(その4-4) エイムズの住宅価格をニューラルネットワークで予測してみた

今回はニューラルネットワークになります。

今だとディープラーニングの方が聞いたことがある方の方が多いでしょうか？

ディープラーニングはニューラルネットワークを多層化することにより高精度を出すことを可能にした手法のようです。

ディープラーニング（深層学習）は、より基礎的で広範な機械学習の手法であるニューラルネットワークという
分析手法を拡張し、高精度の分析や活用を可能にした手法です。引用: https://www.soumu.go.jp/ict_skill/pdf/ict_skill_3_5.pdf

2014年ごろにシリコンバレーでGoogle社のJeff Deanさんがディープラーニングとdoc2vecのプレゼンテーションをされていたのをたまたま聞く機会がありましたが、いま業務で使うことになるとは思わなかったです。当時もっと興味を持って勉強しておけば今もう少し楽が出来たかも知れません　笑

評価指標
ニューラルネットワーク分析
使用ライブラリのバージョン
まとめ

評価指標

住宅IdごとのSalePrice(販売価格)を予測するコンペです。

評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。

House Prices - Advanced Regression Techniques | Kaggle

Predict sales prices and practice feature engineering, RFs, and gradient boosting

ニューラルネットワーク分析

分析用データの準備

事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。

本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

学習用データとスコア付与用データの読み込み

import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")

df.head()

Out[0]





  
    
      
      Id
      LotFrontage
      LotArea
      LotShape
      Utilities
      LandSlope
      OverallQual
      OverallCond
      MasVnrArea
      ExterCond
      ...
      SaleType_New
      SaleType_Oth
      SaleType_WD
      SaleCondition_Abnorml
      SaleCondition_AdjLand
      SaleCondition_Alloca
      SaleCondition_Family
      SaleCondition_Normal
      SaleCondition_Partial
      SalePrice
    
  
  
    
      0
      1
      65.0
      8450
      3.0
      3.0
      2.0
      7
      5
      196.0
      2.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      208500
    
    
      1
      2
      80.0
      9600
      3.0
      3.0
      2.0
      6
      8
      0.0
      2.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      181500
    
    
      2
      3
      68.0
      11250
      2.0
      3.0
      2.0
      7
      5
      162.0
      2.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      223500
    
    
      3
      4
      60.0
      9550
      2.0
      3.0
      2.0
      7
      5
      0.0
      2.0
      ...
      0.0
      0.0
      1.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      140000
    
    
      4
      5
      84.0
      14260
      2.0
      3.0
      2.0
      8
      5
      350.0
      2.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      250000
    
  

5 rows × 335 columns

# 描画設定
from IPython.display import HTML
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12       # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12       # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18        # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8      # 画像サイズの変更(inch)

ニューラルネットワークに使用する変数を選ぶ

全て突っ込んでみようと思います。

ちなみにデータは標準化した方がいいようです。

Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data.
引用: https://scikit-learn.org/stable/modules/neural_networks_supervised.html#tips-on-practical-use

ニューラルネットワークで学習を実施

# 説明変数
df.drop(["Id","SalePrice"],axis=1)

	LotFrontage	LotArea	LotShape	Utilities	LandSlope	OverallQual	OverallCond	MasVnrArea	ExterCond	BsmtQual	...	SaleType_ConLw	SaleType_New	SaleType_Oth	SaleType_WD	SaleCondition_Abnorml	SaleCondition_AdjLand	SaleCondition_Alloca	SaleCondition_Family	SaleCondition_Normal	SaleCondition_Partial
0	65.0	8450	3.0	3.0	2.0	7	5	196.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
1	80.0	9600	3.0	3.0	2.0	6	8	0.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2	68.0	11250	2.0	3.0	2.0	7	5	162.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
3	60.0	9550	2.0	3.0	2.0	7	5	0.0	2.0	3.0	...	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
4	84.0	14260	2.0	3.0	2.0	8	5	350.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	62.0	7917	3.0	3.0	2.0	6	5	0.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
1456	85.0	13175	3.0	3.0	2.0	6	6	119.0	2.0	4.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
1457	66.0	9042	3.0	3.0	2.0	7	9	0.0	3.0	3.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
1458	68.0	9717	3.0	3.0	2.0	5	6	0.0	2.0	3.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
1459	75.0	9937	3.0	3.0	2.0	5	6	0.0	2.0	3.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0

1460 rows × 333 columns

# 説明変数と目的変数を指定

# 学習データ
X_train = df.drop(["Id","SalePrice"],axis=1)
Y_train = df["SalePrice"] # 販売価格

# テストデータ
X_test = df_test.drop(["Id"],axis=1)

# pipelineでデータを標準化してモデリングをする
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html
from sklearn.neural_network import MLPRegressor
pipeline = make_pipeline(StandardScaler(), MLPRegressor(random_state=1, max_iter=10000))

# fitする
fit_pipeline = pipeline.fit(X_train,Y_train)

Out[0]

/Users/hinomaruc/Desktop/blog/my-venv/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (10000) reached and the optimization hasn't converged yet.
  warnings.warn(

max_iterを1万回にしてみましたが、収束しなかったようです。変数が多すぎたかも知れません。

# モデルパラメータ一覧
fit_pipeline.get_params()

Out[0]


    {'memory': None,
     'steps': [('standardscaler', StandardScaler()),
      ('mlpregressor', MLPRegressor(max_iter=10000, random_state=1))],
     'verbose': False,
     'standardscaler': StandardScaler(),
     'mlpregressor': MLPRegressor(max_iter=10000, random_state=1),
     'standardscaler__copy': True,
     'standardscaler__with_mean': True,
     'standardscaler__with_std': True,
     'mlpregressor__activation': 'relu',
     'mlpregressor__alpha': 0.0001,
     'mlpregressor__batch_size': 'auto',
     'mlpregressor__beta_1': 0.9,
     'mlpregressor__beta_2': 0.999,
     'mlpregressor__early_stopping': False,
     'mlpregressor__epsilon': 1e-08,
     'mlpregressor__hidden_layer_sizes': (100,),
     'mlpregressor__learning_rate': 'constant',
     'mlpregressor__learning_rate_init': 0.001,
     'mlpregressor__max_fun': 15000,
     'mlpregressor__max_iter': 10000,
     'mlpregressor__momentum': 0.9,
     'mlpregressor__n_iter_no_change': 10,
     'mlpregressor__nesterovs_momentum': True,
     'mlpregressor__power_t': 0.5,
     'mlpregressor__random_state': 1,
     'mlpregressor__shuffle': True,
     'mlpregressor__solver': 'adam',
     'mlpregressor__tol': 0.0001,
     'mlpregressor__validation_fraction': 0.1,
     'mlpregressor__verbose': False,
     'mlpregressor__warm_start': False}

# stepsからニューラルネットワークモデルの部分を抽出
model_pipeline = fit_pipeline.named_steps["mlpregressor"] # or pipeline.steps[1][1]

# モデル部分のパラメーターを確認。
model_pipeline.get_params()

Out[0]


    {'activation': 'relu',
     'alpha': 0.0001,
     'batch_size': 'auto',
     'beta_1': 0.9,
     'beta_2': 0.999,
     'early_stopping': False,
     'epsilon': 1e-08,
     'hidden_layer_sizes': (100,),
     'learning_rate': 'constant',
     'learning_rate_init': 0.001,
     'max_fun': 15000,
     'max_iter': 10000,
     'momentum': 0.9,
     'n_iter_no_change': 10,
     'nesterovs_momentum': True,
     'power_t': 0.5,
     'random_state': 1,
     'shuffle': True,
     'solver': 'adam',
     'tol': 0.0001,
     'validation_fraction': 0.1,
     'verbose': False,
     'warm_start': False}

### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = fit_pipeline.predict(X_test)

df_test[["Id","SalePrice"]]

Out[0]





  
    
      
      Id
      SalePrice
    
  
  
    
      0
      1461
      124756.768341
    
    
      1
      1462
      171088.184988
    
    
      2
      1463
      194691.478136
    
    
      3
      1464
      174850.818991
    
    
      4
      1465
      199959.214332
    
    
      ...
      ...
      ...
    
    
      1454
      2915
      98613.927093
    
    
      1455
      2916
      64153.925738
    
    
      1456
      2917
      203805.591756
    
    
      1457
      2918
      43940.916577
    
    
      1458
      2919
      219198.983817
    
  

1459 rows × 2 columns

	Id	SalePrice
0	1461	124756.768341
1	1462	171088.184988
2	1463	194691.478136
3	1464	174850.818991
4	1465	199959.214332
...	...	...
1454	2915	98613.927093
1455	2916	64153.925738
1456	2917	203805.591756
1457	2918	43940.916577
1458	2919	219198.983817

予測できていそうです。

Kaggleにスコア付与結果をアップロード

df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)

!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#4 nn"

Out[0]

100%|██████████████████████████████████████| 33.6k/33.6k [00:02<00:00, 11.6kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#4 nn
0.96401

かなり悪化してしまいました。インプットの割に学習時間が足らなすぎたのかも知れません。

ニューラルネットワーク (重回帰と同じ変数を利用)を試してみる


# 説明変数
ana_cols=[
  'TotalLivArea'
, 'OverallQual'
, 'TotalBathRms'
, 'GarageCars'
, 'BsmtQual'
, 'FullBath'
, 'GarageFinish'
, 'FireplaceQu'
, 'TotRmsAbvGrd'
#, 'Neighborhood_Blmngtn' avoid multi-correlation
, 'Neighborhood_Blueste'
, 'Neighborhood_BrDale'
, 'Neighborhood_BrkSide'
, 'Neighborhood_ClearCr'
, 'Neighborhood_CollgCr'
, 'Neighborhood_Crawfor'
, 'Neighborhood_Edwards'
, 'Neighborhood_Gilbert'
, 'Neighborhood_IDOTRR'
, 'Neighborhood_MeadowV'
, 'Neighborhood_Mitchel'
, 'Neighborhood_NAmes'
, 'Neighborhood_NPkVill'
, 'Neighborhood_NWAmes'
, 'Neighborhood_NoRidge'
, 'Neighborhood_NridgHt'
, 'Neighborhood_OldTown'
, 'Neighborhood_SWISU'
, 'Neighborhood_Sawyer'
, 'Neighborhood_SawyerW'
, 'Neighborhood_Somerst'
, 'Neighborhood_StoneBr'
, 'Neighborhood_Timber'
, 'Neighborhood_Veenker'
]

# 学習データ
X_train = df[ana_cols]
Y_train = df["SalePrice"] # 販売価格

# テストデータ
X_test = df_test[ana_cols]

# pipelineでデータを標準化してモデリングをする
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html
from sklearn.neural_network import MLPRegressor
pipeline = make_pipeline(StandardScaler(), MLPRegressor(random_state=1, max_iter=10000))

# fitする
fit_pipeline = pipeline.fit(X_train,Y_train)

# モデルの適用
df_test["SalePrice"] = fit_pipeline.predict(df_test[ana_cols])

# コンペ提出データの作成
df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)

# コンペ提出
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#4 nn same columns as multiregression"

Out[0]

100%|██████████████████████████████████████| 33.7k/33.7k [00:03<00:00, 10.2kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#4 nn same columns as multiregression
0.18823

使用ライブラリのバージョン

pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2

まとめ

重回帰よりは当てはまりがよく、多項式回帰よりは当てはまりが悪いという結果になりました。

ただ最低限のパラメータしか設定していないので、チューニング結果によってはもっと精度が良くなるかも知れません。

ニューラルネットワークは今後使う機会が多くなってくるので色々いじってみて理解を深めたいところです。

次回はSupport Vector Regressionを試してみます。