(その4-9) エイムズの住宅価格をAutoML(AutoGluon)で予測してみた

今回はAutoGluonというAutoMLライブラリをエイムズのデータセットで試してみます。

MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかと思います。

※ brew install しているのは yum や apt に置き換える必要はあります。

(MLJAR) Pythonで3つのAutoML環境を用意してみた
 (AutoGluon) Pythonで3つのAutoML環境を用意してみた
 (auto-sklearn) Pythonで3つのAutoML環境を用意してみた

それではやってみます。

AutoGluonのアップグレード
評価指標
AutoGluon
使用ライブラリのバージョン
まとめ

AutoGluonのアップグレード

なるべく新しいバージョンのライブラリを使うことにします。

source ~/venv-autogluon/bin/activate
(venv-autogluon) python3 -m pip install autogluon --upgrade

Out[0]

Requirement already satisfied: autogluon in ./venv-autogluon/lib/python3.8/site-packages (0.4.2)
Collecting autogluon
  Using cached autogluon-0.5.2-py3-none-any.whl (9.6 kB)
・・・省略・・・
Successfully installed Cython-0.29.32 autogluon-0.5.2 autogluon.common-0.5.2 autogluon.core-0.5.2 autogluon.features-0.5.2 autogluon.multimodal-0.5.2 autogluon.tabular-0.5.2 autogluon.text-0.5.2 autogluon.timeseries-0.5.2 autogluon.vision-0.5.2 click-8.0.4 convertdate-2.4.0 distlib-0.3.5 future-0.18.2 gluonts-0.9.7 grpcio-1.43.0 hijri-converter-2.2.4 holidays-0.14.2 hyperopt-0.2.7 korean-lunar-calendar-0.2.1 llvmlite-0.39.0 nlpaug-1.1.10 nltk-3.7 numba-0.56.0 numpy-1.21.6 patsy-0.5.2 platformdirs-2.5.2 pmdarima-1.8.5 protobuf-3.18.1 py4j-0.10.9.5 pymeeus-0.5.11 pytorch-metric-learning-1.3.2 ray-1.13.0 sktime-0.11.4 statsmodels-0.13.2 tbats-1.1.0 tensorboardX-2.5.1 torch-1.11.0 torchtext-0.12.0 torchvision-0.12.0 transformers-4.20.1 virtualenv-20.16.3

バージョン 0.4.2から0.5.2にアップデートされました。

評価指標

住宅IdごとのSalePrice(販売価格)を予測するコンペです。

評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。

House Prices - Advanced Regression Techniques | Kaggle

Predict sales prices and practice feature engineering, RFs, and gradient boosting

AutoGluon

分析用データの準備

事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。

本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

学習用データとスコア付与用データの読み込み

import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")

# 描画設定
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12       # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12       # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18        # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8      # 画像サイズの変更(inch)

# 説明変数と目的変数を指定

# 学習データ (AutoGluonは目的変数も含める)
X_train = df.drop(["Id"],axis=1)

# テストデータ
X_test = df_test.drop(["Id"],axis=1)

AutoGluonでモデルの作成

# https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html
# autogluonのモデル作成
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="SalePrice", problem_type="regression",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)

Out[0]


Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "RESULT_AUTOGLUON/"
AutoGluon Version:  0.5.2
Python Version:     3.8.13
Operating System:   Darwin
Train Data Rows:    1460
Train Data Columns: 333
Label Column: SalePrice
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    9107.26 MB
    Train Data (Original)  Memory Usage: 3.89 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 286 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Useless Original Features (Count: 7): ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907']
        These features carry no predictive signal and should be manually investigated.
        This is typically a feature which has the same value for all rows.
        These features do not need to be present at inference time.
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 310 | ['LotFrontage', 'LotShape', 'Utilities', 'LandSlope', 'MasVnrArea', ...]
        ('int', [])   :  16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...]
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     :  24 | ['LotFrontage', 'LotShape', 'LandSlope', 'MasVnrArea', 'ExterCond', ...]
        ('int', [])       :  16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...]
        ('int', ['bool']) : 286 | ['Utilities', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190', ...]
    1.0s = Fit runtime
    326 features in original data used to generate 326 features in processed data.
    Train Data (Processed) Memory Usage: 0.88 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.2s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1168, Val Rows: 292
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 598.8s of the 598.78s of remaining time.
    -49351.2967  = Validation score   (-root_mean_squared_error)
    0.09s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 598.59s of the 598.58s of remaining time.
    -49022.574   = Validation score   (-root_mean_squared_error)
    0.06s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 598.46s of the 598.45s of remaining time.
    -32311.2318  = Validation score   (-root_mean_squared_error)
    2.71s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 595.69s of the 595.68s of remaining time.
    -33364.3905  = Validation score   (-root_mean_squared_error)
    0.82s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 594.83s of the 594.82s of remaining time.
    -33091.9716  = Validation score   (-root_mean_squared_error)
    4.49s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 590.17s of the 590.16s of remaining time.
    -29978.5556  = Validation score   (-root_mean_squared_error)
    21.76s   = Training   runtime
    0.04s    = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 568.35s of the 568.34s of remaining time.
    -32129.1943  = Validation score   (-root_mean_squared_error)
    3.44s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 564.74s of the 564.72s of remaining time.
No improvement since epoch 3: early stopping
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
    -52385.7488  = Validation score   (-root_mean_squared_error)
    20.07s   = Training   runtime
    0.18s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 544.44s of the 544.43s of remaining time.
    -28249.057   = Validation score   (-root_mean_squared_error)
    4.8s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 539.59s of the 539.58s of remaining time.
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
    -34557.6018  = Validation score   (-root_mean_squared_error)
    5.05s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 534.49s of the 534.48s of remaining time.
    -30409.2806  = Validation score   (-root_mean_squared_error)
    2.78s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 531.52s of remaining time.
    -27721.2328  = Validation score   (-root_mean_squared_error)
    0.64s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 69.2s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")

なんか色々自動でやってくれています。
最終的にWeightedEnsembele_L2モデルが選択されたようです。

# 特徴量の重要度の確認
predictor.feature_importance(X_train)

Out[0]


    These features in provided data are not utilized by the predictor and will be ignored: ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907']
    Computing feature importance via permutation shuffling for 326 features using 1460 rows with 5 shuffle sets...
        1178.64s    = Expected runtime (235.73s per shuffle set)
        889.07s = Actual runtime (Completed 5 of 5 shuffle sets)




  
    
      
      importance
      stddev
      p_value
      n
      p99_high
      p99_low
    
  
  
    
      TotalLivArea
      38261.203832
      1094.563494
      8.028544e-08
      5
      40514.925195
      36007.482469
    
    
      OverallQual
      23512.093204
      936.372603
      3.012264e-07
      5
      25440.097335
      21584.089073
    
    
      BsmtQual
      4811.859551
      195.716333
      3.277030e-07
      5
      5214.842186
      4408.876915
    
    
      GarageCars
      2733.962537
      333.895938
      2.617385e-05
      5
      3421.458889
      2046.466185
    
    
      LotArea
      2485.904056
      60.642905
      4.246384e-08
      5
      2610.768635
      2361.039477
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      RoofMatl_CompShg
      -2.558456
      3.332701
      9.194039e-01
      5
      4.303621
      -9.420532
    
    
      YrSold_2008
      -3.084284
      4.882976
      8.846518e-01
      5
      6.969831
      -13.138400
    
    
      YearBuilt_1916
      -3.790175
      2.284067
      9.896777e-01
      5
      0.912750
      -8.493101
    
    
      PoolQC
      -36.494475
      12.669437
      9.985053e-01
      5
      -10.407929
      -62.581021
    
    
      MoSold_1
      -71.427322
      47.552748
      9.858308e-01
      5
      26.484442
      -169.339086
    
  

326 rows × 6 columns

	importance	stddev	p_value	n	p99_high	p99_low
TotalLivArea	38261.203832	1094.563494	8.028544e-08	5	40514.925195	36007.482469
OverallQual	23512.093204	936.372603	3.012264e-07	5	25440.097335	21584.089073
BsmtQual	4811.859551	195.716333	3.277030e-07	5	5214.842186	4408.876915
GarageCars	2733.962537	333.895938	2.617385e-05	5	3421.458889	2046.466185
LotArea	2485.904056	60.642905	4.246384e-08	5	2610.768635	2361.039477
...	...	...	...	...	...	...
RoofMatl_CompShg	-2.558456	3.332701	9.194039e-01	5	4.303621	-9.420532
YrSold_2008	-3.084284	4.882976	8.846518e-01	5	6.969831	-13.138400
YearBuilt_1916	-3.790175	2.284067	9.896777e-01	5	0.912750	-8.493101
PoolQC	-36.494475	12.669437	9.985053e-01	5	-10.407929	-62.581021
MoSold_1	-71.427322	47.552748	9.858308e-01	5	26.484442	-169.339086

結果がでるまで時間が結構かかりました。

TotalLivAreaとOverallQualが上位に来ているあたり、信用できる結果になっているのかなと思います。

# 自由度調整済みr2を算出
def adjusted_r2(X,Y,Yhat):
    from sklearn.metrics import r2_score
    import numpy as np
    r_squared = r2_score(Y, Yhat)
    adjusted_r2 = 1 - (1-r_squared)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
    return adjusted_r2

# 訓練データへの精度を確認 (オーバーフィット具合を確認する)
print("train r2_adjusted",adjusted_r2(X_train,X_train["SalePrice"], predictor.predict(X_train)))

Out[0]

train r2_adjusted 0.9680146829503177

### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = predictor.predict(X_train)

df_test[["Id","SalePrice"]]

Out[0]





  
    
      
      Id
      SalePrice
    
  
  
    
      0
      1461
      208608.531250
    
    
      1
      1462
      183918.640625
    
    
      2
      1463
      225355.890625
    
    
      3
      1464
      141544.953125
    
    
      4
      1465
      286628.718750
    
    
      ...
      ...
      ...
    
    
      1454
      2915
      187273.500000
    
    
      1455
      2916
      173471.203125
    
    
      1456
      2917
      208200.875000
    
    
      1457
      2918
      266205.906250
    
    
      1458
      2919
      142144.937500
    
  

1459 rows × 2 columns

	Id	SalePrice
0	1461	208608.531250
1	1462	183918.640625
2	1463	225355.890625
3	1464	141544.953125
4	1465	286628.718750
...	...	...
1454	2915	187273.500000
1455	2916	173471.203125
1456	2917	208200.875000
1457	2918	266205.906250
1458	2919	142144.937500

# SalePrice(予測) の分布を確認
sns.histplot(df_test["SalePrice"],bins=20)

Out[0]

Kaggleにスコア付与結果をアップロード

df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)

!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#9 automl autogluon"

Out[0]

100%|██████████████████████████████████████| 21.1k/21.1k [00:04<00:00, 5.29kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#9 automl autogluon
Score: 0.56305

まさかの全然だめでした。学習時間が足りなかったのかもしれません。600秒にしていたので。。