今回はAutoGluonというAutoMLライブラリをエイムズのデータセットで試してみます。
MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかと思います。
※ brew install しているのは yum や apt に置き換える必要はあります。
(MLJAR) Pythonで3つのAutoML環境を用意してみた
(AutoGluon) Pythonで3つのAutoML環境を用意してみた
(auto-sklearn) Pythonで3つのAutoML環境を用意してみた
それではやってみます。
AutoGluonのアップグレード
なるべく新しいバージョンのライブラリを使うことにします。
source ~/venv-autogluon/bin/activate
(venv-autogluon) python3 -m pip install autogluon --upgrade
Requirement already satisfied: autogluon in ./venv-autogluon/lib/python3.8/site-packages (0.4.2) Collecting autogluon Using cached autogluon-0.5.2-py3-none-any.whl (9.6 kB) ・・・省略・・・ Successfully installed Cython-0.29.32 autogluon-0.5.2 autogluon.common-0.5.2 autogluon.core-0.5.2 autogluon.features-0.5.2 autogluon.multimodal-0.5.2 autogluon.tabular-0.5.2 autogluon.text-0.5.2 autogluon.timeseries-0.5.2 autogluon.vision-0.5.2 click-8.0.4 convertdate-2.4.0 distlib-0.3.5 future-0.18.2 gluonts-0.9.7 grpcio-1.43.0 hijri-converter-2.2.4 holidays-0.14.2 hyperopt-0.2.7 korean-lunar-calendar-0.2.1 llvmlite-0.39.0 nlpaug-1.1.10 nltk-3.7 numba-0.56.0 numpy-1.21.6 patsy-0.5.2 platformdirs-2.5.2 pmdarima-1.8.5 protobuf-3.18.1 py4j-0.10.9.5 pymeeus-0.5.11 pytorch-metric-learning-1.3.2 ray-1.13.0 sktime-0.11.4 statsmodels-0.13.2 tbats-1.1.0 tensorboardX-2.5.1 torch-1.11.0 torchtext-0.12.0 torchvision-0.12.0 transformers-4.20.1 virtualenv-20.16.3
バージョン 0.4.2から0.5.2にアップデートされました。
評価指標
住宅IdごとのSalePrice(販売価格)を予測するコンペです。
評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。
AutoGluon
分析用データの準備
事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。
本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。
(その3-2) エイムズの住宅価格のデータセットのデータ加工①
(その3-3) エイムズの住宅価格のデータセットのデータ加工②
学習用データとスコア付与用データの読み込み
import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")
# 描画設定
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12 # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12 # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18 # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8 # 画像サイズの変更(inch)
# 説明変数と目的変数を指定
# 学習データ (AutoGluonは目的変数も含める)
X_train = df.drop(["Id"],axis=1)
# テストデータ
X_test = df_test.drop(["Id"],axis=1)
AutoGluonでモデルの作成
# https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html
# autogluonのモデル作成
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="SalePrice", problem_type="regression",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)
Beginning AutoGluon training ... Time limit = 600s AutoGluon will save models to "RESULT_AUTOGLUON/" AutoGluon Version: 0.5.2 Python Version: 3.8.13 Operating System: Darwin Train Data Rows: 1460 Train Data Columns: 333 Label Column: SalePrice Preprocessing data ... Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 9107.26 MB Train Data (Original) Memory Usage: 3.89 MB (0.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: Fitting AsTypeFeatureGenerator... Note: Converting 286 features to boolean dtype as they only contain 2 unique values. Stage 2 Generators: Fitting FillNaFeatureGenerator... Stage 3 Generators: Fitting IdentityFeatureGenerator... Stage 4 Generators: Fitting DropUniqueFeatureGenerator... Useless Original Features (Count: 7): ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907'] These features carry no predictive signal and should be manually investigated. This is typically a feature which has the same value for all rows. These features do not need to be present at inference time. Types of features in original data (raw dtype, special dtypes): ('float', []) : 310 | ['LotFrontage', 'LotShape', 'Utilities', 'LandSlope', 'MasVnrArea', ...] ('int', []) : 16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...] Types of features in processed data (raw dtype, special dtypes): ('float', []) : 24 | ['LotFrontage', 'LotShape', 'LandSlope', 'MasVnrArea', 'ExterCond', ...] ('int', []) : 16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...] ('int', ['bool']) : 286 | ['Utilities', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190', ...] 1.0s = Fit runtime 326 features in original data used to generate 326 features in processed data. Train Data (Processed) Memory Usage: 0.88 MB (0.0% of available memory) Data preprocessing and feature engineering runtime = 1.2s ... AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error' This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value. To change this, specify the eval_metric parameter of Predictor() Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1168, Val Rows: 292 Fitting 11 L1 models ... Fitting model: KNeighborsUnif ... Training model for up to 598.8s of the 598.78s of remaining time. -49351.2967 = Validation score (-root_mean_squared_error) 0.09s = Training runtime 0.07s = Validation runtime Fitting model: KNeighborsDist ... Training model for up to 598.59s of the 598.58s of remaining time. -49022.574 = Validation score (-root_mean_squared_error) 0.06s = Training runtime 0.04s = Validation runtime Fitting model: LightGBMXT ... Training model for up to 598.46s of the 598.45s of remaining time. -32311.2318 = Validation score (-root_mean_squared_error) 2.71s = Training runtime 0.01s = Validation runtime Fitting model: LightGBM ... Training model for up to 595.69s of the 595.68s of remaining time. -33364.3905 = Validation score (-root_mean_squared_error) 0.82s = Training runtime 0.01s = Validation runtime Fitting model: RandomForestMSE ... Training model for up to 594.83s of the 594.82s of remaining time. -33091.9716 = Validation score (-root_mean_squared_error) 4.49s = Training runtime 0.07s = Validation runtime Fitting model: CatBoost ... Training model for up to 590.17s of the 590.16s of remaining time. -29978.5556 = Validation score (-root_mean_squared_error) 21.76s = Training runtime 0.04s = Validation runtime Fitting model: ExtraTreesMSE ... Training model for up to 568.35s of the 568.34s of remaining time. -32129.1943 = Validation score (-root_mean_squared_error) 3.44s = Training runtime 0.07s = Validation runtime Fitting model: NeuralNetFastAI ... Training model for up to 564.74s of the 564.72s of remaining time. No improvement since epoch 3: early stopping [W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) -52385.7488 = Validation score (-root_mean_squared_error) 20.07s = Training runtime 0.18s = Validation runtime Fitting model: XGBoost ... Training model for up to 544.44s of the 544.43s of remaining time. -28249.057 = Validation score (-root_mean_squared_error) 4.8s = Training runtime 0.02s = Validation runtime Fitting model: NeuralNetTorch ... Training model for up to 539.59s of the 539.58s of remaining time. [W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) -34557.6018 = Validation score (-root_mean_squared_error) 5.05s = Training runtime 0.04s = Validation runtime Fitting model: LightGBMLarge ... Training model for up to 534.49s of the 534.48s of remaining time. -30409.2806 = Validation score (-root_mean_squared_error) 2.78s = Training runtime 0.01s = Validation runtime Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 531.52s of remaining time. -27721.2328 = Validation score (-root_mean_squared_error) 0.64s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 69.2s ... Best model: "WeightedEnsemble_L2" TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")
なんか色々自動でやってくれています。
最終的にWeightedEnsembele_L2モデルが選択されたようです。
# 特徴量の重要度の確認
predictor.feature_importance(X_train)
These features in provided data are not utilized by the predictor and will be ignored: ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907'] Computing feature importance via permutation shuffling for 326 features using 1460 rows with 5 shuffle sets... 1178.64s = Expected runtime (235.73s per shuffle set) 889.07s = Actual runtime (Completed 5 of 5 shuffle sets)
importance stddev p_value n p99_high p99_low TotalLivArea 38261.203832 1094.563494 8.028544e-08 5 40514.925195 36007.482469 OverallQual 23512.093204 936.372603 3.012264e-07 5 25440.097335 21584.089073 BsmtQual 4811.859551 195.716333 3.277030e-07 5 5214.842186 4408.876915 GarageCars 2733.962537 333.895938 2.617385e-05 5 3421.458889 2046.466185 LotArea 2485.904056 60.642905 4.246384e-08 5 2610.768635 2361.039477 ... ... ... ... ... ... ... RoofMatl_CompShg -2.558456 3.332701 9.194039e-01 5 4.303621 -9.420532 YrSold_2008 -3.084284 4.882976 8.846518e-01 5 6.969831 -13.138400 YearBuilt_1916 -3.790175 2.284067 9.896777e-01 5 0.912750 -8.493101 PoolQC -36.494475 12.669437 9.985053e-01 5 -10.407929 -62.581021 MoSold_1 -71.427322 47.552748 9.858308e-01 5 26.484442 -169.339086 326 rows × 6 columns
結果がでるまで時間が結構かかりました。
TotalLivAreaとOverallQualが上位に来ているあたり、信用できる結果になっているのかなと思います。
# 自由度調整済みr2を算出
def adjusted_r2(X,Y,Yhat):
from sklearn.metrics import r2_score
import numpy as np
r_squared = r2_score(Y, Yhat)
adjusted_r2 = 1 - (1-r_squared)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
return adjusted_r2
# 訓練データへの精度を確認 (オーバーフィット具合を確認する)
print("train r2_adjusted",adjusted_r2(X_train,X_train["SalePrice"], predictor.predict(X_train)))
train r2_adjusted 0.9680146829503177
### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = predictor.predict(X_train)
df_test[["Id","SalePrice"]]
Id SalePrice 0 1461 208608.531250 1 1462 183918.640625 2 1463 225355.890625 3 1464 141544.953125 4 1465 286628.718750 ... ... ... 1454 2915 187273.500000 1455 2916 173471.203125 1456 2917 208200.875000 1457 2918 266205.906250 1458 2919 142144.937500 1459 rows × 2 columns
# SalePrice(予測) の分布を確認
sns.histplot(df_test["SalePrice"],bins=20)
Kaggleにスコア付与結果をアップロード
df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#9 automl autogluon"
100%|██████████████████████████████████████| 21.1k/21.1k [00:04<00:00, 5.29kB/s] Successfully submitted to House Prices - Advanced Regression Techniques #9 automl autogluon Score: 0.56305
まさかの全然だめでした。学習時間が足りなかったのかもしれません。600秒にしていたので。。
使用ライブラリのバージョン
pandas Version: 1.3.5
numpy Version: 1.21.6
scikit-learn Version: 1.0.2
seaborn Version: 0.11.2
matplotlib Version: 3.5.2
autogluon Version: 0.5.2
まとめ
次回はautosklearnを試してみたいと思います。