前回はXgBoostで分類モデルを作成しました。
暫定1位はロジスティック回帰CVで作成してモデルで、Kaggleの精度は0.76794です。
今回はAutoMLを試してみようと思います。
色々なモデルを作成して一番良い精度のものを探索してくれるのでかなりの時間節約になります。
MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかも知れません。
※ brew install しているのは yum や apt に置き換える必要はあります。
(MLJAR) Pythonで3つのAutoML環境を用意してみた
(AutoGluon) Pythonで3つのAutoML環境を用意してみた
(auto-sklearn) Pythonで3つのAutoML環境を用意してみた
AutoMLは今のところ3種類の環境を準備したので、順番に試してみたいと思います。
まずは mljar から試してみます。
評価指標
タイタニックのデータセットは生存有無を正確に予測できた乗客の割合(Accuracy)を評価指標としています。
分析用データの準備
事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。
本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。
学習データと評価データの読み込み
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")
概要確認
# 概要確認
df_train.info()
RangeIndex: 891 entries, 0 to 890 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 891 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 891 non-null object 12 FamilyCnt 891 non-null int64 13 SameTicketCnt 891 non-null int64 14 Pclass_str_1 891 non-null float64 15 Pclass_str_2 891 non-null float64 16 Pclass_str_3 891 non-null float64 17 Sex_female 891 non-null float64 18 Sex_male 891 non-null float64 19 Embarked_C 891 non-null float64 20 Embarked_Q 891 non-null float64 21 Embarked_S 891 non-null float64 dtypes: float64(10), int64(7), object(5) memory usage: 153.3+ KB
# 概要確認
df_eval.info()
RangeIndex: 418 entries, 0 to 417 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 418 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 418 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object 11 Pclass_str_1 418 non-null float64 12 Pclass_str_2 418 non-null float64 13 Pclass_str_3 418 non-null float64 14 Sex_female 418 non-null float64 15 Sex_male 418 non-null float64 16 Embarked_C 418 non-null float64 17 Embarked_Q 418 non-null float64 18 Embarked_S 418 non-null float64 19 FamilyCnt 418 non-null int64 20 SameTicketCnt 418 non-null int64 dtypes: float64(10), int64(6), object(5) memory usage: 68.7+ KB
# 描画設定
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12 # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12 # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18 # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8 # 画像サイズの変更(inch)
モデリング用に学習用データを訓練データとテストデータに分割
# 訓練データとテストデータに分割する。
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(df_train, test_size=0.20,random_state=100)
# 説明変数
FEATURE_COLS=[
'Age'
, 'Fare'
, 'SameTicketCnt'
, 'Pclass_str_1'
, 'Pclass_str_3'
, 'Sex_female'
, 'Embarked_Q'
, 'Embarked_S'
]
X_train = x_train[FEATURE_COLS] # 説明変数 (train)
Y_train = x_train["Survived"] # 目的変数 (train)
X_test = x_test[FEATURE_COLS] # 説明変数 (test)
Y_test = x_test["Survived"] # 目的変数 (test)
mljar
# https://supervised.mljar.com/api/
# mljarのモデル作成
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)
modeは複数選択できます。今回はCompeteモードを実行します。
Competeモードはコンペ用に精度を追求したモードになります。
モデル作成
# fitで学習させる
automl.fit(X_train,Y_train)
AutoML directory: AutoML_2 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors'] AutoML will stack models AutoML will ensemble available models AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked'] * Step adjust_validation will try to check up to 1 model 1_DecisionTree logloss 0.567949 trained in 1.19 seconds Adjust validation. Remove: 1_DecisionTree Validation strategy: 10-fold CV Shuffle,Stratify * Step simple_algorithms will try to check up to 4 models 1_DecisionTree logloss 0.602245 trained in 4.38 seconds 2_DecisionTree logloss 0.563085 trained in 4.04 seconds 3_DecisionTree logloss 0.455007 trained in 4.42 seconds 4_Linear logloss 0.450752 trained in 8.04 seconds * Step default_algorithms will try to check up to 7 models 5_Default_LightGBM logloss 0.421875 trained in 6.39 seconds 6_Default_Xgboost logloss 0.419798 trained in 6.82 seconds 7_Default_CatBoost logloss 0.39265 trained in 6.3 seconds 8_Default_NeuralNetwork logloss 0.452638 trained in 8.01 seconds 9_Default_RandomForest logloss 0.412185 trained in 11.74 seconds 10_Default_ExtraTrees logloss 0.447975 trained in 11.89 seconds 11_Default_NearestNeighbors logloss 0.920078 trained in 6.0 seconds * Step not_so_random will try to check up to 61 models 21_LightGBM logloss 0.418538 trained in 7.4 seconds 12_Xgboost logloss 0.41681 trained in 8.47 seconds 30_CatBoost logloss 0.388772 trained in 7.15 seconds 39_RandomForest logloss 0.410939 trained in 13.22 seconds 48_ExtraTrees logloss 0.414481 trained in 12.63 seconds 57_NeuralNetwork logloss 0.47381 trained in 9.41 seconds 66_NearestNeighbors logloss 0.807434 trained in 7.35 seconds 22_LightGBM logloss 0.410037 trained in 8.64 seconds 13_Xgboost logloss 0.43098 trained in 9.28 seconds 31_CatBoost logloss 0.39616 trained in 10.19 seconds 40_RandomForest logloss 0.414593 trained in 18.0 seconds 49_ExtraTrees logloss 0.417487 trained in 14.32 seconds 58_NeuralNetwork logloss 0.460169 trained in 12.2 seconds 67_NearestNeighbors logloss 1.049051 trained in 8.35 seconds 23_LightGBM logloss 0.430469 trained in 9.81 seconds 14_Xgboost logloss 0.438132 trained in 10.99 seconds 32_CatBoost logloss 0.400784 trained in 14.11 seconds 41_RandomForest logloss 0.405409 trained in 15.38 seconds 50_ExtraTrees logloss 0.420452 trained in 14.92 seconds 59_NeuralNetwork logloss 0.51347 trained in 12.73 seconds 68_NearestNeighbors logloss 1.427616 trained in 10.03 seconds 24_LightGBM logloss 0.426847 trained in 12.63 seconds 15_Xgboost logloss 0.402373 trained in 12.21 seconds 33_CatBoost logloss 0.390621 trained in 12.36 seconds 42_RandomForest logloss 0.41898 trained in 20.05 seconds 51_ExtraTrees logloss 0.423715 trained in 16.42 seconds 60_NeuralNetwork logloss 0.52747 trained in 13.77 seconds 69_NearestNeighbors logloss 1.427616 trained in 10.48 seconds 25_LightGBM logloss 0.419036 trained in 12.17 seconds 16_Xgboost logloss 0.463085 trained in 12.62 seconds 34_CatBoost logloss 0.402524 trained in 13.88 seconds 43_RandomForest logloss 0.416642 trained in 23.7 seconds 52_ExtraTrees logloss 0.435716 trained in 20.86 seconds 61_NeuralNetwork logloss 0.546483 trained in 15.6 seconds 70_NearestNeighbors logloss 1.049051 trained in 11.85 seconds 26_LightGBM logloss 0.406914 trained in 13.12 seconds 17_Xgboost logloss 0.585177 trained in 14.25 seconds 35_CatBoost logloss 0.397436 trained in 16.04 seconds 44_RandomForest logloss 0.414609 trained in 19.83 seconds 53_ExtraTrees logloss 0.454507 trained in 21.77 seconds 62_NeuralNetwork logloss 0.643082 trained in 15.52 seconds 71_NearestNeighbors logloss 1.608606 trained in 12.84 seconds 27_LightGBM logloss 0.424306 trained in 14.85 seconds 18_Xgboost logloss 0.596754 trained in 14.84 seconds 36_CatBoost logloss 0.396218 trained in 15.97 seconds 45_RandomForest logloss 0.421922 trained in 23.84 seconds 54_ExtraTrees logloss 0.470755 trained in 20.88 seconds 63_NeuralNetwork logloss 0.475343 trained in 17.87 seconds 72_NearestNeighbors logloss 1.427616 trained in 14.14 seconds 28_LightGBM logloss 0.409946 trained in 15.94 seconds 19_Xgboost logloss 0.430856 trained in 16.9 seconds 37_CatBoost logloss 0.389202 trained in 17.44 seconds 46_RandomForest logloss 0.413533 trained in 23.31 seconds 55_ExtraTrees logloss 0.442965 trained in 20.53 seconds 64_NeuralNetwork logloss 0.480929 trained in 17.99 seconds 29_LightGBM logloss 0.417319 trained in 16.69 seconds 20_Xgboost logloss 0.584405 trained in 18.04 seconds 38_CatBoost logloss 0.395742 trained in 18.44 seconds 47_RandomForest logloss 0.414602 trained in 26.64 seconds 56_ExtraTrees logloss 0.418954 trained in 25.37 seconds 65_NeuralNetwork logloss 0.500499 trained in 19.26 seconds * Step golden_features will try to check up to 3 models None 10 Add Golden Feature: Pclass_str_3_diff_Sex_female Add Golden Feature: Sex_female_sum_Pclass_str_1 Add Golden Feature: Sex_female_ratio_SameTicketCnt Add Golden Feature: Sex_female_multiply_SameTicketCnt Add Golden Feature: SameTicketCnt_ratio_Sex_female Add Golden Feature: Sex_female_diff_Embarked_S Add Golden Feature: Embarked_Q_sum_Sex_female Add Golden Feature: Sex_female_sum_SameTicketCnt Add Golden Feature: Sex_female_multiply_Pclass_str_1 Add Golden Feature: Sex_female_ratio_Pclass_str_1 Created 10 Golden Features in 11.27 seconds. 30_CatBoost_GoldenFeatures logloss 0.390194 trained in 30.33 seconds 37_CatBoost_GoldenFeatures logloss 0.3996 trained in 19.95 seconds 33_CatBoost_GoldenFeatures logloss 0.398953 trained in 20.31 seconds * Step kmeans_features will try to check up to 3 models 30_CatBoost_KMeansFeatures logloss 0.394747 trained in 20.3 seconds 37_CatBoost_KMeansFeatures logloss 0.403682 trained in 21.41 seconds 33_CatBoost_KMeansFeatures logloss 0.402157 trained in 21.79 seconds * Step insert_random_feature will try to check up to 1 model 30_CatBoost_RandomFeature logloss 0.398843 trained in 20.69 seconds Drop features ['Embarked_S', 'random_feature', 'Embarked_Q'] * Step features_selection will try to check up to 6 models 30_CatBoost_SelectedFeatures logloss 0.39354 trained in 21.2 seconds 15_Xgboost_SelectedFeatures logloss 0.402946 trained in 20.32 seconds 41_RandomForest_SelectedFeatures logloss 0.408014 trained in 26.05 seconds 26_LightGBM_SelectedFeatures logloss 0.407294 trained in 19.79 seconds 48_ExtraTrees_SelectedFeatures logloss 0.410218 trained in 24.66 seconds 8_Default_NeuralNetwork_SelectedFeatures logloss 0.431012 trained in 22.16 seconds * Step hill_climbing_1 will try to check up to 31 models 73_CatBoost logloss 0.389943 trained in 20.52 seconds 74_CatBoost_GoldenFeatures logloss 0.395436 trained in 20.95 seconds 75_Xgboost logloss 0.40114 trained in 21.68 seconds 76_Xgboost logloss 0.405252 trained in 22.0 seconds 77_Xgboost_SelectedFeatures logloss 0.402009 trained in 22.96 seconds 78_Xgboost_SelectedFeatures logloss 0.403257 trained in 22.28 seconds 79_RandomForest logloss 0.411572 trained in 28.07 seconds 80_RandomForest logloss 0.407432 trained in 27.12 seconds 81_LightGBM logloss 0.410037 trained in 21.76 seconds 82_LightGBM logloss 0.40899 trained in 21.47 seconds 83_LightGBM_SelectedFeatures logloss 0.408176 trained in 22.33 seconds 84_LightGBM_SelectedFeatures logloss 0.406016 trained in 22.01 seconds 85_RandomForest_SelectedFeatures logloss 0.411539 trained in 30.15 seconds 86_RandomForest_SelectedFeatures logloss 0.408014 trained in 28.85 seconds 87_LightGBM logloss 0.415542 trained in 22.73 seconds 88_LightGBM logloss 0.419886 trained in 22.69 seconds 89_ExtraTrees_SelectedFeatures logloss 0.410218 trained in 28.21 seconds 90_RandomForest logloss 0.410939 trained in 29.69 seconds 91_ExtraTrees logloss 0.414481 trained in 28.02 seconds 92_Xgboost logloss 0.418173 trained in 24.23 seconds 93_ExtraTrees logloss 0.417487 trained in 40.5 seconds 94_NeuralNetwork_SelectedFeatures logloss 0.443693 trained in 37.7 seconds 95_NeuralNetwork_SelectedFeatures logloss 0.514071 trained in 41.13 seconds 96_NeuralNetwork logloss 0.465475 trained in 29.56 seconds 97_NeuralNetwork logloss 0.479735 trained in 32.72 seconds 98_DecisionTree logloss 0.514448 trained in 30.53 seconds 99_NeuralNetwork logloss 0.531638 trained in 28.44 seconds 100_DecisionTree logloss 0.464915 trained in 24.23 seconds 101_DecisionTree logloss 0.723283 trained in 24.46 seconds 102_DecisionTree logloss 0.514448 trained in 24.73 seconds 103_NearestNeighbors logloss 1.1462 trained in 24.79 seconds * Step hill_climbing_2 will try to check up to 10 models 104_CatBoost logloss 0.394675 trained in 26.15 seconds 105_Xgboost logloss 0.411168 trained in 27.43 seconds 106_Xgboost_SelectedFeatures logloss 0.411864 trained in 41.72 seconds 107_Xgboost logloss 0.40767 trained in 37.67 seconds 108_LightGBM_SelectedFeatures logloss 0.410488 trained in 30.6 seconds 109_LightGBM logloss 0.406375 trained in 33.79 seconds 110_LightGBM_SelectedFeatures logloss 0.407976 trained in 36.19 seconds 111_RandomForest logloss 0.418069 trained in 42.58 seconds 112_ExtraTrees_SelectedFeatures logloss 0.411296 trained in 37.63 seconds 113_NeuralNetwork_SelectedFeatures logloss 0.489014 trained in 33.15 seconds * Step boost_on_errors will try to check up to 1 model 30_CatBoost_BoostOnErrors logloss 0.393569 trained in 30.84 seconds * Step ensemble will try to check up to 1 model Ensemble logloss 0.384841 trained in 84.58 seconds * Step stack will try to check up to 59 models 30_CatBoost_Stacked logloss 0.373164 trained in 38.49 seconds 75_Xgboost_Stacked logloss 0.389562 trained in 40.42 seconds 41_RandomForest_Stacked logloss 0.391046 trained in 43.91 seconds 84_LightGBM_SelectedFeatures_Stacked logloss 0.39183 trained in 30.89 seconds 89_ExtraTrees_SelectedFeatures_Stacked logloss 0.370869 trained in 40.62 seconds 8_Default_NeuralNetwork_SelectedFeatures_Stacked logloss 0.412413 trained in 32.7 seconds 37_CatBoost_Stacked logloss 0.387002 trained in 42.13 seconds 77_Xgboost_SelectedFeatures_Stacked logloss 0.39166 trained in 42.12 seconds 80_RandomForest_Stacked logloss 0.390082 trained in 43.68 seconds 109_LightGBM_Stacked logloss 0.389491 trained in 35.17 seconds 48_ExtraTrees_SelectedFeatures_Stacked logloss 0.372773 trained in 40.27 seconds 94_NeuralNetwork_SelectedFeatures_Stacked logloss 0.445272 trained in 33.97 seconds 73_CatBoost_Stacked logloss 0.373393 trained in 62.72 seconds 15_Xgboost_Stacked logloss 0.390611 trained in 38.46 seconds 41_RandomForest_SelectedFeatures_Stacked logloss 0.383395 trained in 52.86 seconds 26_LightGBM_Stacked logloss 0.393751 trained in 35.58 seconds * Step ensemble_stacked will try to check up to 1 model Ensemble_Stacked logloss 0.366951 trained in 112.21 seconds AutoML fit time: 3754.78 seconds AutoML best model: Ensemble_Stacked
実に様々な手法を自動で探索してくれます。
最終的に一番精度がよかったのは、Ensemble_Stackedになりました。
精度確認
# Return the mean accuracy on the given data and labels.
print("train",automl.score(X_train,Y_train))
print("test",automl.score(X_test,Y_test))
train 0.8637640449438202 test 0.8156424581005587
過学習しすぎていない精度だと思います。
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
print(confusion_matrix(Y_test,automl.predict(X_test)))
ConfusionMatrixDisplay.from_estimator(automl,X_test,Y_test,cmap="Reds",display_labels=["非生存","生存"],normalize="all")
plt.show()
[[96 8]
[25 50]]
Kaggleへ予測データをアップロード
df_eval["Survived"] = automl.predict(df_eval[FEATURE_COLS])
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #009. mljar"
100%|████████████████████████████████████████| 2.77k/2.77k [00:05<00:00, 566B/s] Successfully submitted to Titanic - Machine Learning from Disaster
0.77511
ロジスティック回帰CVの精度は0.76794だったのでmljarで作成したモデルが暫定1位になりました。
やっぱりAutoMLはすごいですね。
モデル手法というよりは、どうモデリング用のデータを作るのかが分析者の知見とセンスにかかわってくるのでしょうか?
自分のキャリアの方向性もどう他の人と差別化できるのか真剣に考えようと思います。
まとめ
mljarの精度が0.775で暫定1位になりました。
単純に全数データを投入した場合と変数選択やデータ加工をしなかった場合どうなるかなど番外編でまた検証してみたいと思います。