※ brew install しているのは yum や apt に置き換える必要はあります。
(MLJAR) Pythonで3つのAutoML環境を用意してみた
(AutoGluon) Pythonで3つのAutoML環境を用意してみた
(auto-sklearn) Pythonで3つのAutoML環境を用意してみた
まずは mljar から試してみます。

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")
# 概要確認
RangeIndex: 891 entries, 0 to 890 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 891 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 891 non-null object 12 FamilyCnt 891 non-null int64 13 SameTicketCnt 891 non-null int64 14 Pclass_str_1 891 non-null float64 15 Pclass_str_2 891 non-null float64 16 Pclass_str_3 891 non-null float64 17 Sex_female 891 non-null float64 18 Sex_male 891 non-null float64 19 Embarked_C 891 non-null float64 20 Embarked_Q 891 non-null float64 21 Embarked_S 891 non-null float64 dtypes: float64(10), int64(7), object(5) memory usage: 153.3+ KB
# 概要確認
RangeIndex: 418 entries, 0 to 417 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 418 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 418 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object 11 Pclass_str_1 418 non-null float64 12 Pclass_str_2 418 non-null float64 13 Pclass_str_3 418 non-null float64 14 Sex_female 418 non-null float64 15 Sex_male 418 non-null float64 16 Embarked_C 418 non-null float64 17 Embarked_Q 418 non-null float64 18 Embarked_S 418 non-null float64 19 FamilyCnt 418 non-null int64 20 SameTicketCnt 418 non-null int64 dtypes: float64(10), int64(6), object(5) memory usage: 68.7+ KB
# 描画設定
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12 # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12 # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18 # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8 # 画像サイズの変更(inch)
# 訓練データとテストデータに分割する。
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(df_train, test_size=0.20,random_state=100)
# 説明変数
, 'Fare'
, 'SameTicketCnt'
, 'Pclass_str_1'
, 'Pclass_str_3'
, 'Sex_female'
, 'Embarked_Q'
, 'Embarked_S'
X_train = x_train[FEATURE_COLS] # 説明変数 (train)
Y_train = x_train["Survived"] # 目的変数 (train)
X_test = x_test[FEATURE_COLS] # 説明変数 (test)
Y_test = x_test["Survived"] # 目的変数 (test)
# https://supervised.mljar.com/api/
# mljarのモデル作成
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)
# fitで学習させる
AutoML directory: AutoML_2 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors'] AutoML will stack models AutoML will ensemble available models AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked'] * Step adjust_validation will try to check up to 1 model 1_DecisionTree logloss 0.567949 trained in 1.19 seconds Adjust validation. Remove: 1_DecisionTree Validation strategy: 10-fold CV Shuffle,Stratify * Step simple_algorithms will try to check up to 4 models 1_DecisionTree logloss 0.602245 trained in 4.38 seconds 2_DecisionTree logloss 0.563085 trained in 4.04 seconds 3_DecisionTree logloss 0.455007 trained in 4.42 seconds 4_Linear logloss 0.450752 trained in 8.04 seconds * Step default_algorithms will try to check up to 7 models 5_Default_LightGBM logloss 0.421875 trained in 6.39 seconds 6_Default_Xgboost logloss 0.419798 trained in 6.82 seconds 7_Default_CatBoost logloss 0.39265 trained in 6.3 seconds 8_Default_NeuralNetwork logloss 0.452638 trained in 8.01 seconds 9_Default_RandomForest logloss 0.412185 trained in 11.74 seconds 10_Default_ExtraTrees logloss 0.447975 trained in 11.89 seconds 11_Default_NearestNeighbors logloss 0.920078 trained in 6.0 seconds * Step not_so_random will try to check up to 61 models 21_LightGBM logloss 0.418538 trained in 7.4 seconds 12_Xgboost logloss 0.41681 trained in 8.47 seconds 30_CatBoost logloss 0.388772 trained in 7.15 seconds 39_RandomForest logloss 0.410939 trained in 13.22 seconds 48_ExtraTrees logloss 0.414481 trained in 12.63 seconds 57_NeuralNetwork logloss 0.47381 trained in 9.41 seconds 66_NearestNeighbors logloss 0.807434 trained in 7.35 seconds 22_LightGBM logloss 0.410037 trained in 8.64 seconds 13_Xgboost logloss 0.43098 trained in 9.28 seconds 31_CatBoost logloss 0.39616 trained in 10.19 seconds 40_RandomForest logloss 0.414593 trained in 18.0 seconds 49_ExtraTrees logloss 0.417487 trained in 14.32 seconds 58_NeuralNetwork logloss 0.460169 trained in 12.2 seconds 67_NearestNeighbors logloss 1.049051 trained in 8.35 seconds 23_LightGBM logloss 0.430469 trained in 9.81 seconds 14_Xgboost logloss 0.438132 trained in 10.99 seconds 32_CatBoost logloss 0.400784 trained in 14.11 seconds 41_RandomForest logloss 0.405409 trained in 15.38 seconds 50_ExtraTrees logloss 0.420452 trained in 14.92 seconds 59_NeuralNetwork logloss 0.51347 trained in 12.73 seconds 68_NearestNeighbors logloss 1.427616 trained in 10.03 seconds 24_LightGBM logloss 0.426847 trained in 12.63 seconds 15_Xgboost logloss 0.402373 trained in 12.21 seconds 33_CatBoost logloss 0.390621 trained in 12.36 seconds 42_RandomForest logloss 0.41898 trained in 20.05 seconds 51_ExtraTrees logloss 0.423715 trained in 16.42 seconds 60_NeuralNetwork logloss 0.52747 trained in 13.77 seconds 69_NearestNeighbors logloss 1.427616 trained in 10.48 seconds 25_LightGBM logloss 0.419036 trained in 12.17 seconds 16_Xgboost logloss 0.463085 trained in 12.62 seconds 34_CatBoost logloss 0.402524 trained in 13.88 seconds 43_RandomForest logloss 0.416642 trained in 23.7 seconds 52_ExtraTrees logloss 0.435716 trained in 20.86 seconds 61_NeuralNetwork logloss 0.546483 trained in 15.6 seconds 70_NearestNeighbors logloss 1.049051 trained in 11.85 seconds 26_LightGBM logloss 0.406914 trained in 13.12 seconds 17_Xgboost logloss 0.585177 trained in 14.25 seconds 35_CatBoost logloss 0.397436 trained in 16.04 seconds 44_RandomForest logloss 0.414609 trained in 19.83 seconds 53_ExtraTrees logloss 0.454507 trained in 21.77 seconds 62_NeuralNetwork logloss 0.643082 trained in 15.52 seconds 71_NearestNeighbors logloss 1.608606 trained in 12.84 seconds 27_LightGBM logloss 0.424306 trained in 14.85 seconds 18_Xgboost logloss 0.596754 trained in 14.84 seconds 36_CatBoost logloss 0.396218 trained in 15.97 seconds 45_RandomForest logloss 0.421922 trained in 23.84 seconds 54_ExtraTrees logloss 0.470755 trained in 20.88 seconds 63_NeuralNetwork logloss 0.475343 trained in 17.87 seconds 72_NearestNeighbors logloss 1.427616 trained in 14.14 seconds 28_LightGBM logloss 0.409946 trained in 15.94 seconds 19_Xgboost logloss 0.430856 trained in 16.9 seconds 37_CatBoost logloss 0.389202 trained in 17.44 seconds 46_RandomForest logloss 0.413533 trained in 23.31 seconds 55_ExtraTrees logloss 0.442965 trained in 20.53 seconds 64_NeuralNetwork logloss 0.480929 trained in 17.99 seconds 29_LightGBM logloss 0.417319 trained in 16.69 seconds 20_Xgboost logloss 0.584405 trained in 18.04 seconds 38_CatBoost logloss 0.395742 trained in 18.44 seconds 47_RandomForest logloss 0.414602 trained in 26.64 seconds 56_ExtraTrees logloss 0.418954 trained in 25.37 seconds 65_NeuralNetwork logloss 0.500499 trained in 19.26 seconds * Step golden_features will try to check up to 3 models None 10 Add Golden Feature: Pclass_str_3_diff_Sex_female Add Golden Feature: Sex_female_sum_Pclass_str_1 Add Golden Feature: Sex_female_ratio_SameTicketCnt Add Golden Feature: Sex_female_multiply_SameTicketCnt Add Golden Feature: SameTicketCnt_ratio_Sex_female Add Golden Feature: Sex_female_diff_Embarked_S Add Golden Feature: Embarked_Q_sum_Sex_female Add Golden Feature: Sex_female_sum_SameTicketCnt Add Golden Feature: Sex_female_multiply_Pclass_str_1 Add Golden Feature: Sex_female_ratio_Pclass_str_1 Created 10 Golden Features in 11.27 seconds. 30_CatBoost_GoldenFeatures logloss 0.390194 trained in 30.33 seconds 37_CatBoost_GoldenFeatures logloss 0.3996 trained in 19.95 seconds 33_CatBoost_GoldenFeatures logloss 0.398953 trained in 20.31 seconds * Step kmeans_features will try to check up to 3 models 30_CatBoost_KMeansFeatures logloss 0.394747 trained in 20.3 seconds 37_CatBoost_KMeansFeatures logloss 0.403682 trained in 21.41 seconds 33_CatBoost_KMeansFeatures logloss 0.402157 trained in 21.79 seconds * Step insert_random_feature will try to check up to 1 model 30_CatBoost_RandomFeature logloss 0.398843 trained in 20.69 seconds Drop features ['Embarked_S', 'random_feature', 'Embarked_Q'] * Step features_selection will try to check up to 6 models 30_CatBoost_SelectedFeatures logloss 0.39354 trained in 21.2 seconds 15_Xgboost_SelectedFeatures logloss 0.402946 trained in 20.32 seconds 41_RandomForest_SelectedFeatures logloss 0.408014 trained in 26.05 seconds 26_LightGBM_SelectedFeatures logloss 0.407294 trained in 19.79 seconds 48_ExtraTrees_SelectedFeatures logloss 0.410218 trained in 24.66 seconds 8_Default_NeuralNetwork_SelectedFeatures logloss 0.431012 trained in 22.16 seconds * Step hill_climbing_1 will try to check up to 31 models 73_CatBoost logloss 0.389943 trained in 20.52 seconds 74_CatBoost_GoldenFeatures logloss 0.395436 trained in 20.95 seconds 75_Xgboost logloss 0.40114 trained in 21.68 seconds 76_Xgboost logloss 0.405252 trained in 22.0 seconds 77_Xgboost_SelectedFeatures logloss 0.402009 trained in 22.96 seconds 78_Xgboost_SelectedFeatures logloss 0.403257 trained in 22.28 seconds 79_RandomForest logloss 0.411572 trained in 28.07 seconds 80_RandomForest logloss 0.407432 trained in 27.12 seconds 81_LightGBM logloss 0.410037 trained in 21.76 seconds 82_LightGBM logloss 0.40899 trained in 21.47 seconds 83_LightGBM_SelectedFeatures logloss 0.408176 trained in 22.33 seconds 84_LightGBM_SelectedFeatures logloss 0.406016 trained in 22.01 seconds 85_RandomForest_SelectedFeatures logloss 0.411539 trained in 30.15 seconds 86_RandomForest_SelectedFeatures logloss 0.408014 trained in 28.85 seconds 87_LightGBM logloss 0.415542 trained in 22.73 seconds 88_LightGBM logloss 0.419886 trained in 22.69 seconds 89_ExtraTrees_SelectedFeatures logloss 0.410218 trained in 28.21 seconds 90_RandomForest logloss 0.410939 trained in 29.69 seconds 91_ExtraTrees logloss 0.414481 trained in 28.02 seconds 92_Xgboost logloss 0.418173 trained in 24.23 seconds 93_ExtraTrees logloss 0.417487 trained in 40.5 seconds 94_NeuralNetwork_SelectedFeatures logloss 0.443693 trained in 37.7 seconds 95_NeuralNetwork_SelectedFeatures logloss 0.514071 trained in 41.13 seconds 96_NeuralNetwork logloss 0.465475 trained in 29.56 seconds 97_NeuralNetwork logloss 0.479735 trained in 32.72 seconds 98_DecisionTree logloss 0.514448 trained in 30.53 seconds 99_NeuralNetwork logloss 0.531638 trained in 28.44 seconds 100_DecisionTree logloss 0.464915 trained in 24.23 seconds 101_DecisionTree logloss 0.723283 trained in 24.46 seconds 102_DecisionTree logloss 0.514448 trained in 24.73 seconds 103_NearestNeighbors logloss 1.1462 trained in 24.79 seconds * Step hill_climbing_2 will try to check up to 10 models 104_CatBoost logloss 0.394675 trained in 26.15 seconds 105_Xgboost logloss 0.411168 trained in 27.43 seconds 106_Xgboost_SelectedFeatures logloss 0.411864 trained in 41.72 seconds 107_Xgboost logloss 0.40767 trained in 37.67 seconds 108_LightGBM_SelectedFeatures logloss 0.410488 trained in 30.6 seconds 109_LightGBM logloss 0.406375 trained in 33.79 seconds 110_LightGBM_SelectedFeatures logloss 0.407976 trained in 36.19 seconds 111_RandomForest logloss 0.418069 trained in 42.58 seconds 112_ExtraTrees_SelectedFeatures logloss 0.411296 trained in 37.63 seconds 113_NeuralNetwork_SelectedFeatures logloss 0.489014 trained in 33.15 seconds * Step boost_on_errors will try to check up to 1 model 30_CatBoost_BoostOnErrors logloss 0.393569 trained in 30.84 seconds * Step ensemble will try to check up to 1 model Ensemble logloss 0.384841 trained in 84.58 seconds * Step stack will try to check up to 59 models 30_CatBoost_Stacked logloss 0.373164 trained in 38.49 seconds 75_Xgboost_Stacked logloss 0.389562 trained in 40.42 seconds 41_RandomForest_Stacked logloss 0.391046 trained in 43.91 seconds 84_LightGBM_SelectedFeatures_Stacked logloss 0.39183 trained in 30.89 seconds 89_ExtraTrees_SelectedFeatures_Stacked logloss 0.370869 trained in 40.62 seconds 8_Default_NeuralNetwork_SelectedFeatures_Stacked logloss 0.412413 trained in 32.7 seconds 37_CatBoost_Stacked logloss 0.387002 trained in 42.13 seconds 77_Xgboost_SelectedFeatures_Stacked logloss 0.39166 trained in 42.12 seconds 80_RandomForest_Stacked logloss 0.390082 trained in 43.68 seconds 109_LightGBM_Stacked logloss 0.389491 trained in 35.17 seconds 48_ExtraTrees_SelectedFeatures_Stacked logloss 0.372773 trained in 40.27 seconds 94_NeuralNetwork_SelectedFeatures_Stacked logloss 0.445272 trained in 33.97 seconds 73_CatBoost_Stacked logloss 0.373393 trained in 62.72 seconds 15_Xgboost_Stacked logloss 0.390611 trained in 38.46 seconds 41_RandomForest_SelectedFeatures_Stacked logloss 0.383395 trained in 52.86 seconds 26_LightGBM_Stacked logloss 0.393751 trained in 35.58 seconds * Step ensemble_stacked will try to check up to 1 model Ensemble_Stacked logloss 0.366951 trained in 112.21 seconds AutoML fit time: 3754.78 seconds AutoML best model: Ensemble_Stacked
# Return the mean accuracy on the given data and labels.
train 0.8637640449438202 test 0.8156424581005587
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
[[96 8]
[25 50]]
df_eval["Survived"] = automl.predict(df_eval[FEATURE_COLS])
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #009. mljar"
100%|████████████████████████████████████████| 2.77k/2.77k [00:05<00:00, 566B/s] Successfully submitted to Titanic - Machine Learning from Disaster