(その4-12) タイタニックの乗客の生存有無をAutoMLで予測してみたまとめ

タイタニックのデータセットで色々なモデルを作成してきました。

精度がよかったモデルは全てAutoMLを使ったものでした。

今までは欠損値処理・変数選択・特徴量エンジニアリングをした訓練データの一部を使ってモデリングをしていました。

(混合行列で精度を確認したかったので、訓練データをさらに学習とテストデータに分けた)

そこで下記2パターンも試して精度があがるかどうかを試してみたいと思います。

欠損値処理・変数選択・特徴量エンジニアリングをした訓練データを全てつかってモデリングした場合
欠損値処理・変数選択・特徴量エンジニアリングもしていないローデータを全て使ってモデリングした場合

MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかも知れません。

※ brew install しているのは yum や apt に置き換える必要はあります。

(MLJAR) Pythonで3つのAutoML環境を用意してみた

(AutoGluon) Pythonで3つのAutoML環境を用意してみた

(auto-sklearn) Pythonで3つのAutoML環境を用意してみた

評価指標
パターン1
パターン2
まとめ

評価指標

タイタニックのデータセットは生存有無を正確に予測できた乗客の割合(Accuracy)を評価指標としています。

パターン1

欠損値処理・変数選択・特徴量エンジニアリングをした訓練データを全てつかってモデリングした場合で再モデリングしてみます。
学習データの量が増えるので精度はあがるのかどうか確認したいと思います。

mljar

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")

# 説明変数
FEATURE_COLS=[
   'Age'
 , 'Fare'
 , 'SameTicketCnt'
 , 'Pclass_str_1'
 , 'Pclass_str_3'
 , 'Sex_female'
 , 'Embarked_Q'
 , 'Embarked_S'
]

X_train = df_train[FEATURE_COLS] # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)

# https://supervised.mljar.com/api/
# mlboxのモデル作成
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)

# fitする
automl.fit(X_train,Y_train)

Out[0]


AutoML directory: AutoML_3
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree logloss 0.643133 trained in 1.21 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 10-fold CV Shuffle,Stratify
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree logloss 0.535371 trained in 2.87 seconds
2_DecisionTree logloss 0.46238 trained in 2.78 seconds
3_DecisionTree logloss 0.46392 trained in 2.76 seconds
4_Linear logloss 0.452554 trained in 7.67 seconds
* Step default_algorithms will try to check up to 7 models
5_Default_LightGBM logloss 0.402072 trained in 5.66 seconds
6_Default_Xgboost logloss 0.400937 trained in 5.37 seconds
7_Default_CatBoost logloss 0.384351 trained in 5.38 seconds
8_Default_NeuralNetwork logloss 0.490864 trained in 7.85 seconds
9_Default_RandomForest logloss 0.417678 trained in 11.12 seconds
10_Default_ExtraTrees logloss 0.420511 trained in 11.95 seconds
11_Default_NearestNeighbors logloss 0.909295 trained in 4.29 seconds
* Step not_so_random will try to check up to 61 models
21_LightGBM logloss 0.398133 trained in 9.02 seconds
12_Xgboost logloss 0.391173 trained in 9.16 seconds
30_CatBoost logloss 0.387318 trained in 6.19 seconds
39_RandomForest logloss 0.407113 trained in 13.39 seconds
48_ExtraTrees logloss 0.411102 trained in 11.25 seconds
57_NeuralNetwork logloss 0.444187 trained in 9.94 seconds
66_NearestNeighbors logloss 0.834326 trained in 5.44 seconds
22_LightGBM logloss 0.386149 trained in 7.45 seconds
13_Xgboost logloss 0.415666 trained in 8.3 seconds
31_CatBoost logloss 0.385593 trained in 9.93 seconds
40_RandomForest logloss 0.406411 trained in 17.76 seconds
49_ExtraTrees logloss 0.414009 trained in 18.35 seconds
58_NeuralNetwork logloss 0.439771 trained in 15.31 seconds
67_NearestNeighbors logloss 1.112462 trained in 7.37 seconds
23_LightGBM logloss 0.406987 trained in 9.88 seconds
14_Xgboost logloss 0.421778 trained in 10.6 seconds
32_CatBoost logloss 0.389089 trained in 15.27 seconds
41_RandomForest logloss 0.411355 trained in 16.84 seconds
50_ExtraTrees logloss 0.410128 trained in 14.09 seconds
59_NeuralNetwork logloss 0.473541 trained in 12.82 seconds
68_NearestNeighbors logloss 1.299103 trained in 8.58 seconds
24_LightGBM logloss 0.402071 trained in 10.49 seconds
15_Xgboost logloss 0.383564 trained in 11.58 seconds
33_CatBoost logloss 0.388264 trained in 11.46 seconds
42_RandomForest logloss 0.411391 trained in 16.71 seconds
51_ExtraTrees logloss 0.430014 trained in 19.28 seconds
60_NeuralNetwork logloss 0.466434 trained in 14.24 seconds
69_NearestNeighbors logloss 1.299103 trained in 9.21 seconds
25_LightGBM logloss 0.400083 trained in 11.1 seconds
16_Xgboost logloss 0.449111 trained in 11.83 seconds
34_CatBoost logloss 0.383657 trained in 13.94 seconds
43_RandomForest logloss 0.419977 trained in 19.14 seconds
52_ExtraTrees logloss 0.433074 trained in 17.28 seconds
61_NeuralNetwork logloss 0.509227 trained in 15.43 seconds
70_NearestNeighbors logloss 1.112462 trained in 10.47 seconds
26_LightGBM logloss 0.386131 trained in 12.3 seconds
17_Xgboost logloss 0.4757 trained in 13.62 seconds
35_CatBoost logloss 0.39084 trained in 14.67 seconds
44_RandomForest logloss 0.407182 trained in 19.22 seconds
53_ExtraTrees logloss 0.416177 trained in 19.04 seconds
62_NeuralNetwork logloss 0.685383 trained in 14.21 seconds
71_NearestNeighbors logloss 1.559218 trained in 11.71 seconds
27_LightGBM logloss 0.402373 trained in 16.53 seconds
18_Xgboost logloss 0.541056 trained in 16.04 seconds
36_CatBoost logloss 0.388274 trained in 18.56 seconds
45_RandomForest logloss 0.411463 trained in 22.94 seconds
54_ExtraTrees logloss 0.414643 trained in 20.77 seconds
63_NeuralNetwork logloss 0.457195 trained in 18.83 seconds
72_NearestNeighbors logloss 1.299103 trained in 13.34 seconds
28_LightGBM logloss 0.396239 trained in 15.29 seconds
19_Xgboost logloss 0.403582 trained in 18.18 seconds
37_CatBoost logloss 0.390126 trained in 16.62 seconds
46_RandomForest logloss 0.405416 trained in 21.1 seconds
55_ExtraTrees logloss 0.398471 trained in 20.1 seconds
64_NeuralNetwork logloss 0.496227 trained in 18.8 seconds
29_LightGBM logloss 0.39758 trained in 16.64 seconds
20_Xgboost logloss 0.473729 trained in 17.73 seconds
38_CatBoost logloss 0.386961 trained in 18.37 seconds
47_RandomForest logloss 0.406615 trained in 27.31 seconds
56_ExtraTrees logloss 0.414916 trained in 23.71 seconds
65_NeuralNetwork logloss 0.45266 trained in 20.0 seconds
* Step golden_features will try to check up to 3 models
None 10
Add Golden Feature: Pclass_str_3_diff_Sex_female
Add Golden Feature: Sex_female_multiply_SameTicketCnt
Add Golden Feature: SameTicketCnt_ratio_Sex_female
Add Golden Feature: Sex_female_ratio_SameTicketCnt
Add Golden Feature: Age_ratio_Sex_female
Add Golden Feature: Sex_female_multiply_Age
Add Golden Feature: Sex_female_ratio_Age
Add Golden Feature: Sex_female_sum_SameTicketCnt
Add Golden Feature: Embarked_Q_sum_Sex_female
Add Golden Feature: Sex_female_diff_Embarked_S
Created 10 Golden Features in 13.04 seconds.
15_Xgboost_GoldenFeatures logloss 0.38565 trained in 34.62 seconds
34_CatBoost_GoldenFeatures logloss 0.387767 trained in 22.28 seconds
7_Default_CatBoost_GoldenFeatures logloss 0.390365 trained in 18.69 seconds
* Step kmeans_features will try to check up to 3 models
15_Xgboost_KMeansFeatures logloss 0.393264 trained in 23.62 seconds
34_CatBoost_KMeansFeatures logloss 0.391255 trained in 39.62 seconds
7_Default_CatBoost_KMeansFeatures logloss 0.392218 trained in 21.07 seconds
* Step insert_random_feature will try to check up to 1 model
15_Xgboost_RandomFeature logloss 0.400385 trained in 20.27 seconds
Drop features ['random_feature', 'Embarked_S', 'Embarked_Q']
* Step features_selection will try to check up to 6 models
15_Xgboost_SelectedFeatures logloss 0.388467 trained in 20.93 seconds
34_CatBoost_SelectedFeatures logloss 0.388058 trained in 20.91 seconds
26_LightGBM_SelectedFeatures logloss 0.386319 trained in 18.57 seconds
55_ExtraTrees_SelectedFeatures logloss 0.411567 trained in 24.17 seconds
46_RandomForest_SelectedFeatures logloss 0.40437 trained in 25.48 seconds
58_NeuralNetwork_SelectedFeatures logloss 0.428871 trained in 23.06 seconds
* Step hill_climbing_1 will try to check up to 31 models
73_Xgboost logloss 0.381747 trained in 20.42 seconds
74_Xgboost logloss 0.385947 trained in 20.49 seconds
75_CatBoost logloss 0.387476 trained in 21.48 seconds
76_CatBoost logloss 0.384692 trained in 20.52 seconds
77_CatBoost logloss 0.385155 trained in 20.94 seconds
78_CatBoost logloss 0.382853 trained in 21.67 seconds
79_CatBoost logloss 0.386026 trained in 26.66 seconds
80_Xgboost_GoldenFeatures logloss 0.389585 trained in 22.6 seconds
81_Xgboost_GoldenFeatures logloss 0.384525 trained in 22.23 seconds
82_LightGBM logloss 0.386149 trained in 21.3 seconds
83_LightGBM logloss 0.380569 trained in 20.84 seconds
84_LightGBM logloss 0.386131 trained in 21.03 seconds
85_LightGBM_SelectedFeatures logloss 0.388061 trained in 22.38 seconds
86_LightGBM_SelectedFeatures logloss 0.38196 trained in 21.28 seconds
87_Xgboost_SelectedFeatures logloss 0.39152 trained in 22.81 seconds
88_Xgboost_SelectedFeatures logloss 0.391144 trained in 22.98 seconds
89_ExtraTrees logloss 0.410087 trained in 27.97 seconds
90_RandomForest_SelectedFeatures logloss 0.398531 trained in 27.54 seconds
91_RandomForest logloss 0.402316 trained in 31.35 seconds
92_RandomForest logloss 0.406411 trained in 30.2 seconds
93_ExtraTrees logloss 0.413942 trained in 28.46 seconds
94_ExtraTrees logloss 0.409957 trained in 29.7 seconds
95_ExtraTrees logloss 0.411102 trained in 27.93 seconds
96_NeuralNetwork_SelectedFeatures logloss 0.434763 trained in 27.42 seconds
97_NeuralNetwork_SelectedFeatures logloss 0.43951 trained in 25.95 seconds
98_NeuralNetwork logloss 0.4356 trained in 26.6 seconds
99_DecisionTree logloss 0.467542 trained in 23.15 seconds
100_DecisionTree logloss 0.646404 trained in 22.99 seconds
101_DecisionTree logloss 0.444153 trained in 23.72 seconds
102_DecisionTree logloss 0.444153 trained in 23.86 seconds
103_NearestNeighbors logloss 1.200042 trained in 23.7 seconds
* Step hill_climbing_2 will try to check up to 12 models
104_LightGBM logloss 0.38445 trained in 24.67 seconds
105_Xgboost logloss 0.391009 trained in 27.53 seconds
106_LightGBM_SelectedFeatures logloss 0.389836 trained in 25.0 seconds
107_CatBoost logloss 0.388503 trained in 26.52 seconds
108_Xgboost logloss 0.388531 trained in 27.58 seconds
109_Xgboost_GoldenFeatures logloss 0.387964 trained in 27.87 seconds
110_LightGBM logloss 0.383974 trained in 31.19 seconds
111_RandomForest_SelectedFeatures logloss 0.396759 trained in 37.62 seconds
112_RandomForest logloss 0.397621 trained in 35.58 seconds
113_ExtraTrees logloss 0.421114 trained in 32.88 seconds
114_ExtraTrees logloss 0.402344 trained in 32.32 seconds
115_NeuralNetwork logloss 0.4562 trained in 40.35 seconds
* Step boost_on_errors will try to check up to 1 model
83_LightGBM_BoostOnErrors logloss 0.390312 trained in 34.35 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.370908 trained in 77.27 seconds
* Step stack will try to check up to 60 models
83_LightGBM_Stacked logloss 0.366533 trained in 29.04 seconds
73_Xgboost_Stacked logloss 0.368775 trained in 35.0 seconds
78_CatBoost_Stacked logloss 0.363835 trained in 65.46 seconds
111_RandomForest_SelectedFeatures_Stacked logloss 0.378099 trained in 49.06 seconds
55_ExtraTrees_Stacked logloss 0.365257 trained in 38.91 seconds
58_NeuralNetwork_SelectedFeatures_Stacked logloss 0.415738 trained in 33.64 seconds
86_LightGBM_SelectedFeatures_Stacked logloss 0.366778 trained in 30.07 seconds
15_Xgboost_Stacked logloss 0.370325 trained in 36.88 seconds
34_CatBoost_Stacked logloss 0.364966 trained in 143.69 seconds
112_RandomForest_Stacked logloss 0.369609 trained in 49.01 seconds
114_ExtraTrees_Stacked logloss 0.365528 trained in 39.8 seconds
96_NeuralNetwork_SelectedFeatures_Stacked logloss 0.444507 trained in 792.28 seconds
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked logloss 0.355152 trained in 107.36 seconds
AutoML fit time: 4292.47 seconds
AutoML best model: Ensemble_Stacked

色々なモデルを試してくれています。

AutoML best model: Ensemble_Stackedという結果になりました。

df_eval["Survived"] = automl.predict(df_eval[FEATURE_COLS])
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. mljar パターン1"

Out[0]


0.75119

精度は一部の訓練データを使用する場合より、なぜか下がってしまいました。

AutoGluon

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")

FEATURE_COLS=[
   'Age'
 , 'Fare'
 , 'SameTicketCnt'
 , 'Pclass_str_1'
 , 'Pclass_str_3'
 , 'Sex_female'
 , 'Embarked_Q'
 , 'Embarked_S'
 , 'Survived'
]

X_train = df_train[FEATURE_COLS] # 説明変数 (train)

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="Survived", problem_type="binary",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)

Out[0]


Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "RESULT_AUTOGLUON/"
AutoGluon Version:  0.4.2
Python Version:     3.8.13
Operating System:   Darwin
Train Data Rows:    891
Train Data Columns: 8
Label Column: Survived
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    11537.1 MB
    Train Data (Original)  Memory Usage: 0.06 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 5 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 7 | ['Age', 'Fare', 'Pclass_str_1', 'Pclass_str_3', 'Sex_female', ...]
        ('int', [])   : 1 | ['SameTicketCnt']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     : 2 | ['Age', 'Fare']
        ('int', [])       : 1 | ['SameTicketCnt']
        ('int', ['bool']) : 5 | ['Pclass_str_1', 'Pclass_str_3', 'Sex_female', 'Embarked_Q', 'Embarked_S']
    0.1s = Fit runtime
    8 features in original data used to generate 8 features in processed data.
    Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.12s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 599.88s of the 599.88s of remaining time.
    0.6592   = Validation score   (accuracy)
    0.02s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 599.8s of the 599.8s of remaining time.
    0.6704   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 599.77s of the 599.77s of remaining time.
    0.8212   = Validation score   (accuracy)
    2.65s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 597.1s of the 597.1s of remaining time.
    0.838    = Validation score   (accuracy)
    0.4s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 596.68s of the 596.68s of remaining time.
    0.7989   = Validation score   (accuracy)
    1.13s    = Training   runtime
    0.08s    = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 595.42s of the 595.42s of remaining time.
    0.7989   = Validation score   (accuracy)
    0.8s     = Training   runtime
    0.09s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 594.48s of the 594.47s of remaining time.
    0.8547   = Validation score   (accuracy)
    1.43s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 593.03s of the 593.03s of remaining time.
    0.7877   = Validation score   (accuracy)
    0.78s    = Training   runtime
    0.08s    = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 592.11s of the 592.11s of remaining time.
    0.7821   = Validation score   (accuracy)
    0.78s    = Training   runtime
    0.12s    = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 591.16s of the 591.15s of remaining time.
    0.8324   = Validation score   (accuracy)
    5.45s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 585.66s of the 585.66s of remaining time.
    0.8436   = Validation score   (accuracy)
    0.59s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 585.04s of the 585.04s of remaining time.
    0.8045   = Validation score   (accuracy)
    3.17s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 581.85s of the 581.84s of remaining time.
    0.8324   = Validation score   (accuracy)
    0.67s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 580.44s of remaining time.
    0.8603   = Validation score   (accuracy)
    0.74s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 20.39s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")

df_eval["Survived"] = predictor.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autogluon パターン1"

Out[0]

0.76076

auto-sklearn

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")

FEATURE_COLS=[
   'Age'
 , 'Fare'
 , 'SameTicketCnt'
 , 'Pclass_str_1'
 , 'Pclass_str_3'
 , 'Sex_female'
 , 'Embarked_Q'
 , 'Embarked_S'
]

X_train = df_train[FEATURE_COLS] # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)

import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, Y_train)

df_eval["Survived"] = cls.predict(df_eval[FEATURE_COLS])
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autosklearn パターン1"

Out[0]


0.76555

パターン1のまとめ

精度は落ちる結果になりました。

学習データにノイズが含まっていたり、まだデータ加工に考慮が必要だということでしょうか？

パターン2

欠損値処理・変数選択・特徴量エンジニアリングもしていないローデータを全て使ってモデリングした場合

何もしないでAutoMLに任せた方が精度が上がるのではないかという仮説のもと、そのままのデータでモデリングしてみます。

mljar

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")

X_train = df_train.drop("Survived",axis=1) # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)

# https://supervised.mljar.com/api/
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)

automl.fit(X_train,Y_train)

Out[0]


AutoML directory: AutoML_4
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree logloss 0.461027 trained in 2.64 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 10-fold CV Shuffle,Stratify
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree logloss 0.588967 trained in 12.97 seconds
2_DecisionTree logloss 0.42534 trained in 11.45 seconds
3_DecisionTree logloss 0.457723 trained in 11.2 seconds
4_Linear logloss 0.526347 trained in 23.32 seconds
* Step default_algorithms will try to check up to 7 models
5_Default_LightGBM logloss 0.405131 trained in 14.45 seconds
6_Default_Xgboost logloss 0.403659 trained in 14.77 seconds
7_Default_CatBoost logloss 0.395797 trained in 18.35 seconds
8_Default_NeuralNetwork logloss 0.76738 trained in 20.43 seconds
9_Default_RandomForest logloss 0.398352 trained in 24.21 seconds
10_Default_ExtraTrees logloss 0.398378 trained in 26.34 seconds
11_Default_NearestNeighbors logloss 1.098024 trained in 16.76 seconds
* Step not_so_random will try to check up to 61 models
21_LightGBM logloss 0.399716 trained in 14.63 seconds
12_Xgboost logloss 0.406368 trained in 16.04 seconds
30_CatBoost logloss 0.40208 trained in 17.71 seconds
39_RandomForest logloss 0.404127 trained in 32.21 seconds
48_ExtraTrees logloss 0.403249 trained in 27.43 seconds
57_NeuralNetwork logloss 0.584528 trained in 21.8 seconds
66_NearestNeighbors logloss 0.854819 trained in 18.14 seconds
22_LightGBM logloss 0.410192 trained in 15.57 seconds
13_Xgboost logloss 0.408691 trained in 17.36 seconds
31_CatBoost logloss 0.395837 trained in 29.12 seconds
40_RandomForest logloss 0.398086 trained in 33.49 seconds
49_ExtraTrees logloss 0.403386 trained in 37.29 seconds
58_NeuralNetwork logloss 0.701188 trained in 34.28 seconds
67_NearestNeighbors logloss 0.847123 trained in 25.93 seconds
23_LightGBM logloss 0.425954 trained in 20.98 seconds
14_Xgboost logloss 0.413054 trained in 27.48 seconds
32_CatBoost logloss 0.398507 trained in 61.67 seconds
41_RandomForest logloss 0.400717 trained in 34.67 seconds
50_ExtraTrees logloss 0.390717 trained in 29.41 seconds
59_NeuralNetwork logloss 0.850408 trained in 24.97 seconds
68_NearestNeighbors logloss 1.652493 trained in 21.87 seconds
24_LightGBM logloss 0.402971 trained in 18.79 seconds
15_Xgboost logloss 0.39368 trained in 20.78 seconds
33_CatBoost logloss 0.402891 trained in 27.12 seconds
42_RandomForest logloss 0.398977 trained in 29.68 seconds
51_ExtraTrees logloss 0.406206 trained in 28.03 seconds
60_NeuralNetwork logloss 0.976241 trained in 29.27 seconds
69_NearestNeighbors logloss 1.652493 trained in 23.4 seconds
25_LightGBM logloss 0.400252 trained in 21.61 seconds
16_Xgboost logloss 0.439921 trained in 24.78 seconds
34_CatBoost logloss 0.403254 trained in 37.62 seconds
43_RandomForest logloss 0.403334 trained in 36.33 seconds
52_ExtraTrees logloss 0.409987 trained in 34.3 seconds
61_NeuralNetwork logloss 0.911824 trained in 30.49 seconds
70_NearestNeighbors logloss 0.847123 trained in 24.23 seconds
26_LightGBM logloss 0.409352 trained in 21.49 seconds
17_Xgboost logloss 0.46196 trained in 23.23 seconds
35_CatBoost logloss 0.403134 trained in 36.35 seconds
44_RandomForest logloss 0.399452 trained in 35.27 seconds
53_ExtraTrees logloss 0.407137 trained in 35.21 seconds
62_NeuralNetwork logloss 0.960051 trained in 30.22 seconds
71_NearestNeighbors logloss 1.653601 trained in 24.74 seconds
27_LightGBM logloss 0.406186 trained in 26.0 seconds
18_Xgboost logloss 0.475936 trained in 24.12 seconds
36_CatBoost logloss 0.395098 trained in 38.38 seconds
45_RandomForest logloss 0.401126 trained in 38.59 seconds
54_ExtraTrees logloss 0.418287 trained in 34.39 seconds
63_NeuralNetwork logloss 0.67047 trained in 33.34 seconds
72_NearestNeighbors logloss 1.652493 trained in 28.84 seconds
28_LightGBM logloss 0.397014 trained in 26.25 seconds
* Step mix_encoding will try to check up to 1 model
15_Xgboost_categorical_mix logloss 0.394184 trained in 30.49 seconds
* Step golden_features will try to check up to 3 models
None 10
Add Golden Feature: Parch_sum_SibSp
Add Golden Feature: SibSp_sum_Pclass
Add Golden Feature: SibSp_ratio_Parch
Add Golden Feature: Pclass_diff_Parch
Add Golden Feature: SibSp_ratio_Fare
Add Golden Feature: SibSp_multiply_Pclass
Add Golden Feature: Parch_multiply_SibSp
Add Golden Feature: Parch_ratio_SibSp
Add Golden Feature: SibSp_diff_Parch
Add Golden Feature: Parch_multiply_Pclass
Created 10 Golden Features in 13.19 seconds.
50_ExtraTrees_GoldenFeatures logloss 0.39349 trained in 56.81 seconds
15_Xgboost_GoldenFeatures logloss 0.39824 trained in 30.44 seconds
15_Xgboost_categorical_mix_GoldenFeatures logloss 0.396049 trained in 30.37 seconds
* Step kmeans_features will try to check up to 3 models
50_ExtraTrees_KMeansFeatures logloss 0.391718 trained in 42.41 seconds
15_Xgboost_KMeansFeatures logloss 0.405067 trained in 33.91 seconds
15_Xgboost_categorical_mix_KMeansFeatures logloss 0.402308 trained in 35.31 seconds
* Step insert_random_feature will try to check up to 1 model
50_ExtraTrees_RandomFeature logloss 0.398504 trained in 137.61 seconds
Drop features ['Ticket_1601', 'Ticket_113781', 'Age', 'Ticket_347082', 'Name_rev', 'Ticket_2666', 'Ticket_ston', 'Name_dr', 'Name_robert', 'random_feature', 'Ticket_17755', 'Ticket_ca', 'Ticket_29106', 'Name_joseph', 'Name_william', 'Ticket_2343', 'Name_leonard', 'Name_kate', 'Name_peter', 'Name_johan', 'Name_skoog', 'Ticket_347088', 'Ticket_113760', 'Name_edward', 'Ticket_11767', 'Name_emily', 'Name_ivan', 'Name_martha', 'Ticket_17569', 'Ticket_237736', 'Name_palsson', 'Name_viktor', 'Ticket_2699', 'Ticket_347080', 'Name_arnold', 'Ticket_28403', 'Name_carl', 'Ticket_250647', 'Ticket_248738', 'Ticket_36947', 'Ticket_367230', 'Ticket_370129', 'Ticket_19943', 'Ticket_2668', 'Name_baclini', 'Name_hart', 'Name_van', 'Ticket_31921', 'Ticket_367226', 'Name_catherine', 'Ticket_364516', 'Ticket_110152', 'Ticket_34651', 'Ticket_36973', 'Ticket_220845', 'Name_marion', 'Ticket_230433', 'Ticket_2659', 'Name_vander', 'Name_nils', 'Ticket_349909', 'Ticket_345773', 'Name_daniel', 'Name_fortune', 'Name_walter', 'Ticket_36928', 'Ticket_17604', 'Ticket_17611', 'Ticket_17593', 'Ticket_19928', 'Ticket_17474', 'Name_stanley', 'Name_harry', 'Ticket_2691', 'Ticket_110413', 'Name_jane', 'Name_louise', 'Ticket_371110', 'Ticket_48871', 'Name_karl', 'Name_david', 'Ticket_347742', 'Name_hugh', 'Name_lefebre', 'Ticket_paris', 'Ticket_2678', 'Ticket_13529', 'Name_elias', 'Name_martin', 'Name_frank', 'Ticket_o2', 'Ticket_2653', 'Ticket_2665', 'Ticket_370365', 'Ticket_19996', 'Name_bertram', 'Ticket_24160', 'Name_gustaf', 'Name_ellen', 'Name_richards', 'Name_charles', 'Name_panula', 'Ticket_sc', 'Name_alice', 'Ticket_244252', 'Name_elsie', 'Name_matilda', 'Ticket_17558', 'Ticket_244367', 'Name_thayer', 'Ticket_349237', 'Ticket_2651', 'Ticket_2661', 'Ticket_2315', 'Name_hansen', 'Name_brown', 'Ticket_248727', 'Ticket_363291', 'Ticket_14879', 'Ticket_6608', 'Ticket_239853', 'Ticket_19950', 'Ticket_17582', 'Ticket_31027', 'Ticket_3336', 'Ticket_3101279', 'Name_boulos', 'Name_benjamin', 'Ticket_ah', 'Ticket_2079', 'Ticket_250655', 'Ticket_19877', 'Ticket_2673', 'Ticket_17761', 'Ticket_230136', 'Ticket_26360', 'Ticket_382652', 'Ticket_17572', 'Name_williams', 'Ticket_soton', 'Name_alfred', 'Name_sage', 'Ticket_347077', 'Name_sofia', 'Ticket_2123', 'Ticket_line', 'Ticket_17760', 'Ticket_9549', 'Name_kelly', 'Ticket_12749', 'Name_ernst', 'Name_johansson', 'Name_goodwin', 'Ticket_113789', 'Name_hans', 'Name_rice', 'Name_marie', 'Ticket_113505', 'Ticket_2144', 'Ticket_751', 'Ticket_6607', 'Name_norman', 'Ticket_17757', 'Ticket_37671', 'Name_percival', 'Ticket_392096', 'Ticket_54636', 'Ticket_17421', 'Name_francis', 'Name_victor', 'Name_augusta', 'Name_august', 'Ticket_230080', 'Name_jr', 'Name_samuel', 'Name_albert', 'Ticket_35273', 'Ticket_17758', 'Name_johnson', 'Name_alexander', 'Ticket_3101295', 'Ticket_oq', 'Name_margaret', 'Name_olsen', 'Name_bertha', 'Ticket_358585', 'Name_jensen', 'Name_elisabeth', 'Ticket_29750', 'Ticket_17485', 'Name_asplund', 'Ticket_243847', 'Ticket_17477', 'Name_ada', 'Ticket_250649', 'Ticket_3381', 'Ticket_33112', 'Ticket_239865', 'Ticket_13502', 'Name_thomas', 'Ticket_35281', 'Ticket_3101278', 'Ticket_347054', 'Name_florence', 'Name_patrick', 'Name_anne', 'Name_richard', 'Ticket_345764', 'Name_harper', 'Name_edith', 'Name_gustafsson', 'Name_hanna', 'Name_smith', 'Ticket_364849', 'Ticket_7534', 'Ticket_111361', 'Ticket_2627', 'Ticket_250644', 'Ticket_231919', 'Name_emil', 'Name_katherine', 'Name_andrew', 'Ticket_17608', 'Name_sidney', 'Ticket_2908', 'Ticket_376564', 'Name_oskar', 'Name_bourke', 'Ticket_17453', 'Ticket_113572', 'Ticket_113803', 'Ticket_110465', 'Name_douglas', 'Name_ford', 'Name_henry', 'Ticket_4133', 'Name_anna', 'Name_john', 'Name_frederick', 'Name_carter', 'Name_ernest', 'Name_harris', 'Name_james', 'Name_helen', 'Name_arthur', 'Name_annie', 'Name_elizabeth', 'Name_mary', 'Name_george', 'Ticket_pp', 'Ticket_pc', 'Parch', 'Name_andersson', 'Name_maria']
* Step features_selection will try to check up to 6 models
50_ExtraTrees_SelectedFeatures logloss 0.396132 trained in 35.84 seconds
15_Xgboost_SelectedFeatures logloss 0.404208 trained in 34.39 seconds
36_CatBoost_SelectedFeatures logloss 0.402618 trained in 41.61 seconds
28_LightGBM_SelectedFeatures logloss 0.406602 trained in 40.25 seconds
40_RandomForest_SelectedFeatures logloss 0.406838 trained in 41.95 seconds
57_NeuralNetwork_SelectedFeatures logloss 0.425396 trained in 33.05 seconds
* Step hill_climbing_1 will try to check up to 32 models
73_ExtraTrees logloss 0.390712 trained in 41.55 seconds
74_ExtraTrees logloss 0.401999 trained in 41.04 seconds
75_ExtraTrees logloss 0.38248 trained in 40.55 seconds
76_ExtraTrees logloss 0.400482 trained in 41.51 seconds
77_ExtraTrees_GoldenFeatures logloss 0.387611 trained in 40.97 seconds
78_ExtraTrees_GoldenFeatures logloss 0.402254 trained in 47.96 seconds
* Step hill_climbing_2 will try to check up to 30 models
79_ExtraTrees logloss 0.384233 trained in 45.27 seconds
80_ExtraTrees_GoldenFeatures logloss 0.386553 trained in 42.1 seconds
81_ExtraTrees logloss 0.391268 trained in 39.77 seconds
82_Xgboost logloss 0.394635 trained in 33.64 seconds
83_Xgboost logloss 0.391803 trained in 34.54 seconds
84_Xgboost logloss 0.39389 trained in 34.75 seconds
85_Xgboost logloss 0.392496 trained in 34.79 seconds
* Step boost_on_errors will try to check up to 1 model
75_ExtraTrees_BoostOnErrors not trained. Force to stop the training. Total time for AutoML training already exceeded.
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.373248 trained in 5148.8 seconds
Skip stack because no parameters were generated.
Skip ensemble_stacked because no parameters were generated.
AutoML fit time: 32704.8 seconds
AutoML best model: Ensemble

df_eval["Survived"] = automl.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. mljar パターン2"

Out[0]

0.79186

おお、精度あがった。

AutoGluon

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")

X_train = df_train # 説明変数 (train)

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="Survived", problem_type="binary",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)

Out[0]

Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "RESULT_AUTOGLUON/"
AutoGluon Version:  0.4.2
Python Version:     3.8.13
Operating System:   Darwin
Train Data Rows:    891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    10859.32 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
    0.8s = Fit runtime
    11 features in original data used to generate 28 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.92s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 599.07s of the 599.07s of remaining time.
    0.6536   = Validation score   (accuracy)
    0.12s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 598.87s of the 598.87s of remaining time.
    0.6536   = Validation score   (accuracy)
    0.07s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 598.75s of the 598.74s of remaining time.
    0.8156   = Validation score   (accuracy)
    2.66s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 596.04s of the 596.04s of remaining time.
    0.8212   = Validation score   (accuracy)
    0.53s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 595.48s of the 595.47s of remaining time.
    0.8156   = Validation score   (accuracy)
    1.31s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 594.01s of the 594.01s of remaining time.
    0.8156   = Validation score   (accuracy)
    0.96s    = Training   runtime
    0.13s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 592.86s of the 592.85s of remaining time.
    0.8268   = Validation score   (accuracy)
    1.71s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 591.12s of the 591.12s of remaining time.
    0.8101   = Validation score   (accuracy)
    1.03s    = Training   runtime
    0.11s    = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 589.93s of the 589.92s of remaining time.
    0.8101   = Validation score   (accuracy)
    1.01s    = Training   runtime
    0.11s    = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 588.73s of the 588.72s of remaining time.
No improvement since epoch 9: early stopping
    0.8268   = Validation score   (accuracy)
    7.76s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 580.89s of the 580.88s of remaining time.
    0.8101   = Validation score   (accuracy)
    0.8s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 580.05s of the 580.04s of remaining time.
    0.8492   = Validation score   (accuracy)
    8.53s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 571.47s of the 571.47s of remaining time.
    0.8324   = Validation score   (accuracy)
    1.61s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 568.72s of remaining time.
    0.8603   = Validation score   (accuracy)
    0.89s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 32.25s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")

df_eval["Survived"] = predictor.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autogluon パターン2"

Out[0]

0.76555

悪くないです。

auto-sklearn

import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")

# ValueError対策
cols=df_train.select_dtypes(exclude=['int','float']).columns.to_list()
df_train[cols] = df_train[cols].astype('category')

# XとYの作成
X_train = df_train.drop("Survived",axis=1) # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)

# モデル作成
import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, Y_train)

Out[0]


AutoSklearnClassifier(per_run_time_limit=360)

ValueError: Input Column Name has invalid type object.
Cast it to a valid dtype before using it in Auto-Sklearn. Valid types are numerical, categorical or boolean.

というエラーになるので、Object型をcategory型に変更しています。

df_eval["Survived"] = cls.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autosklearn パターン2"

Out[0]