タイタニックのデータセットで色々なモデルを作成してきました。
精度がよかったモデルは全てAutoMLを使ったものでした。
今までは欠損値処理・変数選択・特徴量エンジニアリングをした訓練データの一部を使ってモデリングをしていました。
(混合行列で精度を確認したかったので、訓練データをさらに学習とテストデータに分けた)
そこで下記2パターンも試して精度があがるかどうかを試してみたいと思います。
- 欠損値処理・変数選択・特徴量エンジニアリングをした訓練データを全てつかってモデリングした場合
- 欠損値処理・変数選択・特徴量エンジニアリングもしていないローデータを全て使ってモデリングした場合
MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかも知れません。
※ brew install しているのは yum や apt に置き換える必要はあります。
(MLJAR) Pythonで3つのAutoML環境を用意してみた
(AutoGluon) Pythonで3つのAutoML環境を用意してみた
(auto-sklearn) Pythonで3つのAutoML環境を用意してみた
評価指標
タイタニックのデータセットは生存有無を正確に予測できた乗客の割合(Accuracy)を評価指標としています。
パターン1
欠損値処理・変数選択・特徴量エンジニアリングをした訓練データを全てつかってモデリングした場合で再モデリングしてみます。
学習データの量が増えるので精度はあがるのかどうか確認したいと思います。
mljar
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")
# 説明変数
FEATURE_COLS=[
'Age'
, 'Fare'
, 'SameTicketCnt'
, 'Pclass_str_1'
, 'Pclass_str_3'
, 'Sex_female'
, 'Embarked_Q'
, 'Embarked_S'
]
X_train = df_train[FEATURE_COLS] # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)
# https://supervised.mljar.com/api/
# mlboxのモデル作成
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)
# fitする
automl.fit(X_train,Y_train)
AutoML directory: AutoML_3 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors'] AutoML will stack models AutoML will ensemble available models AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked'] * Step adjust_validation will try to check up to 1 model 1_DecisionTree logloss 0.643133 trained in 1.21 seconds Adjust validation. Remove: 1_DecisionTree Validation strategy: 10-fold CV Shuffle,Stratify * Step simple_algorithms will try to check up to 4 models 1_DecisionTree logloss 0.535371 trained in 2.87 seconds 2_DecisionTree logloss 0.46238 trained in 2.78 seconds 3_DecisionTree logloss 0.46392 trained in 2.76 seconds 4_Linear logloss 0.452554 trained in 7.67 seconds * Step default_algorithms will try to check up to 7 models 5_Default_LightGBM logloss 0.402072 trained in 5.66 seconds 6_Default_Xgboost logloss 0.400937 trained in 5.37 seconds 7_Default_CatBoost logloss 0.384351 trained in 5.38 seconds 8_Default_NeuralNetwork logloss 0.490864 trained in 7.85 seconds 9_Default_RandomForest logloss 0.417678 trained in 11.12 seconds 10_Default_ExtraTrees logloss 0.420511 trained in 11.95 seconds 11_Default_NearestNeighbors logloss 0.909295 trained in 4.29 seconds * Step not_so_random will try to check up to 61 models 21_LightGBM logloss 0.398133 trained in 9.02 seconds 12_Xgboost logloss 0.391173 trained in 9.16 seconds 30_CatBoost logloss 0.387318 trained in 6.19 seconds 39_RandomForest logloss 0.407113 trained in 13.39 seconds 48_ExtraTrees logloss 0.411102 trained in 11.25 seconds 57_NeuralNetwork logloss 0.444187 trained in 9.94 seconds 66_NearestNeighbors logloss 0.834326 trained in 5.44 seconds 22_LightGBM logloss 0.386149 trained in 7.45 seconds 13_Xgboost logloss 0.415666 trained in 8.3 seconds 31_CatBoost logloss 0.385593 trained in 9.93 seconds 40_RandomForest logloss 0.406411 trained in 17.76 seconds 49_ExtraTrees logloss 0.414009 trained in 18.35 seconds 58_NeuralNetwork logloss 0.439771 trained in 15.31 seconds 67_NearestNeighbors logloss 1.112462 trained in 7.37 seconds 23_LightGBM logloss 0.406987 trained in 9.88 seconds 14_Xgboost logloss 0.421778 trained in 10.6 seconds 32_CatBoost logloss 0.389089 trained in 15.27 seconds 41_RandomForest logloss 0.411355 trained in 16.84 seconds 50_ExtraTrees logloss 0.410128 trained in 14.09 seconds 59_NeuralNetwork logloss 0.473541 trained in 12.82 seconds 68_NearestNeighbors logloss 1.299103 trained in 8.58 seconds 24_LightGBM logloss 0.402071 trained in 10.49 seconds 15_Xgboost logloss 0.383564 trained in 11.58 seconds 33_CatBoost logloss 0.388264 trained in 11.46 seconds 42_RandomForest logloss 0.411391 trained in 16.71 seconds 51_ExtraTrees logloss 0.430014 trained in 19.28 seconds 60_NeuralNetwork logloss 0.466434 trained in 14.24 seconds 69_NearestNeighbors logloss 1.299103 trained in 9.21 seconds 25_LightGBM logloss 0.400083 trained in 11.1 seconds 16_Xgboost logloss 0.449111 trained in 11.83 seconds 34_CatBoost logloss 0.383657 trained in 13.94 seconds 43_RandomForest logloss 0.419977 trained in 19.14 seconds 52_ExtraTrees logloss 0.433074 trained in 17.28 seconds 61_NeuralNetwork logloss 0.509227 trained in 15.43 seconds 70_NearestNeighbors logloss 1.112462 trained in 10.47 seconds 26_LightGBM logloss 0.386131 trained in 12.3 seconds 17_Xgboost logloss 0.4757 trained in 13.62 seconds 35_CatBoost logloss 0.39084 trained in 14.67 seconds 44_RandomForest logloss 0.407182 trained in 19.22 seconds 53_ExtraTrees logloss 0.416177 trained in 19.04 seconds 62_NeuralNetwork logloss 0.685383 trained in 14.21 seconds 71_NearestNeighbors logloss 1.559218 trained in 11.71 seconds 27_LightGBM logloss 0.402373 trained in 16.53 seconds 18_Xgboost logloss 0.541056 trained in 16.04 seconds 36_CatBoost logloss 0.388274 trained in 18.56 seconds 45_RandomForest logloss 0.411463 trained in 22.94 seconds 54_ExtraTrees logloss 0.414643 trained in 20.77 seconds 63_NeuralNetwork logloss 0.457195 trained in 18.83 seconds 72_NearestNeighbors logloss 1.299103 trained in 13.34 seconds 28_LightGBM logloss 0.396239 trained in 15.29 seconds 19_Xgboost logloss 0.403582 trained in 18.18 seconds 37_CatBoost logloss 0.390126 trained in 16.62 seconds 46_RandomForest logloss 0.405416 trained in 21.1 seconds 55_ExtraTrees logloss 0.398471 trained in 20.1 seconds 64_NeuralNetwork logloss 0.496227 trained in 18.8 seconds 29_LightGBM logloss 0.39758 trained in 16.64 seconds 20_Xgboost logloss 0.473729 trained in 17.73 seconds 38_CatBoost logloss 0.386961 trained in 18.37 seconds 47_RandomForest logloss 0.406615 trained in 27.31 seconds 56_ExtraTrees logloss 0.414916 trained in 23.71 seconds 65_NeuralNetwork logloss 0.45266 trained in 20.0 seconds * Step golden_features will try to check up to 3 models None 10 Add Golden Feature: Pclass_str_3_diff_Sex_female Add Golden Feature: Sex_female_multiply_SameTicketCnt Add Golden Feature: SameTicketCnt_ratio_Sex_female Add Golden Feature: Sex_female_ratio_SameTicketCnt Add Golden Feature: Age_ratio_Sex_female Add Golden Feature: Sex_female_multiply_Age Add Golden Feature: Sex_female_ratio_Age Add Golden Feature: Sex_female_sum_SameTicketCnt Add Golden Feature: Embarked_Q_sum_Sex_female Add Golden Feature: Sex_female_diff_Embarked_S Created 10 Golden Features in 13.04 seconds. 15_Xgboost_GoldenFeatures logloss 0.38565 trained in 34.62 seconds 34_CatBoost_GoldenFeatures logloss 0.387767 trained in 22.28 seconds 7_Default_CatBoost_GoldenFeatures logloss 0.390365 trained in 18.69 seconds * Step kmeans_features will try to check up to 3 models 15_Xgboost_KMeansFeatures logloss 0.393264 trained in 23.62 seconds 34_CatBoost_KMeansFeatures logloss 0.391255 trained in 39.62 seconds 7_Default_CatBoost_KMeansFeatures logloss 0.392218 trained in 21.07 seconds * Step insert_random_feature will try to check up to 1 model 15_Xgboost_RandomFeature logloss 0.400385 trained in 20.27 seconds Drop features ['random_feature', 'Embarked_S', 'Embarked_Q'] * Step features_selection will try to check up to 6 models 15_Xgboost_SelectedFeatures logloss 0.388467 trained in 20.93 seconds 34_CatBoost_SelectedFeatures logloss 0.388058 trained in 20.91 seconds 26_LightGBM_SelectedFeatures logloss 0.386319 trained in 18.57 seconds 55_ExtraTrees_SelectedFeatures logloss 0.411567 trained in 24.17 seconds 46_RandomForest_SelectedFeatures logloss 0.40437 trained in 25.48 seconds 58_NeuralNetwork_SelectedFeatures logloss 0.428871 trained in 23.06 seconds * Step hill_climbing_1 will try to check up to 31 models 73_Xgboost logloss 0.381747 trained in 20.42 seconds 74_Xgboost logloss 0.385947 trained in 20.49 seconds 75_CatBoost logloss 0.387476 trained in 21.48 seconds 76_CatBoost logloss 0.384692 trained in 20.52 seconds 77_CatBoost logloss 0.385155 trained in 20.94 seconds 78_CatBoost logloss 0.382853 trained in 21.67 seconds 79_CatBoost logloss 0.386026 trained in 26.66 seconds 80_Xgboost_GoldenFeatures logloss 0.389585 trained in 22.6 seconds 81_Xgboost_GoldenFeatures logloss 0.384525 trained in 22.23 seconds 82_LightGBM logloss 0.386149 trained in 21.3 seconds 83_LightGBM logloss 0.380569 trained in 20.84 seconds 84_LightGBM logloss 0.386131 trained in 21.03 seconds 85_LightGBM_SelectedFeatures logloss 0.388061 trained in 22.38 seconds 86_LightGBM_SelectedFeatures logloss 0.38196 trained in 21.28 seconds 87_Xgboost_SelectedFeatures logloss 0.39152 trained in 22.81 seconds 88_Xgboost_SelectedFeatures logloss 0.391144 trained in 22.98 seconds 89_ExtraTrees logloss 0.410087 trained in 27.97 seconds 90_RandomForest_SelectedFeatures logloss 0.398531 trained in 27.54 seconds 91_RandomForest logloss 0.402316 trained in 31.35 seconds 92_RandomForest logloss 0.406411 trained in 30.2 seconds 93_ExtraTrees logloss 0.413942 trained in 28.46 seconds 94_ExtraTrees logloss 0.409957 trained in 29.7 seconds 95_ExtraTrees logloss 0.411102 trained in 27.93 seconds 96_NeuralNetwork_SelectedFeatures logloss 0.434763 trained in 27.42 seconds 97_NeuralNetwork_SelectedFeatures logloss 0.43951 trained in 25.95 seconds 98_NeuralNetwork logloss 0.4356 trained in 26.6 seconds 99_DecisionTree logloss 0.467542 trained in 23.15 seconds 100_DecisionTree logloss 0.646404 trained in 22.99 seconds 101_DecisionTree logloss 0.444153 trained in 23.72 seconds 102_DecisionTree logloss 0.444153 trained in 23.86 seconds 103_NearestNeighbors logloss 1.200042 trained in 23.7 seconds * Step hill_climbing_2 will try to check up to 12 models 104_LightGBM logloss 0.38445 trained in 24.67 seconds 105_Xgboost logloss 0.391009 trained in 27.53 seconds 106_LightGBM_SelectedFeatures logloss 0.389836 trained in 25.0 seconds 107_CatBoost logloss 0.388503 trained in 26.52 seconds 108_Xgboost logloss 0.388531 trained in 27.58 seconds 109_Xgboost_GoldenFeatures logloss 0.387964 trained in 27.87 seconds 110_LightGBM logloss 0.383974 trained in 31.19 seconds 111_RandomForest_SelectedFeatures logloss 0.396759 trained in 37.62 seconds 112_RandomForest logloss 0.397621 trained in 35.58 seconds 113_ExtraTrees logloss 0.421114 trained in 32.88 seconds 114_ExtraTrees logloss 0.402344 trained in 32.32 seconds 115_NeuralNetwork logloss 0.4562 trained in 40.35 seconds * Step boost_on_errors will try to check up to 1 model 83_LightGBM_BoostOnErrors logloss 0.390312 trained in 34.35 seconds * Step ensemble will try to check up to 1 model Ensemble logloss 0.370908 trained in 77.27 seconds * Step stack will try to check up to 60 models 83_LightGBM_Stacked logloss 0.366533 trained in 29.04 seconds 73_Xgboost_Stacked logloss 0.368775 trained in 35.0 seconds 78_CatBoost_Stacked logloss 0.363835 trained in 65.46 seconds 111_RandomForest_SelectedFeatures_Stacked logloss 0.378099 trained in 49.06 seconds 55_ExtraTrees_Stacked logloss 0.365257 trained in 38.91 seconds 58_NeuralNetwork_SelectedFeatures_Stacked logloss 0.415738 trained in 33.64 seconds 86_LightGBM_SelectedFeatures_Stacked logloss 0.366778 trained in 30.07 seconds 15_Xgboost_Stacked logloss 0.370325 trained in 36.88 seconds 34_CatBoost_Stacked logloss 0.364966 trained in 143.69 seconds 112_RandomForest_Stacked logloss 0.369609 trained in 49.01 seconds 114_ExtraTrees_Stacked logloss 0.365528 trained in 39.8 seconds 96_NeuralNetwork_SelectedFeatures_Stacked logloss 0.444507 trained in 792.28 seconds * Step ensemble_stacked will try to check up to 1 model Ensemble_Stacked logloss 0.355152 trained in 107.36 seconds AutoML fit time: 4292.47 seconds AutoML best model: Ensemble_Stacked
色々なモデルを試してくれています。
AutoML best model: Ensemble_Stackedという結果になりました。
df_eval["Survived"] = automl.predict(df_eval[FEATURE_COLS])
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. mljar パターン1"
0.75119
精度は一部の訓練データを使用する場合より、なぜか下がってしまいました。
AutoGluon
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")
FEATURE_COLS=[
'Age'
, 'Fare'
, 'SameTicketCnt'
, 'Pclass_str_1'
, 'Pclass_str_3'
, 'Sex_female'
, 'Embarked_Q'
, 'Embarked_S'
, 'Survived'
]
X_train = df_train[FEATURE_COLS] # 説明変数 (train)
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="Survived", problem_type="binary",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)
Beginning AutoGluon training ... Time limit = 600s AutoGluon will save models to "RESULT_AUTOGLUON/" AutoGluon Version: 0.4.2 Python Version: 3.8.13 Operating System: Darwin Train Data Rows: 891 Train Data Columns: 8 Label Column: Survived Preprocessing data ... Selected class <--> label mapping: class 1 = 1, class 0 = 0 Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 11537.1 MB Train Data (Original) Memory Usage: 0.06 MB (0.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: Fitting AsTypeFeatureGenerator... Note: Converting 5 features to boolean dtype as they only contain 2 unique values. Stage 2 Generators: Fitting FillNaFeatureGenerator... Stage 3 Generators: Fitting IdentityFeatureGenerator... Stage 4 Generators: Fitting DropUniqueFeatureGenerator... Types of features in original data (raw dtype, special dtypes): ('float', []) : 7 | ['Age', 'Fare', 'Pclass_str_1', 'Pclass_str_3', 'Sex_female', ...] ('int', []) : 1 | ['SameTicketCnt'] Types of features in processed data (raw dtype, special dtypes): ('float', []) : 2 | ['Age', 'Fare'] ('int', []) : 1 | ['SameTicketCnt'] ('int', ['bool']) : 5 | ['Pclass_str_1', 'Pclass_str_3', 'Sex_female', 'Embarked_Q', 'Embarked_S'] 0.1s = Fit runtime 8 features in original data used to generate 8 features in processed data. Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory) Data preprocessing and feature engineering runtime = 0.12s ... AutoGluon will gauge predictive performance using evaluation metric: 'accuracy' To change this, specify the eval_metric parameter of Predictor() Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179 Fitting 13 L1 models ... Fitting model: KNeighborsUnif ... Training model for up to 599.88s of the 599.88s of remaining time. 0.6592 = Validation score (accuracy) 0.02s = Training runtime 0.04s = Validation runtime Fitting model: KNeighborsDist ... Training model for up to 599.8s of the 599.8s of remaining time. 0.6704 = Validation score (accuracy) 0.01s = Training runtime 0.01s = Validation runtime Fitting model: LightGBMXT ... Training model for up to 599.77s of the 599.77s of remaining time. 0.8212 = Validation score (accuracy) 2.65s = Training runtime 0.01s = Validation runtime Fitting model: LightGBM ... Training model for up to 597.1s of the 597.1s of remaining time. 0.838 = Validation score (accuracy) 0.4s = Training runtime 0.01s = Validation runtime Fitting model: RandomForestGini ... Training model for up to 596.68s of the 596.68s of remaining time. 0.7989 = Validation score (accuracy) 1.13s = Training runtime 0.08s = Validation runtime Fitting model: RandomForestEntr ... Training model for up to 595.42s of the 595.42s of remaining time. 0.7989 = Validation score (accuracy) 0.8s = Training runtime 0.09s = Validation runtime Fitting model: CatBoost ... Training model for up to 594.48s of the 594.47s of remaining time. 0.8547 = Validation score (accuracy) 1.43s = Training runtime 0.0s = Validation runtime Fitting model: ExtraTreesGini ... Training model for up to 593.03s of the 593.03s of remaining time. 0.7877 = Validation score (accuracy) 0.78s = Training runtime 0.08s = Validation runtime Fitting model: ExtraTreesEntr ... Training model for up to 592.11s of the 592.11s of remaining time. 0.7821 = Validation score (accuracy) 0.78s = Training runtime 0.12s = Validation runtime Fitting model: NeuralNetFastAI ... Training model for up to 591.16s of the 591.15s of remaining time. 0.8324 = Validation score (accuracy) 5.45s = Training runtime 0.02s = Validation runtime Fitting model: XGBoost ... Training model for up to 585.66s of the 585.66s of remaining time. 0.8436 = Validation score (accuracy) 0.59s = Training runtime 0.01s = Validation runtime Fitting model: NeuralNetTorch ... Training model for up to 585.04s of the 585.04s of remaining time. 0.8045 = Validation score (accuracy) 3.17s = Training runtime 0.02s = Validation runtime Fitting model: LightGBMLarge ... Training model for up to 581.85s of the 581.84s of remaining time. 0.8324 = Validation score (accuracy) 0.67s = Training runtime 0.01s = Validation runtime Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 580.44s of remaining time. 0.8603 = Validation score (accuracy) 0.74s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 20.39s ... Best model: "WeightedEnsemble_L2" TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")
df_eval["Survived"] = predictor.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autogluon パターン1"
0.76076
auto-sklearn
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/titanic_eval.csv")
FEATURE_COLS=[
'Age'
, 'Fare'
, 'SameTicketCnt'
, 'Pclass_str_1'
, 'Pclass_str_3'
, 'Sex_female'
, 'Embarked_Q'
, 'Embarked_S'
]
X_train = df_train[FEATURE_COLS] # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)
import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, Y_train)
df_eval["Survived"] = cls.predict(df_eval[FEATURE_COLS])
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autosklearn パターン1"
0.76555
パターン1のまとめ
精度は落ちる結果になりました。
学習データにノイズが含まっていたり、まだデータ加工に考慮が必要だということでしょうか?
パターン2
- 欠損値処理・変数選択・特徴量エンジニアリングもしていないローデータを全て使ってモデリングした場合
何もしないでAutoMLに任せた方が精度が上がるのではないかという仮説のもと、そのままのデータでモデリングしてみます。
mljar
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")
X_train = df_train.drop("Survived",axis=1) # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)
# https://supervised.mljar.com/api/
from supervised.automl import AutoML
automl = AutoML(mode="Compete", random_state=100)
automl.fit(X_train,Y_train)
AutoML directory: AutoML_4 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors'] AutoML will stack models AutoML will ensemble available models AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked'] * Step adjust_validation will try to check up to 1 model 1_DecisionTree logloss 0.461027 trained in 2.64 seconds Adjust validation. Remove: 1_DecisionTree Validation strategy: 10-fold CV Shuffle,Stratify * Step simple_algorithms will try to check up to 4 models 1_DecisionTree logloss 0.588967 trained in 12.97 seconds 2_DecisionTree logloss 0.42534 trained in 11.45 seconds 3_DecisionTree logloss 0.457723 trained in 11.2 seconds 4_Linear logloss 0.526347 trained in 23.32 seconds * Step default_algorithms will try to check up to 7 models 5_Default_LightGBM logloss 0.405131 trained in 14.45 seconds 6_Default_Xgboost logloss 0.403659 trained in 14.77 seconds 7_Default_CatBoost logloss 0.395797 trained in 18.35 seconds 8_Default_NeuralNetwork logloss 0.76738 trained in 20.43 seconds 9_Default_RandomForest logloss 0.398352 trained in 24.21 seconds 10_Default_ExtraTrees logloss 0.398378 trained in 26.34 seconds 11_Default_NearestNeighbors logloss 1.098024 trained in 16.76 seconds * Step not_so_random will try to check up to 61 models 21_LightGBM logloss 0.399716 trained in 14.63 seconds 12_Xgboost logloss 0.406368 trained in 16.04 seconds 30_CatBoost logloss 0.40208 trained in 17.71 seconds 39_RandomForest logloss 0.404127 trained in 32.21 seconds 48_ExtraTrees logloss 0.403249 trained in 27.43 seconds 57_NeuralNetwork logloss 0.584528 trained in 21.8 seconds 66_NearestNeighbors logloss 0.854819 trained in 18.14 seconds 22_LightGBM logloss 0.410192 trained in 15.57 seconds 13_Xgboost logloss 0.408691 trained in 17.36 seconds 31_CatBoost logloss 0.395837 trained in 29.12 seconds 40_RandomForest logloss 0.398086 trained in 33.49 seconds 49_ExtraTrees logloss 0.403386 trained in 37.29 seconds 58_NeuralNetwork logloss 0.701188 trained in 34.28 seconds 67_NearestNeighbors logloss 0.847123 trained in 25.93 seconds 23_LightGBM logloss 0.425954 trained in 20.98 seconds 14_Xgboost logloss 0.413054 trained in 27.48 seconds 32_CatBoost logloss 0.398507 trained in 61.67 seconds 41_RandomForest logloss 0.400717 trained in 34.67 seconds 50_ExtraTrees logloss 0.390717 trained in 29.41 seconds 59_NeuralNetwork logloss 0.850408 trained in 24.97 seconds 68_NearestNeighbors logloss 1.652493 trained in 21.87 seconds 24_LightGBM logloss 0.402971 trained in 18.79 seconds 15_Xgboost logloss 0.39368 trained in 20.78 seconds 33_CatBoost logloss 0.402891 trained in 27.12 seconds 42_RandomForest logloss 0.398977 trained in 29.68 seconds 51_ExtraTrees logloss 0.406206 trained in 28.03 seconds 60_NeuralNetwork logloss 0.976241 trained in 29.27 seconds 69_NearestNeighbors logloss 1.652493 trained in 23.4 seconds 25_LightGBM logloss 0.400252 trained in 21.61 seconds 16_Xgboost logloss 0.439921 trained in 24.78 seconds 34_CatBoost logloss 0.403254 trained in 37.62 seconds 43_RandomForest logloss 0.403334 trained in 36.33 seconds 52_ExtraTrees logloss 0.409987 trained in 34.3 seconds 61_NeuralNetwork logloss 0.911824 trained in 30.49 seconds 70_NearestNeighbors logloss 0.847123 trained in 24.23 seconds 26_LightGBM logloss 0.409352 trained in 21.49 seconds 17_Xgboost logloss 0.46196 trained in 23.23 seconds 35_CatBoost logloss 0.403134 trained in 36.35 seconds 44_RandomForest logloss 0.399452 trained in 35.27 seconds 53_ExtraTrees logloss 0.407137 trained in 35.21 seconds 62_NeuralNetwork logloss 0.960051 trained in 30.22 seconds 71_NearestNeighbors logloss 1.653601 trained in 24.74 seconds 27_LightGBM logloss 0.406186 trained in 26.0 seconds 18_Xgboost logloss 0.475936 trained in 24.12 seconds 36_CatBoost logloss 0.395098 trained in 38.38 seconds 45_RandomForest logloss 0.401126 trained in 38.59 seconds 54_ExtraTrees logloss 0.418287 trained in 34.39 seconds 63_NeuralNetwork logloss 0.67047 trained in 33.34 seconds 72_NearestNeighbors logloss 1.652493 trained in 28.84 seconds 28_LightGBM logloss 0.397014 trained in 26.25 seconds * Step mix_encoding will try to check up to 1 model 15_Xgboost_categorical_mix logloss 0.394184 trained in 30.49 seconds * Step golden_features will try to check up to 3 models None 10 Add Golden Feature: Parch_sum_SibSp Add Golden Feature: SibSp_sum_Pclass Add Golden Feature: SibSp_ratio_Parch Add Golden Feature: Pclass_diff_Parch Add Golden Feature: SibSp_ratio_Fare Add Golden Feature: SibSp_multiply_Pclass Add Golden Feature: Parch_multiply_SibSp Add Golden Feature: Parch_ratio_SibSp Add Golden Feature: SibSp_diff_Parch Add Golden Feature: Parch_multiply_Pclass Created 10 Golden Features in 13.19 seconds. 50_ExtraTrees_GoldenFeatures logloss 0.39349 trained in 56.81 seconds 15_Xgboost_GoldenFeatures logloss 0.39824 trained in 30.44 seconds 15_Xgboost_categorical_mix_GoldenFeatures logloss 0.396049 trained in 30.37 seconds * Step kmeans_features will try to check up to 3 models 50_ExtraTrees_KMeansFeatures logloss 0.391718 trained in 42.41 seconds 15_Xgboost_KMeansFeatures logloss 0.405067 trained in 33.91 seconds 15_Xgboost_categorical_mix_KMeansFeatures logloss 0.402308 trained in 35.31 seconds * Step insert_random_feature will try to check up to 1 model 50_ExtraTrees_RandomFeature logloss 0.398504 trained in 137.61 seconds Drop features ['Ticket_1601', 'Ticket_113781', 'Age', 'Ticket_347082', 'Name_rev', 'Ticket_2666', 'Ticket_ston', 'Name_dr', 'Name_robert', 'random_feature', 'Ticket_17755', 'Ticket_ca', 'Ticket_29106', 'Name_joseph', 'Name_william', 'Ticket_2343', 'Name_leonard', 'Name_kate', 'Name_peter', 'Name_johan', 'Name_skoog', 'Ticket_347088', 'Ticket_113760', 'Name_edward', 'Ticket_11767', 'Name_emily', 'Name_ivan', 'Name_martha', 'Ticket_17569', 'Ticket_237736', 'Name_palsson', 'Name_viktor', 'Ticket_2699', 'Ticket_347080', 'Name_arnold', 'Ticket_28403', 'Name_carl', 'Ticket_250647', 'Ticket_248738', 'Ticket_36947', 'Ticket_367230', 'Ticket_370129', 'Ticket_19943', 'Ticket_2668', 'Name_baclini', 'Name_hart', 'Name_van', 'Ticket_31921', 'Ticket_367226', 'Name_catherine', 'Ticket_364516', 'Ticket_110152', 'Ticket_34651', 'Ticket_36973', 'Ticket_220845', 'Name_marion', 'Ticket_230433', 'Ticket_2659', 'Name_vander', 'Name_nils', 'Ticket_349909', 'Ticket_345773', 'Name_daniel', 'Name_fortune', 'Name_walter', 'Ticket_36928', 'Ticket_17604', 'Ticket_17611', 'Ticket_17593', 'Ticket_19928', 'Ticket_17474', 'Name_stanley', 'Name_harry', 'Ticket_2691', 'Ticket_110413', 'Name_jane', 'Name_louise', 'Ticket_371110', 'Ticket_48871', 'Name_karl', 'Name_david', 'Ticket_347742', 'Name_hugh', 'Name_lefebre', 'Ticket_paris', 'Ticket_2678', 'Ticket_13529', 'Name_elias', 'Name_martin', 'Name_frank', 'Ticket_o2', 'Ticket_2653', 'Ticket_2665', 'Ticket_370365', 'Ticket_19996', 'Name_bertram', 'Ticket_24160', 'Name_gustaf', 'Name_ellen', 'Name_richards', 'Name_charles', 'Name_panula', 'Ticket_sc', 'Name_alice', 'Ticket_244252', 'Name_elsie', 'Name_matilda', 'Ticket_17558', 'Ticket_244367', 'Name_thayer', 'Ticket_349237', 'Ticket_2651', 'Ticket_2661', 'Ticket_2315', 'Name_hansen', 'Name_brown', 'Ticket_248727', 'Ticket_363291', 'Ticket_14879', 'Ticket_6608', 'Ticket_239853', 'Ticket_19950', 'Ticket_17582', 'Ticket_31027', 'Ticket_3336', 'Ticket_3101279', 'Name_boulos', 'Name_benjamin', 'Ticket_ah', 'Ticket_2079', 'Ticket_250655', 'Ticket_19877', 'Ticket_2673', 'Ticket_17761', 'Ticket_230136', 'Ticket_26360', 'Ticket_382652', 'Ticket_17572', 'Name_williams', 'Ticket_soton', 'Name_alfred', 'Name_sage', 'Ticket_347077', 'Name_sofia', 'Ticket_2123', 'Ticket_line', 'Ticket_17760', 'Ticket_9549', 'Name_kelly', 'Ticket_12749', 'Name_ernst', 'Name_johansson', 'Name_goodwin', 'Ticket_113789', 'Name_hans', 'Name_rice', 'Name_marie', 'Ticket_113505', 'Ticket_2144', 'Ticket_751', 'Ticket_6607', 'Name_norman', 'Ticket_17757', 'Ticket_37671', 'Name_percival', 'Ticket_392096', 'Ticket_54636', 'Ticket_17421', 'Name_francis', 'Name_victor', 'Name_augusta', 'Name_august', 'Ticket_230080', 'Name_jr', 'Name_samuel', 'Name_albert', 'Ticket_35273', 'Ticket_17758', 'Name_johnson', 'Name_alexander', 'Ticket_3101295', 'Ticket_oq', 'Name_margaret', 'Name_olsen', 'Name_bertha', 'Ticket_358585', 'Name_jensen', 'Name_elisabeth', 'Ticket_29750', 'Ticket_17485', 'Name_asplund', 'Ticket_243847', 'Ticket_17477', 'Name_ada', 'Ticket_250649', 'Ticket_3381', 'Ticket_33112', 'Ticket_239865', 'Ticket_13502', 'Name_thomas', 'Ticket_35281', 'Ticket_3101278', 'Ticket_347054', 'Name_florence', 'Name_patrick', 'Name_anne', 'Name_richard', 'Ticket_345764', 'Name_harper', 'Name_edith', 'Name_gustafsson', 'Name_hanna', 'Name_smith', 'Ticket_364849', 'Ticket_7534', 'Ticket_111361', 'Ticket_2627', 'Ticket_250644', 'Ticket_231919', 'Name_emil', 'Name_katherine', 'Name_andrew', 'Ticket_17608', 'Name_sidney', 'Ticket_2908', 'Ticket_376564', 'Name_oskar', 'Name_bourke', 'Ticket_17453', 'Ticket_113572', 'Ticket_113803', 'Ticket_110465', 'Name_douglas', 'Name_ford', 'Name_henry', 'Ticket_4133', 'Name_anna', 'Name_john', 'Name_frederick', 'Name_carter', 'Name_ernest', 'Name_harris', 'Name_james', 'Name_helen', 'Name_arthur', 'Name_annie', 'Name_elizabeth', 'Name_mary', 'Name_george', 'Ticket_pp', 'Ticket_pc', 'Parch', 'Name_andersson', 'Name_maria'] * Step features_selection will try to check up to 6 models 50_ExtraTrees_SelectedFeatures logloss 0.396132 trained in 35.84 seconds 15_Xgboost_SelectedFeatures logloss 0.404208 trained in 34.39 seconds 36_CatBoost_SelectedFeatures logloss 0.402618 trained in 41.61 seconds 28_LightGBM_SelectedFeatures logloss 0.406602 trained in 40.25 seconds 40_RandomForest_SelectedFeatures logloss 0.406838 trained in 41.95 seconds 57_NeuralNetwork_SelectedFeatures logloss 0.425396 trained in 33.05 seconds * Step hill_climbing_1 will try to check up to 32 models 73_ExtraTrees logloss 0.390712 trained in 41.55 seconds 74_ExtraTrees logloss 0.401999 trained in 41.04 seconds 75_ExtraTrees logloss 0.38248 trained in 40.55 seconds 76_ExtraTrees logloss 0.400482 trained in 41.51 seconds 77_ExtraTrees_GoldenFeatures logloss 0.387611 trained in 40.97 seconds 78_ExtraTrees_GoldenFeatures logloss 0.402254 trained in 47.96 seconds * Step hill_climbing_2 will try to check up to 30 models 79_ExtraTrees logloss 0.384233 trained in 45.27 seconds 80_ExtraTrees_GoldenFeatures logloss 0.386553 trained in 42.1 seconds 81_ExtraTrees logloss 0.391268 trained in 39.77 seconds 82_Xgboost logloss 0.394635 trained in 33.64 seconds 83_Xgboost logloss 0.391803 trained in 34.54 seconds 84_Xgboost logloss 0.39389 trained in 34.75 seconds 85_Xgboost logloss 0.392496 trained in 34.79 seconds * Step boost_on_errors will try to check up to 1 model 75_ExtraTrees_BoostOnErrors not trained. Force to stop the training. Total time for AutoML training already exceeded. * Step ensemble will try to check up to 1 model Ensemble logloss 0.373248 trained in 5148.8 seconds Skip stack because no parameters were generated. Skip ensemble_stacked because no parameters were generated. AutoML fit time: 32704.8 seconds AutoML best model: Ensemble
df_eval["Survived"] = automl.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. mljar パターン2"
0.79186
おお、精度あがった。
AutoGluon
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")
X_train = df_train # 説明変数 (train)
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="Survived", problem_type="binary",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)
Beginning AutoGluon training ... Time limit = 600s AutoGluon will save models to "RESULT_AUTOGLUON/" AutoGluon Version: 0.4.2 Python Version: 3.8.13 Operating System: Darwin Train Data Rows: 891 Train Data Columns: 11 Label Column: Survived Preprocessing data ... Selected class <--> label mapping: class 1 = 1, class 0 = 0 Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 10859.32 MB Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: Fitting AsTypeFeatureGenerator... Note: Converting 1 features to boolean dtype as they only contain 2 unique values. Stage 2 Generators: Fitting FillNaFeatureGenerator... Stage 3 Generators: Fitting IdentityFeatureGenerator... Fitting CategoryFeatureGenerator... Fitting CategoryMemoryMinimizeFeatureGenerator... Fitting TextSpecialFeatureGenerator... Fitting BinnedFeatureGenerator... Fitting DropDuplicatesFeatureGenerator... Fitting TextNgramFeatureGenerator... Fitting CountVectorizer for text features: ['Name'] CountVectorizer fit with vocabulary size = 8 Stage 4 Generators: Fitting DropUniqueFeatureGenerator... Types of features in original data (raw dtype, special dtypes): ('float', []) : 2 | ['Age', 'Fare'] ('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch'] ('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked'] ('object', ['text']) : 1 | ['Name'] Types of features in processed data (raw dtype, special dtypes): ('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked'] ('float', []) : 2 | ['Age', 'Fare'] ('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch'] ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...] ('int', ['bool']) : 1 | ['Sex'] ('int', ['text_ngram']) : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...] 0.8s = Fit runtime 11 features in original data used to generate 28 features in processed data. Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory) Data preprocessing and feature engineering runtime = 0.92s ... AutoGluon will gauge predictive performance using evaluation metric: 'accuracy' To change this, specify the eval_metric parameter of Predictor() Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179 Fitting 13 L1 models ... Fitting model: KNeighborsUnif ... Training model for up to 599.07s of the 599.07s of remaining time. 0.6536 = Validation score (accuracy) 0.12s = Training runtime 0.05s = Validation runtime Fitting model: KNeighborsDist ... Training model for up to 598.87s of the 598.87s of remaining time. 0.6536 = Validation score (accuracy) 0.07s = Training runtime 0.03s = Validation runtime Fitting model: LightGBMXT ... Training model for up to 598.75s of the 598.74s of remaining time. 0.8156 = Validation score (accuracy) 2.66s = Training runtime 0.02s = Validation runtime Fitting model: LightGBM ... Training model for up to 596.04s of the 596.04s of remaining time. 0.8212 = Validation score (accuracy) 0.53s = Training runtime 0.02s = Validation runtime Fitting model: RandomForestGini ... Training model for up to 595.48s of the 595.47s of remaining time. 0.8156 = Validation score (accuracy) 1.31s = Training runtime 0.1s = Validation runtime Fitting model: RandomForestEntr ... Training model for up to 594.01s of the 594.01s of remaining time. 0.8156 = Validation score (accuracy) 0.96s = Training runtime 0.13s = Validation runtime Fitting model: CatBoost ... Training model for up to 592.86s of the 592.85s of remaining time. 0.8268 = Validation score (accuracy) 1.71s = Training runtime 0.02s = Validation runtime Fitting model: ExtraTreesGini ... Training model for up to 591.12s of the 591.12s of remaining time. 0.8101 = Validation score (accuracy) 1.03s = Training runtime 0.11s = Validation runtime Fitting model: ExtraTreesEntr ... Training model for up to 589.93s of the 589.92s of remaining time. 0.8101 = Validation score (accuracy) 1.01s = Training runtime 0.11s = Validation runtime Fitting model: NeuralNetFastAI ... Training model for up to 588.73s of the 588.72s of remaining time. No improvement since epoch 9: early stopping 0.8268 = Validation score (accuracy) 7.76s = Training runtime 0.04s = Validation runtime Fitting model: XGBoost ... Training model for up to 580.89s of the 580.88s of remaining time. 0.8101 = Validation score (accuracy) 0.8s = Training runtime 0.02s = Validation runtime Fitting model: NeuralNetTorch ... Training model for up to 580.05s of the 580.04s of remaining time. 0.8492 = Validation score (accuracy) 8.53s = Training runtime 0.04s = Validation runtime Fitting model: LightGBMLarge ... Training model for up to 571.47s of the 571.47s of remaining time. 0.8324 = Validation score (accuracy) 1.61s = Training runtime 0.02s = Validation runtime Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 568.72s of remaining time. 0.8603 = Validation score (accuracy) 0.89s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 32.25s ... Best model: "WeightedEnsemble_L2" TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")
df_eval["Survived"] = predictor.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autogluon パターン2"
0.76555
悪くないです。
auto-sklearn
import pandas as pd
import numpy as np
# タイタニックデータセットの学習用データと評価用データの読み込み
df_train = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/train.csv")
df_eval = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/titanic/test.csv")
# ValueError対策
cols=df_train.select_dtypes(exclude=['int','float']).columns.to_list()
df_train[cols] = df_train[cols].astype('category')
# XとYの作成
X_train = df_train.drop("Survived",axis=1) # 説明変数 (train)
Y_train = df_train["Survived"] # 目的変数 (train)
# モデル作成
import autosklearn.classification
cls = autosklearn.classification.AutoSklearnClassifier()
cls.fit(X_train, Y_train)
AutoSklearnClassifier(per_run_time_limit=360)
ValueError: Input Column Name has invalid type object.
Cast it to a valid dtype before using it in Auto-Sklearn. Valid types are numerical, categorical or boolean.
というエラーになるので、Object型をcategory型に変更しています。
df_eval["Survived"] = cls.predict(df_eval)
df_eval[["PassengerId","Survived"]].to_csv("titanic_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c titanic -f titanic_submission.csv -m "model #011. autosklearn パターン2"
0.75598
パターン2のまとめ
mljarはそのままの方が精度があがりました。
それ以外のAutoGluonとauto-sklearnはデータ加工してもしなくても同じくらいの精度ですね。。
まとめ
パターン1は全体的に精度が落ち気味でした。
訓練データにノイズがあったりデータ加工をもう少しがんばった方がいいということでしょうか?
パターン2はmljarは何もしない方が精度向上しました。AutoGluonとauto-sklearnはした方が精度はいいがそれほど変わらないかなという結果です。