pred_analytics/lec4.ipynb
2025-01-13 14:42:39 +04:00

295 KiB
Raw Blame History

Классификация

Загрузка набора данных

In [1]:
import pandas as pd

from sklearn import set_config

set_config(transform_output="pandas")

random_state=9

df = pd.read_csv("data/titanic.csv", index_col="PassengerId")

df
Out[1]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

Разделение набора данных на обучающую и тестовые выборки (80/20) для задачи классификации

Целевой признак -- Survived

In [2]:
from src.utils import split_stratified_into_train_val_test

X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(
    df, stratify_colname="Survived", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=random_state
)

display("X_train", X_train)
display("y_train", y_train)

display("X_test", X_test)
display("y_test", y_test)
'X_train'
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
145 0 2 Andrew, Mr. Edgardo Samuel male 18.00 0 0 231945 11.5000 NaN S
206 0 3 Strom, Miss. Telma Matilda female 2.00 0 1 347054 10.4625 G6 S
349 1 3 Coutts, Master. William Loch "William" male 3.00 1 1 C.A. 37671 15.9000 NaN S
329 1 3 Goldsmith, Mrs. Frank John (Emily Alice Brown) female 31.00 1 1 363291 20.5250 NaN S
289 1 2 Hosono, Mr. Masabumi male 42.00 0 0 237798 13.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
756 1 2 Hamalainen, Master. Viljo male 0.67 1 1 250649 14.5000 NaN S
816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
890 1 1 Behr, Mr. Karl Howell male 26.00 0 0 111369 30.0000 C148 C
738 1 1 Lesurer, Mr. Gustave J male 35.00 0 0 PC 17755 512.3292 B101 C
61 0 3 Sirayanian, Mr. Orsen male 22.00 0 0 2669 7.2292 NaN C

712 rows × 11 columns

'y_train'
Survived
PassengerId
145 0
206 0
349 1
329 1
289 1
... ...
756 1
816 0
890 1
738 1
61 0

712 rows × 1 columns

'X_test'
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 113798 31.0000 NaN C
791 0 3 Keane, Mr. Andrew "Andy" male NaN 0 0 12460 7.7500 NaN Q
509 0 3 Olsen, Mr. Henry Margido male 28.0 0 0 C 4001 22.5250 NaN S
828 1 2 Mallet, Master. Andre male 1.0 0 2 S.C./PARIS 2079 37.0042 NaN C
414 0 2 Cunningham, Mr. Alfred Fleming male NaN 0 0 239853 0.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
824 1 3 Moor, Mrs. (Beila) female 27.0 0 1 392096 12.4750 E121 S
353 0 3 Elias, Mr. Tannous male 15.0 1 1 2695 7.2292 NaN C
674 1 2 Wilhelms, Mr. Charles male 31.0 0 0 244270 13.0000 NaN S
100 0 2 Kantor, Mr. Sinai male 34.0 1 0 244367 26.0000 NaN S
542 0 3 Andersson, Miss. Ingeborg Constanzia female 9.0 4 2 347082 31.2750 NaN S

179 rows × 11 columns

'y_test'
Survived
PassengerId
843 1
791 0
509 0
828 1
414 0
... ...
824 1
353 0
674 1
100 0
542 0

179 rows × 1 columns

Формирование конвейера для классификации данных

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование

features_preprocessing -- трансформер для предобработки признаков

features_engineering -- трансформер для конструирования признаков

drop_columns -- трансформер для удаления колонок

features_postprocessing -- трансформер для унитарного кодирования новых признаков

pipeline_end -- основной конвейер предобработки данных и конструирования признаков

Конвейер выполняется последовательно.

Трансформер выполняет параллельно для указанного набора колонок.

Документация:

https://scikit-learn.org/1.5/api/sklearn.pipeline.html

https://scikit-learn.org/1.5/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from src.transformers import TitanicFeatures


columns_to_drop = ["Survived", "Name", "Cabin", "Ticket", "Embarked", "Parch", "Fare"]
num_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype != "object"
]
cat_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype == "object"
]

num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()
preprocessing_num = Pipeline(
    [
        ("imputer", num_imputer),
        ("scaler", num_scaler),
    ]
)

cat_imputer = SimpleImputer(strategy="constant", fill_value="unknown")
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
preprocessing_cat = Pipeline(
    [
        ("imputer", cat_imputer),
        ("encoder", cat_encoder),
    ]
)

features_preprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_num", preprocessing_num, num_columns),
        ("prepocessing_cat", preprocessing_cat, cat_columns),
        ("prepocessing_features", cat_imputer, ["Name", "Cabin"]),
    ],
    remainder="passthrough"
)

features_engineering = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("add_features", TitanicFeatures(), ["Name", "Cabin"]),
    ],
    remainder="passthrough",
)

drop_columns = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("drop_columns", "drop", columns_to_drop),
    ],
    remainder="passthrough",
)

features_postprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_cat", preprocessing_cat, ["Cabin_type"]),
    ],
    remainder="passthrough",
)

pipeline_end = Pipeline(
    [
        ("features_preprocessing", features_preprocessing),
        ("features_engineering", features_engineering),
        ("drop_columns", drop_columns),
        ("features_postprocessing", features_postprocessing),
    ]
)

Демонстрация работы конвейера для предобработки данных при классификации

In [4]:
preprocessing_result = pipeline_end.fit_transform(X_train)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

preprocessed_df
Out[4]:
Cabin_type_B Cabin_type_C Cabin_type_D Cabin_type_E Cabin_type_F Cabin_type_G Cabin_type_T Cabin_type_u Is_married Pclass Age SibSp Sex_male
PassengerId
145 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 -0.379423 -0.869506 -0.473465 1.0
206 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0 0.821241 -2.102186 -0.473465 0.0
349 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0.821241 -2.025143 0.437635 1.0
329 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 0.821241 0.132047 0.437635 0.0
289 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 -0.379423 0.979514 -0.473465 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
756 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 -0.379423 -2.204652 0.437635 1.0
816 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 -1.580088 -0.099081 -0.473465 1.0
890 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 -1.580088 -0.253166 -0.473465 1.0
738 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 -1.580088 0.440217 -0.473465 1.0
61 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0.821241 -0.561336 -0.473465 1.0

712 rows × 13 columns

Формирование набора моделей для классификации

logistic -- логистическая регрессия

ridge -- гребневая регрессия

decision_tree -- дерево решений

knn -- k-ближайших соседей

naive_bayes -- наивный Байесовский классификатор

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)

random_forest -- метод случайного леса (набор деревьев решений)

mlp -- многослойный персептрон (нейронная сеть)

Документация: https://scikit-learn.org/1.5/supervised_learning.html

In [5]:
from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree

class_models = {
    "logistic": {"model": linear_model.LogisticRegression()},
    # "ridge": {"model": linear_model.RidgeClassifierCV(cv=5, class_weight="balanced")},
    "ridge": {"model": linear_model.LogisticRegression(penalty="l2", class_weight="balanced")},
    "decision_tree": {
        "model": tree.DecisionTreeClassifier(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsClassifier(n_neighbors=7)},
    "naive_bayes": {"model": naive_bayes.GaussianNB()},
    "gradient_boosting": {
        "model": ensemble.GradientBoostingClassifier(n_estimators=210)
    },
    "random_forest": {
        "model": ensemble.RandomForestClassifier(
            max_depth=11, class_weight="balanced", random_state=random_state
        )
    },
    "mlp": {
        "model": neural_network.MLPClassifier(
            hidden_layer_sizes=(7,),
            max_iter=500,
            early_stopping=True,
            random_state=random_state,
        )
    },
}

Обучение моделей на обучающем наборе данных и оценка на тестовом

In [6]:
from src.utils import run_classification

for model_name in class_models.keys():
    print(f"Model: {model_name}")
    model = class_models[model_name]["model"]

    pipeline = Pipeline([("pipeline", pipeline_end), ("model", model)]).fit(
        X_train, y_train.values.ravel()
    )

    class_models[model_name] = run_classification(
        pipeline, X_train, X_test, y_train, y_test
    )
Model: logistic
Model: ridge
Model: decision_tree
Model: knn
Model: naive_bayes
Model: gradient_boosting
Model: random_forest
Model: mlp

Сводная таблица оценок качества для использованных моделей классификации

Документация: https://scikit-learn.org/1.5/modules/model_evaluation.html

Матрица неточностей

In [7]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

_, ax = plt.subplots(int(len(class_models) / 2), 2, figsize=(12, 10), sharex=False, sharey=False)
for index, key in enumerate(class_models.keys()):
    c_matrix = class_models[key]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])
    disp.ax_.set_title(key)

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)
plt.show()
No description has been provided for this image

Точность, полнота, верность (аккуратность), F-мера

In [8]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
]
class_metrics.sort_values(
    by="Accuracy_test", ascending=False
).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)
Out[8]:
  Precision_train Precision_test Recall_train Recall_test Accuracy_train Accuracy_test F1_train F1_test
random_forest 0.894340 0.794118 0.868132 0.782609 0.910112 0.837989 0.881041 0.788321
gradient_boosting 0.889764 0.800000 0.827839 0.753623 0.894663 0.832402 0.857685 0.776119
logistic 0.751880 0.806452 0.732601 0.724638 0.804775 0.826816 0.742115 0.763359
decision_tree 0.852459 0.839286 0.761905 0.681159 0.858146 0.826816 0.804642 0.752000
knn 0.829167 0.827586 0.728938 0.695652 0.838483 0.826816 0.775828 0.755906
ridge 0.720395 0.688312 0.802198 0.768116 0.804775 0.776536 0.759099 0.726027
naive_bayes 0.554524 0.575472 0.875458 0.884058 0.682584 0.703911 0.678977 0.697143
mlp 0.900000 0.833333 0.197802 0.217391 0.683989 0.681564 0.324324 0.344828

ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса

In [9]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
]
class_metrics.sort_values(by="ROC_AUC_test", ascending=False).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
Out[9]:
  Accuracy_test F1_test ROC_AUC_test Cohen_kappa_test MCC_test
random_forest 0.837989 0.788321 0.858893 0.657111 0.657157
logistic 0.826816 0.763359 0.854084 0.627409 0.629641
ridge 0.776536 0.726027 0.851054 0.538303 0.540613
gradient_boosting 0.832402 0.776119 0.850922 0.642381 0.643113
knn 0.826816 0.755906 0.838735 0.623260 0.628905
decision_tree 0.826816 0.752000 0.794137 0.621151 0.629142
naive_bayes 0.703911 0.697143 0.785903 0.431814 0.470403
mlp 0.681564 0.344828 0.712714 0.220490 0.307678
In [10]:
best_model = str(class_metrics.sort_values(by="MCC_test", ascending=False).iloc[0].name)

display(best_model)
'random_forest'

Вывод данных с ошибкой предсказания для оценки

In [11]:
preprocessing_result = pipeline_end.transform(X_test)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

y_pred = class_models[best_model]["preds"]

error_index = y_test[y_test["Survived"] != y_pred].index.tolist()
display(f"Error items count: {len(error_index)}")

error_predicted = pd.Series(y_pred, index=y_test.index).loc[error_index]
error_df = X_test.loc[error_index].copy()
error_df.insert(loc=1, column="Predicted", value=error_predicted)
error_df.sort_index()
'Error items count: 29'
Out[11]:
Survived Predicted Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
26 1 0 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
72 0 1 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.9000 NaN S
103 0 1 1 White, Mr. Richard Frasar male 21.0 0 1 35281 77.2875 D26 S
108 1 0 3 Moss, Mr. Albert Johan male NaN 0 0 312991 7.7750 NaN S
128 1 0 3 Madsen, Mr. Fridtjof Arne male 24.0 0 0 C 17369 7.1417 NaN S
193 1 0 3 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 1 0 350046 7.8542 NaN S
241 0 1 3 Zabour, Miss. Thamine female NaN 1 0 2665 14.4542 NaN C
272 1 0 3 Tornquist, Mr. William Henry male 25.0 0 0 LINE 0.0000 NaN S
293 0 1 2 Levy, Mr. Rene Jacques male 36.0 0 0 SC/Paris 2163 12.8750 D C
352 0 1 1 Williams-Lambert, Mr. Fletcher Fellows male NaN 0 0 113510 35.0000 C128 S
358 0 1 2 Funk, Miss. Annie Clemmer female 38.0 0 0 237671 13.0000 NaN S
378 0 1 1 Widener, Mr. Harry Elkins male 27.0 0 2 113503 211.5000 C82 C
445 1 0 3 Johannesen-Bratthammer, Mr. Bernt male NaN 0 0 65306 8.1125 NaN S
450 1 0 1 Peuchen, Major. Arthur Godfrey male 52.0 0 0 113786 30.5000 C104 S
508 1 0 1 Bradley, Mr. George ("George Arthur Brayton") male NaN 0 0 111427 26.5500 NaN S
511 1 0 3 Daly, Mr. Eugene Patrick male 29.0 0 0 382651 7.7500 NaN Q
570 1 0 3 Jonsson, Mr. Carl male 32.0 0 0 350417 7.8542 NaN S
579 0 1 3 Caram, Mrs. Joseph (Maria Elias) female NaN 1 0 2689 14.4583 NaN C
584 0 1 1 Ross, Mr. John Hugo male 36.0 0 0 13049 40.1250 A10 C
588 1 0 1 Frolicher-Stehli, Mr. Maxmillian male 60.0 1 1 13567 79.2000 B41 C
618 0 1 3 Lobb, Mrs. William Arthur (Cordelia K Stanlick) female 26.0 1 0 A/5. 3336 16.1000 NaN S
658 0 1 3 Bourke, Mrs. John (Catherine) female 32.0 1 1 364849 15.5000 NaN Q
661 1 0 1 Frauenthal, Dr. Henry William male 50.0 2 0 PC 17611 133.6500 NaN S
674 1 0 2 Wilhelms, Mr. Charles male 31.0 0 0 244270 13.0000 NaN S
745 1 0 3 Stranden, Mr. Juho male 31.0 0 0 STON/O 2. 3101288 7.9250 NaN S
773 0 1 2 Mack, Mrs. (Mary) female 57.0 0 0 S.O./P.P. 3 10.5000 E77 S
807 0 1 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0000 A36 S
814 0 1 3 Andersson, Miss. Ebba Iris Alfrida female 6.0 4 2 347082 31.2750 NaN S
829 1 0 3 McCormack, Mr. Thomas Joseph male NaN 0 0 367228 7.7500 NaN Q

Пример использования обученной модели (конвейера) для предсказания

In [12]:
model = class_models[best_model]["pipeline"]

example_id = 450
test = pd.DataFrame(X_test.loc[example_id, :]).T
test_preprocessed = pd.DataFrame(preprocessed_df.loc[example_id, :]).T
display(test)
display(test_preprocessed)
result_proba = model.predict_proba(test)[0]
result = model.predict(test)[0]
real = int(y_test.loc[example_id].values[0])
display(f"predicted: {result} (proba: {result_proba})")
display(f"real: {real}")
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
450 1 1 Peuchen, Major. Arthur Godfrey male 52.0 0 0 113786 30.5 C104 S
Cabin_type_B Cabin_type_C Cabin_type_D Cabin_type_E Cabin_type_F Cabin_type_G Cabin_type_T Cabin_type_u Is_married Pclass Age SibSp Sex_male
450 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.580088 1.749939 -0.473465 1.0
'predicted: 0 (proba: [0.91145747 0.08854253])'
'real: 1'
In [13]:
from sklearn.model_selection import GridSearchCV

optimized_model_type = "random_forest"

random_forest_model = class_models[optimized_model_type]["pipeline"]

param_grid = {
    "model__n_estimators": [10, 20, 30, 40, 50, 100, 150, 200, 250, 500],
    "model__max_features": ["sqrt", "log2", 2],
    "model__max_depth": [2, 3, 4, 5, 6, 7, 8, 9 ,10],
    "model__criterion": ["gini", "entropy", "log_loss"],
}

gs_optomizer = GridSearchCV(
    estimator=random_forest_model, param_grid=param_grid, n_jobs=-1
)
gs_optomizer.fit(X_train, y_train.values.ravel())
gs_optomizer.best_params_
c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\numpy\ma\core.py:2881: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,
Out[13]:
{'model__criterion': 'gini',
 'model__max_depth': 7,
 'model__max_features': 'sqrt',
 'model__n_estimators': 30}

Обучение модели с новыми гиперпараметрами

In [14]:
pipeline = gs_optomizer.best_estimator_.fit(X_train, y_train.values.ravel())

result = run_classification(pipeline, X_train, X_test, y_train, y_test)

Формирование данных для оценки старой и новой версии модели

In [15]:
optimized_metrics = pd.DataFrame(columns=list(result.keys()))
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=class_models[optimized_model_type]
)
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=result
)
optimized_metrics.insert(loc=0, column="Name", value=["Old", "New"])
optimized_metrics = optimized_metrics.set_index("Name")

Оценка параметров старой и новой модели

In [16]:
optimized_metrics[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)
Out[16]:
  Precision_train Precision_test Recall_train Recall_test Accuracy_train Accuracy_test F1_train F1_test
Name                
Old 0.894340 0.794118 0.868132 0.782609 0.910112 0.837989 0.881041 0.788321
New 0.800699 0.777778 0.838828 0.811594 0.858146 0.837989 0.819320 0.794326
In [17]:
optimized_metrics[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
Out[17]:
  Accuracy_test F1_test ROC_AUC_test Cohen_kappa_test MCC_test
Name          
Old 0.837989 0.788321 0.858893 0.657111 0.657157
New 0.837989 0.794326 0.866140 0.660785 0.661193
In [18]:
_, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=False, sharey=False
)

for index in range(0, len(optimized_metrics)):
    c_matrix = optimized_metrics.iloc[index]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.3)
plt.show()
No description has been provided for this image

Регрессия

Загрузка данных

In [19]:
import pandas as pd

density_train = pd.read_csv("data/density/density_train.csv", sep=";", decimal=",")
density_test = pd.read_csv("data/density/density_test.csv", sep=";", decimal=",")

display(density_train.head(3))
display(density_test.head(3))
T Al2O3 TiO2 Density
0 20 0.0 0.0 1.06250
1 25 0.0 0.0 1.05979
2 35 0.0 0.0 1.05404
T Al2O3 TiO2 Density
0 30 0.00 0.0 1.05696
1 55 0.00 0.0 1.04158
2 25 0.05 0.0 1.08438

Формирование выборок

In [20]:
density_y_train = pd.DataFrame(density_train["Density"], columns=["Density"])
density_train = density_train.drop(["Density"], axis=1)

display(density_train.head(3))
display(density_y_train.head(3))

density_y_test = pd.DataFrame(density_test["Density"], columns=["Density"])
density_test = density_test.drop(["Density"], axis=1)

display(density_test.head(3))
display(density_y_test.head(3))
T Al2O3 TiO2
0 20 0.0 0.0
1 25 0.0 0.0
2 35 0.0 0.0
Density
0 1.06250
1 1.05979
2 1.05404
T Al2O3 TiO2
0 30 0.00 0.0
1 55 0.00 0.0
2 25 0.05 0.0
Density
0 1.05696
1 1.04158
2 1.08438

Определение перечня алгоритмов решения задачи аппроксимации (регрессии)

In [21]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model, tree, neighbors, ensemble, neural_network

random_state = 9

models = {
    "linear": {"model": linear_model.LinearRegression(n_jobs=-1)},
    "linear_poly": {
        "model": make_pipeline(
            PolynomialFeatures(degree=2),
            linear_model.LinearRegression(fit_intercept=False, n_jobs=-1),
        )
    },
    "linear_interact": {
        "model": make_pipeline(
            PolynomialFeatures(interaction_only=True),
            linear_model.LinearRegression(fit_intercept=False, n_jobs=-1),
        )
    },
    "ridge": {"model": linear_model.RidgeCV()},
    "decision_tree": {
        "model": tree.DecisionTreeRegressor(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsRegressor(n_neighbors=7, n_jobs=-1)},
    "random_forest": {
        "model": ensemble.RandomForestRegressor(
            max_depth=7, random_state=random_state, n_jobs=-1
        )
    },
    "mlp": {
        "model": neural_network.MLPRegressor(
            activation="tanh",
            hidden_layer_sizes=(3,),
            max_iter=500,
            early_stopping=True,
            random_state=random_state,
        )
    },
}

Обучение и оценка моделей с помощью различных алгоритмов

In [22]:
from src.utils import run_regression

for model_name in models.keys():
    print(f"Model: {model_name}")
    X_train = density_train
    X_test = density_test
    y_train = density_y_train
    y_test = density_y_test

    model = models[model_name]["model"]
    fitted_model = model.fit(
        X_train.values, density_y_train.values.ravel()
    )

    models[model_name] = run_regression(fitted_model, X_train, X_test, y_train, y_test)
Model: linear
Model: linear_poly
Model: linear_interact
Model: ridge
Model: decision_tree
Model: knn
Model: random_forest
Model: mlp

Вывод результатов оценки

In [23]:
reg_metrics = pd.DataFrame.from_dict(models, "index")[
    ["RMSE_train", "RMSE_test", "RMAE_test", "R2_test"]
]
reg_metrics.sort_values(by="RMSE_test").style.background_gradient(
    cmap="viridis", low=1, high=0.3, subset=["RMSE_train", "RMSE_test"]
).background_gradient(cmap="plasma", low=0.3, high=1, subset=["RMAE_test", "R2_test"])
Out[23]:
  RMSE_train RMSE_test RMAE_test R2_test
linear_poly 0.000319 0.000362 0.016643 0.999965
linear_interact 0.001131 0.001491 0.033198 0.999413
linear 0.002464 0.003261 0.049891 0.997191
random_forest 0.002716 0.005575 0.067298 0.991788
decision_tree 0.000346 0.006433 0.076138 0.989067
ridge 0.013989 0.015356 0.116380 0.937703
knn 0.053108 0.056776 0.217611 0.148414
mlp 0.095734 0.099654 0.270371 -1.623554

Вывод реального и "спрогнозированного" результата для обучающей и тестовой выборок

Получение лучшей модели

In [24]:
best_model = str(reg_metrics.sort_values(by="RMSE_test").iloc[0].name)

display(best_model)
'linear_poly'

Вывод для обучающей выборки

In [25]:
pd.concat(
    [
        density_train,
        density_y_train,
        pd.Series(
            models[best_model]["train_preds"],
            index=density_y_train.index,
            name="DensityPred",
        ),
    ],
    axis=1,
).head(5)
Out[25]:
T Al2O3 TiO2 Density DensityPred
0 20 0.0 0.0 1.06250 1.063174
1 25 0.0 0.0 1.05979 1.060117
2 35 0.0 0.0 1.05404 1.053941
3 40 0.0 0.0 1.05103 1.050822
4 45 0.0 0.0 1.04794 1.047683

Вывод для тестовой выборки

In [26]:
pd.concat(
    [
        density_test,
        density_y_test,
        pd.Series(
            models[best_model]["preds"],
            index=density_y_test.index,
            name="DensityPred",
        ),
    ],
    axis=1,
).head(5)
Out[26]:
T Al2O3 TiO2 Density DensityPred
0 30 0.00 0.0 1.05696 1.057040
1 55 0.00 0.0 1.04158 1.041341
2 25 0.05 0.0 1.08438 1.084063
3 30 0.05 0.0 1.08112 1.080764
4 35 0.05 0.0 1.07781 1.077444