MII/lec4_1.ipynb
2024-11-15 23:06:57 +04:00

220 KiB
Raw Blame History

Загрузка набора данных

In [1]:
import pandas as pd

from sklearn import set_config

set_config(transform_output="pandas")

random_state=9

df = pd.read_csv("data/healthcare.csv", index_col="id")

df
Out[1]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
... ... ... ... ... ... ... ... ... ... ... ...
18234 Female 80.0 1 0 Yes Private Urban 83.75 NaN never smoked 0
44873 Female 81.0 0 0 Yes Self-employed Urban 125.20 40.0 never smoked 0
19723 Female 35.0 0 0 Yes Self-employed Rural 82.99 30.6 never smoked 0
37544 Male 51.0 0 0 Yes Private Rural 166.29 25.6 formerly smoked 0
44679 Female 44.0 0 0 Yes Govt_job Urban 85.28 26.2 Unknown 0

5110 rows × 11 columns

Разделение набора данных на обучающую и тестовые выборки (80/20) для задачи классификации

Целевой признак -- heart_disease - есть ли заболевания сердца . x - полная выборка, y - gear box столбец

In [2]:
from utils import split_stratified_into_train_val_test

X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(
    df, stratify_colname="heart_disease", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=random_state
)

display("X_train", X_train)
display("y_train", y_train)

display("X_test", X_test)
display("y_test", y_test)
'X_train'
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
17762 Female 3.0 0 0 No children Rural 114.88 19.1 Unknown 0
48652 Female 8.0 0 0 No children Urban 83.55 22.4 Unknown 0
6903 Female 15.0 0 0 No children Rural 77.57 18.3 Unknown 0
2903 Female 35.0 0 0 No Private Rural 123.83 23.8 never smoked 0
58153 Female 18.0 0 0 No Private Urban 123.66 22.2 never smoked 0
... ... ... ... ... ... ... ... ... ... ... ...
34084 Male 7.0 0 0 No children Urban 77.12 18.6 Unknown 0
11176 Male 9.0 0 0 No children Rural 85.02 16.3 Unknown 0
52554 Male 19.0 0 0 No Private Rural 64.92 22.5 Unknown 0
10381 Female 38.0 1 0 Yes Self-employed Urban 91.00 33.3 never smoked 0
70884 Female 34.0 0 0 Yes Private Urban 79.80 37.4 smokes 0

4088 rows × 11 columns

'y_train'
heart_disease
id
17762 0
48652 0
6903 0
2903 0
58153 0
... ...
34084 0
11176 0
52554 0
10381 0
70884 0

4088 rows × 1 columns

'X_test'
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
2520 Female 26.0 0 0 Yes Private Rural 84.90 26.2 never smoked 0
56855 Male 46.0 0 0 Yes Private Urban 137.77 29.3 never smoked 0
27034 Female 65.0 0 0 Yes Govt_job Urban 82.72 29.8 smokes 0
641 Male 52.0 0 0 Yes Govt_job Rural 87.26 40.1 smokes 0
65407 Female 64.0 0 0 Yes Self-employed Rural 65.46 32.5 formerly smoked 0
... ... ... ... ... ... ... ... ... ... ... ...
40447 Female 59.0 0 0 Yes Private Rural 82.42 28.8 never smoked 0
56324 Female 53.0 0 0 Yes Self-employed Rural 81.76 34.3 formerly smoked 0
4813 Male 27.0 0 0 No Private Urban 112.98 44.7 never smoked 0
14372 Male 50.0 0 0 Yes Self-employed Urban 192.16 43.6 never smoked 0
50522 Female 72.0 0 0 Yes Govt_job Urban 131.41 28.4 never smoked 1

1022 rows × 11 columns

'y_test'
heart_disease
id
2520 0
56855 0
27034 0
641 0
65407 0
... ...
40447 0
56324 0
4813 0
14372 0
50522 0

1022 rows × 1 columns

В итоге, этот код выполняет следующие действия:

  • Заполняет пропущенные значения: В числовых столбцах медианой, в категориальных - значением "unknown".
  • Стандартизирует числовые данные: приводит их к нулевому среднему и единичному стандартному отклонению.
  • Преобразует категориальные данные: использует one-hot-кодирование.
  • Удаляет ненужные столбцы: из списка columns_to_drop.

Формирование конвейера для классификации данных

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование

features_preprocessing -- трансформер для предобработки признаков

features_engineering -- трансформер для конструирования признаков

drop_columns -- трансформер для удаления колонок

features_postprocessing -- трансформер для унитарного кодирования новых признаков

pipeline_end -- основной конвейер предобработки данных и конструирования признаков

Конвейер выполняется последовательно.

Трансформер выполняет параллельно для указанного набора колонок.

Документация:

https://scikit-learn.org/1.5/api/sklearn.pipeline.html

https://scikit-learn.org/1.5/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from transformers import TitanicFeatures


columns_to_drop = []
#columns_to_drop = ["Doors", "Color", "Gear box type", "Prod_year", "Mileage", "Airbags", "Levy", "Leather_interior", "Fuel type", "Drive wheels"]
num_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype != "object"
]
cat_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype == "object"
]

num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()
preprocessing_num = Pipeline(
    [
        ("imputer", num_imputer),
        ("scaler", num_scaler),
    ]
)

cat_imputer = SimpleImputer(strategy="constant", fill_value="unknown")
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
preprocessing_cat = Pipeline(
    [
        ("imputer", cat_imputer),
        ("encoder", cat_encoder),
    ]
)

features_preprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_num", preprocessing_num, num_columns),
        ("prepocessing_cat", preprocessing_cat, cat_columns),
        #("prepocessing_features", cat_imputer, ["Name", "Cabin"]),
    ],
    remainder="passthrough"
)

# features_engineering = ColumnTransformer(
#     verbose_feature_names_out=False,
#     transformers=[
#         ("add_features", TitanicFeatures(), ["Name", "Cabin"]),
#     ],
#     remainder="passthrough",
# )

drop_columns = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("drop_columns", "drop", columns_to_drop),
    ],
    remainder="passthrough",
)

# features_postprocessing = ColumnTransformer(
#     verbose_feature_names_out=False,
#     transformers=[
#         ("prepocessing_cat", preprocessing_cat, ["Cabin_type"]),
#     ],
#     remainder="passthrough",
# )

pipeline_end = Pipeline(
    [
        ("features_preprocessing", features_preprocessing),
       # ("features_engineering", features_engineering),
        ("drop_columns", drop_columns),
       # ("features_postprocessing", features_postprocessing),
    ]
)

Демонстрация работы конвейера для предобработки данных при классификации

In [4]:
preprocessing_result = pipeline_end.fit_transform(X_train)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

preprocessed_df
Out[4]:
age hypertension heart_disease avg_glucose_level bmi stroke gender_Male ever_married_Yes work_type_Never_worked work_type_Private work_type_Self-employed work_type_children Residence_type_Urban smoking_status_formerly smoked smoking_status_never smoked smoking_status_smokes
id
17762 -1.767059 -0.331155 -0.239061 0.202705 -1.260486 -0.221387 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48652 -1.546212 -0.331155 -0.239061 -0.493692 -0.834786 -0.221387 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
6903 -1.237027 -0.331155 -0.239061 -0.626614 -1.363686 -0.221387 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2903 -0.353640 -0.331155 -0.239061 0.401643 -0.654186 -0.221387 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
58153 -1.104519 -0.331155 -0.239061 0.397865 -0.860586 -0.221387 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34084 -1.590381 -0.331155 -0.239061 -0.636617 -1.324986 -0.221387 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
11176 -1.502043 -0.331155 -0.239061 -0.461017 -1.621686 -0.221387 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
52554 -1.060349 -0.331155 -0.239061 -0.907796 -0.821886 -0.221387 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
10381 -0.221132 3.019737 -0.239061 -0.328095 0.571314 -0.221387 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
70884 -0.397810 -0.331155 -0.239061 -0.577046 1.100214 -0.221387 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0

4088 rows × 16 columns

Формирование набора моделей для классификации

logistic -- логистическая регрессия

ridge -- гребневая регрессия

decision_tree -- дерево решений

knn -- k-ближайших соседей

naive_bayes -- наивный Байесовский классификатор

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)

random_forest -- метод случайного леса (набор деревьев решений)

mlp -- многослойный персептрон (нейронная сеть)

Документация: https://scikit-learn.org/1.5/supervised_learning.html

In [5]:
from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree

class_models = {
    "logistic": {"model": linear_model.LogisticRegression()},
    # "ridge": {"model": linear_model.RidgeClassifierCV(cv=5, class_weight="balanced")},
    "ridge": {"model": linear_model.LogisticRegression(penalty="l2", class_weight="balanced")},
    "decision_tree": {
        "model": tree.DecisionTreeClassifier(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsClassifier(n_neighbors=7)},
    "naive_bayes": {"model": naive_bayes.GaussianNB()},
    "gradient_boosting": {
        "model": ensemble.GradientBoostingClassifier(n_estimators=210)
    },
    "random_forest": {
        "model": ensemble.RandomForestClassifier(
            max_depth=11, class_weight="balanced", random_state=random_state
        )
    },
    "mlp": {
        "model": neural_network.MLPClassifier(
            hidden_layer_sizes=(7,),
            max_iter=100000,
            early_stopping=True,
            random_state=random_state,
        )
    },
}
In [6]:
# print(y_train.dtypes)
# print(y_test.dtypes)
# df.info()
print(y_train.head())
print(y_test.head())
       heart_disease
id                  
17762              0
48652              0
6903               0
2903               0
58153              0
       heart_disease
id                  
2520               0
56855              0
27034              0
641                0
65407              0

Обучение моделей на обучающем наборе данных и оценка на тестовом

In [13]:
import numpy as np
from sklearn import metrics

for model_name in class_models.keys():
    print(f"Model: {model_name}")
    model = class_models[model_name]["model"]

    model_pipeline = Pipeline([("pipeline", pipeline_end), ("model", model)])
    model_pipeline = model_pipeline.fit(X_train, y_train.values.ravel())

    y_train_predict = model_pipeline.predict(X_train)
    y_test_probs = model_pipeline.predict_proba(X_test)[:, 1]
    y_test_predict = np.where(y_test_probs > 0.5, 1, 0)

    class_models[model_name]["pipeline"] = model_pipeline
    class_models[model_name]["probs"] = y_test_probs
    class_models[model_name]["preds"] = y_test_predict

    print(y_train_predict)
    print(y_test_predict)
    class_models[model_name]["Precision_train"] = metrics.precision_score(
        y_train, y_train_predict, 
    )
    class_models[model_name]["Precision_test"] = metrics.precision_score(
        y_test, y_test_predict, 
    )
    class_models[model_name]["Recall_train"] = metrics.recall_score(
        y_train, y_train_predict, 
    )
    class_models[model_name]["Recall_test"] = metrics.recall_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Accuracy_train"] = metrics.accuracy_score(
        y_train, y_train_predict
    )
    class_models[model_name]["Accuracy_test"] = metrics.accuracy_score(
        y_test, y_test_predict
    )
    class_models[model_name]["ROC_AUC_test"] = metrics.roc_auc_score(
        y_test, y_test_probs
    )
    class_models[model_name]["F1_train"] = metrics.f1_score(y_train, y_train_predict)
    class_models[model_name]["F1_test"] = metrics.f1_score(y_test, y_test_predict)
    class_models[model_name]["MCC_test"] = metrics.matthews_corrcoef(
        y_test, y_test_predict
    )
    class_models[model_name]["Cohen_kappa_test"] = metrics.cohen_kappa_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Confusion_matrix"] = metrics.confusion_matrix(
        y_test, y_test_predict
    )
Model: logistic
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: ridge
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: decision_tree
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: knn
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: naive_bayes
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: gradient_boosting
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: random_forest
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Model: mlp
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]

Сводная таблица оценок качества для использованных моделей классификации

Документация: https://scikit-learn.org/1.5/modules/model_evaluation.html

Матрица неточностей

In [8]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

_, ax = plt.subplots(int(len(class_models) / 2), 2, figsize=(12, 10), sharex=False, sharey=False)
for index, key in enumerate(class_models.keys()):
    c_matrix = class_models[key]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["bad heart", "nice heart"]
    ).plot(ax=ax.flat[index])
    disp.ax_.set_title(key)

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)
plt.show()
No description has been provided for this image

Точность, полнота, верность (аккуратность), F-мера

In [9]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
]
class_metrics.sort_values(
    by="Accuracy_test", ascending=False
).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)
Out[9]:
  Precision_train Precision_test Recall_train Recall_test Accuracy_train Accuracy_test F1_train F1_test
logistic 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
ridge 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
decision_tree 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
knn 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
naive_bayes 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
gradient_boosting 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
random_forest 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mlp 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса

In [10]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
]
class_metrics.sort_values(by="ROC_AUC_test", ascending=False).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
Out[10]:
  Accuracy_test F1_test ROC_AUC_test Cohen_kappa_test MCC_test
logistic 1.000000 1.000000 1.000000 1.000000 1.000000
ridge 1.000000 1.000000 1.000000 1.000000 1.000000
decision_tree 1.000000 1.000000 1.000000 1.000000 1.000000
knn 1.000000 1.000000 1.000000 1.000000 1.000000
naive_bayes 1.000000 1.000000 1.000000 1.000000 1.000000
gradient_boosting 1.000000 1.000000 1.000000 1.000000 1.000000
random_forest 1.000000 1.000000 1.000000 1.000000 1.000000
mlp 1.000000 1.000000 1.000000 1.000000 1.000000
In [11]:
best_model = str(class_metrics.sort_values(by="MCC_test", ascending=False).iloc[0].name)

display(best_model)
'logistic'

Вывод данных с ошибкой предсказания для оценки

In [12]:
preprocessing_result = pipeline_end.transform(X_test)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

y_pred = class_models[best_model]["preds"]

error_index = y_test[y_test["Survived"] != y_pred].index.tolist()
display(f"Error items count: {len(error_index)}")

error_predicted = pd.Series(y_pred, index=y_test.index).loc[error_index]
error_df = X_test.loc[error_index].copy()
error_df.insert(loc=1, column="Predicted", value=error_predicted)
error_df.sort_index()
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Survived'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[12], line 9
      2 preprocessed_df = pd.DataFrame(
      3     preprocessing_result,
      4     columns=pipeline_end.get_feature_names_out(),
      5 )
      7 y_pred = class_models[best_model]["preds"]
----> 9 error_index = y_test[y_test["Survived"] != y_pred].index.tolist()
     10 display(f"Error items count: {len(error_index)}")
     12 error_predicted = pd.Series(y_pred, index=y_test.index).loc[error_index]

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'Survived'

Пример использования обученной модели (конвейера) для предсказания

In [ ]:
model = class_models[best_model]["pipeline"]

example_id = 450
test = pd.DataFrame(X_test.loc[example_id, :]).T
test_preprocessed = pd.DataFrame(preprocessed_df.loc[example_id, :]).T
display(test)
display(test_preprocessed)
result_proba = model.predict_proba(test)[0]
result = model.predict(test)[0]
real = int(y_test.loc[example_id].values[0])
display(f"predicted: {result} (proba: {result_proba})")
display(f"real: {real}")
In [ ]:
from sklearn.model_selection import GridSearchCV

optimized_model_type = "random_forest"

random_forest_model = class_models[optimized_model_type]["pipeline"]

param_grid = {
    "model__n_estimators": [10, 20, 30, 40, 50, 100, 150, 200, 250, 500],
    "model__max_features": ["sqrt", "log2", 2],
    "model__max_depth": [2, 3, 4, 5, 6, 7, 8, 9 ,10],
    "model__criterion": ["gini", "entropy", "log_loss"],
}

gs_optomizer = GridSearchCV(
    estimator=random_forest_model, param_grid=param_grid, n_jobs=-1
)
gs_optomizer.fit(X_train, y_train.values.ravel())
gs_optomizer.best_params_

Обучение модели с новыми гиперпараметрами

In [ ]:
optimized_model = ensemble.RandomForestClassifier(
    random_state=random_state,
    criterion="gini",
    max_depth=7,
    max_features="sqrt",
    n_estimators=30,
)

result = {}

result["pipeline"] = Pipeline([("pipeline", pipeline_end), ("model", optimized_model)]).fit(X_train, y_train.values.ravel())
result["train_preds"] = result["pipeline"].predict(X_train)
result["probs"] = result["pipeline"].predict_proba(X_test)[:, 1]
result["preds"] = np.where(result["probs"] > 0.5, 1, 0)

result["Precision_train"] = metrics.precision_score(y_train, result["train_preds"])
result["Precision_test"] = metrics.precision_score(y_test, result["preds"])
result["Recall_train"] = metrics.recall_score(y_train, result["train_preds"])
result["Recall_test"] = metrics.recall_score(y_test, result["preds"])
result["Accuracy_train"] = metrics.accuracy_score(y_train, result["train_preds"])
result["Accuracy_test"] = metrics.accuracy_score(y_test, result["preds"])
result["ROC_AUC_test"] = metrics.roc_auc_score(y_test, result["probs"])
result["F1_train"] = metrics.f1_score(y_train, result["train_preds"])
result["F1_test"] = metrics.f1_score(y_test, result["preds"])
result["MCC_test"] = metrics.matthews_corrcoef(y_test, result["preds"])
result["Cohen_kappa_test"] = metrics.cohen_kappa_score(y_test, result["preds"])
result["Confusion_matrix"] = metrics.confusion_matrix(y_test, result["preds"])

Формирование данных для оценки старой и новой версии модели

In [ ]:
optimized_metrics = pd.DataFrame(columns=list(result.keys()))
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=class_models[optimized_model_type]
)
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=result
)
optimized_metrics.insert(loc=0, column="Name", value=["Old", "New"])
optimized_metrics = optimized_metrics.set_index("Name")

Оценка параметров старой и новой модели

In [ ]:
optimized_metrics[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)
In [ ]:
optimized_metrics[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
In [14]:
_, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=False, sharey=False
)

for index in range(0, len(optimized_metrics)):
    c_matrix = optimized_metrics.iloc[index]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.3)
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 4
      1 _, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=False, sharey=False
      2 )
----> 4 for index in range(0, len(optimized_metrics)):
      5     c_matrix = optimized_metrics.iloc[index]["Confusion_matrix"]
      6     disp = ConfusionMatrixDisplay(
      7         confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
      8     ).plot(ax=ax.flat[index])

NameError: name 'optimized_metrics' is not defined
No description has been provided for this image