MII/lec4.ipynb
2024-10-25 22:08:47 +04:00

102 KiB
Raw Blame History

Загрузка набора данных

In [4]:
import pandas as pd

from sklearn import set_config

set_config(transform_output="pandas")

random_state=9

df = pd.read_csv("data/car_price_prediction.csv", index_col="ID")

df
Out[4]:
Price Levy Manufacturer Model Prod_year Category Leather_interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
ID
45654403 13328 1399 LEXUS RX 450 2010 Jeep Yes Hybrid 3.5 186005 km 6.0 Automatic 4x4 04-May Left wheel Silver 12
44731507 16621 1018 CHEVROLET Equinox 2011 Jeep No Petrol 3 192000 km 6.0 Tiptronic 4x4 04-May Left wheel Black 8
45774419 8467 - HONDA FIT 2006 Hatchback No Petrol 1.3 200000 km 4.0 Variator Front 04-May Right-hand drive Black 2
45769185 3607 862 FORD Escape 2011 Jeep Yes Hybrid 2.5 168966 km 4.0 Automatic 4x4 04-May Left wheel White 0
45809263 11726 446 HONDA FIT 2014 Hatchback Yes Petrol 1.3 91901 km 4.0 Automatic Front 04-May Left wheel Silver 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45798355 8467 - MERCEDES-BENZ CLK 200 1999 Coupe Yes CNG 2.0 Turbo 300000 km 4.0 Manual Rear 02-Mar Left wheel Silver 5
45778856 15681 831 HYUNDAI Sonata 2011 Sedan Yes Petrol 2.4 161600 km 4.0 Tiptronic Front 04-May Left wheel Red 8
45804997 26108 836 HYUNDAI Tucson 2010 Jeep Yes Diesel 2 116365 km 4.0 Automatic Front 04-May Left wheel Grey 4
45793526 5331 1288 CHEVROLET Captiva 2007 Jeep Yes Diesel 2 51258 km 4.0 Automatic Front 04-May Left wheel Black 4
45813273 470 753 HYUNDAI Sonata 2012 Sedan Yes Hybrid 2.4 186923 km 4.0 Automatic Front 04-May Left wheel White 12

19237 rows × 17 columns

Разделение набора данных на обучающую и тестовые выборки (80/20) для задачи классификации

Целевой признак -- gear box type - коробка переключения передач. x - полная выборка, y - gear box столбец

In [5]:
from utils import split_stratified_into_train_val_test

X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(
    df, stratify_colname="Gear box type", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=random_state
)

display("X_train", X_train)
display("y_train", y_train)

display("X_test", X_test)
display("y_test", y_test)
'X_train'
Price Levy Manufacturer Model Prod_year Category Leather_interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
ID
45758153 1333 289 FORD Escape 2008 Jeep Yes Hybrid 0.4 349288 km 4.0 Automatic Front 04-May Left wheel Blue 0
45699930 17249 - FORD Escape Hybrid 2008 Jeep No Hybrid 2.3 147000 km 4.0 Variator 4x4 04-May Left wheel White 8
45646562 1333 1053 LEXUS ES 350 2014 Sedan Yes Petrol 3.5 179358 km 6.0 Automatic Front 04-May Left wheel Red 12
45656923 9879 1018 MERCEDES-BENZ ML 350 2011 Jeep Yes Diesel 3 275862 km 6.0 Automatic 4x4 04-May Left wheel Silver 12
45815887 10976 1275 HYUNDAI Sonata 2019 Sedan Yes Petrol 2.4 29419 km 4.0 Automatic Front 04-May Left wheel Blue 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45802363 21805 1024 HYUNDAI H1 2010 Minivan Yes Diesel 2.5 58958 km 4.0 Automatic Front 04-May Left wheel Black 4
45812777 220 1327 TOYOTA Camry 2018 Sedan Yes Petrol 2.5 47688 km 4.0 Automatic Front 04-May Left wheel Blue 12
44104417 15210 - TOYOTA Aqua 2014 Hatchback No Hybrid 1.5 139000 km 4.0 Variator Front 04-May Right-hand drive White 2
45793406 3136 - OPEL Corsa 1995 Hatchback No Petrol 1.4 100000 km 4.0 Manual Front 02-Mar Left wheel Grey 2
45700700 18817 - TOYOTA Camry 2007 Sedan Yes Hybrid 2.4 151000 km 4.0 Variator Front 04-May Left wheel Black 10

15389 rows × 17 columns

'y_train'
Gear box type
ID
45758153 Automatic
45699930 Variator
45646562 Automatic
45656923 Automatic
45815887 Automatic
... ...
45802363 Automatic
45812777 Automatic
44104417 Variator
45793406 Manual
45700700 Variator

15389 rows × 1 columns

'X_test'
Price Levy Manufacturer Model Prod_year Category Leather_interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
ID
45813151 220 919 MERCEDES-BENZ ML 350 2012 Jeep Yes Diesel 3 209072 km 6.0 Automatic 4x4 04-May Left wheel Grey 12
45783744 11000 - JEEP Liberty 2001 Jeep Yes LPG 3.7 137582 km 6.0 Automatic 4x4 04-May Right-hand drive Silver 6
45805850 10976 - TOYOTA RAV 4 2002 Jeep Yes CNG 2 200000 km 4.0 Automatic 4x4 04-May Left wheel White 4
45816409 1568 753 HYUNDAI Sonata 2012 Sedan Yes Petrol 2.4 246230 km 4.0 Automatic Front 04-May Left wheel Black 12
45281242 8938 843 TOYOTA Prius 2008 Sedan No Hybrid 1.5 133016 km 4.0 Automatic Front 04-May Left wheel Beige 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45798478 13172 639 FORD Focus 2014 Sedan Yes Petrol 2 134400 km 4.0 Tiptronic Front 04-May Left wheel Red 8
45321909 16621 - TOYOTA Prius 2010 Hatchback No Hybrid 1.8 154000 km 4.0 Variator Front 04-May Left wheel White 6
45758118 15681 1811 LEXUS GX 460 2010 Jeep Yes Petrol 4.6 275240 km 8.0 Automatic 4x4 04-May Left wheel Silver 0
45758137 6476 - NISSAN Note 2008 Hatchback No CNG 1.5 999999999 km 4.0 Automatic 4x4 04-May Right-hand drive Black 0
45720411 3 697 VOLKSWAGEN Jetta 2015 Sedan Yes Petrol 1.8 Turbo 65000 km 4.0 Automatic Front 04-May Left wheel Grey 12

3848 rows × 17 columns

'y_test'
Gear box type
ID
45813151 Automatic
45783744 Automatic
45805850 Automatic
45816409 Automatic
45281242 Automatic
... ...
45798478 Tiptronic
45321909 Variator
45758118 Automatic
45758137 Automatic
45720411 Automatic

3848 rows × 1 columns

В итоге, этот код выполняет следующие действия:

  • Заполняет пропущенные значения: В числовых столбцах медианой, в категориальных - значением "unknown".
  • Стандартизирует числовые данные: приводит их к нулевому среднему и единичному стандартному отклонению.
  • Преобразует категориальные данные: использует one-hot-кодирование.
  • Удаляет ненужные столбцы: из списка columns_to_drop.

Формирование конвейера для классификации данных

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование

features_preprocessing -- трансформер для предобработки признаков

features_engineering -- трансформер для конструирования признаков

drop_columns -- трансформер для удаления колонок

features_postprocessing -- трансформер для унитарного кодирования новых признаков

pipeline_end -- основной конвейер предобработки данных и конструирования признаков

Конвейер выполняется последовательно.

Трансформер выполняет параллельно для указанного набора колонок.

Документация:

https://scikit-learn.org/1.5/api/sklearn.pipeline.html

https://scikit-learn.org/1.5/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from transformers import TitanicFeatures


#columns_to_drop = ["Survived", "Name", "Cabin", "Ticket", "Embarked", "Parch", "Fare"]
columns_to_drop = ["Doors", "Color", "Gear box type", "Prod_year", "Mileage", "Airbags", "Levy", "Leather_interior", "Fuel type", "Drive wheels"]
num_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype != "object"
]
cat_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype == "object"
]

num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()
preprocessing_num = Pipeline(
    [
        ("imputer", num_imputer),
        ("scaler", num_scaler),
    ]
)

cat_imputer = SimpleImputer(strategy="constant", fill_value="unknown")
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
preprocessing_cat = Pipeline(
    [
        ("imputer", cat_imputer),
        ("encoder", cat_encoder),
    ]
)

features_preprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_num", preprocessing_num, num_columns),
        ("prepocessing_cat", preprocessing_cat, cat_columns),
        #("prepocessing_features", cat_imputer, ["Name", "Cabin"]),
    ],
    remainder="passthrough"
)

# features_engineering = ColumnTransformer(
#     verbose_feature_names_out=False,
#     transformers=[
#         ("add_features", TitanicFeatures(), ["Name", "Cabin"]),
#     ],
#     remainder="passthrough",
# )

drop_columns = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("drop_columns", "drop", columns_to_drop),
    ],
    remainder="passthrough",
)

# features_postprocessing = ColumnTransformer(
#     verbose_feature_names_out=False,
#     transformers=[
#         ("prepocessing_cat", preprocessing_cat, ["Cabin_type"]),
#     ],
#     remainder="passthrough",
# )

pipeline_end = Pipeline(
    [
        ("features_preprocessing", features_preprocessing),
       # ("features_engineering", features_engineering),
        ("drop_columns", drop_columns),
       # ("features_postprocessing", features_postprocessing),
    ]
)

Демонстрация работы конвейера для предобработки данных при классификации

In [7]:
preprocessing_result = pipeline_end.fit_transform(X_train)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

preprocessed_df
Out[7]:
Price Cylinders Manufacturer_ALFA ROMEO Manufacturer_ASTON MARTIN Manufacturer_AUDI Manufacturer_BENTLEY Manufacturer_BMW Manufacturer_BUICK Manufacturer_CADILLAC Manufacturer_CHEVROLET ... Engine volume_5.7 Turbo Engine volume_5.8 Engine volume_5.9 Engine volume_6 Engine volume_6.2 Engine volume_6.3 Engine volume_6.3 Turbo Engine volume_6.7 Engine volume_6.8 Wheel_Right-hand drive
ID
45758153 -0.082497 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45699930 -0.007675 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45646562 -0.082497 1.187062 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45656923 -0.042322 1.187062 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45815887 -0.037165 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45802363 0.013743 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45812777 -0.087729 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
44104417 -0.017260 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
45793406 -0.074021 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
45700700 -0.000304 -0.485038 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

15389 rows × 1573 columns

Формирование набора моделей для классификации

logistic -- логистическая регрессия

ridge -- гребневая регрессия

decision_tree -- дерево решений

knn -- k-ближайших соседей

naive_bayes -- наивный Байесовский классификатор

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)

random_forest -- метод случайного леса (набор деревьев решений)

mlp -- многослойный персептрон (нейронная сеть)

Документация: https://scikit-learn.org/1.5/supervised_learning.html

In [8]:
from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree

class_models = {
    "logistic": {"model": linear_model.LogisticRegression()},
    # "ridge": {"model": linear_model.RidgeClassifierCV(cv=5, class_weight="balanced")},
    "ridge": {"model": linear_model.LogisticRegression(penalty="l2", class_weight="balanced")},
    "decision_tree": {
        "model": tree.DecisionTreeClassifier(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsClassifier(n_neighbors=7)},
    "naive_bayes": {"model": naive_bayes.GaussianNB()},
    "gradient_boosting": {
        "model": ensemble.GradientBoostingClassifier(n_estimators=210)
    },
    "random_forest": {
        "model": ensemble.RandomForestClassifier(
            max_depth=11, class_weight="balanced", random_state=random_state
        )
    },
    "mlp": {
        "model": neural_network.MLPClassifier(
            hidden_layer_sizes=(7,),
            max_iter=100000,
            early_stopping=True,
            random_state=random_state,
        )
    },
}
In [14]:
print(y_train.dtypes)
print(y_test.dtypes)
df.info()
Gear box type    object
dtype: object
Gear box type    object
dtype: object
<class 'pandas.core.frame.DataFrame'>
Index: 19237 entries, 45654403 to 45813273
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Price             19237 non-null  int64  
 1   Levy              19237 non-null  object 
 2   Manufacturer      19237 non-null  object 
 3   Model             19237 non-null  object 
 4   Prod_year         19237 non-null  int64  
 5   Category          19237 non-null  object 
 6   Leather_interior  19237 non-null  object 
 7   Fuel type         19237 non-null  object 
 8   Engine volume     19237 non-null  object 
 9   Mileage           19237 non-null  object 
 10  Cylinders         19237 non-null  float64
 11  Gear box type     19237 non-null  object 
 12  Drive wheels      19237 non-null  object 
 13  Doors             19237 non-null  object 
 14  Wheel             19237 non-null  object 
 15  Color             19237 non-null  object 
 16  Airbags           19237 non-null  int64  
dtypes: float64(1), int64(3), object(13)
memory usage: 2.6+ MB

Обучение моделей на обучающем наборе данных и оценка на тестовом

In [16]:
import numpy as np
from sklearn import metrics

for model_name in class_models.keys():
    print(f"Model: {model_name}")
    model = class_models[model_name]["model"]

    model_pipeline = Pipeline([("pipeline", pipeline_end), ("model", model)])
    model_pipeline = model_pipeline.fit(X_train, y_train.values.ravel())

    y_train_predict = model_pipeline.predict(X_train)
    y_test_probs = model_pipeline.predict_proba(X_test)[:, 1]
    y_test_predict = np.where(y_test_probs > 0.5, 1, 0)

    class_models[model_name]["pipeline"] = model_pipeline
    class_models[model_name]["probs"] = y_test_probs
    class_models[model_name]["preds"] = y_test_predict

    class_models[model_name]["Precision_train"] = metrics.precision_score(
        y_train, y_train_predict, average="micro"
    )
    class_models[model_name]["Precision_test"] = metrics.precision_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Recall_train"] = metrics.recall_score(
        y_train, y_train_predict, average="micro"
    )
    class_models[model_name]["Recall_test"] = metrics.recall_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Accuracy_train"] = metrics.accuracy_score(
        y_train, y_train_predict
    )
    class_models[model_name]["Accuracy_test"] = metrics.accuracy_score(
        y_test, y_test_predict
    )
    class_models[model_name]["ROC_AUC_test"] = metrics.roc_auc_score(
        y_test, y_test_probs
    )
    class_models[model_name]["F1_train"] = metrics.f1_score(y_train, y_train_predict)
    class_models[model_name]["F1_test"] = metrics.f1_score(y_test, y_test_predict)
    class_models[model_name]["MCC_test"] = metrics.matthews_corrcoef(
        y_test, y_test_predict
    )
    class_models[model_name]["Cohen_kappa_test"] = metrics.cohen_kappa_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Confusion_matrix"] = metrics.confusion_matrix(
        y_test, y_test_predict
    )
Model: logistic
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0, 1, 3] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 28
     19 # class_models[model_name]["Precision_train"] = metrics.precision_score(
     20 #     y_train, y_train_predict, average="micro"
     21 # )
     22 # class_models[model_name]["Precision_test"] = metrics.precision_score(
     23 #     y_test, y_test_predict
     24 # )
     25 class_models[model_name]["Recall_train"] = metrics.recall_score(
     26     y_train, y_train_predict, average="micro"
     27 )
---> 28 class_models[model_name]["Recall_test"] = metrics.recall_score(
     29     y_test, y_test_predict
     30 )
     31 class_models[model_name]["Accuracy_train"] = metrics.accuracy_score(
     32     y_train, y_train_predict
     33 )
     34 class_models[model_name]["Accuracy_test"] = metrics.accuracy_score(
     35     y_test, y_test_predict
     36 )

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    222         str(e),
    223     )

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\metrics\_classification.py:2385, in recall_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
   2217 @validate_params(
   2218     {
   2219         "y_true": ["array-like", "sparse matrix"],
   (...)
   2244     zero_division="warn",
   2245 ):
   2246     """Compute the recall.
   2247 
   2248     The recall is the ratio ``tp / (tp + fn)`` where ``tp`` is the number of
   (...)
   2383     array([1. , 1. , 0.5])
   2384     """
-> 2385     _, r, _, _ = precision_recall_fscore_support(
   2386         y_true,
   2387         y_pred,
   2388         labels=labels,
   2389         pos_label=pos_label,
   2390         average=average,
   2391         warn_for=("recall",),
   2392         sample_weight=sample_weight,
   2393         zero_division=zero_division,
   2394     )
   2395     return r

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\utils\_param_validation.py:186, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    184 global_skip_validation = get_config()["skip_parameter_validation"]
    185 if global_skip_validation:
--> 186     return func(*args, **kwargs)
    188 func_sig = signature(func)
    190 # Map *args/**kwargs to the function signature

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\metrics\_classification.py:1789, in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
   1626 """Compute precision, recall, F-measure and support for each class.
   1627 
   1628 The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
   (...)
   1786  array([2, 2, 2]))
   1787 """
   1788 _check_zero_division(zero_division)
-> 1789 labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1791 # Calculate tp_sum, pred_sum, true_sum ###
   1792 samplewise = average == "samples"

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\metrics\_classification.py:1564, in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1561 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1562 # Convert to Python primitive type to avoid NumPy type / Python str
   1563 # comparison. See https://github.com/numpy/numpy/issues/6784
-> 1564 present_labels = unique_labels(y_true, y_pred).tolist()
   1565 if average == "binary":
   1566     if y_type == "binary":

File c:\Users\1\Desktop\улгту\3 курс\МИИ\mai\.venv\Lib\site-packages\sklearn\utils\multiclass.py:114, in unique_labels(*ys)
    112 # Check that we don't mix string type with number type
    113 if len(set(isinstance(label, str) for label in ys_labels)) > 1:
--> 114     raise ValueError("Mix of label input types (string and number)")
    116 return xp.asarray(sorted(ys_labels))

ValueError: Mix of label input types (string and number)

Сводная таблица оценок качества для использованных моделей классификации

Документация: https://scikit-learn.org/1.5/modules/model_evaluation.html

Матрица неточностей

In [ ]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

_, ax = plt.subplots(int(len(class_models) / 2), 2, figsize=(12, 10), sharex=False, sharey=False)
for index, key in enumerate(class_models.keys()):
    c_matrix = class_models[key]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])
    disp.ax_.set_title(key)

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)
plt.show()

Точность, полнота, верность (аккуратность), F-мера

In [ ]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
]
class_metrics.sort_values(
    by="Accuracy_test", ascending=False
).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)

ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса

In [ ]:
class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
]
class_metrics.sort_values(by="ROC_AUC_test", ascending=False).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
In [ ]:
best_model = str(class_metrics.sort_values(by="MCC_test", ascending=False).iloc[0].name)

display(best_model)

Вывод данных с ошибкой предсказания для оценки

In [ ]:
preprocessing_result = pipeline_end.transform(X_test)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

y_pred = class_models[best_model]["preds"]

error_index = y_test[y_test["Survived"] != y_pred].index.tolist()
display(f"Error items count: {len(error_index)}")

error_predicted = pd.Series(y_pred, index=y_test.index).loc[error_index]
error_df = X_test.loc[error_index].copy()
error_df.insert(loc=1, column="Predicted", value=error_predicted)
error_df.sort_index()

Пример использования обученной модели (конвейера) для предсказания

In [ ]:
model = class_models[best_model]["pipeline"]

example_id = 450
test = pd.DataFrame(X_test.loc[example_id, :]).T
test_preprocessed = pd.DataFrame(preprocessed_df.loc[example_id, :]).T
display(test)
display(test_preprocessed)
result_proba = model.predict_proba(test)[0]
result = model.predict(test)[0]
real = int(y_test.loc[example_id].values[0])
display(f"predicted: {result} (proba: {result_proba})")
display(f"real: {real}")
In [ ]:
from sklearn.model_selection import GridSearchCV

optimized_model_type = "random_forest"

random_forest_model = class_models[optimized_model_type]["pipeline"]

param_grid = {
    "model__n_estimators": [10, 20, 30, 40, 50, 100, 150, 200, 250, 500],
    "model__max_features": ["sqrt", "log2", 2],
    "model__max_depth": [2, 3, 4, 5, 6, 7, 8, 9 ,10],
    "model__criterion": ["gini", "entropy", "log_loss"],
}

gs_optomizer = GridSearchCV(
    estimator=random_forest_model, param_grid=param_grid, n_jobs=-1
)
gs_optomizer.fit(X_train, y_train.values.ravel())
gs_optomizer.best_params_

Обучение модели с новыми гиперпараметрами

In [90]:
optimized_model = ensemble.RandomForestClassifier(
    random_state=random_state,
    criterion="gini",
    max_depth=7,
    max_features="sqrt",
    n_estimators=30,
)

result = {}

result["pipeline"] = Pipeline([("pipeline", pipeline_end), ("model", optimized_model)]).fit(X_train, y_train.values.ravel())
result["train_preds"] = result["pipeline"].predict(X_train)
result["probs"] = result["pipeline"].predict_proba(X_test)[:, 1]
result["preds"] = np.where(result["probs"] > 0.5, 1, 0)

result["Precision_train"] = metrics.precision_score(y_train, result["train_preds"])
result["Precision_test"] = metrics.precision_score(y_test, result["preds"])
result["Recall_train"] = metrics.recall_score(y_train, result["train_preds"])
result["Recall_test"] = metrics.recall_score(y_test, result["preds"])
result["Accuracy_train"] = metrics.accuracy_score(y_train, result["train_preds"])
result["Accuracy_test"] = metrics.accuracy_score(y_test, result["preds"])
result["ROC_AUC_test"] = metrics.roc_auc_score(y_test, result["probs"])
result["F1_train"] = metrics.f1_score(y_train, result["train_preds"])
result["F1_test"] = metrics.f1_score(y_test, result["preds"])
result["MCC_test"] = metrics.matthews_corrcoef(y_test, result["preds"])
result["Cohen_kappa_test"] = metrics.cohen_kappa_score(y_test, result["preds"])
result["Confusion_matrix"] = metrics.confusion_matrix(y_test, result["preds"])

Формирование данных для оценки старой и новой версии модели

In [98]:
optimized_metrics = pd.DataFrame(columns=list(result.keys()))
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=class_models[optimized_model_type]
)
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=result
)
optimized_metrics.insert(loc=0, column="Name", value=["Old", "New"])
optimized_metrics = optimized_metrics.set_index("Name")

Оценка параметров старой и новой модели

In [ ]:
optimized_metrics[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)
In [ ]:
optimized_metrics[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)
In [ ]:
_, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=False, sharey=False
)

for index in range(0, len(optimized_metrics)):
    c_matrix = optimized_metrics.iloc[index]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.3)
plt.show()