Классификация¶

Загрузка набора данных¶

In [1]:

import pandas as pd

from sklearn import set_config

set_config(transform_output="pandas")

random_state=9

df = pd.read_csv("data/titanic.csv", index_col="PassengerId")

df

Out[1]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 11 columns

Разделение набора данных на обучающую и тестовые выборки (80/20) для задачи классификации¶

Целевой признак -- Survived

In [2]:

from src.utils import split_stratified_into_train_val_test

X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(
    df, stratify_colname="Survived", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=random_state
)

display("X_train", X_train)
display("y_train", y_train)

display("X_test", X_test)
display("y_test", y_test)

'X_train'

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
145	0	2	Andrew, Mr. Edgardo Samuel	male	18.00	0	0	231945	11.5000	NaN	S
206	0	3	Strom, Miss. Telma Matilda	female	2.00	0	1	347054	10.4625	G6	S
349	1	3	Coutts, Master. William Loch "William"	male	3.00	1	1	C.A. 37671	15.9000	NaN	S
329	1	3	Goldsmith, Mrs. Frank John (Emily Alice Brown)	female	31.00	1	1	363291	20.5250	NaN	S
289	1	2	Hosono, Mr. Masabumi	male	42.00	0	0	237798	13.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
756	1	2	Hamalainen, Master. Viljo	male	0.67	1	1	250649	14.5000	NaN	S
816	0	1	Fry, Mr. Richard	male	NaN	0	0	112058	0.0000	B102	S
890	1	1	Behr, Mr. Karl Howell	male	26.00	0	0	111369	30.0000	C148	C
738	1	1	Lesurer, Mr. Gustave J	male	35.00	0	0	PC 17755	512.3292	B101	C
61	0	3	Sirayanian, Mr. Orsen	male	22.00	0	0	2669	7.2292	NaN	C

712 rows × 11 columns

'y_train'

	Survived
PassengerId
145	0
206	0
349	1
329	1
289	1
...	...
756	1
816	0
890	1
738	1
61	0

712 rows × 1 columns

'X_test'

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
843	1	1	Serepeca, Miss. Augusta	female	30.0	0	0	113798	31.0000	NaN	C
791	0	3	Keane, Mr. Andrew "Andy"	male	NaN	0	0	12460	7.7500	NaN	Q
509	0	3	Olsen, Mr. Henry Margido	male	28.0	0	0	C 4001	22.5250	NaN	S
828	1	2	Mallet, Master. Andre	male	1.0	0	2	S.C./PARIS 2079	37.0042	NaN	C
414	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	0	0	239853	0.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
824	1	3	Moor, Mrs. (Beila)	female	27.0	0	1	392096	12.4750	E121	S
353	0	3	Elias, Mr. Tannous	male	15.0	1	1	2695	7.2292	NaN	C
674	1	2	Wilhelms, Mr. Charles	male	31.0	0	0	244270	13.0000	NaN	S
100	0	2	Kantor, Mr. Sinai	male	34.0	1	0	244367	26.0000	NaN	S
542	0	3	Andersson, Miss. Ingeborg Constanzia	female	9.0	4	2	347082	31.2750	NaN	S

179 rows × 11 columns

'y_test'

	Survived
PassengerId
843	1
791	0
509	0
828	1
414	0
...	...
824	1
353	0
674	1
100	0
542	0

179 rows × 1 columns

Формирование конвейера для классификации данных¶

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование

features_preprocessing -- трансформер для предобработки признаков

features_engineering -- трансформер для конструирования признаков

drop_columns -- трансформер для удаления колонок

features_postprocessing -- трансформер для унитарного кодирования новых признаков

pipeline_end -- основной конвейер предобработки данных и конструирования признаков

Конвейер выполняется последовательно.

Трансформер выполняет параллельно для указанного набора колонок.

Документация:

https://scikit-learn.org/1.5/api/sklearn.pipeline.html

https://scikit-learn.org/1.5/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

In [3]:

from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from src.transformers import TitanicFeatures


columns_to_drop = ["Survived", "Name", "Cabin", "Ticket", "Embarked", "Parch", "Fare"]
num_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype != "object"
]
cat_columns = [
    column
    for column in df.columns
    if column not in columns_to_drop and df[column].dtype == "object"
]

num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()
preprocessing_num = Pipeline(
    [
        ("imputer", num_imputer),
        ("scaler", num_scaler),
    ]
)

cat_imputer = SimpleImputer(strategy="constant", fill_value="unknown")
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
preprocessing_cat = Pipeline(
    [
        ("imputer", cat_imputer),
        ("encoder", cat_encoder),
    ]
)

features_preprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_num", preprocessing_num, num_columns),
        ("prepocessing_cat", preprocessing_cat, cat_columns),
        ("prepocessing_features", cat_imputer, ["Name", "Cabin"]),
    ],
    remainder="passthrough"
)

features_engineering = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("add_features", TitanicFeatures(), ["Name", "Cabin"]),
    ],
    remainder="passthrough",
)

drop_columns = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("drop_columns", "drop", columns_to_drop),
    ],
    remainder="passthrough",
)

features_postprocessing = ColumnTransformer(
    verbose_feature_names_out=False,
    transformers=[
        ("prepocessing_cat", preprocessing_cat, ["Cabin_type"]),
    ],
    remainder="passthrough",
)

pipeline_end = Pipeline(
    [
        ("features_preprocessing", features_preprocessing),
        ("features_engineering", features_engineering),
        ("drop_columns", drop_columns),
        ("features_postprocessing", features_postprocessing),
    ]
)

Демонстрация работы конвейера для предобработки данных при классификации¶

In [4]:

preprocessing_result = pipeline_end.fit_transform(X_train)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

preprocessed_df

Out[4]:

	Cabin_type_B	Cabin_type_C	Cabin_type_D	Cabin_type_E	Cabin_type_F	Cabin_type_G	Cabin_type_T	Cabin_type_u	Is_married	Pclass	Age	SibSp	Sex_male
PassengerId
145	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0	-0.379423	-0.869506	-0.473465	1.0
206	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0	0.821241	-2.102186	-0.473465	0.0
349	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0	0.821241	-2.025143	0.437635	1.0
329	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1	0.821241	0.132047	0.437635	0.0
289	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0	-0.379423	0.979514	-0.473465	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
756	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0	-0.379423	-2.204652	0.437635	1.0
816	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	-1.580088	-0.099081	-0.473465	1.0
890	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0	-1.580088	-0.253166	-0.473465	1.0
738	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0	-1.580088	0.440217	-0.473465	1.0
61	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0	0.821241	-0.561336	-0.473465	1.0

712 rows × 13 columns

Формирование набора моделей для классификации¶

logistic -- логистическая регрессия

ridge -- гребневая регрессия

decision_tree -- дерево решений

knn -- k-ближайших соседей

naive_bayes -- наивный Байесовский классификатор

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)

random_forest -- метод случайного леса (набор деревьев решений)

mlp -- многослойный персептрон (нейронная сеть)

Документация: https://scikit-learn.org/1.5/supervised_learning.html

In [5]:

from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree

class_models = {
    "logistic": {"model": linear_model.LogisticRegression()},
    # "ridge": {"model": linear_model.RidgeClassifierCV(cv=5, class_weight="balanced")},
    "ridge": {"model": linear_model.LogisticRegression(penalty="l2", class_weight="balanced")},
    "decision_tree": {
        "model": tree.DecisionTreeClassifier(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsClassifier(n_neighbors=7)},
    "naive_bayes": {"model": naive_bayes.GaussianNB()},
    "gradient_boosting": {
        "model": ensemble.GradientBoostingClassifier(n_estimators=210)
    },
    "random_forest": {
        "model": ensemble.RandomForestClassifier(
            max_depth=11, class_weight="balanced", random_state=random_state
        )
    },
    "mlp": {
        "model": neural_network.MLPClassifier(
            hidden_layer_sizes=(7,),
            max_iter=500,
            early_stopping=True,
            random_state=random_state,
        )
    },
}

Обучение моделей на обучающем наборе данных и оценка на тестовом¶

In [6]:

from src.utils import run_classification

for model_name in class_models.keys():
    print(f"Model: {model_name}")
    model = class_models[model_name]["model"]

    pipeline = Pipeline([("pipeline", pipeline_end), ("model", model)]).fit(
        X_train, y_train.values.ravel()
    )

    class_models[model_name] = run_classification(
        pipeline, X_train, X_test, y_train, y_test
    )

Model: logistic
Model: ridge
Model: decision_tree
Model: knn
Model: naive_bayes
Model: gradient_boosting
Model: random_forest
Model: mlp

Сводная таблица оценок качества для использованных моделей классификации¶

Документация: https://scikit-learn.org/1.5/modules/model_evaluation.html

Матрица неточностей

In [7]:

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

_, ax = plt.subplots(int(len(class_models) / 2), 2, figsize=(12, 10), sharex=False, sharey=False)
for index, key in enumerate(class_models.keys()):
    c_matrix = class_models[key]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])
    disp.ax_.set_title(key)

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)
plt.show()

No description has been provided for this image

Точность, полнота, верность (аккуратность), F-мера

In [8]:

class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
]
class_metrics.sort_values(
    by="Accuracy_test", ascending=False
).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)

Out[8]:

	Precision_train	Precision_test	Recall_train	Recall_test	Accuracy_train	Accuracy_test	F1_train	F1_test
random_forest	0.894340	0.794118	0.868132	0.782609	0.910112	0.837989	0.881041	0.788321
gradient_boosting	0.889764	0.800000	0.827839	0.753623	0.894663	0.832402	0.857685	0.776119
logistic	0.751880	0.806452	0.732601	0.724638	0.804775	0.826816	0.742115	0.763359
decision_tree	0.852459	0.839286	0.761905	0.681159	0.858146	0.826816	0.804642	0.752000
knn	0.829167	0.827586	0.728938	0.695652	0.838483	0.826816	0.775828	0.755906
ridge	0.720395	0.688312	0.802198	0.768116	0.804775	0.776536	0.759099	0.726027
naive_bayes	0.554524	0.575472	0.875458	0.884058	0.682584	0.703911	0.678977	0.697143
mlp	0.900000	0.833333	0.197802	0.217391	0.683989	0.681564	0.324324	0.344828

ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса

In [9]:

class_metrics = pd.DataFrame.from_dict(class_models, "index")[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
]
class_metrics.sort_values(by="ROC_AUC_test", ascending=False).style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)

Out[9]:

	Accuracy_test	F1_test	ROC_AUC_test	Cohen_kappa_test	MCC_test
random_forest	0.837989	0.788321	0.858893	0.657111	0.657157
logistic	0.826816	0.763359	0.854084	0.627409	0.629641
ridge	0.776536	0.726027	0.851054	0.538303	0.540613
gradient_boosting	0.832402	0.776119	0.850922	0.642381	0.643113
knn	0.826816	0.755906	0.838735	0.623260	0.628905
decision_tree	0.826816	0.752000	0.794137	0.621151	0.629142
naive_bayes	0.703911	0.697143	0.785903	0.431814	0.470403
mlp	0.681564	0.344828	0.712714	0.220490	0.307678

In [10]:

best_model = str(class_metrics.sort_values(by="MCC_test", ascending=False).iloc[0].name)

display(best_model)

'random_forest'

Вывод данных с ошибкой предсказания для оценки¶

In [11]:

preprocessing_result = pipeline_end.transform(X_test)
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
)

y_pred = class_models[best_model]["preds"]

error_index = y_test[y_test["Survived"] != y_pred].index.tolist()
display(f"Error items count: {len(error_index)}")

error_predicted = pd.Series(y_pred, index=y_test.index).loc[error_index]
error_df = X_test.loc[error_index].copy()
error_df.insert(loc=1, column="Predicted", value=error_predicted)
error_df.sort_index()

'Error items count: 29'

Out[11]:

	Survived	Predicted	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
26	1	0	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
72	0	1	3	Goodwin, Miss. Lillian Amy	female	16.0	5	2	CA 2144	46.9000	NaN	S
103	0	1	1	White, Mr. Richard Frasar	male	21.0	0	1	35281	77.2875	D26	S
108	1	0	3	Moss, Mr. Albert Johan	male	NaN	0	0	312991	7.7750	NaN	S
128	1	0	3	Madsen, Mr. Fridtjof Arne	male	24.0	0	0	C 17369	7.1417	NaN	S
193	1	0	3	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	1	0	350046	7.8542	NaN	S
241	0	1	3	Zabour, Miss. Thamine	female	NaN	1	0	2665	14.4542	NaN	C
272	1	0	3	Tornquist, Mr. William Henry	male	25.0	0	0	LINE	0.0000	NaN	S
293	0	1	2	Levy, Mr. Rene Jacques	male	36.0	0	0	SC/Paris 2163	12.8750	D	C
352	0	1	1	Williams-Lambert, Mr. Fletcher Fellows	male	NaN	0	0	113510	35.0000	C128	S
358	0	1	2	Funk, Miss. Annie Clemmer	female	38.0	0	0	237671	13.0000	NaN	S
378	0	1	1	Widener, Mr. Harry Elkins	male	27.0	0	2	113503	211.5000	C82	C
445	1	0	3	Johannesen-Bratthammer, Mr. Bernt	male	NaN	0	0	65306	8.1125	NaN	S
450	1	0	1	Peuchen, Major. Arthur Godfrey	male	52.0	0	0	113786	30.5000	C104	S
508	1	0	1	Bradley, Mr. George ("George Arthur Brayton")	male	NaN	0	0	111427	26.5500	NaN	S
511	1	0	3	Daly, Mr. Eugene Patrick	male	29.0	0	0	382651	7.7500	NaN	Q
570	1	0	3	Jonsson, Mr. Carl	male	32.0	0	0	350417	7.8542	NaN	S
579	0	1	3	Caram, Mrs. Joseph (Maria Elias)	female	NaN	1	0	2689	14.4583	NaN	C
584	0	1	1	Ross, Mr. John Hugo	male	36.0	0	0	13049	40.1250	A10	C
588	1	0	1	Frolicher-Stehli, Mr. Maxmillian	male	60.0	1	1	13567	79.2000	B41	C
618	0	1	3	Lobb, Mrs. William Arthur (Cordelia K Stanlick)	female	26.0	1	0	A/5. 3336	16.1000	NaN	S
658	0	1	3	Bourke, Mrs. John (Catherine)	female	32.0	1	1	364849	15.5000	NaN	Q
661	1	0	1	Frauenthal, Dr. Henry William	male	50.0	2	0	PC 17611	133.6500	NaN	S
674	1	0	2	Wilhelms, Mr. Charles	male	31.0	0	0	244270	13.0000	NaN	S
745	1	0	3	Stranden, Mr. Juho	male	31.0	0	0	STON/O 2. 3101288	7.9250	NaN	S
773	0	1	2	Mack, Mrs. (Mary)	female	57.0	0	0	S.O./P.P. 3	10.5000	E77	S
807	0	1	1	Andrews, Mr. Thomas Jr	male	39.0	0	0	112050	0.0000	A36	S
814	0	1	3	Andersson, Miss. Ebba Iris Alfrida	female	6.0	4	2	347082	31.2750	NaN	S
829	1	0	3	McCormack, Mr. Thomas Joseph	male	NaN	0	0	367228	7.7500	NaN	Q

Пример использования обученной модели (конвейера) для предсказания¶

In [12]:

model = class_models[best_model]["pipeline"]

example_id = 450
test = pd.DataFrame(X_test.loc[example_id, :]).T
test_preprocessed = pd.DataFrame(preprocessed_df.loc[example_id, :]).T
display(test)
display(test_preprocessed)
result_proba = model.predict_proba(test)[0]
result = model.predict(test)[0]
real = int(y_test.loc[example_id].values[0])
display(f"predicted: {result} (proba: {result_proba})")
display(f"real: {real}")

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
450	1	1	Peuchen, Major. Arthur Godfrey	male	52.0	0	0	113786	30.5	C104	S

	Cabin_type_B	Cabin_type_C	Cabin_type_D	Cabin_type_E	Cabin_type_F	Cabin_type_G	Cabin_type_T	Cabin_type_u	Is_married	Pclass	Age	SibSp	Sex_male
450	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	-1.580088	1.749939	-0.473465	1.0

'predicted: 0 (proba: [0.91145747 0.08854253])'

'real: 1'

Подбор гиперпараметров методом поиска по сетке¶

https://www.kaggle.com/code/sociopath00/random-forest-using-gridsearchcv

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [13]:

from sklearn.model_selection import GridSearchCV

optimized_model_type = "random_forest"

random_forest_model = class_models[optimized_model_type]["pipeline"]

param_grid = {
    "model__n_estimators": [10, 20, 30, 40, 50, 100, 150, 200, 250, 500],
    "model__max_features": ["sqrt", "log2", 2],
    "model__max_depth": [2, 3, 4, 5, 6, 7, 8, 9 ,10],
    "model__criterion": ["gini", "entropy", "log_loss"],
}

gs_optomizer = GridSearchCV(
    estimator=random_forest_model, param_grid=param_grid, n_jobs=-1
)
gs_optomizer.fit(X_train, y_train.values.ravel())
gs_optomizer.best_params_

c:\Users\user\Projects\python\ckmai\.venv\Lib\site-packages\numpy\ma\core.py:2881: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,

Out[13]:

{'model__criterion': 'gini',
 'model__max_depth': 7,
 'model__max_features': 'sqrt',
 'model__n_estimators': 30}

Обучение модели с новыми гиперпараметрами

In [14]:

pipeline = gs_optomizer.best_estimator_.fit(X_train, y_train.values.ravel())

result = run_classification(pipeline, X_train, X_test, y_train, y_test)

Формирование данных для оценки старой и новой версии модели

In [15]:

optimized_metrics = pd.DataFrame(columns=list(result.keys()))
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=class_models[optimized_model_type]
)
optimized_metrics.loc[len(optimized_metrics)] = pd.Series(
    data=result
)
optimized_metrics.insert(loc=0, column="Name", value=["Old", "New"])
optimized_metrics = optimized_metrics.set_index("Name")

Оценка параметров старой и новой модели

In [16]:

optimized_metrics[
    [
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
        "Accuracy_train",
        "Accuracy_test",
        "F1_train",
        "F1_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=["Accuracy_train", "Accuracy_test", "F1_train", "F1_test"],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Precision_train",
        "Precision_test",
        "Recall_train",
        "Recall_test",
    ],
)

Out[16]:

	Precision_train	Precision_test	Recall_train	Recall_test	Accuracy_train	Accuracy_test	F1_train	F1_test
Name
Old	0.894340	0.794118	0.868132	0.782609	0.910112	0.837989	0.881041	0.788321
New	0.800699	0.777778	0.838828	0.811594	0.858146	0.837989	0.819320	0.794326

In [17]:

optimized_metrics[
    [
        "Accuracy_test",
        "F1_test",
        "ROC_AUC_test",
        "Cohen_kappa_test",
        "MCC_test",
    ]
].style.background_gradient(
    cmap="plasma",
    low=0.3,
    high=1,
    subset=[
        "ROC_AUC_test",
        "MCC_test",
        "Cohen_kappa_test",
    ],
).background_gradient(
    cmap="viridis",
    low=1,
    high=0.3,
    subset=[
        "Accuracy_test",
        "F1_test",
    ],
)

Out[17]:

	Accuracy_test	F1_test	ROC_AUC_test	Cohen_kappa_test	MCC_test
Name
Old	0.837989	0.788321	0.858893	0.657111	0.657157
New	0.837989	0.794326	0.866140	0.660785	0.661193

In [18]:

_, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=False, sharey=False
)

for index in range(0, len(optimized_metrics)):
    c_matrix = optimized_metrics.iloc[index]["Confusion_matrix"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=c_matrix, display_labels=["Died", "Sirvived"]
    ).plot(ax=ax.flat[index])

plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.3)
plt.show()

Регрессия¶

Загрузка данных¶

In [19]:

import pandas as pd

density_train = pd.read_csv("data/density/density_train.csv", sep=";", decimal=",")
density_test = pd.read_csv("data/density/density_test.csv", sep=";", decimal=",")

display(density_train.head(3))
display(density_test.head(3))

	T	Density
0	20	1.06250
1	25	1.05979
2	35	1.05404

	T	Al2O3	Density
0	30	0.00	1.05696
1	55	0.00	1.04158
2	25	0.05	1.08438

Формирование выборок¶

In [20]:

density_y_train = pd.DataFrame(density_train["Density"], columns=["Density"])
density_train = density_train.drop(["Density"], axis=1)

display(density_train.head(3))
display(density_y_train.head(3))

density_y_test = pd.DataFrame(density_test["Density"], columns=["Density"])
density_test = density_test.drop(["Density"], axis=1)

display(density_test.head(3))
display(density_y_test.head(3))

	T	Al2O3	TiO2
0	20	0.0	0.0
1	25	0.0	0.0
2	35	0.0	0.0

	Density
0	1.06250
1	1.05979
2	1.05404

	T	Al2O3
0	30	0.00
1	55	0.00
2	25	0.05

	Density
0	1.05696
1	1.04158
2	1.08438

Определение перечня алгоритмов решения задачи аппроксимации (регрессии)¶

In [21]:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model, tree, neighbors, ensemble, neural_network

random_state = 9

models = {
    "linear": {"model": linear_model.LinearRegression(n_jobs=-1)},
    "linear_poly": {
        "model": make_pipeline(
            PolynomialFeatures(degree=2),
            linear_model.LinearRegression(fit_intercept=False, n_jobs=-1),
        )
    },
    "linear_interact": {
        "model": make_pipeline(
            PolynomialFeatures(interaction_only=True),
            linear_model.LinearRegression(fit_intercept=False, n_jobs=-1),
        )
    },
    "ridge": {"model": linear_model.RidgeCV()},
    "decision_tree": {
        "model": tree.DecisionTreeRegressor(max_depth=7, random_state=random_state)
    },
    "knn": {"model": neighbors.KNeighborsRegressor(n_neighbors=7, n_jobs=-1)},
    "random_forest": {
        "model": ensemble.RandomForestRegressor(
            max_depth=7, random_state=random_state, n_jobs=-1
        )
    },
    "mlp": {
        "model": neural_network.MLPRegressor(
            activation="tanh",
            hidden_layer_sizes=(3,),
            max_iter=500,
            early_stopping=True,
            random_state=random_state,
        )
    },
}

Обучение и оценка моделей с помощью различных алгоритмов¶

In [22]:

from src.utils import run_regression

for model_name in models.keys():
    print(f"Model: {model_name}")
    X_train = density_train
    X_test = density_test
    y_train = density_y_train
    y_test = density_y_test

    model = models[model_name]["model"]
    fitted_model = model.fit(
        X_train.values, density_y_train.values.ravel()
    )

    models[model_name] = run_regression(fitted_model, X_train, X_test, y_train, y_test)

Model: linear
Model: linear_poly
Model: linear_interact
Model: ridge
Model: decision_tree
Model: knn
Model: random_forest
Model: mlp

Вывод результатов оценки¶

In [23]:

reg_metrics = pd.DataFrame.from_dict(models, "index")[
    ["RMSE_train", "RMSE_test", "RMAE_test", "R2_test"]
]
reg_metrics.sort_values(by="RMSE_test").style.background_gradient(
    cmap="viridis", low=1, high=0.3, subset=["RMSE_train", "RMSE_test"]
).background_gradient(cmap="plasma", low=0.3, high=1, subset=["RMAE_test", "R2_test"])

Out[23]:

	RMSE_train	RMSE_test	RMAE_test	R2_test
linear_poly	0.000319	0.000362	0.016643	0.999965
linear_interact	0.001131	0.001491	0.033198	0.999413
linear	0.002464	0.003261	0.049891	0.997191
random_forest	0.002716	0.005575	0.067298	0.991788
decision_tree	0.000346	0.006433	0.076138	0.989067
ridge	0.013989	0.015356	0.116380	0.937703
knn	0.053108	0.056776	0.217611	0.148414
mlp	0.095734	0.099654	0.270371	-1.623554

Вывод реального и "спрогнозированного" результата для обучающей и тестовой выборок¶

Получение лучшей модели

In [24]:

best_model = str(reg_metrics.sort_values(by="RMSE_test").iloc[0].name)

display(best_model)

'linear_poly'

Вывод для обучающей выборки

In [25]:

pd.concat(
    [
        density_train,
        density_y_train,
        pd.Series(
            models[best_model]["train_preds"],
            index=density_y_train.index,
            name="DensityPred",
        ),
    ],
    axis=1,
).head(5)

Out[25]:

	T	Density	DensityPred
0	20	1.06250	1.063174
1	25	1.05979	1.060117
2	35	1.05404	1.053941
3	40	1.05103	1.050822
4	45	1.04794	1.047683

Вывод для тестовой выборки

In [26]:

pd.concat(
    [
        density_test,
        density_y_test,
        pd.Series(
            models[best_model]["preds"],
            index=density_y_test.index,
            name="DensityPred",
        ),
    ],
    axis=1,
).head(5)

Out[26]:

	T	Al2O3	Density	DensityPred
0	30	0.00	1.05696	1.057040
1	55	0.00	1.04158	1.041341
2	25	0.05	1.08438	1.084063
3	30	0.05	1.08112	1.080764
4	35	0.05	1.07781	1.077444

295 KiB Raw Blame History Unescape Escape