annalyovushkina@yandex.ru 16d54136f0 ещё коммит

2024-11-15 23:33:34 +04:00

68 KiB

Raw Blame History

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("C://Users//annal//aim//static//csv//Forbes_Billionaires.csv")
print(df.columns)

Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')

Определим бизнес цели:¶

1- Прогнозирование состояния миллиардера(регрессия)¶

2- Прогнозирование возраста миллиардера(классификация)¶

Подготовим данные: категоризируем колонку age¶

In [2]:

print(df.isnull().sum())

print()

# Есть ли пустые значения признаков
print(df.isnull().any())

print()

# Процент пустых значений признаков
for i in df.columns:
    null_rate = df[i].isnull().sum() / len(df) * 100
    if null_rate > 0:
        print(f"{i} процент пустых значений: %{null_rate:.2f}")

Rank        0
Name        0
Networth    0
Age         0
Country     0
Source      0
Industry    0
dtype: int64

Rank        False
Name        False
Networth    False
Age         False
Country     False
Source      False
Industry    False
dtype: bool

In [2]:

bins = [0, 30, 40, 50, 60, 70, 80, 101]  # границы для возрастных категорий
labels = ['Under 30', '30-40', '40-50', '50-60', '60-70', '70-80', '80+']  # метки для категорий

df["Age_category"] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
# Удаляем оригинальные колонки 'country', 'industry' и 'source' из исходного DataFrame
df.drop(columns=['Age'], inplace=True)

# Просмотр результата
print(df.head())

   Rank                        Name  Networth        Country  \
0      1                 Elon Musk      219.0  United States   
1      2                Jeff Bezos      171.0  United States   
2      3  Bernard Arnault & family      158.0         France   
3      4                Bill Gates      129.0  United States   
4      5            Warren Buffett      118.0  United States   

               Source                Industry Age_category  
0       Tesla, SpaceX             Automotive         50-60  
1              Amazon             Technology         50-60  
2                LVMH       Fashion & Retail         70-80  
3           Microsoft             Technology         60-70  
4  Berkshire Hathaway  Finance & Investments           80+

In [27]:

from utils import split_stratified_into_train_val_test

X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(
    df, stratify_colname="Age_category", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=9
)

display("X_train", X_train)
display("y_train", y_train)

display("X_test", X_test)
display("y_test", y_test)

'X_train'

	Rank	Name	Networth	Country	Source	Industry	Age_category
1909	1818	Tran Ba Duong & family	1.6	Vietnam	automotive	Automotive	60-70
2099	2076	Mark Dixon	1.4	United Kingdom	office real estate	Real Estate	60-70
1392	1341	Yingzhuo Xu	2.3	China	agribusiness	Food & Beverage	50-60
627	622	Bruce Flatt	4.6	Canada	money management	Finance & Investments	50-60
527	523	Li Liangbin	5.2	China	lithium	Manufacturing	50-60
...	...	...	...	...	...	...	...
84	85	Theo Albrecht, Jr. & family	18.7	Germany	Aldi, Trader Joe's	Fashion & Retail	70-80
633	622	Tony Tamer	4.6	United States	private equity	Finance & Investments	60-70
922	913	Bob Gaglardi	3.3	Canada	hotels	Real Estate	80+
2178	2076	Eugene Wu	1.4	Taiwan	finance	Finance & Investments	70-80
415	411	Leonard Stern	6.2	United States	real estate	Real Estate	80+

2080 rows × 7 columns

'y_train'

	Age_category
1909	60-70
2099	60-70
1392	50-60
627	50-60
527	50-60
...	...
84	70-80
633	60-70
922	80+
2178	70-80
415	80+

2080 rows × 1 columns

'X_test'

	Rank	Name	Networth	Country	Source	Industry	Age_category
2075	2076	Radhe Shyam Agarwal	1.4	India	consumer goods	Fashion & Retail	70-80
1529	1513	Robert Duggan	2.0	United States	pharmaceuticals	Healthcare	70-80
1803	1729	Yao Kuizhang	1.7	China	beverages	Food & Beverage	50-60
425	424	Alexei Kuzmichev	6.0	Russia	oil, banking, telecom	Energy	50-60
2597	2578	Ramesh Genomal	1.0	Philippines	apparel	Fashion & Retail	70-80
...	...	...	...	...	...	...	...
935	913	Alfred Oetker	3.3	Germany	consumer goods	Fashion & Retail	50-60
1541	1513	Thomas Lee	2.0	United States	private equity	Finance & Investments	70-80
1646	1645	Roberto Angelini Rossi	1.8	Chile	forestry, mining	diversified	70-80
376	375	Patrick Drahi	6.6	France	telecom	Telecom	50-60
1894	1818	Gerald Schwartz	1.6	Canada	finance	Finance & Investments	80+

520 rows × 7 columns

'y_test'

	Age_category
2075	70-80
1529	70-80
1803	50-60
425	50-60
2597	70-80
...	...
935	50-60
1541	70-80
1646	70-80
376	50-60
1894	80+

520 rows × 1 columns

Формирование конвейера для классификации данных¶

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация¶

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование¶

features_preprocessing -- трансформер для предобработки признаков¶

features_engineering -- трансформер для конструирования признаков¶

drop_columns -- трансформер для удаления колонок¶

pipeline_end -- основной конвейер предобработки данных и конструирования признаков¶

In [37]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd

# Исправляем ColumnTransformer с сохранением имен колонок
columns_to_drop = ["Age_category", "Rank ", "Name"]

num_columns = [
    column
    for column in X_train.columns
    if column not in columns_to_drop and X_train[column].dtype != "object"
]
cat_columns = [
    column
    for column in X_train.columns
    if column not in columns_to_drop and X_train[column].dtype == "object"
]

# Предобработка числовых данных
num_imputer = SimpleImputer(strategy="median")
num_scaler = StandardScaler()
preprocessing_num = Pipeline(
    [
        ("imputer", num_imputer),
        ("scaler", num_scaler),
    ]
)

# Предобработка категориальных данных
cat_imputer = SimpleImputer(strategy="constant", fill_value="unknown")
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
preprocessing_cat = Pipeline(
    [
        ("imputer", cat_imputer),
        ("encoder", cat_encoder),
    ]
)

# Общая предобработка признаков
features_preprocessing = ColumnTransformer(
    verbose_feature_names_out=True,  # Сохраняем имена колонок
    transformers=[
        ("prepocessing_num", preprocessing_num, num_columns),
        ("prepocessing_cat", preprocessing_cat, cat_columns),
    ],
    remainder="drop"  # Убираем неиспользуемые столбцы
)

# Итоговый конвейер
pipeline_end = Pipeline(
    [
        ("features_preprocessing", features_preprocessing),
    ]
)

# Преобразуем данные
preprocessing_result = pipeline_end.fit_transform(X_train)

# Создаем DataFrame с правильными именами колонок
preprocessed_df = pd.DataFrame(
    preprocessing_result,
    columns=pipeline_end.get_feature_names_out(),
    index=X_train.index,  # Сохраняем индексы
)

preprocessed_df

Out[37]:

	prepocessing_num__Networth	prepocessing_cat__Country_Argentina	prepocessing_cat__Country_Australia	prepocessing_cat__Country_Austria	prepocessing_cat__Country_Barbados	prepocessing_cat__Country_Belgium	prepocessing_cat__Country_Belize	prepocessing_cat__Country_Brazil	prepocessing_cat__Country_Bulgaria	prepocessing_cat__Country_Canada	...	prepocessing_cat__Industry_Logistics	prepocessing_cat__Industry_Manufacturing	prepocessing_cat__Industry_Media & Entertainment	prepocessing_cat__Industry_Metals & Mining	prepocessing_cat__Industry_Real Estate	prepocessing_cat__Industry_Service	prepocessing_cat__Industry_Sports	prepocessing_cat__Industry_Technology	prepocessing_cat__Industry_Telecom	prepocessing_cat__Industry_diversified
1909	-0.309917	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2099	-0.329245	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
1392	-0.242268	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
627	-0.019995	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
527	0.037990	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
84	1.342637	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
633	-0.019995	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
922	-0.145628	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2178	-0.329245	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
415	0.134630	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0

2080 rows × 860 columns

Формирование набора моделей для классификации¶

logistic -- логистическая регрессия¶

ridge -- гребневая регрессия¶

decision_tree -- дерево решений¶

knn -- k-ближайших соседей¶

naive_bayes -- наивный Байесовский классификатор¶

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)¶

random_forest -- метод случайного леса (набор деревьев решений)¶

mlp -- многослойный персептрон (нейронная сеть)¶

In [38]:

from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree

class_models = {
    "logistic": {"model": linear_model.LogisticRegression()},
    # "ridge": {"model": linear_model.RidgeClassifierCV(cv=5, class_weight="balanced")},
    "ridge": {"model": linear_model.LogisticRegression(penalty="l2", class_weight="balanced")},
    "decision_tree": {
        "model": tree.DecisionTreeClassifier(max_depth=7, random_state=9)
    },
    "knn": {"model": neighbors.KNeighborsClassifier(n_neighbors=7)},
    "naive_bayes": {"model": naive_bayes.GaussianNB()},
    "gradient_boosting": {
        "model": ensemble.GradientBoostingClassifier(n_estimators=210)
    },
    "random_forest": {
        "model": ensemble.RandomForestClassifier(
            max_depth=11, class_weight="balanced", random_state=9
        )
    },
    "mlp": {
        "model": neural_network.MLPClassifier(
            hidden_layer_sizes=(7,),
            max_iter=500,
            early_stopping=True,
            random_state=9,
        )
    },
}

Обучение моделей на обучающем наборе данных и оценка на тестовом¶

In [40]:

import numpy as np
from sklearn import metrics

for model_name in class_models.keys():
    print(f"Model: {model_name}")
    model = class_models[model_name]["model"]

    model_pipeline = Pipeline([("pipeline", pipeline_end), ("model", model)])
    model_pipeline = model_pipeline.fit(X_train, y_train.values.ravel())

    y_train_predict = model_pipeline.predict(X_train)
    y_test_probs = model_pipeline.predict_proba(X_test)[:, 1]
    y_test_predict = np.where(y_test_probs > 0.5, 1, 0)

    class_models[model_name]["pipeline"] = model_pipeline
    class_models[model_name]["probs"] = y_test_probs
    class_models[model_name]["preds"] = y_test_predict

    class_models[model_name]["Precision_train"] = metrics.precision_score(
        y_train, y_train_predict
    )
    class_models[model_name]["Precision_test"] = metrics.precision_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Recall_train"] = metrics.recall_score(
        y_train, y_train_predict
    )
    class_models[model_name]["Recall_test"] = metrics.recall_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Accuracy_train"] = metrics.accuracy_score(
        y_train, y_train_predict
    )
    class_models[model_name]["Accuracy_test"] = metrics.accuracy_score(
        y_test, y_test_predict
    )
    class_models[model_name]["ROC_AUC_test"] = metrics.roc_auc_score(
        y_test, y_test_probs
    )
    class_models[model_name]["F1_train"] = metrics.f1_score(y_train, y_train_predict)
    class_models[model_name]["F1_test"] = metrics.f1_score(y_test, y_test_predict)
    class_models[model_name]["MCC_test"] = metrics.matthews_corrcoef(
        y_test, y_test_predict
    )
    class_models[model_name]["Cohen_kappa_test"] = metrics.cohen_kappa_score(
        y_test, y_test_predict
    )
    class_models[model_name]["Confusion_matrix"] = metrics.confusion_matrix(
        y_test, y_test_predict
    )

Model: logistic

c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\preprocessing\_encoders.py:242: UserWarning: Found unknown categories in columns [0, 1] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[40], line 19
     16 class_models[model_name]["probs"] = y_test_probs
     17 class_models[model_name]["preds"] = y_test_predict
---> 19 class_models[model_name]["Precision_train"] = metrics.precision_score(
     20     y_train, y_train_predict
     21 )
     22 class_models[model_name]["Precision_test"] = metrics.precision_score(
     23     y_test, y_test_predict
     24 )
     25 class_models[model_name]["Recall_train"] = metrics.recall_score(
     26     y_train, y_train_predict
     27 )

File c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    222         str(e),
    223     )

File c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\metrics\_classification.py:2204, in precision_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
   2037 @validate_params(
   2038     {
   2039         "y_true": ["array-like", "sparse matrix"],
   (...)
   2064     zero_division="warn",
   2065 ):
   2066     """Compute the precision.
   2067 
   2068     The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
   (...)
   2202     array([0.5, 1. , 1. ])
   2203     """
-> 2204     p, _, _, _ = precision_recall_fscore_support(
   2205         y_true,
   2206         y_pred,
   2207         labels=labels,
   2208         pos_label=pos_label,
   2209         average=average,
   2210         warn_for=("precision",),
   2211         sample_weight=sample_weight,
   2212         zero_division=zero_division,
   2213     )
   2214     return p

File c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\utils\_param_validation.py:186, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    184 global_skip_validation = get_config()["skip_parameter_validation"]
    185 if global_skip_validation:
--> 186     return func(*args, **kwargs)
    188 func_sig = signature(func)
    190 # Map *args/**kwargs to the function signature

File c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\metrics\_classification.py:1789, in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
   1626 """Compute precision, recall, F-measure and support for each class.
   1627 
   1628 The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
   (...)
   1786  array([2, 2, 2]))
   1787 """
   1788 _check_zero_division(zero_division)
-> 1789 labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1791 # Calculate tp_sum, pred_sum, true_sum ###
   1792 samplewise = average == "samples"

File c:\Users\annal\aim\.venv\Lib\site-packages\sklearn\metrics\_classification.py:1578, in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1576         if y_type == "multiclass":
   1577             average_options.remove("samples")
-> 1578         raise ValueError(
   1579             "Target is %s but average='binary'. Please "
   1580             "choose another average setting, one of %r." % (y_type, average_options)
   1581         )
   1582 elif pos_label not in (None, 1):
   1583     warnings.warn(
   1584         "Note that pos_label (set to %r) is ignored when "
   1585         "average != 'binary' (got %r). You may use "
   (...)
   1588         UserWarning,
   1589     )

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

68 KiB Raw Blame History Unescape Escape

Определим бизнес цели:¶

1- Прогнозирование состояния миллиардера(регрессия)¶

2- Прогнозирование возраста миллиардера(классификация)¶

Подготовим данные: категоризируем колонку age¶

Формирование конвейера для классификации данных¶

preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация¶

preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование¶

features_preprocessing -- трансформер для предобработки признаков¶

features_engineering -- трансформер для конструирования признаков¶

drop_columns -- трансформер для удаления колонок¶

pipeline_end -- основной конвейер предобработки данных и конструирования признаков¶

Формирование набора моделей для классификации¶

logistic -- логистическая регрессия¶

ridge -- гребневая регрессия¶

decision_tree -- дерево решений¶

knn -- k-ближайших соседей¶

naive_bayes -- наивный Байесовский классификатор¶

gradient_boosting -- метод градиентного бустинга (набор деревьев решений)¶

random_forest -- метод случайного леса (набор деревьев решений)¶

mlp -- многослойный персептрон (нейронная сеть)¶

Обучение моделей на обучающем наборе данных и оценка на тестовом¶

68 KiB

Raw Blame History