EXE 90061e7cc8 lab3

2024-10-27 15:20:41 +04:00

188 KiB

Raw Blame History

Вариант задания: Прогнозирование объема продаж в кофейне¶

Бизнес-цели:¶

Цель: Разработать модель машинного обучения, которая позволит прогнозировать объем продаж кофе в завиимости от его других характеристик (стоимость открытия, стоимость закрытия)

Цели технического проекта:¶

Сбор и подготовка данных: Очистка данных от пропусков, выбросов и дубликатов. Преобразование категориальных переменных в числовые. Разделение данных на обучающую и тестовую выборки.

In [20]:

import pandas as pn
import matplotlib.pyplot as plt
import matplotlib
import matplotlib.ticker as ticker
from datetime import datetime
import matplotlib.dates as md

df = pn.read_csv(".//static//csv//Starbucks Dataset.csv")
print(df.columns)

df["date"] = df.apply(lambda row: datetime.strptime(row["Date"], "%Y-%m-%d"), axis=1)
df.info()
#print(df['date'].head)

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8036 entries, 0 to 8035
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       8036 non-null   object        
 1   Open       8036 non-null   float64       
 2   High       8036 non-null   float64       
 3   Low        8036 non-null   float64       
 4   Close      8036 non-null   float64       
 5   Adj Close  8036 non-null   float64       
 6   Volume     8036 non-null   int64         
 7   date       8036 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(5), int64(1), object(1)
memory usage: 502.4+ KB

Разделим на 3 выборки

In [21]:

from sklearn.model_selection import train_test_split

# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тест)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Разделение обучающей выборки на обучающую и контрольную (80% - обучение, 20% - контроль)
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

print("Размер обучающей выборки:", len(train_data))
print("Размер контрольной выборки:", len(val_data))
print("Размер тестовой выборки:", len(test_data))

Размер обучающей выборки: 5142
Размер контрольной выборки: 1286
Размер тестовой выборки: 1608

In [22]:

import seaborn as sns
import matplotlib.pyplot as plt

# Гистограмма распределения объема в обучающей выборке
sns.histplot(train_data["Volume"], kde=True)
plt.title('Распределение цены в обучающей выборке')
plt.show()

# Гистограмма распределения объема в контрольной выборке
sns.histplot(val_data["Volume"], kde=True)
plt.title('Распределение цены в контрольной выборке')
plt.show()

# Гистограмма распределения объема в тестовой выборке
sns.histplot(test_data["Volume"], kde=True)
plt.title('Распределение цены в тестовой выборке')
plt.show()

No description has been provided for this image

Процесс конструирования признаков¶

Унитарное кодирование категориальных признаков (one-hot encoding)¶

One-hot encoding: Преобразование категориальных признаков в бинарные векторы.

In [23]:

import pandas as pd

# Пример категориальных признаков
categorical_features = [
    "Date",
    "date"
]

# Применение one-hot encoding
train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)
val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)
test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)

Дискретизация числовых признаков¶

In [24]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np


labels = ["low hight price", "medium hight price", "big hight price"]
num_bins = 3

hist1, bins1 = np.histogram(
    df["High"].fillna(df["High"].median()), bins=num_bins
)
bins1, hist1

pd.concat([df["High"], pd.cut(df["High"], list(bins1))], axis=1).tail(20)

Out[24]:

	High	High
8016	89.250000	(84.329, 126.32]
8017	88.610001	(84.329, 126.32]
8018	88.989998	(84.329, 126.32]
8019	76.989998	(42.338, 84.329]
8020	75.150002	(42.338, 84.329]
8021	75.510002	(42.338, 84.329]
8022	74.190002	(42.338, 84.329]
8023	72.849998	(42.338, 84.329]
8024	74.470001	(42.338, 84.329]
8025	75.760002	(42.338, 84.329]
8026	76.309998	(42.338, 84.329]
8027	76.839996	(42.338, 84.329]
8028	76.730003	(42.338, 84.329]
8029	76.029999	(42.338, 84.329]
8030	75.550003	(42.338, 84.329]
8031	78.000000	(42.338, 84.329]
8032	78.320000	(42.338, 84.329]
8033	78.220001	(42.338, 84.329]
8034	81.019997	(42.338, 84.329]
8035	80.699997	(42.338, 84.329]

In [25]:

pd.concat(
    [df["High"], pd.cut(df["High"], list(bins1), labels=labels)], axis=1
).head(20)

Out[25]:

	High	High
0	0.347656	NaN
1	0.367188	low hight price
2	0.371094	low hight price
3	0.359375	low hight price
4	0.359375	low hight price
5	0.355469	low hight price
6	0.355469	low hight price
7	0.355469	low hight price
8	0.359375	low hight price
9	0.367188	low hight price
10	0.371094	low hight price
11	0.382813	low hight price
12	0.382813	low hight price
13	0.414063	low hight price
14	0.437500	low hight price
15	0.437500	low hight price
16	0.445313	low hight price
17	0.437500	low hight price
18	0.441406	low hight price
19	0.449219	low hight price

Ручной синтез¶

In [26]:

# Пример синтеза признака среднего значения в максимальной и минимальной цене
train_data_encoded["medium"] = train_data_encoded["High"] / train_data_encoded["Low"]
val_data_encoded["medium"] = val_data_encoded["High"] / val_data_encoded["Low"]
test_data_encoded["medium"] = test_data_encoded["High"] / test_data_encoded["Low"]

Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети.

In [27]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Пример масштабирования числовых признаков
numerical_features = ["Open", "Close"]

scaler = StandardScaler()
train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])
val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])
test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])

Конструирование признаков с применением фреймворка Featuretools¶

In [28]:

import featuretools as ft

# Определение сущностей
es = ft.EntitySet(id='coffee_data')
es = es.add_dataframe(dataframe_name='starbucks', dataframe=train_data_encoded, index='id')


# Генерация признаков
feature_matrix, feature_defs = ft.dfs(
    entityset=es, target_dataframe_name="starbucks", max_depth=2
)

# Преобразование признаков для контрольной и тестовой выборок
val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)
test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)

d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\entityset\entityset.py:1733: UserWarning: index id not found in dataframe, creating new integer column
  warnings.warn(
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\synthesis\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn(
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)

Оценка качества каждого набора признаков¶

Предсказательная способность Метрики: RMSE, MAE, R²

Методы: Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках.

Скорость вычисления Методы: Измерение времени выполнения генерации признаков и обучения модели.

Надежность Методы: Кросс-валидация, анализ чувствительности модели к изменениям в данных.

Корреляция Методы: Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков.

Цельность Методы: Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели.

In [29]:

import featuretools as ft

# Определение сущностей
es = ft.EntitySet(id='coffee_data')
es = es.add_dataframe(
    dataframe_name="starbucks", dataframe=train_data_encoded, index="id"
)

# Генерация признаков
feature_matrix, feature_defs = ft.dfs(
    entityset=es, target_dataframe_name="starbucks", max_depth=2
)

# Преобразование признаков для контрольной и тестовой выборок
val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)
test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)

d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\entityset\entityset.py:724: UserWarning: A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: index
  warnings.warn(
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\synthesis\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
  warnings.warn(
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\featuretools\computational_backends\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\woodwork\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)

In [30]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

# Удаление строк с NaN
feature_matrix = feature_matrix.dropna()
val_feature_matrix = val_feature_matrix.dropna()
test_feature_matrix = test_feature_matrix.dropna()

# Разделение данных на обучающую и тестовую выборки
X_train = feature_matrix.drop("Volume", axis=1)
y_train = feature_matrix["Volume"]
X_val = val_feature_matrix.drop("Volume", axis=1)
y_val = val_feature_matrix["Volume"]
X_test = test_feature_matrix.drop("Volume", axis=1)
y_test = test_feature_matrix["Volume"]

# Выбор модели
model = RandomForestRegressor(random_state=42)

# Обучение модели
model.fit(X_train, y_train)

# Предсказание и оценка
y_pred = model.predict(X_test)

rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"RMSE: {rmse}")
print(f"R²: {r2}")
print(f"MAE: {mae}")

# Кросс-валидация
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_cv = (-scores.mean())**0.5
print(f"Cross-validated RMSE: {rmse_cv}")

# Анализ важности признаков
feature_importances = model.feature_importances_
feature_names = X_train.columns


# Проверка на переобучение
y_train_pred = model.predict(X_train)

rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
r2_train = r2_score(y_train, y_train_pred)
mae_train = mean_absolute_error(y_train, y_train_pred)

print(f"Train RMSE: {rmse_train}")
print(f"Train R²: {r2_train}")
print(f"Train MAE: {mae_train}")

# Визуализация результатов
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel("Actual Volume")
plt.ylabel("Predicted Volume")
plt.title("Actual vs Predicted Volume")
plt.show()

d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\sklearn\metrics\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.
  warnings.warn(

RMSE: 2885972.9324181927
R²: 0.9328285916832842
MAE: 1680373.6776608187
Cross-validated RMSE: 12160466.835803727

d:\3_КУРС_ПИ\МИИ\aisenv\Lib\site-packages\sklearn\metrics\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.
  warnings.warn(

Train RMSE: 4388457.199779966
Train R²: 0.9082228071090095
Train MAE: 1787810.5665033064

Точность предсказаний: Модель показывает довольно высокий R² (0.9975), что указывает на хорошее объяснение вариации распродаж. Значения RMSE и MAE довольно низки, что говорит о том, что модель достаточно точно предсказывает цены.

Переобучение: Разница между RMSE на обучающей и тестовой выборках не очень большая, что указывает на то, что переобучение не является критическим. Однако, стоит быть осторожным и продолжать мониторинг этого показателя.

188 KiB Raw Blame History Unescape Escape