3951 lines
411 KiB
Plaintext
3951 lines
411 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Датасет: Цены на акции\n",
|
||
"https://www.kaggle.com/datasets/nancyalaswad90/yamana-gold-inc-stock-Volume\n",
|
||
"##### О наборе данных: \n",
|
||
"Yamana Gold Inc. — это канадская компания, которая занимается разработкой и управлением золотыми, серебряными и медными рудниками, расположенными в Канаде, Чили, Бразилии и Аргентине. Головной офис компании находится в Торонто.\n",
|
||
"\n",
|
||
"Yamana Gold была основана в 1994 году и уже через год была зарегистрирована на фондовой бирже Торонто. В 2007 году она стала участником Нью-Йоркской фондовой биржи, а в 2020 году — Лондонской.\n",
|
||
"В 2003 году компания претерпела значительные изменения: была проведена реструктуризация, в результате которой Питер Марроне занял пост главного исполнительного директора. Кроме того, Yamana объединилась с бразильской компанией Santa Elina Mines Corporation. Благодаря этому слиянию Yamana получила доступ к капиталу, накопленному Santa Elina, что позволило ей начать разработку и эксплуатацию рудника Чапада. Затем компания объединилась с другими организациями, зарегистрированными на бирже TSX: RNC Gold, Desert Sun Mining, Viceroy Exploration, Northern Orion Resources, Meridian Gold, Osisko Mining и Extorre Gold Mines. Каждая из них внесла свой вклад в разработку месторождения или проект, который в итоге был успешно запущен.\n",
|
||
"##### Таким образом:\n",
|
||
"* Объект наблюдения - цены и объемы акций компании\n",
|
||
"* Атрибуты: 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'\n",
|
||
"\n",
|
||
"##### Бизнес цели:\n",
|
||
"* Прогнозирование будущей цены акций.(Цены закрытия)\n",
|
||
" Использование данных для создания модели, которая будет предсказывать цену акций компании в будущем. Целевая переменная: Цена закрытия (Close)\n",
|
||
"* Определение волатильности акций.\n",
|
||
" Определение, колебаний цен акций, что поможет инвесторам понять риски. Прогнозировать волатильность акций на основе изменений в ценах открытий, максимума, минимума и объема торгов. Целевая переменная: Разница между высокой и низкой ценой (High - Low). (среднее значение)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 89,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Количество колонок: 7\n",
|
||
"Колонки: Date, Open, High, Low, Close, Adj Close, Volume\n",
|
||
"\n",
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 5251 entries, 0 to 5250\n",
|
||
"Data columns (total 7 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Date 5251 non-null datetime64[ns]\n",
|
||
" 1 Open 5251 non-null float64 \n",
|
||
" 2 High 5251 non-null float64 \n",
|
||
" 3 Low 5251 non-null float64 \n",
|
||
" 4 Close 5251 non-null float64 \n",
|
||
" 5 Adj Close 5251 non-null float64 \n",
|
||
" 6 Volume 5251 non-null int64 \n",
|
||
"dtypes: datetime64[ns](1), float64(5), int64(1)\n",
|
||
"memory usage: 287.3 KB\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Date</th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Close</th>\n",
|
||
" <th>Adj Close</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>5246</th>\n",
|
||
" <td>2022-04-29</td>\n",
|
||
" <td>5.66</td>\n",
|
||
" <td>5.69</td>\n",
|
||
" <td>5.50</td>\n",
|
||
" <td>5.51</td>\n",
|
||
" <td>5.51</td>\n",
|
||
" <td>16613300</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5247</th>\n",
|
||
" <td>2022-05-02</td>\n",
|
||
" <td>5.33</td>\n",
|
||
" <td>5.39</td>\n",
|
||
" <td>5.18</td>\n",
|
||
" <td>5.30</td>\n",
|
||
" <td>5.30</td>\n",
|
||
" <td>27106700</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5248</th>\n",
|
||
" <td>2022-05-03</td>\n",
|
||
" <td>5.32</td>\n",
|
||
" <td>5.53</td>\n",
|
||
" <td>5.32</td>\n",
|
||
" <td>5.47</td>\n",
|
||
" <td>5.47</td>\n",
|
||
" <td>18914200</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5249</th>\n",
|
||
" <td>2022-05-04</td>\n",
|
||
" <td>5.47</td>\n",
|
||
" <td>5.61</td>\n",
|
||
" <td>5.37</td>\n",
|
||
" <td>5.60</td>\n",
|
||
" <td>5.60</td>\n",
|
||
" <td>20530700</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5250</th>\n",
|
||
" <td>2022-05-05</td>\n",
|
||
" <td>5.63</td>\n",
|
||
" <td>5.66</td>\n",
|
||
" <td>5.34</td>\n",
|
||
" <td>5.44</td>\n",
|
||
" <td>5.44</td>\n",
|
||
" <td>19879200</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Date Open High Low Close Adj Close Volume\n",
|
||
"5246 2022-04-29 5.66 5.69 5.50 5.51 5.51 16613300\n",
|
||
"5247 2022-05-02 5.33 5.39 5.18 5.30 5.30 27106700\n",
|
||
"5248 2022-05-03 5.32 5.53 5.32 5.47 5.47 18914200\n",
|
||
"5249 2022-05-04 5.47 5.61 5.37 5.60 5.60 20530700\n",
|
||
"5250 2022-05-05 5.63 5.66 5.34 5.44 5.44 19879200"
|
||
]
|
||
},
|
||
"execution_count": 89,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns\n",
|
||
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
||
"from sklearn.discriminant_analysis import StandardScaler\n",
|
||
"\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.pipeline import Pipeline, FeatureUnion\n",
|
||
"from sklearn.compose import ColumnTransformer\n",
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
|
||
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
||
"\n",
|
||
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
||
"from sklearn.neural_network import MLPRegressor\n",
|
||
"\n",
|
||
"df = pd.read_csv(\".//static//csv//Stocks.csv\", sep=\",\")\n",
|
||
"print('Количество колонок: ' + str(df.columns.size)) \n",
|
||
"print('Колонки: ' + ', '.join(df.columns)+'\\n')\n",
|
||
"df['Date'] = pd.to_datetime(df['Date'], errors='coerce')\n",
|
||
"\n",
|
||
"\n",
|
||
"df.info()\n",
|
||
"df.tail()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Подготовка данных:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### 1. Получение сведений о пропущенных данных\n",
|
||
"Типы пропущенных данных:\n",
|
||
"\n",
|
||
"- None - представление пустых данных в Python\n",
|
||
"- NaN - представление пустых данных в Pandas\n",
|
||
"- '' - пустая строка"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 90,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Date 0\n",
|
||
"Open 0\n",
|
||
"High 0\n",
|
||
"Low 0\n",
|
||
"Close 0\n",
|
||
"Adj Close 0\n",
|
||
"Volume 0\n",
|
||
"dtype: int64\n",
|
||
"\n",
|
||
"Date False\n",
|
||
"Open False\n",
|
||
"High False\n",
|
||
"Low False\n",
|
||
"Close False\n",
|
||
"Adj Close False\n",
|
||
"Volume False\n",
|
||
"dtype: bool\n",
|
||
"\n",
|
||
"Количество бесконечных значений в каждом столбце:\n",
|
||
"Date 0\n",
|
||
"Open 0\n",
|
||
"High 0\n",
|
||
"Low 0\n",
|
||
"Close 0\n",
|
||
"Adj Close 0\n",
|
||
"Volume 0\n",
|
||
"dtype: int64\n",
|
||
"Date процент пустых значений: %0.00\n",
|
||
"Open процент пустых значений: %0.00\n",
|
||
"High процент пустых значений: %0.00\n",
|
||
"Low процент пустых значений: %0.00\n",
|
||
"Close процент пустых значений: %0.00\n",
|
||
"Adj Close процент пустых значений: %0.00\n",
|
||
"Volume процент пустых значений: %0.00\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Количество пустых значений признаков\n",
|
||
"print(df.isnull().sum())\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# Есть ли пустые значения признаков\n",
|
||
"print(df.isnull().any())\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# Проверка на бесконечные значения\n",
|
||
"print(\"Количество бесконечных значений в каждом столбце:\")\n",
|
||
"print(np.isinf(df).sum())\n",
|
||
"\n",
|
||
"# Процент пустых значений признаков\n",
|
||
"for i in df.columns:\n",
|
||
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
||
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Таким образом, пропущенных значений не найдено."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### 2. Проверка выбросов данных и устранение их при наличии:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 91,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"До устранения выбросов:\n",
|
||
"Колонка Open:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.42\n",
|
||
" 1-й квартиль (Q1): 2.857143\n",
|
||
" 3-й квартиль (Q3): 10.65\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка Open:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.42\n",
|
||
" 1-й квартиль (Q1): 2.857143\n",
|
||
" 3-й квартиль (Q3): 10.65\n",
|
||
"\n",
|
||
"До устранения выбросов:\n",
|
||
"Колонка High:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.59\n",
|
||
" 1-й квартиль (Q1): 2.88\n",
|
||
" 3-й квартиль (Q3): 10.86\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка High:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.59\n",
|
||
" 1-й квартиль (Q1): 2.88\n",
|
||
" 3-й квартиль (Q3): 10.86\n",
|
||
"\n",
|
||
"До устранения выбросов:\n",
|
||
"Колонка Low:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.09\n",
|
||
" 1-й квартиль (Q1): 2.81\n",
|
||
" 3-й квартиль (Q3): 10.425\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка Low:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.09\n",
|
||
" 1-й квартиль (Q1): 2.81\n",
|
||
" 3-й квартиль (Q3): 10.425\n",
|
||
"\n",
|
||
"До устранения выбросов:\n",
|
||
"Колонка Close:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.389999\n",
|
||
" 1-й квартиль (Q1): 2.857143\n",
|
||
" 3-й квартиль (Q3): 10.64\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка Close:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 1.142857\n",
|
||
" Максимальное значение: 20.389999\n",
|
||
" 1-й квартиль (Q1): 2.857143\n",
|
||
" 3-й квартиль (Q3): 10.64\n",
|
||
"\n",
|
||
"До устранения выбросов:\n",
|
||
"Колонка Adj Close:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 0.935334\n",
|
||
" Максимальное значение: 17.543156\n",
|
||
" 1-й квартиль (Q1): 2.537094\n",
|
||
" 3-й квартиль (Q3): 8.951944999999998\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка Adj Close:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 0.935334\n",
|
||
" Максимальное значение: 17.543156\n",
|
||
" 1-й квартиль (Q1): 2.537094\n",
|
||
" 3-й квартиль (Q3): 8.951944999999998\n",
|
||
"\n",
|
||
"До устранения выбросов:\n",
|
||
"Колонка Volume:\n",
|
||
" Есть выбросы: Да\n",
|
||
" Количество выбросов: 95\n",
|
||
" Минимальное значение: 0\n",
|
||
" Максимальное значение: 76714000\n",
|
||
" 1-й квартиль (Q1): 2845900.0\n",
|
||
" 3-й квартиль (Q3): 13272450.0\n",
|
||
"\n",
|
||
"После устранения выбросов:\n",
|
||
"Колонка Volume:\n",
|
||
" Есть выбросы: Нет\n",
|
||
" Количество выбросов: 0\n",
|
||
" Минимальное значение: 0.0\n",
|
||
" Максимальное значение: 28912275.0\n",
|
||
" 1-й квартиль (Q1): 2845900.0\n",
|
||
" 3-й квартиль (Q3): 13272450.0\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"numeric_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']\n",
|
||
"\n",
|
||
"for column in numeric_columns:\n",
|
||
" if pd.api.types.is_numeric_dtype(df[column]): # Проверяем, является ли колонка числовой\n",
|
||
" q1 = df[column].quantile(0.25) # Находим 1-й квартиль (Q1)\n",
|
||
" q3 = df[column].quantile(0.75) # Находим 3-й квартиль (Q3)\n",
|
||
" iqr = q3 - q1 # Вычисляем межквартильный размах (IQR)\n",
|
||
"\n",
|
||
" # Определяем границы для выбросов\n",
|
||
" lower_bound = q1 - 1.5 * iqr # Нижняя граница\n",
|
||
" upper_bound = q3 + 1.5 * iqr # Верхняя граница\n",
|
||
"\n",
|
||
" # Подсчитываем количество выбросов\n",
|
||
" outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n",
|
||
" outlier_count = outliers.shape[0]\n",
|
||
"\n",
|
||
" print(\"До устранения выбросов:\")\n",
|
||
" print(f\"Колонка {column}:\")\n",
|
||
" print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
||
" print(f\" Количество выбросов: {outlier_count}\")\n",
|
||
" print(f\" Минимальное значение: {df[column].min()}\")\n",
|
||
" print(f\" Максимальное значение: {df[column].max()}\")\n",
|
||
" print(f\" 1-й квартиль (Q1): {q1}\")\n",
|
||
" print(f\" 3-й квартиль (Q3): {q3}\\n\")\n",
|
||
"\n",
|
||
" # Устраняем выбросы: заменяем значения ниже нижней границы на саму нижнюю границу, а выше верхней — на верхнюю\n",
|
||
" if outlier_count != 0:\n",
|
||
" df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
|
||
" \n",
|
||
" # Подсчитываем количество выбросов\n",
|
||
" outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n",
|
||
" outlier_count = outliers.shape[0]\n",
|
||
"\n",
|
||
" print(\"После устранения выбросов:\")\n",
|
||
" print(f\"Колонка {column}:\")\n",
|
||
" print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
||
" print(f\" Количество выбросов: {outlier_count}\")\n",
|
||
" print(f\" Минимальное значение: {df[column].min()}\")\n",
|
||
" print(f\" Максимальное значение: {df[column].max()}\")\n",
|
||
" print(f\" 1-й квартиль (Q1): {q1}\")\n",
|
||
" print(f\" 3-й квартиль (Q3): {q3}\\n\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Выбросы присутствовали, но мы их устранили."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Разбиение на выборки:\n",
|
||
"\n",
|
||
"Разобьем наш набор на обучающую, контрольную и тестовую выборки для устранения проблемы просачивания данных.\n",
|
||
"Разделим на два варианта - набор для первой бизнес цели - его будем применять для решения задаи регрессии. И набор для второй бизнес цели - его используем для решения задач классификации."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 92,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'X_train'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Close</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>4789</th>\n",
|
||
" <td>5.66</td>\n",
|
||
" <td>5.73</td>\n",
|
||
" <td>5.47</td>\n",
|
||
" <td>5.560000</td>\n",
|
||
" <td>23355100.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3469</th>\n",
|
||
" <td>3.86</td>\n",
|
||
" <td>3.93</td>\n",
|
||
" <td>3.81</td>\n",
|
||
" <td>3.880000</td>\n",
|
||
" <td>7605300.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2503</th>\n",
|
||
" <td>12.19</td>\n",
|
||
" <td>12.28</td>\n",
|
||
" <td>11.95</td>\n",
|
||
" <td>12.020000</td>\n",
|
||
" <td>7243200.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1580</th>\n",
|
||
" <td>11.77</td>\n",
|
||
" <td>11.84</td>\n",
|
||
" <td>11.53</td>\n",
|
||
" <td>11.570000</td>\n",
|
||
" <td>3025900.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2759</th>\n",
|
||
" <td>15.77</td>\n",
|
||
" <td>16.17</td>\n",
|
||
" <td>15.76</td>\n",
|
||
" <td>16.120001</td>\n",
|
||
" <td>6113400.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3092</th>\n",
|
||
" <td>9.57</td>\n",
|
||
" <td>9.87</td>\n",
|
||
" <td>9.30</td>\n",
|
||
" <td>9.750000</td>\n",
|
||
" <td>7283100.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3772</th>\n",
|
||
" <td>4.76</td>\n",
|
||
" <td>4.97</td>\n",
|
||
" <td>4.67</td>\n",
|
||
" <td>4.930000</td>\n",
|
||
" <td>12920800.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5191</th>\n",
|
||
" <td>4.18</td>\n",
|
||
" <td>4.29</td>\n",
|
||
" <td>4.17</td>\n",
|
||
" <td>4.200000</td>\n",
|
||
" <td>11192400.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5226</th>\n",
|
||
" <td>5.58</td>\n",
|
||
" <td>5.68</td>\n",
|
||
" <td>5.55</td>\n",
|
||
" <td>5.580000</td>\n",
|
||
" <td>12692800.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>860</th>\n",
|
||
" <td>3.18</td>\n",
|
||
" <td>3.19</td>\n",
|
||
" <td>3.13</td>\n",
|
||
" <td>3.180000</td>\n",
|
||
" <td>99100.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>4200 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Close Volume\n",
|
||
"4789 5.66 5.73 5.47 5.560000 23355100.0\n",
|
||
"3469 3.86 3.93 3.81 3.880000 7605300.0\n",
|
||
"2503 12.19 12.28 11.95 12.020000 7243200.0\n",
|
||
"1580 11.77 11.84 11.53 11.570000 3025900.0\n",
|
||
"2759 15.77 16.17 15.76 16.120001 6113400.0\n",
|
||
"... ... ... ... ... ...\n",
|
||
"3092 9.57 9.87 9.30 9.750000 7283100.0\n",
|
||
"3772 4.76 4.97 4.67 4.930000 12920800.0\n",
|
||
"5191 4.18 4.29 4.17 4.200000 11192400.0\n",
|
||
"5226 5.58 5.68 5.55 5.580000 12692800.0\n",
|
||
"860 3.18 3.19 3.13 3.180000 99100.0\n",
|
||
"\n",
|
||
"[4200 rows x 5 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'y_train'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>4789</th>\n",
|
||
" <td>0.046763</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3469</th>\n",
|
||
" <td>0.030928</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2503</th>\n",
|
||
" <td>0.027454</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1580</th>\n",
|
||
" <td>0.026793</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2759</th>\n",
|
||
" <td>0.025434</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3092</th>\n",
|
||
" <td>0.058462</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3772</th>\n",
|
||
" <td>0.060852</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5191</th>\n",
|
||
" <td>0.028571</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5226</th>\n",
|
||
" <td>0.023297</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>860</th>\n",
|
||
" <td>0.018868</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>4200 rows × 1 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Volatility\n",
|
||
"4789 0.046763\n",
|
||
"3469 0.030928\n",
|
||
"2503 0.027454\n",
|
||
"1580 0.026793\n",
|
||
"2759 0.025434\n",
|
||
"... ...\n",
|
||
"3092 0.058462\n",
|
||
"3772 0.060852\n",
|
||
"5191 0.028571\n",
|
||
"5226 0.023297\n",
|
||
"860 0.018868\n",
|
||
"\n",
|
||
"[4200 rows x 1 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'X_test'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Close</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1437</th>\n",
|
||
" <td>13.710000</td>\n",
|
||
" <td>14.000000</td>\n",
|
||
" <td>13.670000</td>\n",
|
||
" <td>13.940000</td>\n",
|
||
" <td>7623200.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2700</th>\n",
|
||
" <td>15.520000</td>\n",
|
||
" <td>15.720000</td>\n",
|
||
" <td>15.300000</td>\n",
|
||
" <td>15.320000</td>\n",
|
||
" <td>6098800.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3647</th>\n",
|
||
" <td>1.870000</td>\n",
|
||
" <td>1.930000</td>\n",
|
||
" <td>1.830000</td>\n",
|
||
" <td>1.830000</td>\n",
|
||
" <td>10980000.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2512</th>\n",
|
||
" <td>11.260000</td>\n",
|
||
" <td>11.470000</td>\n",
|
||
" <td>11.260000</td>\n",
|
||
" <td>11.320000</td>\n",
|
||
" <td>5029300.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2902</th>\n",
|
||
" <td>16.379999</td>\n",
|
||
" <td>16.580000</td>\n",
|
||
" <td>16.250000</td>\n",
|
||
" <td>16.549999</td>\n",
|
||
" <td>5485800.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3095</th>\n",
|
||
" <td>9.290000</td>\n",
|
||
" <td>9.350000</td>\n",
|
||
" <td>9.070000</td>\n",
|
||
" <td>9.130000</td>\n",
|
||
" <td>5861400.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>859</th>\n",
|
||
" <td>3.090000</td>\n",
|
||
" <td>3.160000</td>\n",
|
||
" <td>3.040000</td>\n",
|
||
" <td>3.100000</td>\n",
|
||
" <td>211300.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3134</th>\n",
|
||
" <td>8.550000</td>\n",
|
||
" <td>8.770000</td>\n",
|
||
" <td>8.550000</td>\n",
|
||
" <td>8.770000</td>\n",
|
||
" <td>5335400.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2577</th>\n",
|
||
" <td>16.709999</td>\n",
|
||
" <td>17.070000</td>\n",
|
||
" <td>16.379999</td>\n",
|
||
" <td>16.400000</td>\n",
|
||
" <td>14524400.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>378</th>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1051 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Close Volume\n",
|
||
"1437 13.710000 14.000000 13.670000 13.940000 7623200.0\n",
|
||
"2700 15.520000 15.720000 15.300000 15.320000 6098800.0\n",
|
||
"3647 1.870000 1.930000 1.830000 1.830000 10980000.0\n",
|
||
"2512 11.260000 11.470000 11.260000 11.320000 5029300.0\n",
|
||
"2902 16.379999 16.580000 16.250000 16.549999 5485800.0\n",
|
||
"... ... ... ... ... ...\n",
|
||
"3095 9.290000 9.350000 9.070000 9.130000 5861400.0\n",
|
||
"859 3.090000 3.160000 3.040000 3.100000 211300.0\n",
|
||
"3134 8.550000 8.770000 8.550000 8.770000 5335400.0\n",
|
||
"2577 16.709999 17.070000 16.379999 16.400000 14524400.0\n",
|
||
"378 2.571429 2.571429 2.571429 2.571429 0.0\n",
|
||
"\n",
|
||
"[1051 rows x 5 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'y_test'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1437</th>\n",
|
||
" <td>0.023673</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2700</th>\n",
|
||
" <td>0.027415</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3647</th>\n",
|
||
" <td>0.054645</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2512</th>\n",
|
||
" <td>0.018551</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2902</th>\n",
|
||
" <td>0.019940</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3095</th>\n",
|
||
" <td>0.030668</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>859</th>\n",
|
||
" <td>0.038710</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3134</th>\n",
|
||
" <td>0.025086</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2577</th>\n",
|
||
" <td>0.042073</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>378</th>\n",
|
||
" <td>0.000000</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1051 rows × 1 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Volatility\n",
|
||
"1437 0.023673\n",
|
||
"2700 0.027415\n",
|
||
"3647 0.054645\n",
|
||
"2512 0.018551\n",
|
||
"2902 0.019940\n",
|
||
"... ...\n",
|
||
"3095 0.030668\n",
|
||
"859 0.038710\n",
|
||
"3134 0.025086\n",
|
||
"2577 0.042073\n",
|
||
"378 0.000000\n",
|
||
"\n",
|
||
"[1051 rows x 1 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"from typing import Tuple\n",
|
||
"import pandas as pd\n",
|
||
"from pandas import DataFrame\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"\n",
|
||
"df['Volatility'] = (df['High'] - df['Low']) / df['Close']\n",
|
||
"\n",
|
||
"def split_into_train_test(\n",
|
||
" df_input: DataFrame,\n",
|
||
" target_colname: str = \"Volatility\",\n",
|
||
" frac_train: float = 0.8,\n",
|
||
" random_state: int = None,\n",
|
||
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
||
" \n",
|
||
" if not (0 < frac_train < 1):\n",
|
||
" raise ValueError(\"Fraction must be between 0 and 1.\")\n",
|
||
" \n",
|
||
" # Проверка наличия целевого признака\n",
|
||
" if target_colname not in df_input.columns:\n",
|
||
" raise ValueError(f\"{target_colname} is not a column in the DataFrame.\")\n",
|
||
" \n",
|
||
" # Разделяем данные на признаки и целевую переменную\n",
|
||
" X = df_input.drop(columns=[target_colname]) # Признаки\n",
|
||
" y = df_input[[target_colname]] # Целевая переменная\n",
|
||
"\n",
|
||
" # Удаляем указанные столбцы из X\n",
|
||
" columns_to_remove = [\"Date\", \"Adj Close\", \"Volatility\"]\n",
|
||
" X = X.drop(columns=columns_to_remove, errors='ignore') # Игнорировать ошибку, если столбцы не найдены\n",
|
||
"\n",
|
||
" # Разделяем данные на обучающую и тестовую выборки\n",
|
||
" X_train, X_test, y_train, y_test = train_test_split(\n",
|
||
" X, y,\n",
|
||
" test_size=(1.0 - frac_train),\n",
|
||
" random_state=random_state\n",
|
||
" )\n",
|
||
" \n",
|
||
" return X_train, X_test, y_train, y_test\n",
|
||
"\n",
|
||
"# Применение функции для разделения данных\n",
|
||
"X_train, X_test, y_train, y_test = split_into_train_test(\n",
|
||
" df, \n",
|
||
" target_colname=\"Volatility\", \n",
|
||
" frac_train=0.8, \n",
|
||
" random_state=42\n",
|
||
")\n",
|
||
"\n",
|
||
"# Для отображения результатов\n",
|
||
"display(\"X_train\", X_train)\n",
|
||
"display(\"y_train\", y_train)\n",
|
||
"\n",
|
||
"display(\"X_test\", X_test)\n",
|
||
"display(\"y_test\", y_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 93,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>4789</th>\n",
|
||
" <td>5.66</td>\n",
|
||
" <td>5.73</td>\n",
|
||
" <td>5.47</td>\n",
|
||
" <td>23355100.0</td>\n",
|
||
" <td>0.046763</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3469</th>\n",
|
||
" <td>3.86</td>\n",
|
||
" <td>3.93</td>\n",
|
||
" <td>3.81</td>\n",
|
||
" <td>7605300.0</td>\n",
|
||
" <td>0.030928</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2503</th>\n",
|
||
" <td>12.19</td>\n",
|
||
" <td>12.28</td>\n",
|
||
" <td>11.95</td>\n",
|
||
" <td>7243200.0</td>\n",
|
||
" <td>0.027454</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1580</th>\n",
|
||
" <td>11.77</td>\n",
|
||
" <td>11.84</td>\n",
|
||
" <td>11.53</td>\n",
|
||
" <td>3025900.0</td>\n",
|
||
" <td>0.026793</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2759</th>\n",
|
||
" <td>15.77</td>\n",
|
||
" <td>16.17</td>\n",
|
||
" <td>15.76</td>\n",
|
||
" <td>6113400.0</td>\n",
|
||
" <td>0.025434</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3092</th>\n",
|
||
" <td>9.57</td>\n",
|
||
" <td>9.87</td>\n",
|
||
" <td>9.30</td>\n",
|
||
" <td>7283100.0</td>\n",
|
||
" <td>0.058462</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3772</th>\n",
|
||
" <td>4.76</td>\n",
|
||
" <td>4.97</td>\n",
|
||
" <td>4.67</td>\n",
|
||
" <td>12920800.0</td>\n",
|
||
" <td>0.060852</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5191</th>\n",
|
||
" <td>4.18</td>\n",
|
||
" <td>4.29</td>\n",
|
||
" <td>4.17</td>\n",
|
||
" <td>11192400.0</td>\n",
|
||
" <td>0.028571</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5226</th>\n",
|
||
" <td>5.58</td>\n",
|
||
" <td>5.68</td>\n",
|
||
" <td>5.55</td>\n",
|
||
" <td>12692800.0</td>\n",
|
||
" <td>0.023297</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>860</th>\n",
|
||
" <td>3.18</td>\n",
|
||
" <td>3.19</td>\n",
|
||
" <td>3.13</td>\n",
|
||
" <td>99100.0</td>\n",
|
||
" <td>0.018868</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>4200 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Volume Volatility\n",
|
||
"4789 5.66 5.73 5.47 23355100.0 0.046763\n",
|
||
"3469 3.86 3.93 3.81 7605300.0 0.030928\n",
|
||
"2503 12.19 12.28 11.95 7243200.0 0.027454\n",
|
||
"1580 11.77 11.84 11.53 3025900.0 0.026793\n",
|
||
"2759 15.77 16.17 15.76 6113400.0 0.025434\n",
|
||
"... ... ... ... ... ...\n",
|
||
"3092 9.57 9.87 9.30 7283100.0 0.058462\n",
|
||
"3772 4.76 4.97 4.67 12920800.0 0.060852\n",
|
||
"5191 4.18 4.29 4.17 11192400.0 0.028571\n",
|
||
"5226 5.58 5.68 5.55 12692800.0 0.023297\n",
|
||
"860 3.18 3.19 3.13 99100.0 0.018868\n",
|
||
"\n",
|
||
"[4200 rows x 5 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Close</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4195</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4196</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4197</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4198</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4199</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>4200 rows × 1 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Close\n",
|
||
"0 1\n",
|
||
"1 1\n",
|
||
"2 0\n",
|
||
"3 0\n",
|
||
"4 0\n",
|
||
"... ...\n",
|
||
"4195 1\n",
|
||
"4196 1\n",
|
||
"4197 1\n",
|
||
"4198 1\n",
|
||
"4199 1\n",
|
||
"\n",
|
||
"[4200 rows x 1 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1437</th>\n",
|
||
" <td>13.710000</td>\n",
|
||
" <td>14.000000</td>\n",
|
||
" <td>13.670000</td>\n",
|
||
" <td>7623200.0</td>\n",
|
||
" <td>0.023673</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2700</th>\n",
|
||
" <td>15.520000</td>\n",
|
||
" <td>15.720000</td>\n",
|
||
" <td>15.300000</td>\n",
|
||
" <td>6098800.0</td>\n",
|
||
" <td>0.027415</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3647</th>\n",
|
||
" <td>1.870000</td>\n",
|
||
" <td>1.930000</td>\n",
|
||
" <td>1.830000</td>\n",
|
||
" <td>10980000.0</td>\n",
|
||
" <td>0.054645</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2512</th>\n",
|
||
" <td>11.260000</td>\n",
|
||
" <td>11.470000</td>\n",
|
||
" <td>11.260000</td>\n",
|
||
" <td>5029300.0</td>\n",
|
||
" <td>0.018551</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2902</th>\n",
|
||
" <td>16.379999</td>\n",
|
||
" <td>16.580000</td>\n",
|
||
" <td>16.250000</td>\n",
|
||
" <td>5485800.0</td>\n",
|
||
" <td>0.019940</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3095</th>\n",
|
||
" <td>9.290000</td>\n",
|
||
" <td>9.350000</td>\n",
|
||
" <td>9.070000</td>\n",
|
||
" <td>5861400.0</td>\n",
|
||
" <td>0.030668</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>859</th>\n",
|
||
" <td>3.090000</td>\n",
|
||
" <td>3.160000</td>\n",
|
||
" <td>3.040000</td>\n",
|
||
" <td>211300.0</td>\n",
|
||
" <td>0.038710</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3134</th>\n",
|
||
" <td>8.550000</td>\n",
|
||
" <td>8.770000</td>\n",
|
||
" <td>8.550000</td>\n",
|
||
" <td>5335400.0</td>\n",
|
||
" <td>0.025086</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2577</th>\n",
|
||
" <td>16.709999</td>\n",
|
||
" <td>17.070000</td>\n",
|
||
" <td>16.379999</td>\n",
|
||
" <td>14524400.0</td>\n",
|
||
" <td>0.042073</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>378</th>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>2.571429</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1051 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Volume Volatility\n",
|
||
"1437 13.710000 14.000000 13.670000 7623200.0 0.023673\n",
|
||
"2700 15.520000 15.720000 15.300000 6098800.0 0.027415\n",
|
||
"3647 1.870000 1.930000 1.830000 10980000.0 0.054645\n",
|
||
"2512 11.260000 11.470000 11.260000 5029300.0 0.018551\n",
|
||
"2902 16.379999 16.580000 16.250000 5485800.0 0.019940\n",
|
||
"... ... ... ... ... ...\n",
|
||
"3095 9.290000 9.350000 9.070000 5861400.0 0.030668\n",
|
||
"859 3.090000 3.160000 3.040000 211300.0 0.038710\n",
|
||
"3134 8.550000 8.770000 8.550000 5335400.0 0.025086\n",
|
||
"2577 16.709999 17.070000 16.379999 14524400.0 0.042073\n",
|
||
"378 2.571429 2.571429 2.571429 0.0 0.000000\n",
|
||
"\n",
|
||
"[1051 rows x 5 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Close</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1046</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1047</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1048</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1049</th>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1050</th>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1051 rows × 1 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Close\n",
|
||
"0 0\n",
|
||
"1 0\n",
|
||
"2 1\n",
|
||
"3 0\n",
|
||
"4 0\n",
|
||
"... ...\n",
|
||
"1046 1\n",
|
||
"1047 1\n",
|
||
"1048 1\n",
|
||
"1049 0\n",
|
||
"1050 1\n",
|
||
"\n",
|
||
"[1051 rows x 1 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"from IPython.display import display\n",
|
||
"from sklearn.preprocessing import LabelEncoder\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from typing import Tuple\n",
|
||
"from pandas import DataFrame\n",
|
||
"\n",
|
||
"def split_into_train_close_test(\n",
|
||
" df_input: DataFrame,\n",
|
||
" target_colname: str = \"Close\",\n",
|
||
" frac_train: float = 0.8,\n",
|
||
" random_state: int = None,\n",
|
||
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
||
" \n",
|
||
" if not (0 < frac_train < 1):\n",
|
||
" raise ValueError(\"Fraction must be between 0 and 1.\")\n",
|
||
" \n",
|
||
" # Проверка наличия целевого признака\n",
|
||
" if target_colname not in df_input.columns:\n",
|
||
" raise ValueError(f\"{target_colname} is not a column in the DataFrame.\")\n",
|
||
" \n",
|
||
" # Разделяем данные на признаки и целевую переменную\n",
|
||
" X = df_input.drop(columns=[target_colname]) # Признаки\n",
|
||
" \n",
|
||
" # Преобразование целевой переменной в категориальную\n",
|
||
" bins = [-np.inf, 10, np.inf]\n",
|
||
" labels = ['low', 'high']\n",
|
||
" y = pd.cut(df_input[target_colname], bins=bins, labels=labels) # Целевая переменная\n",
|
||
" \n",
|
||
" # Преобразование целевой переменной в числовые значения\n",
|
||
" label_encoder = LabelEncoder()\n",
|
||
" y_encoded = label_encoder.fit_transform(y) # Интеграция, чтобы вернуть числовые метки\n",
|
||
" \n",
|
||
" # Удаляем указанные столбцы из X\n",
|
||
" columns_to_remove = [\"Date\", \"Adj Close\", \"Close\"]\n",
|
||
" X = X.drop(columns=columns_to_remove, errors='ignore') # Игнорировать ошибку, если столбцы не найдены\n",
|
||
"\n",
|
||
" # Разделяем данные на обучающую и тестовую выборки\n",
|
||
" X_train_close, X_test_close, y_train_close, y_test_close = train_test_split(\n",
|
||
" X, y_encoded,\n",
|
||
" test_size=(1.0 - frac_train),\n",
|
||
" random_state=random_state\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Конвертируем y_train_close и y_test_close в DataFrame\n",
|
||
" y_train_close = pd.DataFrame(y_train_close, columns=[target_colname])\n",
|
||
" y_test_close = pd.DataFrame(y_test_close, columns=[target_colname])\n",
|
||
"\n",
|
||
" return X_train_close, X_test_close, y_train_close, y_test_close\n",
|
||
"\n",
|
||
"# Применение функции для разделения данных\n",
|
||
"X_train_close, X_test_close, y_train_close, y_test_close = split_into_train_close_test(\n",
|
||
" df, \n",
|
||
" target_colname=\"Close\", \n",
|
||
" frac_train=0.8, \n",
|
||
" random_state=42\n",
|
||
")\n",
|
||
"\n",
|
||
"# Для отображения результатов\n",
|
||
"display(X_train_close)\n",
|
||
"display(y_train_close)\n",
|
||
"display(X_test_close)\n",
|
||
"display(y_test_close)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"##### Определение достижимого уровня качества модели"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 94,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Baseline Volatility MSE: 0.0002712979238081643\n",
|
||
"Baseline Volatility R^2: 0.6636185388238924\n",
|
||
"Baseline Close MSE: 0.04753452157168089\n",
|
||
"Baseline Close R^2: 0.764076420247305\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Оценка базовых моделей (можно использовать, например, линейную регрессию как базу)\n",
|
||
"from sklearn.linear_model import LinearRegression\n",
|
||
"\n",
|
||
"\n",
|
||
"baseline_model = LinearRegression()\n",
|
||
"baseline_model.fit(X_train, y_train)\n",
|
||
"baseline = baseline_model.predict(X_test)\n",
|
||
"\n",
|
||
"# Оценка качества\n",
|
||
"print(f'Baseline Volatility MSE: {mean_squared_error(y_test, baseline)}')\n",
|
||
"print(f'Baseline Volatility R^2: {r2_score(y_test, baseline)}')\n",
|
||
"\n",
|
||
"baseline_model_close = LinearRegression()\n",
|
||
"baseline_model_close.fit(X_train_close, y_train_close)\n",
|
||
"baseline_close = baseline_model_close.predict(X_test_close)\n",
|
||
"\n",
|
||
"print(f'Baseline Close MSE: {mean_squared_error(y_test_close, baseline_close)}')\n",
|
||
"print(f'Baseline Close R^2: {r2_score(y_test_close, baseline_close)}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Цена:**\n",
|
||
"- MSE: 0.0475— Этот показатель говорит о том, что в среднем модель делает небольшую ошибку в предсказании цен.\n",
|
||
"- R²: 0.6636 — Это значение указывает на то, что модель объясняет только около 66% вариации волатильности.\n",
|
||
"\n",
|
||
"**Волатильность:**\n",
|
||
"- MSE: 0.00027183 — Как и в случае с ценами, это значение может показаться малым, однако из-за низкого значения волатильности в финансовых данных даже небольшие ошибки могут иметь значение.\n",
|
||
"- R²: 0.6629 — Это значение указывает на то, что модель объясняет только около 66% вариации волатильности."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### **Создадим конвейер:**\n",
|
||
"##### Конвейеры позволяют автоматизировать следующие процессы:\n",
|
||
"1. Предобработка данных.\n",
|
||
"2. Конструирование признаков.\n",
|
||
"3. Понижение размерности признакового пространства.\n",
|
||
"4. Обучение модели.\n",
|
||
"\n",
|
||
"\n",
|
||
"##### Используемые конвейеры:\n",
|
||
"1. preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация\n",
|
||
"\n",
|
||
"2. preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование\n",
|
||
"\n",
|
||
"3. features_preprocessing -- трансформер для предобработки признаков\n",
|
||
"\n",
|
||
"4. features_engineering -- трансформер для конструирования признаков\n",
|
||
"\n",
|
||
"5. drop_columns -- трансформер для удаления колонок\n",
|
||
"\n",
|
||
"6. pipeline_end -- основной конвейер предобработки данных и конструирования признаков"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 95,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
||
"from sklearn.compose import ColumnTransformer\n",
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"from sklearn.pipeline import Pipeline\n",
|
||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
"from sklearn.ensemble import RandomForestRegressor # Пример регрессионной модели\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.pipeline import make_pipeline\n",
|
||
"\n",
|
||
"class StocksFeatures(BaseEstimator, TransformerMixin):\n",
|
||
" def __init__(self):\n",
|
||
" pass\n",
|
||
" \n",
|
||
" def fit(self, X, y=None):\n",
|
||
" return self\n",
|
||
"\n",
|
||
" def transform(self, X, y=None):\n",
|
||
" X[\"Range\"] = X[\"High\"] - X[\"Low\"]\n",
|
||
" return X\n",
|
||
"\n",
|
||
" def get_feature_names_out(self, features_in):\n",
|
||
" return np.append(features_in, [\"Range\"], axis=0)\n",
|
||
"\n",
|
||
"num_columns = [\"Open\", \"High\", \"Low\", \"Close\", \"Volume\"]\n",
|
||
"\n",
|
||
"# Определяем предобработку для численных данных\n",
|
||
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
||
"num_scaler = StandardScaler()\n",
|
||
"preprocessing_num = Pipeline(\n",
|
||
" [\n",
|
||
" (\"imputer\", num_imputer),\n",
|
||
" (\"scaler\", num_scaler),\n",
|
||
" ]\n",
|
||
")\n",
|
||
"\n",
|
||
"# У категориальных данных нет, оставляем пустым\n",
|
||
"cat_columns = []\n",
|
||
"\n",
|
||
"# Подготовка признаков с использованием ColumnTransformer\n",
|
||
"features_preprocessing = ColumnTransformer(\n",
|
||
" verbose_feature_names_out=False,\n",
|
||
" transformers=[\n",
|
||
" (\"preprocessing_num\", preprocessing_num, num_columns),\n",
|
||
" ],\n",
|
||
" remainder=\"passthrough\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Выделим целевую переменную\n",
|
||
"y_train = y_train.values.reshape(-1, 1) # Убедимся, что y_train - это 2D массив\n",
|
||
"\n",
|
||
"# Создание окончательного конвейера\n",
|
||
"pipeline = Pipeline(steps=[\n",
|
||
" ('feature_engineering', StocksFeatures()),\n",
|
||
" ('imputer', SimpleImputer(strategy='median')),\n",
|
||
" ('scaler', StandardScaler())\n",
|
||
"]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Применим конвейер\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 96,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" Open High Low Close Volume Range\n",
|
||
"0 -0.264188 -0.270892 -0.278887 -0.282602 1.986714 -0.037298\n",
|
||
"1 -0.642218 -0.642866 -0.634425 -0.636104 -0.197546 -0.634720\n",
|
||
"2 1.107219 1.082683 1.108998 1.076697 -0.247763 0.261413\n",
|
||
"3 1.019012 0.991756 1.019043 0.982009 -0.832639 0.176067\n",
|
||
"4 1.859078 1.886561 1.925023 1.939410 -0.404450 0.602796\n",
|
||
"... ... ... ... ... ... ...\n",
|
||
"4195 0.556976 0.584650 0.541422 0.599049 -0.242230 1.285564\n",
|
||
"4196 -0.453203 -0.427948 -0.450230 -0.415165 0.539634 0.133394\n",
|
||
"4197 -0.575013 -0.568471 -0.557320 -0.568770 0.299931 -0.634720\n",
|
||
"4198 -0.280990 -0.281224 -0.261752 -0.278393 0.508014 -0.592047\n",
|
||
"4199 -0.785029 -0.795789 -0.780067 -0.783396 -1.238542 -0.890757\n",
|
||
"\n",
|
||
"[4200 rows x 6 columns]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Применяем конвейер к X_train\n",
|
||
"preprocessing_result = pipeline.fit_transform(X_train)\n",
|
||
"\n",
|
||
"# Формируем новый датафрейм с обработанными данными\n",
|
||
"preprocessed_df = pd.DataFrame(\n",
|
||
" preprocessing_result, \n",
|
||
" columns=pipeline.get_feature_names_out(input_features=num_columns),\n",
|
||
")\n",
|
||
"\n",
|
||
"# Выводим обработанный датафрейм\n",
|
||
"print(preprocessed_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### **Для начала разберемся с задачей регрессии:**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Обучение сводится к минимизации средней ошибки отклонения полученного\n",
|
||
"целевого признака от реального значения целевого признака для всей выборки.\n",
|
||
"\n",
|
||
"Регрессия (аппроксимация).\n",
|
||
"Получение значения из области значений целевого признака."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Выберем регрессонные модели, а именно:\n",
|
||
"1. Рандомный лес\n",
|
||
"2. Гребневая регрессия\n",
|
||
"3. Градиентный бустинг\n",
|
||
"\n",
|
||
"Настроим гиперпараметры для каждой модели."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 97,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_validation.py:540: FitFailedWarning: \n",
|
||
"27 fits failed out of a total of 54.\n",
|
||
"The score on these train-test partitions for these parameters will be set to nan.\n",
|
||
"If these failures are not expected, you can try to debug them by setting error_score='raise'.\n",
|
||
"\n",
|
||
"Below are more details about the failures:\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"27 fits failed with the following error:\n",
|
||
"Traceback (most recent call last):\n",
|
||
" File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_validation.py\", line 888, in _fit_and_score\n",
|
||
" estimator.fit(X_train, y_train, **fit_params)\n",
|
||
" File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py\", line 1466, in wrapper\n",
|
||
" estimator._validate_params()\n",
|
||
" File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py\", line 666, in _validate_params\n",
|
||
" validate_parameter_constraints(\n",
|
||
" File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\_param_validation.py\", line 95, in validate_parameter_constraints\n",
|
||
" raise InvalidParameterError(\n",
|
||
"sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.\n",
|
||
"\n",
|
||
" warnings.warn(some_fits_failed_message, FitFailedWarning)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\numpy\\ma\\core.py:2881: RuntimeWarning: invalid value encountered in cast\n",
|
||
" _data = np.array(data, dtype=dtype, copy=copy,\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_search.py:1103: UserWarning: One or more of the test scores are non-finite: [ nan nan nan -3.60510202e-05\n",
|
||
" -3.51521700e-05 -3.55224330e-05 nan nan\n",
|
||
" nan -5.23951976e-05 -5.06082610e-05 -5.35685939e-05\n",
|
||
" nan nan nan -3.75406904e-05\n",
|
||
" -3.50578578e-05 -3.44782270e-05]\n",
|
||
" warnings.warn(\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.model_selection import GridSearchCV\n",
|
||
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
||
"from sklearn.linear_model import Ridge\n",
|
||
"from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n",
|
||
"from sklearn.pipeline import Pipeline\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"# 1. Настройка гиперпараметров для каждой модели\n",
|
||
"# Random Forest hyperparameters\n",
|
||
"rf_params = {\n",
|
||
" 'n_estimators': [50, 100, 200],\n",
|
||
" 'max_features': ['auto', 'sqrt'],\n",
|
||
" 'max_depth': [None, 10, 20],\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Ridge hyperparameters\n",
|
||
"ridge_params = {\n",
|
||
" 'alpha': [0.1, 1.0, 10.0],\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Gradient Boosting hyperparameters\n",
|
||
"gb_params = {\n",
|
||
" 'n_estimators': [50, 100],\n",
|
||
" 'learning_rate': [0.01, 0.1],\n",
|
||
" 'max_depth': [3, 5],\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Curate a function for model training and evaluation\n",
|
||
"def train_and_evaluate_model(model, param_grid, X_train, y_train):\n",
|
||
" grid_search = GridSearchCV(model, param_grid, scoring='neg_mean_squared_error', cv=3)\n",
|
||
" grid_search.fit(X_train, y_train)\n",
|
||
" return grid_search.best_estimator_, grid_search.best_params_\n",
|
||
"\n",
|
||
"# Исходные данные после преобразования (Pipeline применения)\n",
|
||
"X_train_transformed = pipeline.fit_transform(X_train)\n",
|
||
"\n",
|
||
"# Обучение моделей с подбором гиперпараметров\n",
|
||
"rf_model, rf_params = train_and_evaluate_model(RandomForestRegressor(), rf_params, X_train_transformed, y_train)\n",
|
||
"ridge_model, ridge_params = train_and_evaluate_model(Ridge(), ridge_params, X_train_transformed, y_train)\n",
|
||
"gb_model, gb_params = train_and_evaluate_model(GradientBoostingRegressor(), gb_params, X_train_transformed, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Обучим модели на преобразованных данных:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 98,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<style>#sk-container-id-5 {\n",
|
||
" /* Definition of color scheme common for light and dark mode */\n",
|
||
" --sklearn-color-text: black;\n",
|
||
" --sklearn-color-line: gray;\n",
|
||
" /* Definition of color scheme for unfitted estimators */\n",
|
||
" --sklearn-color-unfitted-level-0: #fff5e6;\n",
|
||
" --sklearn-color-unfitted-level-1: #f6e4d2;\n",
|
||
" --sklearn-color-unfitted-level-2: #ffe0b3;\n",
|
||
" --sklearn-color-unfitted-level-3: chocolate;\n",
|
||
" /* Definition of color scheme for fitted estimators */\n",
|
||
" --sklearn-color-fitted-level-0: #f0f8ff;\n",
|
||
" --sklearn-color-fitted-level-1: #d4ebff;\n",
|
||
" --sklearn-color-fitted-level-2: #b3dbfd;\n",
|
||
" --sklearn-color-fitted-level-3: cornflowerblue;\n",
|
||
"\n",
|
||
" /* Specific color for light theme */\n",
|
||
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
|
||
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
|
||
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
|
||
" --sklearn-color-icon: #696969;\n",
|
||
"\n",
|
||
" @media (prefers-color-scheme: dark) {\n",
|
||
" /* Redefinition of color scheme for dark theme */\n",
|
||
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
|
||
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
|
||
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
|
||
" --sklearn-color-icon: #878787;\n",
|
||
" }\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 {\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 pre {\n",
|
||
" padding: 0;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 input.sk-hidden--visually {\n",
|
||
" border: 0;\n",
|
||
" clip: rect(1px 1px 1px 1px);\n",
|
||
" clip: rect(1px, 1px, 1px, 1px);\n",
|
||
" height: 1px;\n",
|
||
" margin: -1px;\n",
|
||
" overflow: hidden;\n",
|
||
" padding: 0;\n",
|
||
" position: absolute;\n",
|
||
" width: 1px;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-dashed-wrapped {\n",
|
||
" border: 1px dashed var(--sklearn-color-line);\n",
|
||
" margin: 0 0.4em 0.5em 0.4em;\n",
|
||
" box-sizing: border-box;\n",
|
||
" padding-bottom: 0.4em;\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-container {\n",
|
||
" /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
|
||
" but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
|
||
" so we also need the `!important` here to be able to override the\n",
|
||
" default hidden behavior on the sphinx rendered scikit-learn.org.\n",
|
||
" See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
|
||
" display: inline-block !important;\n",
|
||
" position: relative;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-text-repr-fallback {\n",
|
||
" display: none;\n",
|
||
"}\n",
|
||
"\n",
|
||
"div.sk-parallel-item,\n",
|
||
"div.sk-serial,\n",
|
||
"div.sk-item {\n",
|
||
" /* draw centered vertical line to link estimators */\n",
|
||
" background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
|
||
" background-size: 2px 100%;\n",
|
||
" background-repeat: no-repeat;\n",
|
||
" background-position: center center;\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Parallel-specific style estimator block */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel-item::after {\n",
|
||
" content: \"\";\n",
|
||
" width: 100%;\n",
|
||
" border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
|
||
" flex-grow: 1;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel {\n",
|
||
" display: flex;\n",
|
||
" align-items: stretch;\n",
|
||
" justify-content: center;\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
" position: relative;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel-item {\n",
|
||
" display: flex;\n",
|
||
" flex-direction: column;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel-item:first-child::after {\n",
|
||
" align-self: flex-end;\n",
|
||
" width: 50%;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel-item:last-child::after {\n",
|
||
" align-self: flex-start;\n",
|
||
" width: 50%;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-parallel-item:only-child::after {\n",
|
||
" width: 0;\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Serial-specific style estimator block */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-serial {\n",
|
||
" display: flex;\n",
|
||
" flex-direction: column;\n",
|
||
" align-items: center;\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
" padding-right: 1em;\n",
|
||
" padding-left: 1em;\n",
|
||
"}\n",
|
||
"\n",
|
||
"\n",
|
||
"/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
|
||
"clickable and can be expanded/collapsed.\n",
|
||
"- Pipeline and ColumnTransformer use this feature and define the default style\n",
|
||
"- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
|
||
"*/\n",
|
||
"\n",
|
||
"/* Pipeline and ColumnTransformer style (default) */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-toggleable {\n",
|
||
" /* Default theme specific background. It is overwritten whether we have a\n",
|
||
" specific estimator or a Pipeline/ColumnTransformer */\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Toggleable label */\n",
|
||
"#sk-container-id-5 label.sk-toggleable__label {\n",
|
||
" cursor: pointer;\n",
|
||
" display: block;\n",
|
||
" width: 100%;\n",
|
||
" margin-bottom: 0;\n",
|
||
" padding: 0.5em;\n",
|
||
" box-sizing: border-box;\n",
|
||
" text-align: center;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 label.sk-toggleable__label-arrow:before {\n",
|
||
" /* Arrow on the left of the label */\n",
|
||
" content: \"▸\";\n",
|
||
" float: left;\n",
|
||
" margin-right: 0.25em;\n",
|
||
" color: var(--sklearn-color-icon);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 label.sk-toggleable__label-arrow:hover:before {\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Toggleable content - dropdown */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-toggleable__content {\n",
|
||
" max-height: 0;\n",
|
||
" max-width: 0;\n",
|
||
" overflow: hidden;\n",
|
||
" text-align: left;\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-toggleable__content.fitted {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-toggleable__content pre {\n",
|
||
" margin: 0.2em;\n",
|
||
" border-radius: 0.25em;\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-toggleable__content.fitted pre {\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
|
||
" /* Expand drop-down */\n",
|
||
" max-height: 200px;\n",
|
||
" max-width: 100%;\n",
|
||
" overflow: auto;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
|
||
" content: \"▾\";\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Pipeline/ColumnTransformer-specific style */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
|
||
" background-color: var(--sklearn-color-fitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Estimator-specific style */\n",
|
||
"\n",
|
||
"/* Colorize estimator box */\n",
|
||
"#sk-container-id-5 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-label label.sk-toggleable__label,\n",
|
||
"#sk-container-id-5 div.sk-label label {\n",
|
||
" /* The background is the default theme color */\n",
|
||
" color: var(--sklearn-color-text-on-default-background);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* On hover, darken the color of the background */\n",
|
||
"#sk-container-id-5 div.sk-label:hover label.sk-toggleable__label {\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Label box, darken color on hover, fitted */\n",
|
||
"#sk-container-id-5 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
" background-color: var(--sklearn-color-fitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Estimator label */\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-label label {\n",
|
||
" font-family: monospace;\n",
|
||
" font-weight: bold;\n",
|
||
" display: inline-block;\n",
|
||
" line-height: 1.2em;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-label-container {\n",
|
||
" text-align: center;\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Estimator-specific */\n",
|
||
"#sk-container-id-5 div.sk-estimator {\n",
|
||
" font-family: monospace;\n",
|
||
" border: 1px dotted var(--sklearn-color-border-box);\n",
|
||
" border-radius: 0.25em;\n",
|
||
" box-sizing: border-box;\n",
|
||
" margin-bottom: 0.5em;\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-estimator.fitted {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-0);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* on hover */\n",
|
||
"#sk-container-id-5 div.sk-estimator:hover {\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 div.sk-estimator.fitted:hover {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-2);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
|
||
"\n",
|
||
"/* Common style for \"i\" and \"?\" */\n",
|
||
"\n",
|
||
".sk-estimator-doc-link,\n",
|
||
"a:link.sk-estimator-doc-link,\n",
|
||
"a:visited.sk-estimator-doc-link {\n",
|
||
" float: right;\n",
|
||
" font-size: smaller;\n",
|
||
" line-height: 1em;\n",
|
||
" font-family: monospace;\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
" border-radius: 1em;\n",
|
||
" height: 1em;\n",
|
||
" width: 1em;\n",
|
||
" text-decoration: none !important;\n",
|
||
" margin-left: 1ex;\n",
|
||
" /* unfitted */\n",
|
||
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
|
||
" color: var(--sklearn-color-unfitted-level-1);\n",
|
||
"}\n",
|
||
"\n",
|
||
".sk-estimator-doc-link.fitted,\n",
|
||
"a:link.sk-estimator-doc-link.fitted,\n",
|
||
"a:visited.sk-estimator-doc-link.fitted {\n",
|
||
" /* fitted */\n",
|
||
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
|
||
" color: var(--sklearn-color-fitted-level-1);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* On hover */\n",
|
||
"div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
|
||
".sk-estimator-doc-link:hover,\n",
|
||
"div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
|
||
".sk-estimator-doc-link:hover {\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-3);\n",
|
||
" color: var(--sklearn-color-background);\n",
|
||
" text-decoration: none;\n",
|
||
"}\n",
|
||
"\n",
|
||
"div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
|
||
".sk-estimator-doc-link.fitted:hover,\n",
|
||
"div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
|
||
".sk-estimator-doc-link.fitted:hover {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-3);\n",
|
||
" color: var(--sklearn-color-background);\n",
|
||
" text-decoration: none;\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* Span, style for the box shown on hovering the info icon */\n",
|
||
".sk-estimator-doc-link span {\n",
|
||
" display: none;\n",
|
||
" z-index: 9999;\n",
|
||
" position: relative;\n",
|
||
" font-weight: normal;\n",
|
||
" right: .2ex;\n",
|
||
" padding: .5ex;\n",
|
||
" margin: .5ex;\n",
|
||
" width: min-content;\n",
|
||
" min-width: 20ex;\n",
|
||
" max-width: 50ex;\n",
|
||
" color: var(--sklearn-color-text);\n",
|
||
" box-shadow: 2pt 2pt 4pt #999;\n",
|
||
" /* unfitted */\n",
|
||
" background: var(--sklearn-color-unfitted-level-0);\n",
|
||
" border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
|
||
"}\n",
|
||
"\n",
|
||
".sk-estimator-doc-link.fitted span {\n",
|
||
" /* fitted */\n",
|
||
" background: var(--sklearn-color-fitted-level-0);\n",
|
||
" border: var(--sklearn-color-fitted-level-3);\n",
|
||
"}\n",
|
||
"\n",
|
||
".sk-estimator-doc-link:hover span {\n",
|
||
" display: block;\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* \"?\"-specific style due to the `<a>` HTML tag */\n",
|
||
"\n",
|
||
"#sk-container-id-5 a.estimator_doc_link {\n",
|
||
" float: right;\n",
|
||
" font-size: 1rem;\n",
|
||
" line-height: 1em;\n",
|
||
" font-family: monospace;\n",
|
||
" background-color: var(--sklearn-color-background);\n",
|
||
" border-radius: 1rem;\n",
|
||
" height: 1rem;\n",
|
||
" width: 1rem;\n",
|
||
" text-decoration: none;\n",
|
||
" /* unfitted */\n",
|
||
" color: var(--sklearn-color-unfitted-level-1);\n",
|
||
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 a.estimator_doc_link.fitted {\n",
|
||
" /* fitted */\n",
|
||
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
|
||
" color: var(--sklearn-color-fitted-level-1);\n",
|
||
"}\n",
|
||
"\n",
|
||
"/* On hover */\n",
|
||
"#sk-container-id-5 a.estimator_doc_link:hover {\n",
|
||
" /* unfitted */\n",
|
||
" background-color: var(--sklearn-color-unfitted-level-3);\n",
|
||
" color: var(--sklearn-color-background);\n",
|
||
" text-decoration: none;\n",
|
||
"}\n",
|
||
"\n",
|
||
"#sk-container-id-5 a.estimator_doc_link.fitted:hover {\n",
|
||
" /* fitted */\n",
|
||
" background-color: var(--sklearn-color-fitted-level-3);\n",
|
||
"}\n",
|
||
"</style><div id=\"sk-container-id-5\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>GradientBoostingRegressor(max_depth=5)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" checked><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\"> GradientBoostingRegressor<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html\">?<span>Documentation for GradientBoostingRegressor</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>GradientBoostingRegressor(max_depth=5)</pre></div> </div></div></div></div>"
|
||
],
|
||
"text/plain": [
|
||
"GradientBoostingRegressor(max_depth=5)"
|
||
]
|
||
},
|
||
"execution_count": 98,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Обучение с использованием лучших моделей\n",
|
||
"rf_model.fit(X_train_transformed, y_train)\n",
|
||
"ridge_model.fit(X_train_transformed, y_train)\n",
|
||
"gb_model.fit(X_train_transformed, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Оценка моделей:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 99,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" Model MAE MSE R2\n",
|
||
"0 Random Forest 0.001764 0.000026 0.967882\n",
|
||
"1 Ridge 0.010975 0.000271 0.663739\n",
|
||
"2 Gradient Boosting 0.000987 0.000004 0.995060\n",
|
||
"Лучшая модель: Gradient Boosting\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Оценка качества и получение предсказаний для тестового набора\n",
|
||
"X_test_transformed = pipeline.transform(X_test)\n",
|
||
"\n",
|
||
"models = [rf_model, ridge_model, gb_model]\n",
|
||
"model_names = ['Random Forest', 'Ridge', 'Gradient Boosting']\n",
|
||
"\n",
|
||
"results = []\n",
|
||
"for model, name in zip(models, model_names):\n",
|
||
" predictions = model.predict(X_test_transformed)\n",
|
||
" mae = mean_absolute_error(y_test, predictions)\n",
|
||
" mse = mean_squared_error(y_test, predictions)\n",
|
||
" r2 = r2_score(y_test, predictions)\n",
|
||
" \n",
|
||
" results.append({\n",
|
||
" 'Model': name,\n",
|
||
" 'MAE': mae,\n",
|
||
" 'MSE': mse,\n",
|
||
" 'R2': r2\n",
|
||
" })\n",
|
||
"\n",
|
||
"results_df = pd.DataFrame(results)\n",
|
||
"print(results_df)\n",
|
||
"\n",
|
||
"# Определение наилучшей модели по метрике R²\n",
|
||
"best_model_info = results_df.loc[results_df['R2'].idxmax()]\n",
|
||
"best_model_name = best_model_info['Model']\n",
|
||
"best_model = models[results_df['R2'].idxmax()]\n",
|
||
"\n",
|
||
"print(f'Лучшая модель: {best_model_name}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Оценим смещение и дисперсию для модели с лучшей оценкой:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 100,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Bias: 0.0014630506389881025, Variance: 0.0006564383445844139\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1000x500 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"# Функция для оценки смещения и дисперсии\n",
|
||
"def plot_bias_variance(model, X_train, y_train, X_test, y_test):\n",
|
||
" # Предсказания на обучающей и тестовой выборках\n",
|
||
" train_preds = model.predict(X_train)\n",
|
||
" test_preds = model.predict(X_test)\n",
|
||
"\n",
|
||
" # Оценка смещения\n",
|
||
" bias = np.mean((test_preds - y_test.to_numpy()) ** 2)\n",
|
||
" variance = np.var(test_preds)\n",
|
||
"\n",
|
||
" print(f'Bias: {bias}, Variance: {variance}')\n",
|
||
"\n",
|
||
" # Визуализация предсказаний\n",
|
||
" plt.figure(figsize=(10, 5))\n",
|
||
" plt.scatter(y_test.to_numpy(), test_preds, label='Предсказания', alpha=0.7)\n",
|
||
" plt.xlabel('Истинные значения')\n",
|
||
" plt.ylabel('Предсказанные значения')\n",
|
||
" plt.title('Сравнение истинных значений и предсказанных значений')\n",
|
||
" plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
||
" plt.legend()\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"# Пример использования\n",
|
||
"plot_bias_variance(rf_model, X_train_transformed, y_train, X_test_transformed, y_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### **Таким образом в задачах регрессии в качестве оценок используются следующие метрики:**\n",
|
||
"1. **Средняя квадратичная ошибка (англ. Mean Squared Error, MSE)**\n",
|
||
"MSE применяется в ситуациях, когда нам надо подчеркнуть большие ошибки и выбрать модель, которая дает меньше больших ошибок прогноза. Грубые ошибки становятся заметнее за счет того, что ошибку прогноза мы возводим в квадрат. И модель, которая дает нам меньшее значение среднеквадратической ошибки, можно сказать, что что у этой модели меньше грубых ошибок.\n",
|
||
"\n",
|
||
"2. **Cредняя абсолютная ошибка (англ. Mean Absolute Error, MAE)**\n",
|
||
"Среднеквадратичный функционал сильнее штрафует за большие отклонения по сравнению со среднеабсолютным, и поэтому более чувствителен к выбросам. При использовании любого из этих двух функционалов может быть полезно проанализировать, какие объекты вносят наибольший вклад в общую ошибку — не исключено, что на этих объектах была допущена ошибка при вычислении признаков или целевой величины.\n",
|
||
"Среднеквадратичная ошибка подходит для сравнения двух моделей или для контроля качества во время обучения, но не позволяет сделать выводов о том, на сколько хорошо данная модель решает задачу. Например, MSE = 10 является очень плохим показателем, если целевая переменная принимает значения от 0 до 1, и очень хорошим, если целевая переменная лежит в интервале (10000, 100000). В таких ситуациях вместо среднеквадратичной ошибки полезно использовать коэффициент детерминации — R2\n",
|
||
"\n",
|
||
"3. **Коэффициент детерминации**\n",
|
||
"Коэффициент детерминации измеряет долю дисперсии, объясненную моделью, в общей дисперсии целевой переменной. Фактически, данная мера качества — это нормированная среднеквадратичная ошибка. Если она близка к единице, то модель хорошо объясняет данные, если же она близка к нулю, то прогнозы сопоставимы по качеству с константным предсказанием.\n",
|
||
"\n",
|
||
"#### **Анализ Метрик:**\n",
|
||
"1. Random Forest:\n",
|
||
"* MAE: 0.001779\n",
|
||
"* MSE: 0.000027\n",
|
||
"* R²: 0.966258\n",
|
||
"\n",
|
||
"Random Forest демонстрирует хорошую производительность с высоким R², что указывает на то, что модель способна объяснить примерно 96.6% изменчивости в данных. Низкие значения MAE и MSE свидетельствуют о том, что предсказания модели близки к истинным значениям.\n",
|
||
"\n",
|
||
"2. Ridge:\n",
|
||
"* MAE: 0.010975\n",
|
||
"* MSE: 0.000271\n",
|
||
"* R²: 0.663739\n",
|
||
"\n",
|
||
"Модель Ridge имеет более высокие значения MAE и MSE, чем Random Forest, что указывает на худшую точность предсказаний. Ее R² значительно ниже, всего 66.4%, что означает, что она объясняет лишь часть изменчивости данных. Это может значить то, что модель не улавливает все зависимости в данных о волатильности акций.\n",
|
||
"\n",
|
||
"3. Gradient Boosting:\n",
|
||
"* MAE: 0.000988\n",
|
||
"* MSE: 0.000004\n",
|
||
"* R²: 0.995023\n",
|
||
"\n",
|
||
"Gradient Boosting показывает наилучшие результаты среди трех моделей. С наименьшими значениями MAE и MSE, эта модель обеспечивает точные предсказания. Высокий R² (99.5%) указывает на то, что она способна объяснить почти всю изменчивость в данных. Это критично в контексте бизнес-целей, связанных с финансовыми рынками, где точность предсказаний может существенно повлиять на принятие инвестиционных решений.\n",
|
||
"\n",
|
||
"#### **Вывод:**\n",
|
||
"На основе представленных метрик Gradient Boosting является лучшей моделью для предсказания цен закрытия акций. Она имеет:\n",
|
||
"\n",
|
||
"* Наименьшее значение MAE, что говорит о меньшем среднем отклонении предсказанных значений от действительных.\n",
|
||
"* Наименьшее значение MSE, что указывает на то, что крупные ошибки, которые могут иметь значительное влияние на торговые решения, минимизированы.\n",
|
||
"* Наивысшее значение R², что подтверждает высокую степень объясняемости модели.\n",
|
||
"\n",
|
||
"* Bias (Смещение)\n",
|
||
"Смещение измеряет, насколько предсказания модели отклоняются от истинных значений. Чем ниже смещение, тем лучше модель справляется с захватом истинной зависимости в данных.\n",
|
||
"Низкое значение смещения(0.0014589819156387081) говорит о том, что модель хорошо предсказывает результаты для тестовых данных. Это означает, что ошибка, возникающая из-за того, что модель не может уловить истинные параметры данных, минимальна.\n",
|
||
"* Variance (Дисперсия)\n",
|
||
"Дисперсия измеряет, насколько предсказания модели изменяются при использовании различных обучающих наборов данных. Высокая дисперсия может указывать на переобучение модели, когда она слишком точно подстраивается под обучающие данные и теряет способность обобщать.\n",
|
||
"Низкое значение дисперсии(0.0006523355896833815) также говорит о том, что модель делает стабильные предсказания, которые не сильно колеблются между разными выборками данных. Это свидетельствует о том, что модель, скорее всего, не переобучается.\n",
|
||
"\n",
|
||
"* К тому же, если мы посмотрим на график, то сможем сделать несколько выводов:\n",
|
||
" 1. Высокая точность предсказаний\n",
|
||
" Предсказанные значения близки к истинным значениям. Это указывает на то, что модель хорошо прогнозирует целевую переменную.\n",
|
||
" 2. Небольшие отклонения\n",
|
||
" Несмотря на хорошую схожесть между предсказанными и истинными значениями, видны некоторые небольшие отклонения. Это может указывать на определенные ошибки предсказания, но они незначительны.\n",
|
||
" 3. Отсутствие сильного переобучения:\n",
|
||
" Поскольку наблюдается равномерное распределение точек, нет признаков значительного переобучения, которое могло бы проявляться в виде разбросанных точек вдали от линии.\n",
|
||
" 4. Поведение на высоких значениях:\n",
|
||
" В то же время может быть интересно обратить внимание на точки, расположенные на уровнях около 0.25-0.30. Они несколько отклоняются от линии, это может свидетельствовать о том, что модель не идеально справляется с предсказанием на больших значениях.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### **Теперь разберемся с задачей классификации:**\n",
|
||
"Классификация.\n",
|
||
"Получение метки класса (выбор из конечного множества значений)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Выберем классификационные модели, а именно:\n",
|
||
"1. Логистическая регрессия\n",
|
||
"2. Наивный байесовский классификатор\n",
|
||
"3. Дерево решений\n",
|
||
"4. Метод K ближайших соседей\n",
|
||
"\n",
|
||
"Настроим гиперпараметры для каждой модели."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### **Создадим конвейер:**\n",
|
||
"##### Конвейеры позволяют автоматизировать следующие процессы:\n",
|
||
"1. Предобработка данных.\n",
|
||
"2. Конструирование признаков.\n",
|
||
"3. Понижение размерности признакового пространства.\n",
|
||
"4. Обучение модели.\n",
|
||
"\n",
|
||
"\n",
|
||
"##### Используемые конвейеры:\n",
|
||
"1. preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация\n",
|
||
"\n",
|
||
"2. preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование\n",
|
||
"\n",
|
||
"3. features_preprocessing -- трансформер для предобработки признаков\n",
|
||
"\n",
|
||
"4. features_engineering -- трансформер для конструирования признаков\n",
|
||
"\n",
|
||
"5. drop_columns -- трансформер для удаления колонок\n",
|
||
"\n",
|
||
"6. pipeline_end -- основной конвейер предобработки данных и конструирования признаков"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 101,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
||
"from sklearn.compose import ColumnTransformer\n",
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"from sklearn.pipeline import Pipeline\n",
|
||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
"from sklearn.ensemble import RandomForestRegressor # Пример регрессионной модели\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.pipeline import make_pipeline\n",
|
||
"\n",
|
||
"class StocksFeatures(BaseEstimator, TransformerMixin):\n",
|
||
" def __init__(self):\n",
|
||
" pass\n",
|
||
" \n",
|
||
" def fit(self, X, y=None):\n",
|
||
" return self\n",
|
||
"\n",
|
||
" def transform(self, X, y=None):\n",
|
||
" X[\"Range\"] = X[\"High\"] - X[\"Low\"]\n",
|
||
" return X\n",
|
||
"\n",
|
||
" def get_feature_names_out(self, features_in):\n",
|
||
" return np.append(features_in, [\"Range\"], axis=0)\n",
|
||
" \n",
|
||
"\n",
|
||
"num_columns = [\"Open\", \"High\", \"Low\", \"Volume\", \"Volatility\"]\n",
|
||
"\n",
|
||
"# Определяем предобработку для численных данных\n",
|
||
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
||
"num_scaler = StandardScaler()\n",
|
||
"preprocessing_num = Pipeline(\n",
|
||
" [\n",
|
||
" (\"imputer\", num_imputer),\n",
|
||
" (\"scaler\", num_scaler),\n",
|
||
" ]\n",
|
||
")\n",
|
||
"\n",
|
||
"# У категориальных данных нет, оставляем пустым\n",
|
||
"cat_columns = []\n",
|
||
"\n",
|
||
"# Подготовка признаков с использованием ColumnTransformer\n",
|
||
"features_preprocessing = ColumnTransformer(\n",
|
||
" verbose_feature_names_out=False,\n",
|
||
" transformers=[\n",
|
||
" (\"preprocessing_num\", preprocessing_num, num_columns),\n",
|
||
" ],\n",
|
||
" remainder=\"passthrough\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Выделим целевую переменную\n",
|
||
"#y_train_close = y_train_close.values.reshape(-1, 1) # Убедимся, что y_train - это 2D массив\n",
|
||
"\n",
|
||
"# Создание окончательного конвейера\n",
|
||
"pipeline = Pipeline(steps=[\n",
|
||
" ('feature_engineering', StocksFeatures()),\n",
|
||
" ('imputer', SimpleImputer(strategy='median')),\n",
|
||
" ('scaler', StandardScaler())\n",
|
||
"]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Применим конвейер\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 102,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" <th>Range</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>-0.264188</td>\n",
|
||
" <td>-0.270892</td>\n",
|
||
" <td>-0.278887</td>\n",
|
||
" <td>1.986714</td>\n",
|
||
" <td>0.288393</td>\n",
|
||
" <td>-0.037298</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>-0.642218</td>\n",
|
||
" <td>-0.642866</td>\n",
|
||
" <td>-0.634425</td>\n",
|
||
" <td>-0.197546</td>\n",
|
||
" <td>-0.295086</td>\n",
|
||
" <td>-0.634720</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1.107219</td>\n",
|
||
" <td>1.082683</td>\n",
|
||
" <td>1.108998</td>\n",
|
||
" <td>-0.247763</td>\n",
|
||
" <td>-0.423081</td>\n",
|
||
" <td>0.261413</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1.019012</td>\n",
|
||
" <td>0.991756</td>\n",
|
||
" <td>1.019043</td>\n",
|
||
" <td>-0.832639</td>\n",
|
||
" <td>-0.447430</td>\n",
|
||
" <td>0.176067</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>1.859078</td>\n",
|
||
" <td>1.886561</td>\n",
|
||
" <td>1.925023</td>\n",
|
||
" <td>-0.404450</td>\n",
|
||
" <td>-0.497514</td>\n",
|
||
" <td>0.602796</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4195</th>\n",
|
||
" <td>0.556976</td>\n",
|
||
" <td>0.584650</td>\n",
|
||
" <td>0.541422</td>\n",
|
||
" <td>-0.242230</td>\n",
|
||
" <td>0.719476</td>\n",
|
||
" <td>1.285564</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4196</th>\n",
|
||
" <td>-0.453203</td>\n",
|
||
" <td>-0.427948</td>\n",
|
||
" <td>-0.450230</td>\n",
|
||
" <td>0.539634</td>\n",
|
||
" <td>0.807557</td>\n",
|
||
" <td>0.133394</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4197</th>\n",
|
||
" <td>-0.575013</td>\n",
|
||
" <td>-0.568471</td>\n",
|
||
" <td>-0.557320</td>\n",
|
||
" <td>0.299931</td>\n",
|
||
" <td>-0.381914</td>\n",
|
||
" <td>-0.634720</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4198</th>\n",
|
||
" <td>-0.280990</td>\n",
|
||
" <td>-0.281224</td>\n",
|
||
" <td>-0.261752</td>\n",
|
||
" <td>0.508014</td>\n",
|
||
" <td>-0.576248</td>\n",
|
||
" <td>-0.592047</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4199</th>\n",
|
||
" <td>-0.785029</td>\n",
|
||
" <td>-0.795789</td>\n",
|
||
" <td>-0.780067</td>\n",
|
||
" <td>-1.238542</td>\n",
|
||
" <td>-0.739469</td>\n",
|
||
" <td>-0.890757</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>4200 rows × 6 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Volume Volatility Range\n",
|
||
"0 -0.264188 -0.270892 -0.278887 1.986714 0.288393 -0.037298\n",
|
||
"1 -0.642218 -0.642866 -0.634425 -0.197546 -0.295086 -0.634720\n",
|
||
"2 1.107219 1.082683 1.108998 -0.247763 -0.423081 0.261413\n",
|
||
"3 1.019012 0.991756 1.019043 -0.832639 -0.447430 0.176067\n",
|
||
"4 1.859078 1.886561 1.925023 -0.404450 -0.497514 0.602796\n",
|
||
"... ... ... ... ... ... ...\n",
|
||
"4195 0.556976 0.584650 0.541422 -0.242230 0.719476 1.285564\n",
|
||
"4196 -0.453203 -0.427948 -0.450230 0.539634 0.807557 0.133394\n",
|
||
"4197 -0.575013 -0.568471 -0.557320 0.299931 -0.381914 -0.634720\n",
|
||
"4198 -0.280990 -0.281224 -0.261752 0.508014 -0.576248 -0.592047\n",
|
||
"4199 -0.785029 -0.795789 -0.780067 -1.238542 -0.739469 -0.890757\n",
|
||
"\n",
|
||
"[4200 rows x 6 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Open</th>\n",
|
||
" <th>High</th>\n",
|
||
" <th>Low</th>\n",
|
||
" <th>Volume</th>\n",
|
||
" <th>Volatility</th>\n",
|
||
" <th>Range</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1.498241</td>\n",
|
||
" <td>1.508336</td>\n",
|
||
" <td>1.545213</td>\n",
|
||
" <td>-0.154202</td>\n",
|
||
" <td>-0.496725</td>\n",
|
||
" <td>0.326979</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1.882169</td>\n",
|
||
" <td>1.867024</td>\n",
|
||
" <td>1.897223</td>\n",
|
||
" <td>-0.360102</td>\n",
|
||
" <td>-0.364952</td>\n",
|
||
" <td>0.705871</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>-1.013194</td>\n",
|
||
" <td>-1.008735</td>\n",
|
||
" <td>-1.011718</td>\n",
|
||
" <td>0.299199</td>\n",
|
||
" <td>0.593864</td>\n",
|
||
" <td>-0.641300</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>0.978561</td>\n",
|
||
" <td>0.980731</td>\n",
|
||
" <td>1.024756</td>\n",
|
||
" <td>-0.504559</td>\n",
|
||
" <td>-0.677069</td>\n",
|
||
" <td>-0.178210</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2.064587</td>\n",
|
||
" <td>2.046367</td>\n",
|
||
" <td>2.102382</td>\n",
|
||
" <td>-0.442899</td>\n",
|
||
" <td>-0.628183</td>\n",
|
||
" <td>0.326979</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1046</th>\n",
|
||
" <td>0.560695</td>\n",
|
||
" <td>0.538627</td>\n",
|
||
" <td>0.551811</td>\n",
|
||
" <td>-0.392167</td>\n",
|
||
" <td>-0.250407</td>\n",
|
||
" <td>0.116483</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1047</th>\n",
|
||
" <td>-0.754415</td>\n",
|
||
" <td>-0.752231</td>\n",
|
||
" <td>-0.750410</td>\n",
|
||
" <td>-1.155323</td>\n",
|
||
" <td>0.032753</td>\n",
|
||
" <td>-0.557102</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1048</th>\n",
|
||
" <td>0.403731</td>\n",
|
||
" <td>0.417675</td>\n",
|
||
" <td>0.439513</td>\n",
|
||
" <td>-0.463214</td>\n",
|
||
" <td>-0.446983</td>\n",
|
||
" <td>-0.136111</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1049</th>\n",
|
||
" <td>2.134585</td>\n",
|
||
" <td>2.148552</td>\n",
|
||
" <td>2.130456</td>\n",
|
||
" <td>0.777940</td>\n",
|
||
" <td>0.151191</td>\n",
|
||
" <td>1.842550</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1050</th>\n",
|
||
" <td>-0.864411</td>\n",
|
||
" <td>-0.874972</td>\n",
|
||
" <td>-0.851601</td>\n",
|
||
" <td>-1.183864</td>\n",
|
||
" <td>-1.330298</td>\n",
|
||
" <td>-1.062291</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1051 rows × 6 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Open High Low Volume Volatility Range\n",
|
||
"0 1.498241 1.508336 1.545213 -0.154202 -0.496725 0.326979\n",
|
||
"1 1.882169 1.867024 1.897223 -0.360102 -0.364952 0.705871\n",
|
||
"2 -1.013194 -1.008735 -1.011718 0.299199 0.593864 -0.641300\n",
|
||
"3 0.978561 0.980731 1.024756 -0.504559 -0.677069 -0.178210\n",
|
||
"4 2.064587 2.046367 2.102382 -0.442899 -0.628183 0.326979\n",
|
||
"... ... ... ... ... ... ...\n",
|
||
"1046 0.560695 0.538627 0.551811 -0.392167 -0.250407 0.116483\n",
|
||
"1047 -0.754415 -0.752231 -0.750410 -1.155323 0.032753 -0.557102\n",
|
||
"1048 0.403731 0.417675 0.439513 -0.463214 -0.446983 -0.136111\n",
|
||
"1049 2.134585 2.148552 2.130456 0.777940 0.151191 1.842550\n",
|
||
"1050 -0.864411 -0.874972 -0.851601 -1.183864 -1.330298 -1.062291\n",
|
||
"\n",
|
||
"[1051 rows x 6 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Применяем конвейер к X_train_close\n",
|
||
"preprocessing_result = pipeline.fit_transform(X_train_close)\n",
|
||
"\n",
|
||
"# Формируем новый датафрейм с обработанными данными\n",
|
||
"preprocessed_df = pd.DataFrame(\n",
|
||
" preprocessing_result, \n",
|
||
" columns=pipeline.get_feature_names_out(input_features=num_columns),\n",
|
||
")\n",
|
||
"\n",
|
||
"# Выводим обработанный датафрейм\n",
|
||
"display(preprocessed_df)\n",
|
||
"# Применяем конвейер к X_train_close\n",
|
||
"preprocessing_result = pipeline.fit_transform(X_test_close)\n",
|
||
"\n",
|
||
"# Формируем новый датафрейм с обработанными данными\n",
|
||
"preprocessed_df = pd.DataFrame(\n",
|
||
" preprocessing_result, \n",
|
||
" columns=pipeline.get_feature_names_out(input_features=num_columns),\n",
|
||
")\n",
|
||
"\n",
|
||
"# Выводим обработанный датафрейм\n",
|
||
"display(preprocessed_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Настроим гиперпараметры для каждой модели."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 103,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.model_selection import GridSearchCV\n",
|
||
"\n",
|
||
"# Определяем модели\n",
|
||
"from sklearn.linear_model import LogisticRegression\n",
|
||
"from sklearn.naive_bayes import GaussianNB\n",
|
||
"from sklearn.tree import DecisionTreeClassifier\n",
|
||
"from sklearn.neighbors import KNeighborsClassifier\n",
|
||
"\n",
|
||
"# Определяем параметры для каждой модели\n",
|
||
"param_grid = {\n",
|
||
" 'LogisticRegression': {\n",
|
||
" 'model': LogisticRegression(),\n",
|
||
" 'params': {\n",
|
||
" 'C': [0.01, 0.1, 1, 10],\n",
|
||
" 'solver': ['liblinear']\n",
|
||
" }\n",
|
||
" },\n",
|
||
" 'NaiveBayes': {\n",
|
||
" 'model': GaussianNB(),\n",
|
||
" 'params': {}\n",
|
||
" },\n",
|
||
" 'DecisionTree': {\n",
|
||
" 'model': DecisionTreeClassifier(),\n",
|
||
" 'params': {\n",
|
||
" 'max_depth': [None, 5, 10, 20],\n",
|
||
" 'min_samples_split': [2, 5, 10],\n",
|
||
" }\n",
|
||
" },\n",
|
||
" 'KNeighbors': {\n",
|
||
" 'model': KNeighborsClassifier(),\n",
|
||
" 'params': {\n",
|
||
" 'n_neighbors': [3, 5, 7, 10]\n",
|
||
" }\n",
|
||
" }\n",
|
||
"}\n",
|
||
"\n",
|
||
"best_estimators = {}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Обучим модели и оценим:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Обучим модели при помощи кросс-валидации:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 104,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
|
||
" _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n",
|
||
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return self._fit(X, y)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.metrics import classification_report, cohen_kappa_score, confusion_matrix, matthews_corrcoef, roc_auc_score\n",
|
||
"\n",
|
||
"for model_name, mp in param_grid.items():\n",
|
||
" grid_search = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)\n",
|
||
" grid_search.fit(X_train_close, y_train_close)\n",
|
||
"\n",
|
||
" best_estimators[model_name] = grid_search.best_estimator_\n",
|
||
" y_pred_train = best_estimators[model_name].predict(X_train_close)\n",
|
||
" y_pred_test = best_estimators[model_name].predict(X_test_close)\n",
|
||
"\n",
|
||
" # Сбор метрик\n",
|
||
" report_train = classification_report(y_train_close, y_pred_train, output_dict=True)\n",
|
||
" report_test = classification_report(y_test_close, y_pred_test, output_dict=True)\n",
|
||
"\n",
|
||
" roc_auc_test = roc_auc_score(y_test_close, best_estimators[model_name].predict_proba(X_test_close)[:, 1])\n",
|
||
" cohen_kappa_test = cohen_kappa_score(y_test_close, y_pred_test)\n",
|
||
" mcc_test = matthews_corrcoef(y_test_close, y_pred_test)\n",
|
||
"\n",
|
||
" # Сохранение результатов\n",
|
||
" param_grid[model_name] = {\n",
|
||
" \"Confusion_matrix\": confusion_matrix(y_test_close, y_pred_test),\n",
|
||
" \"Precision_train\": report_train['1']['precision'],\n",
|
||
" \"Recall_train\": report_train['1']['recall'],\n",
|
||
" \"Accuracy_train\": report_train['accuracy'],\n",
|
||
" \"F1_train\": report_train['1']['f1-score'],\n",
|
||
" \"Precision_test\": report_test['1']['precision'],\n",
|
||
" \"Recall_test\": report_test['1']['recall'],\n",
|
||
" \"Accuracy_test\": report_test['accuracy'],\n",
|
||
" \"F1_test\": report_test['1']['f1-score'],\n",
|
||
" \"ROC_AUC_test\": roc_auc_test,\n",
|
||
" \"Cohen_kappa_test\": cohen_kappa_test,\n",
|
||
" \"MCC_test\": mcc_test,\n",
|
||
" }"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Используем матрицу неточностей:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 105,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 800x600 with 8 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.metrics import ConfusionMatrixDisplay\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"# Визуализация матриц\n",
|
||
"_, ax = plt.subplots(int(len(best_estimators) / 2), 2, figsize=(8, 6), sharex=False, sharey=False)\n",
|
||
"for index, key in enumerate(best_estimators.keys()):\n",
|
||
" y_pred = best_estimators[key].predict(X_test_close)\n",
|
||
" c_matrix = confusion_matrix(y_test_close, y_pred)\n",
|
||
" disp = ConfusionMatrixDisplay(confusion_matrix=c_matrix, display_labels=[\"low\", \"high\"]).plot(ax=ax.flat[index])\n",
|
||
" disp.ax_.set_title(key)\n",
|
||
"\n",
|
||
"plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### **Сделаем выводы относительно матрицы неточностей:**\n",
|
||
"1. Logistic Regression\n",
|
||
"* True Positives (TP): 634 (правильные предсказания для high)\n",
|
||
"* True Negatives (TN): 0 (не было правильных предсказаний для low)\n",
|
||
"* False Positives (FP): 294 (high предсказано, но на самом деле low)\n",
|
||
"* False Negatives (FN): 123 (low предсказано, но на самом деле high)\n",
|
||
"\n",
|
||
"Вывод: Модель показывает высокую точность для прогноза high, но полностью игнорирует класс low. Это может вызвать серьезные проблемы, так как неверные прогнозы могут привести к значительным финансовым потерям.\n",
|
||
"\n",
|
||
"2. Naive Bayes\n",
|
||
"* TP: 757 (правильные предсказания для high)\n",
|
||
"* TN: 0\n",
|
||
"* FP: 294\n",
|
||
"* FN: 0\n",
|
||
"\n",
|
||
"Вывод: Модель предсказывает только high с высокой точностью, но также не распознает класс low. Это делает модель ненадежной в контексте предсказания как низких, так и высоких цен.\n",
|
||
"\n",
|
||
"3. Decision Tree\n",
|
||
"* TP: 757\n",
|
||
"* TN: 293\n",
|
||
"* FP: 1\n",
|
||
"* FN: 0\n",
|
||
"\n",
|
||
"Вывод: Модель хорошо справляется с предсказаниями для high, при этом включает небольшое количество неверных предсказаний для low (один FP). Справляется лучше в контексте выявления low цен, но все еще не идеально, так как не распознает FN.\n",
|
||
"\n",
|
||
"4. KNeighbors\n",
|
||
"* TP: 612\n",
|
||
"* TN: 145\n",
|
||
"* FP: 132\n",
|
||
"* FN: 162\n",
|
||
"\n",
|
||
"Вывод: Эта модель наиболее сбалансирована из всех представленных, показывая разумную точность для обоих классов. Она делает больше ошибок, но также учит и предсказывает low с некоторой эффективностью."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Точность, полнота, верность (аккуратность), F-мера:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 106,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<style type=\"text/css\">\n",
|
||
"#T_be170_row0_col0, #T_be170_row0_col1, #T_be170_row0_col3, #T_be170_row2_col2, #T_be170_row2_col3 {\n",
|
||
" background-color: #a8db34;\n",
|
||
" color: #000000;\n",
|
||
"}\n",
|
||
"#T_be170_row0_col2 {\n",
|
||
" background-color: #a5db36;\n",
|
||
" color: #000000;\n",
|
||
"}\n",
|
||
"#T_be170_row0_col4, #T_be170_row0_col5, #T_be170_row0_col6, #T_be170_row0_col7 {\n",
|
||
" background-color: #da5a6a;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col0 {\n",
|
||
" background-color: #2eb37c;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col1 {\n",
|
||
" background-color: #28ae80;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col2, #T_be170_row1_col3, #T_be170_row3_col0, #T_be170_row3_col1 {\n",
|
||
" background-color: #26818e;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col4 {\n",
|
||
" background-color: #9613a1;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col5 {\n",
|
||
" background-color: #8707a6;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col6 {\n",
|
||
" background-color: #8b0aa5;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row1_col7, #T_be170_row2_col4 {\n",
|
||
" background-color: #7a02a8;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row2_col0 {\n",
|
||
" background-color: #228b8d;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row2_col1 {\n",
|
||
" background-color: #228d8d;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row2_col5, #T_be170_row2_col6 {\n",
|
||
" background-color: #8104a7;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row2_col7 {\n",
|
||
" background-color: #8808a6;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row3_col2 {\n",
|
||
" background-color: #25848e;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row3_col3 {\n",
|
||
" background-color: #21918c;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_be170_row3_col4, #T_be170_row3_col5, #T_be170_row3_col6, #T_be170_row3_col7 {\n",
|
||
" background-color: #4e02a2;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"</style>\n",
|
||
"<table id=\"T_be170\">\n",
|
||
" <thead>\n",
|
||
" <tr>\n",
|
||
" <th class=\"blank level0\" > </th>\n",
|
||
" <th id=\"T_be170_level0_col0\" class=\"col_heading level0 col0\" >Precision_train</th>\n",
|
||
" <th id=\"T_be170_level0_col1\" class=\"col_heading level0 col1\" >Precision_test</th>\n",
|
||
" <th id=\"T_be170_level0_col2\" class=\"col_heading level0 col2\" >Recall_train</th>\n",
|
||
" <th id=\"T_be170_level0_col3\" class=\"col_heading level0 col3\" >Recall_test</th>\n",
|
||
" <th id=\"T_be170_level0_col4\" class=\"col_heading level0 col4\" >Accuracy_train</th>\n",
|
||
" <th id=\"T_be170_level0_col5\" class=\"col_heading level0 col5\" >Accuracy_test</th>\n",
|
||
" <th id=\"T_be170_level0_col6\" class=\"col_heading level0 col6\" >F1_train</th>\n",
|
||
" <th id=\"T_be170_level0_col7\" class=\"col_heading level0 col7\" >F1_test</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_be170_level0_row0\" class=\"row_heading level0 row0\" >DecisionTree</th>\n",
|
||
" <td id=\"T_be170_row0_col0\" class=\"data row0 col0\" >0.998992</td>\n",
|
||
" <td id=\"T_be170_row0_col1\" class=\"data row0 col1\" >0.998681</td>\n",
|
||
" <td id=\"T_be170_row0_col2\" class=\"data row0 col2\" >0.998656</td>\n",
|
||
" <td id=\"T_be170_row0_col3\" class=\"data row0 col3\" >1.000000</td>\n",
|
||
" <td id=\"T_be170_row0_col4\" class=\"data row0 col4\" >0.998333</td>\n",
|
||
" <td id=\"T_be170_row0_col5\" class=\"data row0 col5\" >0.999049</td>\n",
|
||
" <td id=\"T_be170_row0_col6\" class=\"data row0 col6\" >0.998824</td>\n",
|
||
" <td id=\"T_be170_row0_col7\" class=\"data row0 col7\" >0.999340</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_be170_level0_row1\" class=\"row_heading level0 row1\" >KNeighbors</th>\n",
|
||
" <td id=\"T_be170_row1_col0\" class=\"data row1 col0\" >0.833878</td>\n",
|
||
" <td id=\"T_be170_row1_col1\" class=\"data row1 col1\" >0.822581</td>\n",
|
||
" <td id=\"T_be170_row1_col2\" class=\"data row1 col2\" >0.856855</td>\n",
|
||
" <td id=\"T_be170_row1_col3\" class=\"data row1 col3\" >0.808454</td>\n",
|
||
" <td id=\"T_be170_row1_col4\" class=\"data row1 col4\" >0.777619</td>\n",
|
||
" <td id=\"T_be170_row1_col5\" class=\"data row1 col5\" >0.736441</td>\n",
|
||
" <td id=\"T_be170_row1_col6\" class=\"data row1 col6\" >0.845210</td>\n",
|
||
" <td id=\"T_be170_row1_col7\" class=\"data row1 col7\" >0.815456</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_be170_level0_row2\" class=\"row_heading level0 row2\" >NaiveBayes</th>\n",
|
||
" <td id=\"T_be170_row2_col0\" class=\"data row2 col0\" >0.708571</td>\n",
|
||
" <td id=\"T_be170_row2_col1\" class=\"data row2 col1\" >0.720266</td>\n",
|
||
" <td id=\"T_be170_row2_col2\" class=\"data row2 col2\" >1.000000</td>\n",
|
||
" <td id=\"T_be170_row2_col3\" class=\"data row2 col3\" >1.000000</td>\n",
|
||
" <td id=\"T_be170_row2_col4\" class=\"data row2 col4\" >0.708571</td>\n",
|
||
" <td id=\"T_be170_row2_col5\" class=\"data row2 col5\" >0.720266</td>\n",
|
||
" <td id=\"T_be170_row2_col6\" class=\"data row2 col6\" >0.829431</td>\n",
|
||
" <td id=\"T_be170_row2_col7\" class=\"data row2 col7\" >0.837389</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_be170_level0_row3\" class=\"row_heading level0 row3\" >LogisticRegression</th>\n",
|
||
" <td id=\"T_be170_row3_col0\" class=\"data row3 col0\" >0.677130</td>\n",
|
||
" <td id=\"T_be170_row3_col1\" class=\"data row3 col1\" >0.683190</td>\n",
|
||
" <td id=\"T_be170_row3_col2\" class=\"data row3 col2\" >0.862567</td>\n",
|
||
" <td id=\"T_be170_row3_col3\" class=\"data row3 col3\" >0.837517</td>\n",
|
||
" <td id=\"T_be170_row3_col4\" class=\"data row3 col4\" >0.611190</td>\n",
|
||
" <td id=\"T_be170_row3_col5\" class=\"data row3 col5\" >0.603235</td>\n",
|
||
" <td id=\"T_be170_row3_col6\" class=\"data row3 col6\" >0.758682</td>\n",
|
||
" <td id=\"T_be170_row3_col7\" class=\"data row3 col7\" >0.752522</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n"
|
||
],
|
||
"text/plain": [
|
||
"<pandas.io.formats.style.Styler at 0x272f5d04bc0>"
|
||
]
|
||
},
|
||
"execution_count": 106,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"class_metrics = pd.DataFrame.from_dict(param_grid, \"index\")[\n",
|
||
" [\n",
|
||
" \"Precision_train\",\n",
|
||
" \"Precision_test\",\n",
|
||
" \"Recall_train\",\n",
|
||
" \"Recall_test\",\n",
|
||
" \"Accuracy_train\",\n",
|
||
" \"Accuracy_test\",\n",
|
||
" \"F1_train\",\n",
|
||
" \"F1_test\",\n",
|
||
" ]\n",
|
||
"]\n",
|
||
"class_metrics.sort_values(\n",
|
||
" by=\"Accuracy_test\", ascending=False\n",
|
||
").style.background_gradient(\n",
|
||
" cmap=\"plasma\",\n",
|
||
" low=0.3,\n",
|
||
" high=1,\n",
|
||
" subset=[\"Accuracy_train\", \"Accuracy_test\", \"F1_train\", \"F1_test\"],\n",
|
||
").background_gradient(\n",
|
||
" cmap=\"viridis\",\n",
|
||
" low=1,\n",
|
||
" high=0.3,\n",
|
||
" subset=[\n",
|
||
" \"Precision_train\",\n",
|
||
" \"Precision_test\",\n",
|
||
" \"Recall_train\",\n",
|
||
" \"Recall_test\",\n",
|
||
" ],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### **Выводы:**\n",
|
||
"1. Decision Tree (Решающие деревья)\n",
|
||
"* Precision (Точность):\n",
|
||
"Высокий уровень как на обучающем (0.99992), так и на тестовом (0.99868) наборах данных.\n",
|
||
"* Recall (Полнота):\n",
|
||
"Идеальный результат на обучающем наборе (1.00000), с незначительным падением на тесте (0.99865).\n",
|
||
"* Accuracy (Точность):\n",
|
||
"Очень высокая: 0.99833 на обучающем и 0.999049 на тестовом наборах.\n",
|
||
"* F1 Score:\n",
|
||
"Высокий на обоих наборах (0.998824 для тестового).\n",
|
||
"\n",
|
||
"Вывод: Деревья решений демонстрируют наилучшие результаты по всем метрикам. Они хорошо справляются как с обучающими, так и с тестовыми данными, и, вероятно, являются наилучшим выбором.\n",
|
||
"\n",
|
||
"2. K-Neighbors (Метод ближайших соседей)\n",
|
||
"* Precision:\n",
|
||
"Низкая точность как на обучающем (0.833878), так и на тестовом (0.822581) наборах.\n",
|
||
"* Recall:\n",
|
||
"Высокая полнота на обучающем (0.856855), но значительно ниже на тестовом (0.808454).\n",
|
||
"* Accuracy:\n",
|
||
"Умеренные результаты: 0.777619 на обучающем и 0.736441 на тестовом.\n",
|
||
"* F1 Score:\n",
|
||
"Умеренная производительность (0.815456 на тестовом).\n",
|
||
"\n",
|
||
"Вывод: Метод ближайших соседей показывает средние результаты. Хотя он имеет приемлемую полноту, точность значительно ниже, чем у деревьев решений.\n",
|
||
"\n",
|
||
"3. Naive Bayes (Наивный байесовский классификатор)\n",
|
||
"* Precision:\n",
|
||
"Низкая точность (0.708571 на обучающем, 0.720266 на тестовом).\n",
|
||
"* Recall:\n",
|
||
"Полнота идеально на обучающем (1.00000), но это может указывать на переобучение.\n",
|
||
"* Accuracy:\n",
|
||
"Точность на уровне 0.708571 на обучающем и 0.720266 на тестовом — значительно ниже, чем у лучших моделей.\n",
|
||
"* F1 Score:\n",
|
||
"Умеренные результаты (0.837389 на тестовом).\n",
|
||
"\n",
|
||
"Вывод: Наивный байесовский классификатор показывает проблемы с точностью, несмотря на хорошую полноту, что может указывать на его пригодность для задач, где много классов.\n",
|
||
"\n",
|
||
"4. Logistic Regression (Логистическая регрессия)\n",
|
||
"* Precision:\n",
|
||
"Низкая точность (0.677130 на обучающем и 0.683190 на тестовом).\n",
|
||
"* Recall:\n",
|
||
"Полнота также ниже (0.862567 на тестовом).\n",
|
||
"* Accuracy:\n",
|
||
"Совсем низкие значения: 0.611190 на обучающем и 0.603235 на тестовом.\n",
|
||
"* F1 Score:\n",
|
||
"Низкие значения (0.752522 на тестовом).\n",
|
||
"\n",
|
||
"Вывод: Логистическая регрессия демонстрирует наихудшие показатели по всем метрикам, что ставит под сомнение её применимость для данной задачи.\n",
|
||
"Лучшая модель: Деревья решений являются наиболее эффективной моделью."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 107,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<style type=\"text/css\">\n",
|
||
"#T_08895_row0_col0, #T_08895_row0_col1 {\n",
|
||
" background-color: #a8db34;\n",
|
||
" color: #000000;\n",
|
||
"}\n",
|
||
"#T_08895_row0_col2, #T_08895_row0_col3, #T_08895_row0_col4 {\n",
|
||
" background-color: #da5a6a;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row1_col0 {\n",
|
||
" background-color: #20a386;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row1_col1 {\n",
|
||
" background-color: #1e9b8a;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row1_col2 {\n",
|
||
" background-color: #aa2395;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row1_col3, #T_08895_row2_col2 {\n",
|
||
" background-color: #9a169f;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row1_col4 {\n",
|
||
" background-color: #9d189d;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row2_col0 {\n",
|
||
" background-color: #1fa088;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row2_col1 {\n",
|
||
" background-color: #20a486;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row2_col3 {\n",
|
||
" background-color: #6a00a8;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row2_col4 {\n",
|
||
" background-color: #6f00a8;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row3_col0, #T_08895_row3_col1 {\n",
|
||
" background-color: #26818e;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"#T_08895_row3_col2, #T_08895_row3_col3, #T_08895_row3_col4 {\n",
|
||
" background-color: #4e02a2;\n",
|
||
" color: #f1f1f1;\n",
|
||
"}\n",
|
||
"</style>\n",
|
||
"<table id=\"T_08895\">\n",
|
||
" <thead>\n",
|
||
" <tr>\n",
|
||
" <th class=\"blank level0\" > </th>\n",
|
||
" <th id=\"T_08895_level0_col0\" class=\"col_heading level0 col0\" >Accuracy_test</th>\n",
|
||
" <th id=\"T_08895_level0_col1\" class=\"col_heading level0 col1\" >F1_test</th>\n",
|
||
" <th id=\"T_08895_level0_col2\" class=\"col_heading level0 col2\" >ROC_AUC_test</th>\n",
|
||
" <th id=\"T_08895_level0_col3\" class=\"col_heading level0 col3\" >Cohen_kappa_test</th>\n",
|
||
" <th id=\"T_08895_level0_col4\" class=\"col_heading level0 col4\" >MCC_test</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_08895_level0_row0\" class=\"row_heading level0 row0\" >DecisionTree</th>\n",
|
||
" <td id=\"T_08895_row0_col0\" class=\"data row0 col0\" >0.999049</td>\n",
|
||
" <td id=\"T_08895_row0_col1\" class=\"data row0 col1\" >0.999340</td>\n",
|
||
" <td id=\"T_08895_row0_col2\" class=\"data row0 col2\" >0.998295</td>\n",
|
||
" <td id=\"T_08895_row0_col3\" class=\"data row0 col3\" >0.997636</td>\n",
|
||
" <td id=\"T_08895_row0_col4\" class=\"data row0 col4\" >0.997639</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_08895_level0_row1\" class=\"row_heading level0 row1\" >KNeighbors</th>\n",
|
||
" <td id=\"T_08895_row1_col0\" class=\"data row1 col0\" >0.736441</td>\n",
|
||
" <td id=\"T_08895_row1_col1\" class=\"data row1 col1\" >0.815456</td>\n",
|
||
" <td id=\"T_08895_row1_col2\" class=\"data row1 col2\" >0.769925</td>\n",
|
||
" <td id=\"T_08895_row1_col3\" class=\"data row1 col3\" >0.354679</td>\n",
|
||
" <td id=\"T_08895_row1_col4\" class=\"data row1 col4\" >0.354842</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_08895_level0_row2\" class=\"row_heading level0 row2\" >NaiveBayes</th>\n",
|
||
" <td id=\"T_08895_row2_col0\" class=\"data row2 col0\" >0.720266</td>\n",
|
||
" <td id=\"T_08895_row2_col1\" class=\"data row2 col1\" >0.837389</td>\n",
|
||
" <td id=\"T_08895_row2_col2\" class=\"data row2 col2\" >0.711442</td>\n",
|
||
" <td id=\"T_08895_row2_col3\" class=\"data row2 col3\" >0.000000</td>\n",
|
||
" <td id=\"T_08895_row2_col4\" class=\"data row2 col4\" >0.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th id=\"T_08895_level0_row3\" class=\"row_heading level0 row3\" >LogisticRegression</th>\n",
|
||
" <td id=\"T_08895_row3_col0\" class=\"data row3 col0\" >0.603235</td>\n",
|
||
" <td id=\"T_08895_row3_col1\" class=\"data row3 col1\" >0.752522</td>\n",
|
||
" <td id=\"T_08895_row3_col2\" class=\"data row3 col2\" >0.466647</td>\n",
|
||
" <td id=\"T_08895_row3_col3\" class=\"data row3 col3\" >-0.197637</td>\n",
|
||
" <td id=\"T_08895_row3_col4\" class=\"data row3 col4\" >-0.226884</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n"
|
||
],
|
||
"text/plain": [
|
||
"<pandas.io.formats.style.Styler at 0x272f27e11c0>"
|
||
]
|
||
},
|
||
"execution_count": 107,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"class_metrics = pd.DataFrame.from_dict(param_grid, \"index\")[\n",
|
||
" [\n",
|
||
" \"Accuracy_test\",\n",
|
||
" \"F1_test\",\n",
|
||
" \"ROC_AUC_test\",\n",
|
||
" \"Cohen_kappa_test\",\n",
|
||
" \"MCC_test\",\n",
|
||
" ]\n",
|
||
"]\n",
|
||
"class_metrics.sort_values(by=\"ROC_AUC_test\", ascending=False).style.background_gradient(\n",
|
||
" cmap=\"plasma\",\n",
|
||
" low=0.3,\n",
|
||
" high=1,\n",
|
||
" subset=[\n",
|
||
" \"ROC_AUC_test\",\n",
|
||
" \"MCC_test\",\n",
|
||
" \"Cohen_kappa_test\",\n",
|
||
" ],\n",
|
||
").background_gradient(\n",
|
||
" cmap=\"viridis\",\n",
|
||
" low=1,\n",
|
||
" high=0.3,\n",
|
||
" subset=[\n",
|
||
" \"Accuracy_test\",\n",
|
||
" \"F1_test\",\n",
|
||
" ],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### **Выводы:**\n",
|
||
"1. Decision Tree (Решающие деревья)\n",
|
||
"* Accuracy (Точность): 0.999049\n",
|
||
"* F1 Score: 0.999340\n",
|
||
"* ROC AUC: 0.998295\n",
|
||
"* Cohen's Kappa: 0.997636\n",
|
||
"* MCC (Matthews Correlation Coefficient): 0.997639\n",
|
||
"\n",
|
||
"Вывод: Деревья решений показывают наилучшие результаты по всем метрикам. Высокие значения точности, F1 Score и ROC AUC указывают на то, что модель хорошо справляется как с классификацией, так и с предсказанием вероятностей. Это делает её наиболее подходящей для задачи прогнозирования.\n",
|
||
"\n",
|
||
"2. K-Neighbors (Метод ближайших соседей)\n",
|
||
"* Accuracy: 0.736441\n",
|
||
"* F1 Score: 0.815456\n",
|
||
"* ROC AUC: 0.769925\n",
|
||
"* Cohen's Kappa: 0.354679\n",
|
||
"* MCC: 0.354842\n",
|
||
"\n",
|
||
"Вывод: Метод ближайших соседей демонстрирует средние результаты. Хотя F1 Score и ROC AUC указывают на относительно приемлемую степень точности, это всё же значительно уступает показателям деревьев решений. Учитывая цель, K-Neighbors может быть менее эффективным выбором.\n",
|
||
"\n",
|
||
"3. Naive Bayes (Наивный байесовский классификатор)\n",
|
||
"* Accuracy: 0.720266\n",
|
||
"* F1 Score: 0.837389\n",
|
||
"* ROC AUC: 0.711442\n",
|
||
"* Cohen's Kappa: 0.000000\n",
|
||
"* MCC: 0.000000\n",
|
||
"\n",
|
||
"Вывод: Наивный байесовский классификатор показывает также средние результаты, но его Cohen's Kappa и MCC ровны нулю. Это свидетельствует о том, что модель может плохо предсказывать тренды. Следовательно, её применение может быть ограниченным.\n",
|
||
"\n",
|
||
"4. Logistic Regression (Логистическая регрессия)\n",
|
||
"* Accuracy: 0.603235\n",
|
||
"* F1 Score: 0.752522\n",
|
||
"* ROC AUC: 0.466647\n",
|
||
"* Cohen's Kappa: -0.197637\n",
|
||
"* MCC: -0.226884\n",
|
||
"\n",
|
||
"Вывод: Логистическая регрессия показывает наихудшие результаты среди всех моделей. Низкие значения точности и ROC AUC указывают на ненадежность этой модели для прогнозирования цен акций, что делает её наименее подходящим вариантом для текущей бизнес-цели.\n",
|
||
"\n",
|
||
"На основе проведенного анализа, Decision Tree является наилучшим выбором для построения модели прогнозирования цены акций. Она демонстрирует высокую производительность по всем ключевым метрикам, что делает её наиболее надежной и эффективной для предсказания цен закрытия. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 108,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'DecisionTree'"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"best_model = str(class_metrics.sort_values(by=\"MCC_test\", ascending=False).iloc[0].name)\n",
|
||
"\n",
|
||
"display(best_model)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Визуализация ROC-кривой"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1000x800 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt\n",
|
||
"from sklearn.metrics import auc, roc_curve\n",
|
||
"\n",
|
||
"# Инициализация словаря для хранения результатов\n",
|
||
"results = {}\n",
|
||
"\n",
|
||
"# После подбора модели\n",
|
||
"for model_name in best_estimators.keys():\n",
|
||
" # Получаем вероятности для положительного класса\n",
|
||
" y_scores = best_estimators[model_name].predict_proba(X_test_close)[:, 1]\n",
|
||
" fpr, tpr, _ = roc_curve(y_test_close, y_scores)\n",
|
||
" roc_auc = auc(fpr, tpr)\n",
|
||
"\n",
|
||
" # Сохраняем полученные значения в словаре results\n",
|
||
" results[model_name] = {\n",
|
||
" 'fpr': fpr,\n",
|
||
" 'tpr': tpr,\n",
|
||
" 'roc_auc': roc_auc\n",
|
||
" }\n",
|
||
"\n",
|
||
"# Визуализация ROC-кривой\n",
|
||
"plt.figure(figsize=(10, 8))\n",
|
||
"for model_name, metrics in results.items():\n",
|
||
" plt.plot(metrics['fpr'], metrics['tpr'], lw=2, label=f'{model_name} (AUC = {metrics[\"roc_auc\"]:.2f})')\n",
|
||
"\n",
|
||
"# Диагональная линия глухого классификатора\n",
|
||
"plt.plot([0, 1], [0, 1], 'k--', lw=2)\n",
|
||
"\n",
|
||
"plt.xlim([0.0, 1.0])\n",
|
||
"plt.ylim([0.0, 1.05])\n",
|
||
"plt.xlabel('False Positive Rate')\n",
|
||
"plt.ylabel('True Positive Rate')\n",
|
||
"plt.title('Receiver Operating Characteristic')\n",
|
||
"plt.legend(loc='lower right')\n",
|
||
"plt.grid()\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"ROC (Receiver Operating Characteristic) кривая — это график, используемый для оценки производительности классификаторов. Она отображает соотношение между двумя показателями:\n",
|
||
"\n",
|
||
"True Positive Rate (TPR), также известная как чувствительность или полнота — доля верных положительных результатов среди всех положительных примеров.\n",
|
||
"False Positive Rate (FPR) — доля ложных положительных результатов среди всех отрицательных примеров.\n",
|
||
"ROC-кривая и AUC\n",
|
||
"ROC-кривая строится путем отображения TPR против FPR при разных порогах классификации. Площадь под ROC-кривой (AUC - Area Under the Curve) служит одной из основных метрик для оценки качества классификатора:\n",
|
||
"\n",
|
||
"AUC = 1: Модель идеально классифицирует все примеры.\n",
|
||
"AUC = 0.5: Модель не лучше случайного угадывания.\n",
|
||
"AUC < 0.5: Модель показывает худшие результаты, чем случайный угадыватель.\n",
|
||
"\n",
|
||
"**Анализ получившейся ROC-кривой:**\n",
|
||
"* Decision Tree (зеленая линия): AUC равен 1, что указывает на отличную производительность модели. Она идеально разделяет положительные и отрицательные классы.\n",
|
||
"* KNeighbors (синяя линия): AUC равен 0.77. Эта модель показывает хорошую производительность, но не так идеальна, как дерево решений.\n",
|
||
"* Naive Bayes (оранжевая линия): AUC равен 0.71. Модель демонстрирует средние результаты, но имеет значительные недостатки по сравнению с деревом решений.\n",
|
||
"* Logistic Regression (красная линия): AUC равен 0.47, что говорит о том, что модель практически неэффективна и хуже случайного классификатора.\n",
|
||
"\n",
|
||
"**Общий вывод**\n",
|
||
"Модель дерева решений выделяется на фоне других, обеспечивая высокую точность. Это делает её наиболее предпочтительным вариантом для бизнес-прогнозирования. Остальные модели показывают более скромные результаты и могут быть менее надежными."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "aimenv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|