{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Датасет: Цены на акции\n", "https://www.kaggle.com/datasets/nancyalaswad90/yamana-gold-inc-stock-Volume\n", "##### О наборе данных: \n", "Yamana Gold Inc. — это канадская компания, которая занимается разработкой и управлением золотыми, серебряными и медными рудниками, расположенными в Канаде, Чили, Бразилии и Аргентине. Головной офис компании находится в Торонто.\n", "\n", "Yamana Gold была основана в 1994 году и уже через год была зарегистрирована на фондовой бирже Торонто. В 2007 году она стала участником Нью-Йоркской фондовой биржи, а в 2020 году — Лондонской.\n", "В 2003 году компания претерпела значительные изменения: была проведена реструктуризация, в результате которой Питер Марроне занял пост главного исполнительного директора. Кроме того, Yamana объединилась с бразильской компанией Santa Elina Mines Corporation. Благодаря этому слиянию Yamana получила доступ к капиталу, накопленному Santa Elina, что позволило ей начать разработку и эксплуатацию рудника Чапада. Затем компания объединилась с другими организациями, зарегистрированными на бирже TSX: RNC Gold, Desert Sun Mining, Viceroy Exploration, Northern Orion Resources, Meridian Gold, Osisko Mining и Extorre Gold Mines. Каждая из них внесла свой вклад в разработку месторождения или проект, который в итоге был успешно запущен.\n", "##### Таким образом:\n", "* Объект наблюдения - цены и объемы акций компании\n", "* Атрибуты: 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'\n", "\n", "##### Бизнес цели:\n", "* Прогнозирование будущей цены акций.(Цены закрытия)\n", " Использование данных для создания модели, которая будет предсказывать цену акций компании в будущем. Целевая переменная: Цена закрытия (Close)\n", "* Определение волатильности акций.\n", " Определение, колебаний цен акций, что поможет инвесторам понять риски. Прогнозировать волатильность акций на основе изменений в ценах открытий, максимума, минимума и объема торгов. Целевая переменная: Разница между высокой и низкой ценой (High - Low). (среднее значение)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Количество колонок: 7\n", "Колонки: Date, Open, High, Low, Close, Adj Close, Volume\n", "\n", "\n", "RangeIndex: 5251 entries, 0 to 5250\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Date 5251 non-null datetime64[ns]\n", " 1 Open 5251 non-null float64 \n", " 2 High 5251 non-null float64 \n", " 3 Low 5251 non-null float64 \n", " 4 Close 5251 non-null float64 \n", " 5 Adj Close 5251 non-null float64 \n", " 6 Volume 5251 non-null int64 \n", "dtypes: datetime64[ns](1), float64(5), int64(1)\n", "memory usage: 287.3 KB\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOpenHighLowCloseAdj CloseVolume
52462022-04-295.665.695.505.515.5116613300
52472022-05-025.335.395.185.305.3027106700
52482022-05-035.325.535.325.475.4718914200
52492022-05-045.475.615.375.605.6020530700
52502022-05-055.635.665.345.445.4419879200
\n", "
" ], "text/plain": [ " Date Open High Low Close Adj Close Volume\n", "5246 2022-04-29 5.66 5.69 5.50 5.51 5.51 16613300\n", "5247 2022-05-02 5.33 5.39 5.18 5.30 5.30 27106700\n", "5248 2022-05-03 5.32 5.53 5.32 5.47 5.47 18914200\n", "5249 2022-05-04 5.47 5.61 5.37 5.60 5.60 20530700\n", "5250 2022-05-05 5.63 5.66 5.34 5.44 5.44 19879200" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.discriminant_analysis import StandardScaler\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline, FeatureUnion\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", "from sklearn.neural_network import MLPRegressor\n", "\n", "df = pd.read_csv(\".//static//csv//Stocks.csv\", sep=\",\")\n", "print('Количество колонок: ' + str(df.columns.size)) \n", "print('Колонки: ' + ', '.join(df.columns)+'\\n')\n", "df['Date'] = pd.to_datetime(df['Date'], errors='coerce')\n", "\n", "\n", "df.info()\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Подготовка данных:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Получение сведений о пропущенных данных\n", "Типы пропущенных данных:\n", "\n", "- None - представление пустых данных в Python\n", "- NaN - представление пустых данных в Pandas\n", "- '' - пустая строка" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Date 0\n", "Open 0\n", "High 0\n", "Low 0\n", "Close 0\n", "Adj Close 0\n", "Volume 0\n", "dtype: int64\n", "\n", "Date False\n", "Open False\n", "High False\n", "Low False\n", "Close False\n", "Adj Close False\n", "Volume False\n", "dtype: bool\n", "\n", "Количество бесконечных значений в каждом столбце:\n", "Date 0\n", "Open 0\n", "High 0\n", "Low 0\n", "Close 0\n", "Adj Close 0\n", "Volume 0\n", "dtype: int64\n", "Date процент пустых значений: %0.00\n", "Open процент пустых значений: %0.00\n", "High процент пустых значений: %0.00\n", "Low процент пустых значений: %0.00\n", "Close процент пустых значений: %0.00\n", "Adj Close процент пустых значений: %0.00\n", "Volume процент пустых значений: %0.00\n" ] } ], "source": [ "# Количество пустых значений признаков\n", "print(df.isnull().sum())\n", "print()\n", "\n", "# Есть ли пустые значения признаков\n", "print(df.isnull().any())\n", "print()\n", "\n", "# Проверка на бесконечные значения\n", "print(\"Количество бесконечных значений в каждом столбце:\")\n", "print(np.isinf(df).sum())\n", "\n", "# Процент пустых значений признаков\n", "for i in df.columns:\n", " null_rate = df[i].isnull().sum() / len(df) * 100\n", " print(f\"{i} процент пустых значений: %{null_rate:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Таким образом, пропущенных значений не найдено." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Проверка выбросов данных и устранение их при наличии:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "До устранения выбросов:\n", "Колонка Open:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.42\n", " 1-й квартиль (Q1): 2.857143\n", " 3-й квартиль (Q3): 10.65\n", "\n", "После устранения выбросов:\n", "Колонка Open:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.42\n", " 1-й квартиль (Q1): 2.857143\n", " 3-й квартиль (Q3): 10.65\n", "\n", "До устранения выбросов:\n", "Колонка High:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.59\n", " 1-й квартиль (Q1): 2.88\n", " 3-й квартиль (Q3): 10.86\n", "\n", "После устранения выбросов:\n", "Колонка High:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.59\n", " 1-й квартиль (Q1): 2.88\n", " 3-й квартиль (Q3): 10.86\n", "\n", "До устранения выбросов:\n", "Колонка Low:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.09\n", " 1-й квартиль (Q1): 2.81\n", " 3-й квартиль (Q3): 10.425\n", "\n", "После устранения выбросов:\n", "Колонка Low:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.09\n", " 1-й квартиль (Q1): 2.81\n", " 3-й квартиль (Q3): 10.425\n", "\n", "До устранения выбросов:\n", "Колонка Close:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.389999\n", " 1-й квартиль (Q1): 2.857143\n", " 3-й квартиль (Q3): 10.64\n", "\n", "После устранения выбросов:\n", "Колонка Close:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 1.142857\n", " Максимальное значение: 20.389999\n", " 1-й квартиль (Q1): 2.857143\n", " 3-й квартиль (Q3): 10.64\n", "\n", "До устранения выбросов:\n", "Колонка Adj Close:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 0.935334\n", " Максимальное значение: 17.543156\n", " 1-й квартиль (Q1): 2.537094\n", " 3-й квартиль (Q3): 8.951944999999998\n", "\n", "После устранения выбросов:\n", "Колонка Adj Close:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 0.935334\n", " Максимальное значение: 17.543156\n", " 1-й квартиль (Q1): 2.537094\n", " 3-й квартиль (Q3): 8.951944999999998\n", "\n", "До устранения выбросов:\n", "Колонка Volume:\n", " Есть выбросы: Да\n", " Количество выбросов: 95\n", " Минимальное значение: 0\n", " Максимальное значение: 76714000\n", " 1-й квартиль (Q1): 2845900.0\n", " 3-й квартиль (Q3): 13272450.0\n", "\n", "После устранения выбросов:\n", "Колонка Volume:\n", " Есть выбросы: Нет\n", " Количество выбросов: 0\n", " Минимальное значение: 0.0\n", " Максимальное значение: 28912275.0\n", " 1-й квартиль (Q1): 2845900.0\n", " 3-й квартиль (Q3): 13272450.0\n", "\n" ] } ], "source": [ "numeric_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']\n", "\n", "for column in numeric_columns:\n", " if pd.api.types.is_numeric_dtype(df[column]): # Проверяем, является ли колонка числовой\n", " q1 = df[column].quantile(0.25) # Находим 1-й квартиль (Q1)\n", " q3 = df[column].quantile(0.75) # Находим 3-й квартиль (Q3)\n", " iqr = q3 - q1 # Вычисляем межквартильный размах (IQR)\n", "\n", " # Определяем границы для выбросов\n", " lower_bound = q1 - 1.5 * iqr # Нижняя граница\n", " upper_bound = q3 + 1.5 * iqr # Верхняя граница\n", "\n", " # Подсчитываем количество выбросов\n", " outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n", " outlier_count = outliers.shape[0]\n", "\n", " print(\"До устранения выбросов:\")\n", " print(f\"Колонка {column}:\")\n", " print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n", " print(f\" Количество выбросов: {outlier_count}\")\n", " print(f\" Минимальное значение: {df[column].min()}\")\n", " print(f\" Максимальное значение: {df[column].max()}\")\n", " print(f\" 1-й квартиль (Q1): {q1}\")\n", " print(f\" 3-й квартиль (Q3): {q3}\\n\")\n", "\n", " # Устраняем выбросы: заменяем значения ниже нижней границы на саму нижнюю границу, а выше верхней — на верхнюю\n", " if outlier_count != 0:\n", " df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n", " \n", " # Подсчитываем количество выбросов\n", " outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n", " outlier_count = outliers.shape[0]\n", "\n", " print(\"После устранения выбросов:\")\n", " print(f\"Колонка {column}:\")\n", " print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n", " print(f\" Количество выбросов: {outlier_count}\")\n", " print(f\" Минимальное значение: {df[column].min()}\")\n", " print(f\" Максимальное значение: {df[column].max()}\")\n", " print(f\" 1-й квартиль (Q1): {q1}\")\n", " print(f\" 3-й квартиль (Q3): {q3}\\n\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Выбросы присутствовали, но мы их устранили." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Разбиение на выборки:\n", "\n", "Разобьем наш набор на обучающую, контрольную и тестовую выборки для устранения проблемы просачивания данных.\n", "Разделим на два варианта - набор для первой бизнес цели - его будем применять для решения задаи регрессии. И набор для второй бизнес цели - его используем для решения задач классификации." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'X_train'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowCloseVolume
47895.665.735.475.56000023355100.0
34693.863.933.813.8800007605300.0
250312.1912.2811.9512.0200007243200.0
158011.7711.8411.5311.5700003025900.0
275915.7716.1715.7616.1200016113400.0
..................
30929.579.879.309.7500007283100.0
37724.764.974.674.93000012920800.0
51914.184.294.174.20000011192400.0
52265.585.685.555.58000012692800.0
8603.183.193.133.18000099100.0
\n", "

4200 rows × 5 columns

\n", "
" ], "text/plain": [ " Open High Low Close Volume\n", "4789 5.66 5.73 5.47 5.560000 23355100.0\n", "3469 3.86 3.93 3.81 3.880000 7605300.0\n", "2503 12.19 12.28 11.95 12.020000 7243200.0\n", "1580 11.77 11.84 11.53 11.570000 3025900.0\n", "2759 15.77 16.17 15.76 16.120001 6113400.0\n", "... ... ... ... ... ...\n", "3092 9.57 9.87 9.30 9.750000 7283100.0\n", "3772 4.76 4.97 4.67 4.930000 12920800.0\n", "5191 4.18 4.29 4.17 4.200000 11192400.0\n", "5226 5.58 5.68 5.55 5.580000 12692800.0\n", "860 3.18 3.19 3.13 3.180000 99100.0\n", "\n", "[4200 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'y_train'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Volatility
47890.046763
34690.030928
25030.027454
15800.026793
27590.025434
......
30920.058462
37720.060852
51910.028571
52260.023297
8600.018868
\n", "

4200 rows × 1 columns

\n", "
" ], "text/plain": [ " Volatility\n", "4789 0.046763\n", "3469 0.030928\n", "2503 0.027454\n", "1580 0.026793\n", "2759 0.025434\n", "... ...\n", "3092 0.058462\n", "3772 0.060852\n", "5191 0.028571\n", "5226 0.023297\n", "860 0.018868\n", "\n", "[4200 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'X_test'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowCloseVolume
143713.71000014.00000013.67000013.9400007623200.0
270015.52000015.72000015.30000015.3200006098800.0
36471.8700001.9300001.8300001.83000010980000.0
251211.26000011.47000011.26000011.3200005029300.0
290216.37999916.58000016.25000016.5499995485800.0
..................
30959.2900009.3500009.0700009.1300005861400.0
8593.0900003.1600003.0400003.100000211300.0
31348.5500008.7700008.5500008.7700005335400.0
257716.70999917.07000016.37999916.40000014524400.0
3782.5714292.5714292.5714292.5714290.0
\n", "

1051 rows × 5 columns

\n", "
" ], "text/plain": [ " Open High Low Close Volume\n", "1437 13.710000 14.000000 13.670000 13.940000 7623200.0\n", "2700 15.520000 15.720000 15.300000 15.320000 6098800.0\n", "3647 1.870000 1.930000 1.830000 1.830000 10980000.0\n", "2512 11.260000 11.470000 11.260000 11.320000 5029300.0\n", "2902 16.379999 16.580000 16.250000 16.549999 5485800.0\n", "... ... ... ... ... ...\n", "3095 9.290000 9.350000 9.070000 9.130000 5861400.0\n", "859 3.090000 3.160000 3.040000 3.100000 211300.0\n", "3134 8.550000 8.770000 8.550000 8.770000 5335400.0\n", "2577 16.709999 17.070000 16.379999 16.400000 14524400.0\n", "378 2.571429 2.571429 2.571429 2.571429 0.0\n", "\n", "[1051 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'y_test'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Volatility
14370.023673
27000.027415
36470.054645
25120.018551
29020.019940
......
30950.030668
8590.038710
31340.025086
25770.042073
3780.000000
\n", "

1051 rows × 1 columns

\n", "
" ], "text/plain": [ " Volatility\n", "1437 0.023673\n", "2700 0.027415\n", "3647 0.054645\n", "2512 0.018551\n", "2902 0.019940\n", "... ...\n", "3095 0.030668\n", "859 0.038710\n", "3134 0.025086\n", "2577 0.042073\n", "378 0.000000\n", "\n", "[1051 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from typing import Tuple\n", "import pandas as pd\n", "from pandas import DataFrame\n", "from sklearn.model_selection import train_test_split\n", "\n", "df['Volatility'] = (df['High'] - df['Low']) / df['Close']\n", "\n", "def split_into_train_test(\n", " df_input: DataFrame,\n", " target_colname: str = \"Volatility\",\n", " frac_train: float = 0.8,\n", " random_state: int = None,\n", ") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n", " \n", " if not (0 < frac_train < 1):\n", " raise ValueError(\"Fraction must be between 0 and 1.\")\n", " \n", " # Проверка наличия целевого признака\n", " if target_colname not in df_input.columns:\n", " raise ValueError(f\"{target_colname} is not a column in the DataFrame.\")\n", " \n", " # Разделяем данные на признаки и целевую переменную\n", " X = df_input.drop(columns=[target_colname]) # Признаки\n", " y = df_input[[target_colname]] # Целевая переменная\n", "\n", " # Удаляем указанные столбцы из X\n", " columns_to_remove = [\"Date\", \"Adj Close\", \"Volatility\"]\n", " X = X.drop(columns=columns_to_remove, errors='ignore') # Игнорировать ошибку, если столбцы не найдены\n", "\n", " # Разделяем данные на обучающую и тестовую выборки\n", " X_train, X_test, y_train, y_test = train_test_split(\n", " X, y,\n", " test_size=(1.0 - frac_train),\n", " random_state=random_state\n", " )\n", " \n", " return X_train, X_test, y_train, y_test\n", "\n", "# Применение функции для разделения данных\n", "X_train, X_test, y_train, y_test = split_into_train_test(\n", " df, \n", " target_colname=\"Volatility\", \n", " frac_train=0.8, \n", " random_state=42\n", ")\n", "\n", "# Для отображения результатов\n", "display(\"X_train\", X_train)\n", "display(\"y_train\", y_train)\n", "\n", "display(\"X_test\", X_test)\n", "display(\"y_test\", y_test)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowVolumeVolatility
47895.665.735.4723355100.00.046763
34693.863.933.817605300.00.030928
250312.1912.2811.957243200.00.027454
158011.7711.8411.533025900.00.026793
275915.7716.1715.766113400.00.025434
..................
30929.579.879.307283100.00.058462
37724.764.974.6712920800.00.060852
51914.184.294.1711192400.00.028571
52265.585.685.5512692800.00.023297
8603.183.193.1399100.00.018868
\n", "

4200 rows × 5 columns

\n", "
" ], "text/plain": [ " Open High Low Volume Volatility\n", "4789 5.66 5.73 5.47 23355100.0 0.046763\n", "3469 3.86 3.93 3.81 7605300.0 0.030928\n", "2503 12.19 12.28 11.95 7243200.0 0.027454\n", "1580 11.77 11.84 11.53 3025900.0 0.026793\n", "2759 15.77 16.17 15.76 6113400.0 0.025434\n", "... ... ... ... ... ...\n", "3092 9.57 9.87 9.30 7283100.0 0.058462\n", "3772 4.76 4.97 4.67 12920800.0 0.060852\n", "5191 4.18 4.29 4.17 11192400.0 0.028571\n", "5226 5.58 5.68 5.55 12692800.0 0.023297\n", "860 3.18 3.19 3.13 99100.0 0.018868\n", "\n", "[4200 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Close
01
11
20
30
40
......
41951
41961
41971
41981
41991
\n", "

4200 rows × 1 columns

\n", "
" ], "text/plain": [ " Close\n", "0 1\n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", "... ...\n", "4195 1\n", "4196 1\n", "4197 1\n", "4198 1\n", "4199 1\n", "\n", "[4200 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowVolumeVolatility
143713.71000014.00000013.6700007623200.00.023673
270015.52000015.72000015.3000006098800.00.027415
36471.8700001.9300001.83000010980000.00.054645
251211.26000011.47000011.2600005029300.00.018551
290216.37999916.58000016.2500005485800.00.019940
..................
30959.2900009.3500009.0700005861400.00.030668
8593.0900003.1600003.040000211300.00.038710
31348.5500008.7700008.5500005335400.00.025086
257716.70999917.07000016.37999914524400.00.042073
3782.5714292.5714292.5714290.00.000000
\n", "

1051 rows × 5 columns

\n", "
" ], "text/plain": [ " Open High Low Volume Volatility\n", "1437 13.710000 14.000000 13.670000 7623200.0 0.023673\n", "2700 15.520000 15.720000 15.300000 6098800.0 0.027415\n", "3647 1.870000 1.930000 1.830000 10980000.0 0.054645\n", "2512 11.260000 11.470000 11.260000 5029300.0 0.018551\n", "2902 16.379999 16.580000 16.250000 5485800.0 0.019940\n", "... ... ... ... ... ...\n", "3095 9.290000 9.350000 9.070000 5861400.0 0.030668\n", "859 3.090000 3.160000 3.040000 211300.0 0.038710\n", "3134 8.550000 8.770000 8.550000 5335400.0 0.025086\n", "2577 16.709999 17.070000 16.379999 14524400.0 0.042073\n", "378 2.571429 2.571429 2.571429 0.0 0.000000\n", "\n", "[1051 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Close
00
10
21
30
40
......
10461
10471
10481
10490
10501
\n", "

1051 rows × 1 columns

\n", "
" ], "text/plain": [ " Close\n", "0 0\n", "1 0\n", "2 1\n", "3 0\n", "4 0\n", "... ...\n", "1046 1\n", "1047 1\n", "1048 1\n", "1049 0\n", "1050 1\n", "\n", "[1051 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from IPython.display import display\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.model_selection import train_test_split\n", "from typing import Tuple\n", "from pandas import DataFrame\n", "\n", "def split_into_train_close_test(\n", " df_input: DataFrame,\n", " target_colname: str = \"Close\",\n", " frac_train: float = 0.8,\n", " random_state: int = None,\n", ") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n", " \n", " if not (0 < frac_train < 1):\n", " raise ValueError(\"Fraction must be between 0 and 1.\")\n", " \n", " # Проверка наличия целевого признака\n", " if target_colname not in df_input.columns:\n", " raise ValueError(f\"{target_colname} is not a column in the DataFrame.\")\n", " \n", " # Разделяем данные на признаки и целевую переменную\n", " X = df_input.drop(columns=[target_colname]) # Признаки\n", " \n", " # Преобразование целевой переменной в категориальную\n", " bins = [-np.inf, 10, np.inf]\n", " labels = ['low', 'high']\n", " y = pd.cut(df_input[target_colname], bins=bins, labels=labels) # Целевая переменная\n", " \n", " # Преобразование целевой переменной в числовые значения\n", " label_encoder = LabelEncoder()\n", " y_encoded = label_encoder.fit_transform(y) # Интеграция, чтобы вернуть числовые метки\n", " \n", " # Удаляем указанные столбцы из X\n", " columns_to_remove = [\"Date\", \"Adj Close\", \"Close\"]\n", " X = X.drop(columns=columns_to_remove, errors='ignore') # Игнорировать ошибку, если столбцы не найдены\n", "\n", " # Разделяем данные на обучающую и тестовую выборки\n", " X_train_close, X_test_close, y_train_close, y_test_close = train_test_split(\n", " X, y_encoded,\n", " test_size=(1.0 - frac_train),\n", " random_state=random_state\n", " )\n", " \n", " # Конвертируем y_train_close и y_test_close в DataFrame\n", " y_train_close = pd.DataFrame(y_train_close, columns=[target_colname])\n", " y_test_close = pd.DataFrame(y_test_close, columns=[target_colname])\n", "\n", " return X_train_close, X_test_close, y_train_close, y_test_close\n", "\n", "# Применение функции для разделения данных\n", "X_train_close, X_test_close, y_train_close, y_test_close = split_into_train_close_test(\n", " df, \n", " target_colname=\"Close\", \n", " frac_train=0.8, \n", " random_state=42\n", ")\n", "\n", "# Для отображения результатов\n", "display(X_train_close)\n", "display(y_train_close)\n", "display(X_test_close)\n", "display(y_test_close)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Определение достижимого уровня качества модели" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline Volatility MSE: 0.0002712979238081643\n", "Baseline Volatility R^2: 0.6636185388238924\n", "Baseline Close MSE: 0.04753452157168089\n", "Baseline Close R^2: 0.764076420247305\n" ] } ], "source": [ "# Оценка базовых моделей (можно использовать, например, линейную регрессию как базу)\n", "from sklearn.linear_model import LinearRegression\n", "\n", "\n", "baseline_model = LinearRegression()\n", "baseline_model.fit(X_train, y_train)\n", "baseline = baseline_model.predict(X_test)\n", "\n", "# Оценка качества\n", "print(f'Baseline Volatility MSE: {mean_squared_error(y_test, baseline)}')\n", "print(f'Baseline Volatility R^2: {r2_score(y_test, baseline)}')\n", "\n", "baseline_model_close = LinearRegression()\n", "baseline_model_close.fit(X_train_close, y_train_close)\n", "baseline_close = baseline_model_close.predict(X_test_close)\n", "\n", "print(f'Baseline Close MSE: {mean_squared_error(y_test_close, baseline_close)}')\n", "print(f'Baseline Close R^2: {r2_score(y_test_close, baseline_close)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Цена:**\n", "- MSE: 0.0475— Этот показатель говорит о том, что в среднем модель делает небольшую ошибку в предсказании цен.\n", "- R²: 0.6636 — Это значение указывает на то, что модель объясняет только около 66% вариации волатильности.\n", "\n", "**Волатильность:**\n", "- MSE: 0.00027183 — Как и в случае с ценами, это значение может показаться малым, однако из-за низкого значения волатильности в финансовых данных даже небольшие ошибки могут иметь значение.\n", "- R²: 0.6629 — Это значение указывает на то, что модель объясняет только около 66% вариации волатильности." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Создадим конвейер:**\n", "##### Конвейеры позволяют автоматизировать следующие процессы:\n", "1. Предобработка данных.\n", "2. Конструирование признаков.\n", "3. Понижение размерности признакового пространства.\n", "4. Обучение модели.\n", "\n", "\n", "##### Используемые конвейеры:\n", "1. preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация\n", "\n", "2. preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование\n", "\n", "3. features_preprocessing -- трансформер для предобработки признаков\n", "\n", "4. features_engineering -- трансформер для конструирования признаков\n", "\n", "5. drop_columns -- трансформер для удаления колонок\n", "\n", "6. pipeline_end -- основной конвейер предобработки данных и конструирования признаков" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.ensemble import RandomForestRegressor # Пример регрессионной модели\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import make_pipeline\n", "\n", "class StocksFeatures(BaseEstimator, TransformerMixin):\n", " def __init__(self):\n", " pass\n", " \n", " def fit(self, X, y=None):\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " X[\"Range\"] = X[\"High\"] - X[\"Low\"]\n", " return X\n", "\n", " def get_feature_names_out(self, features_in):\n", " return np.append(features_in, [\"Range\"], axis=0)\n", "\n", "num_columns = [\"Open\", \"High\", \"Low\", \"Close\", \"Volume\"]\n", "\n", "# Определяем предобработку для численных данных\n", "num_imputer = SimpleImputer(strategy=\"median\")\n", "num_scaler = StandardScaler()\n", "preprocessing_num = Pipeline(\n", " [\n", " (\"imputer\", num_imputer),\n", " (\"scaler\", num_scaler),\n", " ]\n", ")\n", "\n", "# У категориальных данных нет, оставляем пустым\n", "cat_columns = []\n", "\n", "# Подготовка признаков с использованием ColumnTransformer\n", "features_preprocessing = ColumnTransformer(\n", " verbose_feature_names_out=False,\n", " transformers=[\n", " (\"preprocessing_num\", preprocessing_num, num_columns),\n", " ],\n", " remainder=\"passthrough\"\n", ")\n", "\n", "# Выделим целевую переменную\n", "y_train = y_train.values.reshape(-1, 1) # Убедимся, что y_train - это 2D массив\n", "\n", "# Создание окончательного конвейера\n", "pipeline = Pipeline(steps=[\n", " ('feature_engineering', StocksFeatures()),\n", " ('imputer', SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())\n", "]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Применим конвейер\n" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Open High Low Close Volume Range\n", "0 -0.264188 -0.270892 -0.278887 -0.282602 1.986714 -0.037298\n", "1 -0.642218 -0.642866 -0.634425 -0.636104 -0.197546 -0.634720\n", "2 1.107219 1.082683 1.108998 1.076697 -0.247763 0.261413\n", "3 1.019012 0.991756 1.019043 0.982009 -0.832639 0.176067\n", "4 1.859078 1.886561 1.925023 1.939410 -0.404450 0.602796\n", "... ... ... ... ... ... ...\n", "4195 0.556976 0.584650 0.541422 0.599049 -0.242230 1.285564\n", "4196 -0.453203 -0.427948 -0.450230 -0.415165 0.539634 0.133394\n", "4197 -0.575013 -0.568471 -0.557320 -0.568770 0.299931 -0.634720\n", "4198 -0.280990 -0.281224 -0.261752 -0.278393 0.508014 -0.592047\n", "4199 -0.785029 -0.795789 -0.780067 -0.783396 -1.238542 -0.890757\n", "\n", "[4200 rows x 6 columns]\n" ] } ], "source": [ "# Применяем конвейер к X_train\n", "preprocessing_result = pipeline.fit_transform(X_train)\n", "\n", "# Формируем новый датафрейм с обработанными данными\n", "preprocessed_df = pd.DataFrame(\n", " preprocessing_result, \n", " columns=pipeline.get_feature_names_out(input_features=num_columns),\n", ")\n", "\n", "# Выводим обработанный датафрейм\n", "print(preprocessed_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Для начала разберемся с задачей регрессии:**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обучение сводится к минимизации средней ошибки отклонения полученного\n", "целевого признака от реального значения целевого признака для всей выборки.\n", "\n", "Регрессия (аппроксимация).\n", "Получение значения из области значений целевого признака." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Выберем регрессонные модели, а именно:\n", "1. Рандомный лес\n", "2. Гребневая регрессия\n", "3. Градиентный бустинг\n", "\n", "Настроим гиперпараметры для каждой модели." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_validation.py:540: FitFailedWarning: \n", "27 fits failed out of a total of 54.\n", "The score on these train-test partitions for these parameters will be set to nan.\n", "If these failures are not expected, you can try to debug them by setting error_score='raise'.\n", "\n", "Below are more details about the failures:\n", "--------------------------------------------------------------------------------\n", "27 fits failed with the following error:\n", "Traceback (most recent call last):\n", " File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_validation.py\", line 888, in _fit_and_score\n", " estimator.fit(X_train, y_train, **fit_params)\n", " File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py\", line 1466, in wrapper\n", " estimator._validate_params()\n", " File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py\", line 666, in _validate_params\n", " validate_parameter_constraints(\n", " File \"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\_param_validation.py\", line 95, in validate_parameter_constraints\n", " raise InvalidParameterError(\n", "sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.\n", "\n", " warnings.warn(some_fits_failed_message, FitFailedWarning)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\numpy\\ma\\core.py:2881: RuntimeWarning: invalid value encountered in cast\n", " _data = np.array(data, dtype=dtype, copy=copy,\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\model_selection\\_search.py:1103: UserWarning: One or more of the test scores are non-finite: [ nan nan nan -3.60510202e-05\n", " -3.51521700e-05 -3.55224330e-05 nan nan\n", " nan -5.23951976e-05 -5.06082610e-05 -5.35685939e-05\n", " nan nan nan -3.75406904e-05\n", " -3.50578578e-05 -3.44782270e-05]\n", " warnings.warn(\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n", "from sklearn.pipeline import Pipeline\n", "import pandas as pd\n", "\n", "# 1. Настройка гиперпараметров для каждой модели\n", "# Random Forest hyperparameters\n", "rf_params = {\n", " 'n_estimators': [50, 100, 200],\n", " 'max_features': ['auto', 'sqrt'],\n", " 'max_depth': [None, 10, 20],\n", "}\n", "\n", "# Ridge hyperparameters\n", "ridge_params = {\n", " 'alpha': [0.1, 1.0, 10.0],\n", "}\n", "\n", "# Gradient Boosting hyperparameters\n", "gb_params = {\n", " 'n_estimators': [50, 100],\n", " 'learning_rate': [0.01, 0.1],\n", " 'max_depth': [3, 5],\n", "}\n", "\n", "# Curate a function for model training and evaluation\n", "def train_and_evaluate_model(model, param_grid, X_train, y_train):\n", " grid_search = GridSearchCV(model, param_grid, scoring='neg_mean_squared_error', cv=3)\n", " grid_search.fit(X_train, y_train)\n", " return grid_search.best_estimator_, grid_search.best_params_\n", "\n", "# Исходные данные после преобразования (Pipeline применения)\n", "X_train_transformed = pipeline.fit_transform(X_train)\n", "\n", "# Обучение моделей с подбором гиперпараметров\n", "rf_model, rf_params = train_and_evaluate_model(RandomForestRegressor(), rf_params, X_train_transformed, y_train)\n", "ridge_model, ridge_params = train_and_evaluate_model(Ridge(), ridge_params, X_train_transformed, y_train)\n", "gb_model, gb_params = train_and_evaluate_model(GradientBoostingRegressor(), gb_params, X_train_transformed, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Обучим модели на преобразованных данных:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True) # TODO: Is this still required?\n" ] }, { "data": { "text/html": [ "
GradientBoostingRegressor(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GradientBoostingRegressor(max_depth=5)" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Обучение с использованием лучших моделей\n", "rf_model.fit(X_train_transformed, y_train)\n", "ridge_model.fit(X_train_transformed, y_train)\n", "gb_model.fit(X_train_transformed, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Оценка моделей:" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Model MAE MSE R2\n", "0 Random Forest 0.001764 0.000026 0.967882\n", "1 Ridge 0.010975 0.000271 0.663739\n", "2 Gradient Boosting 0.000987 0.000004 0.995060\n", "Лучшая модель: Gradient Boosting\n" ] } ], "source": [ "# Оценка качества и получение предсказаний для тестового набора\n", "X_test_transformed = pipeline.transform(X_test)\n", "\n", "models = [rf_model, ridge_model, gb_model]\n", "model_names = ['Random Forest', 'Ridge', 'Gradient Boosting']\n", "\n", "results = []\n", "for model, name in zip(models, model_names):\n", " predictions = model.predict(X_test_transformed)\n", " mae = mean_absolute_error(y_test, predictions)\n", " mse = mean_squared_error(y_test, predictions)\n", " r2 = r2_score(y_test, predictions)\n", " \n", " results.append({\n", " 'Model': name,\n", " 'MAE': mae,\n", " 'MSE': mse,\n", " 'R2': r2\n", " })\n", "\n", "results_df = pd.DataFrame(results)\n", "print(results_df)\n", "\n", "# Определение наилучшей модели по метрике R²\n", "best_model_info = results_df.loc[results_df['R2'].idxmax()]\n", "best_model_name = best_model_info['Model']\n", "best_model = models[results_df['R2'].idxmax()]\n", "\n", "print(f'Лучшая модель: {best_model_name}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Оценим смещение и дисперсию для модели с лучшей оценкой:" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bias: 0.0014630506389881025, Variance: 0.0006564383445844139\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Функция для оценки смещения и дисперсии\n", "def plot_bias_variance(model, X_train, y_train, X_test, y_test):\n", " # Предсказания на обучающей и тестовой выборках\n", " train_preds = model.predict(X_train)\n", " test_preds = model.predict(X_test)\n", "\n", " # Оценка смещения\n", " bias = np.mean((test_preds - y_test.to_numpy()) ** 2)\n", " variance = np.var(test_preds)\n", "\n", " print(f'Bias: {bias}, Variance: {variance}')\n", "\n", " # Визуализация предсказаний\n", " plt.figure(figsize=(10, 5))\n", " plt.scatter(y_test.to_numpy(), test_preds, label='Предсказания', alpha=0.7)\n", " plt.xlabel('Истинные значения')\n", " plt.ylabel('Предсказанные значения')\n", " plt.title('Сравнение истинных значений и предсказанных значений')\n", " plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n", " plt.legend()\n", " plt.show()\n", "\n", "# Пример использования\n", "plot_bias_variance(rf_model, X_train_transformed, y_train, X_test_transformed, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Таким образом в задачах регрессии в качестве оценок используются следующие метрики:**\n", "1. **Средняя квадратичная ошибка (англ. Mean Squared Error, MSE)**\n", "MSE применяется в ситуациях, когда нам надо подчеркнуть большие ошибки и выбрать модель, которая дает меньше больших ошибок прогноза. Грубые ошибки становятся заметнее за счет того, что ошибку прогноза мы возводим в квадрат. И модель, которая дает нам меньшее значение среднеквадратической ошибки, можно сказать, что что у этой модели меньше грубых ошибок.\n", "\n", "2. **Cредняя абсолютная ошибка (англ. Mean Absolute Error, MAE)**\n", "Среднеквадратичный функционал сильнее штрафует за большие отклонения по сравнению со среднеабсолютным, и поэтому более чувствителен к выбросам. При использовании любого из этих двух функционалов может быть полезно проанализировать, какие объекты вносят наибольший вклад в общую ошибку — не исключено, что на этих объектах была допущена ошибка при вычислении признаков или целевой величины.\n", "Среднеквадратичная ошибка подходит для сравнения двух моделей или для контроля качества во время обучения, но не позволяет сделать выводов о том, на сколько хорошо данная модель решает задачу. Например, MSE = 10 является очень плохим показателем, если целевая переменная принимает значения от 0 до 1, и очень хорошим, если целевая переменная лежит в интервале (10000, 100000). В таких ситуациях вместо среднеквадратичной ошибки полезно использовать коэффициент детерминации — R2\n", "\n", "3. **Коэффициент детерминации**\n", "Коэффициент детерминации измеряет долю дисперсии, объясненную моделью, в общей дисперсии целевой переменной. Фактически, данная мера качества — это нормированная среднеквадратичная ошибка. Если она близка к единице, то модель хорошо объясняет данные, если же она близка к нулю, то прогнозы сопоставимы по качеству с константным предсказанием.\n", "\n", "#### **Анализ Метрик:**\n", "1. Random Forest:\n", "* MAE: 0.001779\n", "* MSE: 0.000027\n", "* R²: 0.966258\n", "\n", "Random Forest демонстрирует хорошую производительность с высоким R², что указывает на то, что модель способна объяснить примерно 96.6% изменчивости в данных. Низкие значения MAE и MSE свидетельствуют о том, что предсказания модели близки к истинным значениям.\n", "\n", "2. Ridge:\n", "* MAE: 0.010975\n", "* MSE: 0.000271\n", "* R²: 0.663739\n", "\n", "Модель Ridge имеет более высокие значения MAE и MSE, чем Random Forest, что указывает на худшую точность предсказаний. Ее R² значительно ниже, всего 66.4%, что означает, что она объясняет лишь часть изменчивости данных. Это может значить то, что модель не улавливает все зависимости в данных о волатильности акций.\n", "\n", "3. Gradient Boosting:\n", "* MAE: 0.000988\n", "* MSE: 0.000004\n", "* R²: 0.995023\n", "\n", "Gradient Boosting показывает наилучшие результаты среди трех моделей. С наименьшими значениями MAE и MSE, эта модель обеспечивает точные предсказания. Высокий R² (99.5%) указывает на то, что она способна объяснить почти всю изменчивость в данных. Это критично в контексте бизнес-целей, связанных с финансовыми рынками, где точность предсказаний может существенно повлиять на принятие инвестиционных решений.\n", "\n", "#### **Вывод:**\n", "На основе представленных метрик Gradient Boosting является лучшей моделью для предсказания цен закрытия акций. Она имеет:\n", "\n", "* Наименьшее значение MAE, что говорит о меньшем среднем отклонении предсказанных значений от действительных.\n", "* Наименьшее значение MSE, что указывает на то, что крупные ошибки, которые могут иметь значительное влияние на торговые решения, минимизированы.\n", "* Наивысшее значение R², что подтверждает высокую степень объясняемости модели.\n", "\n", "* Bias (Смещение)\n", "Смещение измеряет, насколько предсказания модели отклоняются от истинных значений. Чем ниже смещение, тем лучше модель справляется с захватом истинной зависимости в данных.\n", "Низкое значение смещения(0.0014589819156387081) говорит о том, что модель хорошо предсказывает результаты для тестовых данных. Это означает, что ошибка, возникающая из-за того, что модель не может уловить истинные параметры данных, минимальна.\n", "* Variance (Дисперсия)\n", "Дисперсия измеряет, насколько предсказания модели изменяются при использовании различных обучающих наборов данных. Высокая дисперсия может указывать на переобучение модели, когда она слишком точно подстраивается под обучающие данные и теряет способность обобщать.\n", "Низкое значение дисперсии(0.0006523355896833815) также говорит о том, что модель делает стабильные предсказания, которые не сильно колеблются между разными выборками данных. Это свидетельствует о том, что модель, скорее всего, не переобучается.\n", "\n", "* К тому же, если мы посмотрим на график, то сможем сделать несколько выводов:\n", " 1. Высокая точность предсказаний\n", " Предсказанные значения близки к истинным значениям. Это указывает на то, что модель хорошо прогнозирует целевую переменную.\n", " 2. Небольшие отклонения\n", " Несмотря на хорошую схожесть между предсказанными и истинными значениями, видны некоторые небольшие отклонения. Это может указывать на определенные ошибки предсказания, но они незначительны.\n", " 3. Отсутствие сильного переобучения:\n", " Поскольку наблюдается равномерное распределение точек, нет признаков значительного переобучения, которое могло бы проявляться в виде разбросанных точек вдали от линии.\n", " 4. Поведение на высоких значениях:\n", " В то же время может быть интересно обратить внимание на точки, расположенные на уровнях около 0.25-0.30. Они несколько отклоняются от линии, это может свидетельствовать о том, что модель не идеально справляется с предсказанием на больших значениях.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Теперь разберемся с задачей классификации:**\n", "Классификация.\n", "Получение метки класса (выбор из конечного множества значений)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Выберем классификационные модели, а именно:\n", "1. Логистическая регрессия\n", "2. Наивный байесовский классификатор\n", "3. Дерево решений\n", "4. Метод K ближайших соседей\n", "\n", "Настроим гиперпараметры для каждой модели." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Создадим конвейер:**\n", "##### Конвейеры позволяют автоматизировать следующие процессы:\n", "1. Предобработка данных.\n", "2. Конструирование признаков.\n", "3. Понижение размерности признакового пространства.\n", "4. Обучение модели.\n", "\n", "\n", "##### Используемые конвейеры:\n", "1. preprocessing_num -- конвейер для обработки числовых данных: заполнение пропущенных значений и стандартизация\n", "\n", "2. preprocessing_cat -- конвейер для обработки категориальных данных: заполнение пропущенных данных и унитарное кодирование\n", "\n", "3. features_preprocessing -- трансформер для предобработки признаков\n", "\n", "4. features_engineering -- трансформер для конструирования признаков\n", "\n", "5. drop_columns -- трансформер для удаления колонок\n", "\n", "6. pipeline_end -- основной конвейер предобработки данных и конструирования признаков" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.ensemble import RandomForestRegressor # Пример регрессионной модели\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import make_pipeline\n", "\n", "class StocksFeatures(BaseEstimator, TransformerMixin):\n", " def __init__(self):\n", " pass\n", " \n", " def fit(self, X, y=None):\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " X[\"Range\"] = X[\"High\"] - X[\"Low\"]\n", " return X\n", "\n", " def get_feature_names_out(self, features_in):\n", " return np.append(features_in, [\"Range\"], axis=0)\n", " \n", "\n", "num_columns = [\"Open\", \"High\", \"Low\", \"Volume\", \"Volatility\"]\n", "\n", "# Определяем предобработку для численных данных\n", "num_imputer = SimpleImputer(strategy=\"median\")\n", "num_scaler = StandardScaler()\n", "preprocessing_num = Pipeline(\n", " [\n", " (\"imputer\", num_imputer),\n", " (\"scaler\", num_scaler),\n", " ]\n", ")\n", "\n", "# У категориальных данных нет, оставляем пустым\n", "cat_columns = []\n", "\n", "# Подготовка признаков с использованием ColumnTransformer\n", "features_preprocessing = ColumnTransformer(\n", " verbose_feature_names_out=False,\n", " transformers=[\n", " (\"preprocessing_num\", preprocessing_num, num_columns),\n", " ],\n", " remainder=\"passthrough\"\n", ")\n", "\n", "# Выделим целевую переменную\n", "#y_train_close = y_train_close.values.reshape(-1, 1) # Убедимся, что y_train - это 2D массив\n", "\n", "# Создание окончательного конвейера\n", "pipeline = Pipeline(steps=[\n", " ('feature_engineering', StocksFeatures()),\n", " ('imputer', SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())\n", "]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Применим конвейер\n" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowVolumeVolatilityRange
0-0.264188-0.270892-0.2788871.9867140.288393-0.037298
1-0.642218-0.642866-0.634425-0.197546-0.295086-0.634720
21.1072191.0826831.108998-0.247763-0.4230810.261413
31.0190120.9917561.019043-0.832639-0.4474300.176067
41.8590781.8865611.925023-0.404450-0.4975140.602796
.....................
41950.5569760.5846500.541422-0.2422300.7194761.285564
4196-0.453203-0.427948-0.4502300.5396340.8075570.133394
4197-0.575013-0.568471-0.5573200.299931-0.381914-0.634720
4198-0.280990-0.281224-0.2617520.508014-0.576248-0.592047
4199-0.785029-0.795789-0.780067-1.238542-0.739469-0.890757
\n", "

4200 rows × 6 columns

\n", "
" ], "text/plain": [ " Open High Low Volume Volatility Range\n", "0 -0.264188 -0.270892 -0.278887 1.986714 0.288393 -0.037298\n", "1 -0.642218 -0.642866 -0.634425 -0.197546 -0.295086 -0.634720\n", "2 1.107219 1.082683 1.108998 -0.247763 -0.423081 0.261413\n", "3 1.019012 0.991756 1.019043 -0.832639 -0.447430 0.176067\n", "4 1.859078 1.886561 1.925023 -0.404450 -0.497514 0.602796\n", "... ... ... ... ... ... ...\n", "4195 0.556976 0.584650 0.541422 -0.242230 0.719476 1.285564\n", "4196 -0.453203 -0.427948 -0.450230 0.539634 0.807557 0.133394\n", "4197 -0.575013 -0.568471 -0.557320 0.299931 -0.381914 -0.634720\n", "4198 -0.280990 -0.281224 -0.261752 0.508014 -0.576248 -0.592047\n", "4199 -0.785029 -0.795789 -0.780067 -1.238542 -0.739469 -0.890757\n", "\n", "[4200 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenHighLowVolumeVolatilityRange
01.4982411.5083361.545213-0.154202-0.4967250.326979
11.8821691.8670241.897223-0.360102-0.3649520.705871
2-1.013194-1.008735-1.0117180.2991990.593864-0.641300
30.9785610.9807311.024756-0.504559-0.677069-0.178210
42.0645872.0463672.102382-0.442899-0.6281830.326979
.....................
10460.5606950.5386270.551811-0.392167-0.2504070.116483
1047-0.754415-0.752231-0.750410-1.1553230.032753-0.557102
10480.4037310.4176750.439513-0.463214-0.446983-0.136111
10492.1345852.1485522.1304560.7779400.1511911.842550
1050-0.864411-0.874972-0.851601-1.183864-1.330298-1.062291
\n", "

1051 rows × 6 columns

\n", "
" ], "text/plain": [ " Open High Low Volume Volatility Range\n", "0 1.498241 1.508336 1.545213 -0.154202 -0.496725 0.326979\n", "1 1.882169 1.867024 1.897223 -0.360102 -0.364952 0.705871\n", "2 -1.013194 -1.008735 -1.011718 0.299199 0.593864 -0.641300\n", "3 0.978561 0.980731 1.024756 -0.504559 -0.677069 -0.178210\n", "4 2.064587 2.046367 2.102382 -0.442899 -0.628183 0.326979\n", "... ... ... ... ... ... ...\n", "1046 0.560695 0.538627 0.551811 -0.392167 -0.250407 0.116483\n", "1047 -0.754415 -0.752231 -0.750410 -1.155323 0.032753 -0.557102\n", "1048 0.403731 0.417675 0.439513 -0.463214 -0.446983 -0.136111\n", "1049 2.134585 2.148552 2.130456 0.777940 0.151191 1.842550\n", "1050 -0.864411 -0.874972 -0.851601 -1.183864 -1.330298 -1.062291\n", "\n", "[1051 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Применяем конвейер к X_train_close\n", "preprocessing_result = pipeline.fit_transform(X_train_close)\n", "\n", "# Формируем новый датафрейм с обработанными данными\n", "preprocessed_df = pd.DataFrame(\n", " preprocessing_result, \n", " columns=pipeline.get_feature_names_out(input_features=num_columns),\n", ")\n", "\n", "# Выводим обработанный датафрейм\n", "display(preprocessed_df)\n", "# Применяем конвейер к X_train_close\n", "preprocessing_result = pipeline.fit_transform(X_test_close)\n", "\n", "# Формируем новый датафрейм с обработанными данными\n", "preprocessed_df = pd.DataFrame(\n", " preprocessing_result, \n", " columns=pipeline.get_feature_names_out(input_features=num_columns),\n", ")\n", "\n", "# Выводим обработанный датафрейм\n", "display(preprocessed_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Настроим гиперпараметры для каждой модели." ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# Определяем модели\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Определяем параметры для каждой модели\n", "param_grid = {\n", " 'LogisticRegression': {\n", " 'model': LogisticRegression(),\n", " 'params': {\n", " 'C': [0.01, 0.1, 1, 10],\n", " 'solver': ['liblinear']\n", " }\n", " },\n", " 'NaiveBayes': {\n", " 'model': GaussianNB(),\n", " 'params': {}\n", " },\n", " 'DecisionTree': {\n", " 'model': DecisionTreeClassifier(),\n", " 'params': {\n", " 'max_depth': [None, 5, 10, 20],\n", " 'min_samples_split': [2, 5, 10],\n", " }\n", " },\n", " 'KNeighbors': {\n", " 'model': KNeighborsClassifier(),\n", " 'params': {\n", " 'n_neighbors': [3, 5, 7, 10]\n", " }\n", " }\n", "}\n", "\n", "best_estimators = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Обучим модели и оценим:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обучим модели при помощи кросс-валидации:" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\neighbors\\_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n" ] } ], "source": [ "from sklearn.metrics import classification_report, cohen_kappa_score, confusion_matrix, matthews_corrcoef, roc_auc_score\n", "\n", "for model_name, mp in param_grid.items():\n", " grid_search = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)\n", " grid_search.fit(X_train_close, y_train_close)\n", "\n", " best_estimators[model_name] = grid_search.best_estimator_\n", " y_pred_train = best_estimators[model_name].predict(X_train_close)\n", " y_pred_test = best_estimators[model_name].predict(X_test_close)\n", "\n", " # Сбор метрик\n", " report_train = classification_report(y_train_close, y_pred_train, output_dict=True)\n", " report_test = classification_report(y_test_close, y_pred_test, output_dict=True)\n", "\n", " roc_auc_test = roc_auc_score(y_test_close, best_estimators[model_name].predict_proba(X_test_close)[:, 1])\n", " cohen_kappa_test = cohen_kappa_score(y_test_close, y_pred_test)\n", " mcc_test = matthews_corrcoef(y_test_close, y_pred_test)\n", "\n", " # Сохранение результатов\n", " param_grid[model_name] = {\n", " \"Confusion_matrix\": confusion_matrix(y_test_close, y_pred_test),\n", " \"Precision_train\": report_train['1']['precision'],\n", " \"Recall_train\": report_train['1']['recall'],\n", " \"Accuracy_train\": report_train['accuracy'],\n", " \"F1_train\": report_train['1']['f1-score'],\n", " \"Precision_test\": report_test['1']['precision'],\n", " \"Recall_test\": report_test['1']['recall'],\n", " \"Accuracy_test\": report_test['accuracy'],\n", " \"F1_test\": report_test['1']['f1-score'],\n", " \"ROC_AUC_test\": roc_auc_test,\n", " \"Cohen_kappa_test\": cohen_kappa_test,\n", " \"MCC_test\": mcc_test,\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Используем матрицу неточностей:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import ConfusionMatrixDisplay\n", "import matplotlib.pyplot as plt\n", "\n", "# Визуализация матриц\n", "_, ax = plt.subplots(int(len(best_estimators) / 2), 2, figsize=(8, 6), sharex=False, sharey=False)\n", "for index, key in enumerate(best_estimators.keys()):\n", " y_pred = best_estimators[key].predict(X_test_close)\n", " c_matrix = confusion_matrix(y_test_close, y_pred)\n", " disp = ConfusionMatrixDisplay(confusion_matrix=c_matrix, display_labels=[\"low\", \"high\"]).plot(ax=ax.flat[index])\n", " disp.ax_.set_title(key)\n", "\n", "plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Сделаем выводы относительно матрицы неточностей:**\n", "1. Logistic Regression\n", "* True Positives (TP): 634 (правильные предсказания для high)\n", "* True Negatives (TN): 0 (не было правильных предсказаний для low)\n", "* False Positives (FP): 294 (high предсказано, но на самом деле low)\n", "* False Negatives (FN): 123 (low предсказано, но на самом деле high)\n", "\n", "Вывод: Модель показывает высокую точность для прогноза high, но полностью игнорирует класс low. Это может вызвать серьезные проблемы, так как неверные прогнозы могут привести к значительным финансовым потерям.\n", "\n", "2. Naive Bayes\n", "* TP: 757 (правильные предсказания для high)\n", "* TN: 0\n", "* FP: 294\n", "* FN: 0\n", "\n", "Вывод: Модель предсказывает только high с высокой точностью, но также не распознает класс low. Это делает модель ненадежной в контексте предсказания как низких, так и высоких цен.\n", "\n", "3. Decision Tree\n", "* TP: 757\n", "* TN: 293\n", "* FP: 1\n", "* FN: 0\n", "\n", "Вывод: Модель хорошо справляется с предсказаниями для high, при этом включает небольшое количество неверных предсказаний для low (один FP). Справляется лучше в контексте выявления low цен, но все еще не идеально, так как не распознает FN.\n", "\n", "4. KNeighbors\n", "* TP: 612\n", "* TN: 145\n", "* FP: 132\n", "* FN: 162\n", "\n", "Вывод: Эта модель наиболее сбалансирована из всех представленных, показывая разумную точность для обоих классов. Она делает больше ошибок, но также учит и предсказывает low с некоторой эффективностью." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Точность, полнота, верность (аккуратность), F-мера:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 Precision_trainPrecision_testRecall_trainRecall_testAccuracy_trainAccuracy_testF1_trainF1_test
DecisionTree0.9989920.9986810.9986561.0000000.9983330.9990490.9988240.999340
KNeighbors0.8338780.8225810.8568550.8084540.7776190.7364410.8452100.815456
NaiveBayes0.7085710.7202661.0000001.0000000.7085710.7202660.8294310.837389
LogisticRegression0.6771300.6831900.8625670.8375170.6111900.6032350.7586820.752522
\n" ], "text/plain": [ "" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_metrics = pd.DataFrame.from_dict(param_grid, \"index\")[\n", " [\n", " \"Precision_train\",\n", " \"Precision_test\",\n", " \"Recall_train\",\n", " \"Recall_test\",\n", " \"Accuracy_train\",\n", " \"Accuracy_test\",\n", " \"F1_train\",\n", " \"F1_test\",\n", " ]\n", "]\n", "class_metrics.sort_values(\n", " by=\"Accuracy_test\", ascending=False\n", ").style.background_gradient(\n", " cmap=\"plasma\",\n", " low=0.3,\n", " high=1,\n", " subset=[\"Accuracy_train\", \"Accuracy_test\", \"F1_train\", \"F1_test\"],\n", ").background_gradient(\n", " cmap=\"viridis\",\n", " low=1,\n", " high=0.3,\n", " subset=[\n", " \"Precision_train\",\n", " \"Precision_test\",\n", " \"Recall_train\",\n", " \"Recall_test\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Выводы:**\n", "1. Decision Tree (Решающие деревья)\n", "* Precision (Точность):\n", "Высокий уровень как на обучающем (0.99992), так и на тестовом (0.99868) наборах данных.\n", "* Recall (Полнота):\n", "Идеальный результат на обучающем наборе (1.00000), с незначительным падением на тесте (0.99865).\n", "* Accuracy (Точность):\n", "Очень высокая: 0.99833 на обучающем и 0.999049 на тестовом наборах.\n", "* F1 Score:\n", "Высокий на обоих наборах (0.998824 для тестового).\n", "\n", "Вывод: Деревья решений демонстрируют наилучшие результаты по всем метрикам. Они хорошо справляются как с обучающими, так и с тестовыми данными, и, вероятно, являются наилучшим выбором.\n", "\n", "2. K-Neighbors (Метод ближайших соседей)\n", "* Precision:\n", "Низкая точность как на обучающем (0.833878), так и на тестовом (0.822581) наборах.\n", "* Recall:\n", "Высокая полнота на обучающем (0.856855), но значительно ниже на тестовом (0.808454).\n", "* Accuracy:\n", "Умеренные результаты: 0.777619 на обучающем и 0.736441 на тестовом.\n", "* F1 Score:\n", "Умеренная производительность (0.815456 на тестовом).\n", "\n", "Вывод: Метод ближайших соседей показывает средние результаты. Хотя он имеет приемлемую полноту, точность значительно ниже, чем у деревьев решений.\n", "\n", "3. Naive Bayes (Наивный байесовский классификатор)\n", "* Precision:\n", "Низкая точность (0.708571 на обучающем, 0.720266 на тестовом).\n", "* Recall:\n", "Полнота идеально на обучающем (1.00000), но это может указывать на переобучение.\n", "* Accuracy:\n", "Точность на уровне 0.708571 на обучающем и 0.720266 на тестовом — значительно ниже, чем у лучших моделей.\n", "* F1 Score:\n", "Умеренные результаты (0.837389 на тестовом).\n", "\n", "Вывод: Наивный байесовский классификатор показывает проблемы с точностью, несмотря на хорошую полноту, что может указывать на его пригодность для задач, где много классов.\n", "\n", "4. Logistic Regression (Логистическая регрессия)\n", "* Precision:\n", "Низкая точность (0.677130 на обучающем и 0.683190 на тестовом).\n", "* Recall:\n", "Полнота также ниже (0.862567 на тестовом).\n", "* Accuracy:\n", "Совсем низкие значения: 0.611190 на обучающем и 0.603235 на тестовом.\n", "* F1 Score:\n", "Низкие значения (0.752522 на тестовом).\n", "\n", "Вывод: Логистическая регрессия демонстрирует наихудшие показатели по всем метрикам, что ставит под сомнение её применимость для данной задачи.\n", "Лучшая модель: Деревья решений являются наиболее эффективной моделью." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ROC-кривая, каппа Коэна, коэффициент корреляции Мэтьюса:" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 Accuracy_testF1_testROC_AUC_testCohen_kappa_testMCC_test
DecisionTree0.9990490.9993400.9982950.9976360.997639
KNeighbors0.7364410.8154560.7699250.3546790.354842
NaiveBayes0.7202660.8373890.7114420.0000000.000000
LogisticRegression0.6032350.7525220.466647-0.197637-0.226884
\n" ], "text/plain": [ "" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_metrics = pd.DataFrame.from_dict(param_grid, \"index\")[\n", " [\n", " \"Accuracy_test\",\n", " \"F1_test\",\n", " \"ROC_AUC_test\",\n", " \"Cohen_kappa_test\",\n", " \"MCC_test\",\n", " ]\n", "]\n", "class_metrics.sort_values(by=\"ROC_AUC_test\", ascending=False).style.background_gradient(\n", " cmap=\"plasma\",\n", " low=0.3,\n", " high=1,\n", " subset=[\n", " \"ROC_AUC_test\",\n", " \"MCC_test\",\n", " \"Cohen_kappa_test\",\n", " ],\n", ").background_gradient(\n", " cmap=\"viridis\",\n", " low=1,\n", " high=0.3,\n", " subset=[\n", " \"Accuracy_test\",\n", " \"F1_test\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Выводы:**\n", "1. Decision Tree (Решающие деревья)\n", "* Accuracy (Точность): 0.999049\n", "* F1 Score: 0.999340\n", "* ROC AUC: 0.998295\n", "* Cohen's Kappa: 0.997636\n", "* MCC (Matthews Correlation Coefficient): 0.997639\n", "\n", "Вывод: Деревья решений показывают наилучшие результаты по всем метрикам. Высокие значения точности, F1 Score и ROC AUC указывают на то, что модель хорошо справляется как с классификацией, так и с предсказанием вероятностей. Это делает её наиболее подходящей для задачи прогнозирования.\n", "\n", "2. K-Neighbors (Метод ближайших соседей)\n", "* Accuracy: 0.736441\n", "* F1 Score: 0.815456\n", "* ROC AUC: 0.769925\n", "* Cohen's Kappa: 0.354679\n", "* MCC: 0.354842\n", "\n", "Вывод: Метод ближайших соседей демонстрирует средние результаты. Хотя F1 Score и ROC AUC указывают на относительно приемлемую степень точности, это всё же значительно уступает показателям деревьев решений. Учитывая цель, K-Neighbors может быть менее эффективным выбором.\n", "\n", "3. Naive Bayes (Наивный байесовский классификатор)\n", "* Accuracy: 0.720266\n", "* F1 Score: 0.837389\n", "* ROC AUC: 0.711442\n", "* Cohen's Kappa: 0.000000\n", "* MCC: 0.000000\n", "\n", "Вывод: Наивный байесовский классификатор показывает также средние результаты, но его Cohen's Kappa и MCC ровны нулю. Это свидетельствует о том, что модель может плохо предсказывать тренды. Следовательно, её применение может быть ограниченным.\n", "\n", "4. Logistic Regression (Логистическая регрессия)\n", "* Accuracy: 0.603235\n", "* F1 Score: 0.752522\n", "* ROC AUC: 0.466647\n", "* Cohen's Kappa: -0.197637\n", "* MCC: -0.226884\n", "\n", "Вывод: Логистическая регрессия показывает наихудшие результаты среди всех моделей. Низкие значения точности и ROC AUC указывают на ненадежность этой модели для прогнозирования цен акций, что делает её наименее подходящим вариантом для текущей бизнес-цели.\n", "\n", "На основе проведенного анализа, Decision Tree является наилучшим выбором для построения модели прогнозирования цены акций. Она демонстрирует высокую производительность по всем ключевым метрикам, что делает её наиболее надежной и эффективной для предсказания цен закрытия. " ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'DecisionTree'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "best_model = str(class_metrics.sort_values(by=\"MCC_test\", ascending=False).iloc[0].name)\n", "\n", "display(best_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Визуализация ROC-кривой" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.metrics import auc, roc_curve\n", "\n", "# Инициализация словаря для хранения результатов\n", "results = {}\n", "\n", "# После подбора модели\n", "for model_name in best_estimators.keys():\n", " # Получаем вероятности для положительного класса\n", " y_scores = best_estimators[model_name].predict_proba(X_test_close)[:, 1]\n", " fpr, tpr, _ = roc_curve(y_test_close, y_scores)\n", " roc_auc = auc(fpr, tpr)\n", "\n", " # Сохраняем полученные значения в словаре results\n", " results[model_name] = {\n", " 'fpr': fpr,\n", " 'tpr': tpr,\n", " 'roc_auc': roc_auc\n", " }\n", "\n", "# Визуализация ROC-кривой\n", "plt.figure(figsize=(10, 8))\n", "for model_name, metrics in results.items():\n", " plt.plot(metrics['fpr'], metrics['tpr'], lw=2, label=f'{model_name} (AUC = {metrics[\"roc_auc\"]:.2f})')\n", "\n", "# Диагональная линия глухого классификатора\n", "plt.plot([0, 1], [0, 1], 'k--', lw=2)\n", "\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.05])\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver Operating Characteristic')\n", "plt.legend(loc='lower right')\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ROC (Receiver Operating Characteristic) кривая — это график, используемый для оценки производительности классификаторов. Она отображает соотношение между двумя показателями:\n", "\n", "True Positive Rate (TPR), также известная как чувствительность или полнота — доля верных положительных результатов среди всех положительных примеров.\n", "False Positive Rate (FPR) — доля ложных положительных результатов среди всех отрицательных примеров.\n", "ROC-кривая и AUC\n", "ROC-кривая строится путем отображения TPR против FPR при разных порогах классификации. Площадь под ROC-кривой (AUC - Area Under the Curve) служит одной из основных метрик для оценки качества классификатора:\n", "\n", "AUC = 1: Модель идеально классифицирует все примеры.\n", "AUC = 0.5: Модель не лучше случайного угадывания.\n", "AUC < 0.5: Модель показывает худшие результаты, чем случайный угадыватель.\n", "\n", "**Анализ получившейся ROC-кривой:**\n", "* Decision Tree (зеленая линия): AUC равен 1, что указывает на отличную производительность модели. Она идеально разделяет положительные и отрицательные классы.\n", "* KNeighbors (синяя линия): AUC равен 0.77. Эта модель показывает хорошую производительность, но не так идеальна, как дерево решений.\n", "* Naive Bayes (оранжевая линия): AUC равен 0.71. Модель демонстрирует средние результаты, но имеет значительные недостатки по сравнению с деревом решений.\n", "* Logistic Regression (красная линия): AUC равен 0.47, что говорит о том, что модель практически неэффективна и хуже случайного классификатора.\n", "\n", "**Общий вывод**\n", "Модель дерева решений выделяется на фоне других, обеспечивая высокую точность. Это делает её наиболее предпочтительным вариантом для бизнес-прогнозирования. Остальные модели показывают более скромные результаты и могут быть менее надежными." ] } ], "metadata": { "kernelspec": { "display_name": "aimenv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }