1899 lines
487 KiB
Plaintext
1899 lines
487 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Датасет: Цены на акции\n",
|
|||
|
"https://www.kaggle.com/datasets/nancyalaswad90/yamana-gold-inc-stock-Volume\n",
|
|||
|
"##### О наборе данных: \n",
|
|||
|
"Yamana Gold Inc. — это канадская компания, которая занимается разработкой и управлением золотыми, серебряными и медными рудниками, расположенными в Канаде, Чили, Бразилии и Аргентине. Головной офис компании находится в Торонто.\n",
|
|||
|
"\n",
|
|||
|
"Yamana Gold была основана в 1994 году и уже через год была зарегистрирована на фондовой бирже Торонто. В 2007 году она стала участником Нью-Йоркской фондовой биржи, а в 2020 году — Лондонской.\n",
|
|||
|
"В 2003 году компания претерпела значительные изменения: была проведена реструктуризация, в результате которой Питер Марроне занял пост главного исполнительного директора. Кроме того, Yamana объединилась с бразильской компанией Santa Elina Mines Corporation. Благодаря этому слиянию Yamana получила доступ к капиталу, накопленному Santa Elina, что позволило ей начать разработку и эксплуатацию рудника Чапада. Затем компания объединилась с другими организациями, зарегистрированными на бирже TSX: RNC Gold, Desert Sun Mining, Viceroy Exploration, Northern Orion Resources, Meridian Gold, Osisko Mining и Extorre Gold Mines. Каждая из них внесла свой вклад в разработку месторождения или проект, который в итоге был успешно запущен.\n",
|
|||
|
"##### Таким образом:\n",
|
|||
|
"* Объект наблюдения - цены и объемы акций компании\n",
|
|||
|
"* Атрибуты: 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'\n",
|
|||
|
"\n",
|
|||
|
"##### Бизнес цели:\n",
|
|||
|
"* Прогнозирование будущей цены акций.\n",
|
|||
|
" Использование данных для создания модели, которая будет предсказывать цену акций компании в будущем.\n",
|
|||
|
"* Определение волатильности акций.\n",
|
|||
|
" Определение, колебаний цен акций, что поможет инвесторам понять риски.\n",
|
|||
|
"\n",
|
|||
|
"##### Технические цели:\n",
|
|||
|
"* Разработать модель машинного обучения для прогноза цены акций на основе имеющихся данных.\n",
|
|||
|
"* Разработать метрику и модель для оценки волатильности акций на основе исторических данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество колонок: 7\n",
|
|||
|
"Колонки: Date, Open, High, Low, Close, Adj Close, Volume\n",
|
|||
|
"\n",
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 5251 entries, 0 to 5250\n",
|
|||
|
"Data columns (total 7 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Date 5251 non-null datetime64[ns]\n",
|
|||
|
" 1 Open 5251 non-null float64 \n",
|
|||
|
" 2 High 5251 non-null float64 \n",
|
|||
|
" 3 Low 5251 non-null float64 \n",
|
|||
|
" 4 Close 5251 non-null float64 \n",
|
|||
|
" 5 Adj Close 5251 non-null float64 \n",
|
|||
|
" 6 Volume 5251 non-null int64 \n",
|
|||
|
"dtypes: datetime64[ns](1), float64(5), int64(1)\n",
|
|||
|
"memory usage: 287.3 KB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th>Open</th>\n",
|
|||
|
" <th>High</th>\n",
|
|||
|
" <th>Low</th>\n",
|
|||
|
" <th>Close</th>\n",
|
|||
|
" <th>Adj Close</th>\n",
|
|||
|
" <th>Volume</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>2001-06-22</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>2.806002</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>2001-06-25</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>3.428571</td>\n",
|
|||
|
" <td>2.806002</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>2001-06-26</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.039837</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>2001-06-27</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.039837</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>2001-06-28</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.714286</td>\n",
|
|||
|
" <td>3.039837</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Date Open High Low Close Adj Close Volume\n",
|
|||
|
"0 2001-06-22 3.428571 3.428571 3.428571 3.428571 2.806002 0\n",
|
|||
|
"1 2001-06-25 3.428571 3.428571 3.428571 3.428571 2.806002 0\n",
|
|||
|
"2 2001-06-26 3.714286 3.714286 3.714286 3.714286 3.039837 0\n",
|
|||
|
"3 2001-06-27 3.714286 3.714286 3.714286 3.714286 3.039837 0\n",
|
|||
|
"4 2001-06-28 3.714286 3.714286 3.714286 3.714286 3.039837 0"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//Stocks.csv\", sep=\",\")\n",
|
|||
|
"print('Количество колонок: ' + str(df.columns.size)) \n",
|
|||
|
"print('Колонки: ' + ', '.join(df.columns)+'\\n')\n",
|
|||
|
"df['Date'] = pd.to_datetime(df['Date'], errors='coerce')\n",
|
|||
|
"\n",
|
|||
|
"df.info()\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Подготовка данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### 1. Получение сведений о пропущенных данных\n",
|
|||
|
"Типы пропущенных данных:\n",
|
|||
|
"\n",
|
|||
|
"- None - представление пустых данных в Python\n",
|
|||
|
"- NaN - представление пустых данных в Pandas\n",
|
|||
|
"- '' - пустая строка"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Date 0\n",
|
|||
|
"Open 0\n",
|
|||
|
"High 0\n",
|
|||
|
"Low 0\n",
|
|||
|
"Close 0\n",
|
|||
|
"Adj Close 0\n",
|
|||
|
"Volume 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Date False\n",
|
|||
|
"Open False\n",
|
|||
|
"High False\n",
|
|||
|
"Low False\n",
|
|||
|
"Close False\n",
|
|||
|
"Adj Close False\n",
|
|||
|
"Volume False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Количество бесконечных значений в каждом столбце:\n",
|
|||
|
"Date 0\n",
|
|||
|
"Open 0\n",
|
|||
|
"High 0\n",
|
|||
|
"Low 0\n",
|
|||
|
"Close 0\n",
|
|||
|
"Adj Close 0\n",
|
|||
|
"Volume 0\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"Date процент пустых значений: %0.00\n",
|
|||
|
"Open процент пустых значений: %0.00\n",
|
|||
|
"High процент пустых значений: %0.00\n",
|
|||
|
"Low процент пустых значений: %0.00\n",
|
|||
|
"Close процент пустых значений: %0.00\n",
|
|||
|
"Adj Close процент пустых значений: %0.00\n",
|
|||
|
"Volume процент пустых значений: %0.00\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"# Количество пустых значений признаков\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print(df.isnull().any())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на бесконечные значения\n",
|
|||
|
"print(\"Количество бесконечных значений в каждом столбце:\")\n",
|
|||
|
"print(np.isinf(df).sum())\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" print(f\"{i} процент пустых значений: %{null_rate:.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Таким образом, пропущенных значений не найдено."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### 2. Проверка выбросов данных и устранение их при наличии:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 33,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка Open:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.42\n",
|
|||
|
" 1-й квартиль (Q1): 2.857143\n",
|
|||
|
" 3-й квартиль (Q3): 10.65\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка Open:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.42\n",
|
|||
|
" 1-й квартиль (Q1): 2.857143\n",
|
|||
|
" 3-й квартиль (Q3): 10.65\n",
|
|||
|
"\n",
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка High:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.59\n",
|
|||
|
" 1-й квартиль (Q1): 2.88\n",
|
|||
|
" 3-й квартиль (Q3): 10.86\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка High:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.59\n",
|
|||
|
" 1-й квартиль (Q1): 2.88\n",
|
|||
|
" 3-й квартиль (Q3): 10.86\n",
|
|||
|
"\n",
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка Low:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.09\n",
|
|||
|
" 1-й квартиль (Q1): 2.81\n",
|
|||
|
" 3-й квартиль (Q3): 10.425\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка Low:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.09\n",
|
|||
|
" 1-й квартиль (Q1): 2.81\n",
|
|||
|
" 3-й квартиль (Q3): 10.425\n",
|
|||
|
"\n",
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка Close:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.389999\n",
|
|||
|
" 1-й квартиль (Q1): 2.857143\n",
|
|||
|
" 3-й квартиль (Q3): 10.64\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка Close:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 1.142857\n",
|
|||
|
" Максимальное значение: 20.389999\n",
|
|||
|
" 1-й квартиль (Q1): 2.857143\n",
|
|||
|
" 3-й квартиль (Q3): 10.64\n",
|
|||
|
"\n",
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка Adj Close:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 0.935334\n",
|
|||
|
" Максимальное значение: 17.543156\n",
|
|||
|
" 1-й квартиль (Q1): 2.537094\n",
|
|||
|
" 3-й квартиль (Q3): 8.951944999999998\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка Adj Close:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 0.935334\n",
|
|||
|
" Максимальное значение: 17.543156\n",
|
|||
|
" 1-й квартиль (Q1): 2.537094\n",
|
|||
|
" 3-й квартиль (Q3): 8.951944999999998\n",
|
|||
|
"\n",
|
|||
|
"До устранения выбросов:\n",
|
|||
|
"Колонка Volume:\n",
|
|||
|
" Есть выбросы: Да\n",
|
|||
|
" Количество выбросов: 95\n",
|
|||
|
" Минимальное значение: 0\n",
|
|||
|
" Максимальное значение: 76714000\n",
|
|||
|
" 1-й квартиль (Q1): 2845900.0\n",
|
|||
|
" 3-й квартиль (Q3): 13272450.0\n",
|
|||
|
"\n",
|
|||
|
"После устранения выбросов:\n",
|
|||
|
"Колонка Volume:\n",
|
|||
|
" Есть выбросы: Нет\n",
|
|||
|
" Количество выбросов: 0\n",
|
|||
|
" Минимальное значение: 0.0\n",
|
|||
|
" Максимальное значение: 28912275.0\n",
|
|||
|
" 1-й квартиль (Q1): 2845900.0\n",
|
|||
|
" 3-й квартиль (Q3): 13272450.0\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"numeric_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']\n",
|
|||
|
"\n",
|
|||
|
"for column in numeric_columns:\n",
|
|||
|
" if pd.api.types.is_numeric_dtype(df[column]): # Проверяем, является ли колонка числовой\n",
|
|||
|
" q1 = df[column].quantile(0.25) # Находим 1-й квартиль (Q1)\n",
|
|||
|
" q3 = df[column].quantile(0.75) # Находим 3-й квартиль (Q3)\n",
|
|||
|
" iqr = q3 - q1 # Вычисляем межквартильный размах (IQR)\n",
|
|||
|
"\n",
|
|||
|
" # Определяем границы для выбросов\n",
|
|||
|
" lower_bound = q1 - 1.5 * iqr # Нижняя граница\n",
|
|||
|
" upper_bound = q3 + 1.5 * iqr # Верхняя граница\n",
|
|||
|
"\n",
|
|||
|
" # Подсчитываем количество выбросов\n",
|
|||
|
" outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n",
|
|||
|
" outlier_count = outliers.shape[0]\n",
|
|||
|
"\n",
|
|||
|
" print(\"До устранения выбросов:\")\n",
|
|||
|
" print(f\"Колонка {column}:\")\n",
|
|||
|
" print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
|||
|
" print(f\" Количество выбросов: {outlier_count}\")\n",
|
|||
|
" print(f\" Минимальное значение: {df[column].min()}\")\n",
|
|||
|
" print(f\" Максимальное значение: {df[column].max()}\")\n",
|
|||
|
" print(f\" 1-й квартиль (Q1): {q1}\")\n",
|
|||
|
" print(f\" 3-й квартиль (Q3): {q3}\\n\")\n",
|
|||
|
"\n",
|
|||
|
" # Устраняем выбросы: заменяем значения ниже нижней границы на саму нижнюю границу, а выше верхней — на верхнюю\n",
|
|||
|
" if outlier_count != 0:\n",
|
|||
|
" df[column] = df[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
|
|||
|
" \n",
|
|||
|
" # Подсчитываем количество выбросов\n",
|
|||
|
" outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n",
|
|||
|
" outlier_count = outliers.shape[0]\n",
|
|||
|
"\n",
|
|||
|
" print(\"После устранения выбросов:\")\n",
|
|||
|
" print(f\"Колонка {column}:\")\n",
|
|||
|
" print(f\" Есть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
|||
|
" print(f\" Количество выбросов: {outlier_count}\")\n",
|
|||
|
" print(f\" Минимальное значение: {df[column].min()}\")\n",
|
|||
|
" print(f\" Максимальное значение: {df[column].max()}\")\n",
|
|||
|
" print(f\" 1-й квартиль (Q1): {q1}\")\n",
|
|||
|
" print(f\" 3-й квартиль (Q3): {q3}\\n\")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Выбросы присутствовали, но мы их устранили."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Разбиение на выборки:\n",
|
|||
|
"\n",
|
|||
|
"Разобьем наш набор на обучающую, контрольную и тестовую выборки для устранения проблемы просачивания данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 34,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 4200\n",
|
|||
|
"Размер контрольной выборки: 1051\n",
|
|||
|
"Размер тестовой выборки: 1051\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
|
|||
|
"X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
|
|||
|
"X_train, X_val = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки: \", len(X_train))\n",
|
|||
|
"print(\"Размер контрольной выборки: \", len(X_test))\n",
|
|||
|
"print(\"Размер тестовой выборки: \", len(X_val))\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 35,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+0AAAIjCAYAAAB20vpjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABeK0lEQVR4nO3dd3gU1eL/8U8CaRASSkgDAqGGXiIlIEVAQlEEohRBQVC8EJSmIipdRUClKIJ4KRYUwYKKgoSqYqgCgpSL3mBQkuDCTQKkQub3B9/sjyUJkJDsTuD9ep59ZGfOnHNmZmfjZ2fmjJNhGIYAAAAAAIDpODu6AwAAAAAAIHeEdgAAAAAATIrQDgAAAACASRHaAQAAAAAwKUI7AAAAAAAmRWgHAAAAAMCkCO0AAAAAAJgUoR0AAAAAAJMq6egOAAAAFHdpaWk6d+6cSpYsKV9fX0d3BwBwG+FMOwAAKHZGjhype++916F92LRpk3r27KmyZcvKw8NDlSpV0ujRox3aJxSuDRs2yNPTU//884+juwLgDkZoB4AisGLFCjk5OVlf7u7uql27tkaNGqWEhARHdw8o1mJiYvTvf/9bL7zwgsP68M477yg8PFxJSUmaP3++oqKiFBUVpenTpzusTyh8Xbt2Vc2aNTVz5kxHdwXAHczJMAzD0Z0AgNvNihUr9Nhjj2n69OkKDg5WWlqafvrpJ3344YeqWrWqDh8+rFKlSjm6m0CxNGbMGK1fv17Hjx93SPsnTpxQw4YN9dhjj+mdd96Rk5OTQ/oB+1i0aJGeeeYZxcfHq0yZMo7uDoA7EGfaAaAIdevWTYMGDdLjjz+uFStWaMyYMYqJidFXX33l6K4BxVJmZqZWrlypvn37OqwPCxYskL+/vxYsWEBgvwNEREQoPT1da9ascXRXANyhCO0AYEcdO3aUdOXyXkk6d+6cnnnmGTVs2FCenp7y8vJSt27ddPDgwRzLpqWlaerUqapdu7bc3d0VEBCgPn366I8//pAknTx50uaS/GtfHTp0sNa1bds2OTk56dNPP9ULL7wgf39/lS5dWj179tSpU6dytL1r1y517dpV3t7eKlWqlNq3b68dO3bkuo4dOnTItf2pU6fmKPvRRx8pNDRUHh4eKl++vPr3759r+9dbt6tlZWVp3rx5ql+/vtzd3eXn56cnn3xS//vf/2zKVatWTffdd1+OdkaNGpWjztz6PmfOnBzbVJLS09M1ZcoU1axZU25ubqpSpYqee+45paen57qtrtahQ4cc9b3yyitydnbWxx9/XKDt8frrr6t169aqUKGCPDw8FBoaqs8++yzX9j/66CO1aNFCpUqVUrly5dSuXTtt3LjRpsz69evVvn17lSlTRl5eXmrevHmOvq1Zs8a6T318fDRo0CD9/fffNmWGDBli0+dy5cqpQ4cO+vHHH2+4nX766SdZLBZ17tw51/lTp0694ec/P33Nzc6dOxUaGqqRI0fKz89Pbm5uatCggd57770cZfOzD669rSa3vt/sd0b2MZ5bW56enhoyZIjNtMTERI0dO1bVqlWTm5ubKleurEcffVQWi8Wmvm3bttks16NHjxzHSPY+OHbsmPr27SsvLy9VqFBBo0ePVlpams3yly5d0owZM1SjRg25ubmpWrVqeuGFF3IcM9WqVbNuD2dnZ/n7+6tfv36KjY21lsk+LlasWGGddv78eYWGhio4OFhxcXF5lpOkyMhIOTk55dg2vr6+atSoET+2AnAYRo8HADvKDtgVKlSQJP33v//V2rVr9dBDDyk4OFgJCQl699131b59ex05ckSBgYGSpMuXL+u+++7T5s2b1b9/f40ePVrnz59XVFSUDh8+rBo1aljbGDBggLp3727T7sSJE3PtzyuvvCInJydNmDBBZ86c0bx589S5c2cdOHBAHh4ekqQtW7aoW7duCg0N1ZQpU+Ts7Kzly5erY8eO+vHHH9WiRYsc9VauXNl6D+iFCxc0YsSIXNueNGmS+vbtq8cff1z//POP3nrrLbVr10779+9X2bJlcywzfPhwtW3bVpL0xRdf6Msvv7SZ/+STT1pvTXj66acVExOjt99+W/v379eOHTvk4uKS63bIj8TExFzvb83KylLPnj31008/afjw4apbt64OHTqkuXPn6j//+Y/Wrl2br3aWL1+ul156SW+88YYefvjhXMvcaHvMnz9fPXv21MCBA5WRkaFVq1bpoYce0rp169SjRw9ruWnTpmnq1Klq3bq1pk+fLldXV+3atUtbtmxRly5dJF0JlEOHDlX9+vU1ceJElS1bVvv379eGDRus/cve9s2bN9fMmTOVkJCg+fPna8eOHTn2qY+Pj+bOnStJ+uuvvzR//nx1795dp06dynXfZ/v555/l5OSkpk2bXnf7LVq0SJ6enpJy//znp6/XOnv2rPbu3auSJUsqMjJSNWrU0Nq1azV8+HCdPXtWzz//fL73wdXmzp0rHx8fSVeOk6vd7HdGfly4cEFt27bV0aNHNXToUDVr1kwWi0Vff/21/vrrL2tfrvXDDz/ou+++y7Pevn37qlq1apo5c6Z27typBQsW6H//+58++OADa5nHH39c77//vh588EGNHz9eu3bt0syZM3X06NEcn+e2bdtq+PDhysrK0uHDhzVv3jydPn06zx97MjMzFRERodjYWO3YsUMBAQF59vX333/P9UeXbKGhofk+hgGg0BgAgEK3fPlyQ5KxadMm459//jFOnTplrFq1yqhQoYLh4eFh/PXXX4ZhGEZaWppx+fJlm2VjYmIMNzc3Y/r06dZpy5YtMyQZb775Zo62srKyrMtJMubMmZOjTP369Y327dtb32/dutWQZFSqVMlITk62Tl+9erUhyZg/f7617lq1ahnh4eHWdgzDMFJSUozg4GDj3nvvzdFW69atjQYNGljf//PPP4YkY8qUKdZpJ0+eNEqUKGG88sorNsseOnTIKFmyZI7pJ06cMCQZ77//vnXalClTjKv/jP3444+GJGPlypU2y27YsCHH9KpVqxo9evTI0ffIyEjj2j+N1/b9ueeeM3x9fY3Q0FCbbfrhhx8azs7Oxo8//miz/OLFiw1Jxo4dO3K0d7X27dtb6/v222+NkiVLGuPHj8+17M1sD8O4sp+ulpGRYTRo0MDo2LGjTV3Ozs5G7969c3wWs/d5YmKiUaZMGaNly5ZGampqrmUyMjIMX19fo0GDBjZl1q1bZ0gyJk+ebJ02ePBgo2rVqjb1LFmyxJBk7N69O9d1zjZo0CCjQoUKec5/4YUXDEmGxWKxTrv285+fvuamatWqhiRjxYoV1mmXLl0yOnXqZLi5udm0fTP7INt7771nSDL+/PNP67SrPxeGcfPfGdnH+Jo1a3K0U7p0aWPw4MHW95MnTzYkGV988UWOstn7N7u+rVu3Wue1bNnS6NatW45jJPuz2LNnT5u6Ro4caUgyDh48aBiGYRw4cMCQZDz++OM25Z555hlDkrFlyxbrtKpVq9r02TAM4+GHHzZKlSplsx0kGcuXLzeysrKMgQMHGqVKlTJ27dqVY3tll8vWt29fo0GDBkaVKlVytGMYhvHqq68akoyEhIQc8wCgqHF5PAAUoc6dO6tixYqqUqWK+vfvL09PT3355ZeqVKmSJMnNzU3Ozle+ii9fvqyzZ8/K09NTderU0S+//GKt5/PPP5ePj4+eeuqpHG3cyj21jz76qM3ASg8++KACAgKsZ88OHDigEydO6OGHH9bZs2dlsVhksVh08eJFderUST/88IOysrJs6kxLS5O7u/t12/3iiy+UlZWlvn37Wuu0WCzy9/dXrVq1tHXrVpvyGRkZkq5sr7ysWbNG3t7euvfee23qDA0NlaenZ446MzMzbcpZLJYcl+5e6++//9Zbb72lSZMmWc/iXt1+3bp1FRISYlNn9i0R17afl927d6tv376KiIjQnDlzci1zM9tDkvVqCUn63//+p6SkJLVt29bms7V27VplZWVp8uTJ1s9ituzPVlRUlM6fP6/nn38+x77NLrN3716dOXNGI0eOtCnTo0cPhYSE6Ntvv7VZLisry7qNDhw
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+0AAAIjCAYAAAB20vpjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABbsElEQVR4nO3deVyVZf7/8ffBBVEWRWRTEFxx3FJKBsvdUnQ0y3KfNE2d1MqlMm1MsRo0K7dMcya1xTItM6cmy30Lzd0c0VEHoxJUdABFZZH794dfzs/jARSCc27k9Xw8zqPOfV/3dX3OfXMffHNvFsMwDAEAAAAAANNxcXYBAAAAAAAgb4R2AAAAAABMitAOAAAAAIBJEdoBAAAAADApQjsAAAAAACZFaAcAAAAAwKQI7QAAAAAAmBShHQAAAAAAkyrv7AIAAABKu2vXrunixYsqX768fH19nV0OAOAuwpF2AABQ6owaNUoPPvigU2vYsGGDevbsqapVq8rNzU01a9bUc88959SaULzWrVsnd3d3nT9/3tmlACjDCO0AUAKWLVsmi8VifVWqVEkNGjTQmDFjdPbsWWeXB5Rq8fHx+sc//qHJkyc7rYZ3331XXbp0UWpqqubOnav169dr/fr1mj59utNqQvHr2rWr6tWrp5iYGGeXAqAMsxiGYTi7CAC42yxbtkxPPvmkpk+frtDQUF27dk07duzQRx99pNq1a+vIkSOqXLmys8sESqWxY8fq22+/1fHjx50y/okTJ9S0aVM9+eSTevfdd2WxWJxSBxxj4cKFev7555WUlCQPDw9nlwOgDOJIOwCUoKioKA0aNEhPPfWUli1bprFjxyo+Pl5fffWVs0sDSqWsrCwtX75cffr0cVoN8+bNk7+/v+bNm0dgLwN69+6tjIwMrVq1ytmlACijCO0A4EAdO3aUdOP0Xkm6ePGinn/+eTVt2lTu7u7y9PRUVFSUDh06ZLfstWvXNG3aNDVo0ECVKlVSQECAHn30UZ06dUqSdPr0aZtT8m99tW/f3trXli1bZLFY9Nlnn2ny5Mny9/dXlSpV1LNnT/3yyy92Y+/evVtdu3aVl5eXKleurHbt2mnnzp15fsb27dvnOf60adPs2n788ccKDw+Xm5ubvL291a9fvzzHL+iz3SwnJ0dz5sxR48aNValSJfn5+WnkyJH63//+Z9MuJCREf/rTn+zGGTNmjF2fedU+a9Ysu3UqSRkZGZo6darq1asnV1dXBQUF6cUXX1RGRkae6+pm7du3t+vv9ddfl4uLiz755JMirY8333xTrVu3VvXq1eXm5qbw8HB9/vnneY7/8ccfq1WrVqpcubKqVaumtm3b6vvvv7dp8+2336pdu3by8PCQp6en7rvvPrvaVq1aZd2mPj4+GjRokH777TebNkOGDLGpuVq1amrfvr22b99+2/W0Y8cOJScnq3PnznnOnzZt2m1//gtTa1527dql8PBwjRo1Sn5+fnJ1dVWTJk3097//3a5tYbbBrZfV5FX7nX5n5O7jeY3l7u6uIUOG2ExLSUnRuHHjFBISIldXV9WqVUtPPPGEkpOTbfrbsmWLzXLdu3e320dyt8GxY8fUp08feXp6qnr16nruued07do1m+Wzs7P16quvqm7dunJ1dVVISIgmT55st8+EhIRY14eLi4v8/f3Vt29fJSQkWNvk7hfLli2zTrt06ZLCw8MVGhqqxMTEfNtJ0ujRo2WxWOzWja+vr5o1a8YfWwE4DXePBwAHyg3Y1atXlyT997//1Zo1a/T4448rNDRUZ8+e1Xvvvad27drp6NGjCgwMlCRdv35df/rTn7Rx40b169dPzz33nC5duqT169fryJEjqlu3rnWM/v37q1u3bjbjTpo0Kc96Xn/9dVksFk2cOFHnzp3TnDlz1LlzZx08eFBubm6SpE2bNikqKkrh4eGaOnWqXFxctHTpUnXs2FHbt29Xq1at7PqtVauW9RrQy5cv6+mnn85z7ClTpqhPnz566qmndP78ec2fP19t27bVgQMHVLVqVbtlRowYoTZt2kiSVq9erS+//NJm/siRI62XJjz77LOKj4/XO++8owMHDmjnzp2qUKFCnuuhMFJSUvK8vjUnJ0c9e/bUjh07NGLECDVq1Eg//fSTZs+erf/85z9as2ZNocZZunSp/vrXv+qtt97SgAED8mxzu/Uxd+5c9ezZUwMHDlRmZqZWrFihxx9/XF9//bW6d+9ubRcdHa1p06apdevWmj59uipWrKjdu3dr06ZNeuihhyTdCJRDhw5V48aNNWnSJFWtWlUHDhzQunXrrPXlrvv77rtPMTExOnv2rObOnaudO3fabVMfHx/Nnj1bkvTrr79q7ty56tatm3755Zc8t32uH374QRaLRS1atChw/S1cuFDu7u6S8v75L0ytt7pw4YL27t2r8uXLa/To0apbt67WrFmjESNG6MKFC3rppZcKvQ1uNnv2bPn4+Ei6sZ/c7E6/Mwrj8uXLatOmjeLi4jR06FC1bNlSycnJWrt2rX799VdrLbfatm2b/vWvf+Xbb58+fRQSEqKYmBjt2rVL8+bN0//+9z99+OGH1jZPPfWUPvjgAz322GOaMGGCdu/erZiYGMXFxdn9PLdp00YjRoxQTk6Ojhw5ojlz5ujMmTP5/rEnKytLvXv3VkJCgnbu3KmAgIB8az158mSef3TJFR4eXuh9GACKjQEAKHZLly41JBkbNmwwzp8/b/zyyy/GihUrjOrVqxtubm7Gr7/+ahiGYVy7ds24fv26zbLx8fGGq6urMX36dOu0JUuWGJKMt99+226snJwc63KSjFmzZtm1ady4sdGuXTvr+82bNxuSjJo1axppaWnW6StXrjQkGXPnzrX2Xb9+faNLly7WcQzDMK5cuWKEhoYaDz74oN1YrVu3Npo0aWJ9f/78eUOSMXXqVOu006dPG+XKlTNef/11m2V/+ukno3z58nbTT5w4YUgyPvjgA+u0qVOnGjf/Gtu+fbshyVi+fLnNsuvWrbObXrt2baN79+52tY8ePdq49VfjrbW/+OKLhq+vrxEeHm6zTj/66CPDxcXF2L59u83yixYtMiQZO3futBvvZu3atbP298033xjly5c3JkyYkGfbO1kfhnFjO90sMzPTaNKkidGxY0ebvlxcXIxHHnnE7mcxd5unpKQYHh4eRkREhHH16tU822RmZhq+vr5GkyZNbNp8/fXXhiTjlVdesU4bPHiwUbt2bZt+Fi9ebEgyfvzxxzw/c65BgwYZ1atXz3f+5MmTDUlGcnKyddqtP/+FqTUvtWvXNiQZy5Yts07Lzs42OnXqZLi6utqMfSfbINff//53Q5Lx888/W6fd/HNhGHf+nZG7j69atcpunCpVqhiDBw+2vn/llVcMScbq1avt2uZu39z+Nm/ebJ0XERFhREVF2e0juT+LPXv2tOlr1KhRhiTj0KFDhmEYxsGDBw1JxlNPPWXT7vnnnzckGZs2bbJOq127tk3NhmEYAwYMMCpXrmyzHiQZS5cuNXJycoyBAwcalStXNnbv3m23vnLb5erTp4/RpEkTIygoyG4cwzCMv/3tb4Yk4+zZs3bzAKCkcXo8AJSgzp07q0aNGgoKClK/fv3k7u6uL7/8UjVr1pQkubq6ysXlxlfx9evXdeHCBbm7u6thw4bav3+/tZ8vvvhCPj4+euaZZ+zG+D3X1D7xxBM2N1Z67LHHFBAQYD16dvDgQZ04cUIDBgzQhQsXlJycrOTkZKWnp6tTp07atm2bcnJybPq8du2aKlWqVOC4q1evVk5Ojvr06WPtMzk5Wf7+/qpfv742b95s0z4zM1PSjfWVn1WrVsnLy0sPPvigTZ/h4eFyd3e36zMrK8umXXJyst2pu7f67bffNH/+fE2ZMsV6FPfm8Rs1aqSwsDCbPnMvibh1/Pz8+OOP6tOnj3r37q1Zs2bl2eZO1ock69kSkvS///1PqampatOmjc3P1po1a5STk6NXXnnF+rOYK/dna/369bp06ZJeeuklu22b22bv3r06d+6cRo0aZdOme/fuCgsL0zfffGOzXE5OjnUdHTx4UB9
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+0AAAIjCAYAAAB20vpjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABbsElEQVR4nO3deVyVZf7/8ffBBVEWRWRTEFxx3FJKBsvdUnQ0y3KfNE2d1MqlMm1MsRo0K7dMcya1xTItM6cmy30Lzd0c0VEHoxJUdABFZZH794dfzs/jARSCc27k9Xw8zqPOfV/3dX3OfXMffHNvFsMwDAEAAAAAANNxcXYBAAAAAAAgb4R2AAAAAABMitAOAAAAAIBJEdoBAAAAADApQjsAAAAAACZFaAcAAAAAwKQI7QAAAAAAmBShHQAAAAAAkyrv7AIAAABKu2vXrunixYsqX768fH19nV0OAOAuwpF2AABQ6owaNUoPPvigU2vYsGGDevbsqapVq8rNzU01a9bUc88959SaULzWrVsnd3d3nT9/3tmlACjDCO0AUAKWLVsmi8VifVWqVEkNGjTQmDFjdPbsWWeXB5Rq8fHx+sc//qHJkyc7rYZ3331XXbp0UWpqqubOnav169dr/fr1mj59utNqQvHr2rWr6tWrp5iYGGeXAqAMsxiGYTi7CAC42yxbtkxPPvmkpk+frtDQUF27dk07duzQRx99pNq1a+vIkSOqXLmys8sESqWxY8fq22+/1fHjx50y/okTJ9S0aVM9+eSTevfdd2WxWJxSBxxj4cKFev7555WUlCQPDw9nlwOgDOJIOwCUoKioKA0aNEhPPfWUli1bprFjxyo+Pl5fffWVs0sDSqWsrCwtX75cffr0cVoN8+bNk7+/v+bNm0dgLwN69+6tjIwMrVq1ytmlACijCO0A4EAdO3aUdOP0Xkm6ePGinn/+eTVt2lTu7u7y9PRUVFSUDh06ZLfstWvXNG3aNDVo0ECVKlVSQECAHn30UZ06dUqSdPr0aZtT8m99tW/f3trXli1bZLFY9Nlnn2ny5Mny9/dXlSpV1LNnT/3yyy92Y+/evVtdu3aVl5eXKleurHbt2mnnzp15fsb27dvnOf60adPs2n788ccKDw+Xm5ubvL291a9fvzzHL+iz3SwnJ0dz5sxR48aNValSJfn5+WnkyJH63//+Z9MuJCREf/rTn+zGGTNmjF2fedU+a9Ysu3UqSRkZGZo6darq1asnV1dXBQUF6cUXX1RGRkae6+pm7du3t+vv9ddfl4uLiz755JMirY8333xTrVu3VvXq1eXm5qbw8HB9/vnneY7/8ccfq1WrVqpcubKqVaumtm3b6vvvv7dp8+2336pdu3by8PCQp6en7rvvPrvaVq1aZd2mPj4+GjRokH777TebNkOGDLGpuVq1amrfvr22b99+2/W0Y8cOJScnq3PnznnOnzZt2m1//gtTa1527dql8PBwjRo1Sn5+fnJ1dVWTJk3097//3a5tYbbBrZfV5FX7nX5n5O7jeY3l7u6uIUOG2ExLSUnRuHHjFBISIldXV9WqVUtPPPGEkpOTbfrbsmWLzXLdu3e320dyt8GxY8fUp08feXp6qnr16nruued07do1m+Wzs7P16quvqm7dunJ1dVVISIgmT55st8+EhIRY14eLi4v8/f3Vt29fJSQkWNvk7hfLli2zTrt06ZLCw8MVGhqqxMTEfNtJ0ujRo2WxWOzWja+vr5o1a8YfWwE4DXePBwAHyg3Y1atXlyT997//1Zo1a/T4448rNDRUZ8+e1Xvvvad27drp6NGjCgwMlCRdv35df/rTn7Rx40b169dPzz33nC5duqT169fryJEjqlu3rnWM/v37q1u3bjbjTpo0Kc96Xn/9dVksFk2cOFHnzp3TnDlz1LlzZx08eFBubm6SpE2bNikqKkrh4eGaOnWqXFxctHTpUnXs2FHbt29Xq1at7PqtVauW9RrQy5cv6+mnn85z7ClTpqhPnz566qmndP78ec2fP19t27bVgQMHVLVqVbtlRowYoTZt2kiSVq9erS+//NJm/siRI62XJjz77LOKj4/XO++8owMHDmjnzp2qUKFCnuuhMFJSUvK8vjUnJ0c9e/bUjh07NGLECDVq1Eg//fSTZs+erf/85z9as2ZNocZZunSp/vrXv+qtt97SgAED8mxzu/Uxd+5c9ezZUwMHDlRmZqZWrFihxx9/XF9//bW6d+9ubRcdHa1p06apdevWmj59uipWrKjdu3dr06ZNeuihhyTdCJRDhw5V48aNNWnSJFWtWlUHDhzQunXrrPXlrvv77rtPMTExOnv2rObOnaudO3fabVMfHx/Nnj1bkvTrr79q7ty56tatm3755Zc8t32uH374QRaLRS1atChw/S1cuFDu7u6S8v75L0ytt7pw4YL27t2r8uXLa/To0apbt67WrFmjESNG6MKFC3rppZcKvQ1uNnv2bPn4+Ei6sZ/c7E6/Mwrj8uXLatOmjeLi4jR06FC1bNlSycnJWrt2rX799VdrLbfatm2b/vWvf+Xbb58+fRQSEqKYmBjt2rVL8+bN0//+9z99+OGH1jZPPfWUPvjgAz322GOaMGGCdu/erZiYGMXFxdn9PLdp00YjRoxQTk6Ojhw5ojlz5ujMmTP5/rEnKytLvXv3VkJCgnbu3KmAgIB8az158mSef3TJFR4eXuh9GACKjQEAKHZLly41JBkbNmwwzp8/b/zyyy/GihUrjOrVqxtubm7Gr7/+ahiGYVy7ds24fv26zbLx8fGGq6urMX36dOu0JUuWGJKMt99+226snJwc63KSjFmzZtm1ady4sdGuXTvr+82bNxuSjJo1axppaWnW6StXrjQkGXPnzrX2Xb9+faNLly7WcQzDMK5cuWKEhoYaDz74oN1YrVu3Npo0aWJ9f/78eUOSMXXqVOu006dPG+XKlTNef/11m2V/+ukno3z58nbTT5w4YUgyPvjgA+u0qVOnGjf/Gtu+fbshyVi+fLnNsuvWrbObXrt2baN79+52tY8ePdq49VfjrbW/+OKLhq+vrxEeHm6zTj/66CPDxcXF2L59u83yixYtMiQZO3futBvvZu3atbP298033xjly5c3JkyYkGfbO1kfhnFjO90sMzPTaNKkidGxY0ebvlxcXIxHHnnE7mcxd5unpKQYHh4eRkREhHH16tU822RmZhq+vr5GkyZNbNp8/fXXhiTjlVdesU4bPHiwUbt2bZt+Fi9ebEgyfvzxxzw/c65BgwYZ1atXz3f+5MmTDUlGcnKyddqtP/+FqTUvtWvXNiQZy5Yts07Lzs42OnXqZLi6utqMfSfbINff//53Q5Lx888/W6fd/HNhGHf+nZG7j69atcpunCpVqhiDBw+2vn/llVcMScbq1avt2uZu39z+Nm/ebJ0XERFhREVF2e0juT+LPXv2tOlr1KhRhiTj0KFDhmEYxsGDBw1JxlNPPWXT7vnnnzckGZs2bbJOq127tk3NhmEYAwYMMCpXrmyzHiQZS5cuNXJycoyBAwcalStXNnbv3m23vnLb5erTp4/RpEkTIygoyG4cwzCMv/3tb4Yk4+zZs3bzAKCkcXo8AJSgzp07q0aNGgoKClK/fv3k7u6uL7/8UjVr1pQkubq6ysXlxlfx9evXdeHCBbm7u6thw4bav3+/tZ8vvvhCPj4+euaZZ+zG+D3X1D7xxBM2N1Z67LHHFBAQYD16dvDgQZ04cUIDBgzQhQsXlJycrOTkZKWnp6tTp07atm2bcnJybPq8du2aKlWqVOC4q1evVk5Ojvr06WPtMzk5Wf7+/qpfv742b95s0z4zM1PSjfWVn1WrVsnLy0sPPvigTZ/h4eFyd3e36zMrK8umXXJyst2pu7f67bffNH/+fE2ZMsV6FPfm8Rs1aqSwsDCbPnMvibh1/Pz8+OOP6tOnj3r37q1Zs2bl2eZO1ock69kSkvS///1PqampatOmjc3P1po1a5STk6NXXnnF+rOYK/dna/369bp06ZJeeuklu22b22bv3r06d+6cRo0aZdOme/fuCgsL0zfffGOzXE5OjnUdHTx4UB9
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма распределения цены закрытия в обучающей выборке\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.histplot(X_train['Close'], bins=30, kde=False)\n",
|
|||
|
"plt.title(\"Распределение классов (до балансировки)\")\n",
|
|||
|
"plt.xlabel('Целевая переменная: Close')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма распределения цены закрытия в контрольной выборке\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.histplot(X_val['Close'], bins=30, kde=False)\n",
|
|||
|
"plt.title(\"Распределение классов (до балансировки)\")\n",
|
|||
|
"plt.xlabel('Целевая переменная: Close')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Гистограмма распределения цены закрытия в тестовой выборке\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"sns.histplot(X_test['Close'], bins=30, kde=False)\n",
|
|||
|
"plt.title(\"Распределение классов (до балансировки)\")\n",
|
|||
|
"plt.xlabel('Целевая переменная: Close')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"##### Применим овер- и андерсемплинг к обучающей выборке:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 36,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" Date Open High Low Close Adj Close Volume \\\n",
|
|||
|
"4789 2020-07-08 5.66 5.73 5.47 5.560000 5.341250 23355100.0 \n",
|
|||
|
"3469 2015-04-10 3.86 3.93 3.81 3.880000 3.513961 7605300.0 \n",
|
|||
|
"2503 2011-06-07 12.19 12.28 11.95 12.020000 10.138681 7243200.0 \n",
|
|||
|
"1580 2007-10-08 11.77 11.84 11.53 11.570000 9.509553 3025900.0 \n",
|
|||
|
"2759 2012-06-12 15.77 16.17 15.76 16.120001 13.771020 6113400.0 \n",
|
|||
|
"\n",
|
|||
|
" closePrice_category \n",
|
|||
|
"4789 high \n",
|
|||
|
"3469 medium \n",
|
|||
|
"2503 very_high \n",
|
|||
|
"1580 very_high \n",
|
|||
|
"2759 very_high \n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAn4AAAHHCAYAAAAh/VVAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABc0UlEQVR4nO3dd1QU598F8LuAFKki0gwi9i4RjYIFoygKdiMWoliiRiXWqDGxYW+xK5bErrHGnqDYG3axi2gwEhURFRGU/rx/+O78GHZRJKtg5n7O2XPYmWdmvlN29jJtVUIIASIiIiL6z9PL7wKIiIiI6ONg8CMiIiJSCAY/IiIiIoVg8CMiIiJSCAY/IiIiIoVg8CMiIiJSCAY/IiIiIoVg8CMiIiJSCAY/IiKiPEhPT0dsbCzu37+f36WQjqWmpiImJgYPHz7M71J0jsGP6BOWmJiIuXPnSu/j4+OxaNGi/CuIPrp169bh3r170vtVq1bhwYMH+VfQf1xkZCR69+4NBwcHGBoaws7ODu7u7uCPYOVOQd5ez58/jy5dusDGxgZGRkZwcHBA+/bt87ssnVPp6ifbVq1ahR49ekjvjYyMUKJECTRt2hRjxoyBnZ2dLiZDRFlkZGTA0tISS5cuRYMGDfDzzz/j1q1bCAkJye/S6CPp27cvnj9/jhkzZiAiIgLt2rXDnTt34ODgkN+l/eecPn0azZs3h7W1NQYOHIhKlSpBpVLB0tIStWrVyu/yPgkFdXvduXMnOnbsiAoVKqBfv34oXbo0AMDW1hbVqlXL19p0zUDXI5wwYQJcXFyQnJyMEydOIDg4GH/88QeuXbuGwoUL63pyRIqmr6+PoKAgdOvWDZmZmbCwsMDevXvzuyz6iIYMGYKGDRvCxcUFADB06NB8/xL9L0pNTUWPHj1Qrlw57N+/H5aWlvld0iepIG6vz549wzfffANvb29s2bIFhoaG+VrPh6bz4Ne8eXPUrFkTAPDNN9+gaNGimD17Nnbu3InOnTvrenJEijds2DB07NgR0dHRqFixIqysrPK7JPqIKlSogLt37+LatWuwsbGRjlSQbu3evRsRERG4desWQ9+/UBC315UrVyI5ORmrVq36z4c+4CNc49eoUSMAQFRUFIA3yfr7779H1apVYWZmBgsLCzRv3hyXL1/WGDY5ORnjx49HuXLlYGxsDAcHB7Rr1w53794FANy7dw8qlSrHV8OGDaVxHTlyBCqVCps2bcKPP/4Ie3t7mJqaolWrVoiOjtaY9pkzZ9CsWTNYWlqicOHC8PT0xMmTJ7XOY8OGDbVOf/z48Rpt161bBzc3N5iYmMDa2hqdOnXSOv23zVtWmZmZmDt3LipXrgxjY2PY2dlJh9KzKlmyJFq0aKExncDAQI1xaqt95syZGssUAFJSUjBu3DiUKVMGRkZGcHJywogRI5CSkqJ1WWXVsGFDjfFNnjwZenp62LBhg9Tt+PHj6NChA0qUKCFNY8iQIXj9+rXUpnv37m/dFlQqley6kj///BP169eHqakpzM3N4evri+vXr8tqyWmcZcqUkbVbvHgxKleuDCMjIzg6OmLAgAGIj4/XmNcqVargwoUL8PDwgImJCVxcXLBkyRJZO/V2euTIEVl3X19fjfUyfvx4ad199tlncHd3h4GBAezt7bWOI7vg4GBUr14dlpaWMDU1RfXq1fHrr7/K2ly5cgXdu3dHqVKlYGxsDHt7e/Ts2RNPnz6Vtctai9rhw4dhZGSEb7/9VqPdrVu34OfnBwsLCxQtWhSDBg1CcnKybHiVSoXAwMAc61+1apXGei1ZsqS0nvT09GBvb4+OHTtqXHw/a9YseHh4oGjRojAxMYGbmxu2bt2qMY2SJUuie/fu0vuXL18iMDAQxYsXh5GREcqWLYtp06YhMzPznbW3aNECJUuWlHXbunWr1nUVHx+PwYMHw8nJCUZGRihTpgymT58um456H7Fq1SqYmpqidu3aKF26NAYMGACVSiWrWxv18LNmzdLoV6VKFY3PJpC7/aJ6HcfFxcm6nz9/XqpXrXv37jAzM3trnTntS7NSf27ULyMjI5QrVw5Tp07N1bV3sbGx6NWrF+zs7GBsbIzq1atj9erVsjanT5+Gi4sLtm3bhtKlS8PQ0BAlSpTAiBEjZPuigIAA2NjYIC0tTWM6TZs2Rfny5WU1Z1/33bt319hOuL1qficWKlQIJUuWxPDhw5Gamiq1U+8Xzp8/n+O4sn/3nD59Gq6urpgyZYo0Dzktq/T0dEycOBGlS5eGkZERSpYsiR9//FHjO0/9nbt//364urrC2NgYlSpVwu+//y5rp20/dv36dRQpUgQtWrRAenr6ey3n3ND5Eb/s1CGtaNGiAIC//voLO3bsQIcOHeDi4oLHjx9j6dKl8PT0xI0bN+Do6AjgzbVLLVq0wMGDB9GpUycMGjQIL1++RGhoKK5duyb7L6Fz587w8fGRTXfUqFFa65k8eTJUKhVGjhyJ2NhYzJ07F15eXggPD4eJiQkA4NChQ2jevDnc3Nwwbtw46OnpYeXKlWjUqBGOHz+OL774QmO8n332GaZOnQrgzQX3/fr10zrtMWPGwM/PD9988w2ePHmCBQsWoEGDBrh06ZLWIzV9+vRB/fr1AQC///47tm/fLuvft29f6frKgQMHIioqCgsXLsSlS5dw8uRJFCpUSOtyeB/x8fHSvGWVmZmJVq1a4cSJE+jTpw8qVqyIq1evYs6cObh9+zZ27NjxXtNZuXIlRo8ejZ9//hldunSRum/ZsgWvXr1Cv379ULRoUZw9exYLFizAP//8gy1btgB4sxy8vLykYbp27Yq2bduiXbt2UrdixYoBANauXYuAgAB4e3tj+vTpePXqFYKDg1GvXj1cunRJtrMzMjLCL7/8IqvT3Nxc+nv8+PEICgqCl5cX+vXrh4iICAQHB+PcuXMay//58+fw8fGBn58fOnfujM2bN6Nfv34wNDREz549c1wux44dwx9//JGrZfjzzz/j8ePHuWr78uVLNG3aFKVLl4YQAps3b8Y333wDKysr6YLm0NBQ/PXXX+jRowfs7e1x/fp1LFu2DNevX8fp06c1wp7a5cuX0aZNG/j4+Gi92cTPzw8lS5bE1KlTcfr0acyfPx/Pnz/HmjVrclX729SvXx99+vRBZmYmrl27hrlz5+Lhw4c4fvy41GbevHlo1aoV/P39kZqaio0bN6JDhw7Ys2cPfH19cxx3+/btERoaim7duuGLL77A4cOHMWrUKNy7d08jxOfVq1ev4OnpiQcPHqBv374oUaIETp06hVGjRuHRo0eym3myu3PnDpYvX66TOrLLy37xY/vxxx9RsWJFvH79Wvon39bWFr169cpxmNevX6Nhw4a4c+cOAgMD4eLigi1btqB79+6Ij4/HoEGDAABPnz7FX3/9hR9//BHt2rXDsGHDcP78ecycORPXrl3D3r17oVKp0LVrV6xZswb79u2T/bMdExODQ4cOYdy4ce89X9xe/0f9nZiSkoJ9+/Zh1qxZMDY2xsSJE/M8D0+fPsWJEydw4sQJ9OzZE25ubjh48KDWZfXNN99g9erV+OqrrzBs2DCcOXMGU6dOxc2bNzW+nyMjI9GxY0d8++23CAgIwMqVK9GhQweEhISgSZMmWmuJjo5Gs2bNUKFCBWzevBkGBm9i2r9ZzhqEjqxcuVIAEAcOHBBPnjwR0dHRYuPGjaJo0aLCxMRE/PPPP0IIIZKTk0VGRoZs2KioKGFkZCQmTJggdVuxYoUAIGbPnq0xrczMTGk4AGLmzJkabSpXriw8PT2l94cPHxYARPHixUVCQoLUffPmzQKAmDdvnjTusmXLCm9vb2k6Qgjx6tUr4eLiIpo0aaIxLQ8PD1GlShXp/ZMnTwQAMW7cOKnbvXv3hL6+vpg8ebJs2KtXrwoDAwON7pGRkQKAWL16tdRt3LhxIusqO378uAAg1q9fLxs2JCREo7uzs7Pw9fXVqH3AgAEi+2aQvfYRI0YIW1tb4ebmJluma9euFXp6euL48eOy4ZcsWSIAiJMnT2pMLytPT09pfHv37hUGBgZi2LBhGu1evXq
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnEAAAHHCAYAAADQ9g7NAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABbuklEQVR4nO3dd1QU198G8GcB6U1EmsGGxl4iNrCAimKvEQuxi0Yl1lhIbNg1xq6oSazR2MWSSMTesIsVEQ0qURFREMRQhPv+4cv8GHZBQBRGn885ew57587Md3Z2Zx+mrUoIIUBEREREiqJV0AUQERERUe4xxBEREREpEEMcERERkQIxxBEREREpEEMcERERkQIxxBEREREpEEMcERERkQIxxBEREREpEEMcERERkQIxxBF9Zl69eoVFixZJz2NjY7F8+fKCK4g+ut9//x3379+Xnq9btw6PHj0quIKICrH79+9DpVJh3bp1UtvUqVOhUqkKrqj/V6Ahbt26dVCpVNJDX18fX375Jby9vfH06dOCLI3ok2VgYICJEydi06ZNiIiIwNSpU7Fv376CLos+opMnT2LcuHG4f/8+/v77bwwbNgxaWvyfnkhpdAq6AACYNm0aypQpg8TERJw6dQp+fn7466+/cOPGDRgaGhZ0eUSfFG1tbfj6+qJ3795IS0uDqakp/vzzz4Iuiz6iUaNGwdXVFWXKlAEAjB49Gra2tgVcFZFyTJw4ERMmTCjoMgpHiGvVqhVq164NABg4cCCKFSuGBQsWYM+ePejRo0cBV0f06RkzZgy6deuGiIgIVKpUCebm5gVdEn1EFStWxL1793Djxg1YWlrCwcGhoEuiT9zr168/qZ0yOjo60NEp+AhVKPefN23aFAAQHh4OAHjx4gW+//57VKtWDcbGxjA1NUWrVq1w9epVtXETExMxdepUfPnll9DX14etrS06d+6Me/fuAfjfse2sHq6urtK0jh07BpVKha1bt+KHH36AjY0NjIyM0L59e0RERKjN+9y5c2jZsiXMzMxgaGgIFxcXnD59WuMyurq6apz/1KlT1fr+/vvvcHR0hIGBASwsLNC9e3eN889u2TJKS0vDokWLUKVKFejr68Pa2hqDBw9GTEyMrF/p0qXRtm1btfl4e3urTVNT7T/99JPaawoASUlJmDJlCsqVKwc9PT3Y29tj3LhxSEpK0vhaZeTq6qo2vZkzZ0JLSwubN2+W2k6ePImuXbuiZMmS0jxGjRqF//77T+rTt2/fbN8LKpVKdt7QgQMH0KhRIxgZGcHExARt2rTBzZs3ZbVkNc1y5crJ+q1YsQJVqlSBnp4e7OzsMGzYMMTGxqota9WqVXHp0iU4OzvDwMAAZcqUwcqVK2X90t+nx44dk7W3adNGbb1kPI/jiy++gJOTE3R0dGBjY6NxGpn5+fmhRo0aMDMzg5GREWrUqIHffvtN1ufatWvo27cvypYtC319fdjY2KB///54/vy5rJ+mc0qOHj0KPT09fPvtt2r9bt++DQ8PD5iamqJYsWIYMWIEEhMTZeOrVCp4e3tnWX/6KRwZ12vp0qWl9aSlpQUbGxt069YNDx8+lI07f/58ODs7o1ixYjAwMICjoyN27NihNo/SpUujb9++0vP4+Hh4e3ujRIkS0NPTQ/ny5TFnzhykpaW9s/a2bduidOnSsrYdO3ZoXFexsbEYOXIk7O3toaenh3LlymHu3Lmy+WQ8t8fIyAj16tWDg4MDhg0bBpVKJatbk/Tx58+frzasatWqap9NIGfbxfR1HB0dLWu/ePGi2rlIffv2hbGxcbZ1ZrUtzSj9c5PVI/Nr8c8//6Br166wsLCAoaEh6tevr3EP9ru+gzIv87vm++jRI/Tv3x/W1tbQ09NDlSpVsGbNmmyXLd2bN28wffp0ODg4QE9PD6VLl8YPP/wg29a2bdsWZcuW1Ti+k5OTtIMlXU6+jzJuuxo3bgxDQ0P88MMPAN6uU3d3d1haWkrbtP79+8vGz+lnLf0zs337dlSuXBkGBgZwcnLC9evXAQCrVq1CuXLloK+vD1dXV9nnPnOd2W1jNdG0/Uqvx9/fH1WrVpXWV0BAgNr4x44dQ+3ataGvrw8HBwesWrUqT+fZFXyM1CD9zV6sWDEAbz88/v7+6Nq1K8qUKYOnT59i1apVcHFxwa1bt2BnZwcASE1NRdu2bXH48GF0794dI0aMQHx8PAIDA3Hjxg3Zf5s9evRA69atZfP18fHRWM/MmTOhUqkwfvx4REVFYdGiRXBzc0NwcDAMDAwAAEeOHEGrVq3g6OiIKVOmQEtLC2vXrkXTpk1x8uRJ1K1bV226X3zxBWbPng3g7cnmQ4YM0TjvSZMmwcPDAwMHDsSzZ8+wdOlSNG7cGFeuXNG4B2XQoEFo1KgRAGDXrl3YvXu3bPjgwYOxbt069OvXD8OHD0d4eDiWLVuGK1eu4PTp0yhSpIjG1yE3YmNjpWXLKC0tDe3bt8epU6cwaNAgVKpUCdevX8fChQtx584d+Pv752o+a9euxcSJE/Hzzz+jZ8+eUvv27dvx+vVrDBkyBMWKFcP58+exdOlS/Pvvv9i+fTuAt6+Dm5ubNE6vXr3QqVMndO7cWWorXrw4AGDjxo3o06cP3N3dMXfuXLx+/Rp+fn5o2LAhrly5Ivui1dPTw6+//iqr08TERPp76tSp8PX1hZubG4YMGYLQ0FD4+fnhwoULaq9/TEwMWrduDQ8PD/To0QPbtm3DkCFDoKurq7bhy+jEiRP466+/cvQa/vzzzzk+BzU+Ph4tWrSAg4MDhBDYtm0bBg4cCHNzc3Tp0gUAEBgYiH/++Qf9+vWDjY0Nbt68idWrV+PmzZs4e/Zslhupq1evomPHjmjdurXGCy08PDxQunRpzJ49G2fPnsWSJUsQExODDRs25Kj27DRq1AiDBg1CWloabty4gUWLFuHx48c4efKk1Gfx4sVo3749PD09kZycjC1btqBr167Yv38/2rRpk+W0u3TpgsDAQPTu3Rt169bF0aNH4ePjg/v37+foyyInXr9+DRcXFzx69AiDBw9GyZIlcebMGfj4+ODJkyeyC1kyu3v3Ln755Zd8qSOzvGwXP7bhw4ejTp06sraBAwfKnj99+hTOzs54/fo1hg8fjmLFimH9+vVo3749duzYgU6dOgHI3XdQuo0bN0p/jxo1Sm2+9evXl8JB8eLFceDAAQwYMABxcXEYOXJktss2cOBArF+/Hl9//TXGjBmDc+fOYfbs2QgJCZG+F7p164bevXvjwoULstfhwYMHOHv2LH766SepLTffR8+fP0erVq3QvXt3fPPNN7C2tkZUVBRatGiB4sWLY8KECTA3N8f9+/exa9cuWd25+aydPHkSe/fuxbBhwwAAs2fPRtu2bTFu3DisWLECQ4cORUxMDObNm4f+/fvjyJEjsvHzuo3NyqlTp7Br1y4MHToUJiYmWLJkCbp06YKHDx9KmebKlSto2bIlbG1t4evri9TUVEybNk36vskVUYDWrl0rAIhDhw6JZ8+eiYiICLFlyxZRrFgxYWBgIP79918hhBCJiYkiNTVVNm54eLjQ09MT06ZNk9rWrFkjAIgFCxaozSstLU0aD4D46aef1PpUqVJFuLi4SM+PHj0qAIgSJUqIuLg4qX3btm0CgFi8eLE07fLlywt3d3dpPkII8fr1a1GmTBnRvHlztXk5OzuLqlWrSs+fPXsmAIgpU6ZIbffv3xfa2tpi5syZsnGvX78udHR01NrDwsIEALF+/XqpbcqUKSLjaj558qQAIDZt2iQbNyAgQK29VKlSok2bNmq1Dxs2TGR+62Sufdy4ccLKyko4OjrKXtONGzcKLS0tcfLkSdn4K1euFADE6dOn1eaXkYuLizS9P//8U+jo6IgxY8ao9Xv9+rVa2+zZs4VKpRIPHjzQOO3My5AuPj5emJubCy8vL1l7ZGSkMDMzk7X36dNHGBkZZVl/VFSU0NXVFS1atJC9p5ctWyYAiDVr1siWFYD4+eefpbakpCRRs2ZNYWVlJZKTk4UQ/3ufHj16VOpXr1490apVK7Vlyvx+iIqKEiYmJlL
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnYAAAHHCAYAAAAyKhW0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABbDUlEQVR4nO3dd1QU5/s28GsB6U1AmsFu7CViA7uiqNiNRCV20SjEGjV+Y8OGGqNYUZNYQ2KNJSYSsTfsYhfRYCRGRBSkKEV43j98d34MuyAgAhmvzzl7DvvMszP3zOzOXkxblRBCgIiIiIj+83SKuwAiIiIiKhwMdkREREQKwWBHREREpBAMdkREREQKwWBHREREpBAMdkREREQKwWBHREREpBAMdkREREQKwWBHREREpBAMdkQKlZSUhICAAOl5fHw8Vq1aVXwFUZH76aef8ODBA+n5xo0b8ejRo+IriP7zVCoVZs2aVdxllAiDBw9GhQoVZG0lYfkUSbDbuHEjVCqV9DA0NMTHH38MX19fPHnypChKIPrgGBkZYdq0aQgKCkJUVBRmzZqF3377rbjLoiJ08uRJTJ48GQ8ePMCff/4JHx8f6Ojw/3kiJdMryonNnj0bFStWREpKCk6dOoXAwED88ccfuHHjBoyNjYuyFCLF09XVhZ+fHwYOHIjMzEyYm5vj999/L+6yqAiNHz8erVu3RsWKFQEAEyZMgIODQzFXRaRcr169gp5ekUYrDUU69U6dOqFhw4YAgOHDh8Pa2hpLlizB3r170a9fv6IsheiDMHHiRHz22WeIiopCjRo1YGlpWdwlURGqXr067t+/jxs3bsDGxgaVK1cu7pKIJCkpKdDX11fUXmRDQ8PiLqF4z7Fr27YtACAyMhIA8Pz5c3z11VeoU6cOTE1NYW5ujk6dOuHq1asar01JScGsWbPw8ccfw9DQEA4ODujVqxfu378PAHjw4IHs8G/2R+vWraVxHTt2DCqVCtu2bcP//vc/2Nvbw8TEBN26dUNUVJTGtM+dO4eOHTvCwsICxsbGaNWqFU6fPq11Hlu3bq11+tqOwf/0009wdnaGkZERrKys0LdvX63Tz23essrMzERAQABq1aoFQ0ND2NnZYeTIkYiLi5P1q1ChArp06aIxHV9fX41xaqv922+/1VimAJCamoqZM2eiSpUqMDAwgJOTEyZPnozU1FStyyqr1q1ba4xv3rx50NHRwc8//yy1nTx5En369EG5cuWkaYwfPx6vXr2S+gwePDjX94JKpZKdh3TgwAG0aNECJiYmMDMzg4eHB27evCmrJadxVqlSRdZv9erVqFWrFgwMDODo6AgfHx/Ex8drzGvt2rVx6dIluLq6wsjICBUrVsSaNWtk/dTv02PHjsnaPTw8NNbLrFmzpHX30UcfwcXFBXp6erC3t9c6juwCAwNRr149WFhYwMTEBPXq1cOPP/4o63Pt2jUMHjwYlSpVgqGhIezt7TF06FA8e/ZM1i9rLWpHjx6FgYEBvvjiC41+d+7cgaenJ8zNzWFtbY2xY8ciJSVF9nqVSgVfX98c61ef/pF1vVaoUEFaTzo6OrC3t8dnn32Ghw8fyl67ePFiuLq6wtraGkZGRnB2dsbOnTs1plGhQgUMHjxYep6YmAhfX1+ULVsWBgYGqFq1KhYsWIDMzMy31t6lSxeNc3V27typdV3Fx8dj3LhxcHJygoGBAapUqYKFCxfKpqPeRmzcuBEmJiZo0qQJKleuDB8fH6hUKlnd2qhfv3jxYo1htWvX1vhsAnnbLqrXcWxsrKz94sWLUr1qgwcPhqmpaa515uV8JvXnJqdH9mXx119/oU+fPrCysoKxsTGaNm2qdU/3276Dss/z26b76NEjDB06FHZ2djAwMECtWrWwfv36XOcNkK/rty0fdS337t3D4MGDYWlpCQsLCwwZMgQvX76UvTY1NRXjx49HmTJlYGZmhm7duuGff/7RWkNealevh61bt2LatGkoW7YsjI2NkZCQgPT0dPj5+aFq1aowNDSEtbU1mjdvjpCQEOn1+d3e3L17F59//jksLCxQpkwZTJ8+HUIIREVFoXv37jA3N4e9vT2+++47rXXmNQ8U5jJ/9eoVxowZAxsbG2mZP3r0KN/n7RXr/kL1B8Da2hrAmw/Unj170KdPH1SsWBFPnjzB2rVr0apVK9y6dQuOjo4AgIyMDHTp0gWHDx9G3759MXbsWCQmJiIkJAQ3btyQ/Vfar18/dO7cWTbdqVOnaq1n3rx5UKlUmDJlCmJiYhAQEAA3NzeEhYXByMgIAHDkyBF06tQJzs7OmDlzJnR0dLBhwwa0bdsWJ0+eROPGjTXG+9FHH8Hf3x/AmxPaR40apXXa06dPh6enJ4YPH46nT59ixYoVaNmyJa5cuaJ1T8uIESPQokULAMCvv/6K3bt3y4aPHDkSGzduxJAhQzBmzBhERkZi5cqVuHLlCk6fPo1SpUppXQ75ER8fL81bVpmZmejWrRtOnTqFESNGoEaNGrh+/TqWLl2Ku3fvYs+ePfmazoYNGzBt2jR899136N+/v9S+Y8cOvHz5EqNGjYK1tTXOnz+PFStW4J9//sGOHTsAvFkObm5u0msGDBiAnj17olevXlJbmTJlAABbtmzBoEGD4O7ujoULF+Lly5cIDAxE8+bNceXKFdmXr4GBAX744QdZnWZmZtLfs2bNgp+fH9zc3DBq1CiEh4cjMDAQFy5c0Fj+cXFx6Ny5Mzw9PdGvXz9s374do0aNgr6+PoYOHZrjcjlx4gT++OOPPC3D7777Ls/ntCYmJqJDhw6oXLkyhBDYvn07hg8fDktLS/Tu3RsAEBISgr/++gtDhgyBvb09bt68iXXr1uHmzZs4e/asRphTu3r1Knr06IHOnTtrvZjD09MTFSpUgL+/P86ePYvly5cjLi4OmzdvzlPtuWnRogVGjBiBzMxM3LhxAwEBAfj3339x8uRJqc+yZcvQrVs3eHl5IS0tDVu3bkWfPn2wf/9+eHh45Dju3r17IyQkBAMHDkTjxo1x9OhRTJ06FQ8ePNAI6QX18uVLtGrVCo8ePcLIkSNRrlw5nDlzBlOnTsXjx49lF8tkd+/ePXz//feFUkd2BdkuFrUxY8agUaNGsrbhw4fLnj958gSurq54+fIlxowZA2tra2zatAndunXDzp070bNnTwD5+w5S27Jli/T3+PHjNabbtGlTKfSXKVMGBw4cwLBhw5CQkIBx48YV0lJ4w9PTExUrVoS/vz8uX76MH374Aba2tli4cKHUZ/jw4fjpp5/Qv39/uLq64siRI1rf//mtfc6cOdDX18dXX32F1NRU6OvrY9asWfD398fw4cPRuHFjJCQk4OLFi7h8+TLat28PIP/bm88++ww1atTAggUL8Pvvv2Pu3LmwsrLC2rVr0bZtWyxcuBBBQUH46quv0KhRI7Rs2VL2+rzkgcJe5oMHD8b27dsxYMAANG3aFMePH891m5MjUQQ2bNggAIhDhw6Jp0+fiqioKLF161ZhbW0tjIyMxD///COEECIlJUVkZGTIXhsZGSkMDAzE7Nmzpbb169cLAGLJkiUa08rMzJReB0B8++23Gn1q1aolWrVqJT0/evSoACDKli0rEhISpPbt27cLAGLZsmXSuKtWrSrc3d2l6QghxMuXL0XFihVF+/btNabl6uoqateuLT1/+vSpACBmzpwptT148EDo6uqKefPmyV57/fp1oaenp9EeEREhAIhNmzZJbTNnzhRZV+fJkycFABEUFCR7bXBwsEZ7+fLlhYeHh0btPj4+IvtbJHvtkydPFra2tsLZ2Vm2TLds2SJ0dHTEyZMnZa9fs2aNACBOnz6tMb2sWrVqJY3v999/F3p6emLixIka/V6+fKnR5u/vL1Qqlfj777+1jjv7PKglJiYKS0tL4e3tLWuPjo4WFhYWsvZBgwYJExOTHOuPiYkR+vr6okOHDrL39MqVKwUAsX79etm8AhDfffed1Jaamirq168vbG1tRVpamhDi/96nR48elfo1adJEdOrUSWOesr8fYmJihJmZmdQ36zjy4vXr18Lc3Fz4+vpKbdq
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки до oversampling и undersampling: 4200\n",
|
|||
|
"Размер обучающей выборки после oversampling и undersampling: 4232\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th>Open</th>\n",
|
|||
|
" <th>High</th>\n",
|
|||
|
" <th>Low</th>\n",
|
|||
|
" <th>Close</th>\n",
|
|||
|
" <th>Adj Close</th>\n",
|
|||
|
" <th>Volume</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>2020-07-08</td>\n",
|
|||
|
" <td>5.66</td>\n",
|
|||
|
" <td>5.73</td>\n",
|
|||
|
" <td>5.47</td>\n",
|
|||
|
" <td>5.56</td>\n",
|
|||
|
" <td>5.341250</td>\n",
|
|||
|
" <td>23355100.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20</th>\n",
|
|||
|
" <td>2021-01-19</td>\n",
|
|||
|
" <td>5.15</td>\n",
|
|||
|
" <td>5.15</td>\n",
|
|||
|
" <td>5.02</td>\n",
|
|||
|
" <td>5.13</td>\n",
|
|||
|
" <td>4.966732</td>\n",
|
|||
|
" <td>15906300.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21</th>\n",
|
|||
|
" <td>2010-04-08</td>\n",
|
|||
|
" <td>10.60</td>\n",
|
|||
|
" <td>10.65</td>\n",
|
|||
|
" <td>10.48</td>\n",
|
|||
|
" <td>10.52</td>\n",
|
|||
|
" <td>8.794909</td>\n",
|
|||
|
" <td>10456400.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>24</th>\n",
|
|||
|
" <td>2020-12-07</td>\n",
|
|||
|
" <td>5.47</td>\n",
|
|||
|
" <td>5.80</td>\n",
|
|||
|
" <td>5.47</td>\n",
|
|||
|
" <td>5.75</td>\n",
|
|||
|
" <td>5.541336</td>\n",
|
|||
|
" <td>12929600.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>28</th>\n",
|
|||
|
" <td>2021-01-05</td>\n",
|
|||
|
" <td>6.15</td>\n",
|
|||
|
" <td>6.16</td>\n",
|
|||
|
" <td>5.98</td>\n",
|
|||
|
" <td>6.04</td>\n",
|
|||
|
" <td>5.847770</td>\n",
|
|||
|
" <td>15080900.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Date Open High Low Close Adj Close Volume\n",
|
|||
|
"0 2020-07-08 5.66 5.73 5.47 5.56 5.341250 23355100.0\n",
|
|||
|
"20 2021-01-19 5.15 5.15 5.02 5.13 4.966732 15906300.0\n",
|
|||
|
"21 2010-04-08 10.60 10.65 10.48 10.52 8.794909 10456400.0\n",
|
|||
|
"24 2020-12-07 5.47 5.80 5.47 5.75 5.541336 12929600.0\n",
|
|||
|
"28 2021-01-05 6.15 6.16 5.98 6.04 5.847770 15080900.0"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 36,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование целевой переменной (цены) в категориальные диапазоны с использованием квантилей\n",
|
|||
|
"X_train['closePrice_category'] = pd.qcut(X_train['Close'], q=4, labels=['low', 'medium', 'high', 'very_high'])\n",
|
|||
|
"print(X_train.head())\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после преобразования в категории\n",
|
|||
|
"sns.countplot(x=X_train['closePrice_category'])\n",
|
|||
|
"plt.title('Распределение категорий закрывающей цены в обучающей выборке')\n",
|
|||
|
"plt.xlabel('Категория закрывающей цены')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Балансировка категорий с помощью RandomOverSampler (увеличение меньшинств)\n",
|
|||
|
"ros = RandomOverSampler(random_state=42)\n",
|
|||
|
"y_train = X_train['closePrice_category']\n",
|
|||
|
"X_train = X_train.drop(columns=['closePrice_category'])\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Применяем oversampling. Здесь важно, что мы используем X_train как DataFrame и y_train_categories как целевую переменную\n",
|
|||
|
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после oversampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение категорий закрывающей цены после oversampling')\n",
|
|||
|
"plt.xlabel('Категория закрывающей цены')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для уменьшения большего класса\n",
|
|||
|
"rus = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после undersampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение категорий закрывающей цены после undersampling')\n",
|
|||
|
"plt.xlabel('Категория закрывающей цены')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки до oversampling и undersampling: \", len(X_train))\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))\n",
|
|||
|
"X_resampled.head()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"По сути, балансировка так то не требовалась, но все же мы ее провели, добавив в обучающую выборку 5 значений (ーー;)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование признаков\n",
|
|||
|
"1. **Унитарное кодирование категориальных признаков. Преобразование категориальных признаков в бинарные векторы.**\n",
|
|||
|
"* В данном датасете категориальные признаки отсутствуют, так что пропустим этот пункт.\n",
|
|||
|
"2. **Дискретизация числовых признаков. Преобразование непрерывных числовых значений в дискретные категории или интервалы (бины).**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 37,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Названия столбцов в датасете:\n",
|
|||
|
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')\n",
|
|||
|
"Статистические параметры:\n",
|
|||
|
" Date Open High Low \\\n",
|
|||
|
"count 5251 5251.000000 5251.000000 5251.000000 \n",
|
|||
|
"mean 2011-12-01 11:59:51.772995840 6.863639 6.986071 6.720615 \n",
|
|||
|
"min 2001-06-22 00:00:00 1.142857 1.142857 1.142857 \n",
|
|||
|
"25% 2006-09-13 12:00:00 2.857143 2.880000 2.810000 \n",
|
|||
|
"50% 2011-11-29 00:00:00 4.600000 4.710000 4.490000 \n",
|
|||
|
"75% 2017-02-16 12:00:00 10.650000 10.860000 10.425000 \n",
|
|||
|
"max 2022-05-05 00:00:00 20.420000 20.590000 20.090000 \n",
|
|||
|
"std NaN 4.753836 4.832010 4.662891 \n",
|
|||
|
"\n",
|
|||
|
" Close Adj Close Volume \n",
|
|||
|
"count 5251.000000 5251.000000 5.251000e+03 \n",
|
|||
|
"mean 6.850606 5.895644 8.976705e+06 \n",
|
|||
|
"min 1.142857 0.935334 0.000000e+00 \n",
|
|||
|
"25% 2.857143 2.537094 2.845900e+06 \n",
|
|||
|
"50% 4.600000 4.337419 8.216200e+06 \n",
|
|||
|
"75% 10.640000 8.951945 1.327245e+07 \n",
|
|||
|
"max 20.389999 17.543156 2.891228e+07 \n",
|
|||
|
"std 4.746055 3.941634 7.251098e+06 \n",
|
|||
|
"После дискретизации 'Close':\n",
|
|||
|
" Date Open High Low Close Adj Close Volume \\\n",
|
|||
|
"0 2001-06-22 3.428571 3.428571 3.428571 3.428571 2.806002 0.0 \n",
|
|||
|
"1 2001-06-25 3.428571 3.428571 3.428571 3.428571 2.806002 0.0 \n",
|
|||
|
"2 2001-06-26 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 \n",
|
|||
|
"3 2001-06-27 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 \n",
|
|||
|
"4 2001-06-28 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Close_Disc \n",
|
|||
|
"0 2-4 \n",
|
|||
|
"1 2-4 \n",
|
|||
|
"2 2-4 \n",
|
|||
|
"3 2-4 \n",
|
|||
|
"4 2-4 \n",
|
|||
|
" Date Open High Low Close Adj Close Volume \\\n",
|
|||
|
"2623 2011-11-25 14.730000 15.050000 14.65 14.650000 12.429751 2433000.0 \n",
|
|||
|
"2624 2011-11-28 15.150000 15.370000 15.04 15.200000 12.896397 4348600.0 \n",
|
|||
|
"2625 2011-11-29 15.270000 15.710000 15.21 15.600000 13.235776 4576500.0 \n",
|
|||
|
"2626 2011-11-30 16.120001 16.850000 16.07 16.830000 14.279361 9537100.0 \n",
|
|||
|
"2627 2011-12-01 16.770000 16.940001 16.58 16.809999 14.262395 5111500.0 \n",
|
|||
|
"\n",
|
|||
|
" Close_Disc \n",
|
|||
|
"2623 14-16 \n",
|
|||
|
"2624 14-16 \n",
|
|||
|
"2625 14-16 \n",
|
|||
|
"2626 16+ \n",
|
|||
|
"2627 16+ \n",
|
|||
|
" Date Open High Low Close Adj Close Volume Close_Disc\n",
|
|||
|
"5246 2022-04-29 5.66 5.69 5.50 5.51 5.51 16613300.0 4-6\n",
|
|||
|
"5247 2022-05-02 5.33 5.39 5.18 5.30 5.30 27106700.0 4-6\n",
|
|||
|
"5248 2022-05-03 5.32 5.53 5.32 5.47 5.47 18914200.0 4-6\n",
|
|||
|
"5249 2022-05-04 5.47 5.61 5.37 5.60 5.60 20530700.0 4-6\n",
|
|||
|
"5250 2022-05-05 5.63 5.66 5.34 5.44 5.44 19879200.0 4-6\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"#Пример дискретизации по цене закрытия\n",
|
|||
|
"# Проверка на наличие числовых признаков\n",
|
|||
|
"print(\"Названия столбцов в датасете:\")\n",
|
|||
|
"print(df.columns)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим основные статистические параметры для количественных признаков\n",
|
|||
|
"print(\"Статистические параметры:\")\n",
|
|||
|
"print(df.describe())\n",
|
|||
|
"\n",
|
|||
|
"# Дискретизация столбца 'Close' на группы\n",
|
|||
|
"bins = [0, 2, 4, 6, 8, 10, 12, 14, 16, 30] # Определяем границы корзин\n",
|
|||
|
"labels = ['0-2', '2-4', '4-6', '6-8', '8-10', '10-12', '12-14', '14-16', '16+'] # Названия категорий\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового столбца 'Close_Disc' на основе дискретизации\n",
|
|||
|
"df['Close_Disc'] = pd.cut(df['Close'], bins=bins, labels=labels, include_lowest=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка результата\n",
|
|||
|
"print(\"После дискретизации 'Close':\")\n",
|
|||
|
"print(df.head())\n",
|
|||
|
"n = len(df)\n",
|
|||
|
"middle_index = n // 2\n",
|
|||
|
"print(df.iloc[middle_index - 2: middle_index + 3])\n",
|
|||
|
"print(df.tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование новых признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 38,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"Исходный датасет: \n",
|
|||
|
" Date Open High Low Close Adj Close Volume Close_Disc\n",
|
|||
|
"5246 2022-04-29 5.66 5.69 5.50 5.51 5.51 16613300.0 4-6\n",
|
|||
|
"5247 2022-05-02 5.33 5.39 5.18 5.30 5.30 27106700.0 4-6\n",
|
|||
|
"5248 2022-05-03 5.32 5.53 5.32 5.47 5.47 18914200.0 4-6\n",
|
|||
|
"5249 2022-05-04 5.47 5.61 5.37 5.60 5.60 20530700.0 4-6\n",
|
|||
|
"5250 2022-05-05 5.63 5.66 5.34 5.44 5.44 19879200.0 4-6\n",
|
|||
|
"\n",
|
|||
|
"Обучающая выборка: \n",
|
|||
|
" Date Open High Low Close Adj Close Volume\n",
|
|||
|
"2435 2011-04-14 12.530000 12.84 12.480000 12.750000 10.754427 10527200.0\n",
|
|||
|
"1756 2013-05-30 11.510000 11.76 11.480000 11.720000 10.166282 9028100.0\n",
|
|||
|
"3296 2009-11-20 13.100000 13.28 12.870000 13.220000 11.031483 17024900.0\n",
|
|||
|
"1243 2012-09-17 18.870001 19.00 18.469999 18.870001 16.178450 6652400.0\n",
|
|||
|
"343 2006-12-12 12.920000 13.00 12.580000 12.800000 10.487218 3981100.0\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: \n",
|
|||
|
" Date Open High Low Close Adj Close \\\n",
|
|||
|
"3095 2013-10-14 9.290000 9.350000 9.070000 9.130000 8.025586 \n",
|
|||
|
"859 2004-11-24 3.090000 3.160000 3.040000 3.100000 2.537094 \n",
|
|||
|
"3134 2013-12-09 8.550000 8.770000 8.550000 8.770000 7.709136 \n",
|
|||
|
"2577 2011-09-21 16.709999 17.070000 16.379999 16.400000 13.869872 \n",
|
|||
|
"378 2002-12-27 2.571429 2.571429 2.571429 2.571429 2.104502 \n",
|
|||
|
"\n",
|
|||
|
" Volume \n",
|
|||
|
"3095 5861400.0 \n",
|
|||
|
"859 211300.0 \n",
|
|||
|
"3134 5335400.0 \n",
|
|||
|
"2577 14524400.0 \n",
|
|||
|
"378 0.0 \n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: \n",
|
|||
|
" Date Open High Low Close Adj Close \\\n",
|
|||
|
"3095 2013-10-14 9.290000 9.350000 9.070000 9.130000 8.025586 \n",
|
|||
|
"859 2004-11-24 3.090000 3.160000 3.040000 3.100000 2.537094 \n",
|
|||
|
"3134 2013-12-09 8.550000 8.770000 8.550000 8.770000 7.709136 \n",
|
|||
|
"2577 2011-09-21 16.709999 17.070000 16.379999 16.400000 13.869872 \n",
|
|||
|
"378 2002-12-27 2.571429 2.571429 2.571429 2.571429 2.104502 \n",
|
|||
|
"\n",
|
|||
|
" Volume \n",
|
|||
|
"3095 5861400.0 \n",
|
|||
|
"859 211300.0 \n",
|
|||
|
"3134 5335400.0 \n",
|
|||
|
"2577 14524400.0 \n",
|
|||
|
"378 0.0 \n",
|
|||
|
"\n",
|
|||
|
"Новые признаки в обучающей выборке:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"2435 0.977868\n",
|
|||
|
"1756 -0.142403\n",
|
|||
|
"3296 0.885768\n",
|
|||
|
"1243 -0.609255\n",
|
|||
|
"343 -0.401554\n",
|
|||
|
"\n",
|
|||
|
"Новые признаки в тестовой выборке:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 inf\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n",
|
|||
|
"\n",
|
|||
|
"Новые признаки в контрольной выборке:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 inf\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n",
|
|||
|
"\n",
|
|||
|
"Новые признаки в датасете:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"5246 -0.218393\n",
|
|||
|
"5247 0.631626\n",
|
|||
|
"5248 -0.302232\n",
|
|||
|
"5249 0.085465\n",
|
|||
|
"5250 -0.031733\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print('\\nИсходный датасет: ')\n",
|
|||
|
"print(df.tail())\n",
|
|||
|
"print('\\nОбучающая выборка: ')\n",
|
|||
|
"print(X_resampled.tail())\n",
|
|||
|
"print('\\nТестовая выборка: ')\n",
|
|||
|
"print(X_test.tail())\n",
|
|||
|
"print('\\nКонтрольная выборка: ')\n",
|
|||
|
"print(X_val.tail())\n",
|
|||
|
"\n",
|
|||
|
"#Объем изменений\n",
|
|||
|
"df['Volume_Change'] = df['Volume'].pct_change()\n",
|
|||
|
"X_resampled['Volume_Change'] = X_resampled['Volume'].pct_change()\n",
|
|||
|
"X_test['Volume_Change'] = X_test['Volume'].pct_change()\n",
|
|||
|
"X_val['Volume_Change'] = X_val['Volume'].pct_change()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка создания новых признаков\n",
|
|||
|
"print(\"\\nНовые признаки в обучающей выборке:\")\n",
|
|||
|
"print(X_resampled[['Volume_Change']].tail())\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nНовые признаки в тестовой выборке:\")\n",
|
|||
|
"print(X_test[['Volume_Change']].tail())\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nНовые признаки в контрольной выборке:\")\n",
|
|||
|
"print(X_val[['Volume_Change']].tail())\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nНовые признаки в датасете:\")\n",
|
|||
|
"print(df[['Volume_Change']].tail())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"##### Проверим новые признаки:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 39,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"Исходный датасет: \n",
|
|||
|
"Volume_Change 501\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Обучающая выборка: \n",
|
|||
|
"Volume_Change 102\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: \n",
|
|||
|
"Volume_Change 16\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: \n",
|
|||
|
"Volume_Change 16\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Есть ли пустые значения признаков: \n",
|
|||
|
"\n",
|
|||
|
"Исходный датасет: \n",
|
|||
|
"Volume_Change True\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Обучающая выорка: \n",
|
|||
|
"Volume_Change True\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: \n",
|
|||
|
"Volume_Change True\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: \n",
|
|||
|
"Volume_Change True\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"\n",
|
|||
|
"Количество бесконечных значений в каждом столбце:\n",
|
|||
|
"\n",
|
|||
|
"Исходный датасет: \n",
|
|||
|
"Volume_Change 32\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Обучающая выборка: \n",
|
|||
|
"Volume_Change 310\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: \n",
|
|||
|
"Volume_Change 107\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: \n",
|
|||
|
"Volume_Change 107\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"Volume_Change процент пустых значений в датасете: %9.54\n",
|
|||
|
"Volume_Change процент пустых значений в обучающей выборке: %2.41\n",
|
|||
|
"Volume_Change процент пустых значений в тестовой выборке: %1.52\n",
|
|||
|
"Volume_Change процент пустых значений в контрольной выборке: %1.52\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print('\\nИсходный датасет: ')\n",
|
|||
|
"print(df[['Volume_Change']].isnull().sum())\n",
|
|||
|
"print('\\nОбучающая выборка: ')\n",
|
|||
|
"print(X_resampled[['Volume_Change']].isnull().sum())\n",
|
|||
|
"print('\\nТестовая выборка: ')\n",
|
|||
|
"print(X_test[['Volume_Change']].isnull().sum())\n",
|
|||
|
"print('\\nКонтрольная выборка: ')\n",
|
|||
|
"print(X_val[['Volume_Change']].isnull().sum())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Есть ли пустые значения признаков\n",
|
|||
|
"print('Есть ли пустые значения признаков: ')\n",
|
|||
|
"print('\\nИсходный датасет: ')\n",
|
|||
|
"print(df[['Volume_Change']].isnull().any())\n",
|
|||
|
"print('\\nОбучающая выорка: ')\n",
|
|||
|
"print(X_resampled[['Volume_Change']].isnull().any())\n",
|
|||
|
"print('\\nТестовая выборка: ')\n",
|
|||
|
"print(X_test[['Volume_Change']].isnull().any())\n",
|
|||
|
"print('\\nКонтрольная выборка: ')\n",
|
|||
|
"print(X_val[['Volume_Change']].isnull().any())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на бесконечные значения\n",
|
|||
|
"print(\"Количество бесконечных значений в каждом столбце:\")\n",
|
|||
|
"print('\\nИсходный датасет: ')\n",
|
|||
|
"print(np.isinf(df[['Volume_Change']]).sum())\n",
|
|||
|
"print('\\nОбучающая выборка: ')\n",
|
|||
|
"print(np.isinf(X_resampled[['Volume_Change']]).sum())\n",
|
|||
|
"print('\\nТестовая выборка: ')\n",
|
|||
|
"print(np.isinf(X_test[['Volume_Change']]).sum())\n",
|
|||
|
"print('\\nКонтрольная выборка: ')\n",
|
|||
|
"print(np.isinf(X_val[['Volume_Change']]).sum())\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in df[['Volume_Change']].columns:\n",
|
|||
|
" null_rate = df[['Volume_Change']][i].isnull().sum() / len(df[['Volume_Change']]) * 100\n",
|
|||
|
" print(f\"{i} процент пустых значений в датасете: %{null_rate:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in X_resampled[['Volume_Change']].columns:\n",
|
|||
|
" null_rate = X_resampled[['Volume_Change']][i].isnull().sum() / len(X_resampled[['Volume_Change']]) * 100\n",
|
|||
|
" print(f\"{i} процент пустых значений в обучающей выборке: %{null_rate:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in X_test[['Volume_Change']].columns:\n",
|
|||
|
" null_rate = X_test[['Volume_Change']][i].isnull().sum() / len(X_test[['Volume_Change']]) * 100\n",
|
|||
|
" print(f\"{i} процент пустых значений в тестовой выборке: %{null_rate:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Процент пустых значений признаков\n",
|
|||
|
"for i in X_val[['Volume_Change']].columns:\n",
|
|||
|
" null_rate = X_val[['Volume_Change']][i].isnull().sum() / len(X_val[['Volume_Change']]) * 100\n",
|
|||
|
" print(f\"{i} процент пустых значений в контрольной выборке: %{null_rate:.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Заполним пустые данные"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 40,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"(5251, 1)\n",
|
|||
|
"(4232, 1)\n",
|
|||
|
"(1051, 1)\n",
|
|||
|
"(1051, 1)\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"5246 -0.218393\n",
|
|||
|
"5247 0.631626\n",
|
|||
|
"5248 -0.302232\n",
|
|||
|
"5249 0.085465\n",
|
|||
|
"5250 -0.031733\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"2435 0.977868\n",
|
|||
|
"1756 -0.142403\n",
|
|||
|
"3296 0.885768\n",
|
|||
|
"1243 -0.609255\n",
|
|||
|
"343 -0.401554\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 0.000000\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 0.000000\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\K\\AppData\\Local\\Temp\\ipykernel_21516\\2904461267.py:36: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!\n",
|
|||
|
"You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.\n",
|
|||
|
"A typical example is when you are setting values in a column of a DataFrame, like:\n",
|
|||
|
"\n",
|
|||
|
"df[\"col\"][row_indexer] = value\n",
|
|||
|
"\n",
|
|||
|
"Use `df.loc[row_indexer, \"col\"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.\n",
|
|||
|
"\n",
|
|||
|
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
|||
|
"\n",
|
|||
|
" df[['Volume_Change']].loc[df[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_df\n",
|
|||
|
"C:\\Users\\K\\AppData\\Local\\Temp\\ipykernel_21516\\2904461267.py:37: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!\n",
|
|||
|
"You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.\n",
|
|||
|
"A typical example is when you are setting values in a column of a DataFrame, like:\n",
|
|||
|
"\n",
|
|||
|
"df[\"col\"][row_indexer] = value\n",
|
|||
|
"\n",
|
|||
|
"Use `df.loc[row_indexer, \"col\"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.\n",
|
|||
|
"\n",
|
|||
|
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
|||
|
"\n",
|
|||
|
" X_resampled[['Volume_Change']].loc[X_resampled[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_train\n",
|
|||
|
"C:\\Users\\K\\AppData\\Local\\Temp\\ipykernel_21516\\2904461267.py:38: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!\n",
|
|||
|
"You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.\n",
|
|||
|
"A typical example is when you are setting values in a column of a DataFrame, like:\n",
|
|||
|
"\n",
|
|||
|
"df[\"col\"][row_indexer] = value\n",
|
|||
|
"\n",
|
|||
|
"Use `df.loc[row_indexer, \"col\"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.\n",
|
|||
|
"\n",
|
|||
|
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
|||
|
"\n",
|
|||
|
" X_test[['Volume_Change']].loc[X_test[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_test\n",
|
|||
|
"C:\\Users\\K\\AppData\\Local\\Temp\\ipykernel_21516\\2904461267.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!\n",
|
|||
|
"You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.\n",
|
|||
|
"A typical example is when you are setting values in a column of a DataFrame, like:\n",
|
|||
|
"\n",
|
|||
|
"df[\"col\"][row_indexer] = value\n",
|
|||
|
"\n",
|
|||
|
"Use `df.loc[row_indexer, \"col\"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.\n",
|
|||
|
"\n",
|
|||
|
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
|||
|
"\n",
|
|||
|
" X_val[['Volume_Change']].loc[X_val[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_val\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Заменяем бесконечные значения на NaN\n",
|
|||
|
"df.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
|||
|
"X_resampled.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
|||
|
"X_test.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
|||
|
"X_val.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"fillna_df = df[['Volume_Change']].fillna(0)\n",
|
|||
|
"fillna_X_resampled = X_resampled[['Volume_Change']].fillna(0)\n",
|
|||
|
"fillna_X_test = X_test[['Volume_Change']].fillna(0)\n",
|
|||
|
"fillna_X_val = X_val[['Volume_Change']].fillna(0)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"print(fillna_df.shape)\n",
|
|||
|
"print(fillna_X_resampled.shape)\n",
|
|||
|
"print(fillna_X_test.shape)\n",
|
|||
|
"print(fillna_X_val.shape)\n",
|
|||
|
"\n",
|
|||
|
"print(fillna_df.isnull().any())\n",
|
|||
|
"print(fillna_X_resampled.isnull().any())\n",
|
|||
|
"print(fillna_X_test.isnull().any())\n",
|
|||
|
"print(fillna_X_val.isnull().any())\n",
|
|||
|
"\n",
|
|||
|
"# Замена пустых данных на 0\n",
|
|||
|
"df[\"Volume_Change\"] = df[\"Volume_Change\"].fillna(0)\n",
|
|||
|
"X_resampled[\"Volume_Change\"] = X_resampled[\"Volume_Change\"].fillna(0)\n",
|
|||
|
"X_test[\"Volume_Change\"] = X_test[\"Volume_Change\"].fillna(0)\n",
|
|||
|
"X_val[\"Volume_Change\"] = X_val[\"Volume_Change\"].fillna(0)\n",
|
|||
|
"\n",
|
|||
|
"# Вычисляем медиану для колонки \"Volume_Change\"\n",
|
|||
|
"median_Volume_Change_df = df[\"Volume_Change\"].median()\n",
|
|||
|
"median_Volume_Change_train = X_resampled[\"Volume_Change\"].median()\n",
|
|||
|
"median_Volume_Change_test = X_test[\"Volume_Change\"].median()\n",
|
|||
|
"median_Volume_Change_val = X_val[\"Volume_Change\"].median()\n",
|
|||
|
"\n",
|
|||
|
"# Заменяем значения 0 на медиану\n",
|
|||
|
"df[['Volume_Change']].loc[df[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_df\n",
|
|||
|
"X_resampled[['Volume_Change']].loc[X_resampled[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_train\n",
|
|||
|
"X_test[['Volume_Change']].loc[X_test[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_test\n",
|
|||
|
"X_val[['Volume_Change']].loc[X_val[\"Volume_Change\"] == 0, \"Volume_Change\"] = median_Volume_Change_val\n",
|
|||
|
"\n",
|
|||
|
"print(df[['Volume_Change']].tail())\n",
|
|||
|
"print(X_resampled[['Volume_Change']].tail())\n",
|
|||
|
"print(X_test[['Volume_Change']].tail())\n",
|
|||
|
"print(X_val[['Volume_Change']].tail())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Удалим наблюдения с пропусками"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 41,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"(5251, 1)\n",
|
|||
|
"(4232, 1)\n",
|
|||
|
"(1051, 1)\n",
|
|||
|
"(1051, 1)\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"5246 -0.218393\n",
|
|||
|
"5247 0.631626\n",
|
|||
|
"5248 -0.302232\n",
|
|||
|
"5249 0.085465\n",
|
|||
|
"5250 -0.031733\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"2435 0.977868\n",
|
|||
|
"1756 -0.142403\n",
|
|||
|
"3296 0.885768\n",
|
|||
|
"1243 -0.609255\n",
|
|||
|
"343 -0.401554\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 0.000000\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n",
|
|||
|
"Volume_Change False\n",
|
|||
|
"dtype: bool\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 0.000000\n",
|
|||
|
"859 -0.963951\n",
|
|||
|
"3134 24.250355\n",
|
|||
|
"2577 1.722270\n",
|
|||
|
"378 -1.000000\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"dropna_df = df[['Volume_Change']].dropna()\n",
|
|||
|
"dropna_X_resampled = X_resampled[['Volume_Change']].dropna()\n",
|
|||
|
"dropna_X_test = X_test[['Volume_Change']].dropna()\n",
|
|||
|
"dropna_X_val = X_val[['Volume_Change']].dropna()\n",
|
|||
|
"\n",
|
|||
|
"print(dropna_df.shape)\n",
|
|||
|
"print(dropna_X_resampled.shape)\n",
|
|||
|
"print(dropna_X_test.shape)\n",
|
|||
|
"print(dropna_X_val.shape)\n",
|
|||
|
"\n",
|
|||
|
"print(dropna_df.isnull().any())\n",
|
|||
|
"print(df[['Volume_Change']].tail())\n",
|
|||
|
"print(dropna_X_resampled.isnull().any())\n",
|
|||
|
"print(X_resampled[['Volume_Change']].tail())\n",
|
|||
|
"print(dropna_X_test.isnull().any())\n",
|
|||
|
"print(X_test[['Volume_Change']].tail())\n",
|
|||
|
"print(dropna_X_val.isnull().any())\n",
|
|||
|
"print(X_val[['Volume_Change']].tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"##### Масштабируем новые признаки:\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 42,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Результаты после масштабирования:\n",
|
|||
|
"\n",
|
|||
|
" Датафрейм:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"5246 -0.176620\n",
|
|||
|
"5247 0.224373\n",
|
|||
|
"5248 -0.216171\n",
|
|||
|
"5249 -0.033276\n",
|
|||
|
"5250 -0.088564\n",
|
|||
|
"\n",
|
|||
|
" Обучающая:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"2435 -0.033736\n",
|
|||
|
"1756 -0.033805\n",
|
|||
|
"3296 -0.033742\n",
|
|||
|
"1243 -0.033834\n",
|
|||
|
"343 -0.033821\n",
|
|||
|
"\n",
|
|||
|
" Тестовая:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 -0.033796\n",
|
|||
|
"859 -0.033856\n",
|
|||
|
"3134 -0.032301\n",
|
|||
|
"2577 -0.033690\n",
|
|||
|
"378 -0.033858\n",
|
|||
|
"\n",
|
|||
|
" Контрольная:\n",
|
|||
|
" Volume_Change\n",
|
|||
|
"3095 -0.033796\n",
|
|||
|
"859 -0.033856\n",
|
|||
|
"3134 -0.032301\n",
|
|||
|
"2577 -0.033690\n",
|
|||
|
"378 -0.033858\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
|
|||
|
"\n",
|
|||
|
"# Пример масштабирования числовых признаков\n",
|
|||
|
"numerical_features = ['Volume_Change']\n",
|
|||
|
"\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"df[numerical_features] = scaler.fit_transform(df[numerical_features])\n",
|
|||
|
"X_resampled[numerical_features] = scaler.fit_transform(X_resampled[numerical_features])\n",
|
|||
|
"X_val[numerical_features] = scaler.transform(X_val[numerical_features])\n",
|
|||
|
"X_test[numerical_features] = scaler.transform(X_test[numerical_features])\n",
|
|||
|
"\n",
|
|||
|
"# Вывод результатов после масштабирования\n",
|
|||
|
"print(\"Результаты после масштабирования:\")\n",
|
|||
|
"print(\"\\n Датафрейм:\")\n",
|
|||
|
"print(df[numerical_features].tail())\n",
|
|||
|
"print(\"\\n Обучающая:\")\n",
|
|||
|
"print(X_resampled[numerical_features].tail())\n",
|
|||
|
"print(\"\\n Тестовая:\")\n",
|
|||
|
"print(X_val[numerical_features].tail())\n",
|
|||
|
"print(\"\\n Контрольная:\")\n",
|
|||
|
"print(X_test[numerical_features].tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Данные признаки предоставляют важную информацию о текущем тренде и возможных изменениях в будущих ценах. Положительные значения Price_Change и Percentage_Change, наряду с высоким Volume_Change, могут поддерживать гипотезу о росте цен на акции.\n",
|
|||
|
"\n",
|
|||
|
"Также, эти признаки помогают понять уровень рискованности инвестиций. Высокие значения Price_Range и резкие изменения в Volume_Change могут указывать на склонность к большим колебаниям, что требует внимательного управления рисками."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Применим featuretools для конструирования признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 43,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Empty DataFrame\n",
|
|||
|
"Columns: [Date, Open, High, Low, Close, Adj Close, Volume, Volume_Change, id]\n",
|
|||
|
"Index: []\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" Open High Low Close Adj Close Volume Close_Disc \\\n",
|
|||
|
"id \n",
|
|||
|
"0 3.428571 3.428571 3.428571 3.428571 2.806002 0.0 2-4 \n",
|
|||
|
"1 3.428571 3.428571 3.428571 3.428571 2.806002 0.0 2-4 \n",
|
|||
|
"2 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 2-4 \n",
|
|||
|
"3 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 2-4 \n",
|
|||
|
"4 3.714286 3.714286 3.714286 3.714286 3.039837 0.0 2-4 \n",
|
|||
|
"\n",
|
|||
|
" Volume_Change DAY(Date) MONTH(Date) WEEKDAY(Date) YEAR(Date) \n",
|
|||
|
"id \n",
|
|||
|
"0 -0.073594 22 6 4 2001 \n",
|
|||
|
"1 -0.073594 25 6 0 2001 \n",
|
|||
|
"2 -0.073594 26 6 1 2001 \n",
|
|||
|
"3 -0.073594 27 6 2 2001 \n",
|
|||
|
"4 -0.073594 28 6 3 2001 \n",
|
|||
|
" Open High Low Close Adj Close Volume Volume_Change \\\n",
|
|||
|
"id \n",
|
|||
|
"0 5.66 5.73 5.47 5.56 5.341250 23355100.0 -0.033796 \n",
|
|||
|
"20 5.15 5.15 5.02 5.13 4.966732 15906300.0 -0.033816 \n",
|
|||
|
"21 10.60 10.65 10.48 10.52 8.794909 10456400.0 -0.033817 \n",
|
|||
|
"24 5.47 5.80 5.47 5.75 5.541336 12929600.0 -0.033782 \n",
|
|||
|
"28 6.15 6.16 5.98 6.04 5.847770 15080900.0 -0.033786 \n",
|
|||
|
"\n",
|
|||
|
" DAY(Date) MONTH(Date) WEEKDAY(Date) YEAR(Date) \n",
|
|||
|
"id \n",
|
|||
|
"0 8 7 2 2020 \n",
|
|||
|
"20 19 1 1 2021 \n",
|
|||
|
"21 8 4 3 2010 \n",
|
|||
|
"24 7 12 0 2020 \n",
|
|||
|
"28 5 1 1 2021 \n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n",
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\featuretools\\computational_backends\\feature_set_calculator.py:143: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
|||
|
" df = pd.concat([df, default_df], sort=True)\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"df['id'] = df.index \n",
|
|||
|
"X_resampled['id'] = X_resampled.index\n",
|
|||
|
"X_val['id'] = X_val.index\n",
|
|||
|
"X_test['id'] = X_test.index\n",
|
|||
|
" # Добавляем уникальный идентификатор\n",
|
|||
|
"# Предобработка данных (например, кодирование категориальных признаков, удаление дубликатов)\n",
|
|||
|
"# Удаление дубликатов по идентификатору\n",
|
|||
|
"df = df.drop_duplicates(subset='id')\n",
|
|||
|
"duplicates = X_resampled[X_resampled['id'].duplicated(keep=False)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
|
|||
|
"df = df.drop_duplicates(subset='id', keep='first')\n",
|
|||
|
"\n",
|
|||
|
"print(duplicates)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='stock_data')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление датафрейма с домами\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='stocks', dataframe=df, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокой синтезы признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='stocks', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 5 строк сгенерированного набора признаков\n",
|
|||
|
"print(feature_matrix.head())\n",
|
|||
|
"\n",
|
|||
|
"X_resampled = X_resampled.drop_duplicates(subset='id')\n",
|
|||
|
"X_resampled = X_resampled.drop_duplicates(subset='id', keep='first') # or keep='last'\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей (Создание EntitySet)\n",
|
|||
|
"es = ft.EntitySet(id='stock_data')\n",
|
|||
|
"\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='stocks', dataframe=X_resampled, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='stocks', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=X_val.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=X_test.index)\n",
|
|||
|
"\n",
|
|||
|
"print(feature_matrix.head())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Система сгенерировала следующие признаки:\n",
|
|||
|
"1. **Open, High, Low, Close, Adj Close**: Это стандартные финансовые параметры акций, отражающие цены открытия, максимальные, минимальные и закрытия за определенный период.\n",
|
|||
|
"**Volume**: Объем торгов акциями, который показывает, сколько акций было куплено/продано за определенный период.\n",
|
|||
|
"\n",
|
|||
|
"2. Сложные признаки:\n",
|
|||
|
"**Close_Disc**: Это диапазон цены закрытия.\n",
|
|||
|
"**Price_Change**: Изменение цены, т.е. разница между ценой закрытия и ценой открытия акций.\n",
|
|||
|
"**Percentage_Change**: Процентное изменение цен, которое позволяет оценить относительное изменение стоимости акций.\n",
|
|||
|
"**Average_Price**: Средняя цена акций за указанный период. Этот показатель может быть использован для оценки общей тенденции рынка.\n",
|
|||
|
"\n",
|
|||
|
"3. Также произошло разбиение даты на месяц, день недели и год, что может помочь в анализе сезонных и временных закономерностей."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценим качество каждого набора признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 44,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABLf0lEQVR4nO3dfVwVdf7//+fhGlEgVEATEa8v0kxKJWstRdEsc7U1XUtLP1mGldqay34yryrTrswyq/2YVmZtVlpZaWpqrqJ5WV4Qqy2KqWBogKhcCO/fH/04345cKIeDB8bH/XY7t5tn5j0zrzlzxvNk5j0zNmOMEQAAgEV5uLsAAACAqkTYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAYBq4tlnn1VRUZEkqaioSDNnznRzRaiIvXv3avny5fb3u3fv1pdffum+gmBH2EGVWbRokWw2m/3l5+enli1bauzYsUpPT3d3eUC188477+iFF17QL7/8ohdffFHvvPOOu0tCBZw+fVoPPvigtmzZogMHDuixxx7Tnj173F0WJHm5uwBY3/Tp0xUVFaXc3Fz9+9//1vz58/XVV19p7969qlWrlrvLA6qN6dOna/jw4Zo0aZJ8fX21ePFid5eECoiJibG/JKlly5Z64IEH3FwVJMnGg0BRVRYtWqT7779f27Zt0/XXX28f/vjjj+ull17SkiVLNHToUDdWCFQ/J06c0MGDB9WiRQvVr1/f3eXACfv379e5c+fUvn17+fj4uLsciNNYcIMePXpIklJSUiRJp06d0t/+9je1b99etWvXVmBgoPr27asffvihxLS5ubmaOnWqWrZsKT8/PzVo0EADBw7Uzz//LEk6dOiQw6mzC1+33HKLfV7r16+XzWbTv/71L/3jH/9QeHi4AgIC1L9/fx05cqTEsrdu3ao+ffooKChItWrVUvfu3bVp06ZS1/GWW24pdflTp04t0Xbx4sWKjo6Wv7+/QkJCNGTIkFKXX966/VFRUZHmzJmjdu3ayc/PT2FhYXrwwQf122+/ObRr0qSJbr/99hLLGTt2bIl5llb7888/X+IzlaS8vDxNmTJFzZs3l6+vryIiIvTEE08oLy+v1M/qj2655RZdc801JYa/8MILstlsOnTokMPwzMxMjRs3ThEREfL19VXz5s01a9Yse7+XP5o6dWqpn919993n0O7o0aMaOXKkwsLC5Ovrq3bt2untt992aFP83Sl++fr6qmXLlpo5c6Yu/Ptx165d6tu3rwIDA1W7dm317NlTW7ZscWhTfMr30KFDCg0N1Y033qi6deuqQ4cOstlsWrRoUbmf24WnjC/2vavIOrpy/yjeBqGhoSooKHAY98EHH9jrzcjIcBj39ddf6+abb1ZAQIDq1Kmjfv36ad++fQ5t7rvvPtWuXbtEXR9//LFsNpvWr19vH1bR79nrr7+udu3aydfXVw0bNlR8fLwyMzMd2txyyy32faFt27aKjo7WDz/8UOo+isuP01i47IqDSd26dSVJ//3vf7V8+XL95S9/UVRUlNLT0/Xmm2+qe/fu2r9/vxo2bChJKiws1O233661a9dqyJAheuyxx3T69GmtXr1ae/fuVbNmzezLGDp0qG677TaH5SYkJJRazzPPPCObzaZJkybpxIkTmjNnjmJjY7V79275+/tLkr799lv17dtX0dHRmjJlijw8PLRw4UL16NFDGzduVOfOnUvMt1GjRvYOpjk5ORozZkypy548ebIGDx6s//mf/9Gvv/6qV199VX/605+0a9cuBQcHl5hm9OjRuvnmmyVJn376qZYtW+Yw/sEHH7QfVXv00UeVkpKi1157Tbt27dKmTZvk7e1d6udQEZmZmaV2ni0qKlL//v3173//W6NHj1abNm20Z88evfzyy/rPf/7j0Hmzss6ePavu3bvr6NGjevDBB9W4cWNt3rxZCQkJOn78uObMmVPqdO+995793+PHj3cYl56erq5du8pms2ns2LGqX7++vv76a40aNUrZ2dkaN26cQ/t//OMfatOmjc6dO2cPBaGhoRo1apQkad++fbr55psVGBioJ554Qt7e3nrzzTd1yy23aMOGDerSpUuZ6/fee+9VuL9H8SnjYqV97yq6jlWxf5w+fVorVqzQn//8Z/uwhQsXys/PT7m5uSU+hxEjRiguLk6zZs3S2bNnNX/+fN10003atWuXmjRpUqHPqKKmTp2qadOmKTY2VmPGjFFycrLmz5+vbdu2XXR/mjRpUpXWhgowQBVZuHChkWTWrFljfv31V3PkyBHz4Ycfmrp16xp/f3/zyy+/GGOMyc3NNYWFhQ7TpqSkGF9fXzN9+nT7sLfffttIMi+99FKJZRUVFdmnk2Sef/75Em3atWtnunfvbn+/bt06I8lcffXVJjs72z78o48+MpLMK6+8Yp93ixYtTFxcnH05xhhz9uxZExUVZXr16lViWTfeeKO55ppr7O9//fVXI8lMmTLFPuzQoUPG09PTPPPMMw7T7tmzx3h5eZUYfuDAASPJvPPOO/ZhU6ZMMX/cjTdu3Ggkmffff99h2pUrV5YYHhkZafr161ei9vj4eHPhfw0X1v7EE0+Y0NBQEx0d7fCZvvfee8bDw8Ns3LjRYfo33njDSDKbNm0qsbw/6t69u2nXrl2J4c8//7yRZFJSUuzDZsyYYQICAsx//vMfh7Z///vfjaenp0lNTXUY/r//+7/GZrM5DIuMjDQjRoywvx81apRp0KCBycjIcGg3ZMgQExQUZM6ePWuM+X/fnXXr1tnb5ObmGg8PD/Pwww/bhw0YMMD4+PiYn3/+2T7s2LFjpk6dOuZPf/qTfVjxvlK8frm5uaZx48amb9++RpJZuHBhyQ/rD4qn37Ztm8Pw0r53FV1HV+4fxd/XoUOHmttvv90+/PDhw8bDw8MMHTrUSDK//vqrMcaY06dPm+DgYPPAAw841JqWlmaCgoIcho8YMcIEBASU+GyWLl1aYltd6vfsxIkTxsfHx/Tu3dvh/6jXXnvNSDJvv/22wzz/uC989dVXRpLp06dPif0Jlx+nsVDlYmNjVb9+fUVERGjIkCGqXbu2li1bpquvvlqS5OvrKw+P37+KhYWFOnnypGrXrq1WrVpp586d9vl88sknqlevnh555JESy6jMYeLhw4erTp069vd33XWXGjRooK+++krS75ePHjhwQH/961918uRJZWRkKCMjQ2fOnFHPnj313XfflThtkpubKz8/v3KX++mnn6qoqEiDBw+2zzMjI0Ph4eFq0aKF1q1b59A+Pz9f0u+fV1mWLl2qoKAg9erVy2Ge0dHRql27dol5FhQUOLTLyMgo8Zf1hY4ePapXX31VkydPLnHaYOnSpWrTpo1at27tMM/iU5cXLr8yli5dqptvvllXXXWVw7JiY2NVWFio7777zqF9fn5+uZ+dMUaffPKJ7rjjDhljHOYZFxenrKwsh++jJGVlZSkjI0OpqamaPXu2ioqK7OtaWFiob775RgMGDFDTpk3t0zRo0EB//etf9e9//1vZ2dml1jJv3jydPHlSU6ZMcfbjcdk6VsX+MXLkSK1cuVJpaWmSfr8KLSYmRi1btnRot3r1amVmZmro0KEOtXp6eqpLly6lfp8u/D6fPn261M+isLCwRNuzZ886tFmzZo3y8/M1btw4+/9RkvTAAw8oMDCwzMvKjTFKSEjQoEGDyj16h8uH01iocvPmzVPLli3l5eWlsLAwtWrVyuE/jqKiIr3yyit6/fXXlZKSosLCQvu44lNd0u+nv1q1aiUvL9d+bVu0aOHw3mazqXnz5vbz9gcOHJAkjRgxosx5ZGVl6aqrrrK/z8jIKDHfCx04cEDGmDLbXXh4vLiPQGn9Ev44z6ysLIWGhpY6/sSJEw7vv/nmmwp3gp0yZYoaNmyoBx98UB9//HGJ5SclJZU5zwuXXxkHDhzQjz/+eMnLyszMLPez+/XXX5WZmam33npLb7311iXNc8CAAfZ/e3h46Mknn9SgQYPs8zt79qxatWpVYj5t2rRRUVGRjhw5onbt2jm
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер feature_matrix: 4232\n",
|
|||
|
"Размер y_train_categories: 4232\n",
|
|||
|
"Коэффициент детерминации R²: 1.00\n",
|
|||
|
"Время обучения модели: 45.54 секунд\n",
|
|||
|
"Среднеквадратичная ошибка: 0.00\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import time\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.metrics import r2_score\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.metrics import mean_squared_error\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
|
|||
|
"y = feature_matrix['Close']\n",
|
|||
|
"X = feature_matrix.drop('Close', axis=1)\n",
|
|||
|
"\n",
|
|||
|
"plt.hist(y, bins=30, edgecolor='k')\n",
|
|||
|
"plt.title('Распределение целевой переменной')\n",
|
|||
|
"plt.xlabel('Close Price')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер feature_matrix: \", feature_matrix.shape[0])\n",
|
|||
|
"print(\"Размер y_train_categories: \", y.shape[0])\n",
|
|||
|
"\n",
|
|||
|
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
|
|||
|
"X.fillna(X.median(), inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model = LinearRegression()\n",
|
|||
|
"\n",
|
|||
|
"# Начинаем отсчет времени\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Время обучения модели\n",
|
|||
|
"train_time = time.time() - start_time\n",
|
|||
|
"\n",
|
|||
|
"# Предсказания и оценка модели и вычисляем среднеквадратичную ошибку\n",
|
|||
|
"predictions = model.predict(X_val)\n",
|
|||
|
"mse = mean_squared_error(y_val, predictions)\n",
|
|||
|
"\n",
|
|||
|
"r2 = r2_score(y_val, predictions)\n",
|
|||
|
"print(f'Коэффициент детерминации R²: {r2:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
|
|||
|
"print(f'Среднеквадратичная ошибка: {mse:.2f}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"В данном случае среднеквадратичные ошибки как в случае с контрольной выборкой, так и в случае с тестовой достаточно малы, что может значит о том, что предсказания модели близки к реальным значениям."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 45,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"RMSE: 0.09582857422264315\n",
|
|||
|
"R²: 0.9995934815979668\n",
|
|||
|
"MAE: 0.05673237514757995 \n",
|
|||
|
"\n",
|
|||
|
"Кросс-валидация RMSE: 0.10266281621290554 \n",
|
|||
|
"\n",
|
|||
|
"Train RMSE: 0.03608662827625366\n",
|
|||
|
"Train R²: 0.9999422411727147\n",
|
|||
|
"Train MAE: 0.022020514230428674\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\K\\source\\repos\\AIM-PIbd-31-Ievlewa-M-D\\aimenv\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1cAAAIjCAYAAADvBuGTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADZ90lEQVR4nOzdeXgV1fnA8e/M3DX3JjcJ2UPYwiq7oChgRUEWBauidS0IVG1Fa7XV6q+tS7W11daltXUX9926L4i4I7hvqCwJIJBA9u3ud2bO74+YKzEBCQYT4P085nm8Z86cOTPJDffNOec9mlJKIYQQQgghhBDiB9G7ugNCCCGEEEIIsTeQ4EoIIYQQQgghOoEEV0IIIYQQQgjRCSS4EkIIIYQQQohOIMGVEEIIIYQQQnQCCa6EEEIIIYQQohNIcCWEEEIIIYQQnUCCKyGEEEIIIYToBBJcCSGEEEIIIUQnkOBKCCGEEEJ0qc2bN3P33XcnX2/YsIEHHnig6zokxC6S4EqIPcjpp5+O3+/v6m4IIYQQnUrTNBYuXMjixYvZsGEDF110EW+99VZXd0uIDnN0dQeEEDtWU1PDAw88wFtvvcWbb75JJBJh+vTpjB49mp/97GeMHj26q7sohBBC/CCFhYWcccYZTJ8+HYD8/Hxef/31ru2UELtAU0qpru6EEKJ9Dz/8MGeccQbBYJA+ffqQSCTYunUro0eP5tNPPyWRSDB37lxuu+02XC5XV3dXCCGE+EFKS0uprq5m2LBh+Hy+ru6OEB0m0wKF6KaWLVvGaaedRl5eHsuWLWP9+vVMmTIFj8fD+++/T3l5OSeffDL33HMP559/fqtz//GPfzB+/Hh69OiB1+tlzJgxPP74422uoWkal19+efK1aZoceeSRZGZm8uWXXybr7Ohr0qRJALz++utomtbmL41HHXVUm+tMmjQpeV6LDRs2oGlaqzn3AKtWreL4448nMzMTj8fD2LFjeeaZZ9rcS319Peeffz59+vTB7XbTs2dP5syZQ3V19Xb7V15eTp8+fRg7dizBYBCAeDzOpZdeypgxYwgEAvh8Pg455BBee+21NtesrKxkwYIF9OrVC8Mwks9kZ6Zu9unTh5kzZ7YpP+ecc9A0rU15WVkZ8+fPJzc3F7fbzdChQ7nrrrta1Wm5x/a+136/n9NPPz35ura2lt/97ncMHz4cv99PWloaM2bM4NNPP/3evsOOfy769OnTqm4oFOK3v/0tRUVFuN1uBg0axD/+8Q929m977777LkceeSQZGRn4fD5GjBjBjTfemDzeMl123bp1TJs2DZ/PR0FBAX/+85/bXKMj742WL8MwKCws5Mwzz6S+vj5ZpyPPG5p/Rn/zm98kn0P//v35+9//jm3byTot74N//OMfbdocNmxYq/dNR95zd999N5qmsWHDhmTZ4sWLGT9+PCkpKQQCAWbOnMnKlSvbXLc90WiUyy+/nIEDB+LxeMjPz+e4446jtLR0h+f16dNnhz8729I0jXPOOYcHHniAQYMG4fF4GDNmDG+++Wabdj/++GNmzJhBWloafr+fyZMns2LFilZ1Wp5Be1+bN28Gtj/1+vHHH2/3WT/22GOMGTMGr9dLVlYWp512GmVlZa3qXH755ey3337J99lBBx3EU0891apOe78T33///V1+Lq+99hqapvHkk0+2uZcHH3wQTdNYvnx5smxnfs+2PD+Xy0VVVVWrY8uXL0/29YMPPujwMzr99NOTvzeKi4sZN24ctbW1eL3eNj+3QnR3Mi1QiG7qb3/7G7Zt8/DDDzNmzJg2x7Oysrj33nv58ssvufXWW7nsssvIyckB4MYbb+Too4/m1FNPJR6P8/DDD3PCCSfw3HPPcdRRR233mr/4xS94/fXXWbJkCfvttx8A9913X/L4W2+9xW233cb1119PVlYWALm5udtt78033+SFF17YpfsH+OKLL5gwYQKFhYVcfPHF+Hw+Hn30UY455hieeOIJjj32WACCwSCHHHIIX331FfPnz2f//fenurqaZ555hs2bNyf7uq2GhgZmzJiB0+nkhRdeSH6gamxs5I477uDkk0/mjDPOoKmpiTvvvJNp06bx3nvvMWrUqGQbc+fO5ZVXXuHcc89l5MiRGIbBbbfdxkcffbTL99yeiooKDjrooOSHquzsbF588UUWLFhAY2Mjv/nNbzrc5rp163jqqac44YQT6Nu3LxUVFdx6660ceuihfPnllxQUFHxvG0cccQRz5sxpVfbPf/6Turq65GulFEcffTSvvfYaCxYsYNSoUSxevJgLL7yQsrIyrr/++h1eY8mSJcycOZP8/HzOO+888vLy+Oqrr3juuec477zzkvUsy2L69OkcdNBBXHPNNbz00ktcdtllmKbJn//852S9jrw3jj32WI477jhM02T58uXcdtttRCKRVu+JnRUOhzn00EMpKyvjrLPOolevXrzzzjtccsklbNmyhRtuuKHDbbZnZ99zb731FkceeSS9e/fmsssuI5FI8N///pcJEybw/vvvM3DgwO2ea1kWM2fOZOnSpZx00kmcd955NDU1sWTJElauXElxcfEOrz1q1Ch++9vftiq79957WbJkSZu6b7zxBo888gi//vWvcbvd/Pe//2X69Om89957DBs2DGj+PXHIIYeQlpbGRRddhNPp5NZbb2XSpEm88cYbjBs3rlWbf/7zn+nbt2+rsszMzB32uT1333038+bN44ADDuDqq6+moqKCG2+8kWXLlvHxxx+Tnp4ONP9x4dhjj6VPnz5EIhHuvvtuZs+ezfLlyznwwAO32/7vf//77R77vucyadIkioqKeOCBB5K/J1s88MADFBcXc/DBBwM7/3u2hWEY3H///a3+qLdo0SI8Hg/RaHSXnlF7Lr300jbtCbFHUEKIbikzM1P17t27VdncuXOVz+drVfanP/1JAerZZ59NloXD4VZ14vG4GjZsmDr88MNblQPqsssuU0opdckllyjDMNRTTz213T4tWrRIAWr9+vVtjr322msKUK+99lqybNy4cWrGjBmtrqOUUocddpj6yU9+0ur89evXK0AtWrQoWTZ58mQ1fPhwFY1Gk2W2bavx48erAQMGJMsuvfRSBaj//e9/bfpl23ab/kWjUTVp0iSVk5OjSkpKWtU3TVPFYrFWZXV1dSo3N1fNnz8/WRaJRJSu6+qss85qVbe971F7evfurY466qg25QsXLlTf/dW8YMEClZ+fr6qrq1uVn3TSSSoQCCS/3y33+Nhjj7Vp1+fzqblz5yZfR6NRZVlWqzrr169Xbrdb/fnPf/7e/gNq4cKFbcqPOuqoVj+3Tz31lALUVVdd1are8ccfrzRNa/P8t2Wapurbt6/q3bu3qqura3Ws5fuqVPMzB9S5557b6vhRRx2lXC6XqqqqSpbvynujxfjx49V+++2XfN2R533llVcqn8+n1qxZ06rexRdfrAzDUBs3blRKffs+uPbaa9u0OXToUHXooYe2uf7OvOe++94dM2aMCgQCauvWrck6a9asUU6nU82ePbvNtbd11113KUBdd911bY5t+31pT0d+7gEFqA8++CBZ9vXXXyuPx6OOPfbYZNkxxxyjXC6XKi0tTZaVl5er1NTUVr9nWp7B+++/v93+be/9+9hjj7V61vF4XOXk5Khhw4apSCSSrPfcc88pQF166aXbvUZlZaUC1D/+8Y9k2aGHHtrqe/vCCy8oQE2fPn2Xn8sll1yi3G63qq+vb3Vth8PR6mdjZ3/Ptjy/k08+WQ0fPjxZHgqFVFpamjrllFNaPd+OPKO5c+e2+r2xcuVKpet68me5vX9zhOiuZFqgEN1UU1NTciRqR1pGjhobG5NlXq83+f91dXU0NDRwyCGHbHdE5aabbuLqq6/mX//6Fz/96U9/YM+b/e9//+P999/nb3/7W5tjOTk5yWk421NbW8urr77Kz372M5qamqiurqa6upqamhqmTZvG2rVrk1NLnnjiCUaOHNnmL6xAmyk1tm0zZ84cVqxYwQsvvNDmr+yGYSTXr9m2TW1tLaZpMnbs2FbPLxQKYds2PXr02LkHsouUUjz
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1oAAAMLCAYAAABXcObMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hT1f8H8Hd22jSrg7bsVqAFGWVTGUVoQaYgAgoqKIoCCoqK4CogigMVRHB8fyioqIAgDpApIMhGyrItLd2lg+6ZtE3y+6OQEJpUihmA79fz5HnIzTn3ns89Lb3nfs49EZhMJhOIiIiIiIjIYYTubgAREREREdHthgMtIiIiIiIiB+NAi4iIiIiIyME40CIiIiIiInIwDrSIiIiIiIgcjAMtIiIiIiIiB+NAi4iIiIiIyME40CIiIiIiInIwDrSIiIiIiIgcjAMtIiIiIiIiB+NAi4joGqtXr4ZAIMDx48frfPa///0PAoEAo0aNgsFgcEPriIiI6FbAgRYR0XX68ccfMW3aNPTt2xfff/89RCKRu5tERERENykOtIiIrsPevXvx4IMPol27dvjll18gl8vd3SQiIiK6iXGgRUT0D2JiYnDvvfciMDAQ27dvh1qtrlNmw4YN6Nq1Kzw8PODr64uHHnoImZmZVmUmT54MLy8vJCUlYfDgwVAoFGjcuDEWLlwIk8lkLpeSkgKBQIAlS5bgww8/RIsWLeDh4YGIiAicPXu2zrHj4uJw//33w9vbG3K5HN26dcPPP/9sM5b+/ftDIBDUea1evdqq3CeffIL27dvD09PTqtwPP/xgta/27dvXOcaSJUsgEAiQkpJi3nZlOubV24xGIzp27Gjz+L///jv69u0LhUIBjUaDe++9F7GxsVZl5s+fD4FAgLy8PKvtx48fr7PPK+f+Wj/88AMEAgH27t1r3rZ//36MHTsWzZs3h0wmQ7NmzfDcc8+hsrLSZv1u3bpBqVRanaclS5bUKXu1K+dDKpXi0qVLVp8dOnTIvJ+rp69eT7smT55ss3+vfl3pg5YtW2L48OHYsWMHwsLCIJfL0a5dO2zatMlmW6+n7xpynquqqvD666+ja9euUKvVUCgU6Nu3L/bs2VPvuSMiulWI3d0AIqKb2YULF3DPPfdAJpNh+/btCAwMrFNm9erVePTRR9G9e3csXrwYOTk5WLZsGf7880+cPHkSGo3GXNZgMOCee+5Br1698O6772Lbtm2Ijo5GTU0NFi5caLXfr776CqWlpZgxYwZ0Oh2WLVuGAQMG4MyZM/D39wcAnDt3Dr1790aTJk0wd+5cKBQKrF+/HqNGjcLGjRsxevToOu0NDQ3FK6+8AgDIy8vDc889Z/X5unXrMH36dPTv3x/PPPMMFAoFYmNj8dZbb/3b02nl66+/xpkzZ+ps37VrF4YMGYLg4GDMnz8flZWVWL58OXr37o2//voLLVu2dGg7rrVhwwZUVFRg2rRp8PHxwdGjR7F8+XJkZGRgw4YN5nKHDh3CuHHj0KlTJ7z99ttQq9U2z2d9RCIRvvnmG6s6X375JeRyOXQ6XYPb9eSTTyIyMtJc5+GHH8bo0aNx3333mbf5+fmZ/52QkIDx48fjqaeewqRJk/Dll19i7Nix2LZtG6Kiouy2217fNURJSQn+7//+Dw8++CCeeOIJlJaWYtWqVRg8eDCOHj2KsLCwf7V/IiK3MxERkZUvv/zSBMD066+/mu644w4TANOgQYNslq2qqjI1atTI1L59e1NlZaV5+6+//moCYHr99dfN2yZNmmQCYHrmmWfM24xGo2nYsGEmqVRqunTpkslkMpmSk5NNAEweHh6mjIwMc9kjR46YAJiee+4587aBAweaOnToYNLpdFb7vOuuu0ytW7eu097evXub7r77bvP7K8f68ssvzdsefPBBk0ajsYpnz549JgCmDRs2mLdFRESY7rzzzjrHeO+990wATMnJyeZtV87plW06nc7UvHlz05AhQ+ocPywszNSoUSNTfn6+edupU6dMQqHQ9Mgjj5i3RUdHmwCYz9sVx44dq7PPSZMmmRQKRZ22btiwwQTAtGfPHvO2ioqKOuUWL15sEggEptTUVPO2efPmmQCYsrKyzNuunM/33nuvzj6uduV8PPjgg6YOHTqYt5eXl5tUKpVpwoQJJgCmY8eONbhdVwNgio6OtvlZixYtTABMGzduNG8rLi42BQYGmjp37lynrdfTdw05zzU1NSa9Xm9VrrCw0OTv72967LHHbLaZiOhWwqmDRER2TJ48Genp6ZgwYQJ27Nhhlc244vjx48jNzcX06dOtntsaNmwYQkNDsWXLljp1nn76afO/BQIBnn76aVRVVWHXrl1W5UaNGoUmTZqY3/fo0QM9e/bE1q1bAQAFBQX4/fffMW7cOJSWliIvLw95eXnIz8/H4MGDkZCQUGf6YlVVFWQyWb1xl5aWwtPT06nPoa1YsQL5+fmIjo622p6VlYWYmBhMnjwZ3t7e5u0dO3ZEVFSUOfarFRQUmGPPy8tDcXGx3eNeXS4vLw+lpaV1ynh4eJj/XV5ejry8PNx1110wmUw4efKk+bPS0lIIhUKrjGVDPfzww4iLizNPEdy4cSPUajUGDhx4w+1qiMaNG1tlPVUqFR555BGcPHkS2dnZNuvY67uGEolEkEqlAGqnIhYUFKCmpgbdunXDX3/99a/2TUR0M+BAi4jIjoKCAnzzzTdYs2YNwsLCMGvWrDoX8ampqQCAkJCQOvVDQ0PNn18hFAoRHBxsta1NmzYAYPUMDAC0bt26zj7btGljLpeYmAiTyYTXXnsNfn5+Vq8rF8G5ublW9YuKimw+Q3O18PBwXLx4EfPnz0daWto/Dl4aqri4GG+99RZmz55tngJ5RX3ns23btsjLy0N5ebnV9pCQEKvYr546d7Xy8vI65+mxxx6rUy4tLc080PPy8oKfnx8iIiLMbb8iPDwcRqMRs2bNwoULF5CXl4fCwsIGnQs/Pz8MGzYMX3zxBQDgiy++wKRJkyAU1v3zfL3taohWrVpBIBBYbbP383jlOPb67kasWbMGHTt2hFwuh4+PD/z8/LBlyxaH/rwREbkLn9EiIrLjvffew9ixYwEAn3/+OXr16oV58+Zh5cqVbm5ZLaPRCAB44YUXMHjwYJtlWrVqZfU+OzvbbtkrnnvuOcTHx+ONN97AggULHNPYq7zzzjsQCoV48cUXkZ+f/6/3t3HjRqhUKvP78+fPY8aMGXXKyeVy/PLLL1bb9u/fb/VsnMFgQFRUFAoKCvDSSy8hNDQUCoUCmZmZmDx5svmcA8ADDzyAv/76C8uXL8fnn39+w+1/7LHH8Mgjj+CZZ57BH3/8gf/7v//D/v37rco0pF3O5Mi+++abbzB58mSMGjUKL774Iho1agSRSITFixfjwoULDmoxEZH7cKBFRGRHv379zP/u3r07ZsyYgRUrVuCRRx5Br169AAAtWrQAAMTHx2PAgAFW9ePj482fX2E0GpGUlGTOGgC1AwMAdRZ5SEhIqNOm8+fPm8tdyYxJJBK7WZyrZWRkoLS0FG3btq23nIeHB/73v//h5MmTUKvViI6OxqlTp/DCCy/84zH+ycWLF7Fs2TIsXrwYSqWyzsX61efzWnFxcfD19YVCobDa3q9fP/j6+prf25vKJxKJ6pynoqIiq/dnzpzB+fPnsWbNGjzyyCPm7Tt37qyzP6FQiCVLluDMmTNITk7GypUrkZOTg4ceesjm8e0ZMmQI5HI5HnjgAfTp0wd33HFHnYFWQ9rVEFeyoldntez9PP5T3zXUDz/8gODgYGzatMnq+P92SiIR0c2CUweJiK7Tm2++icDAQEydOhU1NTUAgG7duqFRo0b49NNPodfrzWV/++03xMbGYtiwYXX28/HHH5v/bTKZ8PHHH0MikdR5Lmfz5s1Wz1gdPXoUR44cwZAhQwAAjRo1Qv/+/fHZZ58hKyurznGuXTb8+++/B4A6A0Jb5s2bh7S0NHzzzTeIjIxE165d/7HO9ViwYAH8/f3x1FNP2fw8MDAQYWFhWLNmjdUg6OzZs9ixYweGDh3qkHbYc+VLqE1
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x800 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Корреляция признаков с 'Close':\n",
|
|||
|
"Close 1.000000\n",
|
|||
|
"Low 0.999572\n",
|
|||
|
"High 0.999527\n",
|
|||
|
"Open 0.998976\n",
|
|||
|
"Adj Close 0.997764\n",
|
|||
|
"Volume 0.062913\n",
|
|||
|
"WEEKDAY(Date) 0.009135\n",
|
|||
|
"DAY(Date) -0.011068\n",
|
|||
|
"Volume_Change -0.030591\n",
|
|||
|
"MONTH(Date) -0.034877\n",
|
|||
|
"YEAR(Date) -0.100269\n",
|
|||
|
"Name: Close, dtype: float64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.metrics import r2_score, mean_absolute_error\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Удаление строк с NaN\n",
|
|||
|
"feature_matrix = feature_matrix.dropna()\n",
|
|||
|
"val_feature_matrix = val_feature_matrix.dropna()\n",
|
|||
|
"test_feature_matrix = test_feature_matrix.dropna()\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"y_train = feature_matrix['Close']\n",
|
|||
|
"X_train = feature_matrix.drop('Close', axis=1)\n",
|
|||
|
"y_val = val_feature_matrix['Close']\n",
|
|||
|
"X_val = val_feature_matrix.drop('Close', axis=1)\n",
|
|||
|
"y_test = test_feature_matrix['Close']\n",
|
|||
|
"X_test = test_feature_matrix.drop('Close', axis=1)\n",
|
|||
|
"\n",
|
|||
|
"X_test = X_test.reindex(columns=X_train.columns, fill_value=0) \n",
|
|||
|
"\n",
|
|||
|
"# Кодирования категориальных переменных с использованием одноразового кодирования\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разобьём тренировочный тест и примерку модели\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Выбор модели\n",
|
|||
|
"model = RandomForestRegressor(random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_pred = model.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
|
|||
|
"r2 = r2_score(y_test, y_pred)\n",
|
|||
|
"mae = mean_absolute_error(y_test, y_pred)\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"print(f\"RMSE: {rmse}\")\n",
|
|||
|
"print(f\"R²: {r2}\")\n",
|
|||
|
"print(f\"MAE: {mae} \\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"rmse_cv = (-scores.mean())**0.5\n",
|
|||
|
"print(f\"Кросс-валидация RMSE: {rmse_cv} \\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_train.columns\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_pred = model.predict(X_train)\n",
|
|||
|
"\n",
|
|||
|
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
|
|||
|
"r2_train = r2_score(y_train, y_train_pred)\n",
|
|||
|
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train RMSE: {rmse_train}\")\n",
|
|||
|
"print(f\"Train R²: {r2_train}\")\n",
|
|||
|
"print(f\"Train MAE: {mae_train}\")\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
|
|||
|
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Фактическая цена')\n",
|
|||
|
"plt.ylabel('Прогнозируемая цена')\n",
|
|||
|
"plt.title('Фактическая цена по сравнению с прогнозируемой')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"correlation_matrix = feature_matrix.corr()\n",
|
|||
|
"plt.figure(figsize=(10, 8))\n",
|
|||
|
"sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)\n",
|
|||
|
"plt.title('Корреляционная матрица')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Рассмотрим корреляцию с целевой переменной 'Close'\n",
|
|||
|
"correlation_with_close = correlation_matrix['Close'].sort_values(ascending=False)\n",
|
|||
|
"print(\"Корреляция признаков с 'Close':\")\n",
|
|||
|
"print(correlation_with_close)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"На основании представленных данных о корреляции признаков с целевой переменной 'Close', а также значений Mean Squared Error (MSE), можно сделать несколько важных выводов:\n",
|
|||
|
"\n",
|
|||
|
"**Эффективность модели**\n",
|
|||
|
"Эффективность модели можно оценивать по нескольким критериям:\n",
|
|||
|
"\n",
|
|||
|
"* Точность предсказаний: График сравнения фактических и прогнозируемых цен показывает, что точки близки к линии равенства, это указывает на высокую точность модели. Высокая точность означает, что ваша модель хорошо прогнозирует цены, что критически важно для принятия обоснованных инвестиционных решений.\n",
|
|||
|
"* Метрики оценки: Использование таких метрик, как средняя абсолютная ошибка (MAE), средняя квадратичная ошибка (MSE) или коэффициент детерминации (R²), позволит оценить, насколько близки прогнозы к реальным значениям. Эти меры позволяют количественно оценить уровень ошибки модели. В данном случае среднеквадратичные ошибки достаточно малы(~1.5e-10), что может значит о том, что предсказания модели близки к реальным значениям.\n",
|
|||
|
"\n",
|
|||
|
"**Высокая корреляция признаков**\n",
|
|||
|
"\n",
|
|||
|
"* Показатели High, Low, Open и Average Price имеют крайне высокую положительную корреляцию с целевой переменной Close:\n",
|
|||
|
"Это говорит о том, что данные переменные практически линейно зависимы от значения Close. Таким образом, знание значений этих признаков позволяет с высокой степенью уверенности предсказывать значение Close.\n",
|
|||
|
"\n",
|
|||
|
"Year имеет наибольшую отрицательную корреляцию (-0.09) с Close, что говорит об их наименьшей зависимости друг от друга.\n",
|
|||
|
"\n",
|
|||
|
"**Переобучение**\n",
|
|||
|
"Переобучение (overfitting) — это распространенная проблема в моделях машинного обучения:\n",
|
|||
|
"\n",
|
|||
|
"* Признаки переобучения: Если модель показывает отличные результаты на обучающей выборке, но значительно хуже на тестовой, это свидетельствует о том, что модель слишком сложна и запоминает данные вместо того, чтобы их обобщать.\n",
|
|||
|
"В данном случае модель показала одинаково хорошие результаты на "
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|