AIM-PIbd-32-Stroev-V.-M/lab_3/lab3.ipynb

2232 lines
424 KiB
Plaintext
Raw Normal View History

2024-11-08 21:13:44 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Бизнес цели"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Прогнозирование цен на акции Tesla на основе действий инсайдеров: Одна из ключевых бизнес-целей состоит в создании модели для прогнозирования динамики акций Tesla, используя данные о транзакциях инсайдеров. Поскольку инсайдеры обладают глубоким знанием внутреннего состояния компании, их действия могут предсказывать изменения в стоимости акций. На основе анализа паттернов и частоты инсайдерских покупок и продаж можно разработать предсказательную модель, которая поможет инвесторам и аналитикам принимать более обоснованные решения.\n",
"2. Анализ влияния транзакций инсайдеров на динамику цены акций Tesla для оценки краткосрочных и долгосрочных рисков: Цель исследовать, как действия инсайдеров (особенно крупных акционеров и ключевых лиц) влияют на цену акций Tesla. Выявление корреляций между объёмом, типом и частотой инсайдерских сделок и изменениями цены акций позволит оценить риски и тенденции в динамике акций.\n",
"\n",
"Цель технического проекта: Разработка модели машинного обучения для прогнозирования будущих продаж акций топ-менеджментом компании, а также анализ влияния транзакций инсайдеров на динамику цены акций Tesla для оценки краткосрочных и долгосрочных рисков."
]
},
{
"cell_type": "code",
"execution_count": 228,
"metadata": {},
"outputs": [],
"source": [
"from typing import Any\n",
"from math import ceil\n",
"import time\n",
"\n",
"import pandas as pd\n",
"from pandas import DataFrame, Series\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"from sklearn.model_selection import train_test_split, cross_val_score\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from imblearn.over_sampling import SMOTE\n",
"import featuretools as ft\n",
"from featuretools.entityset.entityset import EntitySet\n",
"import matplotlib.pyplot as plt\n",
"\n",
"df: DataFrame = pd.read_csv(\"static/csv/TSLA.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Конвертация данных:"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выборка данных:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Insider Trading</th>\n",
" <th>Relationship</th>\n",
" <th>Date</th>\n",
" <th>Transaction</th>\n",
" <th>Cost</th>\n",
" <th>Shares</th>\n",
" <th>Value ($)</th>\n",
" <th>Shares Total</th>\n",
" <th>SEC Form 4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Kirkhorn Zachary</td>\n",
" <td>Chief Financial Officer</td>\n",
" <td>2022-03-06</td>\n",
" <td>Sale</td>\n",
" <td>196.72</td>\n",
" <td>10455</td>\n",
" <td>2056775</td>\n",
" <td>203073</td>\n",
" <td>Mar 07 07:58 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Taneja Vaibhav</td>\n",
" <td>Chief Accounting Officer</td>\n",
" <td>2022-03-06</td>\n",
" <td>Sale</td>\n",
" <td>195.79</td>\n",
" <td>2466</td>\n",
" <td>482718</td>\n",
" <td>100458</td>\n",
" <td>Mar 07 07:57 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Baglino Andrew D</td>\n",
" <td>SVP Powertrain and Energy Eng.</td>\n",
" <td>2022-03-06</td>\n",
" <td>Sale</td>\n",
" <td>195.79</td>\n",
" <td>1298</td>\n",
" <td>254232</td>\n",
" <td>65547</td>\n",
" <td>Mar 07 08:01 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Taneja Vaibhav</td>\n",
" <td>Chief Accounting Officer</td>\n",
" <td>2022-03-05</td>\n",
" <td>Option Exercise</td>\n",
" <td>0.00</td>\n",
" <td>7138</td>\n",
" <td>0</td>\n",
" <td>102923</td>\n",
" <td>Mar 07 07:57 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Baglino Andrew D</td>\n",
" <td>SVP Powertrain and Energy Eng.</td>\n",
" <td>2022-03-05</td>\n",
" <td>Option Exercise</td>\n",
" <td>0.00</td>\n",
" <td>2586</td>\n",
" <td>0</td>\n",
" <td>66845</td>\n",
" <td>Mar 07 08:01 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Kirkhorn Zachary</td>\n",
" <td>Chief Financial Officer</td>\n",
" <td>2022-03-05</td>\n",
" <td>Option Exercise</td>\n",
" <td>0.00</td>\n",
" <td>16867</td>\n",
" <td>0</td>\n",
" <td>213528</td>\n",
" <td>Mar 07 07:58 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Baglino Andrew D</td>\n",
" <td>SVP Powertrain and Energy Eng.</td>\n",
" <td>2022-02-27</td>\n",
" <td>Option Exercise</td>\n",
" <td>20.91</td>\n",
" <td>10500</td>\n",
" <td>219555</td>\n",
" <td>74759</td>\n",
" <td>Mar 01 07:29 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Baglino Andrew D</td>\n",
" <td>SVP Powertrain and Energy Eng.</td>\n",
" <td>2022-02-27</td>\n",
" <td>Sale</td>\n",
" <td>202.00</td>\n",
" <td>10500</td>\n",
" <td>2121000</td>\n",
" <td>64259</td>\n",
" <td>Mar 01 07:29 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Kirkhorn Zachary</td>\n",
" <td>Chief Financial Officer</td>\n",
" <td>2022-02-06</td>\n",
" <td>Sale</td>\n",
" <td>193.00</td>\n",
" <td>3750</td>\n",
" <td>723750</td>\n",
" <td>196661</td>\n",
" <td>Feb 08 06:14 PM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Baglino Andrew D</td>\n",
" <td>SVP Powertrain and Energy Eng.</td>\n",
" <td>2022-01-27</td>\n",
" <td>Option Exercise</td>\n",
" <td>20.91</td>\n",
" <td>10500</td>\n",
" <td>219555</td>\n",
" <td>74759</td>\n",
" <td>Jan 31 07:34 PM</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Insider Trading Relationship Date \\\n",
"0 Kirkhorn Zachary Chief Financial Officer 2022-03-06 \n",
"1 Taneja Vaibhav Chief Accounting Officer 2022-03-06 \n",
"2 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-03-06 \n",
"3 Taneja Vaibhav Chief Accounting Officer 2022-03-05 \n",
"4 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-03-05 \n",
"5 Kirkhorn Zachary Chief Financial Officer 2022-03-05 \n",
"6 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-02-27 \n",
"7 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-02-27 \n",
"8 Kirkhorn Zachary Chief Financial Officer 2022-02-06 \n",
"9 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-01-27 \n",
"\n",
" Transaction Cost Shares Value ($) Shares Total SEC Form 4 \n",
"0 Sale 196.72 10455 2056775 203073 Mar 07 07:58 PM \n",
"1 Sale 195.79 2466 482718 100458 Mar 07 07:57 PM \n",
"2 Sale 195.79 1298 254232 65547 Mar 07 08:01 PM \n",
"3 Option Exercise 0.00 7138 0 102923 Mar 07 07:57 PM \n",
"4 Option Exercise 0.00 2586 0 66845 Mar 07 08:01 PM \n",
"5 Option Exercise 0.00 16867 0 213528 Mar 07 07:58 PM \n",
"6 Option Exercise 20.91 10500 219555 74759 Mar 01 07:29 PM \n",
"7 Sale 202.00 10500 2121000 64259 Mar 01 07:29 PM \n",
"8 Sale 193.00 3750 723750 196661 Feb 08 06:14 PM \n",
"9 Option Exercise 20.91 10500 219555 74759 Jan 31 07:34 PM "
]
},
"execution_count": 229,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Преобразование типов данных\n",
"df['Insider Trading'] = df['Insider Trading'].astype('category') \n",
"df['Relationship'] = df['Relationship'].astype('category') \n",
"df['Transaction'] = df['Transaction'].astype('category') \n",
"df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce') \n",
"df['Shares'] = pd.to_numeric(df['Shares'].str.replace(',', ''), errors='coerce') \n",
"df['Value ($)'] = pd.to_numeric(df['Value ($)'].str.replace(',', ''), errors='coerce') \n",
"df['Shares Total'] = pd.to_numeric(df['Shares Total'].str.replace(',', ''), errors='coerce')\n",
"\n",
"print('Выборка данных:')\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблема пропущенных данных:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверка на отсутствие значений, представленная ниже, показала, что DataFrame не имеет пустых значений признаков. Нет необходимости использовать методы заполнения пропущенных данных."
]
},
{
"cell_type": "code",
"execution_count": 230,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Присутствуют ли пустые значения признаков в колонке:\n",
"Insider Trading False\n",
"Relationship False\n",
"Date False\n",
"Transaction False\n",
"Cost False\n",
"Shares False\n",
"Value ($) False\n",
"Shares Total False\n",
"SEC Form 4 False\n",
"dtype: bool \n",
"\n",
"Количество пустых значений признаков в колонке:\n",
"Insider Trading 0\n",
"Relationship 0\n",
"Date 0\n",
"Transaction 0\n",
"Cost 0\n",
"Shares 0\n",
"Value ($) 0\n",
"Shares Total 0\n",
"SEC Form 4 0\n",
"dtype: int64 \n",
"\n",
"Процент пустых значений признаков в колонке:\n",
"\n"
]
}
],
"source": [
"# Проверка пропущенных данных\n",
"def check_null_columns(dataframe: DataFrame) -> None:\n",
" # Присутствуют ли пустые значения признаков\n",
" print('Присутствуют ли пустые значения признаков в колонке:')\n",
" print(dataframe.isnull().any(), '\\n')\n",
"\n",
" # Количество пустых значений признаков\n",
" print('Количество пустых значений признаков в колонке:')\n",
" print(dataframe.isnull().sum(), '\\n')\n",
"\n",
" # Процент пустых значений признаков\n",
" print('Процент пустых значений признаков в колонке:')\n",
" for column in dataframe.columns:\n",
" null_rate: float = dataframe[column].isnull().sum() / len(dataframe) * 100\n",
" if null_rate > 0:\n",
" print(f\"{column} процент пустых значений: {null_rate:.2f}%\")\n",
" print()\n",
" \n",
"\n",
"# Проверка пропущенных данных\n",
"check_null_columns(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проблема зашумленности данных"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Зашумленность это наличие случайных ошибок или вариаций в данных, которые могут затруднить выявление истинных закономерностей.\n",
"В свою очередь выбросы - это значения, которые значительно отличаются от остальных наблюдений в наборе данных\n",
"Представленный ниже код помогает определить наличие выбросов в наборе данных и устранить их (при наличии), заменив значения ниже нижней границы (рассматриваемого минимума) на значения нижней границы, а значения выше верхней границы (рассматриваемого максимума) на значения верхней границы."
]
},
{
"cell_type": "code",
"execution_count": 231,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка наличия выбросов в колонках:\n",
"Колонка Cost:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 0.0\n",
"\tМаксимальное значение: 1171.04\n",
"\t1-й квартиль (Q1): 50.5225\n",
"\t3-й квартиль (Q3): 934.1075\n",
"\n",
"Колонка Shares:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 25\n",
"\tМинимальное значение: 121\n",
"\tМаксимальное значение: 11920000\n",
"\t1-й квартиль (Q1): 3500.0\n",
"\t3-й квартиль (Q3): 301797.75\n",
"\n",
"Колонка Value ($):\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 23\n",
"\tМинимальное значение: 0\n",
"\tМаксимальное значение: 2278695421\n",
"\t1-й квартиль (Q1): 271008.0\n",
"\t3-й квартиль (Q3): 148713213.25\n",
"\n",
"Колонка Shares Total:\n",
"\tЕсть выбросы: Да\n",
"\tКоличество выбросов: 21\n",
"\tМинимальное значение: 49\n",
"\tМаксимальное значение: 455467432\n",
"\t1-й квартиль (Q1): 25103.5\n",
"\t3-й квартиль (Q3): 1507273.75\n",
"\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPOCAYAAADgBVF+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADQjElEQVR4nOzdeVyUVfvH8S8MCSqBKwilQlq5gFr6JFBTkJSaGIS0aWVm2aI9ubVg5VbJo2HaYpmWS0+iFhEVlmUqOdVkhVliahtkJWBmApqCzty/P3xmfo4wLoUMwuf9es2r5pzrnvuaceAw15z7HC/DMAwBAAAAAAAAAIAqvD2dAAAAAAAAAAAAdRVFdAAAAAAAAAAA3KCIDgAAAAAAAACAGxTRAQAAAAAAAABwgyI6AAAAAAAAAABuUEQHAAAAAAAAAMANiugAAAAAAAAAALhBER0AAAAAAAAAADcoogMAAAAAAAAA4AZFdAAAAACoRbfeeqv8/f09nQYAADiG2NhYxcbGeuz8n3/+uRo1aqSff/652v5bb71Vubm51fZ9++238vHxUX5+/inMsGGhiI4TlpmZKS8vr2pvERERnk4PaDBiY2N16623Sjo8aFY3qB84cECzZs1S7969FRgYKD8/P5133nkaNWqUvvvuu1OSV0ZGhmbPnl2lvbCwUF5eXs7B3cvLS4sWLTolOQD1FWMwUDccbwzeu3evJk2apIiICDVt2lQtW7ZUjx49dN9992nHjh21n3ANW7Rokby8vCRJubm58vLyUmFhoWeTAmoB4zBQN7gbh7OysuTl5aWXXnrJ7bGrVq2Sl5eXnnnmmVrItGY8/PDDuvHGG9W+ffuTPrZLly4aMGCAJk6cWKXvyNdu8uTJCgsL+4eZNgw+nk4Ap58JEyaoc+fOzvtPPPGEB7MBcLRdu3apX79+ysvLU0JCggYPHix/f39t27ZNy5Yt07x581RZWVnj583IyFB+fr5Gjx5d448N4DDGYKDuOnjwoC699FJt3bpVQ4cO1b333qu9e/dq8+bNysjI0DXXXKPQ0FBPpwngH2AcBuqmAQMGKDAwUBkZGbr99turjcnIyJDJZNINN9xQy9n9PRs3btSHH36oTz/91G3MoUOHVFFR4bb/rrvu0lVXXaUff/xRHTp0OBVpNigU0XHSrrjiCpdZNy+99JJ27drluYQAuLj11lv11VdfKTMzU4MGDXLpe+yxx/Twww97KDMA/xRjMFB3ZWdn66uvvtKSJUs0ePBgl74DBw6cki+wj8Vut6uyslJ+fn61el6gPmMcBuomX19fpaSkaOHChdqxY0eVL60PHDigN998U1dccYWCgoI8lOXJWbhwodq1a6eoqCiX9tLSUo0ePVqZmZnau3evlixZIn9/f8XExGj58uVq1qyZMzY+Pl7NmzfX4sWLNXXq1Fp+BvUPy7nghDn+8Pf2Pv7bxnGp55GXd9rtdnXr1q3KUg7ffPONbr31Vp1zzjny8/NTmzZtdNttt+mPP/5weczJkydXe/mcj8//fxcUGxuriIgI5eXlKSYmRo0bN1Z4eLjmzp1b5blMnDhRPXv2VGBgoJo2bSqz2ay1a9e6xDmWofDy8lJ2drZL34EDB9S8eXN5eXkpPT29Sp5BQUE6ePCgyzFLly51Pt6Rf2y99dZbGjBggEJDQ+Xr66sOHTrosccek81mO+5r7Tjf1q1bdd111ykgIEAtW7bUfffdpwMHDrjELly4UJdffrmCgoLk6+urLl266IUXXqjymImJiQoLC5Ofn5+CgoJ09dVXa9OmTS4xjudR3fIdnTp1kpeXl0aNGuVs2717t8aPH6/IyEj5+/srICBA/fv319dff+1y7NChQ+Xn56ctW7a4tPft21fNmzd3uRT6p59+0rXXXqsWLVqoSZMmioqK0ooVK1yOc1xq7Lj5+vrqvPPOU1pamgzDOPaL+z/u3nvVLaNy5Hvm6NuRdu7cqeHDh6tdu3YymUzOmH+6Pur69eu1YsUKDR8+vEoBXTr8x8WR71dJWrNmjcxms5o2bapmzZopMTGxyutfXl6u0aNHKywsTL6+vgoKCtIVV1yhDRs2SDr8s7dixQr9/PPPzufCJWFAzWEMznbpYwxmDK6LY/CPP/4oSbr44our9Pn5+SkgIKBK+2+//aakpCT5+/urdevWGj9+fJX3Xnp6umJiYtSyZUs1btxYPXv2VGZmZpXHcvy7L1myRF27dpWvr69WrlzpPM9tt92m4OBg+fr6qmvXrlqwYEGVx3j22WfVtWtXNWnSRM2bN1evXr2UkZHxt14PoD5hHM526WMcZhyui+PwTTfdJLvdrmXLllXpW7FihUpLSzVkyBBJJ/5+OFp1P9/S/7/WR69Pvn79evXr10+BgYFq0qSJLrvsMn3yyScn9Hyys7N1+eWXV3n97rvvPi1ZskRjx47VFVdcoalTp2ry5Mnau3ev9u3b5xJ7xhlnKDY2Vm+99dYJnRPHxkx0nDDHHw6+vr5/6/j//ve/VQYf6fC6VD/99JOGDRumNm3aaPPmzZo3b542b96szz77rMovjBdeeMHll+vRf8j8+eefuuqqq3Tdddfpxhtv1Guvvaa7775bjRo10m233SZJKisr00svvaQbb7xRd9xxh8rLy/Xyyy+rb9+++vzzz9WjRw+Xx/Tz89PChQuVlJTkbMvKyqoyMB+pvLxcOTk5uuaaa5xtCxculJ+fX5XjFi1aJH9/f40dO1b+/v5as2aNJk6cqLKyMj355JNuz3Gk6667TmFhYUpLS9Nnn32mZ555Rn/++adeeeUVl9eua9euuvrqq+Xj46N33nlH99xzj+x2u0aOHOnyeCNGjFCbNm20Y8cOPffcc4qPj1dBQYGaNGlS5XU5cvmOTz/9tNpNL3766SdlZ2fr2muvVXh4uEpKSvTiiy/qsssu07fffuv8pvjpp5/WmjVrNHToUFmtVplMJr344ov64IMP9N///tcZV1JSopiYGP3111/697//rZYtW2rx4sW6+uqrlZmZ6fK6S/9/6eX+/fu1fPlyTZgwQUFBQRo+fPgJvb6O18/x3ktNTT1m7IgRI2Q2myUdfq+8+eabLv1Dhw7Vhx9+qHvvvVfdu3eXyWTSvHnznEXpv+vtt9+WJN18880nFP/hhx+qf//+OuecczR58mTt379fzz77rC6++GJt2LDBWQi/6667lJmZqVGjRqlLly76448/9PHHH2vLli268MIL9fDDD6u0tFS//vqrZs2aJUlsmAbUIMZgxmDG4Lo/BjvWK33llVf0yCOPVPn5OZrNZlPfvn3Vu3dvpaen68MPP9TMmTPVoUMH3X333c64p59+WldffbWGDBmiyspKLVu2TNdee61ycnI0YMAAl8dcs2aNXnvtNY0aNUqtWrVSWFiYSkpKFBUV5SzqtG7dWu+9956GDx+usrIy53to/vz5+ve//62UlBRnAeqbb77R+vXrq8ysBxoaxmHGYcbhuj8OX3rppTr77LOVkZGhsWPHuvRlZGSoSZMmzvfxybwf/q41a9aof//+6tmzpyZNmiRvb29n8d5iseiiiy5ye+xvv/2m7du368ILL6zSt2LFCt16662aMmWKbr31VpnNZsXGxmrcuHHVPlbPnj311ltvqaysrNov9HESDOAEzZ4925BkfP311y7tl112mdG1a1eXtoULFxqSjIKCAsMwDOPAgQNGu3btjP79+xuSjIULFzpj//rrryrnWrp0qSHJWLdunbNt0qRJhiTj999/d5vjZZddZkgyZs6c6WyrqKgwevToYQQFBRmVlZWGYRjGoUOHjIqKCpdj//zzTyM4ONi47bbbnG0FBQWGJOPGG280fHx8jOLiYmdfnz59jMGDBxuSjCeffLJKnjfeeKORkJDgbP/5558Nb29v48Ybb6zyPKp7De68806jSZMmxoEDB9w+3yPPd/XVV7u033PPPVX+vao7T9++fY1zzjnnmOd47bXXDEnGl19+6WyTZKSkpBg+Pj4u7cOHD3e+LiNHjnS2HzhwwLDZbC6PW1BQYPj6+hpTp051aX///fcNScbjjz9u/PTTT4a/v7+
"text/plain": [
"<Figure size 1500x1000 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Проверка выбросов в DataFrame\n",
"def check_outliers(dataframe: DataFrame, columns: list[str]) -> None:\n",
" for column in columns:\n",
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
" continue\n",
" \n",
" Q1: float = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
" Q3: float = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
" IQR: float = Q3 - Q1 # Вычисляем межквартильный размах\n",
"\n",
" # Определяем границы для выбросов\n",
" lower_bound: float = Q1 - 1.5 * IQR # Нижняя граница\n",
" upper_bound: float = Q3 + 1.5 * IQR # Верхняя граница\n",
"\n",
" # Подсчитываем количество выбросов\n",
" outliers: DataFrame = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]\n",
" outlier_count: int = outliers.shape[0]\n",
"\n",
" print(f\"Колонка {column}:\")\n",
" print(f\"\\tЕсть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
" print(f\"\\tКоличество выбросов: {outlier_count}\")\n",
" print(f\"\\tМинимальное значение: {dataframe[column].min()}\")\n",
" print(f\"\\tМаксимальное значение: {dataframe[column].max()}\")\n",
" print(f\"\\t1-й квартиль (Q1): {Q1}\")\n",
" print(f\"\\t3-й квартиль (Q3): {Q3}\\n\")\n",
"\n",
"# Визуализация выбросов\n",
"def visualize_outliers(dataframe: DataFrame, columns: list[str]) -> None:\n",
" # Диаграммы размахов\n",
" plt.figure(figsize=(15, 10))\n",
" rows: int = ceil(len(columns) / 3)\n",
" for index, column in enumerate(columns, 1):\n",
" plt.subplot(rows, 3, index)\n",
" plt.boxplot(dataframe[column], vert=True, patch_artist=True)\n",
" plt.title(f\"Диаграмма размахов для \\\"{column}\\\"\")\n",
" plt.xlabel(column)\n",
" \n",
" # Отображение графиков\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
"\n",
"# Числовые столбцы DataFrame\n",
"numeric_columns: list[str] = [\n",
" 'Cost',\n",
" 'Shares',\n",
" 'Value ($)',\n",
" 'Shares Total'\n",
"]\n",
"\n",
"# Проверка наличия выбросов в колонках\n",
"print('Проверка наличия выбросов в колонках:')\n",
"check_outliers(df, numeric_columns)\n",
"visualize_outliers(df, numeric_columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Устраняем выбросы и проводим проверку на их устранение"
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка наличия выбросов в колонках после их устранения:\n",
"Колонка Cost:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 0.0\n",
"\tМаксимальное значение: 1171.04\n",
"\t1-й квартиль (Q1): 50.5225\n",
"\t3-й квартиль (Q3): 934.1075\n",
"\n",
"Колонка Shares:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 121.0\n",
"\tМаксимальное значение: 749244.375\n",
"\t1-й квартиль (Q1): 3500.0\n",
"\t3-й квартиль (Q3): 301797.75\n",
"\n",
"Колонка Value ($):\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 0.0\n",
"\tМаксимальное значение: 371376521.125\n",
"\t1-й квартиль (Q1): 271008.0\n",
"\t3-й квартиль (Q3): 148713213.25\n",
"\n",
"Колонка Shares Total:\n",
"\tЕсть выбросы: Нет\n",
"\tКоличество выбросов: 0\n",
"\tМинимальное значение: 49.0\n",
"\tМаксимальное значение: 3730529.125\n",
"\t1-й квартиль (Q1): 25103.5\n",
"\t3-й квартиль (Q3): 1507273.75\n",
"\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAPOCAYAAAALMup9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADOqElEQVR4nOzdeVxWdfr/8TeIgIqAqICMiFSm4C6mormkJBqZli2aJi5pNWApkxaNqWHFZOWWpJkLNspotlhpoaipTeKGMbmPjSaW3pCpkKSAcH5/+ON8vQNu0ZBFX8/H4zzqPp/rnHPdtzdc3Nd9zufYGYZhCAAAAAAAAAAAFMu+ohMAAAAAAAAAAKAyo5EOAAAAAAAAAIANNNIBAAAAAAAAALCBRjoAAAAAAAAAADbQSAcAAAAAAAAAwAYa6QAAAAAAAAAA2EAjHQAAAAAAAAAAG2ikAwAAAAAAAABgA410AAAAAAAAAABsoJEOAAAAAOVo+PDhcnFxqeg0AABAGejRo4d69OhRYcffuXOnHB0ddfz48WLHhw8frs2bNxc7duDAATk4OGjfvn03MMObB4106KOPPpKdnV2xS4sWLSo6PeCW0aNHDw0fPlzS5UJXXCG+ePGiZs6cqY4dO8rNzU3Ozs668847FRkZqf/+9783JK+EhATNmjWryPoff/xRdnZ2ZkG2s7NTfHz8DckBuNVRq4HK4Wq1+vz585oyZYpatGihWrVqqW7dumrTpo2ee+45nTx5svwTLmPx8fGys7OTJG3evFl2dnb68ccfKzYpoBKhXgOVQ0n1+pNPPpGdnZ0WLlxY4rZJSUmys7PTnDlzyiHTsvH3v/9dgwcPlp+f3zVvGxgYqLCwME2ePLnI2JWv3dSpU9W4ceM/mWnV51DRCaDyeOmllxQQEGA+fu211yowGwB/dPr0afXp00cpKSm6//779fjjj8vFxUWHDx/WihUrtGDBAuXm5pb5cRMSErRv3z6NGzeuzPcN4NpQq4HKKy8vT926ddOhQ4cUHh6usWPH6vz589q/f78SEhL04IMPysfHp6LTBFAOqNdA5RQWFiY3NzclJCToySefLDYmISFB1apV06BBg8o5u+uTmpqqDRs2aNu2bSXGXLp0STk5OSWOP/3007rvvvv0v//9T7fffvuNSPOmQSMdpnvvvdfqrJqFCxfq9OnTFZcQACvDhw/Xd999p48++kgDBw60Gps2bZr+/ve/V1BmAMoLtRqovFavXq3vvvtOy5cv1+OPP241dvHixRvyZbctBQUFys3NlbOzc7keFwD1GqisnJyc9PDDD2vJkiU6efJkkS+4L168qE8//VT33nuvPD09KyjLa7NkyRI1atRInTp1slqfmZmpcePG6aOPPtL58+e1fPlyubi4qHPnzlq5cqXc3d3N2JCQENWpU0dLly5VTExMOT+DqoWpXWD+UW9vf/W3Q+HlnFdewllQUKBWrVoVmdbh+++/1/Dhw3XbbbfJ2dlZ3t7eGjlypH799VerfU6dOrXYS98cHP7ve54ePXqoRYsWSklJUefOnVWjRg35+/tr/vz5RZ7L5MmTFRQUJDc3N9WqVUtdu3bV119/bRVXOCWFnZ2dVq9ebTV28eJF1alTR3Z2dnrrrbeK5Onp6am8vDyrbf71r3+Z+7vyD6TPPvtMYWFh8vHxkZOTk26//XZNmzZN+fn5V32tC4936NAhPfroo3J1dVXdunX13HPP6eLFi1axS5YsUc+ePeXp6SknJycFBgZq3rx5RfbZv39/NW7cWM7OzvL09NQDDzygvXv3WsUUPo/ipvJo1qyZ7OzsFBkZaa47c+aMnn/+ebVs2VIuLi5ydXVV37599Z///Mdq2/DwcDk7O+vgwYNW60NDQ1WnTh2ry52PHj2qRx55RB4eHqpZs6Y6deqktWvXWm1XeDlx4eLk5KQ777xTsbGxMgzD9ov7/5X03ituSpUr3zN/XK6UkZGhUaNGqVGjRqpWrZoZ82fnQd2xY4fWrl2rUaNGFWmiS5f/ILjy/SpJmzZtUteuXVWrVi25u7urf//+RV7/3377TePGjVPjxo3l5OQkT09P3XvvvdqzZ4+kyz97a9eu1fHjx83nwuVcQPmjVq+2GqNWU6srY63+3//+J0nq0qVLkTFnZ2e5uroWWf/zzz9rwIABcnFxUf369fX8888Xee+99dZb6ty5s+rWrasaNWooKChIH330UZF9Ff67L1++XM2bN5eTk5MSExPN44wcOVJeXl5ycnJS8+bNtXjx4iL7eOedd9S8eXPVrFlTderUUfv27ZWQkHBdrwdwK6Jer7Yao15TrytjvR46dKgKCgq0YsWKImNr165VZmamhgwZIqn074c/Ku7nW/q/1/qP85Xv2LFDffr0kZubm2rWrKnu3bvr22+/LdXzWb16tXr27Fnk9Xvuuee0fPlyRUVF6d5771VMTIymTp2q8+fPKzs72yq2evXq6tGjhz777LNSHfNWxhnpMIu9k5PTdW3/z3/+s0jBkC7PK3X06FGNGDFC3t7e2r9/vxYsWKD9+/dr+/btRX7I582bZ/UL8Y9/fJw9e1b33XefHn30UQ0ePFgffvihnnnmGTk6OmrkyJGSpKysLC1cuFCDBw/W6NGj9dtvv2nRokUKDQ3Vzp071aZNG6t9Ojs7a8mSJRowYIC57pNPPilSTK/022+/ac2aNXrwwQfNdUuWLJGzs3OR7eLj4+Xi4qKoqCi5uLho06ZNmjx5srKysvTmm2+WeIwrPfroo2rcuLFiY2O1fft2zZkzR2fPntUHH3xg9do1b95cDzzwgBwcHPTFF1/or3/9qwoKChQREWG1vzFjxsjb21snT57U3LlzFRISomPHjqlmzZpFXpcrp/LYtm1bsTeuOHr0qFavXq1HHnlE/v7+Sk9P13vvvafu3bvrwIED5je8s2fP1qZNmxQeHq7k5GRVq1ZN7733ntavX69//vOfZlx6ero6d+6s33//Xc8++6zq1q2rpUuX6oEHHtBHH31k9bpL/3fZ5IULF7Ry5Uq99NJL8vT01KhRo0r1+ha+foXvvejoaJuxY8aMUdeuXSVdfq98+umnVuPh4eHasGGDxo4dq9atW6tatWpasGCB2Zi+Xp9//rkk6YknnihV/IYNG9S3b1/ddtttmjp1qi5cuKB33nlHXbp00Z49e8xm+NNPP62PPvpIkZGRCgwM1K+//qp///vfOnjwoNq1a6e///3vyszM1E8//aSZM2dKEjdHAyoAtZpaTa2u/LW6cF7SDz74QJMmTSry8/NH+fn5Cg0NVceOHfXWW29pw4YNevvtt3X77bfrmWeeMeNmz56tBx54QEOGDFFubq5WrFihRx55RGvWrFFYWJjVPjdt2qQPP/xQkZGRqlevnho3bqz09HR16tTJbNjUr19fX331lUaNGqWsrCzzPfT+++/r2Wef1cMPP2w2l77//nvt2LGjyBn2AIpHvaZeU68rf73u1q2bGjZsqISEBEVFRVmNJSQkqGbNmub7+FreD9dr06ZN6tu3r4KCgjRlyhTZ29ubDfxvvvlGHTp0KHHbn3/+WWlpaWrXrl2RsbVr12r48OF65ZVXNHz4cHXt2lU9evTQ3/72t2L3FRQUpM8++0xZWVnFfvmP/8/ALW/WrFmGJOM///mP1fru3bsbzZs3t1q3ZMkSQ5Jx7NgxwzAM4+LFi0ajRo2Mvn37GpKMJUuWmLG///57kWP961//MiQZW7duNddNmTLFkGT88ssvJebYvXt3Q5Lx9ttvm+tycnKMNm3aGJ6enkZubq5hGIZx6dIlIycnx2rbs2fPGl5eXsbIkSPNdceOHTMkGYMHDzYcHBwMi8VijvXq1ct4/PHHDUnGm2++WSTPwYMHG/fff7+5/vjx44a9vb0xePDgIs+juNfgqaeeMmrWrGlcvHixxOd75fEeeOABq/V//etfi/x7FXec0NBQ47bbbrN5jA8//NCQZOzevdtcJ8l4+OGHDQcHB6v1o0aNMl+XiIgIc/3FixeN/Px8q/0eO3bMcHJyMmJiYqzWr1u3zpBkvPrqq8bRo0cNFxcXY8CAAVY
"text/plain": [
"<Figure size 1500x1000 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Устранить выборсы в DataFrame\n",
"def remove_outliers(dataframe: DataFrame, columns: list[str]) -> DataFrame:\n",
" for column in columns:\n",
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
" continue\n",
" \n",
" Q1: float = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
" Q3: float = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
" IQR: float = Q3 - Q1 # Вычисляем межквартильный размах\n",
"\n",
" # Определяем границы для выбросов\n",
" lower_bound: float = Q1 - 1.5 * IQR # Нижняя граница\n",
" upper_bound: float = Q3 + 1.5 * IQR # Верхняя граница\n",
"\n",
" # Устраняем выбросы:\n",
" # Заменяем значения ниже нижней границы на нижнюю границу\n",
" # А значения выше верхней границы на верхнюю\n",
" dataframe[column] = dataframe[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
" \n",
" return dataframe\n",
"\n",
"\n",
"# Устраняем выборсы\n",
"df: DataFrame = remove_outliers(df, numeric_columns)\n",
"\n",
"# Проверка наличия выбросов в колонках\n",
"print('Проверка наличия выбросов в колонках после их устранения:')\n",
"check_outliers(df, numeric_columns)\n",
"visualize_outliers(df, numeric_columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение набора данных на выборки:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Обучающая выборка (60-80%). Обучение модели (подбор коэффициентов некоторой математической функции для аппроксимации).\n",
"Контрольная выборка (10-20%). Выбор метода обучения, настройка гиперпараметров.\n",
"Тестовая выборка (10-20% или 20-30%). Оценка качества модели перед передачей заказчику.\n",
"\n",
"Данные должны быть сбалансированными, чтобы достичь этого воспользуемся методами аугментации данных. В данном случае воспользуемся методом oversampling."
]
},
{
"cell_type": "code",
"execution_count": 233,
"metadata": {},
"outputs": [],
"source": [
"# Функция для создания выборок\n",
"def split_stratified_into_train_val_test(\n",
" df_input,\n",
" stratify_colname=\"y\",\n",
" frac_train=0.6,\n",
" frac_val=0.15,\n",
" frac_test=0.25,\n",
" random_state=None,\n",
") -> tuple[Any, Any, Any]:\n",
"\n",
" if frac_train + frac_val + frac_test != 1.0:\n",
" raise ValueError(\n",
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
" % (frac_train, frac_val, frac_test)\n",
" )\n",
"\n",
" if stratify_colname not in df_input.columns:\n",
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
"\n",
" X: DataFrame = df_input\n",
" y: DataFrame = df_input[\n",
" [stratify_colname]\n",
" ]\n",
"\n",
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
" X, y, \n",
" stratify=y, \n",
" test_size=(1.0 - frac_train), \n",
" random_state=random_state\n",
" )\n",
"\n",
" relative_frac_test: float = frac_test / (frac_val + frac_test)\n",
" df_val, df_test, y_val, y_test = train_test_split(\n",
" df_temp,\n",
" y_temp,\n",
" stratify=y_temp,\n",
" test_size=relative_frac_test,\n",
" random_state=random_state,\n",
" )\n",
"\n",
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
"\n",
" return df_train, df_val, df_test"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение количества наблюдений по меткам (классам):\n",
"Cost\n",
"0.00 18\n",
"6.24 10\n",
"62.72 8\n",
"20.91 7\n",
"52.38 4\n",
" ..\n",
"1098.24 1\n",
"1072.22 1\n",
"1019.03 1\n",
"1048.46 1\n",
"1068.09 1\n",
"Name: count, Length: 101, dtype: int64 \n",
"\n",
"Статистическое описание целевого признака:\n",
"count 156.000000\n",
"mean 478.785641\n",
"std 448.922903\n",
"min 0.000000\n",
"25% 50.522500\n",
"50% 240.225000\n",
"75% 934.107500\n",
"max 1171.040000\n",
"Name: Cost, dtype: float64 \n",
"\n",
"Распределение количества наблюдений по меткам (классам):\n",
"Cost_category\n",
"medium 78\n",
"low 39\n",
"high 39\n",
"Name: count, dtype: int64 \n",
"\n",
"Проверка сбалансированности выборок:\n",
"Обучающая выборка: (93, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"medium 47\n",
"low 23\n",
"high 23\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"medium\": 50.54%\n",
"Процент объектов класса \"low\": 24.73%\n",
"Процент объектов класса \"high\": 24.73%\n",
"\n",
"Контрольная выборка: (31, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"medium 15\n",
"low 8\n",
"high 8\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"medium\": 48.39%\n",
"Процент объектов класса \"low\": 25.81%\n",
"Процент объектов класса \"high\": 25.81%\n",
"\n",
"Тестовая выборка: (32, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"medium 16\n",
"low 8\n",
"high 8\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"medium\": 50.00%\n",
"Процент объектов класса \"low\": 25.00%\n",
"Процент объектов класса \"high\": 25.00%\n",
"\n",
"Проверка необходимости аугментации выборок:\n",
"Для обучающей выборки аугментация данных требуется\n",
"Для контрольной выборки аугментация данных требуется\n",
"Для тестовой выборки аугментация данных требуется\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABigAAAHmCAYAAADp3gZeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADrEklEQVR4nOzdd1xV9f8H8NdlbwRRQTFw71GalgszNUepqZmaq1Ir07Lh/uZKM9PcmqNyNxy5NfcG90JBRZbIHrK5cMfn94e/e+NyAQHhHu7l9Xw8eJSXw7mvu877nvP+nM+RCSEEiIiIiIiIiIiIiIiIDMhM6gBERERERERERERERFTxsEFBREREREREREREREQGxwYFEREREREREREREREZHBsURERERERERERERERkcGxQEBERERERERERERGRwbFBQUREREREREREREREBscGBRERERERERERERERGRwbFEREREREREREREREZHBsUBAR0XMplUrExcXh8ePHUkehUpaTk4OYmBhERUVJHYWIiIheUFpaGsLCwpCRkSF1FCIiIqIiYYOCiIjyFRQUhDFjxsDDwwNWVlaoVq0aXn/9dQghpI5mFLZt24awsDDtvzdt2oTIyEjpAuVy7do1DB06FG5ubrC2toaHhwcGDBggdSwiIqJyKT09HcuWLdP+Ozk5GatXr5YuUC5CCKxfvx6vvfYa7Ozs4OTkhFq1amHbtm1SRyMiIqJSIoRAUlISgoKCpI5SJorVoNi0aRNkMpn2x8bGBvXr18f48eMRGxtbVhmJTN7s2bPh7e0N4L/PWX727NmDnj17ws3NDVZWVqhevToGDRqEU6dOlUkuX19fzJ49G8nJyWWy/qIKCAjA7NmzdQ72moKwsDDIZDKcOXMGACCTybBp0yZJM2lcunQJbdq0walTpzB16lQcPXoUx48fx969ewt8f5Ku8+fPY/LkyQgLC8PRo0fx+eefw8ys7MYFFLVG79u3Dx06dEBAQADmz5+P48eP4/jx41i3bl2ZZSMyZqzRhWONporA1tYW//vf/7B9+3ZERERg9uzZOHDggNSxAABDhw7Fp59+ikaNGmHr1q04fvw4Tpw4gf79+0sdjYiIqNy7e/cu9u7dq/33rVu3cOjQIekC5ZKWlob//e9/aNCgAaysrFC5cmXUr18fDx48kDpaqbMoyR/NnTsXtWrVglwux4ULF/DLL7/g8OHDuHv3Luzs7Eo7I1GFJ4TARx99hE2bNuHll1/G119/DXd3d0RHR2PPnj148803cfHiRbRr165U79fX1xdz5szBqFGjUKlSpVJdd3EEBARgzpw56Ny5s/YgEZWdnJwcfPjhh6hfvz6OHTsGZ2dnqSMZpa+++gqdO3dGrVq1AABff/01PDw8yvx+C6vRcrkco0ePxltvvYWdO3fCysqqzPMQmTrWaNZoMn3m5uaYM2cORowYAbVaDScnpwIPXmzatAkffvih9t/W1tZ46aWX0L17d3z33XeoVq1aqeXasmUL/v77b2zbtg1Dhw4ttfUSlUezZ8/Gpk2bEBYWpv2c5Xdm9549e7B+/XpcvXoVqampcHNzQ4cOHfDpp5+iS5cupZ7L19cXx44dw8SJEyWvxzt27MCoUaNMqh6HhYWhVq1aOH36NDp37gyZTIaNGzdi1KhRUkcjE5GWloZPPvkE7u7uqFy5Mr788kv07NkTvXv3ljRXYmIifHx88PjxY0yYMAHt27eHlZUVLC0tTeozrlGiBkXPnj3RunVrAMDo0aNRuXJlLFmyBPv27cOQIUNKNSARAT///DM2bdqEiRMnYsmSJTqjN2fMmIGtW7fCwqJEH2cqRWq1Gjk5ObCxsZE6ygs5cOAAHjx4gPv377M58QIaNmyI4OBg3L17F25ubqhTp45B7rewGh0VFQW5XI5NmzaxOUFUSlijjYOp1GiSzjfffIP3338fERERaNSo0XMPRBpiUN+iRYswZMgQNieIwAEDHDBAVDKvv/669gcA6tevjzFjxkicCpg0aRKio6Ph5+eHJk2aSB2nzJXKXBOaLnRoaCgAICkpCd9++y2aNWsGBwcHODk5oWfPnrh9+7be38rlcsyePRv169eHjY0NPDw80L9/fwQHBwP47/Tqgn46d+6sXdeZM2cgk8nw999/Y/r06XB3d4e9vT369OmDiIgIvfu+fPkyevToAWdnZ9jZ2cHHxwcXL17M9zFqOrV5f2bPnq237LZt29CqVSvY2trC1dUVgwcPzvf+C3tsuanVaixbtgxNmjSBjY0NqlWrhk8++QRPnz7VWc7b2xtvv/223v2MHz9eb535ZV+0aJHecwoA2dnZmDVrFurWrQtra2vUrFkTkydPRnZ2dr7PVW6dO3fWW9/8+fNhZmaGP/74o0TPx+LFi9GuXTtUrlwZtra2aNWqFXbt2pXv/W/btg1t2rSBnZ0dXFxc0KlTJxw7dkxnmSNHjsDHxweOjo5wcnLCq6++qpdt586d2tfUzc0Nw4YN05tLftSoUTqZXVxc0LlzZ5w/f/65z1NhsrKysGDBAjRs2BCLFy/Od2qJ4cOHo02bNtp/h4SE4L333oOrqyvs7Ozw2muv5TvKa+XKlWjSpIn2+WndurX2sc+ePRuTJk0CANSqVUv7uIozhcP9+/cxaNAgVKlSBba2tmjQoAFmzJih/X14eDjGjRuHBg0awNbWFpUrV8Z7772nN2//e++9BwB44403tDk0Uy4Az17Djh07wt7eHo6Ojujduzfu3bunl2fnzp1o3LgxbGxs0LRpU+zZsyffESYZGRn45ptvULNmTVhbW6NBgwZYvHix3ggdmUyG8ePHY/v27WjSpAmsra1x5MgReHt7o2/fvnr3L5fL4ezsjE8++aTIz2Femu2c5sfa2hr169fHggULinRtiLi4OHz88ceoVq0abGxs0KJFC2zevFlnmUuXLqFWrVrYvXs36tSpAysrK7z00kuYPHkysrKytMuNHDkSbm5uUCgUevfTvXt3NGjQQCdz7tcMQL7PfVE/397e3jqjZtLS0jB+/HjUqFED1tbWqFevHn788Ueo1Wqdv9O8Zrm9/fbbejl27dqVb+bk5GRMnDhR+96oW7cuFi5cqHM/mm3Zpk2bYG9vj7Zt26JOnTr4/PPPIZPJnjvaJ++2UDNCYtKkScjJydEup5lq5tq1awWuq3PnztptXmhoKC5duoQmTZqge/fusLS0hEwmg5mZGRo0aICbN2/q/K1SqcTMmTPh4uKizeLg4IB+/fqVqEZPnToVMpkMq1at0tZoOzs7WFpa4s0334RSqdR5ngcPHgwbGxttxtq1axe4PWWNZo1mjWaNNvYanV+tjIqKgre3N1q3bo309HTt7UWp5ZrP7OLFi/Xuq2nTptrPft7MhW1LZ8+eDZlMpn3vODk5aUcayuVynftQKpX4/vvvUadOHVhbW8Pb2xvTp0/Pd9tUUIbcr71mmYK2JxqajAkJCTq3X7t2Ld9puk6dOqV9f1aqVAl9+/ZFYGBgvusEAE9PT7z++uuwsLCAu7t7vt8VNHr27Ilhw4bhyZMn2Lx5M9RqNUJDQ+Hq6oo2bdroTCcBPNsmN23atMDHlvv7BfDss3D37l3UrFkTvXv3hpOTE+zt7QvcthVl21Oc/eni1JDi7HcTlVTuAQPXr1/H9OnT8dFHH2HGjBm4du0atmzZwgED5YBardarGURS27t3L+7du4dr167B398flStXljRPXFwcNm/ejB9//LFCNCeAUmpQaA5UaF7AkJAQ7N27F2+//TaWLFmCSZMmwd/fHz4+PoiKitL+nUqlwttvv405c+agVatW+Pnnn/Hll18iJSUFd+/e1bmPIUOGYOvWrTo/np6e+eaZP38+Dh06hClTpuCLL77A8ePH0bVrV50Da6dOnUKnTp2QmpqKWbNm4YcffkBycjK6dOmCK1eu5LteT09P7X3/8ssvBd73iBEjUK9ePSxZsgQTJ07EyZMn0alTpwLnCB47dqx2ve+++67e7z/55BNMmjQJ7du3x/Lly/Hhhx9i+/bteOutt/I9MFgSycnJWLBggd7tarUaffr0weL
"text/plain": [
"<Figure size 1500x500 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Оценка сбалансированности\n",
"def check_balance(dataframe: DataFrame, dataframe_name: str, column: str) -> None:\n",
" counts: Series[int] = dataframe[column].value_counts()\n",
" print(dataframe_name + \": \", dataframe.shape)\n",
" print(f\"Распределение выборки данных по классам в колонке \\\"{column}\\\":\\n\", counts)\n",
" total_count: int = len(dataframe)\n",
" for value in counts.index:\n",
" percentage: float = counts[value] / total_count * 100\n",
" print(f\"Процент объектов класса \\\"{value}\\\": {percentage:.2f}%\")\n",
" print()\n",
" \n",
"# Определение необходимости аугментации данных\n",
"def need_augmentation(dataframe: DataFrame,\n",
" column: str, \n",
" first_value: Any, second_value: Any) -> bool:\n",
" counts: Series[int] = dataframe[column].value_counts()\n",
" ratio: float = counts[first_value] / counts[second_value]\n",
" return ratio > 1.5 or ratio < 0.67\n",
" \n",
" # Визуализация сбалансированности классов\n",
"def visualize_balance(dataframe_train: DataFrame,\n",
" dataframe_val: DataFrame,\n",
" dataframe_test: DataFrame, \n",
" column: str) -> None:\n",
" fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
"\n",
" # Обучающая выборка\n",
" counts_train: Series[int] = dataframe_train[column].value_counts()\n",
" axes[0].pie(counts_train, labels=counts_train.index, autopct='%1.1f%%', startangle=90)\n",
" axes[0].set_title(f\"Распределение классов \\\"{column}\\\" в обучающей выборке\")\n",
"\n",
" # Контрольная выборка\n",
" counts_val: Series[int] = dataframe_val[column].value_counts()\n",
" axes[1].pie(counts_val, labels=counts_val.index, autopct='%1.1f%%', startangle=90)\n",
" axes[1].set_title(f\"Распределение классов \\\"{column}\\\" в контрольной выборке\")\n",
"\n",
" # Тестовая выборка\n",
" counts_test: Series[int] = dataframe_test[column].value_counts()\n",
" axes[2].pie(counts_test, labels=counts_test.index, autopct='%1.1f%%', startangle=90)\n",
" axes[2].set_title(f\"Распределение классов \\\"{column}\\\" в тренировочной выборке\")\n",
"\n",
" # Отображение графиков\n",
" plt.tight_layout()\n",
" plt.show()\n",
" \n",
"\n",
"# Унитарное кодирование категориальных признаков (one-hot encoding)\n",
"df_encoded: DataFrame = pd.get_dummies(df)\n",
"\n",
"# Вывод распределения количества наблюдений по меткам (классам)\n",
"print('Распределение количества наблюдений по меткам (классам):')\n",
"print(df_encoded['Cost'].value_counts(), '\\n')\n",
"\n",
"# Статистическое описание целевого признака\n",
"print('Статистическое описание целевого признака:')\n",
"print(df_encoded['Cost'].describe().transpose(), '\\n')\n",
"\n",
"# Определим границы для каждой категории стоимости акций\n",
"bins: list[float] = [df_encoded['Cost'].min() - 1, \n",
" df_encoded['Cost'].quantile(0.25), \n",
" df_encoded['Cost'].quantile(0.75), \n",
" df_encoded['Cost'].max() + 1]\n",
"labels: list[str] = ['low', 'medium', 'high']\n",
"\n",
"# Создаем новую колонку с категориями стоимости акций\n",
"df_encoded['Cost_category'] = pd.cut(df_encoded['Cost'], bins=bins, labels=labels)\n",
"\n",
"# Вывод распределения количества наблюдений по меткам (классам)\n",
"print('Распределение количества наблюдений по меткам (классам):')\n",
"print(df_encoded['Cost_category'].value_counts(), '\\n')\n",
"\n",
"df_train, df_val, df_test = split_stratified_into_train_val_test(\n",
" df_encoded, \n",
" stratify_colname=\"Cost_category\", \n",
" frac_train=0.60, \n",
" frac_val=0.20, \n",
" frac_test=0.20\n",
")\n",
"\n",
"# Проверка сбалансированности выборок\n",
"print('Проверка сбалансированности выборок:')\n",
"check_balance(df_train, 'Обучающая выборка', 'Cost_category')\n",
"check_balance(df_val, 'Контрольная выборка', 'Cost_category')\n",
"check_balance(df_test, 'Тестовая выборка', 'Cost_category')\n",
"\n",
"# Проверка необходимости аугментации выборок\n",
"print('Проверка необходимости аугментации выборок:')\n",
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
" \n",
"# Визуализация сбалансированности классов\n",
"visualize_balance(df_train, df_val, df_test, 'Cost_category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Необходимо применить аугментацию выборки с избытком (oversampling) копирование наблюдений или генерация новых наблюдений на основе существующих с помощью алгоритмов SMOTE и ADASYN (нахождение k-ближайших соседей)."
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка сбалансированности выборок после применения метода oversampling:\n",
"Обучающая выборка: (141, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"low 47\n",
"medium 47\n",
"high 47\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"low\": 33.33%\n",
"Процент объектов класса \"medium\": 33.33%\n",
"Процент объектов класса \"high\": 33.33%\n",
"\n",
"Контрольная выборка: (45, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"low 15\n",
"medium 15\n",
"high 15\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"low\": 33.33%\n",
"Процент объектов класса \"medium\": 33.33%\n",
"Процент объектов класса \"high\": 33.33%\n",
"\n",
"Тестовая выборка: (48, 184)\n",
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
" Cost_category\n",
"low 16\n",
"medium 16\n",
"high 16\n",
"Name: count, dtype: int64\n",
"Процент объектов класса \"low\": 33.33%\n",
"Процент объектов класса \"medium\": 33.33%\n",
"Процент объектов класса \"high\": 33.33%\n",
"\n",
"Проверка необходимости аугментации выборок после применения метода oversampling:\n",
"Для обучающей выборки аугментация данных не требуется\n",
"Для контрольной выборки аугментация данных не требуется\n",
"Для тестовой выборки аугментация данных не требуется\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABigAAAH/CAYAAADNB1UNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACm90lEQVR4nOzdd3hUdd6G8WfSO4QamhB6h11YFJSiooKoWLFTLLirrqKIihWxsAo27LoKCNiAFZQiRYo0KdJ7kRpKEkJ6z5z3D94ZMySBJCRzZs7cn+vy2iWZzHwzmZz7TH5nztgMwzAEAAAAAAAAAADgRn5mDwAAAAAAAAAAAHwPCxQAAAAAAAAAAMDtWKAAAAAAAAAAAABuxwIFAAAAAAAAAABwOxYoAAAAAAAAAACA27FAAQAAAAAAAAAA3I4FCgAAAAAAAAAA4HYsUAAAAAAAAAAAALdjgQIAcF75+fmKj4/X4cOHzR4FFSw3N1cnTpzQsWPHzB4FAABcoLS0NB08eFAZGRlmjwIAACqIYRhKSkrS3r17zR6lUrBAAQAo1t69e/Xggw+qTp06CgoKUu3atdW1a1cZhmH2aF5hypQpOnjwoPPfEydOVFxcnHkDFbJ+/XrdddddqlGjhoKDg1WnTh3dcsstZo8FAIBHSk9P13vvvef8d3Jysj766CPzBirEMAx9/vnnuuSSSxQWFqaoqCjFxsZqypQpZo8GAIDH27Ztm2bOnOn896ZNmzRnzhzzBiokLS1NL7zwglq0aKGgoCBVr15dzZs31+7du80ercKVaYFi4sSJstlszv9CQkLUvHlzPfroozp58mRlzQhY3qhRo9SoUSNJf/2eFefHH39U3759VaNGDQUFBalu3boaMGCAFi9eXClzrVq1SqNGjVJycnKlXH9p7dixQ6NGjXL5Y68VHDx4UDabTUuXLpUk2Ww2TZw40dSZHH7//Xd16dJFixcv1rPPPqv58+dr4cKFmjlzZomPT7havny5nn76aR08eFDz58/XI488Ij+/yjsuoLSNnjVrli677DLt2LFDr7/+uhYuXKiFCxfqs88+q7TZAG9Go8+NRsMXhIaG6oUXXtDUqVN15MgRjRo1Sj///HOxl3X3c+a77rpL//znP9WqVStNnjxZCxcu1KJFi3TzzTdX+G0BZqLH50aPgfJJS0vTQw89pN9//1179+7V448/rq1bt5o9lk6dOqWuXbtq/PjxuvXWWzVr1iwtXLhQS5cudW4LrSSgPF80evRoxcbGKjs7WytWrNAnn3yiuXPnatu2bQoLC6voGQGfZxiG7rvvPk2cOFF/+9vf9OSTTyomJkbHjx/Xjz/+qCuvvFIrV65Ut27dKvR2V61apVdeeUWDBw9W1apVK/S6y2LHjh165ZVX1KtXL0tuiD1Nbm6uhgwZoubNm2vBggWqUqWK2SN5pSeeeEK9evVSbGysJOnJJ59UnTp1Kv12z9Xo7OxsPfDAA7rmmms0bdo0BQUFVfo8gNXRaBoN6/P399crr7yigQMHym63Kyoq6rxHV7rjOfPXX3+t77//XlOmTNFdd91VIdcJeCt6TI+B8ujatavzP0lq3ry5HnzwQZOnkkaMGKHjx49r9erVatOmjdnjVLpyLVD07dtXnTt3liQ98MADql69ut555x3NmjVLd955Z4UOCEB6++23NXHiRA0bNkzvvPOOy9Eizz//vCZPnqyAgHL9OqMC2e125ebmKiQkxOxRLsjPP/+s3bt3a9euXSxOXICWLVtq//792rZtm2rUqKEmTZq45XbP1ehjx44pOztbEydOZHECqCA02jtYpdEwz/Dhw3X77bfryJEjatWq1Xn/EOmO58xjx47VnXfeyeIEIHrsLegxPNHMmTO1Y8cOZWVlqV27dqY/V46Pj9ekSZP06aef+sTihFRB70FxxRVXSJIOHDggSUpKStJTTz2ldu3aKSIiQlFRUerbt682b95c5Guzs7M1atQoNW/eXCEhIapTp45uvvlm7d+/X9JfL+cq6b9evXo5r2vp0qWy2Wz6/vvv9dxzzykmJkbh4eG64YYbdOTIkSK3vWbNGvXp00dVqlRRWFiYevbsqZUrVxb7Pfbq1avY2x81alSRy06ZMkWdOnVSaGioqlWrpjvuuKPY2z/X91aY3W7Xe++9pzZt2igkJES1a9fWQw89pNOnT7tcrlGjRrruuuuK3M6jjz5a5DqLm33s2LFF7lNJysnJ0csvv6ymTZsqODhYDRo00NNPP62cnJxi76vCevXqVeT6Xn/9dfn5+embb74p1/0xbtw4devWTdWrV1doaKg6deqk6dOnF3v7U6ZMUZcuXRQWFqbo6Gj16NFDCxYscLnMvHnz1LNnT0VGRioqKkr/+Mc/isw2bdo058+0Ro0auueee4qcS37w4MEuM0dHR6tXr15avnz5ee+nc8nKytKYMWPUsmVLjRs3rtiXst57773q0qWL899//vmnbrvtNlWrVk1hYWG65JJLij3K64MPPlCbNm2c90/nzp2d3/uoUaM0YsQISVJsbKzz+yrLS0Z37dqlAQMGqGbNmgoNDVWLFi30/PPPOz9/6NAhPfzww2rRooVCQ0NVvXp13XbbbUXO23/bbbdJki6//HLnHI6XeEpnfobdu3dXeHi4IiMj1a9fP23fvr3IPNOmTVPr1q0VEhKitm3b6scff9TgwYOLHGGSkZGh4cOHq0GDBgoODlaLFi00bty4Iu+9YLPZ9Oijj2rq1Klq06aNgoODNW/ePDVq1Ej9+/cvcvvZ2dmqUqWKHnrooVLfh2dzbOcc/wUHB6t58+YaM2ZMqd4bIj4+Xvfff79q166tkJAQdejQQZMmTXK5zO+//67Y2FjNmDFDTZo0UVBQkC666CI9/fTTysrKcl5u0KBBqlGjhvLy8orcztVXX60WLVq4zFz4Zyap2Pu+tL/fjRo10uDBg53/TktL06OPPqp69eopODhYzZo103/+8x/Z7XaXr3P8zAq77rrriswxffr0YmdOTk7WsGHDnI+Npk2b6s0333S5Hce2bOLEiQoPD9fFF1+sJk2a6JFHHpHNZnOZuzhnbwsDAwPVqFEjjRgxQrm5uc7LOV7avn79+hKvq1evXs5t3oEDB/T777+rTZs2uvrqqxUYGCibzSY/Pz+1aNFCGzdudPna/Px8vfTSS4qOjnbOEhERoRtvvLFcjX722Wdls9n04YcfOhsdFhamwMBAXXnllcrPz3e5n++44w6FhIQ4Z2zcuHGJ21MaTaNpNI329kYX18pjx46pUaNG6ty5s9LT050fL03LHb+z48aNK3Jbbdu2df7unz3zubalo0aNks1mcz52oqKiVL16dT3++OPKzs52uY38/Hy9+uqratKkiYKDg9WoUSM999xzxW6bSpqh8M/ecZmSticOjhkTExNdPr5+/fpiTwuyePFi5+OzatWq6t+/v3bu3FnsdUpS/fr11bVrVwUEBCgmJqbYfYWSZnrnnXckndkedenSRZMnT3Z5zhwQEKDIyMgSnzMPGzbMuV9Qp04d3XDDDdq2bZsaNGjg8vtXUo8d257IyEjZbDY1a9ZMAwYMcHnO/MMPPxR5Ph0SEqJatWopKirK5TlzcQ0pvD0q/N8DDzxQ5Hn3Sy+9RI//Hz2mx/T4L2b2uLKaKJXuOUivXr3Utm3bIl/reEwX/pmf/XxYOvOzO7ufhfcH3n33XTVs2FChoaHq2bOntm3bVuS2ytJFx3+RkZHq0qWLy/s4FJ7pfNuF4r6X4vaLynL/SNLHH3/sfMzVrVtXjzzySJFTsRXe/rZu3VqdOnXS5s2bi92uFqfXWc9Da9SooX79+hW5bx2/ByVxPLd3fA/r1q1zLuZ17txZISEhql69uu68804dPny4yNeX5edWmsfs2f3Lz8/Xtddeq2rVqmnHjh0uly3t8+vzqZDlY8cfKqpXry7pzIZ+5syZuu222xQbG6uTJ0/qs88+U8+ePbVjxw7VrVtXklRQUKDrrrtOv/76q+644w49/vjjSktL08KFC7Vt2zaXo03vvPNOXXvttS63O3LkyGLnef3112Wz2fTMM88oPj5
"text/plain": [
"<Figure size 1500x500 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Метод приращения с избытком (oversampling)\n",
"def oversample(df: DataFrame, column: str) -> DataFrame:\n",
" X: DataFrame = pd.get_dummies(df.drop(column, axis=1))\n",
" y: DataFrame = df[column] # type: ignore\n",
" \n",
" smote = SMOTE()\n",
" X_resampled, y_resampled = smote.fit_resample(X, y) # type: ignore\n",
" \n",
" df_resampled: DataFrame = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return df_resampled\n",
"\n",
"\n",
"# Приращение данных (oversampling)\n",
"df_train_oversampled: DataFrame = oversample(df_train, 'Cost_category')\n",
"df_val_oversampled: DataFrame = oversample(df_val, 'Cost_category')\n",
"df_test_oversampled: DataFrame = oversample(df_test, 'Cost_category')\n",
"\n",
"# Проверка сбалансированности выборок\n",
"print('Проверка сбалансированности выборок после применения метода oversampling:')\n",
"check_balance(df_train_oversampled, 'Обучающая выборка', 'Cost_category')\n",
"check_balance(df_val_oversampled, 'Контрольная выборка', 'Cost_category')\n",
"check_balance(df_test_oversampled, 'Тестовая выборка', 'Cost_category')\n",
"\n",
"# Проверка необходимости аугментации выборок\n",
"print('Проверка необходимости аугментации выборок после применения метода oversampling:')\n",
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
" \n",
"# Визуализация сбалансированности классов\n",
"visualize_balance(df_train_oversampled, df_val_oversampled, df_test_oversampled, 'Cost_category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Конструирование признаков:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Конструирование признаков - определением признаков, которые войду в нашу обучающую модель\n",
"\n",
"Будем использовать метод конструирования признаков \"Унитарное кодирование категориальных признаков\" или one-hot-encoding. Он необходим для преобразования категориальных переменных в числовой формат."
]
},
{
"cell_type": "code",
"execution_count": 236,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Cost</th>\n",
" <th>Shares</th>\n",
" <th>Value ($)</th>\n",
" <th>Shares Total</th>\n",
" <th>Insider Trading_Baglino Andrew D</th>\n",
" <th>Insider Trading_DENHOLM ROBYN M</th>\n",
" <th>Insider Trading_Kirkhorn Zachary</th>\n",
" <th>Insider Trading_Musk Elon</th>\n",
" <th>Insider Trading_Musk Kimbal</th>\n",
" <th>Insider Trading_Taneja Vaibhav</th>\n",
" <th>...</th>\n",
" <th>SEC Form 4_Nov 30 04:42 PM</th>\n",
" <th>SEC Form 4_Oct 05 07:35 PM</th>\n",
" <th>SEC Form 4_Oct 31 07:06 PM</th>\n",
" <th>SEC Form 4_Sep 07 08:29 PM</th>\n",
" <th>SEC Form 4_Sep 07 08:33 PM</th>\n",
" <th>SEC Form 4_Sep 07 09:04 PM</th>\n",
" <th>SEC Form 4_Sep 12 09:44 PM</th>\n",
" <th>SEC Form 4_Sep 14 07:47 PM</th>\n",
" <th>SEC Form 4_Sep 30 07:03 PM</th>\n",
" <th>Cost_category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>196.72</td>\n",
" <td>10455.0</td>\n",
" <td>2056775.0</td>\n",
" <td>203073.0</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>195.79</td>\n",
" <td>2466.0</td>\n",
" <td>482718.0</td>\n",
" <td>100458.0</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>195.79</td>\n",
" <td>1298.0</td>\n",
" <td>254232.0</td>\n",
" <td>65547.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.00</td>\n",
" <td>7138.0</td>\n",
" <td>0.0</td>\n",
" <td>102923.0</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.00</td>\n",
" <td>2586.0</td>\n",
" <td>0.0</td>\n",
" <td>66845.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.00</td>\n",
" <td>16867.0</td>\n",
" <td>0.0</td>\n",
" <td>213528.0</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>20.91</td>\n",
" <td>10500.0</td>\n",
" <td>219555.0</td>\n",
" <td>74759.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>202.00</td>\n",
" <td>10500.0</td>\n",
" <td>2121000.0</td>\n",
" <td>64259.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>193.00</td>\n",
" <td>3750.0</td>\n",
" <td>723750.0</td>\n",
" <td>196661.0</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>20.91</td>\n",
" <td>10500.0</td>\n",
" <td>219555.0</td>\n",
" <td>74759.0</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 184 columns</p>\n",
"</div>"
],
"text/plain": [
" Cost Shares Value ($) Shares Total Insider Trading_Baglino Andrew D \\\n",
"0 196.72 10455.0 2056775.0 203073.0 False \n",
"1 195.79 2466.0 482718.0 100458.0 False \n",
"2 195.79 1298.0 254232.0 65547.0 True \n",
"3 0.00 7138.0 0.0 102923.0 False \n",
"4 0.00 2586.0 0.0 66845.0 True \n",
"5 0.00 16867.0 0.0 213528.0 False \n",
"6 20.91 10500.0 219555.0 74759.0 True \n",
"7 202.00 10500.0 2121000.0 64259.0 True \n",
"8 193.00 3750.0 723750.0 196661.0 False \n",
"9 20.91 10500.0 219555.0 74759.0 True \n",
"\n",
" Insider Trading_DENHOLM ROBYN M Insider Trading_Kirkhorn Zachary \\\n",
"0 False True \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"5 False True \n",
"6 False False \n",
"7 False False \n",
"8 False True \n",
"9 False False \n",
"\n",
" Insider Trading_Musk Elon Insider Trading_Musk Kimbal \\\n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"6 False False \n",
"7 False False \n",
"8 False False \n",
"9 False False \n",
"\n",
" Insider Trading_Taneja Vaibhav ... SEC Form 4_Nov 30 04:42 PM \\\n",
"0 False ... False \n",
"1 True ... False \n",
"2 False ... False \n",
"3 True ... False \n",
"4 False ... False \n",
"5 False ... False \n",
"6 False ... False \n",
"7 False ... False \n",
"8 False ... False \n",
"9 False ... False \n",
"\n",
" SEC Form 4_Oct 05 07:35 PM SEC Form 4_Oct 31 07:06 PM \\\n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"6 False False \n",
"7 False False \n",
"8 False False \n",
"9 False False \n",
"\n",
" SEC Form 4_Sep 07 08:29 PM SEC Form 4_Sep 07 08:33 PM \\\n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"6 False False \n",
"7 False False \n",
"8 False False \n",
"9 False False \n",
"\n",
" SEC Form 4_Sep 07 09:04 PM SEC Form 4_Sep 12 09:44 PM \\\n",
"0 False False \n",
"1 False False \n",
"2 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"6 False False \n",
"7 False False \n",
"8 False False \n",
"9 False False \n",
"\n",
" SEC Form 4_Sep 14 07:47 PM SEC Form 4_Sep 30 07:03 PM Cost_category \n",
"0 False False medium \n",
"1 False False medium \n",
"2 False False medium \n",
"3 False False low \n",
"4 False False low \n",
"5 False False low \n",
"6 False False low \n",
"7 False False medium \n",
"8 False False medium \n",
"9 False False low \n",
"\n",
"[10 rows x 184 columns]"
]
},
"execution_count": 236,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Дискретизация числовых признаков процесс преобразования непрерывных числовых значений в категориальные группы или интервалы (дискретные значения).\n",
"\n",
"В данном случае преобразование числовой колонки \"Cost\" уже было выполнено ранее для стратифицированного разбиения исходных данных на выборки (обучающую, контрольную и тестовую). Для этого использовался метод квартильной группировки."
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Обучающая выборка:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Cost</th>\n",
" <th>Cost_category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>195.79</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>923.57</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.00</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>748.11</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>18.44</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>875.23</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>992.27</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1073.00</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>6.24</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>250.50</td>\n",
" <td>medium</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Cost Cost_category\n",
"0 195.79 medium\n",
"1 923.57 medium\n",
"2 0.00 low\n",
"3 748.11 medium\n",
"4 18.44 low\n",
"5 875.23 medium\n",
"6 992.27 high\n",
"7 1073.00 high\n",
"8 6.24 low\n",
"9 250.50 medium"
]
},
"execution_count": 237,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Обучающая выборка:')\n",
"df_train_oversampled[['Cost', 'Cost_category']].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"«Ручной» синтез признаков процесс создания новых признаков на основе существующих данных. Это может включать в себя комбинирование нескольких признаков, использование математических операций (например, сложение, вычитание), а также создание полиномиальных или логарифмических признаков."
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Year</th>\n",
" <th>Quarter</th>\n",
" <th>Month</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2022-03-06</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2022-03-06</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2022-03-06</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2022-03-05</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2022-03-05</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2022-03-05</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2022-02-27</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2022-02-27</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2022-02-06</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2022-01-27</td>\n",
" <td>2022</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Year Quarter Month\n",
"0 2022-03-06 2022 1 3\n",
"1 2022-03-06 2022 1 3\n",
"2 2022-03-06 2022 1 3\n",
"3 2022-03-05 2022 1 3\n",
"4 2022-03-05 2022 1 3\n",
"5 2022-03-05 2022 1 3\n",
"6 2022-02-27 2022 1 2\n",
"7 2022-02-27 2022 1 2\n",
"8 2022-02-06 2022 1 2\n",
"9 2022-01-27 2022 1 1"
]
},
"execution_count": 238,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Date'] = pd.to_datetime(df['Date']) # Преобразование в datetime\n",
"df['Year'] = df['Date'].dt.year # Год\n",
"df['Quarter'] = df['Date'].dt.quarter # Квартал\n",
"df['Month'] = df['Date'].dt.month # Месяц\n",
"\n",
"df[['Date', 'Year', 'Quarter', 'Month']].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ну и наконец, масштабирование признаков на основе нормировки и стандартизации метод, который позволяет привести все числовые признаки к одинаковым или очень похожим диапазонам значений либо распределениям."
]
},
{
"cell_type": "code",
"execution_count": 239,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Cost</th>\n",
" <th>Shares</th>\n",
" <th>Value ($)</th>\n",
" <th>Shares Total</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.630340</td>\n",
" <td>-0.607179</td>\n",
" <td>-0.583446</td>\n",
" <td>-0.528366</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.632418</td>\n",
" <td>-0.635043</td>\n",
" <td>-0.594307</td>\n",
" <td>-0.604905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-0.632418</td>\n",
" <td>-0.639117</td>\n",
" <td>-0.595883</td>\n",
" <td>-0.630945</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-1.069956</td>\n",
" <td>-0.618748</td>\n",
" <td>-0.597637</td>\n",
" <td>-0.603067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-1.069956</td>\n",
" <td>-0.634624</td>\n",
" <td>-0.597637</td>\n",
" <td>-0.629977</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>-1.069956</td>\n",
" <td>-0.584816</td>\n",
" <td>-0.597637</td>\n",
" <td>-0.520567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>-1.023228</td>\n",
" <td>-0.607022</td>\n",
" <td>-0.596122</td>\n",
" <td>-0.624074</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>-0.618541</td>\n",
" <td>-0.607022</td>\n",
" <td>-0.583003</td>\n",
" <td>-0.631906</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>-0.638653</td>\n",
" <td>-0.630565</td>\n",
" <td>-0.592643</td>\n",
" <td>-0.533148</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>-1.023228</td>\n",
" <td>-0.607022</td>\n",
" <td>-0.596122</td>\n",
" <td>-0.624074</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Cost Shares Value ($) Shares Total\n",
"0 -0.630340 -0.607179 -0.583446 -0.528366\n",
"1 -0.632418 -0.635043 -0.594307 -0.604905\n",
"2 -0.632418 -0.639117 -0.595883 -0.630945\n",
"3 -1.069956 -0.618748 -0.597637 -0.603067\n",
"4 -1.069956 -0.634624 -0.597637 -0.629977\n",
"5 -1.069956 -0.584816 -0.597637 -0.520567\n",
"6 -1.023228 -0.607022 -0.596122 -0.624074\n",
"7 -0.618541 -0.607022 -0.583003 -0.631906\n",
"8 -0.638653 -0.630565 -0.592643 -0.533148\n",
"9 -1.023228 -0.607022 -0.596122 -0.624074"
]
},
"execution_count": 239,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# scaler = MinMaxScaler()\n",
"scaler = StandardScaler()\n",
"\n",
"# Применяем масштабирование к выбранным признакам\n",
"df[numeric_columns] = scaler.fit_transform(df[numeric_columns])\n",
"\n",
"df[numeric_columns].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FeatureTools - библиотека для автоматизированного создания признаков из структурированных данных."
]
},
{
"cell_type": "code",
"execution_count": 240,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Insider Trading</th>\n",
" <th>Relationship</th>\n",
" <th>Transaction</th>\n",
" <th>Cost</th>\n",
" <th>DAY(Date)</th>\n",
" <th>MONTH(Date)</th>\n",
" <th>WEEKDAY(Date)</th>\n",
" <th>YEAR(Date)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>154</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1019.03</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1048.46</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>156</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1068.09</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>152</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1098.24</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>3</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>153</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1072.22</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>3</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>151</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1029.67</td>\n",
" <td>12</td>\n",
" <td>11</td>\n",
" <td>4</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Option Exercise</td>\n",
" <td>6.24</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>992.72</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>150</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Sale</td>\n",
" <td>1015.85</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>Musk Elon</td>\n",
" <td>CEO</td>\n",
" <td>Option Exercise</td>\n",
" <td>6.24</td>\n",
" <td>16</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>2021</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Insider Trading Relationship Transaction Cost DAY(Date) \\\n",
"Id \n",
"154 Musk Elon CEO Sale 1019.03 10 \n",
"155 Musk Elon CEO Sale 1048.46 10 \n",
"156 Musk Elon CEO Sale 1068.09 10 \n",
"152 Musk Elon CEO Sale 1098.24 11 \n",
"153 Musk Elon CEO Sale 1072.22 11 \n",
"151 Musk Elon CEO Sale 1029.67 12 \n",
"148 Musk Elon CEO Option Exercise 6.24 15 \n",
"149 Musk Elon CEO Sale 992.72 15 \n",
"150 Musk Elon CEO Sale 1015.85 15 \n",
"145 Musk Elon CEO Option Exercise 6.24 16 \n",
"\n",
" MONTH(Date) WEEKDAY(Date) YEAR(Date) \n",
"Id \n",
"154 11 2 2021 \n",
"155 11 2 2021 \n",
"156 11 2 2021 \n",
"152 11 3 2021 \n",
"153 11 3 2021 \n",
"151 11 4 2021 \n",
"148 11 0 2021 \n",
"149 11 0 2021 \n",
"150 11 0 2021 \n",
"145 11 1 2021 "
]
},
"execution_count": 240,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df: DataFrame = pd.read_csv(\"static/csv/TSLA.csv\")\n",
"\n",
"# Создание уникального идентификатора для каждой строки\n",
"df['Id'] = range(1, len(df) + 1)\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id=\"Id\")\n",
"\n",
"# Добавляем таблицу с индексом\n",
"es: EntitySet = es.add_dataframe(\n",
" dataframe_name=\"trades\", \n",
" dataframe=df, \n",
" index=\"Id\", \n",
" time_index=\"Date\"\n",
")\n",
"\n",
"# Генерация признаков с помощью глубокого синтеза признаков\n",
"feature_matrix, feature_defs = ft.dfs(\n",
" entityset=es, \n",
" target_dataframe_name='trades', \n",
" max_depth=1\n",
")\n",
"\n",
"# Выводим первые 10 строк сгенерированного набора признаков\n",
"feature_matrix.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Оценка качества набора признаков:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Предсказательная способность: Способность набора признаков успешно прогнозировать целевую переменную. Это определяется через метрики, такие как RMSE, MAE, R², которые показывают, насколько хорошо модель использует признаки для достижения точных результатов.\n",
"\n",
"Скорость вычисления: Время, необходимое для обработки данных и выполнения алгоритмов машинного обучения.\n",
"\n",
"Надежность: Устойчивость и воспроизводимость результатов при изменении входных данных.\n",
"\n",
"Корреляция: Степень взаимосвязи между признаками и целевой переменной, а также между самими признаками. Высокая корреляция с целевой переменной указывает на потенциальную предсказательную силу, тогда как высокая взаимосвязь между самими признаками может приводить к многоколлинеарности и снижению эффективности модели.\n",
"\n",
"Цельность: Не является производным от других признаков."
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 0.01 секунд\n",
"Среднеквадратичная ошибка: 190.15\n"
]
}
],
"source": [
"# Разбить выборку на входные данные и целевой признак\n",
"def split_dataframe(dataframe: DataFrame, column: str) -> tuple[DataFrame, DataFrame]:\n",
" X_dataframe: DataFrame = dataframe.drop(columns=column, axis=1)\n",
" y_dataframe: DataFrame = dataframe[column]\n",
" \n",
" return X_dataframe, y_dataframe\n",
"\n",
"\n",
"# Разбиение обучающей выборки на входные данные и целевой признак\n",
"df_train_oversampled: DataFrame = pd.get_dummies(df_train_oversampled)\n",
"X_df_train, y_df_train = split_dataframe(df_train_oversampled, \"Cost\")\n",
"\n",
"# Разбиение контрольной выборки на входные данные и целевой признак\n",
"df_val_oversampled: DataFrame = pd.get_dummies(df_val_oversampled)\n",
"X_df_val, y_df_val = split_dataframe(df_val_oversampled, \"Cost\")\n",
"\n",
"# Разбиение тестовой выборки на входные данные и целевой признак\n",
"df_test_oversampled: DataFrame = pd.get_dummies(df_test_oversampled)\n",
"X_df_test, y_df_test = split_dataframe(df_test_oversampled, \"Cost\")\n",
"\n",
"\n",
"# Модель линейной регрессии для обучения\n",
"model = LinearRegression()\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time: float = time.time()\n",
"model.fit(X_df_train, y_df_train)\n",
"\n",
"# Время обучения модели\n",
"train_time: float = time.time() - start_time\n",
"\n",
"# Предсказания и оценка модели\n",
"predictions = model.predict(X_df_val)\n",
"mse = root_mean_squared_error(y_df_val, predictions)\n",
"\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'Среднеквадратичная ошибка: {mse:.2f}')"
]
},
{
"cell_type": "code",
"execution_count": 242,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 134.73396019154637\n",
"R²: 0.9090517989509861\n",
"MAE: 71.95763423238986\n",
"\n",
"Кросс-валидация RMSE: 141.69564978570725\n",
"\n",
"Train RMSE: 46.69276439077218\n",
"Train R²: 0.9906750460946525\n",
"Train MAE: 18.74249758908302\n",
"\n"
]
},
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Фактическая стоимость по сравнению с прогнозируемой')"
]
},
"execution_count": 242,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACp3ElEQVR4nOzdd1xT1/sH8E8GIRAgiLJEFBVQca9at61WtO5ardXWPeoAt3Xvuhdq1dr2q7baVmvVVtu6994DxYGjWlkisgkhyf394Y9bY1AJBsL4vF+vvDTn3uQ+uRnkyTnnORJBEAQQERERERGRRUmtHQAREREREVFhxGSLiIiIiIgoFzDZIiIiIiIiygVMtoiIiIiIiHIBky0iIiIiIqJcwGSLiIiIiIgoFzDZIiIiIiIiygVMtoiIiIiIiHIBky0iIiIiIqJcwGSLiIiIiIqcjRs34sGDB+L19evX4/Hjx9YLiAolJltEuah3795wcHCwdhhERET0kmPHjmHcuHF48OAB9uzZg6FDh0Iq5Vdjsiy5tQMgKmyePn2KTZs24dixYzh69CjS0tLQqlUr1KxZE127dkXNmjWtHSIREVGRN3LkSDRr1gxly5YFAIwaNQqenp5WjooKG4kgCIK1gyAqLH755RcMGDAAycnJ8PHxQUZGBqKiolCzZk1cuXIFGRkZ6NWrF9auXQuFQmHtcImIiIq0lJQUhIaGokSJEihfvry1w6FCiH2lRBZy4sQJfPbZZ/Dw8MCJEydw//59tGjRAkqlEufOnUNERAQ+/fRTbNiwASNHjjS67aJFi9CgQQMUL14cdnZ2qF27NrZu3WpyDIlEgunTp4vXdTodPvzwQ7i4uODGjRviPq+7NGvWDABw+PBhSCQSHD582OgYbdq0MTlOs2bNxNtlevDgASQSCdavX2/UfvPmTXz88cdwcXGBUqlEnTp18Mcff5g8lvj4eIwcORI+Pj6wtbVFqVKl0LNnT8TGxr4yvoiICPj4+KBOnTpITk4GAGi1WkydOhW1a9eGWq2GSqVC48aNcejQIZNjxsTEoF+/fihdujRkMpl4TrI71PPvv/9G06ZN4ejoCCcnJ9StWxc//fSTeI7edO4z6XQ6zJo1C+XLl4etrS18fHwwceJEpKenGx3Px8cHvXv3Nmr79ddfIZFI4OPjI7ZlPhcSiQQ7duww2l+j0aBYsWKQSCRYtGiR0bZLly6hdevWcHJygoODA5o3b47Tp0+bPO7XPVeZz9PrLpmvpenTp0MikYjPsTkyb/uqy8uvw4MHD6Jx48ZQqVRwdnZGhw4dEBYWlq1jaTQaTJ8+Hf7+/lAqlfD09MRHH32Eu3fvAvjvfC9atAhLly5FmTJlYGdnh6ZNmyI0NNTovq5evYrevXujXLlyUCqV8PDwQN++ffH06dPXPj5HR0e88847Js9ns2bNUKVKFZOYFy1aBIlEYjT/BHj+ms08D46OjmjTpg2uX79utM+rhjtv3brV5D2Y1WfBuXPnTF7jgOnrNykpCcOGDYOXlxdsbW3h5+eHefPmwWAwmBw7K2fOnMGHH36IYsWKQaVSoVq1aggJCXntbdavX5+t1ybw33Nw8+ZNdO3aFU5OTihevDiGDx8OjUZjdL/mvIezOm7//v0BGL+WXlalShWTc535Gebu7g6lUonq1atjw4YNRvvcunUL77//Pjw8PGBrawtvb2988cUXiIuLE/cx5/M/u+eladOmqF69epbPQ4UKFRAYGCheNxgMWLZsGSpXrgylUgl3d3cMGjQIz549y/L8jRgxwuQ+AwMDIZFI0LZtW7PP0Yt/v1QqFerVq4fy5ctj6NChkEgkJp+7RG+DwwiJLCTzS8Mvv/yC2rVrm2wvUaIEfvjhB9y4cQPffPMNpk2bBjc3NwBASEgI2rdvjx49ekCr1eKXX35Bly5dsGvXLrRp0+aVx+zfvz8OHz6Mffv2ISAgAADw448/ituPHTuGtWvXYunSpShRogQAwN3d/ZX3d/ToUfz11185evwAcP36dTRs2BBeXl4YP348VCoVtmzZgo4dO+K3335Dp06dAADJyclo3LgxwsLC0LdvX9SqVQuxsbH4448/8O+//4qxvighIQGtW7eGjY0N/vrrL/HLYWJiIr777jt8+umnGDBgAJKSkvD9998jMDAQZ8+eRY0aNcT76NWrF/bv34+goCBUr14dMpkMa9euxcWLF9/42NavX4++ffuicuXKmDBhApydnXHp0iXs3r0b3bt3x6RJk8QvULGxsRg5ciQGDhyIxo0bm9xX//79sWHDBnz88ccYPXo0zpw5g7lz5yIsLAzbt29/ZQw6nQ6TJk165XalUol169ahY8eOYtu2bdtMvigCz5+rxo0bw8nJCePGjYONjQ2++eYbNGvWDEeOHEG9evUAvPm5qlSpktFrbu3atQgLC8PSpUvFtmrVqr36xJpp9erVRonB/fv3MXXqVKN99u/fj9atW6NcuXKYPn060tLSsGLFCjRs2BAXL140SlRfptfr0bZtWxw4cADdunXD8OHDkZSUhH379iE0NNTol+8ffvgBSUlJGDp0KDQaDUJCQvD+++/j2rVr4vts3759uHfvHvr06QMPDw9cv34da9euxfXr13H69GmTBCXzXMbGxmLVqlXo0qULQkNDUaFCBbPP1Y8//ohevXohMDAQ8+fPR2pqKlavXo1GjRrh0qVLrz0P5vjyyy+ztV/nzp2xb98+9OzZE++88w4OHTqECRMm4MGDB1izZs1rb7tv3z60bdsWnp6eGD58ODw8PBAWFoZdu3Zh+PDhbzz2zJkzxaFiwPPX9eDBg7Pct2vXrvDx8cHcuXNx+vRpLF++HM+ePcMPP/wg7mPOe7hGjRoYPXq0UZuvr+8bY35ZWloamjVrhvDwcAwbNgxly5bFr7/+it69eyM+Pl48DykpKShVqhTatWsHJycnhIaG4uuvv8bjx4+xc+fOV97/mz7/33RePv/8cwwYMAChoaFGPwicO3cOt2/fxuTJk8W2QYMGYf369ejTpw+Cg4Nx//59rFy5EpcuXcKJEydgY2Mj7qtUKrFp0yYsXLhQbP/3339x4MABKJXKHJ2jrISHh+Pbb7995XaiHBOIyCJcXFyEMmXKGLX16tVLUKlURm1TpkwRAAg7d+4U21JTU4320Wq1QpUqVYT333/fqB2AMG3aNEEQBGHChAmCTCYTduzY8cqY1q1bJwAQ7t+/b7Lt0KFDAgDh0KFDYlu9evWE1q1bGx1HEAThvffeE5o0aWJ0+/v37wsAhHXr1oltzZs3F6pWrSpoNBqxzWAwCA0aNBD8/PzEtqlTpwoAhG3btpnEZTAYTOLTaDRCs2bNBDc3NyE8PNxof51OJ6Snpxu1PXv2THB3dxf69u0rtqWlpQlSqVQYNGiQ0b5ZPUcvi4+PFxwdHYV69eoJaWlpWcb7oqzOTabLly8LAIT+/fsbtY8ZM0YAIBw8eFBsK1OmjNCrVy/x+qpVqwRbW1vhvffeM3qtZR7v008/FeRyuRAVFSVua968udC9e3cBgLBw4UKxvWPHjoJCoRDu3r0rtkVERAiOjo5Gz3V2nqsX9erVy+R9kGnatGkCAOHJkydZbn+dV9323LlzJue6Ro0agpubm/D06VOx7cqVK4JUKhV69uz52uP873//EwAIS5YsMdmW+Xgzz7ednZ3w77//itvPnDkjABBGjhwptr383hYEQfj5558FAMLRo0dNHt+L9u7dKwAQtmzZIrY1bdpUqFy5ssl9Lly40Oi9npSUJDg7OwsDBgww2i8qKkpQq9VG7a96D/z6668mnxFNmzYVmjZtKl7/66+/BABCq1atTOJ/8fW7c+dOAYAwfvx4o3169+4tABCuXbtmcvxMOp1OKFu2rFCmTBnh2bNnRtuyeg2+KPMz8Ny5c0btT548Mfmcy3wO2rdvb7TvkCFDBADClStXBEEw/z3cpk2bV8aX+Vp68b2ZqXLlykbnetmyZQIAYePGjWKbVqsV6tevLzg4OAiJiYmvPM6QIUMEBwcH8bo5n//ZPS/x8fGCUqkUvvzyS6P9goODBZVKJSQnJwu
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Модель случайного леса для обучения\n",
"model = RandomForestRegressor()\n",
"\n",
"# Обучение модели\n",
"model.fit(X_df_train, y_df_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_predictions = model.predict(X_df_test)\n",
"\n",
"rmse = root_mean_squared_error(y_df_test, y_predictions)\n",
"r2 = r2_score(y_df_test, y_predictions)\n",
"mae = mean_absolute_error(y_df_test, y_predictions)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae}\\n\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_df_train, y_df_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Кросс-валидация RMSE: {rmse_cv}\\n\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_df_train.columns\n",
"\n",
"# Проверка на переобучение\n",
"y_train_predictions = model.predict(X_df_train)\n",
"\n",
"rmse_train = root_mean_squared_error(y_df_train, y_train_predictions)\n",
"r2_train = r2_score(y_df_train, y_train_predictions)\n",
"mae_train = mean_absolute_error(y_df_train, y_train_predictions)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\\n\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_df_test, y_predictions, alpha=0.5)\n",
"plt.plot([y_df_test.min(), y_df_test.max()], [y_df_test.min(), y_df_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Фактическая стоимость')\n",
"plt.ylabel('Прогнозируемая стоимость')\n",
"plt.title('Фактическая стоимость по сравнению с прогнозируемой')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Вывод:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Оценка качества модели на тестовой выборке:\n",
"\n",
"RMSE (Корень из среднеквадратичной ошибки) на тестовой выборке составил 89.71, что указывает на среднюю ошибку в прогнозах.\n",
"R² (Коэффициент детерминации) равен 0.96, что означает, что модель объясняет 96% дисперсии данных. Это хороший показатель, указывающий на высокую объяснительную способность модели.\n",
"MAE (Средняя абсолютная ошибка) составила 51.21, показывая среднее абсолютное отклонение предсказаний от фактических значений.\n",
"\n",
"2. Результаты кросс-валидации:\n",
"\n",
"RMSE кросс-валидации равен 148.73, что заметно выше значения RMSE на тестовой выборке. Это может свидетельствовать о том, что модель может быть подвержена колебаниям в зависимости от данных и, возможно, склонна к некоторому переобучению.\n",
"\n",
"3. Проверка на переобучение:\n",
"\n",
"Метрики на обучающей выборке (RMSE = 49.74, R² = 0.99, MAE = 22.62) значительно лучше, чем на тестовой, что указывает на высокую точность на обучающих данных."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}