2232 lines
424 KiB
Plaintext
2232 lines
424 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Бизнес цели"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Прогнозирование цен на акции Tesla на основе действий инсайдеров: Одна из ключевых бизнес-целей состоит в создании модели для прогнозирования динамики акций Tesla, используя данные о транзакциях инсайдеров. Поскольку инсайдеры обладают глубоким знанием внутреннего состояния компании, их действия могут предсказывать изменения в стоимости акций. На основе анализа паттернов и частоты инсайдерских покупок и продаж можно разработать предсказательную модель, которая поможет инвесторам и аналитикам принимать более обоснованные решения.\n",
|
|||
|
"2. Анализ влияния транзакций инсайдеров на динамику цены акций Tesla для оценки краткосрочных и долгосрочных рисков: Цель – исследовать, как действия инсайдеров (особенно крупных акционеров и ключевых лиц) влияют на цену акций Tesla. Выявление корреляций между объёмом, типом и частотой инсайдерских сделок и изменениями цены акций позволит оценить риски и тенденции в динамике акций.\n",
|
|||
|
"\n",
|
|||
|
"Цель технического проекта: Разработка модели машинного обучения для прогнозирования будущих продаж акций топ-менеджментом компании, а также анализ влияния транзакций инсайдеров на динамику цены акций Tesla для оценки краткосрочных и долгосрочных рисков."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 228,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from typing import Any\n",
|
|||
|
"from math import ceil\n",
|
|||
|
"import time\n",
|
|||
|
"\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"from pandas import DataFrame, Series\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
|
|||
|
"from sklearn.model_selection import train_test_split, cross_val_score\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"import featuretools as ft\n",
|
|||
|
"from featuretools.entityset.entityset import EntitySet\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"df: DataFrame = pd.read_csv(\"static/csv/TSLA.csv\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Конвертация данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 229,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Выборка данных:\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Insider Trading</th>\n",
|
|||
|
" <th>Relationship</th>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th>Transaction</th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Shares</th>\n",
|
|||
|
" <th>Value ($)</th>\n",
|
|||
|
" <th>Shares Total</th>\n",
|
|||
|
" <th>SEC Form 4</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>196.72</td>\n",
|
|||
|
" <td>10455</td>\n",
|
|||
|
" <td>2056775</td>\n",
|
|||
|
" <td>203073</td>\n",
|
|||
|
" <td>Mar 07 07:58 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>Taneja Vaibhav</td>\n",
|
|||
|
" <td>Chief Accounting Officer</td>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>2466</td>\n",
|
|||
|
" <td>482718</td>\n",
|
|||
|
" <td>100458</td>\n",
|
|||
|
" <td>Mar 07 07:57 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>1298</td>\n",
|
|||
|
" <td>254232</td>\n",
|
|||
|
" <td>65547</td>\n",
|
|||
|
" <td>Mar 07 08:01 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>Taneja Vaibhav</td>\n",
|
|||
|
" <td>Chief Accounting Officer</td>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>7138</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>102923</td>\n",
|
|||
|
" <td>Mar 07 07:57 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>2586</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>66845</td>\n",
|
|||
|
" <td>Mar 07 08:01 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>16867</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>213528</td>\n",
|
|||
|
" <td>Mar 07 07:58 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>2022-02-27</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>219555</td>\n",
|
|||
|
" <td>74759</td>\n",
|
|||
|
" <td>Mar 01 07:29 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>2022-02-27</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>202.00</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>2121000</td>\n",
|
|||
|
" <td>64259</td>\n",
|
|||
|
" <td>Mar 01 07:29 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>2022-02-06</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>193.00</td>\n",
|
|||
|
" <td>3750</td>\n",
|
|||
|
" <td>723750</td>\n",
|
|||
|
" <td>196661</td>\n",
|
|||
|
" <td>Feb 08 06:14 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>2022-01-27</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>219555</td>\n",
|
|||
|
" <td>74759</td>\n",
|
|||
|
" <td>Jan 31 07:34 PM</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Insider Trading Relationship Date \\\n",
|
|||
|
"0 Kirkhorn Zachary Chief Financial Officer 2022-03-06 \n",
|
|||
|
"1 Taneja Vaibhav Chief Accounting Officer 2022-03-06 \n",
|
|||
|
"2 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-03-06 \n",
|
|||
|
"3 Taneja Vaibhav Chief Accounting Officer 2022-03-05 \n",
|
|||
|
"4 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-03-05 \n",
|
|||
|
"5 Kirkhorn Zachary Chief Financial Officer 2022-03-05 \n",
|
|||
|
"6 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-02-27 \n",
|
|||
|
"7 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-02-27 \n",
|
|||
|
"8 Kirkhorn Zachary Chief Financial Officer 2022-02-06 \n",
|
|||
|
"9 Baglino Andrew D SVP Powertrain and Energy Eng. 2022-01-27 \n",
|
|||
|
"\n",
|
|||
|
" Transaction Cost Shares Value ($) Shares Total SEC Form 4 \n",
|
|||
|
"0 Sale 196.72 10455 2056775 203073 Mar 07 07:58 PM \n",
|
|||
|
"1 Sale 195.79 2466 482718 100458 Mar 07 07:57 PM \n",
|
|||
|
"2 Sale 195.79 1298 254232 65547 Mar 07 08:01 PM \n",
|
|||
|
"3 Option Exercise 0.00 7138 0 102923 Mar 07 07:57 PM \n",
|
|||
|
"4 Option Exercise 0.00 2586 0 66845 Mar 07 08:01 PM \n",
|
|||
|
"5 Option Exercise 0.00 16867 0 213528 Mar 07 07:58 PM \n",
|
|||
|
"6 Option Exercise 20.91 10500 219555 74759 Mar 01 07:29 PM \n",
|
|||
|
"7 Sale 202.00 10500 2121000 64259 Mar 01 07:29 PM \n",
|
|||
|
"8 Sale 193.00 3750 723750 196661 Feb 08 06:14 PM \n",
|
|||
|
"9 Option Exercise 20.91 10500 219555 74759 Jan 31 07:34 PM "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 229,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Преобразование типов данных\n",
|
|||
|
"df['Insider Trading'] = df['Insider Trading'].astype('category') \n",
|
|||
|
"df['Relationship'] = df['Relationship'].astype('category') \n",
|
|||
|
"df['Transaction'] = df['Transaction'].astype('category') \n",
|
|||
|
"df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce') \n",
|
|||
|
"df['Shares'] = pd.to_numeric(df['Shares'].str.replace(',', ''), errors='coerce') \n",
|
|||
|
"df['Value ($)'] = pd.to_numeric(df['Value ($)'].str.replace(',', ''), errors='coerce') \n",
|
|||
|
"df['Shares Total'] = pd.to_numeric(df['Shares Total'].str.replace(',', ''), errors='coerce')\n",
|
|||
|
"\n",
|
|||
|
"print('Выборка данных:')\n",
|
|||
|
"df.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проблема пропущенных данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проверка на отсутствие значений, представленная ниже, показала, что DataFrame не имеет пустых значений признаков. Нет необходимости использовать методы заполнения пропущенных данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 230,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Присутствуют ли пустые значения признаков в колонке:\n",
|
|||
|
"Insider Trading False\n",
|
|||
|
"Relationship False\n",
|
|||
|
"Date False\n",
|
|||
|
"Transaction False\n",
|
|||
|
"Cost False\n",
|
|||
|
"Shares False\n",
|
|||
|
"Value ($) False\n",
|
|||
|
"Shares Total False\n",
|
|||
|
"SEC Form 4 False\n",
|
|||
|
"dtype: bool \n",
|
|||
|
"\n",
|
|||
|
"Количество пустых значений признаков в колонке:\n",
|
|||
|
"Insider Trading 0\n",
|
|||
|
"Relationship 0\n",
|
|||
|
"Date 0\n",
|
|||
|
"Transaction 0\n",
|
|||
|
"Cost 0\n",
|
|||
|
"Shares 0\n",
|
|||
|
"Value ($) 0\n",
|
|||
|
"Shares Total 0\n",
|
|||
|
"SEC Form 4 0\n",
|
|||
|
"dtype: int64 \n",
|
|||
|
"\n",
|
|||
|
"Процент пустых значений признаков в колонке:\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Проверка пропущенных данных\n",
|
|||
|
"def check_null_columns(dataframe: DataFrame) -> None:\n",
|
|||
|
" # Присутствуют ли пустые значения признаков\n",
|
|||
|
" print('Присутствуют ли пустые значения признаков в колонке:')\n",
|
|||
|
" print(dataframe.isnull().any(), '\\n')\n",
|
|||
|
"\n",
|
|||
|
" # Количество пустых значений признаков\n",
|
|||
|
" print('Количество пустых значений признаков в колонке:')\n",
|
|||
|
" print(dataframe.isnull().sum(), '\\n')\n",
|
|||
|
"\n",
|
|||
|
" # Процент пустых значений признаков\n",
|
|||
|
" print('Процент пустых значений признаков в колонке:')\n",
|
|||
|
" for column in dataframe.columns:\n",
|
|||
|
" null_rate: float = dataframe[column].isnull().sum() / len(dataframe) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f\"{column} процент пустых значений: {null_rate:.2f}%\")\n",
|
|||
|
" print()\n",
|
|||
|
" \n",
|
|||
|
"\n",
|
|||
|
"# Проверка пропущенных данных\n",
|
|||
|
"check_null_columns(df)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проблема зашумленности данных"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Зашумленность – это наличие случайных ошибок или вариаций в данных, которые могут затруднить выявление истинных закономерностей.\n",
|
|||
|
"В свою очередь выбросы - это значения, которые значительно отличаются от остальных наблюдений в наборе данных\n",
|
|||
|
"Представленный ниже код помогает определить наличие выбросов в наборе данных и устранить их (при наличии), заменив значения ниже нижней границы (рассматриваемого минимума) на значения нижней границы, а значения выше верхней границы (рассматриваемого максимума) – на значения верхней границы."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 231,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка наличия выбросов в колонках:\n",
|
|||
|
"Колонка Cost:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 0.0\n",
|
|||
|
"\tМаксимальное значение: 1171.04\n",
|
|||
|
"\t1-й квартиль (Q1): 50.5225\n",
|
|||
|
"\t3-й квартиль (Q3): 934.1075\n",
|
|||
|
"\n",
|
|||
|
"Колонка Shares:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 25\n",
|
|||
|
"\tМинимальное значение: 121\n",
|
|||
|
"\tМаксимальное значение: 11920000\n",
|
|||
|
"\t1-й квартиль (Q1): 3500.0\n",
|
|||
|
"\t3-й квартиль (Q3): 301797.75\n",
|
|||
|
"\n",
|
|||
|
"Колонка Value ($):\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 23\n",
|
|||
|
"\tМинимальное значение: 0\n",
|
|||
|
"\tМаксимальное значение: 2278695421\n",
|
|||
|
"\t1-й квартиль (Q1): 271008.0\n",
|
|||
|
"\t3-й квартиль (Q3): 148713213.25\n",
|
|||
|
"\n",
|
|||
|
"Колонка Shares Total:\n",
|
|||
|
"\tЕсть выбросы: Да\n",
|
|||
|
"\tКоличество выбросов: 21\n",
|
|||
|
"\tМинимальное значение: 49\n",
|
|||
|
"\tМаксимальное значение: 455467432\n",
|
|||
|
"\t1-й квартиль (Q1): 25103.5\n",
|
|||
|
"\t3-й квартиль (Q3): 1507273.75\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdEAAAPOCAYAAADgBVF+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADQjElEQVR4nOzdeVyUVfvH8S8MCSqBKwilQlq5gFr6JFBTkJSaGIS0aWVm2aI9ubVg5VbJo2HaYpmWS0+iFhEVlmUqOdVkhVliahtkJWBmApqCzty/P3xmfo4wLoUMwuf9es2r5pzrnvuaceAw15z7HC/DMAwBAAAAAAAAAIAqvD2dAAAAAAAAAAAAdRVFdAAAAAAAAAAA3KCIDgAAAAAAAACAGxTRAQAAAAAAAABwgyI6AAAAAAAAAABuUEQHAAAAAAAAAMANiugAAAAAAAAAALhBER0AAAAAAAAAADcoogMAAAAAAAAA4AZFdAAAAACoRbfeeqv8/f09nQYAADiG2NhYxcbGeuz8n3/+uRo1aqSff/652v5bb71Vubm51fZ9++238vHxUX5+/inMsGGhiI4TlpmZKS8vr2pvERERnk4PaDBiY2N16623Sjo8aFY3qB84cECzZs1S7969FRgYKD8/P5133nkaNWqUvvvuu1OSV0ZGhmbPnl2lvbCwUF5eXs7B3cvLS4sWLTolOQD1FWMwUDccbwzeu3evJk2apIiICDVt2lQtW7ZUjx49dN9992nHjh21n3ANW7Rokby8vCRJubm58vLyUmFhoWeTAmoB4zBQN7gbh7OysuTl5aWXXnrJ7bGrVq2Sl5eXnnnmmVrItGY8/PDDuvHGG9W+ffuTPrZLly4aMGCAJk6cWKXvyNdu8uTJCgsL+4eZNgw+nk4Ap58JEyaoc+fOzvtPPPGEB7MBcLRdu3apX79+ysvLU0JCggYPHix/f39t27ZNy5Yt07x581RZWVnj583IyFB+fr5Gjx5d448N4DDGYKDuOnjwoC699FJt3bpVQ4cO1b333qu9e/dq8+bNysjI0DXXXKPQ0FBPpwngH2AcBuqmAQMGKDAwUBkZGbr99turjcnIyJDJZNINN9xQy9n9PRs3btSHH36oTz/91G3MoUOHVFFR4bb/rrvu0lVXXaUff/xRHTp0OBVpNigU0XHSrrjiCpdZNy+99JJ27drluYQAuLj11lv11VdfKTMzU4MGDXLpe+yxx/Twww97KDMA/xRjMFB3ZWdn66uvvtKSJUs0ePBgl74DBw6cki+wj8Vut6uyslJ+fn61el6gPmMcBuomX19fpaSkaOHChdqxY0eVL60PHDigN998U1dccYWCgoI8lOXJWbhwodq1a6eoqCiX9tLSUo0ePVqZmZnau3evlixZIn9/f8XExGj58uVq1qyZMzY+Pl7NmzfX4sWLNXXq1Fp+BvUPy7nghDn+8Pf2Pv7bxnGp55GXd9rtdnXr1q3KUg7ffPONbr31Vp1zzjny8/NTmzZtdNttt+mPP/5weczJkydXe/mcj8//fxcUGxuriIgI5eXlKSYmRo0bN1Z4eLjmzp1b5blMnDhRPXv2VGBgoJo2bSqz2ay1a9e6xDmWofDy8lJ2drZL34EDB9S8eXN5eXkpPT29Sp5BQUE6ePCgyzFLly51Pt6Rf2y99dZbGjBggEJDQ+Xr66sOHTrosccek81mO+5r7Tjf1q1bdd111ykgIEAtW7bUfffdpwMHDrjELly4UJdffrmCgoLk6+urLl266IUXXqjymImJiQoLC5Ofn5+CgoJ09dVXa9OmTS4xjudR3fIdnTp1kpeXl0aNGuVs2717t8aPH6/IyEj5+/srICBA/fv319dff+1y7NChQ+Xn56ctW7a4tPft21fNmzd3uRT6p59+0rXXXqsWLVqoSZMmioqK0ooVK1yOc1xq7Lj5+vrqvPPOU1pamgzDOPaL+z/u3nvVLaNy5Hvm6NuRdu7cqeHDh6tdu3YymUzOmH+6Pur69eu1YsUKDR8+vEoBXTr8x8WR71dJWrNmjcxms5o2bapmzZopMTGxyutfXl6u0aNHKywsTL6+vgoKCtIVV1yhDRs2SDr8s7dixQr9/PPPzufCJWFAzWEMznbpYwxmDK6LY/CPP/4oSbr44our9Pn5+SkgIKBK+2+//aakpCT5+/urdevWGj9+fJX3Xnp6umJiYtSyZUs1btxYPXv2VGZmZpXHcvy7L1myRF27dpWvr69WrlzpPM9tt92m4OBg+fr6qmvXrlqwYEGVx3j22WfVtWtXNWnSRM2bN1evXr2UkZHxt14PoD5hHM526WMcZhyui+PwTTfdJLvdrmXLllXpW7FihUpLSzVkyBBJJ/5+OFp1P9/S/7/WR69Pvn79evXr10+BgYFq0qSJLrvsMn3yyScn9Hyys7N1+eWXV3n97rvvPi1ZskRjx47VFVdcoalTp2ry5Mnau3ev9u3b5xJ7xhlnKDY2Vm+99dYJnRPHxkx0nDDHHw6+vr5/6/j//ve/VQYf6fC6VD/99JOGDRumNm3aaPPmzZo3b542b96szz77rMovjBdeeMHll+vRf8j8+eefuuqqq3Tdddfpxhtv1Guvvaa7775bjRo10m233SZJKisr00svvaQbb7xRd9xxh8rLy/Xyyy+rb9+++vzzz9WjRw+Xx/Tz89PChQuVlJTkbMvKyqoyMB+pvLxcOTk5uuaaa5xtCxculJ+fX5XjFi1aJH9/f40dO1b+/v5as2aNJk6cqLKyMj355JNuz3Gk6667TmFhYUpLS9Nnn32mZ555Rn/++adeeeUVl9eua9euuvrqq+Xj46N33nlH99xzj+x2u0aOHOnyeCNGjFCbNm20Y8cOPffcc4qPj1dBQYGaNGlS5XU5cvmOTz/9tNpNL3766SdlZ2fr2muvVXh4uEpKSvTiiy/qsssu07fffuv8pvjpp5/WmjVrNHToUFmtVplMJr344ov64IMP9N///tcZV1JSopiYGP3111/697//rZYtW2rx4sW6+uqrlZmZ6fK6S/9/6eX+/fu1fPlyTZgwQUFBQRo+fPgJvb6O18/x3ktNTT1m7IgRI2Q2myUdfq+8+eabLv1Dhw7Vhx9+qHvvvVfdu3eXyWTSvHnznEXpv+vtt9+WJN18880nFP/hhx+qf//+OuecczR58mTt379fzz77rC6++GJt2LDBWQi/6667lJmZqVGjRqlLly76448/9PHHH2vLli268MIL9fDDD6u0tFS//vqrZs2aJUlsmAbUIMZgxmDG4Lo/BjvWK33llVf0yCOPVPn5OZrNZlPfvn3Vu3dvpaen68MPP9TMmTPVoUMH3X333c64p59+WldffbWGDBmiyspKLVu2TNdee61ycnI0YMAAl8dcs2aNXnvtNY0aNUqtWrVSWFiYSkpKFBUV5SzqtG7dWu+9956GDx+usrIy53to/vz5+ve//62UlBRnAeqbb77R+vXrq8ysBxoaxmHGYcbhuj8OX3rppTr77LOVkZGhsWPHuvRlZGSoSZMmzvfxybwf/q41a9aof//+6tmzpyZNmiRvb29n8d5iseiiiy5ye+xvv/2m7du368ILL6zSt2LFCt16662aMmWKbr31VpnNZsXGxmrcuHHVPlbPnj311ltvqaysrNov9HESDOAEzZ4925BkfP311y7tl112mdG1a1eXtoULFxqSjIKCAsMwDOPAgQNGu3btjP79+xuSjIULFzpj//rrryrnWrp0qSHJWLdunbNt0qRJhiTj999/d5vjZZddZkgyZs6c6WyrqKgwevToYQQFBRmVlZWGYRjGoUOHjIqKCpdj//zzTyM4ONi47bbbnG0FBQWGJOPGG280fHx8jOLiYmdfnz59jMGDBxuSjCeffLJKnjfeeKORkJDgbP/5558Nb29v48Ybb6zyPKp7De68806jSZMmxoEDB9w+3yPPd/XVV7u033PPPVX+vao7T9++fY1zzjnnmOd47bXXDEnGl19+6WyTZKSkpBg+Pj4u7cOHD3e+LiNHjnS2HzhwwLDZbC6PW1BQYPj6+hpTp051aX///fcNScbjjz9u/PTTT4a/v7+
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 4 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Проверка выбросов в DataFrame\n",
|
|||
|
"def check_outliers(dataframe: DataFrame, columns: list[str]) -> None:\n",
|
|||
|
" for column in columns:\n",
|
|||
|
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
|
|||
|
" continue\n",
|
|||
|
" \n",
|
|||
|
" Q1: float = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
|
|||
|
" Q3: float = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
|
|||
|
" IQR: float = Q3 - Q1 # Вычисляем межквартильный размах\n",
|
|||
|
"\n",
|
|||
|
" # Определяем границы для выбросов\n",
|
|||
|
" lower_bound: float = Q1 - 1.5 * IQR # Нижняя граница\n",
|
|||
|
" upper_bound: float = Q3 + 1.5 * IQR # Верхняя граница\n",
|
|||
|
"\n",
|
|||
|
" # Подсчитываем количество выбросов\n",
|
|||
|
" outliers: DataFrame = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]\n",
|
|||
|
" outlier_count: int = outliers.shape[0]\n",
|
|||
|
"\n",
|
|||
|
" print(f\"Колонка {column}:\")\n",
|
|||
|
" print(f\"\\tЕсть выбросы: {'Да' if outlier_count > 0 else 'Нет'}\")\n",
|
|||
|
" print(f\"\\tКоличество выбросов: {outlier_count}\")\n",
|
|||
|
" print(f\"\\tМинимальное значение: {dataframe[column].min()}\")\n",
|
|||
|
" print(f\"\\tМаксимальное значение: {dataframe[column].max()}\")\n",
|
|||
|
" print(f\"\\t1-й квартиль (Q1): {Q1}\")\n",
|
|||
|
" print(f\"\\t3-й квартиль (Q3): {Q3}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация выбросов\n",
|
|||
|
"def visualize_outliers(dataframe: DataFrame, columns: list[str]) -> None:\n",
|
|||
|
" # Диаграммы размахов\n",
|
|||
|
" plt.figure(figsize=(15, 10))\n",
|
|||
|
" rows: int = ceil(len(columns) / 3)\n",
|
|||
|
" for index, column in enumerate(columns, 1):\n",
|
|||
|
" plt.subplot(rows, 3, index)\n",
|
|||
|
" plt.boxplot(dataframe[column], vert=True, patch_artist=True)\n",
|
|||
|
" plt.title(f\"Диаграмма размахов для \\\"{column}\\\"\")\n",
|
|||
|
" plt.xlabel(column)\n",
|
|||
|
" \n",
|
|||
|
" # Отображение графиков\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Числовые столбцы DataFrame\n",
|
|||
|
"numeric_columns: list[str] = [\n",
|
|||
|
" 'Cost',\n",
|
|||
|
" 'Shares',\n",
|
|||
|
" 'Value ($)',\n",
|
|||
|
" 'Shares Total'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия выбросов в колонках\n",
|
|||
|
"print('Проверка наличия выбросов в колонках:')\n",
|
|||
|
"check_outliers(df, numeric_columns)\n",
|
|||
|
"visualize_outliers(df, numeric_columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Устраняем выбросы и проводим проверку на их устранение"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 232,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка наличия выбросов в колонках после их устранения:\n",
|
|||
|
"Колонка Cost:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 0.0\n",
|
|||
|
"\tМаксимальное значение: 1171.04\n",
|
|||
|
"\t1-й квартиль (Q1): 50.5225\n",
|
|||
|
"\t3-й квартиль (Q3): 934.1075\n",
|
|||
|
"\n",
|
|||
|
"Колонка Shares:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 121.0\n",
|
|||
|
"\tМаксимальное значение: 749244.375\n",
|
|||
|
"\t1-й квартиль (Q1): 3500.0\n",
|
|||
|
"\t3-й квартиль (Q3): 301797.75\n",
|
|||
|
"\n",
|
|||
|
"Колонка Value ($):\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 0.0\n",
|
|||
|
"\tМаксимальное значение: 371376521.125\n",
|
|||
|
"\t1-й квартиль (Q1): 271008.0\n",
|
|||
|
"\t3-й квартиль (Q3): 148713213.25\n",
|
|||
|
"\n",
|
|||
|
"Колонка Shares Total:\n",
|
|||
|
"\tЕсть выбросы: Нет\n",
|
|||
|
"\tКоличество выбросов: 0\n",
|
|||
|
"\tМинимальное значение: 49.0\n",
|
|||
|
"\tМаксимальное значение: 3730529.125\n",
|
|||
|
"\t1-й квартиль (Q1): 25103.5\n",
|
|||
|
"\t3-й квартиль (Q3): 1507273.75\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdIAAAPOCAYAAAALMup9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADOqElEQVR4nOzdeVxWdfr/8TeIgIqAqICMiFSm4C6mormkJBqZli2aJi5pNWApkxaNqWHFZOWWpJkLNspotlhpoaipTeKGMbmPjSaW3pCpkKSAcH5/+ON8vQNu0ZBFX8/H4zzqPp/rnHPdtzdc3Nd9zufYGYZhCAAAAAAAAAAAFMu+ohMAAAAAAAAAAKAyo5EOAAAAAAAAAIANNNIBAAAAAAAAALCBRjoAAAAAAAAAADbQSAcAAAAAAAAAwAYa6QAAAAAAAAAA2EAjHQAAAAAAAAAAG2ikAwAAAAAAAABgA410AAAAAAAAAABsoJEOAAAAAOVo+PDhcnFxqeg0AABAGejRo4d69OhRYcffuXOnHB0ddfz48WLHhw8frs2bNxc7duDAATk4OGjfvn03MMObB4106KOPPpKdnV2xS4sWLSo6PeCW0aNHDw0fPlzS5UJXXCG+ePGiZs6cqY4dO8rNzU3Ozs668847FRkZqf/+9783JK+EhATNmjWryPoff/xRdnZ2ZkG2s7NTfHz8DckBuNVRq4HK4Wq1+vz585oyZYpatGihWrVqqW7dumrTpo2ee+45nTx5svwTLmPx8fGys7OTJG3evFl2dnb68ccfKzYpoBKhXgOVQ0n1+pNPPpGdnZ0WLlxY4rZJSUmys7PTnDlzyiHTsvH3v/9dgwcPlp+f3zVvGxgYqLCwME2ePLnI2JWv3dSpU9W4ceM/mWnV51DRCaDyeOmllxQQEGA+fu211yowGwB/dPr0afXp00cpKSm6//779fjjj8vFxUWHDx/WihUrtGDBAuXm5pb5cRMSErRv3z6NGzeuzPcN4NpQq4HKKy8vT926ddOhQ4cUHh6usWPH6vz589q/f78SEhL04IMPysfHp6LTBFAOqNdA5RQWFiY3NzclJCToySefLDYmISFB1apV06BBg8o5u+uTmpqqDRs2aNu2bSXGXLp0STk5OSWOP/3007rvvvv0v//9T7fffvuNSPOmQSMdpnvvvdfqrJqFCxfq9OnTFZcQACvDhw/Xd999p48++kgDBw60Gps2bZr+/ve/V1BmAMoLtRqovFavXq3vvvtOy5cv1+OPP241dvHixRvyZbctBQUFys3NlbOzc7keFwD1GqisnJyc9PDDD2vJkiU6efJkkS+4L168qE8//VT33nuvPD09KyjLa7NkyRI1atRInTp1slqfmZmpcePG6aOPPtL58+e1fPlyubi4qHPnzlq5cqXc3d3N2JCQENWpU0dLly5VTExMOT+DqoWpXWD+UW9vf/W3Q+HlnFdewllQUKBWrVoVmdbh+++/1/Dhw3XbbbfJ2dlZ3t7eGjlypH799VerfU6dOrXYS98cHP7ve54ePXqoRYsWSklJUefOnVWjRg35+/tr/vz5RZ7L5MmTFRQUJDc3N9WqVUtdu3bV119/bRVXOCWFnZ2dVq9ebTV28eJF1alTR3Z2dnrrrbeK5Onp6am8vDyrbf71r3+Z+7vyD6TPPvtMYWFh8vHxkZOTk26//XZNmzZN+fn5V32tC4936NAhPfroo3J1dVXdunX13HPP6eLFi1axS5YsUc+ePeXp6SknJycFBgZq3rx5RfbZv39/NW7cWM7OzvL09NQDDzygvXv3WsUUPo/ipvJo1qyZ7OzsFBkZaa47c+aMnn/+ebVs2VIuLi5ydXVV37599Z///Mdq2/DwcDk7O+vgwYNW60NDQ1WnTh2ry52PHj2qRx55RB4eHqpZs6Y6deqktWvXWm1XeDlx4eLk5KQ777xTsbGxMgzD9ov7/5X03ituSpUr3zN/XK6UkZGhUaNGqVGjRqpWrZoZ82fnQd2xY4fWrl2rUaNGFWmiS5f/ILjy/SpJmzZtUteuXVWrVi25u7urf//+RV7/3377TePGjVPjxo3l5OQkT09P3XvvvdqzZ4+kyz97a9eu1fHjx83nwuVcQPmjVq+2GqNWU6srY63+3//+J0nq0qVLkTFnZ2e5uroWWf/zzz9rwIABcnFxUf369fX8888Xee+99dZb6ty5s+rWrasaNWooKChIH330UZF9Ff67L1++XM2bN5eTk5MSExPN44wcOVJeXl5ycnJS8+bNtXjx4iL7eOedd9S8eXPVrFlTderUUfv27ZWQkHBdrwdwK6Jer7Yao15TrytjvR46dKgKCgq0YsWKImNr165VZmamhgwZIqn074c/Ku7nW/q/1/qP85Xv2LFDffr0kZubm2rWrKnu3bvr22+/LdXzWb16tXr27Fnk9Xvuuee0fPlyRUVF6d5771VMTIymTp2q8+fPKzs72yq2evXq6tGjhz777LNSHfNWxhnpMIu9k5PTdW3/z3/+s0jBkC7PK3X06FGNGDFC3t7e2r9/vxYsWKD9+/dr+/btRX7I582bZ/UL8Y9/fJw9e1b33XefHn30UQ0ePFgffvihnnnmGTk6OmrkyJGSpKysLC1cuFCDBw/W6NGj9dtvv2nRokUKDQ3Vzp071aZNG6t9Ojs7a8mSJRowYIC57pNPPilSTK/022+/ac2aNXrwwQfNdUuWLJGzs3OR7eLj4+Xi4qKoqCi5uLho06ZNmjx5srKysvTmm2+WeIwrPfroo2rcuLFiY2O1fft2zZkzR2fPntUHH3xg9do1b95cDzzwgBwcHPTFF1/or3/9qwoKChQREWG1vzFjxsjb21snT57U3LlzFRISomPHjqlmzZpFXpcrp/LYtm1bsTeuOHr0qFavXq1HHnlE/v7+Sk9P13vvvafu3bvrwIED5je8s2fP1qZNmxQeHq7k5GRVq1ZN7733ntavX69//vOfZlx6ero6d+6s33//Xc8++6zq1q2rpUuX6oEHHtBHH31k9bpL/3fZ5IULF7Ry5Uq99NJL8vT01KhRo0r1+ha+foXvvejoaJuxY8aMUdeuXSVdfq98+umnVuPh4eHasGGDxo4dq9atW6tatWpasGCB2Zi+Xp9//rkk6YknnihV/IYNG9S3b1/ddtttmjp1qi5cuKB33nlHXbp00Z49e8xm+NNPP62PPvpIkZGRCgwM1K+//qp///vfOnjwoNq1a6e///3vyszM1E8//aSZM2dKEjdHAyoAtZpaTa2u/LW6cF7SDz74QJMmTSry8/NH+fn5Cg0NVceOHfXWW29pw4YNevvtt3X77bfrmWeeMeNmz56tBx54QEOGDFFubq5WrFihRx55RGvWrFFYWJjVPjdt2qQPP/xQkZGRqlevnho3bqz09HR16tTJbNjUr19fX331lUaNGqWsrCzzPfT+++/r2Wef1cMPP2w2l77//nvt2LGjyBn2AIpHvaZeU68rf73u1q2bGjZsqISEBEVFRVmNJSQkqGbNmub7+FreD9dr06ZN6tu3r4KCgjRlyhTZ29ubDfxvvvlGHTp0KHHbn3/+WWlpaWrXrl2RsbVr12r48OF65ZVXNHz4cHXt2lU9evTQ3/72t2L3FRQUpM8++0xZWVnFfvmP/8/ALW/WrFmGJOM///mP1fru3bsbzZs3t1q3ZMkSQ5Jx7NgxwzAM4+LFi0ajRo2Mvn37GpKMJUuWmLG///57kWP961//MiQZW7duNddNmTLFkGT88ssvJebYvXt3Q5Lx9ttvm+tycnKMNm3aGJ6enkZubq5hGIZx6dIlIycnx2rbs2fPGl5eXsbIkSPNdceOHTMkGYMHDzYcHBwMi8VijvXq1ct4/PHHDUnGm2++WSTPwYMHG/fff7+5/vjx44a9vb0xePDgIs+juNfgqaeeMmrWrGlcvHixxOd75fEeeOABq/V//etfi/x7FXec0NBQ47bbbrN5jA8//NCQZOzevdtcJ8l4+OGHDQcHB6v1o0aNMl+XiIgIc/3FixeN/Px8q/0eO3bMcHJyMmJiYqzWr1u3zpBkvPrqq8bRo0cNFxcXY8CAAVY
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 4 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Устранить выборсы в DataFrame\n",
|
|||
|
"def remove_outliers(dataframe: DataFrame, columns: list[str]) -> DataFrame:\n",
|
|||
|
" for column in columns:\n",
|
|||
|
" if not pd.api.types.is_numeric_dtype(dataframe[column]): # Проверяем, является ли колонка числовой\n",
|
|||
|
" continue\n",
|
|||
|
" \n",
|
|||
|
" Q1: float = dataframe[column].quantile(0.25) # 1-й квартиль (25%)\n",
|
|||
|
" Q3: float = dataframe[column].quantile(0.75) # 3-й квартиль (75%)\n",
|
|||
|
" IQR: float = Q3 - Q1 # Вычисляем межквартильный размах\n",
|
|||
|
"\n",
|
|||
|
" # Определяем границы для выбросов\n",
|
|||
|
" lower_bound: float = Q1 - 1.5 * IQR # Нижняя граница\n",
|
|||
|
" upper_bound: float = Q3 + 1.5 * IQR # Верхняя граница\n",
|
|||
|
"\n",
|
|||
|
" # Устраняем выбросы:\n",
|
|||
|
" # Заменяем значения ниже нижней границы на нижнюю границу\n",
|
|||
|
" # А значения выше верхней границы – на верхнюю\n",
|
|||
|
" dataframe[column] = dataframe[column].apply(lambda x: lower_bound if x < lower_bound else upper_bound if x > upper_bound else x)\n",
|
|||
|
" \n",
|
|||
|
" return dataframe\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Устраняем выборсы\n",
|
|||
|
"df: DataFrame = remove_outliers(df, numeric_columns)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия выбросов в колонках\n",
|
|||
|
"print('Проверка наличия выбросов в колонках после их устранения:')\n",
|
|||
|
"check_outliers(df, numeric_columns)\n",
|
|||
|
"visualize_outliers(df, numeric_columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Разбиение набора данных на выборки:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Обучающая выборка (60-80%). Обучение модели (подбор коэффициентов некоторой математической функции для аппроксимации).\n",
|
|||
|
"Контрольная выборка (10-20%). Выбор метода обучения, настройка гиперпараметров.\n",
|
|||
|
"Тестовая выборка (10-20% или 20-30%). Оценка качества модели перед передачей заказчику.\n",
|
|||
|
"\n",
|
|||
|
"Данные должны быть сбалансированными, чтобы достичь этого воспользуемся методами аугментации данных. В данном случае воспользуемся методом oversampling."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 233,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Функция для создания выборок\n",
|
|||
|
"def split_stratified_into_train_val_test(\n",
|
|||
|
" df_input,\n",
|
|||
|
" stratify_colname=\"y\",\n",
|
|||
|
" frac_train=0.6,\n",
|
|||
|
" frac_val=0.15,\n",
|
|||
|
" frac_test=0.25,\n",
|
|||
|
" random_state=None,\n",
|
|||
|
") -> tuple[Any, Any, Any]:\n",
|
|||
|
"\n",
|
|||
|
" if frac_train + frac_val + frac_test != 1.0:\n",
|
|||
|
" raise ValueError(\n",
|
|||
|
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
|
|||
|
" % (frac_train, frac_val, frac_test)\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" if stratify_colname not in df_input.columns:\n",
|
|||
|
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
|
|||
|
"\n",
|
|||
|
" X: DataFrame = df_input\n",
|
|||
|
" y: DataFrame = df_input[\n",
|
|||
|
" [stratify_colname]\n",
|
|||
|
" ]\n",
|
|||
|
"\n",
|
|||
|
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
|
|||
|
" X, y, \n",
|
|||
|
" stratify=y, \n",
|
|||
|
" test_size=(1.0 - frac_train), \n",
|
|||
|
" random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" relative_frac_test: float = frac_test / (frac_val + frac_test)\n",
|
|||
|
" df_val, df_test, y_val, y_test = train_test_split(\n",
|
|||
|
" df_temp,\n",
|
|||
|
" y_temp,\n",
|
|||
|
" stratify=y_temp,\n",
|
|||
|
" test_size=relative_frac_test,\n",
|
|||
|
" random_state=random_state,\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
|
|||
|
"\n",
|
|||
|
" return df_train, df_val, df_test"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 234,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение количества наблюдений по меткам (классам):\n",
|
|||
|
"Cost\n",
|
|||
|
"0.00 18\n",
|
|||
|
"6.24 10\n",
|
|||
|
"62.72 8\n",
|
|||
|
"20.91 7\n",
|
|||
|
"52.38 4\n",
|
|||
|
" ..\n",
|
|||
|
"1098.24 1\n",
|
|||
|
"1072.22 1\n",
|
|||
|
"1019.03 1\n",
|
|||
|
"1048.46 1\n",
|
|||
|
"1068.09 1\n",
|
|||
|
"Name: count, Length: 101, dtype: int64 \n",
|
|||
|
"\n",
|
|||
|
"Статистическое описание целевого признака:\n",
|
|||
|
"count 156.000000\n",
|
|||
|
"mean 478.785641\n",
|
|||
|
"std 448.922903\n",
|
|||
|
"min 0.000000\n",
|
|||
|
"25% 50.522500\n",
|
|||
|
"50% 240.225000\n",
|
|||
|
"75% 934.107500\n",
|
|||
|
"max 1171.040000\n",
|
|||
|
"Name: Cost, dtype: float64 \n",
|
|||
|
"\n",
|
|||
|
"Распределение количества наблюдений по меткам (классам):\n",
|
|||
|
"Cost_category\n",
|
|||
|
"medium 78\n",
|
|||
|
"low 39\n",
|
|||
|
"high 39\n",
|
|||
|
"Name: count, dtype: int64 \n",
|
|||
|
"\n",
|
|||
|
"Проверка сбалансированности выборок:\n",
|
|||
|
"Обучающая выборка: (93, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"medium 47\n",
|
|||
|
"low 23\n",
|
|||
|
"high 23\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"medium\": 50.54%\n",
|
|||
|
"Процент объектов класса \"low\": 24.73%\n",
|
|||
|
"Процент объектов класса \"high\": 24.73%\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: (31, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"medium 15\n",
|
|||
|
"low 8\n",
|
|||
|
"high 8\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"medium\": 48.39%\n",
|
|||
|
"Процент объектов класса \"low\": 25.81%\n",
|
|||
|
"Процент объектов класса \"high\": 25.81%\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: (32, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"medium 16\n",
|
|||
|
"low 8\n",
|
|||
|
"high 8\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"medium\": 50.00%\n",
|
|||
|
"Процент объектов класса \"low\": 25.00%\n",
|
|||
|
"Процент объектов класса \"high\": 25.00%\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации выборок:\n",
|
|||
|
"Для обучающей выборки аугментация данных требуется\n",
|
|||
|
"Для контрольной выборки аугментация данных требуется\n",
|
|||
|
"Для тестовой выборки аугментация данных требуется\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABigAAAHmCAYAAADp3gZeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADrEklEQVR4nOzdd1xV9f8H8NdlbwRRQTFw71GalgszNUepqZmaq1Ir07Lh/uZKM9PcmqNyNxy5NfcG90JBRZbIHrK5cMfn94e/e+NyAQHhHu7l9Xw8eJSXw7mvu877nvP+nM+RCSEEiIiIiIiIiIiIiIiIDMhM6gBERERERERERERERFTxsEFBREREREREREREREQGxwYFEREREREREREREREZHBsURERERERERERERERkcGxQEBERERERERERERGRwbFBQUREREREREREREREBscGBRERERERERERERERGRwbFEREREREREREREREZHBsUBAR0XMplUrExcXh8ePHUkehUpaTk4OYmBhERUVJHYWIiIheUFpaGsLCwpCRkSF1FCIiIqIiYYOCiIjyFRQUhDFjxsDDwwNWVlaoVq0aXn/9dQghpI5mFLZt24awsDDtvzdt2oTIyEjpAuVy7do1DB06FG5ubrC2toaHhwcGDBggdSwiIqJyKT09HcuWLdP+Ozk5GatXr5YuUC5CCKxfvx6vvfYa7Ozs4OTkhFq1amHbtm1SRyMiIqJSIoRAUlISgoKCpI5SJorVoNi0aRNkMpn2x8bGBvXr18f48eMRGxtbVhmJTN7s2bPh7e0N4L/PWX727NmDnj17ws3NDVZWVqhevToGDRqEU6dOlUkuX19fzJ49G8nJyWWy/qIKCAjA7NmzdQ72moKwsDDIZDKcOXMGACCTybBp0yZJM2lcunQJbdq0walTpzB16lQcPXoUx48fx969ewt8f5Ku8+fPY/LkyQgLC8PRo0fx+eefw8ys7MYFFLVG79u3Dx06dEBAQADmz5+P48eP4/jx41i3bl2ZZSMyZqzRhWONporA1tYW//vf/7B9+3ZERERg9uzZOHDggNSxAABDhw7Fp59+ikaNGmHr1q04fvw4Tpw4gf79+0sdjYiIqNy7e/cu9u7dq/33rVu3cOjQIekC5ZKWlob//e9/aNCgAaysrFC5cmXUr18fDx48kDpaqbMoyR/NnTsXtWrVglwux4ULF/DLL7/g8OHDuHv3Luzs7Eo7I1GFJ4TARx99hE2bNuHll1/G119/DXd3d0RHR2PPnj148803cfHiRbRr165U79fX1xdz5szBqFGjUKlSpVJdd3EEBARgzpw56Ny5s/YgEZWdnJwcfPjhh6hfvz6OHTsGZ2dnqSMZpa+++gqdO3dGrVq1AABff/01PDw8yvx+C6vRcrkco0ePxltvvYWdO3fCysqqzPMQmTrWaNZoMn3m5uaYM2cORowYAbVaDScnpwIPXmzatAkffvih9t/W1tZ46aWX0L17d3z33XeoVq1aqeXasmUL/v77b2zbtg1Dhw4ttfUSlUezZ8/Gpk2bEBYWpv2c5Xdm9549e7B+/XpcvXoVqampcHNzQ4cOHfDpp5+iS5cupZ7L19cXx44dw8SJEyWvxzt27MCoUaNMqh6HhYWhVq1aOH36NDp37gyZTIaNGzdi1KhRUkcjE5GWloZPPvkE7u7uqFy5Mr788kv07NkTvXv3ljRXYmIifHx88PjxY0yYMAHt27eHlZUVLC0tTeozrlGiBkXPnj3RunVrAMDo0aNRuXJlLFmyBPv27cOQIUNKNSARAT///DM2bdqEiRMnYsmSJTqjN2fMmIGtW7fCwqJEH2cqRWq1Gjk5ObCxsZE6ygs5cOAAHjx4gPv377M58QIaNmyI4OBg3L17F25ubqhTp45B7rewGh0VFQW5XI5NmzaxOUFUSlijjYOp1GiSzjfffIP3338fERERaNSo0XMPRBpiUN+iRYswZMgQNieIwAEDHDBAVDKvv/669gcA6tevjzFjxkicCpg0aRKio6Ph5+eHJk2aSB2nzJXKXBOaLnRoaCgAICkpCd9++y2aNWsGBwcHODk5oWfPnrh9+7be38rlcsyePRv169eHjY0NPDw80L9/fwQHBwP47/Tqgn46d+6sXdeZM2cgk8nw999/Y/r06XB3d4e9vT369OmDiIgIvfu+fPkyevToAWdnZ9jZ2cHHxwcXL17M9zFqOrV5f2bPnq237LZt29CqVSvY2trC1dUVgwcPzvf+C3tsuanVaixbtgxNmjSBjY0NqlWrhk8++QRPnz7VWc7b2xtvv/223v2MHz9eb535ZV+0aJHecwoA2dnZmDVrFurWrQtra2vUrFkTkydPRnZ2dr7PVW6dO3fWW9/8+fNhZmaGP/74o0TPx+LFi9GuXTtUrlwZtra2aNWqFXbt2pXv/W/btg1t2rSBnZ0dXFxc0KlTJxw7dkxnmSNHjsDHxweOjo5wcnLCq6++qpdt586d2tfUzc0Nw4YN05tLftSoUTqZXVxc0LlzZ5w/f/65z1NhsrKysGDBAjRs2BCLFy/Od2qJ4cOHo02bNtp/h4SE4L333oOrqyvs7Ozw2muv5TvKa+XKlWjSpIn2+WndurX2sc+ePRuTJk0CANSqVUv7uIozhcP9+/cxaNAgVKlSBba2tmjQoAFmzJih/X14eDjGjRuHBg0awNbWFpUrV8Z7772nN2//e++9BwB44403tDk0Uy4Az17Djh07wt7eHo6Ojujduzfu3bunl2fnzp1o3LgxbGxs0LRpU+zZsyffESYZGRn45ptvULNmTVhbW6NBgwZYvHix3ggdmUyG8ePHY/v27WjSpAmsra1x5MgReHt7o2/fvnr3L5fL4ezsjE8++aTIz2Femu2c5sfa2hr169fHggULinRtiLi4OHz88ceoVq0abGxs0KJFC2zevFlnmUuXLqFWrVrYvXs36tSpAysrK7z00kuYPHkysrKytMuNHDkSbm5uUCgUevfTvXt3NGjQQCdz7tcMQL7PfVE/397e3jqjZtLS0jB+/HjUqFED1tbWqFevHn788Ueo1Wqdv9O8Zrm9/fbbejl27dqVb+bk5GRMnDhR+96oW7cuFi5cqHM/mm3Zpk2bYG9vj7Zt26JOnTr4/PPPIZPJnjvaJ++2UDNCYtKkScjJydEup5lq5tq1awWuq3PnztptXmhoKC5duoQmTZqge/fusLS0hEwmg5mZGRo0aICbN2/q/K1SqcTMmTPh4uKizeLg4IB+/fqVqEZPnToVMpkMq1at0tZoOzs7WFpa4s0334RSqdR5ngcPHgwbGxttxtq1axe4PWWNZo1mjWaNNvYanV+tjIqKgre3N1q3bo309HTt7UWp5ZrP7OLFi/Xuq2nTptrPft7MhW1LZ8+eDZlMpn3vODk5aUcayuVynftQKpX4/vvvUadOHVhbW8Pb2xvTp0/Pd9tUUIbcr71mmYK2JxqajAkJCTq3X7t2Ld9puk6dOqV9f1aqVAl9+/ZFYGBgvusEAE9PT7z++uuwsLCAu7t7vt8VNHr27Ilhw4bhyZMn2Lx5M9RqNUJDQ+Hq6oo2bdroTCcBPNsmN23atMDHlvv7BfDss3D37l3UrFkTvXv3hpOTE+zt7QvcthVl21Oc/eni1JDi7HcTlVTuAQPXr1/H9OnT8dFHH2HGjBm4du0atmzZwgED5YBardarGURS27t3L+7du4dr167B398flStXljRPXFwcNm/ejB9//LFCNCeAUmpQaA5UaF7AkJAQ7N27F2+//TaWLFmCSZMmwd/fHz4+PoiKitL+nUqlwttvv405c+agVatW+Pnnn/Hll18iJSUFd+/e1bmPIUOGYOvWrTo/np6e+eaZP38+Dh06hClTpuCLL77A8ePH0bVrV50Da6dOnUKnTp2QmpqKWbNm4YcffkBycjK6dOmCK1eu5LteT09P7X3/8ssvBd73iBEjUK9ePSxZsgQTJ07EyZMn0alTpwLnCB47dqx2ve+++67e7z/55BNMmjQJ7du3x/Lly/Hhhx9i+/bteOutt/I9MFgSycnJWLBggd7tarUaffr0weL
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x500 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Оценка сбалансированности\n",
|
|||
|
"def check_balance(dataframe: DataFrame, dataframe_name: str, column: str) -> None:\n",
|
|||
|
" counts: Series[int] = dataframe[column].value_counts()\n",
|
|||
|
" print(dataframe_name + \": \", dataframe.shape)\n",
|
|||
|
" print(f\"Распределение выборки данных по классам в колонке \\\"{column}\\\":\\n\", counts)\n",
|
|||
|
" total_count: int = len(dataframe)\n",
|
|||
|
" for value in counts.index:\n",
|
|||
|
" percentage: float = counts[value] / total_count * 100\n",
|
|||
|
" print(f\"Процент объектов класса \\\"{value}\\\": {percentage:.2f}%\")\n",
|
|||
|
" print()\n",
|
|||
|
" \n",
|
|||
|
"# Определение необходимости аугментации данных\n",
|
|||
|
"def need_augmentation(dataframe: DataFrame,\n",
|
|||
|
" column: str, \n",
|
|||
|
" first_value: Any, second_value: Any) -> bool:\n",
|
|||
|
" counts: Series[int] = dataframe[column].value_counts()\n",
|
|||
|
" ratio: float = counts[first_value] / counts[second_value]\n",
|
|||
|
" return ratio > 1.5 or ratio < 0.67\n",
|
|||
|
" \n",
|
|||
|
" # Визуализация сбалансированности классов\n",
|
|||
|
"def visualize_balance(dataframe_train: DataFrame,\n",
|
|||
|
" dataframe_val: DataFrame,\n",
|
|||
|
" dataframe_test: DataFrame, \n",
|
|||
|
" column: str) -> None:\n",
|
|||
|
" fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
|
|||
|
"\n",
|
|||
|
" # Обучающая выборка\n",
|
|||
|
" counts_train: Series[int] = dataframe_train[column].value_counts()\n",
|
|||
|
" axes[0].pie(counts_train, labels=counts_train.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[0].set_title(f\"Распределение классов \\\"{column}\\\" в обучающей выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Контрольная выборка\n",
|
|||
|
" counts_val: Series[int] = dataframe_val[column].value_counts()\n",
|
|||
|
" axes[1].pie(counts_val, labels=counts_val.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[1].set_title(f\"Распределение классов \\\"{column}\\\" в контрольной выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Тестовая выборка\n",
|
|||
|
" counts_test: Series[int] = dataframe_test[column].value_counts()\n",
|
|||
|
" axes[2].pie(counts_test, labels=counts_test.index, autopct='%1.1f%%', startangle=90)\n",
|
|||
|
" axes[2].set_title(f\"Распределение классов \\\"{column}\\\" в тренировочной выборке\")\n",
|
|||
|
"\n",
|
|||
|
" # Отображение графиков\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()\n",
|
|||
|
" \n",
|
|||
|
"\n",
|
|||
|
"# Унитарное кодирование категориальных признаков (one-hot encoding)\n",
|
|||
|
"df_encoded: DataFrame = pd.get_dummies(df)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод распределения количества наблюдений по меткам (классам)\n",
|
|||
|
"print('Распределение количества наблюдений по меткам (классам):')\n",
|
|||
|
"print(df_encoded['Cost'].value_counts(), '\\n')\n",
|
|||
|
"\n",
|
|||
|
"# Статистическое описание целевого признака\n",
|
|||
|
"print('Статистическое описание целевого признака:')\n",
|
|||
|
"print(df_encoded['Cost'].describe().transpose(), '\\n')\n",
|
|||
|
"\n",
|
|||
|
"# Определим границы для каждой категории стоимости акций\n",
|
|||
|
"bins: list[float] = [df_encoded['Cost'].min() - 1, \n",
|
|||
|
" df_encoded['Cost'].quantile(0.25), \n",
|
|||
|
" df_encoded['Cost'].quantile(0.75), \n",
|
|||
|
" df_encoded['Cost'].max() + 1]\n",
|
|||
|
"labels: list[str] = ['low', 'medium', 'high']\n",
|
|||
|
"\n",
|
|||
|
"# Создаем новую колонку с категориями стоимости акций\n",
|
|||
|
"df_encoded['Cost_category'] = pd.cut(df_encoded['Cost'], bins=bins, labels=labels)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод распределения количества наблюдений по меткам (классам)\n",
|
|||
|
"print('Распределение количества наблюдений по меткам (классам):')\n",
|
|||
|
"print(df_encoded['Cost_category'].value_counts(), '\\n')\n",
|
|||
|
"\n",
|
|||
|
"df_train, df_val, df_test = split_stratified_into_train_val_test(\n",
|
|||
|
" df_encoded, \n",
|
|||
|
" stratify_colname=\"Cost_category\", \n",
|
|||
|
" frac_train=0.60, \n",
|
|||
|
" frac_val=0.20, \n",
|
|||
|
" frac_test=0.20\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности выборок\n",
|
|||
|
"print('Проверка сбалансированности выборок:')\n",
|
|||
|
"check_balance(df_train, 'Обучающая выборка', 'Cost_category')\n",
|
|||
|
"check_balance(df_val, 'Контрольная выборка', 'Cost_category')\n",
|
|||
|
"check_balance(df_test, 'Тестовая выборка', 'Cost_category')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка необходимости аугментации выборок\n",
|
|||
|
"print('Проверка необходимости аугментации выборок:')\n",
|
|||
|
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
" \n",
|
|||
|
"# Визуализация сбалансированности классов\n",
|
|||
|
"visualize_balance(df_train, df_val, df_test, 'Cost_category')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Необходимо применить аугментацию выборки с избытком (oversampling) – копирование наблюдений или генерация новых наблюдений на основе существующих с помощью алгоритмов SMOTE и ADASYN (нахождение k-ближайших соседей)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 235,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Проверка сбалансированности выборок после применения метода oversampling:\n",
|
|||
|
"Обучающая выборка: (141, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"low 47\n",
|
|||
|
"medium 47\n",
|
|||
|
"high 47\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"low\": 33.33%\n",
|
|||
|
"Процент объектов класса \"medium\": 33.33%\n",
|
|||
|
"Процент объектов класса \"high\": 33.33%\n",
|
|||
|
"\n",
|
|||
|
"Контрольная выборка: (45, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"low 15\n",
|
|||
|
"medium 15\n",
|
|||
|
"high 15\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"low\": 33.33%\n",
|
|||
|
"Процент объектов класса \"medium\": 33.33%\n",
|
|||
|
"Процент объектов класса \"high\": 33.33%\n",
|
|||
|
"\n",
|
|||
|
"Тестовая выборка: (48, 184)\n",
|
|||
|
"Распределение выборки данных по классам в колонке \"Cost_category\":\n",
|
|||
|
" Cost_category\n",
|
|||
|
"low 16\n",
|
|||
|
"medium 16\n",
|
|||
|
"high 16\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент объектов класса \"low\": 33.33%\n",
|
|||
|
"Процент объектов класса \"medium\": 33.33%\n",
|
|||
|
"Процент объектов класса \"high\": 33.33%\n",
|
|||
|
"\n",
|
|||
|
"Проверка необходимости аугментации выборок после применения метода oversampling:\n",
|
|||
|
"Для обучающей выборки аугментация данных не требуется\n",
|
|||
|
"Для контрольной выборки аугментация данных не требуется\n",
|
|||
|
"Для тестовой выборки аугментация данных не требуется\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABigAAAH/CAYAAADNB1UNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACm90lEQVR4nOzdd3hUdd6G8WfSO4QamhB6h11YFJSiooKoWLFTLLirrqKIihWxsAo27LoKCNiAFZQiRYo0KdJ7kRpKEkJ6z5z3D94ZMySBJCRzZs7cn+vy2iWZzHwzmZz7TH5nztgMwzAEAAAAAAAAAADgRn5mDwAAAAAAAAAAAHwPCxQAAAAAAAAAAMDtWKAAAAAAAAAAAABuxwIFAAAAAAAAAABwOxYoAAAAAAAAAACA27FAAQAAAAAAAAAA3I4FCgAAAAAAAAAA4HYsUAAAAAAAAAAAALdjgQIAcF75+fmKj4/X4cOHzR4FFSw3N1cnTpzQsWPHzB4FAABcoLS0NB08eFAZGRlmjwIAACqIYRhKSkrS3r17zR6lUrBAAQAo1t69e/Xggw+qTp06CgoKUu3atdW1a1cZhmH2aF5hypQpOnjwoPPfEydOVFxcnHkDFbJ+/XrdddddqlGjhoKDg1WnTh3dcsstZo8FAIBHSk9P13vvvef8d3Jysj766CPzBirEMAx9/vnnuuSSSxQWFqaoqCjFxsZqypQpZo8GAIDH27Ztm2bOnOn896ZNmzRnzhzzBiokLS1NL7zwglq0aKGgoCBVr15dzZs31+7du80ercKVaYFi4sSJstlszv9CQkLUvHlzPfroozp58mRlzQhY3qhRo9SoUSNJf/2eFefHH39U3759VaNGDQUFBalu3boaMGCAFi9eXClzrVq1SqNGjVJycnKlXH9p7dixQ6NGjXL5Y68VHDx4UDabTUuXLpUk2Ww2TZw40dSZHH7//Xd16dJFixcv1rPPPqv58+dr4cKFmjlzZomPT7havny5nn76aR08eFDz58/XI488Ij+/yjsuoLSNnjVrli677DLt2LFDr7/+uhYuXKiFCxfqs88+q7TZAG9Go8+NRsMXhIaG6oUXXtDUqVN15MgRjRo1Sj///HOxl3X3c+a77rpL//znP9WqVStNnjxZCxcu1KJFi3TzzTdX+G0BZqLH50aPgfJJS0vTQw89pN9//1179+7V448/rq1bt5o9lk6dOqWuXbtq/PjxuvXWWzVr1iwtXLhQS5cudW4LrSSgPF80evRoxcbGKjs7WytWrNAnn3yiuXPnatu2bQoLC6voGQGfZxiG7rvvPk2cOFF/+9vf9OSTTyomJkbHjx/Xjz/+qCuvvFIrV65Ut27dKvR2V61apVdeeUWDBw9W1apVK/S6y2LHjh165ZVX1KtXL0tuiD1Nbm6uhgwZoubNm2vBggWqUqWK2SN5pSeeeEK9evVSbGysJOnJJ59UnTp1Kv12z9Xo7OxsPfDAA7rmmms0bdo0BQUFVfo8gNXRaBoN6/P399crr7yigQMHym63Kyoq6rxHV7rjOfPXX3+t77//XlOmTNFdd91VIdcJeCt6TI+B8ujatavzP0lq3ry5HnzwQZOnkkaMGKHjx49r9erVatOmjdnjVLpyLVD07dtXnTt3liQ98MADql69ut555x3NmjVLd955Z4UOCEB6++23NXHiRA0bNkzvvPOOy9Eizz//vCZPnqyAgHL9OqMC2e125ebmKiQkxOxRLsjPP/+s3bt3a9euXSxOXICWLVtq//792rZtm2rUqKEmTZq45XbP1ehjx44pOztbEydOZHECqCA02jtYpdEwz/Dhw3X77bfryJEjatWq1Xn/EOmO58xjx47VnXfeyeIEIHrsLegxPNHMmTO1Y8cOZWVlqV27dqY/V46Pj9ekSZP06aef+sTihFRB70FxxRVXSJIOHDggSUpKStJTTz2ldu3aKSIiQlFRUerbt682b95c5Guzs7M1atQoNW/eXCEhIapTp45uvvlm7d+/X9JfL+cq6b9evXo5r2vp0qWy2Wz6/vvv9dxzzykmJkbh4eG64YYbdOTIkSK3vWbNGvXp00dVqlRRWFiYevbsqZUrVxb7Pfbq1avY2x81alSRy06ZMkWdOnVSaGioqlWrpjvuuKPY2z/X91aY3W7Xe++9pzZt2igkJES1a9fWQw89pNOnT7tcrlGjRrruuuuK3M6jjz5a5DqLm33s2LFF7lNJysnJ0csvv6ymTZsqODhYDRo00NNPP62cnJxi76vCevXqVeT6Xn/9dfn5+embb74p1/0xbtw4devWTdWrV1doaKg6deqk6dOnF3v7U6ZMUZcuXRQWFqbo6Gj16NFDCxYscLnMvHnz1LNnT0VGRioqKkr/+Mc/isw2bdo058+0Ro0auueee4qcS37w4MEuM0dHR6tXr15avnz5ee+nc8nKytKYMWPUsmVLjRs3rtiXst57773q0qWL899//vmnbrvtNlWrVk1hYWG65JJLij3K64MPPlCbNm2c90/nzp2d3/uoUaM0YsQISVJsbKzz+yrLS0Z37dqlAQMGqGbNmgoNDVWLFi30/PPPOz9/6NAhPfzww2rRooVCQ0NVvXp13XbbbUXO23/bbbdJki6//HLnHI6XeEpnfobdu3dXeHi4IiMj1a9fP23fvr3IPNOmTVPr1q0VEhKitm3b6scff9TgwYOLHGGSkZGh4cOHq0GDBgoODlaLFi00bty4Iu+9YLPZ9Oijj2rq1Klq06aNgoODNW/ePDVq1Ej9+/cvcvvZ2dmqUqWKHnrooVLfh2dzbOcc/wUHB6t58+YaM2ZMqd4bIj4+Xvfff79q166tkJAQdejQQZMmTXK5zO+//67Y2FjNmDFDTZo0UVBQkC666CI9/fTTysrKcl5u0KBBqlGjhvLy8orcztVXX60WLVq4zFz4Zyap2Pu+tL/fjRo10uDBg53/TktL06OPPqp69eopODhYzZo103/+8x/Z7XaXr3P8zAq77rrriswxffr0YmdOTk7WsGHDnI+Npk2b6s0333S5Hce2bOLEiQoPD9fFF1+sJk2a6JFHHpHNZnOZuzhnbwsDAwPVqFEjjRgxQrm5uc7LOV7avn79+hKvq1evXs5t3oEDB/T777+rTZs2uvrqqxUYGCibzSY/Pz+1aNFCGzdudPna/Px8vfTSS4qOjnbOEhERoRtvvLFcjX722Wdls9n04YcfOhsdFhamwMBAXXnllcrPz3e5n++44w6FhIQ4Z2zcuHGJ21MaTaNpNI329kYX18pjx46pUaNG6ty5s9LT050fL03LHb+z48aNK3Jbbdu2df7unz3zubalo0aNks1mcz52oqKiVL16dT3++OPKzs52uY38/Hy9+uqratKkiYKDg9WoUSM999xzxW6bSpqh8M/ecZmSticOjhkTExNdPr5+/fpiTwuyePFi5+OzatWq6t+/v3bu3FnsdUpS/fr11bVrVwUEBCgmJqbYfYWSZnrnnXckndkedenSRZMnT3Z5zhwQEKDIyMgSnzMPGzbMuV9Qp04d3XDDDdq2bZsaNGjg8vtXUo8d257IyEjZbDY1a9ZMAwYMcHnO/MMPPxR5Ph0SEqJatWopKirK5TlzcQ0pvD0q/N8DDzxQ5Hn3Sy+9RI//Hz2mx/T4L2b2uLKaKJXuOUivXr3Utm3bIl/reEwX/pmf/XxYOvOzO7ufhfcH3n33XTVs2FChoaHq2bOntm3bVuS2ytJFx3+RkZHq0qWLy/s4FJ7pfNuF4r6X4vaLynL/SNLHH3/sfMzVrVtXjzzySJFTsRXe/rZu3VqdOnXS5s2bi92uFqfXWc9Da9SooX79+hW5bx2/ByVxPLd3fA/r1q1zLuZ17txZISEhql69uu68804dPny4yNeX5edWmsfs2f3Lz8/Xtddeq2rVqmnHjh0uly3t8+vzqZDlY8cfKqpXry7pzIZ+5syZuu222xQbG6uTJ0/qs88+U8+ePbVjxw7VrVtXklRQUKDrrrtOv/76q+644w49/vjjSktL08KFC7Vt2zaXo03vvPNOXXvttS63O3LkyGLnef3112Wz2fTMM88oPj5
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x500 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Метод приращения с избытком (oversampling)\n",
|
|||
|
"def oversample(df: DataFrame, column: str) -> DataFrame:\n",
|
|||
|
" X: DataFrame = pd.get_dummies(df.drop(column, axis=1))\n",
|
|||
|
" y: DataFrame = df[column] # type: ignore\n",
|
|||
|
" \n",
|
|||
|
" smote = SMOTE()\n",
|
|||
|
" X_resampled, y_resampled = smote.fit_resample(X, y) # type: ignore\n",
|
|||
|
" \n",
|
|||
|
" df_resampled: DataFrame = pd.concat([X_resampled, y_resampled], axis=1)\n",
|
|||
|
" return df_resampled\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Приращение данных (oversampling)\n",
|
|||
|
"df_train_oversampled: DataFrame = oversample(df_train, 'Cost_category')\n",
|
|||
|
"df_val_oversampled: DataFrame = oversample(df_val, 'Cost_category')\n",
|
|||
|
"df_test_oversampled: DataFrame = oversample(df_test, 'Cost_category')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности выборок\n",
|
|||
|
"print('Проверка сбалансированности выборок после применения метода oversampling:')\n",
|
|||
|
"check_balance(df_train_oversampled, 'Обучающая выборка', 'Cost_category')\n",
|
|||
|
"check_balance(df_val_oversampled, 'Контрольная выборка', 'Cost_category')\n",
|
|||
|
"check_balance(df_test_oversampled, 'Тестовая выборка', 'Cost_category')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка необходимости аугментации выборок\n",
|
|||
|
"print('Проверка необходимости аугментации выборок после применения метода oversampling:')\n",
|
|||
|
"print(f\"Для обучающей выборки аугментация данных {'не ' if not need_augmentation(df_train_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
"print(f\"Для контрольной выборки аугментация данных {'не ' if not need_augmentation(df_val_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
"print(f\"Для тестовой выборки аугментация данных {'не ' if not need_augmentation(df_test_oversampled, 'Cost_category', 'low', 'medium') else ''}требуется\")\n",
|
|||
|
" \n",
|
|||
|
"# Визуализация сбалансированности классов\n",
|
|||
|
"visualize_balance(df_train_oversampled, df_val_oversampled, df_test_oversampled, 'Cost_category')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Конструирование признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Конструирование признаков - определением признаков, которые войду в нашу обучающую модель\n",
|
|||
|
"\n",
|
|||
|
"Будем использовать метод конструирования признаков \"Унитарное кодирование категориальных признаков\" или one-hot-encoding. Он необходим для преобразования категориальных переменных в числовой формат."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 236,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Shares</th>\n",
|
|||
|
" <th>Value ($)</th>\n",
|
|||
|
" <th>Shares Total</th>\n",
|
|||
|
" <th>Insider Trading_Baglino Andrew D</th>\n",
|
|||
|
" <th>Insider Trading_DENHOLM ROBYN M</th>\n",
|
|||
|
" <th>Insider Trading_Kirkhorn Zachary</th>\n",
|
|||
|
" <th>Insider Trading_Musk Elon</th>\n",
|
|||
|
" <th>Insider Trading_Musk Kimbal</th>\n",
|
|||
|
" <th>Insider Trading_Taneja Vaibhav</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>SEC Form 4_Nov 30 04:42 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Oct 05 07:35 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Oct 31 07:06 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 07 08:29 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 07 08:33 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 07 09:04 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 12 09:44 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 14 07:47 PM</th>\n",
|
|||
|
" <th>SEC Form 4_Sep 30 07:03 PM</th>\n",
|
|||
|
" <th>Cost_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>196.72</td>\n",
|
|||
|
" <td>10455.0</td>\n",
|
|||
|
" <td>2056775.0</td>\n",
|
|||
|
" <td>203073.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>2466.0</td>\n",
|
|||
|
" <td>482718.0</td>\n",
|
|||
|
" <td>100458.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>1298.0</td>\n",
|
|||
|
" <td>254232.0</td>\n",
|
|||
|
" <td>65547.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>7138.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>102923.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>2586.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>66845.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>16867.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>213528.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500.0</td>\n",
|
|||
|
" <td>219555.0</td>\n",
|
|||
|
" <td>74759.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>202.00</td>\n",
|
|||
|
" <td>10500.0</td>\n",
|
|||
|
" <td>2121000.0</td>\n",
|
|||
|
" <td>64259.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>193.00</td>\n",
|
|||
|
" <td>3750.0</td>\n",
|
|||
|
" <td>723750.0</td>\n",
|
|||
|
" <td>196661.0</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500.0</td>\n",
|
|||
|
" <td>219555.0</td>\n",
|
|||
|
" <td>74759.0</td>\n",
|
|||
|
" <td>True</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>10 rows × 184 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Cost Shares Value ($) Shares Total Insider Trading_Baglino Andrew D \\\n",
|
|||
|
"0 196.72 10455.0 2056775.0 203073.0 False \n",
|
|||
|
"1 195.79 2466.0 482718.0 100458.0 False \n",
|
|||
|
"2 195.79 1298.0 254232.0 65547.0 True \n",
|
|||
|
"3 0.00 7138.0 0.0 102923.0 False \n",
|
|||
|
"4 0.00 2586.0 0.0 66845.0 True \n",
|
|||
|
"5 0.00 16867.0 0.0 213528.0 False \n",
|
|||
|
"6 20.91 10500.0 219555.0 74759.0 True \n",
|
|||
|
"7 202.00 10500.0 2121000.0 64259.0 True \n",
|
|||
|
"8 193.00 3750.0 723750.0 196661.0 False \n",
|
|||
|
"9 20.91 10500.0 219555.0 74759.0 True \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_DENHOLM ROBYN M Insider Trading_Kirkhorn Zachary \\\n",
|
|||
|
"0 False True \n",
|
|||
|
"1 False False \n",
|
|||
|
"2 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False True \n",
|
|||
|
"6 False False \n",
|
|||
|
"7 False False \n",
|
|||
|
"8 False True \n",
|
|||
|
"9 False False \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_Musk Elon Insider Trading_Musk Kimbal \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"2 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"6 False False \n",
|
|||
|
"7 False False \n",
|
|||
|
"8 False False \n",
|
|||
|
"9 False False \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_Taneja Vaibhav ... SEC Form 4_Nov 30 04:42 PM \\\n",
|
|||
|
"0 False ... False \n",
|
|||
|
"1 True ... False \n",
|
|||
|
"2 False ... False \n",
|
|||
|
"3 True ... False \n",
|
|||
|
"4 False ... False \n",
|
|||
|
"5 False ... False \n",
|
|||
|
"6 False ... False \n",
|
|||
|
"7 False ... False \n",
|
|||
|
"8 False ... False \n",
|
|||
|
"9 False ... False \n",
|
|||
|
"\n",
|
|||
|
" SEC Form 4_Oct 05 07:35 PM SEC Form 4_Oct 31 07:06 PM \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"2 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"6 False False \n",
|
|||
|
"7 False False \n",
|
|||
|
"8 False False \n",
|
|||
|
"9 False False \n",
|
|||
|
"\n",
|
|||
|
" SEC Form 4_Sep 07 08:29 PM SEC Form 4_Sep 07 08:33 PM \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"2 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"6 False False \n",
|
|||
|
"7 False False \n",
|
|||
|
"8 False False \n",
|
|||
|
"9 False False \n",
|
|||
|
"\n",
|
|||
|
" SEC Form 4_Sep 07 09:04 PM SEC Form 4_Sep 12 09:44 PM \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"2 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"6 False False \n",
|
|||
|
"7 False False \n",
|
|||
|
"8 False False \n",
|
|||
|
"9 False False \n",
|
|||
|
"\n",
|
|||
|
" SEC Form 4_Sep 14 07:47 PM SEC Form 4_Sep 30 07:03 PM Cost_category \n",
|
|||
|
"0 False False medium \n",
|
|||
|
"1 False False medium \n",
|
|||
|
"2 False False medium \n",
|
|||
|
"3 False False low \n",
|
|||
|
"4 False False low \n",
|
|||
|
"5 False False low \n",
|
|||
|
"6 False False low \n",
|
|||
|
"7 False False medium \n",
|
|||
|
"8 False False medium \n",
|
|||
|
"9 False False low \n",
|
|||
|
"\n",
|
|||
|
"[10 rows x 184 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 236,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df_encoded.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Дискретизация числовых признаков – процесс преобразования непрерывных числовых значений в категориальные группы или интервалы (дискретные значения).\n",
|
|||
|
"\n",
|
|||
|
"В данном случае преобразование числовой колонки \"Cost\" уже было выполнено ранее для стратифицированного разбиения исходных данных на выборки (обучающую, контрольную и тестовую). Для этого использовался метод квартильной группировки."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 237,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Обучающая выборка:\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Cost_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>923.57</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>748.11</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>18.44</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>875.23</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>992.27</td>\n",
|
|||
|
" <td>high</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>1073.00</td>\n",
|
|||
|
" <td>high</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>6.24</td>\n",
|
|||
|
" <td>low</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>250.50</td>\n",
|
|||
|
" <td>medium</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Cost Cost_category\n",
|
|||
|
"0 195.79 medium\n",
|
|||
|
"1 923.57 medium\n",
|
|||
|
"2 0.00 low\n",
|
|||
|
"3 748.11 medium\n",
|
|||
|
"4 18.44 low\n",
|
|||
|
"5 875.23 medium\n",
|
|||
|
"6 992.27 high\n",
|
|||
|
"7 1073.00 high\n",
|
|||
|
"8 6.24 low\n",
|
|||
|
"9 250.50 medium"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 237,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print('Обучающая выборка:')\n",
|
|||
|
"df_train_oversampled[['Cost', 'Cost_category']].head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"«Ручной» синтез признаков – процесс создания новых признаков на основе существующих данных. Это может включать в себя комбинирование нескольких признаков, использование математических операций (например, сложение, вычитание), а также создание полиномиальных или логарифмических признаков."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 238,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Date</th>\n",
|
|||
|
" <th>Year</th>\n",
|
|||
|
" <th>Quarter</th>\n",
|
|||
|
" <th>Month</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>2022-03-06</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>2022-03-05</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>2022-02-27</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>2022-02-27</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>2022-02-06</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>2022-01-27</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Date Year Quarter Month\n",
|
|||
|
"0 2022-03-06 2022 1 3\n",
|
|||
|
"1 2022-03-06 2022 1 3\n",
|
|||
|
"2 2022-03-06 2022 1 3\n",
|
|||
|
"3 2022-03-05 2022 1 3\n",
|
|||
|
"4 2022-03-05 2022 1 3\n",
|
|||
|
"5 2022-03-05 2022 1 3\n",
|
|||
|
"6 2022-02-27 2022 1 2\n",
|
|||
|
"7 2022-02-27 2022 1 2\n",
|
|||
|
"8 2022-02-06 2022 1 2\n",
|
|||
|
"9 2022-01-27 2022 1 1"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 238,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df['Date'] = pd.to_datetime(df['Date']) # Преобразование в datetime\n",
|
|||
|
"df['Year'] = df['Date'].dt.year # Год\n",
|
|||
|
"df['Quarter'] = df['Date'].dt.quarter # Квартал\n",
|
|||
|
"df['Month'] = df['Date'].dt.month # Месяц\n",
|
|||
|
"\n",
|
|||
|
"df[['Date', 'Year', 'Quarter', 'Month']].head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Ну и наконец, масштабирование признаков на основе нормировки и стандартизации – метод, который позволяет привести все числовые признаки к одинаковым или очень похожим диапазонам значений либо распределениям."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 239,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Shares</th>\n",
|
|||
|
" <th>Value ($)</th>\n",
|
|||
|
" <th>Shares Total</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>-0.630340</td>\n",
|
|||
|
" <td>-0.607179</td>\n",
|
|||
|
" <td>-0.583446</td>\n",
|
|||
|
" <td>-0.528366</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>-0.632418</td>\n",
|
|||
|
" <td>-0.635043</td>\n",
|
|||
|
" <td>-0.594307</td>\n",
|
|||
|
" <td>-0.604905</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>-0.632418</td>\n",
|
|||
|
" <td>-0.639117</td>\n",
|
|||
|
" <td>-0.595883</td>\n",
|
|||
|
" <td>-0.630945</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>-1.069956</td>\n",
|
|||
|
" <td>-0.618748</td>\n",
|
|||
|
" <td>-0.597637</td>\n",
|
|||
|
" <td>-0.603067</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>-1.069956</td>\n",
|
|||
|
" <td>-0.634624</td>\n",
|
|||
|
" <td>-0.597637</td>\n",
|
|||
|
" <td>-0.629977</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>-1.069956</td>\n",
|
|||
|
" <td>-0.584816</td>\n",
|
|||
|
" <td>-0.597637</td>\n",
|
|||
|
" <td>-0.520567</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>-1.023228</td>\n",
|
|||
|
" <td>-0.607022</td>\n",
|
|||
|
" <td>-0.596122</td>\n",
|
|||
|
" <td>-0.624074</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>-0.618541</td>\n",
|
|||
|
" <td>-0.607022</td>\n",
|
|||
|
" <td>-0.583003</td>\n",
|
|||
|
" <td>-0.631906</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>-0.638653</td>\n",
|
|||
|
" <td>-0.630565</td>\n",
|
|||
|
" <td>-0.592643</td>\n",
|
|||
|
" <td>-0.533148</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>-1.023228</td>\n",
|
|||
|
" <td>-0.607022</td>\n",
|
|||
|
" <td>-0.596122</td>\n",
|
|||
|
" <td>-0.624074</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Cost Shares Value ($) Shares Total\n",
|
|||
|
"0 -0.630340 -0.607179 -0.583446 -0.528366\n",
|
|||
|
"1 -0.632418 -0.635043 -0.594307 -0.604905\n",
|
|||
|
"2 -0.632418 -0.639117 -0.595883 -0.630945\n",
|
|||
|
"3 -1.069956 -0.618748 -0.597637 -0.603067\n",
|
|||
|
"4 -1.069956 -0.634624 -0.597637 -0.629977\n",
|
|||
|
"5 -1.069956 -0.584816 -0.597637 -0.520567\n",
|
|||
|
"6 -1.023228 -0.607022 -0.596122 -0.624074\n",
|
|||
|
"7 -0.618541 -0.607022 -0.583003 -0.631906\n",
|
|||
|
"8 -0.638653 -0.630565 -0.592643 -0.533148\n",
|
|||
|
"9 -1.023228 -0.607022 -0.596122 -0.624074"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 239,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# scaler = MinMaxScaler()\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"\n",
|
|||
|
"# Применяем масштабирование к выбранным признакам\n",
|
|||
|
"df[numeric_columns] = scaler.fit_transform(df[numeric_columns])\n",
|
|||
|
"\n",
|
|||
|
"df[numeric_columns].head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"FeatureTools - библиотека для автоматизированного создания признаков из структурированных данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 240,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\aim\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Insider Trading</th>\n",
|
|||
|
" <th>Relationship</th>\n",
|
|||
|
" <th>Transaction</th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>DAY(Date)</th>\n",
|
|||
|
" <th>MONTH(Date)</th>\n",
|
|||
|
" <th>WEEKDAY(Date)</th>\n",
|
|||
|
" <th>YEAR(Date)</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>Id</th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th></th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>154</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1019.03</td>\n",
|
|||
|
" <td>10</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>155</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1048.46</td>\n",
|
|||
|
" <td>10</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>156</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1068.09</td>\n",
|
|||
|
" <td>10</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>152</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1098.24</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>153</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1072.22</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>151</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1029.67</td>\n",
|
|||
|
" <td>12</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>148</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>6.24</td>\n",
|
|||
|
" <td>15</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>149</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>992.72</td>\n",
|
|||
|
" <td>15</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>150</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>1015.85</td>\n",
|
|||
|
" <td>15</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>145</th>\n",
|
|||
|
" <td>Musk Elon</td>\n",
|
|||
|
" <td>CEO</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>6.24</td>\n",
|
|||
|
" <td>16</td>\n",
|
|||
|
" <td>11</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2021</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Insider Trading Relationship Transaction Cost DAY(Date) \\\n",
|
|||
|
"Id \n",
|
|||
|
"154 Musk Elon CEO Sale 1019.03 10 \n",
|
|||
|
"155 Musk Elon CEO Sale 1048.46 10 \n",
|
|||
|
"156 Musk Elon CEO Sale 1068.09 10 \n",
|
|||
|
"152 Musk Elon CEO Sale 1098.24 11 \n",
|
|||
|
"153 Musk Elon CEO Sale 1072.22 11 \n",
|
|||
|
"151 Musk Elon CEO Sale 1029.67 12 \n",
|
|||
|
"148 Musk Elon CEO Option Exercise 6.24 15 \n",
|
|||
|
"149 Musk Elon CEO Sale 992.72 15 \n",
|
|||
|
"150 Musk Elon CEO Sale 1015.85 15 \n",
|
|||
|
"145 Musk Elon CEO Option Exercise 6.24 16 \n",
|
|||
|
"\n",
|
|||
|
" MONTH(Date) WEEKDAY(Date) YEAR(Date) \n",
|
|||
|
"Id \n",
|
|||
|
"154 11 2 2021 \n",
|
|||
|
"155 11 2 2021 \n",
|
|||
|
"156 11 2 2021 \n",
|
|||
|
"152 11 3 2021 \n",
|
|||
|
"153 11 3 2021 \n",
|
|||
|
"151 11 4 2021 \n",
|
|||
|
"148 11 0 2021 \n",
|
|||
|
"149 11 0 2021 \n",
|
|||
|
"150 11 0 2021 \n",
|
|||
|
"145 11 1 2021 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 240,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df: DataFrame = pd.read_csv(\"static/csv/TSLA.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Создание уникального идентификатора для каждой строки\n",
|
|||
|
"df['Id'] = range(1, len(df) + 1)\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id=\"Id\")\n",
|
|||
|
"\n",
|
|||
|
"# Добавляем таблицу с индексом\n",
|
|||
|
"es: EntitySet = es.add_dataframe(\n",
|
|||
|
" dataframe_name=\"trades\", \n",
|
|||
|
" dataframe=df, \n",
|
|||
|
" index=\"Id\", \n",
|
|||
|
" time_index=\"Date\"\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокого синтеза признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(\n",
|
|||
|
" entityset=es, \n",
|
|||
|
" target_dataframe_name='trades', \n",
|
|||
|
" max_depth=1\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 10 строк сгенерированного набора признаков\n",
|
|||
|
"feature_matrix.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Оценка качества набора признаков:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Предсказательная способность: Способность набора признаков успешно прогнозировать целевую переменную. Это определяется через метрики, такие как RMSE, MAE, R², которые показывают, насколько хорошо модель использует признаки для достижения точных результатов.\n",
|
|||
|
"\n",
|
|||
|
"Скорость вычисления: Время, необходимое для обработки данных и выполнения алгоритмов машинного обучения.\n",
|
|||
|
"\n",
|
|||
|
"Надежность: Устойчивость и воспроизводимость результатов при изменении входных данных.\n",
|
|||
|
"\n",
|
|||
|
"Корреляция: Степень взаимосвязи между признаками и целевой переменной, а также между самими признаками. Высокая корреляция с целевой переменной указывает на потенциальную предсказательную силу, тогда как высокая взаимосвязь между самими признаками может приводить к многоколлинеарности и снижению эффективности модели.\n",
|
|||
|
"\n",
|
|||
|
"Цельность: Не является производным от других признаков."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 241,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Время обучения модели: 0.01 секунд\n",
|
|||
|
"Среднеквадратичная ошибка: 190.15\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Разбить выборку на входные данные и целевой признак\n",
|
|||
|
"def split_dataframe(dataframe: DataFrame, column: str) -> tuple[DataFrame, DataFrame]:\n",
|
|||
|
" X_dataframe: DataFrame = dataframe.drop(columns=column, axis=1)\n",
|
|||
|
" y_dataframe: DataFrame = dataframe[column]\n",
|
|||
|
" \n",
|
|||
|
" return X_dataframe, y_dataframe\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение обучающей выборки на входные данные и целевой признак\n",
|
|||
|
"df_train_oversampled: DataFrame = pd.get_dummies(df_train_oversampled)\n",
|
|||
|
"X_df_train, y_df_train = split_dataframe(df_train_oversampled, \"Cost\")\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение контрольной выборки на входные данные и целевой признак\n",
|
|||
|
"df_val_oversampled: DataFrame = pd.get_dummies(df_val_oversampled)\n",
|
|||
|
"X_df_val, y_df_val = split_dataframe(df_val_oversampled, \"Cost\")\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение тестовой выборки на входные данные и целевой признак\n",
|
|||
|
"df_test_oversampled: DataFrame = pd.get_dummies(df_test_oversampled)\n",
|
|||
|
"X_df_test, y_df_test = split_dataframe(df_test_oversampled, \"Cost\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Модель линейной регрессии для обучения\n",
|
|||
|
"model = LinearRegression()\n",
|
|||
|
"\n",
|
|||
|
"# Начинаем отсчет времени\n",
|
|||
|
"start_time: float = time.time()\n",
|
|||
|
"model.fit(X_df_train, y_df_train)\n",
|
|||
|
"\n",
|
|||
|
"# Время обучения модели\n",
|
|||
|
"train_time: float = time.time() - start_time\n",
|
|||
|
"\n",
|
|||
|
"# Предсказания и оценка модели\n",
|
|||
|
"predictions = model.predict(X_df_val)\n",
|
|||
|
"mse = root_mean_squared_error(y_df_val, predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
|
|||
|
"print(f'Среднеквадратичная ошибка: {mse:.2f}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 242,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"RMSE: 134.73396019154637\n",
|
|||
|
"R²: 0.9090517989509861\n",
|
|||
|
"MAE: 71.95763423238986\n",
|
|||
|
"\n",
|
|||
|
"Кросс-валидация RMSE: 141.69564978570725\n",
|
|||
|
"\n",
|
|||
|
"Train RMSE: 46.69276439077218\n",
|
|||
|
"Train R²: 0.9906750460946525\n",
|
|||
|
"Train MAE: 18.74249758908302\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Text(0.5, 1.0, 'Фактическая стоимость по сравнению с прогнозируемой')"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 242,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACp3ElEQVR4nOzdd1xT1/sH8E8GIRAgiLJEFBVQca9at61WtO5ardXWPeoAt3Xvuhdq1dr2q7baVmvVVtu6994DxYGjWlkisgkhyf394Y9bY1AJBsL4vF+vvDTn3uQ+uRnkyTnnORJBEAQQERERERGRRUmtHQAREREREVFhxGSLiIiIiIgoFzDZIiIiIiIiygVMtoiIiIiIiHIBky0iIiIiIqJcwGSLiIiIiIgoFzDZIiIiIiIiygVMtoiIiIiIiHIBky0iIiIiIqJcwGSLiIiIiIqcjRs34sGDB+L19evX4/Hjx9YLiAolJltEuah3795wcHCwdhhERET0kmPHjmHcuHF48OAB9uzZg6FDh0Iq5Vdjsiy5tQMgKmyePn2KTZs24dixYzh69CjS0tLQqlUr1KxZE127dkXNmjWtHSIREVGRN3LkSDRr1gxly5YFAIwaNQqenp5WjooKG4kgCIK1gyAqLH755RcMGDAAycnJ8PHxQUZGBqKiolCzZk1cuXIFGRkZ6NWrF9auXQuFQmHtcImIiIq0lJQUhIaGokSJEihfvry1w6FCiH2lRBZy4sQJfPbZZ/Dw8MCJEydw//59tGjRAkqlEufOnUNERAQ+/fRTbNiwASNHjjS67aJFi9CgQQMUL14cdnZ2qF27NrZu3WpyDIlEgunTp4vXdTodPvzwQ7i4uODGjRviPq+7NGvWDABw+PBhSCQSHD582OgYbdq0MTlOs2bNxNtlevDgASQSCdavX2/UfvPmTXz88cdwcXGBUqlEnTp18Mcff5g8lvj4eIwcORI+Pj6wtbVFqVKl0LNnT8TGxr4yvoiICPj4+KBOnTpITk4GAGi1WkydOhW1a9eGWq2GSqVC48aNcejQIZNjxsTEoF+/fihdujRkMpl4TrI71PPvv/9G06ZN4ejoCCcnJ9StWxc//fSTeI7edO4z6XQ6zJo1C+XLl4etrS18fHwwceJEpKenGx3Px8cHvXv3Nmr79ddfIZFI4OPjI7ZlPhcSiQQ7duww2l+j0aBYsWKQSCRYtGiR0bZLly6hdevWcHJygoODA5o3b47Tp0+bPO7XPVeZz9PrLpmvpenTp0MikYjPsTkyb/uqy8uvw4MHD6Jx48ZQqVRwdnZGhw4dEBYWlq1jaTQaTJ8+Hf7+/lAqlfD09MRHH32Eu3fvAvjvfC9atAhLly5FmTJlYGdnh6ZNmyI0NNTovq5evYrevXujXLlyUCqV8PDwQN++ffH06dPXPj5HR0e88847Js9ns2bNUKVKFZOYFy1aBIlEYjT/BHj+ms08D46OjmjTpg2uX79utM+rhjtv3brV5D2Y1WfBuXPnTF7jgOnrNykpCcOGDYOXlxdsbW3h5+eHefPmwWAwmBw7K2fOnMGHH36IYsWKQaVSoVq1aggJCXntbdavX5+t1ybw33Nw8+ZNdO3aFU5OTihevDiGDx8OjUZjdL/mvIezOm7//v0BGL+WXlalShWTc535Gebu7g6lUonq1atjw4YNRvvcunUL77//Pjw8PGBrawtvb2988cUXiIuLE/cx5/M/u+eladOmqF69epbPQ4UKFRAYGCheNxgMWLZsGSpXrgylUgl3d3cMGjQIz549y/L8jRgxwuQ+AwMDIZFI0LZtW7PP0Yt/v1QqFerVq4fy5ctj6NChkEgkJp+7RG+DwwiJLCTzS8Mvv/yC2rVrm2wvUaIEfvjhB9y4cQPffPMNpk2bBjc3NwBASEgI2rdvjx49ekCr1eKXX35Bly5dsGvXLrRp0+aVx+zfvz8OHz6Mffv2ISAgAADw448/ituPHTuGtWvXYunSpShRogQAwN3d/ZX3d/ToUfz11185evwAcP36dTRs2BBeXl4YP348VCoVtmzZgo4dO+K3335Dp06dAADJyclo3LgxwsLC0LdvX9SqVQuxsbH4448/8O+//4qxvighIQGtW7eGjY0N/vrrL/HLYWJiIr777jt8+umnGDBgAJKSkvD9998jMDAQZ8+eRY0aNcT76NWrF/bv34+goCBUr14dMpkMa9euxcWLF9/42NavX4++ffuicuXKmDBhApydnXHp0iXs3r0b3bt3x6RJk8QvULGxsRg5ciQGDhyIxo0bm9xX//79sWHDBnz88ccYPXo0zpw5g7lz5yIsLAzbt29/ZQw6nQ6TJk165XalUol169ahY8eOYtu2bdtMvigCz5+rxo0bw8nJCePGjYONjQ2++eYbNGvWDEeOHEG9evUAvPm5qlSpktFrbu3atQgLC8PSpUvFtmrVqr36xJpp9erVRonB/fv3MXXqVKN99u/fj9atW6NcuXKYPn060tLSsGLFCjRs2BAXL140SlRfptfr0bZtWxw4cADdunXD8OHDkZSUhH379iE0NNTol+8ffvgBSUlJGDp0KDQaDUJCQvD+++/j2rVr4vts3759uHfvHvr06QMPDw9cv34da9euxfXr13H69GmTBCXzXMbGxmLVqlXo0qULQkNDUaFCBbPP1Y8//ohevXohMDAQ8+fPR2pqKlavXo1GjRrh0qVLrz0P5vjyyy+ztV/nzp2xb98+9OzZE++88w4OHTqECRMm4MGDB1izZs1rb7tv3z60bdsWnp6eGD58ODw8PBAWFoZdu3Zh+PDhbzz2zJkzxaFiwPPX9eDBg7Pct2vXrvDx8cHcuXNx+vRpLF++HM+ePcMPP/wg7mPOe7hGjRoYPXq0UZuvr+8bY35ZWloamjVrhvDwcAwbNgxly5bFr7/+it69eyM+Pl48DykpKShVqhTatWsHJycnhIaG4uuvv8bjx4+xc+fOV97/mz7/33RePv/8cwwYMAChoaFGPwicO3cOt2/fxuTJk8W2QYMGYf369ejTpw+Cg4Nx//59rFy5EpcuXcKJEydgY2Mj7qtUKrFp0yYsXLhQbP/3339x4MABKJXKHJ2jrISHh+Pbb7995XaiHBOIyCJcXFyEMmXKGLX16tVLUKlURm1TpkwRAAg7d+4U21JTU4320Wq1QpUqVYT333/fqB2AMG3aNEEQBGHChAmCTCYTduzY8cqY1q1bJwAQ7t+/b7Lt0KFDAgDh0KFDYlu9evWE1q1bGx1HEAThvffeE5o0aWJ0+/v37wsAhHXr1oltzZs3F6pWrSpoNBqxzWAwCA0aNBD8/PzEtqlTpwoAhG3btpnEZTAYTOLTaDRCs2bNBDc3NyE8PNxof51OJ6Snpxu1PXv2THB3dxf69u0rtqWlpQlSqVQYNGiQ0b5ZPUcvi4+PFxwdHYV69eoJaWlpWcb7oqzOTabLly8LAIT+/fsbtY8ZM0YAIBw8eFBsK1OmjNCrVy/x+qpVqwRbW1vhvffeM3qtZR7v008/FeRyuRAVFSVua968udC9e3cBgLBw4UKxvWPHjoJCoRDu3r0rtkVERAiOjo5Gz3V2nqsX9erVy+R9kGnatGkCAOHJkydZbn+dV9323LlzJue6Ro0agpubm/D06VOx7cqVK4JUKhV69uz52uP873//EwAIS5YsMdmW+Xgzz7ednZ3w77//itvPnDkjABBGjhwptr383hYEQfj5558FAMLRo0dNHt+L9u7dKwAQtmzZIrY1bdpUqFy5ssl9Lly40Oi9npSUJDg7OwsDBgww2i8qKkpQq9VG7a96D/z6668mnxFNmzYVmjZtKl7/66+/BABCq1atTOJ/8fW7c+dOAYAwfvx4o3169+4tABCuXbtmcvxMOp1OKFu2rFCmTBnh2bNnRtuyeg2+KPMz8Ny5c0btT548Mfmcy3wO2rdvb7TvkCFDBADClStXBEEw/z3cpk2bV8aX+Vp68b2ZqXLlykbnetmyZQIAYePGjWKbVqsV6tevLzg4OAiJiYmvPM6QIUMEBwcH8bo5n//ZPS/x8fGCUqkUvvzyS6P9goODBZVKJSQnJwu
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Модель случайного леса для обучения\n",
|
|||
|
"model = RandomForestRegressor()\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_df_train, y_df_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_predictions = model.predict(X_df_test)\n",
|
|||
|
"\n",
|
|||
|
"rmse = root_mean_squared_error(y_df_test, y_predictions)\n",
|
|||
|
"r2 = r2_score(y_df_test, y_predictions)\n",
|
|||
|
"mae = mean_absolute_error(y_df_test, y_predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"RMSE: {rmse}\")\n",
|
|||
|
"print(f\"R²: {r2}\")\n",
|
|||
|
"print(f\"MAE: {mae}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_df_train, y_df_train, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"rmse_cv = (-scores.mean())**0.5\n",
|
|||
|
"print(f\"Кросс-валидация RMSE: {rmse_cv}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_df_train.columns\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_predictions = model.predict(X_df_train)\n",
|
|||
|
"\n",
|
|||
|
"rmse_train = root_mean_squared_error(y_df_train, y_train_predictions)\n",
|
|||
|
"r2_train = r2_score(y_df_train, y_train_predictions)\n",
|
|||
|
"mae_train = mean_absolute_error(y_df_train, y_train_predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train RMSE: {rmse_train}\")\n",
|
|||
|
"print(f\"Train R²: {r2_train}\")\n",
|
|||
|
"print(f\"Train MAE: {mae_train}\\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_df_test, y_predictions, alpha=0.5)\n",
|
|||
|
"plt.plot([y_df_test.min(), y_df_test.max()], [y_df_test.min(), y_df_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Фактическая стоимость')\n",
|
|||
|
"plt.ylabel('Прогнозируемая стоимость')\n",
|
|||
|
"plt.title('Фактическая стоимость по сравнению с прогнозируемой')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Вывод:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Оценка качества модели на тестовой выборке:\n",
|
|||
|
"\n",
|
|||
|
"RMSE (Корень из среднеквадратичной ошибки) на тестовой выборке составил 89.71, что указывает на среднюю ошибку в прогнозах.\n",
|
|||
|
"R² (Коэффициент детерминации) равен 0.96, что означает, что модель объясняет 96% дисперсии данных. Это хороший показатель, указывающий на высокую объяснительную способность модели.\n",
|
|||
|
"MAE (Средняя абсолютная ошибка) составила 51.21, показывая среднее абсолютное отклонение предсказаний от фактических значений.\n",
|
|||
|
"\n",
|
|||
|
"2. Результаты кросс-валидации:\n",
|
|||
|
"\n",
|
|||
|
"RMSE кросс-валидации равен 148.73, что заметно выше значения RMSE на тестовой выборке. Это может свидетельствовать о том, что модель может быть подвержена колебаниям в зависимости от данных и, возможно, склонна к некоторому переобучению.\n",
|
|||
|
"\n",
|
|||
|
"3. Проверка на переобучение:\n",
|
|||
|
"\n",
|
|||
|
"Метрики на обучающей выборке (RMSE = 49.74, R² = 0.99, MAE = 22.62) значительно лучше, чем на тестовой, что указывает на высокую точность на обучающих данных."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|