1432 lines
182 KiB
Plaintext
1432 lines
182 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Датасет: [Tesla Insider Trading](https://www.kaggle.com/datasets/ilyaryabov/tesla-insider-trading).\n",
|
||
"\n",
|
||
"### Описание датасета:\n",
|
||
"Датасет представляет собой выборку операций с ценными бумагами компании Tesla, совершённых инсайдерами, и является частью более крупного проекта \"Insider Trading S&P500 – Inside Info\". Данные охватывают транзакции с участием крупных акционеров и должностных лиц компании, включая такие операции, как покупка, продажа и опционы, начиная с 10 ноября 2021 года и до 27 июля 2022 года.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Анализ сведений:\n",
|
||
"**Проблемная область:**\n",
|
||
"Проблемная область данного датасета касается анализа инсайдерских сделок в публичных компаниях, а также их влияния на ценообразование акций. Инсайдерские транзакции, совершаемые людьми с доступом к непубличной информации (такими как руководители, крупные акционеры или члены совета директоров), могут быть индикаторами будущих изменений стоимости акций. Исследование таких транзакций помогает понять, как информация внутри компании отражается в действиях ключевых участников, и может выявить паттерны поведения, которые влияют на рынки.\n",
|
||
"\n",
|
||
"**Актуальность:**\n",
|
||
"Анализ инсайдерских сделок становится особенно важным в условиях высокой волатильности рынка и неопределенности. Инвесторы, аналитики и компании используют такие данные, чтобы лучше понимать сигналы от крупных акционеров и должностных лиц. Действия инсайдеров, такие как покупки и продажи акций, нередко рассматриваются как индикаторы доверия к компании, что может оказывать значительное влияние на рыночные ожидания и прогнозы.\n",
|
||
"\n",
|
||
"**Объекты наблюдений:**\n",
|
||
"Объектами наблюдений в датасете являются инсайдеры компании Tesla — лица, имеющие значительное влияние на управление и информацию компании. Каждый объект характеризуется различными параметрами, включая должность, тип транзакции, количество акций и общую стоимость сделок.\n",
|
||
"\n",
|
||
"**Атрибуты объектов:**\n",
|
||
"- Insider Trading: ФИО лица, совершившего транзакцию.\n",
|
||
"- Relationship: Должность или статус данного лица в компании Tesla.\n",
|
||
"- Date: Дата завершения транзакции.\n",
|
||
"- Transaction: Тип транзакции.\n",
|
||
"- Cost: Цена одной акции на момент совершения транзакции.\n",
|
||
"- Shares: Количество акций, участвующих в транзакции.\n",
|
||
"- Value ($): Общая стоимость транзакции в долларах США.\n",
|
||
"- Shares Total: Общее количество акций, принадлежащих этому лицу после завершения данной транзакции.\n",
|
||
"- SEC Form 4: Дата записи транзакции в форме SEC Form 4, обязательной для отчётности о сделках инсайдеров.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Бизнес-цели:\n",
|
||
"1. **Для решения задачи регрессии:**\n",
|
||
"Цель: предсказать будущую стоимость акций компании Tesla на основе инсайдерских транзакций. Стоимость акций (\"Cost\") зависит от множества факторов, включая объём и тип транзакций, совершаемых инсайдерами. Если выявить зависимости между параметрами транзакций (количество акций, общий объём сделки, должность инсайдера) и стоимостью акций, это может помочь инвесторам принимать обоснованные решения о покупке или продаже.\n",
|
||
"2. **Для решения задачи классификации:**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Выгрузка данных из файла в DataFrame:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 316,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from typing import Any, Tuple\n",
|
||
"\n",
|
||
"import pandas as pd\n",
|
||
"from pandas import DataFrame\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"\n",
|
||
"df: DataFrame = pd.read_csv('..//static//csv//TSLA.csv')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Краткая информация о DataFrame:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 317,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 156 entries, 0 to 155\n",
|
||
"Data columns (total 9 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Insider Trading 156 non-null object \n",
|
||
" 1 Relationship 156 non-null object \n",
|
||
" 2 Date 156 non-null object \n",
|
||
" 3 Transaction 156 non-null object \n",
|
||
" 4 Cost 156 non-null float64\n",
|
||
" 5 Shares 156 non-null object \n",
|
||
" 6 Value ($) 156 non-null object \n",
|
||
" 7 Shares Total 156 non-null object \n",
|
||
" 8 SEC Form 4 156 non-null object \n",
|
||
"dtypes: float64(1), object(8)\n",
|
||
"memory usage: 11.1+ KB\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>count</th>\n",
|
||
" <th>mean</th>\n",
|
||
" <th>std</th>\n",
|
||
" <th>min</th>\n",
|
||
" <th>25%</th>\n",
|
||
" <th>50%</th>\n",
|
||
" <th>75%</th>\n",
|
||
" <th>max</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>Cost</th>\n",
|
||
" <td>156.0</td>\n",
|
||
" <td>478.785641</td>\n",
|
||
" <td>448.922903</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>50.5225</td>\n",
|
||
" <td>240.225</td>\n",
|
||
" <td>934.1075</td>\n",
|
||
" <td>1171.04</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" count mean std min 25% 50% 75% max\n",
|
||
"Cost 156.0 478.785641 448.922903 0.0 50.5225 240.225 934.1075 1171.04"
|
||
]
|
||
},
|
||
"execution_count": 317,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Краткая информация о DataFrame\n",
|
||
"df.info()\n",
|
||
"\n",
|
||
"# Статистическое описание числовых столбцов\n",
|
||
"df.describe().transpose()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Конвертация данных:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 318,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Выборка данных:\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Insider Trading</th>\n",
|
||
" <th>Relationship</th>\n",
|
||
" <th>Transaction</th>\n",
|
||
" <th>Cost</th>\n",
|
||
" <th>Shares</th>\n",
|
||
" <th>Value ($)</th>\n",
|
||
" <th>Shares Total</th>\n",
|
||
" <th>Year</th>\n",
|
||
" <th>Month</th>\n",
|
||
" <th>Day</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>Kirkhorn Zachary</td>\n",
|
||
" <td>Chief Financial Officer</td>\n",
|
||
" <td>Sale</td>\n",
|
||
" <td>196.72</td>\n",
|
||
" <td>10455</td>\n",
|
||
" <td>2056775</td>\n",
|
||
" <td>203073</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>Taneja Vaibhav</td>\n",
|
||
" <td>Chief Accounting Officer</td>\n",
|
||
" <td>Sale</td>\n",
|
||
" <td>195.79</td>\n",
|
||
" <td>2466</td>\n",
|
||
" <td>482718</td>\n",
|
||
" <td>100458</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>Baglino Andrew D</td>\n",
|
||
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
||
" <td>Sale</td>\n",
|
||
" <td>195.79</td>\n",
|
||
" <td>1298</td>\n",
|
||
" <td>254232</td>\n",
|
||
" <td>65547</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>Taneja Vaibhav</td>\n",
|
||
" <td>Chief Accounting Officer</td>\n",
|
||
" <td>Option Exercise</td>\n",
|
||
" <td>0.00</td>\n",
|
||
" <td>7138</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>102923</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Baglino Andrew D</td>\n",
|
||
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
||
" <td>Option Exercise</td>\n",
|
||
" <td>0.00</td>\n",
|
||
" <td>2586</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>66845</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>Kirkhorn Zachary</td>\n",
|
||
" <td>Chief Financial Officer</td>\n",
|
||
" <td>Option Exercise</td>\n",
|
||
" <td>0.00</td>\n",
|
||
" <td>16867</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>213528</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>Baglino Andrew D</td>\n",
|
||
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
||
" <td>Option Exercise</td>\n",
|
||
" <td>20.91</td>\n",
|
||
" <td>10500</td>\n",
|
||
" <td>219555</td>\n",
|
||
" <td>74759</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>27</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Baglino Andrew D</td>\n",
|
||
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
||
" <td>Sale</td>\n",
|
||
" <td>202.00</td>\n",
|
||
" <td>10500</td>\n",
|
||
" <td>2121000</td>\n",
|
||
" <td>64259</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>27</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>Kirkhorn Zachary</td>\n",
|
||
" <td>Chief Financial Officer</td>\n",
|
||
" <td>Sale</td>\n",
|
||
" <td>193.00</td>\n",
|
||
" <td>3750</td>\n",
|
||
" <td>723750</td>\n",
|
||
" <td>196661</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9</th>\n",
|
||
" <td>Baglino Andrew D</td>\n",
|
||
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
||
" <td>Option Exercise</td>\n",
|
||
" <td>20.91</td>\n",
|
||
" <td>10500</td>\n",
|
||
" <td>219555</td>\n",
|
||
" <td>74759</td>\n",
|
||
" <td>2022</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>27</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Insider Trading Relationship Transaction Cost \\\n",
|
||
"0 Kirkhorn Zachary Chief Financial Officer Sale 196.72 \n",
|
||
"1 Taneja Vaibhav Chief Accounting Officer Sale 195.79 \n",
|
||
"2 Baglino Andrew D SVP Powertrain and Energy Eng. Sale 195.79 \n",
|
||
"3 Taneja Vaibhav Chief Accounting Officer Option Exercise 0.00 \n",
|
||
"4 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 0.00 \n",
|
||
"5 Kirkhorn Zachary Chief Financial Officer Option Exercise 0.00 \n",
|
||
"6 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 20.91 \n",
|
||
"7 Baglino Andrew D SVP Powertrain and Energy Eng. Sale 202.00 \n",
|
||
"8 Kirkhorn Zachary Chief Financial Officer Sale 193.00 \n",
|
||
"9 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 20.91 \n",
|
||
"\n",
|
||
" Shares Value ($) Shares Total Year Month Day \n",
|
||
"0 10455 2056775 203073 2022 3 6 \n",
|
||
"1 2466 482718 100458 2022 3 6 \n",
|
||
"2 1298 254232 65547 2022 3 6 \n",
|
||
"3 7138 0 102923 2022 3 5 \n",
|
||
"4 2586 0 66845 2022 3 5 \n",
|
||
"5 16867 0 213528 2022 3 5 \n",
|
||
"6 10500 219555 74759 2022 2 27 \n",
|
||
"7 10500 2121000 64259 2022 2 27 \n",
|
||
"8 3750 723750 196661 2022 2 6 \n",
|
||
"9 10500 219555 74759 2022 1 27 "
|
||
]
|
||
},
|
||
"execution_count": 318,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Преобразование типов данных\n",
|
||
"df['Insider Trading'] = df['Insider Trading'].astype('category') # Преобразование в категорию\n",
|
||
"df['Relationship'] = df['Relationship'].astype('category') # Преобразование в категорию\n",
|
||
"df['Transaction'] = df['Transaction'].astype('category') # Преобразование в категорию\n",
|
||
"df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce') # Преобразование в float\n",
|
||
"df['Shares'] = pd.to_numeric(df['Shares'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
||
"df['Value ($)'] = pd.to_numeric(df['Value ($)'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
||
"df['Shares Total'] = pd.to_numeric(df['Shares Total'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
||
"\n",
|
||
"df['Date'] = pd.to_datetime(df['Date'], errors='coerce') # Преобразование в datetime\n",
|
||
"df['Year'] = df['Date'].dt.year # Год\n",
|
||
"df['Month'] = df['Date'].dt.month # Месяц\n",
|
||
"df['Day'] = df['Date'].dt.day # День\n",
|
||
"df: DataFrame = df.drop(columns=['Date', 'SEC Form 4']) # Удаление столбцов с датами\n",
|
||
"\n",
|
||
"print('Выборка данных:')\n",
|
||
"df.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Бизнес-цель №1 (Задача регрессии).\n",
|
||
"\n",
|
||
"### Достижимый уровень качества модели:\n",
|
||
"**Основные метрики для регрессии:**\n",
|
||
"- **Средняя абсолютная ошибка (Mean Absolute Error, MAE)** – показывает среднее абсолютное отклонение между предсказанными и фактическими значениями.\n",
|
||
"Легко интерпретируется, особенно в финансовых данных, где каждая ошибка в долларах имеет значение.\n",
|
||
"- **Среднеквадратичная ошибка (Mean Squared Error, MSE)** – показывает, насколько отклоняются прогнозы модели от истинных значений в квадрате. Подходит для оценки общего качества модели.\n",
|
||
"- **Коэффициент детерминации (R²)** – указывает, какую долю дисперсии зависимой переменной объясняет модель. R² варьируется от 0 до 1 (чем ближе к 1, тем лучше).\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Выбор ориентира:\n",
|
||
"В качестве базовой модели для оценки качества предсказаний выбрано использование среднего значения целевой переменной (Cost) на обучающей выборке. Это простой и интуитивно понятный метод, который служит минимальным ориентиром для сравнения с более сложными моделями. Базовая модель помогает установить начальный уровень ошибок (MAE, MSE) и показатель качества (R²), которые сложные модели должны улучшить, чтобы оправдать своё использование.\n",
|
||
"\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Разбиение данных:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 319,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Baseline MAE: 417.78235887096776\n",
|
||
"Baseline MSE: 182476.07973024843\n",
|
||
"Baseline R²: -0.027074997920953914\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from pandas.core.frame import DataFrame\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
|
||
"\n",
|
||
"\n",
|
||
"# Разбить данные на обучающую и тестовую выборки\n",
|
||
"def split_into_train_test(\n",
|
||
" df_input: DataFrame,\n",
|
||
" stratify_colname: str = \"y\", \n",
|
||
" frac_train: float = 0.8,\n",
|
||
" random_state: int = 42,\n",
|
||
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
||
"\n",
|
||
" if stratify_colname not in df_input.columns:\n",
|
||
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
|
||
" \n",
|
||
" if not (0 < frac_train < 1):\n",
|
||
" raise ValueError(\"Fraction must be between 0 and 1.\")\n",
|
||
" \n",
|
||
" X: DataFrame = df_input # Contains all columns.\n",
|
||
" y: DataFrame = df_input[\n",
|
||
" [stratify_colname]\n",
|
||
" ] # Dataframe of just the column on which to stratify.\n",
|
||
"\n",
|
||
" # Split original dataframe into train and test dataframes.\n",
|
||
" X_train, X_test, y_train, y_test = train_test_split(\n",
|
||
" X, y,\n",
|
||
" test_size=(1.0 - frac_train),\n",
|
||
" random_state=random_state\n",
|
||
" )\n",
|
||
" \n",
|
||
" return X_train, X_test, y_train, y_test\n",
|
||
"\n",
|
||
"\n",
|
||
"# Определяем целевой признак и входные признаки\n",
|
||
"y_feature: str = 'Cost'\n",
|
||
"X_features: list[str] = df.drop(columns=y_feature, axis=1).columns.tolist()\n",
|
||
"\n",
|
||
"# Разбиваем данные на обучающую и тестовую выборки\n",
|
||
"X_df_train, X_df_test, y_df_train, y_df_test = split_into_train_test(\n",
|
||
" df, \n",
|
||
" stratify_colname=y_feature, \n",
|
||
" frac_train=0.8, \n",
|
||
" random_state=42 \n",
|
||
")\n",
|
||
"\n",
|
||
"# Вычисляем предсказания базовой модели (среднее значение целевой переменной)\n",
|
||
"baseline_predictions: list[float] = [y_df_train.mean()] * len(y_df_test) # type: ignore\n",
|
||
"\n",
|
||
"# Оцениваем базовую модель\n",
|
||
"print('Baseline MAE:', mean_absolute_error(y_df_test, baseline_predictions))\n",
|
||
"print('Baseline MSE:', mean_squared_error(y_df_test, baseline_predictions))\n",
|
||
"print('Baseline R²:', r2_score(y_df_test, baseline_predictions))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Выбор моделей обучения:\n",
|
||
"\n",
|
||
"Для обучения были выбраны следующие модели:\n",
|
||
"1. **Случайный лес (Random Forest)**: Ансамблевая модель, которая использует множество решающих деревьев. Она хорошо справляется с нелинейными зависимостями и шумом в данных, а также обладает устойчивостью к переобучению.\n",
|
||
"2. **Линейная регрессия (Linear Regression)**: Простая модель, предполагающая линейную зависимость между признаками и целевой переменной. Она быстро обучается и предоставляет легкую интерпретацию результатов.\n",
|
||
"3. **Градиентный бустинг (Gradient Boosting)**: Мощная модель, создающая ансамбль деревьев, которые корректируют ошибки предыдущих. Эта модель эффективна для сложных наборов данных и обеспечивает высокую точность предсказаний.\n",
|
||
"\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Построение конвейера:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 320,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.impute import SimpleImputer\n",
|
||
"from sklearn.discriminant_analysis import StandardScaler\n",
|
||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
"from sklearn.compose import ColumnTransformer\n",
|
||
"from sklearn.pipeline import Pipeline\n",
|
||
"\n",
|
||
"\n",
|
||
"# Числовые столбцы\n",
|
||
"num_columns: list[str] = [\n",
|
||
" column\n",
|
||
" for column in df.columns\n",
|
||
" if df[column].dtype not in (\"category\", \"object\")\n",
|
||
"]\n",
|
||
"\n",
|
||
"# Категориальные столбцы\n",
|
||
"cat_columns: list[str] = [\n",
|
||
" column\n",
|
||
" for column in df.columns\n",
|
||
" if df[column].dtype in (\"category\", \"object\")\n",
|
||
"]\n",
|
||
"\n",
|
||
"# Заполнение пропущенных значений\n",
|
||
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
||
"# Стандартизация\n",
|
||
"num_scaler = StandardScaler()\n",
|
||
"# Конвейер для обработки числовых данных\n",
|
||
"preprocessing_num = Pipeline(\n",
|
||
" [\n",
|
||
" (\"imputer\", num_imputer),\n",
|
||
" (\"scaler\", num_scaler),\n",
|
||
" ]\n",
|
||
")\n",
|
||
"\n",
|
||
"# Заполнение пропущенных значений\n",
|
||
"cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"unknown\")\n",
|
||
"# Унитарное кодирование\n",
|
||
"cat_encoder = OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False, drop=\"first\")\n",
|
||
"# Конвейер для обработки категориальных данных\n",
|
||
"preprocessing_cat = Pipeline(\n",
|
||
" [\n",
|
||
" (\"imputer\", cat_imputer),\n",
|
||
" (\"encoder\", cat_encoder),\n",
|
||
" ]\n",
|
||
")\n",
|
||
"\n",
|
||
"# Трансформер для предобработки признаков\n",
|
||
"features_preprocessing = ColumnTransformer(\n",
|
||
" verbose_feature_names_out=False,\n",
|
||
" transformers=[\n",
|
||
" (\"prepocessing_num\", preprocessing_num, num_columns),\n",
|
||
" (\"prepocessing_cat\", preprocessing_cat, cat_columns),\n",
|
||
" ],\n",
|
||
" remainder=\"passthrough\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Основной конвейер предобработки данных\n",
|
||
"pipeline_end = Pipeline(\n",
|
||
" [\n",
|
||
" (\"features_preprocessing\", features_preprocessing),\n",
|
||
" ]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Демонстрация работы конвейера:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 321,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Cost</th>\n",
|
||
" <th>Shares</th>\n",
|
||
" <th>Value ($)</th>\n",
|
||
" <th>Shares Total</th>\n",
|
||
" <th>Year</th>\n",
|
||
" <th>Month</th>\n",
|
||
" <th>Day</th>\n",
|
||
" <th>Insider Trading_DENHOLM ROBYN M</th>\n",
|
||
" <th>Insider Trading_Kirkhorn Zachary</th>\n",
|
||
" <th>Insider Trading_Musk Elon</th>\n",
|
||
" <th>Insider Trading_Musk Kimbal</th>\n",
|
||
" <th>Insider Trading_Taneja Vaibhav</th>\n",
|
||
" <th>Insider Trading_Wilson-Thompson Kathleen</th>\n",
|
||
" <th>Relationship_Chief Accounting Officer</th>\n",
|
||
" <th>Relationship_Chief Financial Officer</th>\n",
|
||
" <th>Relationship_Director</th>\n",
|
||
" <th>Relationship_SVP Powertrain and Energy Eng.</th>\n",
|
||
" <th>Transaction_Sale</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>-0.966516</td>\n",
|
||
" <td>-0.361759</td>\n",
|
||
" <td>-0.450022</td>\n",
|
||
" <td>-0.343599</td>\n",
|
||
" <td>0.715678</td>\n",
|
||
" <td>-0.506108</td>\n",
|
||
" <td>-0.400623</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>-1.074894</td>\n",
|
||
" <td>1.225216</td>\n",
|
||
" <td>-0.414725</td>\n",
|
||
" <td>-0.319938</td>\n",
|
||
" <td>-1.397276</td>\n",
|
||
" <td>0.801338</td>\n",
|
||
" <td>0.906673</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>-1.074894</td>\n",
|
||
" <td>1.211753</td>\n",
|
||
" <td>-0.415027</td>\n",
|
||
" <td>-0.320141</td>\n",
|
||
" <td>-1.397276</td>\n",
|
||
" <td>1.062828</td>\n",
|
||
" <td>-0.098939</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>1.167142</td>\n",
|
||
" <td>0.037499</td>\n",
|
||
" <td>1.023612</td>\n",
|
||
" <td>-0.325853</td>\n",
|
||
" <td>-1.397276</td>\n",
|
||
" <td>1.062828</td>\n",
|
||
" <td>-0.501184</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>1.217886</td>\n",
|
||
" <td>-0.075287</td>\n",
|
||
" <td>0.632973</td>\n",
|
||
" <td>-0.330205</td>\n",
|
||
" <td>-1.397276</td>\n",
|
||
" <td>1.062828</td>\n",
|
||
" <td>-0.501184</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>0.505872</td>\n",
|
||
" <td>-0.361021</td>\n",
|
||
" <td>-0.443679</td>\n",
|
||
" <td>-0.343698</td>\n",
|
||
" <td>0.715678</td>\n",
|
||
" <td>-0.767598</td>\n",
|
||
" <td>1.308918</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>-1.088674</td>\n",
|
||
" <td>-0.357532</td>\n",
|
||
" <td>-0.450389</td>\n",
|
||
" <td>-0.342863</td>\n",
|
||
" <td>0.715678</td>\n",
|
||
" <td>0.278360</td>\n",
|
||
" <td>-0.903429</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>-0.692146</td>\n",
|
||
" <td>-0.355855</td>\n",
|
||
" <td>-0.445383</td>\n",
|
||
" <td>-0.343220</td>\n",
|
||
" <td>0.715678</td>\n",
|
||
" <td>0.801338</td>\n",
|
||
" <td>1.409480</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>-1.088674</td>\n",
|
||
" <td>-0.361181</td>\n",
|
||
" <td>-0.450389</td>\n",
|
||
" <td>-0.343649</td>\n",
|
||
" <td>-1.397276</td>\n",
|
||
" <td>1.062828</td>\n",
|
||
" <td>-0.903429</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9</th>\n",
|
||
" <td>1.091997</td>\n",
|
||
" <td>-0.204531</td>\n",
|
||
" <td>0.114712</td>\n",
|
||
" <td>1.538166</td>\n",
|
||
" <td>0.715678</td>\n",
|
||
" <td>-1.029087</td>\n",
|
||
" <td>1.208357</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Cost Shares Value ($) Shares Total Year Month Day \\\n",
|
||
"0 -0.966516 -0.361759 -0.450022 -0.343599 0.715678 -0.506108 -0.400623 \n",
|
||
"1 -1.074894 1.225216 -0.414725 -0.319938 -1.397276 0.801338 0.906673 \n",
|
||
"2 -1.074894 1.211753 -0.415027 -0.320141 -1.397276 1.062828 -0.098939 \n",
|
||
"3 1.167142 0.037499 1.023612 -0.325853 -1.397276 1.062828 -0.501184 \n",
|
||
"4 1.217886 -0.075287 0.632973 -0.330205 -1.397276 1.062828 -0.501184 \n",
|
||
"5 0.505872 -0.361021 -0.443679 -0.343698 0.715678 -0.767598 1.308918 \n",
|
||
"6 -1.088674 -0.357532 -0.450389 -0.342863 0.715678 0.278360 -0.903429 \n",
|
||
"7 -0.692146 -0.355855 -0.445383 -0.343220 0.715678 0.801338 1.409480 \n",
|
||
"8 -1.088674 -0.361181 -0.450389 -0.343649 -1.397276 1.062828 -0.903429 \n",
|
||
"9 1.091997 -0.204531 0.114712 1.538166 0.715678 -1.029087 1.208357 \n",
|
||
"\n",
|
||
" Insider Trading_DENHOLM ROBYN M Insider Trading_Kirkhorn Zachary \\\n",
|
||
"0 0.0 0.0 \n",
|
||
"1 0.0 0.0 \n",
|
||
"2 0.0 0.0 \n",
|
||
"3 0.0 0.0 \n",
|
||
"4 0.0 0.0 \n",
|
||
"5 0.0 0.0 \n",
|
||
"6 0.0 0.0 \n",
|
||
"7 0.0 0.0 \n",
|
||
"8 0.0 0.0 \n",
|
||
"9 0.0 0.0 \n",
|
||
"\n",
|
||
" Insider Trading_Musk Elon Insider Trading_Musk Kimbal \\\n",
|
||
"0 0.0 0.0 \n",
|
||
"1 1.0 0.0 \n",
|
||
"2 1.0 0.0 \n",
|
||
"3 1.0 0.0 \n",
|
||
"4 1.0 0.0 \n",
|
||
"5 0.0 0.0 \n",
|
||
"6 0.0 0.0 \n",
|
||
"7 0.0 0.0 \n",
|
||
"8 0.0 0.0 \n",
|
||
"9 1.0 0.0 \n",
|
||
"\n",
|
||
" Insider Trading_Taneja Vaibhav Insider Trading_Wilson-Thompson Kathleen \\\n",
|
||
"0 1.0 0.0 \n",
|
||
"1 0.0 0.0 \n",
|
||
"2 0.0 0.0 \n",
|
||
"3 0.0 0.0 \n",
|
||
"4 0.0 0.0 \n",
|
||
"5 0.0 0.0 \n",
|
||
"6 1.0 0.0 \n",
|
||
"7 0.0 0.0 \n",
|
||
"8 1.0 0.0 \n",
|
||
"9 0.0 0.0 \n",
|
||
"\n",
|
||
" Relationship_Chief Accounting Officer \\\n",
|
||
"0 1.0 \n",
|
||
"1 0.0 \n",
|
||
"2 0.0 \n",
|
||
"3 0.0 \n",
|
||
"4 0.0 \n",
|
||
"5 0.0 \n",
|
||
"6 1.0 \n",
|
||
"7 0.0 \n",
|
||
"8 1.0 \n",
|
||
"9 0.0 \n",
|
||
"\n",
|
||
" Relationship_Chief Financial Officer Relationship_Director \\\n",
|
||
"0 0.0 0.0 \n",
|
||
"1 0.0 0.0 \n",
|
||
"2 0.0 0.0 \n",
|
||
"3 0.0 0.0 \n",
|
||
"4 0.0 0.0 \n",
|
||
"5 0.0 0.0 \n",
|
||
"6 0.0 0.0 \n",
|
||
"7 0.0 0.0 \n",
|
||
"8 0.0 0.0 \n",
|
||
"9 0.0 0.0 \n",
|
||
"\n",
|
||
" Relationship_SVP Powertrain and Energy Eng. Transaction_Sale \n",
|
||
"0 0.0 0.0 \n",
|
||
"1 0.0 0.0 \n",
|
||
"2 0.0 0.0 \n",
|
||
"3 0.0 1.0 \n",
|
||
"4 0.0 1.0 \n",
|
||
"5 1.0 1.0 \n",
|
||
"6 0.0 0.0 \n",
|
||
"7 1.0 1.0 \n",
|
||
"8 0.0 0.0 \n",
|
||
"9 0.0 1.0 "
|
||
]
|
||
},
|
||
"execution_count": 321,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Применение конвейера\n",
|
||
"preprocessing_result = pipeline_end.fit_transform(X_df_train)\n",
|
||
"preprocessed_df = pd.DataFrame(\n",
|
||
" preprocessing_result,\n",
|
||
" columns=pipeline_end.get_feature_names_out(),\n",
|
||
")\n",
|
||
"\n",
|
||
"preprocessed_df.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Обучение моделей:\n",
|
||
"\n",
|
||
"Оценка результатов обучения:\n",
|
||
"1. **Случайный лес (Random Forest)**:\n",
|
||
" - Показатели:\n",
|
||
" - Средний балл: 0.9993.\n",
|
||
" - Стандартное отклонение: 0.00046.\n",
|
||
" - Вывод: Очень высокая точность, что свидетельствует о хорошей способности модели к обобщению. Низкое значение стандартного отклонения указывает на стабильность модели.\n",
|
||
"2. **Линейная регрессия (Linear Regression)**:\n",
|
||
" - Показатели:\n",
|
||
" - Средний балл: 1.0.\n",
|
||
" - Стандартное отклонение: 0.0.\n",
|
||
" - Вывод: Идеальная точность, однако есть вероятность переобучения, так как стандартное отклонение равно 0. Это может указывать на то, что модель идеально подгоняет данные, но может не работать на новых данных.\n",
|
||
"3. **Градиентный бустинг (Gradient Boosting)**:\n",
|
||
" - Показатели:\n",
|
||
" - Средний балл: 0.9998.\n",
|
||
" - Стандартное отклонение: 0.00014.\n",
|
||
" - Вывод: Отличные результаты с высокой точностью и низкой вариабельностью. Модель также демонстрирует хорошую устойчивость."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 322,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
||
" warnings.warn(\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
||
" return fit_method(estimator, *args, **kwargs)\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
||
" warnings.warn(\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Random Forest:\n",
|
||
" Mean Score = 0.9992841344976828\n",
|
||
" Standard Deviation = 0.0004515288830049682\n",
|
||
"Linear Regression:\n",
|
||
" Mean Score = 1.0\n",
|
||
" Standard Deviation = 0.0\n",
|
||
"Gradient Boosting:\n",
|
||
" Mean Score = 0.9997688048426001\n",
|
||
" Standard Deviation = 0.0001416815109781245\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
||
" warnings.warn(\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
||
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
||
"from sklearn.linear_model import LinearRegression\n",
|
||
"from sklearn.model_selection import cross_val_score\n",
|
||
"\n",
|
||
"\n",
|
||
"# Обучить модели\n",
|
||
"def train_models(X: DataFrame, y: DataFrame, \n",
|
||
" models: dict[str, Any]) -> dict[str, Any]:\n",
|
||
" results: dict[str, Any] = {}\n",
|
||
" for model_name, model in models.items():\n",
|
||
" # Создаем конвейер для каждой модели\n",
|
||
" model_pipeline = Pipeline(\n",
|
||
" [\n",
|
||
" (\"features_preprocessing\", features_preprocessing),\n",
|
||
" (\"model\", model) # Используем текущую модель\n",
|
||
" ]\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Обучаем модель и вычисляем кросс-валидацию\n",
|
||
" scores = cross_val_score(model_pipeline, X, y, cv=5) # 5-кратная кросс-валидация\n",
|
||
" results[model_name] = {\n",
|
||
" \"mean_score\": scores.mean(),\n",
|
||
" \"std_dev\": scores.std()\n",
|
||
" }\n",
|
||
" \n",
|
||
" return results\n",
|
||
"\n",
|
||
"\n",
|
||
"models_regression: dict[str, Any] = {\n",
|
||
" \"Random Forest\": RandomForestRegressor(),\n",
|
||
" \"Linear Regression\": LinearRegression(),\n",
|
||
" \"Gradient Boosting\": GradientBoostingRegressor(),\n",
|
||
"}\n",
|
||
"\n",
|
||
"results: dict[str, Any] = train_models(X_df_train, y_df_train, models_regression)\n",
|
||
"\n",
|
||
"# Вывод результатов\n",
|
||
"for model_name, scores in results.items():\n",
|
||
" print(f\"\"\"{model_name}:\n",
|
||
" Mean Score = {scores['mean_score']}\n",
|
||
" Standard Deviation = {scores['std_dev']}\"\"\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Проверка на тестовом наборе данных:\n",
|
||
"\n",
|
||
"Оценка результатов обучения:\n",
|
||
"1. Случайный лес (Random Forest):\n",
|
||
" - Показатели:\n",
|
||
" - MAE (обучение): 1.858\n",
|
||
" - MAE (тест): 4.489\n",
|
||
" - MSE (обучение): 10.959\n",
|
||
" - MSE (тест): 62.649\n",
|
||
" - R2 (обучение): 0.9999\n",
|
||
" - R2 (тест): 0.9997\n",
|
||
" - STD (обучение): 3.310\n",
|
||
" - STD (тест): 7.757\n",
|
||
" - Вывод: Случайный лес показывает великолепные значения R2 на обучающей и тестовой выборках, что свидетельствует о сильной способности к обобщению. Однако MAE и MSE на тестовой выборке значительно выше, чем на обучающей, что может указывать на некоторые проблемы с переобучением.\n",
|
||
"2. Линейная регрессия (Linear Regression):\n",
|
||
" - Показатели:\n",
|
||
" - MAE (обучение): 3.069e-13\n",
|
||
" - MAE (тест): 2.762e-13\n",
|
||
" - MSE (обучение): 1.437e-25\n",
|
||
" - MSE (тест): 1.196e-25\n",
|
||
" - R2 (обучение): 1.0\n",
|
||
" - R2 (тест): 1.0\n",
|
||
" - STD (обучение): 3.730e-13\n",
|
||
" - STD (тест): 3.444e-13\n",
|
||
" - Вывод: Высокие показатели точности и нулевые ошибки (MAE, MSE) указывают на то, что модель идеально подгоняет данные как на обучающей, так и на тестовой выборках. Однако это также может быть признаком переобучения.\n",
|
||
"3. Градиентный бустинг (Gradient Boosting):\n",
|
||
" - Показатели:\n",
|
||
" - MAE (обучение): 0.156\n",
|
||
" - MAE (тест): 3.027\n",
|
||
" - MSE (обучение): 0.075\n",
|
||
" - MSE (тест): 41.360\n",
|
||
" - R2 (обучение): 0.9999996\n",
|
||
" - R2 (тест): 0.9998\n",
|
||
" - STD (обучение): 0.274\n",
|
||
" - STD (тест): 6.399\n",
|
||
" - Вывод: Градиентный бустинг демонстрирует отличные результаты на обучающей выборке, однако MAE и MSE на тестовой выборке довольно высокие, что может указывать на определенное переобучение или необходимость улучшения настройки модели."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 323,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Модель: Random Forest\n",
|
||
"\tMAE_train: 1.8584435483870716\n",
|
||
"\tMAE_test: 4.489381249999976\n",
|
||
"\tMSE_train: 10.958770153225622\n",
|
||
"\tMSE_test: 62.643889510626195\n",
|
||
"\tR2_train: 0.9999465631134502\n",
|
||
"\tR2_test: 0.9996474059899577\n",
|
||
"\tSTD_train: 3.3095436106742198\n",
|
||
"\tSTD_test: 7.757028236410516\n",
|
||
"\n",
|
||
"Модель: Linear Regression\n",
|
||
"\tMAE_train: 3.0690862038154006e-13\n",
|
||
"\tMAE_test: 2.761679773755077e-13\n",
|
||
"\tMSE_train: 1.4370485712253764e-25\n",
|
||
"\tMSE_test: 1.19585889812782e-25\n",
|
||
"\tR2_train: 1.0\n",
|
||
"\tR2_test: 1.0\n",
|
||
"\tSTD_train: 3.7295840825107354e-13\n",
|
||
"\tSTD_test: 3.4438670391637766e-13\n",
|
||
"\n",
|
||
"Модель: Gradient Boosting\n",
|
||
"\tMAE_train: 0.15613772760448064\n",
|
||
"\tMAE_test: 3.027282706028462\n",
|
||
"\tMSE_train: 0.07499640211231481\n",
|
||
"\tMSE_test: 41.36034726227861\n",
|
||
"\tR2_train: 0.9999996343043813\n",
|
||
"\tR2_test: 0.9997672013852927\n",
|
||
"\tSTD_train: 0.2738547098596532\n",
|
||
"\tSTD_test: 6.3988297145358555\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"from sklearn import metrics\n",
|
||
"\n",
|
||
"\n",
|
||
"# Оценка качества различных моделей на основе метрик\n",
|
||
"def evaluate_models(models, \n",
|
||
" pipeline_end, \n",
|
||
" X_train, y_train, \n",
|
||
" X_test, y_test) -> dict[str, dict[str, Any]]:\n",
|
||
" results: dict[str, dict[str, Any]] = {}\n",
|
||
" \n",
|
||
" for model_name, model in models.items():\n",
|
||
" # Создание пайплайна для текущей модели\n",
|
||
" model_pipeline = Pipeline(\n",
|
||
" [\n",
|
||
" (\"pipeline\", pipeline_end), \n",
|
||
" (\"model\", model),\n",
|
||
" ]\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Обучение текущей модели\n",
|
||
" model_pipeline.fit(X_train, y_train)\n",
|
||
"\n",
|
||
" # Предсказание для обучающей и тестовой выборки\n",
|
||
" y_train_predict = model_pipeline.predict(X_train)\n",
|
||
" y_test_predict = model_pipeline.predict(X_test)\n",
|
||
"\n",
|
||
" # Вычисление метрик для текущей модели\n",
|
||
" metrics_dict: dict[str, Any] = {\n",
|
||
" \"MAE_train\": metrics.mean_absolute_error(y_train, y_train_predict),\n",
|
||
" \"MAE_test\": metrics.mean_absolute_error(y_test, y_test_predict),\n",
|
||
" \"MSE_train\": metrics.mean_squared_error(y_train, y_train_predict),\n",
|
||
" \"MSE_test\": metrics.mean_squared_error(y_test, y_test_predict),\n",
|
||
" \"R2_train\": metrics.r2_score(y_train, y_train_predict),\n",
|
||
" \"R2_test\": metrics.r2_score(y_test, y_test_predict),\n",
|
||
" \"STD_train\": np.std(y_train - y_train_predict),\n",
|
||
" \"STD_test\": np.std(y_test - y_test_predict),\n",
|
||
" }\n",
|
||
"\n",
|
||
" # Сохранение результатов\n",
|
||
" results[model_name] = metrics_dict\n",
|
||
" \n",
|
||
" return results\n",
|
||
"\n",
|
||
"\n",
|
||
"y_train = np.ravel(y_df_train) \n",
|
||
"y_test = np.ravel(y_df_test) \n",
|
||
"\n",
|
||
"result: dict[str, dict[str, Any]] = evaluate_models(models_regression,\n",
|
||
" pipeline_end,\n",
|
||
" X_df_train, y_train,\n",
|
||
" X_df_test, y_test)\n",
|
||
"\n",
|
||
"# Вывод результатов\n",
|
||
"for model_name, metrics_dict in result.items():\n",
|
||
" print(f\"Модель: {model_name}\")\n",
|
||
" for metric_name, value in metrics_dict.items():\n",
|
||
" print(f\"\\t{metric_name}: {value}\")\n",
|
||
" print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Подбор гиперпараметров:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 324,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Fitting 3 folds for each of 36 candidates, totalling 108 fits\n",
|
||
"Лучшие параметры: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}\n",
|
||
"Лучший результат (MSE): 188.5929593664171\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.model_selection import GridSearchCV\n",
|
||
"\n",
|
||
"\n",
|
||
"# Применение конвейера к данным\n",
|
||
"X_train_processing_result = pipeline_end.fit_transform(X_df_train)\n",
|
||
"X_test_processing_result = pipeline_end.transform(X_df_test)\n",
|
||
"\n",
|
||
"# Создание и настройка модели случайного леса\n",
|
||
"model = RandomForestRegressor()\n",
|
||
"\n",
|
||
"# Установка параметров для поиска по сетке\n",
|
||
"param_grid: dict[str, list[int | None]] = {\n",
|
||
" 'n_estimators': [50, 100, 200], # Количество деревьев\n",
|
||
" 'max_depth': [None, 10, 20, 30], # Максимальная глубина дерева\n",
|
||
" 'min_samples_split': [2, 5, 10] # Минимальное количество образцов для разбиения узла\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Подбор гиперпараметров с помощью поиска по сетке\n",
|
||
"grid_search = GridSearchCV(estimator=model, \n",
|
||
" param_grid=param_grid,\n",
|
||
" scoring='neg_mean_squared_error', cv=3, n_jobs=-1, verbose=2)\n",
|
||
"\n",
|
||
"# Обучение модели на тренировочных данных\n",
|
||
"grid_search.fit(X_train_processing_result, y_train)\n",
|
||
"\n",
|
||
"# Результаты подбора гиперпараметров\n",
|
||
"print(\"Лучшие параметры:\", grid_search.best_params_)\n",
|
||
"# Меняем знак, так как берем отрицательное значение среднеквадратичной ошибки\n",
|
||
"print(\"Лучший результат (MSE):\", -grid_search.best_score_)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Сравнение наборов гиперпараметров:\n",
|
||
"\n",
|
||
"Результаты анализа показывают, что параметры из старой сетки обеспечивают значительно лучшее качество модели. Среднеквадратическая ошибка (MSE) на кросс-валидации для старых параметров составила 179.369, что существенно ниже, чем для новых параметров (1290.656). На тестовой выборке модель с новыми параметрами показала MSE 172.574, что сопоставимо с результатами модели со старыми параметрами, однако этот результат является случайным, так как новые параметры продемонстрировали плохую кросс-валидационную ошибку, указывая на недообучение. Таким образом, параметры из старой сетки более предпочтительны, так как они обеспечивают лучшее обобщение и меньшую ошибку."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 325,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Fitting 3 folds for each of 36 candidates, totalling 108 fits\n",
|
||
"Старые параметры: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 50}\n",
|
||
"Лучший результат (MSE) на старых параметрах: 179.369172166932\n",
|
||
"\n",
|
||
"Новые параметры: {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 50}\n",
|
||
"Лучший результат (MSE) на новых параметрах: 1290.6561132979532\n",
|
||
"Среднеквадратическая ошибка (MSE) на тестовых данных: 172.57398236522087\n",
|
||
"Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 13.136741695154885\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1000x500 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Установка параметров для поиска по сетке для старых значений\n",
|
||
"old_param_grid: dict[str, list[int | None]] = {\n",
|
||
" 'n_estimators': [50, 100, 200], # Количество деревьев\n",
|
||
" 'max_depth': [None, 10, 20, 30], # Максимальная глубина дерева\n",
|
||
" 'min_samples_split': [2, 5, 10] # Минимальное количество образцов для разбиения узла\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Подбор гиперпараметров с помощью поиска по сетке для старых параметров\n",
|
||
"old_grid_search = GridSearchCV(estimator=model, \n",
|
||
" param_grid=old_param_grid,\n",
|
||
" scoring='neg_mean_squared_error', cv=3, n_jobs=-1, verbose=2)\n",
|
||
"\n",
|
||
"# Обучение модели на тренировочных данных\n",
|
||
"old_grid_search.fit(X_train_processing_result, y_train)\n",
|
||
"\n",
|
||
"# Результаты подбора для старых параметров\n",
|
||
"old_best_params = old_grid_search.best_params_\n",
|
||
" # Меняем знак, так как берем отрицательное значение MSE\n",
|
||
"old_best_mse = -old_grid_search.best_score_\n",
|
||
"\n",
|
||
"\n",
|
||
"# Установка параметров для поиска по сетке для новых значений\n",
|
||
"new_param_grid: dict[str, list[int]] = {\n",
|
||
" 'n_estimators': [50],\n",
|
||
" 'max_depth': [5],\n",
|
||
" 'min_samples_split': [10]\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Подбор гиперпараметров с помощью поиска по сетке для новых параметров\n",
|
||
"new_grid_search = GridSearchCV(estimator=model, \n",
|
||
" param_grid=new_param_grid,\n",
|
||
" scoring='neg_mean_squared_error', cv=2)\n",
|
||
"\n",
|
||
"# Обучение модели на тренировочных данных\n",
|
||
"new_grid_search.fit(X_train_processing_result, y_train)\n",
|
||
"\n",
|
||
"# Результаты подбора для новых параметров\n",
|
||
"new_best_params = new_grid_search.best_params_\n",
|
||
"# Меняем знак, так как берем отрицательное значение MSE\n",
|
||
"new_best_mse = -new_grid_search.best_score_\n",
|
||
"\n",
|
||
"# Обучение модели с лучшими параметрами для новых значений\n",
|
||
"model_best = RandomForestRegressor(**new_best_params)\n",
|
||
"model_best.fit(X_train_processing_result, y_train)\n",
|
||
"\n",
|
||
"# Прогнозирование на тестовой выборке\n",
|
||
"y_pred = model_best.predict(X_test_processing_result)\n",
|
||
"\n",
|
||
"# Оценка производительности модели\n",
|
||
"mse = metrics.mean_squared_error(y_test, y_pred)\n",
|
||
"rmse = np.sqrt(mse)\n",
|
||
"\n",
|
||
"# Вывод результатов\n",
|
||
"print(\"Старые параметры:\", old_best_params)\n",
|
||
"print(\"Лучший результат (MSE) на старых параметрах:\", old_best_mse)\n",
|
||
"print(\"\\nНовые параметры:\", new_best_params)\n",
|
||
"print(\"Лучший результат (MSE) на новых параметрах:\", new_best_mse)\n",
|
||
"print(\"Среднеквадратическая ошибка (MSE) на тестовых данных:\", mse)\n",
|
||
"print(\"Корень среднеквадратичной ошибки (RMSE) на тестовых данных:\", rmse)\n",
|
||
"\n",
|
||
"# Обучение модели с лучшими параметрами для старых значений\n",
|
||
"model_old = RandomForestRegressor(**old_best_params)\n",
|
||
"model_old.fit(X_train_processing_result, y_train)\n",
|
||
"\n",
|
||
"# Прогнозирование на тестовой выборке для старых параметров\n",
|
||
"y_pred_old = model_old.predict(X_test_processing_result)\n",
|
||
"\n",
|
||
"# Визуализация ошибок\n",
|
||
"plt.figure(figsize=(10, 5))\n",
|
||
"plt.plot(y_test, label='Реальные значения', marker='o', linestyle='-', color='black')\n",
|
||
"plt.plot(y_pred_old, label='Предсказанные значения (старые параметры)', marker='x', linestyle='--', color='blue')\n",
|
||
"plt.plot(y_pred, label='Предсказанные значения (новые параметры)', marker='s', linestyle='--', color='orange')\n",
|
||
"plt.xlabel('Объекты')\n",
|
||
"plt.ylabel('Значения')\n",
|
||
"plt.title('Сравнение реальных и предсказанных значений')\n",
|
||
"plt.legend()\n",
|
||
"plt.show()"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "aimenv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|