1432 lines
182 KiB
Plaintext
1432 lines
182 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Датасет: [Tesla Insider Trading](https://www.kaggle.com/datasets/ilyaryabov/tesla-insider-trading).\n",
|
|||
|
"\n",
|
|||
|
"### Описание датасета:\n",
|
|||
|
"Датасет представляет собой выборку операций с ценными бумагами компании Tesla, совершённых инсайдерами, и является частью более крупного проекта \"Insider Trading S&P500 – Inside Info\". Данные охватывают транзакции с участием крупных акционеров и должностных лиц компании, включая такие операции, как покупка, продажа и опционы, начиная с 10 ноября 2021 года и до 27 июля 2022 года.\n",
|
|||
|
"\n",
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"### Анализ сведений:\n",
|
|||
|
"**Проблемная область:**\n",
|
|||
|
"Проблемная область данного датасета касается анализа инсайдерских сделок в публичных компаниях, а также их влияния на ценообразование акций. Инсайдерские транзакции, совершаемые людьми с доступом к непубличной информации (такими как руководители, крупные акционеры или члены совета директоров), могут быть индикаторами будущих изменений стоимости акций. Исследование таких транзакций помогает понять, как информация внутри компании отражается в действиях ключевых участников, и может выявить паттерны поведения, которые влияют на рынки.\n",
|
|||
|
"\n",
|
|||
|
"**Актуальность:**\n",
|
|||
|
"Анализ инсайдерских сделок становится особенно важным в условиях высокой волатильности рынка и неопределенности. Инвесторы, аналитики и компании используют такие данные, чтобы лучше понимать сигналы от крупных акционеров и должностных лиц. Действия инсайдеров, такие как покупки и продажи акций, нередко рассматриваются как индикаторы доверия к компании, что может оказывать значительное влияние на рыночные ожидания и прогнозы.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдений:**\n",
|
|||
|
"Объектами наблюдений в датасете являются инсайдеры компании Tesla — лица, имеющие значительное влияние на управление и информацию компании. Каждый объект характеризуется различными параметрами, включая должность, тип транзакции, количество акций и общую стоимость сделок.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объектов:**\n",
|
|||
|
"- Insider Trading: ФИО лица, совершившего транзакцию.\n",
|
|||
|
"- Relationship: Должность или статус данного лица в компании Tesla.\n",
|
|||
|
"- Date: Дата завершения транзакции.\n",
|
|||
|
"- Transaction: Тип транзакции.\n",
|
|||
|
"- Cost: Цена одной акции на момент совершения транзакции.\n",
|
|||
|
"- Shares: Количество акций, участвующих в транзакции.\n",
|
|||
|
"- Value ($): Общая стоимость транзакции в долларах США.\n",
|
|||
|
"- Shares Total: Общее количество акций, принадлежащих этому лицу после завершения данной транзакции.\n",
|
|||
|
"- SEC Form 4: Дата записи транзакции в форме SEC Form 4, обязательной для отчётности о сделках инсайдеров.\n",
|
|||
|
"\n",
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"### Бизнес-цели:\n",
|
|||
|
"1. **Для решения задачи регрессии:**\n",
|
|||
|
"Цель: предсказать будущую стоимость акций компании Tesla на основе инсайдерских транзакций. Стоимость акций (\"Cost\") зависит от множества факторов, включая объём и тип транзакций, совершаемых инсайдерами. Если выявить зависимости между параметрами транзакций (количество акций, общий объём сделки, должность инсайдера) и стоимостью акций, это может помочь инвесторам принимать обоснованные решения о покупке или продаже.\n",
|
|||
|
"2. **Для решения задачи классификации:**\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Выгрузка данных из файла в DataFrame:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 316,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from typing import Any, Tuple\n",
|
|||
|
"\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"from pandas import DataFrame\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"df: DataFrame = pd.read_csv('..//static//csv//TSLA.csv')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Краткая информация о DataFrame:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 317,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 156 entries, 0 to 155\n",
|
|||
|
"Data columns (total 9 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 Insider Trading 156 non-null object \n",
|
|||
|
" 1 Relationship 156 non-null object \n",
|
|||
|
" 2 Date 156 non-null object \n",
|
|||
|
" 3 Transaction 156 non-null object \n",
|
|||
|
" 4 Cost 156 non-null float64\n",
|
|||
|
" 5 Shares 156 non-null object \n",
|
|||
|
" 6 Value ($) 156 non-null object \n",
|
|||
|
" 7 Shares Total 156 non-null object \n",
|
|||
|
" 8 SEC Form 4 156 non-null object \n",
|
|||
|
"dtypes: float64(1), object(8)\n",
|
|||
|
"memory usage: 11.1+ KB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>count</th>\n",
|
|||
|
" <th>mean</th>\n",
|
|||
|
" <th>std</th>\n",
|
|||
|
" <th>min</th>\n",
|
|||
|
" <th>25%</th>\n",
|
|||
|
" <th>50%</th>\n",
|
|||
|
" <th>75%</th>\n",
|
|||
|
" <th>max</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <td>156.0</td>\n",
|
|||
|
" <td>478.785641</td>\n",
|
|||
|
" <td>448.922903</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>50.5225</td>\n",
|
|||
|
" <td>240.225</td>\n",
|
|||
|
" <td>934.1075</td>\n",
|
|||
|
" <td>1171.04</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" count mean std min 25% 50% 75% max\n",
|
|||
|
"Cost 156.0 478.785641 448.922903 0.0 50.5225 240.225 934.1075 1171.04"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 317,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Краткая информация о DataFrame\n",
|
|||
|
"df.info()\n",
|
|||
|
"\n",
|
|||
|
"# Статистическое описание числовых столбцов\n",
|
|||
|
"df.describe().transpose()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конвертация данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 318,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Выборка данных:\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Insider Trading</th>\n",
|
|||
|
" <th>Relationship</th>\n",
|
|||
|
" <th>Transaction</th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Shares</th>\n",
|
|||
|
" <th>Value ($)</th>\n",
|
|||
|
" <th>Shares Total</th>\n",
|
|||
|
" <th>Year</th>\n",
|
|||
|
" <th>Month</th>\n",
|
|||
|
" <th>Day</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>196.72</td>\n",
|
|||
|
" <td>10455</td>\n",
|
|||
|
" <td>2056775</td>\n",
|
|||
|
" <td>203073</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>Taneja Vaibhav</td>\n",
|
|||
|
" <td>Chief Accounting Officer</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>2466</td>\n",
|
|||
|
" <td>482718</td>\n",
|
|||
|
" <td>100458</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>195.79</td>\n",
|
|||
|
" <td>1298</td>\n",
|
|||
|
" <td>254232</td>\n",
|
|||
|
" <td>65547</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>Taneja Vaibhav</td>\n",
|
|||
|
" <td>Chief Accounting Officer</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>7138</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>102923</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>2586</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>66845</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>16867</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>213528</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>219555</td>\n",
|
|||
|
" <td>74759</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>27</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>202.00</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>2121000</td>\n",
|
|||
|
" <td>64259</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>27</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>Kirkhorn Zachary</td>\n",
|
|||
|
" <td>Chief Financial Officer</td>\n",
|
|||
|
" <td>Sale</td>\n",
|
|||
|
" <td>193.00</td>\n",
|
|||
|
" <td>3750</td>\n",
|
|||
|
" <td>723750</td>\n",
|
|||
|
" <td>196661</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>Baglino Andrew D</td>\n",
|
|||
|
" <td>SVP Powertrain and Energy Eng.</td>\n",
|
|||
|
" <td>Option Exercise</td>\n",
|
|||
|
" <td>20.91</td>\n",
|
|||
|
" <td>10500</td>\n",
|
|||
|
" <td>219555</td>\n",
|
|||
|
" <td>74759</td>\n",
|
|||
|
" <td>2022</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>27</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Insider Trading Relationship Transaction Cost \\\n",
|
|||
|
"0 Kirkhorn Zachary Chief Financial Officer Sale 196.72 \n",
|
|||
|
"1 Taneja Vaibhav Chief Accounting Officer Sale 195.79 \n",
|
|||
|
"2 Baglino Andrew D SVP Powertrain and Energy Eng. Sale 195.79 \n",
|
|||
|
"3 Taneja Vaibhav Chief Accounting Officer Option Exercise 0.00 \n",
|
|||
|
"4 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 0.00 \n",
|
|||
|
"5 Kirkhorn Zachary Chief Financial Officer Option Exercise 0.00 \n",
|
|||
|
"6 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 20.91 \n",
|
|||
|
"7 Baglino Andrew D SVP Powertrain and Energy Eng. Sale 202.00 \n",
|
|||
|
"8 Kirkhorn Zachary Chief Financial Officer Sale 193.00 \n",
|
|||
|
"9 Baglino Andrew D SVP Powertrain and Energy Eng. Option Exercise 20.91 \n",
|
|||
|
"\n",
|
|||
|
" Shares Value ($) Shares Total Year Month Day \n",
|
|||
|
"0 10455 2056775 203073 2022 3 6 \n",
|
|||
|
"1 2466 482718 100458 2022 3 6 \n",
|
|||
|
"2 1298 254232 65547 2022 3 6 \n",
|
|||
|
"3 7138 0 102923 2022 3 5 \n",
|
|||
|
"4 2586 0 66845 2022 3 5 \n",
|
|||
|
"5 16867 0 213528 2022 3 5 \n",
|
|||
|
"6 10500 219555 74759 2022 2 27 \n",
|
|||
|
"7 10500 2121000 64259 2022 2 27 \n",
|
|||
|
"8 3750 723750 196661 2022 2 6 \n",
|
|||
|
"9 10500 219555 74759 2022 1 27 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 318,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Преобразование типов данных\n",
|
|||
|
"df['Insider Trading'] = df['Insider Trading'].astype('category') # Преобразование в категорию\n",
|
|||
|
"df['Relationship'] = df['Relationship'].astype('category') # Преобразование в категорию\n",
|
|||
|
"df['Transaction'] = df['Transaction'].astype('category') # Преобразование в категорию\n",
|
|||
|
"df['Cost'] = pd.to_numeric(df['Cost'], errors='coerce') # Преобразование в float\n",
|
|||
|
"df['Shares'] = pd.to_numeric(df['Shares'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
|||
|
"df['Value ($)'] = pd.to_numeric(df['Value ($)'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
|||
|
"df['Shares Total'] = pd.to_numeric(df['Shares Total'].str.replace(',', ''), errors='coerce') # Преобразование в float с удалением запятых\n",
|
|||
|
"\n",
|
|||
|
"df['Date'] = pd.to_datetime(df['Date'], errors='coerce') # Преобразование в datetime\n",
|
|||
|
"df['Year'] = df['Date'].dt.year # Год\n",
|
|||
|
"df['Month'] = df['Date'].dt.month # Месяц\n",
|
|||
|
"df['Day'] = df['Date'].dt.day # День\n",
|
|||
|
"df: DataFrame = df.drop(columns=['Date', 'SEC Form 4']) # Удаление столбцов с датами\n",
|
|||
|
"\n",
|
|||
|
"print('Выборка данных:')\n",
|
|||
|
"df.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Бизнес-цель №1 (Задача регрессии).\n",
|
|||
|
"\n",
|
|||
|
"### Достижимый уровень качества модели:\n",
|
|||
|
"**Основные метрики для регрессии:**\n",
|
|||
|
"- **Средняя абсолютная ошибка (Mean Absolute Error, MAE)** – показывает среднее абсолютное отклонение между предсказанными и фактическими значениями.\n",
|
|||
|
"Легко интерпретируется, особенно в финансовых данных, где каждая ошибка в долларах имеет значение.\n",
|
|||
|
"- **Среднеквадратичная ошибка (Mean Squared Error, MSE)** – показывает, насколько отклоняются прогнозы модели от истинных значений в квадрате. Подходит для оценки общего качества модели.\n",
|
|||
|
"- **Коэффициент детерминации (R²)** – указывает, какую долю дисперсии зависимой переменной объясняет модель. R² варьируется от 0 до 1 (чем ближе к 1, тем лучше).\n",
|
|||
|
"\n",
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"### Выбор ориентира:\n",
|
|||
|
"В качестве базовой модели для оценки качества предсказаний выбрано использование среднего значения целевой переменной (Cost) на обучающей выборке. Это простой и интуитивно понятный метод, который служит минимальным ориентиром для сравнения с более сложными моделями. Базовая модель помогает установить начальный уровень ошибок (MAE, MSE) и показатель качества (R²), которые сложные модели должны улучшить, чтобы оправдать своё использование.\n",
|
|||
|
"\n",
|
|||
|
"---"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Разбиение данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 319,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Baseline MAE: 417.78235887096776\n",
|
|||
|
"Baseline MSE: 182476.07973024843\n",
|
|||
|
"Baseline R²: -0.027074997920953914\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from pandas.core.frame import DataFrame\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Разбить данные на обучающую и тестовую выборки\n",
|
|||
|
"def split_into_train_test(\n",
|
|||
|
" df_input: DataFrame,\n",
|
|||
|
" stratify_colname: str = \"y\", \n",
|
|||
|
" frac_train: float = 0.8,\n",
|
|||
|
" random_state: int = 42,\n",
|
|||
|
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
|||
|
"\n",
|
|||
|
" if stratify_colname not in df_input.columns:\n",
|
|||
|
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
|
|||
|
" \n",
|
|||
|
" if not (0 < frac_train < 1):\n",
|
|||
|
" raise ValueError(\"Fraction must be between 0 and 1.\")\n",
|
|||
|
" \n",
|
|||
|
" X: DataFrame = df_input # Contains all columns.\n",
|
|||
|
" y: DataFrame = df_input[\n",
|
|||
|
" [stratify_colname]\n",
|
|||
|
" ] # Dataframe of just the column on which to stratify.\n",
|
|||
|
"\n",
|
|||
|
" # Split original dataframe into train and test dataframes.\n",
|
|||
|
" X_train, X_test, y_train, y_test = train_test_split(\n",
|
|||
|
" X, y,\n",
|
|||
|
" test_size=(1.0 - frac_train),\n",
|
|||
|
" random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" return X_train, X_test, y_train, y_test\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Определяем целевой признак и входные признаки\n",
|
|||
|
"y_feature: str = 'Cost'\n",
|
|||
|
"X_features: list[str] = df.drop(columns=y_feature, axis=1).columns.tolist()\n",
|
|||
|
"\n",
|
|||
|
"# Разбиваем данные на обучающую и тестовую выборки\n",
|
|||
|
"X_df_train, X_df_test, y_df_train, y_df_test = split_into_train_test(\n",
|
|||
|
" df, \n",
|
|||
|
" stratify_colname=y_feature, \n",
|
|||
|
" frac_train=0.8, \n",
|
|||
|
" random_state=42 \n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Вычисляем предсказания базовой модели (среднее значение целевой переменной)\n",
|
|||
|
"baseline_predictions: list[float] = [y_df_train.mean()] * len(y_df_test) # type: ignore\n",
|
|||
|
"\n",
|
|||
|
"# Оцениваем базовую модель\n",
|
|||
|
"print('Baseline MAE:', mean_absolute_error(y_df_test, baseline_predictions))\n",
|
|||
|
"print('Baseline MSE:', mean_squared_error(y_df_test, baseline_predictions))\n",
|
|||
|
"print('Baseline R²:', r2_score(y_df_test, baseline_predictions))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Выбор моделей обучения:\n",
|
|||
|
"\n",
|
|||
|
"Для обучения были выбраны следующие модели:\n",
|
|||
|
"1. **Случайный лес (Random Forest)**: Ансамблевая модель, которая использует множество решающих деревьев. Она хорошо справляется с нелинейными зависимостями и шумом в данных, а также обладает устойчивостью к переобучению.\n",
|
|||
|
"2. **Линейная регрессия (Linear Regression)**: Простая модель, предполагающая линейную зависимость между признаками и целевой переменной. Она быстро обучается и предоставляет легкую интерпретацию результатов.\n",
|
|||
|
"3. **Градиентный бустинг (Gradient Boosting)**: Мощная модель, создающая ансамбль деревьев, которые корректируют ошибки предыдущих. Эта модель эффективна для сложных наборов данных и обеспечивает высокую точность предсказаний.\n",
|
|||
|
"\n",
|
|||
|
"---"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Построение конвейера:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 320,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.impute import SimpleImputer\n",
|
|||
|
"from sklearn.discriminant_analysis import StandardScaler\n",
|
|||
|
"from sklearn.preprocessing import OneHotEncoder\n",
|
|||
|
"from sklearn.compose import ColumnTransformer\n",
|
|||
|
"from sklearn.pipeline import Pipeline\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Числовые столбцы\n",
|
|||
|
"num_columns: list[str] = [\n",
|
|||
|
" column\n",
|
|||
|
" for column in df.columns\n",
|
|||
|
" if df[column].dtype not in (\"category\", \"object\")\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Категориальные столбцы\n",
|
|||
|
"cat_columns: list[str] = [\n",
|
|||
|
" column\n",
|
|||
|
" for column in df.columns\n",
|
|||
|
" if df[column].dtype in (\"category\", \"object\")\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Заполнение пропущенных значений\n",
|
|||
|
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
|||
|
"# Стандартизация\n",
|
|||
|
"num_scaler = StandardScaler()\n",
|
|||
|
"# Конвейер для обработки числовых данных\n",
|
|||
|
"preprocessing_num = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", num_imputer),\n",
|
|||
|
" (\"scaler\", num_scaler),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Заполнение пропущенных значений\n",
|
|||
|
"cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"unknown\")\n",
|
|||
|
"# Унитарное кодирование\n",
|
|||
|
"cat_encoder = OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False, drop=\"first\")\n",
|
|||
|
"# Конвейер для обработки категориальных данных\n",
|
|||
|
"preprocessing_cat = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", cat_imputer),\n",
|
|||
|
" (\"encoder\", cat_encoder),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Трансформер для предобработки признаков\n",
|
|||
|
"features_preprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"prepocessing_num\", preprocessing_num, num_columns),\n",
|
|||
|
" (\"prepocessing_cat\", preprocessing_cat, cat_columns),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\"\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Основной конвейер предобработки данных\n",
|
|||
|
"pipeline_end = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"features_preprocessing\", features_preprocessing),\n",
|
|||
|
" ]\n",
|
|||
|
")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Демонстрация работы конвейера:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 321,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Cost</th>\n",
|
|||
|
" <th>Shares</th>\n",
|
|||
|
" <th>Value ($)</th>\n",
|
|||
|
" <th>Shares Total</th>\n",
|
|||
|
" <th>Year</th>\n",
|
|||
|
" <th>Month</th>\n",
|
|||
|
" <th>Day</th>\n",
|
|||
|
" <th>Insider Trading_DENHOLM ROBYN M</th>\n",
|
|||
|
" <th>Insider Trading_Kirkhorn Zachary</th>\n",
|
|||
|
" <th>Insider Trading_Musk Elon</th>\n",
|
|||
|
" <th>Insider Trading_Musk Kimbal</th>\n",
|
|||
|
" <th>Insider Trading_Taneja Vaibhav</th>\n",
|
|||
|
" <th>Insider Trading_Wilson-Thompson Kathleen</th>\n",
|
|||
|
" <th>Relationship_Chief Accounting Officer</th>\n",
|
|||
|
" <th>Relationship_Chief Financial Officer</th>\n",
|
|||
|
" <th>Relationship_Director</th>\n",
|
|||
|
" <th>Relationship_SVP Powertrain and Energy Eng.</th>\n",
|
|||
|
" <th>Transaction_Sale</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>-0.966516</td>\n",
|
|||
|
" <td>-0.361759</td>\n",
|
|||
|
" <td>-0.450022</td>\n",
|
|||
|
" <td>-0.343599</td>\n",
|
|||
|
" <td>0.715678</td>\n",
|
|||
|
" <td>-0.506108</td>\n",
|
|||
|
" <td>-0.400623</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>-1.074894</td>\n",
|
|||
|
" <td>1.225216</td>\n",
|
|||
|
" <td>-0.414725</td>\n",
|
|||
|
" <td>-0.319938</td>\n",
|
|||
|
" <td>-1.397276</td>\n",
|
|||
|
" <td>0.801338</td>\n",
|
|||
|
" <td>0.906673</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>-1.074894</td>\n",
|
|||
|
" <td>1.211753</td>\n",
|
|||
|
" <td>-0.415027</td>\n",
|
|||
|
" <td>-0.320141</td>\n",
|
|||
|
" <td>-1.397276</td>\n",
|
|||
|
" <td>1.062828</td>\n",
|
|||
|
" <td>-0.098939</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>1.167142</td>\n",
|
|||
|
" <td>0.037499</td>\n",
|
|||
|
" <td>1.023612</td>\n",
|
|||
|
" <td>-0.325853</td>\n",
|
|||
|
" <td>-1.397276</td>\n",
|
|||
|
" <td>1.062828</td>\n",
|
|||
|
" <td>-0.501184</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>1.217886</td>\n",
|
|||
|
" <td>-0.075287</td>\n",
|
|||
|
" <td>0.632973</td>\n",
|
|||
|
" <td>-0.330205</td>\n",
|
|||
|
" <td>-1.397276</td>\n",
|
|||
|
" <td>1.062828</td>\n",
|
|||
|
" <td>-0.501184</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>0.505872</td>\n",
|
|||
|
" <td>-0.361021</td>\n",
|
|||
|
" <td>-0.443679</td>\n",
|
|||
|
" <td>-0.343698</td>\n",
|
|||
|
" <td>0.715678</td>\n",
|
|||
|
" <td>-0.767598</td>\n",
|
|||
|
" <td>1.308918</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>-1.088674</td>\n",
|
|||
|
" <td>-0.357532</td>\n",
|
|||
|
" <td>-0.450389</td>\n",
|
|||
|
" <td>-0.342863</td>\n",
|
|||
|
" <td>0.715678</td>\n",
|
|||
|
" <td>0.278360</td>\n",
|
|||
|
" <td>-0.903429</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>-0.692146</td>\n",
|
|||
|
" <td>-0.355855</td>\n",
|
|||
|
" <td>-0.445383</td>\n",
|
|||
|
" <td>-0.343220</td>\n",
|
|||
|
" <td>0.715678</td>\n",
|
|||
|
" <td>0.801338</td>\n",
|
|||
|
" <td>1.409480</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>-1.088674</td>\n",
|
|||
|
" <td>-0.361181</td>\n",
|
|||
|
" <td>-0.450389</td>\n",
|
|||
|
" <td>-0.343649</td>\n",
|
|||
|
" <td>-1.397276</td>\n",
|
|||
|
" <td>1.062828</td>\n",
|
|||
|
" <td>-0.903429</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>1.091997</td>\n",
|
|||
|
" <td>-0.204531</td>\n",
|
|||
|
" <td>0.114712</td>\n",
|
|||
|
" <td>1.538166</td>\n",
|
|||
|
" <td>0.715678</td>\n",
|
|||
|
" <td>-1.029087</td>\n",
|
|||
|
" <td>1.208357</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Cost Shares Value ($) Shares Total Year Month Day \\\n",
|
|||
|
"0 -0.966516 -0.361759 -0.450022 -0.343599 0.715678 -0.506108 -0.400623 \n",
|
|||
|
"1 -1.074894 1.225216 -0.414725 -0.319938 -1.397276 0.801338 0.906673 \n",
|
|||
|
"2 -1.074894 1.211753 -0.415027 -0.320141 -1.397276 1.062828 -0.098939 \n",
|
|||
|
"3 1.167142 0.037499 1.023612 -0.325853 -1.397276 1.062828 -0.501184 \n",
|
|||
|
"4 1.217886 -0.075287 0.632973 -0.330205 -1.397276 1.062828 -0.501184 \n",
|
|||
|
"5 0.505872 -0.361021 -0.443679 -0.343698 0.715678 -0.767598 1.308918 \n",
|
|||
|
"6 -1.088674 -0.357532 -0.450389 -0.342863 0.715678 0.278360 -0.903429 \n",
|
|||
|
"7 -0.692146 -0.355855 -0.445383 -0.343220 0.715678 0.801338 1.409480 \n",
|
|||
|
"8 -1.088674 -0.361181 -0.450389 -0.343649 -1.397276 1.062828 -0.903429 \n",
|
|||
|
"9 1.091997 -0.204531 0.114712 1.538166 0.715678 -1.029087 1.208357 \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_DENHOLM ROBYN M Insider Trading_Kirkhorn Zachary \\\n",
|
|||
|
"0 0.0 0.0 \n",
|
|||
|
"1 0.0 0.0 \n",
|
|||
|
"2 0.0 0.0 \n",
|
|||
|
"3 0.0 0.0 \n",
|
|||
|
"4 0.0 0.0 \n",
|
|||
|
"5 0.0 0.0 \n",
|
|||
|
"6 0.0 0.0 \n",
|
|||
|
"7 0.0 0.0 \n",
|
|||
|
"8 0.0 0.0 \n",
|
|||
|
"9 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_Musk Elon Insider Trading_Musk Kimbal \\\n",
|
|||
|
"0 0.0 0.0 \n",
|
|||
|
"1 1.0 0.0 \n",
|
|||
|
"2 1.0 0.0 \n",
|
|||
|
"3 1.0 0.0 \n",
|
|||
|
"4 1.0 0.0 \n",
|
|||
|
"5 0.0 0.0 \n",
|
|||
|
"6 0.0 0.0 \n",
|
|||
|
"7 0.0 0.0 \n",
|
|||
|
"8 0.0 0.0 \n",
|
|||
|
"9 1.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Insider Trading_Taneja Vaibhav Insider Trading_Wilson-Thompson Kathleen \\\n",
|
|||
|
"0 1.0 0.0 \n",
|
|||
|
"1 0.0 0.0 \n",
|
|||
|
"2 0.0 0.0 \n",
|
|||
|
"3 0.0 0.0 \n",
|
|||
|
"4 0.0 0.0 \n",
|
|||
|
"5 0.0 0.0 \n",
|
|||
|
"6 1.0 0.0 \n",
|
|||
|
"7 0.0 0.0 \n",
|
|||
|
"8 1.0 0.0 \n",
|
|||
|
"9 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Relationship_Chief Accounting Officer \\\n",
|
|||
|
"0 1.0 \n",
|
|||
|
"1 0.0 \n",
|
|||
|
"2 0.0 \n",
|
|||
|
"3 0.0 \n",
|
|||
|
"4 0.0 \n",
|
|||
|
"5 0.0 \n",
|
|||
|
"6 1.0 \n",
|
|||
|
"7 0.0 \n",
|
|||
|
"8 1.0 \n",
|
|||
|
"9 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Relationship_Chief Financial Officer Relationship_Director \\\n",
|
|||
|
"0 0.0 0.0 \n",
|
|||
|
"1 0.0 0.0 \n",
|
|||
|
"2 0.0 0.0 \n",
|
|||
|
"3 0.0 0.0 \n",
|
|||
|
"4 0.0 0.0 \n",
|
|||
|
"5 0.0 0.0 \n",
|
|||
|
"6 0.0 0.0 \n",
|
|||
|
"7 0.0 0.0 \n",
|
|||
|
"8 0.0 0.0 \n",
|
|||
|
"9 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" Relationship_SVP Powertrain and Energy Eng. Transaction_Sale \n",
|
|||
|
"0 0.0 0.0 \n",
|
|||
|
"1 0.0 0.0 \n",
|
|||
|
"2 0.0 0.0 \n",
|
|||
|
"3 0.0 1.0 \n",
|
|||
|
"4 0.0 1.0 \n",
|
|||
|
"5 1.0 1.0 \n",
|
|||
|
"6 0.0 0.0 \n",
|
|||
|
"7 1.0 1.0 \n",
|
|||
|
"8 0.0 0.0 \n",
|
|||
|
"9 0.0 1.0 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 321,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Применение конвейера\n",
|
|||
|
"preprocessing_result = pipeline_end.fit_transform(X_df_train)\n",
|
|||
|
"preprocessed_df = pd.DataFrame(\n",
|
|||
|
" preprocessing_result,\n",
|
|||
|
" columns=pipeline_end.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"preprocessed_df.head(10)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Обучение моделей:\n",
|
|||
|
"\n",
|
|||
|
"Оценка результатов обучения:\n",
|
|||
|
"1. **Случайный лес (Random Forest)**:\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - Средний балл: 0.9993.\n",
|
|||
|
" - Стандартное отклонение: 0.00046.\n",
|
|||
|
" - Вывод: Очень высокая точность, что свидетельствует о хорошей способности модели к обобщению. Низкое значение стандартного отклонения указывает на стабильность модели.\n",
|
|||
|
"2. **Линейная регрессия (Linear Regression)**:\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - Средний балл: 1.0.\n",
|
|||
|
" - Стандартное отклонение: 0.0.\n",
|
|||
|
" - Вывод: Идеальная точность, однако есть вероятность переобучения, так как стандартное отклонение равно 0. Это может указывать на то, что модель идеально подгоняет данные, но может не работать на новых данных.\n",
|
|||
|
"3. **Градиентный бустинг (Gradient Boosting)**:\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - Средний балл: 0.9998.\n",
|
|||
|
" - Стандартное отклонение: 0.00014.\n",
|
|||
|
" - Вывод: Отличные результаты с высокой точностью и низкой вариабельностью. Модель также демонстрирует хорошую устойчивость."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 322,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
|||
|
" return fit_method(estimator, *args, **kwargs)\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
|||
|
" return fit_method(estimator, *args, **kwargs)\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
|||
|
" return fit_method(estimator, *args, **kwargs)\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
|||
|
" return fit_method(estimator, *args, **kwargs)\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\base.py:1473: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n",
|
|||
|
" return fit_method(estimator, *args, **kwargs)\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
|||
|
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
|||
|
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
|||
|
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Random Forest:\n",
|
|||
|
" Mean Score = 0.9992841344976828\n",
|
|||
|
" Standard Deviation = 0.0004515288830049682\n",
|
|||
|
"Linear Regression:\n",
|
|||
|
" Mean Score = 1.0\n",
|
|||
|
" Standard Deviation = 0.0\n",
|
|||
|
"Gradient Boosting:\n",
|
|||
|
" Mean Score = 0.9997688048426001\n",
|
|||
|
" Standard Deviation = 0.0001416815109781245\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\preprocessing\\_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
|||
|
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n",
|
|||
|
"d:\\ULSTU\\Семестр 5\\AIM-PIbd-31-Masenkin-M-S\\aimenv\\Lib\\site-packages\\sklearn\\ensemble\\_gb.py:668: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
|||
|
" y = column_or_1d(y, warn=True) # TODO: Is this still required?\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Обучить модели\n",
|
|||
|
"def train_models(X: DataFrame, y: DataFrame, \n",
|
|||
|
" models: dict[str, Any]) -> dict[str, Any]:\n",
|
|||
|
" results: dict[str, Any] = {}\n",
|
|||
|
" for model_name, model in models.items():\n",
|
|||
|
" # Создаем конвейер для каждой модели\n",
|
|||
|
" model_pipeline = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"features_preprocessing\", features_preprocessing),\n",
|
|||
|
" (\"model\", model) # Используем текущую модель\n",
|
|||
|
" ]\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" # Обучаем модель и вычисляем кросс-валидацию\n",
|
|||
|
" scores = cross_val_score(model_pipeline, X, y, cv=5) # 5-кратная кросс-валидация\n",
|
|||
|
" results[model_name] = {\n",
|
|||
|
" \"mean_score\": scores.mean(),\n",
|
|||
|
" \"std_dev\": scores.std()\n",
|
|||
|
" }\n",
|
|||
|
" \n",
|
|||
|
" return results\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"models_regression: dict[str, Any] = {\n",
|
|||
|
" \"Random Forest\": RandomForestRegressor(),\n",
|
|||
|
" \"Linear Regression\": LinearRegression(),\n",
|
|||
|
" \"Gradient Boosting\": GradientBoostingRegressor(),\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"results: dict[str, Any] = train_models(X_df_train, y_df_train, models_regression)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод результатов\n",
|
|||
|
"for model_name, scores in results.items():\n",
|
|||
|
" print(f\"\"\"{model_name}:\n",
|
|||
|
" Mean Score = {scores['mean_score']}\n",
|
|||
|
" Standard Deviation = {scores['std_dev']}\"\"\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Проверка на тестовом наборе данных:\n",
|
|||
|
"\n",
|
|||
|
"Оценка результатов обучения:\n",
|
|||
|
"1. Случайный лес (Random Forest):\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - MAE (обучение): 1.858\n",
|
|||
|
" - MAE (тест): 4.489\n",
|
|||
|
" - MSE (обучение): 10.959\n",
|
|||
|
" - MSE (тест): 62.649\n",
|
|||
|
" - R2 (обучение): 0.9999\n",
|
|||
|
" - R2 (тест): 0.9997\n",
|
|||
|
" - STD (обучение): 3.310\n",
|
|||
|
" - STD (тест): 7.757\n",
|
|||
|
" - Вывод: Случайный лес показывает великолепные значения R2 на обучающей и тестовой выборках, что свидетельствует о сильной способности к обобщению. Однако MAE и MSE на тестовой выборке значительно выше, чем на обучающей, что может указывать на некоторые проблемы с переобучением.\n",
|
|||
|
"2. Линейная регрессия (Linear Regression):\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - MAE (обучение): 3.069e-13\n",
|
|||
|
" - MAE (тест): 2.762e-13\n",
|
|||
|
" - MSE (обучение): 1.437e-25\n",
|
|||
|
" - MSE (тест): 1.196e-25\n",
|
|||
|
" - R2 (обучение): 1.0\n",
|
|||
|
" - R2 (тест): 1.0\n",
|
|||
|
" - STD (обучение): 3.730e-13\n",
|
|||
|
" - STD (тест): 3.444e-13\n",
|
|||
|
" - Вывод: Высокие показатели точности и нулевые ошибки (MAE, MSE) указывают на то, что модель идеально подгоняет данные как на обучающей, так и на тестовой выборках. Однако это также может быть признаком переобучения.\n",
|
|||
|
"3. Градиентный бустинг (Gradient Boosting):\n",
|
|||
|
" - Показатели:\n",
|
|||
|
" - MAE (обучение): 0.156\n",
|
|||
|
" - MAE (тест): 3.027\n",
|
|||
|
" - MSE (обучение): 0.075\n",
|
|||
|
" - MSE (тест): 41.360\n",
|
|||
|
" - R2 (обучение): 0.9999996\n",
|
|||
|
" - R2 (тест): 0.9998\n",
|
|||
|
" - STD (обучение): 0.274\n",
|
|||
|
" - STD (тест): 6.399\n",
|
|||
|
" - Вывод: Градиентный бустинг демонстрирует отличные результаты на обучающей выборке, однако MAE и MSE на тестовой выборке довольно высокие, что может указывать на определенное переобучение или необходимость улучшения настройки модели."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 323,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Модель: Random Forest\n",
|
|||
|
"\tMAE_train: 1.8584435483870716\n",
|
|||
|
"\tMAE_test: 4.489381249999976\n",
|
|||
|
"\tMSE_train: 10.958770153225622\n",
|
|||
|
"\tMSE_test: 62.643889510626195\n",
|
|||
|
"\tR2_train: 0.9999465631134502\n",
|
|||
|
"\tR2_test: 0.9996474059899577\n",
|
|||
|
"\tSTD_train: 3.3095436106742198\n",
|
|||
|
"\tSTD_test: 7.757028236410516\n",
|
|||
|
"\n",
|
|||
|
"Модель: Linear Regression\n",
|
|||
|
"\tMAE_train: 3.0690862038154006e-13\n",
|
|||
|
"\tMAE_test: 2.761679773755077e-13\n",
|
|||
|
"\tMSE_train: 1.4370485712253764e-25\n",
|
|||
|
"\tMSE_test: 1.19585889812782e-25\n",
|
|||
|
"\tR2_train: 1.0\n",
|
|||
|
"\tR2_test: 1.0\n",
|
|||
|
"\tSTD_train: 3.7295840825107354e-13\n",
|
|||
|
"\tSTD_test: 3.4438670391637766e-13\n",
|
|||
|
"\n",
|
|||
|
"Модель: Gradient Boosting\n",
|
|||
|
"\tMAE_train: 0.15613772760448064\n",
|
|||
|
"\tMAE_test: 3.027282706028462\n",
|
|||
|
"\tMSE_train: 0.07499640211231481\n",
|
|||
|
"\tMSE_test: 41.36034726227861\n",
|
|||
|
"\tR2_train: 0.9999996343043813\n",
|
|||
|
"\tR2_test: 0.9997672013852927\n",
|
|||
|
"\tSTD_train: 0.2738547098596532\n",
|
|||
|
"\tSTD_test: 6.3988297145358555\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"from sklearn import metrics\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Оценка качества различных моделей на основе метрик\n",
|
|||
|
"def evaluate_models(models, \n",
|
|||
|
" pipeline_end, \n",
|
|||
|
" X_train, y_train, \n",
|
|||
|
" X_test, y_test) -> dict[str, dict[str, Any]]:\n",
|
|||
|
" results: dict[str, dict[str, Any]] = {}\n",
|
|||
|
" \n",
|
|||
|
" for model_name, model in models.items():\n",
|
|||
|
" # Создание пайплайна для текущей модели\n",
|
|||
|
" model_pipeline = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"pipeline\", pipeline_end), \n",
|
|||
|
" (\"model\", model),\n",
|
|||
|
" ]\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" # Обучение текущей модели\n",
|
|||
|
" model_pipeline.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
" # Предсказание для обучающей и тестовой выборки\n",
|
|||
|
" y_train_predict = model_pipeline.predict(X_train)\n",
|
|||
|
" y_test_predict = model_pipeline.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
" # Вычисление метрик для текущей модели\n",
|
|||
|
" metrics_dict: dict[str, Any] = {\n",
|
|||
|
" \"MAE_train\": metrics.mean_absolute_error(y_train, y_train_predict),\n",
|
|||
|
" \"MAE_test\": metrics.mean_absolute_error(y_test, y_test_predict),\n",
|
|||
|
" \"MSE_train\": metrics.mean_squared_error(y_train, y_train_predict),\n",
|
|||
|
" \"MSE_test\": metrics.mean_squared_error(y_test, y_test_predict),\n",
|
|||
|
" \"R2_train\": metrics.r2_score(y_train, y_train_predict),\n",
|
|||
|
" \"R2_test\": metrics.r2_score(y_test, y_test_predict),\n",
|
|||
|
" \"STD_train\": np.std(y_train - y_train_predict),\n",
|
|||
|
" \"STD_test\": np.std(y_test - y_test_predict),\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" # Сохранение результатов\n",
|
|||
|
" results[model_name] = metrics_dict\n",
|
|||
|
" \n",
|
|||
|
" return results\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"y_train = np.ravel(y_df_train) \n",
|
|||
|
"y_test = np.ravel(y_df_test) \n",
|
|||
|
"\n",
|
|||
|
"result: dict[str, dict[str, Any]] = evaluate_models(models_regression,\n",
|
|||
|
" pipeline_end,\n",
|
|||
|
" X_df_train, y_train,\n",
|
|||
|
" X_df_test, y_test)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод результатов\n",
|
|||
|
"for model_name, metrics_dict in result.items():\n",
|
|||
|
" print(f\"Модель: {model_name}\")\n",
|
|||
|
" for metric_name, value in metrics_dict.items():\n",
|
|||
|
" print(f\"\\t{metric_name}: {value}\")\n",
|
|||
|
" print()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Подбор гиперпараметров:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 324,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Fitting 3 folds for each of 36 candidates, totalling 108 fits\n",
|
|||
|
"Лучшие параметры: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}\n",
|
|||
|
"Лучший результат (MSE): 188.5929593664171\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import GridSearchCV\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Применение конвейера к данным\n",
|
|||
|
"X_train_processing_result = pipeline_end.fit_transform(X_df_train)\n",
|
|||
|
"X_test_processing_result = pipeline_end.transform(X_df_test)\n",
|
|||
|
"\n",
|
|||
|
"# Создание и настройка модели случайного леса\n",
|
|||
|
"model = RandomForestRegressor()\n",
|
|||
|
"\n",
|
|||
|
"# Установка параметров для поиска по сетке\n",
|
|||
|
"param_grid: dict[str, list[int | None]] = {\n",
|
|||
|
" 'n_estimators': [50, 100, 200], # Количество деревьев\n",
|
|||
|
" 'max_depth': [None, 10, 20, 30], # Максимальная глубина дерева\n",
|
|||
|
" 'min_samples_split': [2, 5, 10] # Минимальное количество образцов для разбиения узла\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Подбор гиперпараметров с помощью поиска по сетке\n",
|
|||
|
"grid_search = GridSearchCV(estimator=model, \n",
|
|||
|
" param_grid=param_grid,\n",
|
|||
|
" scoring='neg_mean_squared_error', cv=3, n_jobs=-1, verbose=2)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели на тренировочных данных\n",
|
|||
|
"grid_search.fit(X_train_processing_result, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Результаты подбора гиперпараметров\n",
|
|||
|
"print(\"Лучшие параметры:\", grid_search.best_params_)\n",
|
|||
|
"# Меняем знак, так как берем отрицательное значение среднеквадратичной ошибки\n",
|
|||
|
"print(\"Лучший результат (MSE):\", -grid_search.best_score_)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Сравнение наборов гиперпараметров:\n",
|
|||
|
"\n",
|
|||
|
"Результаты анализа показывают, что параметры из старой сетки обеспечивают значительно лучшее качество модели. Среднеквадратическая ошибка (MSE) на кросс-валидации для старых параметров составила 179.369, что существенно ниже, чем для новых параметров (1290.656). На тестовой выборке модель с новыми параметрами показала MSE 172.574, что сопоставимо с результатами модели со старыми параметрами, однако этот результат является случайным, так как новые параметры продемонстрировали плохую кросс-валидационную ошибку, указывая на недообучение. Таким образом, параметры из старой сетки более предпочтительны, так как они обеспечивают лучшее обобщение и меньшую ошибку."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 325,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Fitting 3 folds for each of 36 candidates, totalling 108 fits\n",
|
|||
|
"Старые параметры: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 50}\n",
|
|||
|
"Лучший результат (MSE) на старых параметрах: 179.369172166932\n",
|
|||
|
"\n",
|
|||
|
"Новые параметры: {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 50}\n",
|
|||
|
"Лучший результат (MSE) на новых параметрах: 1290.6561132979532\n",
|
|||
|
"Среднеквадратическая ошибка (MSE) на тестовых данных: 172.57398236522087\n",
|
|||
|
"Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 13.136741695154885\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAHWCAYAAACBjZMqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hU1dbA4d/UTMqkdxKSEHqTIiJNUVEEC4gd/AT79aoXOyqCKCoKWBD71Sv2hoi9IIigIiBNpJdACKRnJn2SKef745CRIYUkJJlJst7nyaMzZ58zaypnnb332hpFURSEEEIIIYQQQjQprbcDEEIIIYQQQoi2SJItIYQQQgghhGgGkmwJIYQQQgghRDOQZEsIIYQQQgghmoEkW0IIIYQQQgjRDCTZEkIIIYQQQohmIMmWEEIIIYQQQjQDSbaEEEIIIYQQohlIsiWEEEIIIYQQzUCSLSGEEEII0ey+/vprNm/e7L69dOlStm3b5r2AhGgBkmwJ0Q7s27ePW265hU6dOmEymQgODmbYsGEsWLCA8vJyb4cnhBCiHdi6dStTp05lz549/PHHH/zrX/+iuLjY22EJ0aw0iqIo3g5CCNF8vvnmGy6//HL8/Py49tpr6d27N5WVlfz666989tlnTJkyhddff93bYQohhGjjcnNzGTp0KHv37gVgwoQJfPbZZ16OSojmJcmWEG1YWloaffv2JSEhgRUrVhAXF+exfe/evXzzzTdMnTrVSxEKIYRoTyoqKvj7778JCAigR48e3g5HiGYnwwiFaMPmzp1LSUkJb775ZrVEC6Bz584eiZZGo+H222/n/fffp1u3bphMJgYOHMiqVas89jt48CD//ve/6datG/7+/kRERHD55Zdz4MABj3aLFi1Co9G4/wICAujTpw9vvPGGR7spU6YQFBRULb7Fixej0WhYuXKlx/1r167l/PPPJyQkhICAAM4880x+++03jzazZs1Co9GQl5fncf+ff/6JRqNh0aJFHo+fnJzs0e7QoUP4+/uj0WiqPa/vvvuOESNGEBgYiNls5oILLqjXvIOq12PVqlXccsstREREEBwczLXXXovFYqnWvj6P89dffzFlyhT3ENHY2Fiuv/568vPza4whOTnZ4z2p+jv2NU5OTubCCy+s87kcOHAAjUbD/Pnzq23r3bs3I0eOdN9euXIlGo2GxYsX13q849+DRx55BK1Wy/Llyz3a3XzzzRiNRrZs2VJnfBqNhlmzZnncN2/ePDQajUdsde1f29+xcR77Ojz33HMkJSXh7+/PmWeeyd9//13tuDt37uSyyy4jPDwck8nEqaeeypdfflljDFOmTKnx8adMmVKt7XfffceZZ56J2WwmODiYQYMG8cEHH7i3jxw5strzfuKJJ9BqtR7tVq9ezeWXX07Hjh3x8/MjMTGRu+66q9pw41mzZtGzZ0+CgoIIDg7m9NNPZ+nSpR5t6nushnz/R44cSe/evau1nT9/frXv6ok+x1Wfy6rj79ixA39/f6699lqPdr/++is6nY5p06bVeiyo32vSkPi/+OILLrjgAuLj4/Hz8yM1NZXZs2fjdDo99q3ps171W9OY366Gvh/Hf67Wr1/v/qzWFKefnx8DBw6kR48eDfpOCtFa6b0dgBCi+Xz11Vd06tSJoUOH1nufX375hY8//pj//Oc/+Pn58fLLL3P++eezbt0690nC+vXr+f3337nqqqtISEjgwIEDvPLKK4wcOZLt27cTEBDgccznnnuOyMhIioqK+N///sdNN91EcnIyo0aNavBzWrFiBWPGjGHgwIHuE/K33nqLs88+m9WrV3Paaac1+Jg1mTlzJjabrdr97777LpMnT2b06NE8/fTTlJWV8corrzB8+HA2bdpULWmrye23305oaCizZs1i165dvPLKKxw8eNB98teQx1m2bBn79+/nuuuuIzY2lm3btvH666+zbds2/vjjj2onPAAjRozg5ptvBtQTzCeffLLxL1Qzefjhh/nqq6+44YYb2Lp1K2azmR9++IH//ve/zJ49m1NOOaVBx7NarcyZM6dB+5x77rnVTryfeeaZGhPjd955h+LiYm677TZsNhsLFizg7LPPZuvWrcTExACwbds2hg0bRocOHXjggQcIDAzkk08+Yfz48Xz22Wdccskl1Y7r5+fncXHixhtvrNZm0aJFXH/99fTq1YsHH3yQ0NBQNm3axPfff8/EiRNrfG5vvfUWDz/8MM8884xHm08//ZSysjJuvfVWIiIiWLduHQsXLiQjI4NPP/3U3a60tJRLLrmE5ORkysvLWbRoEZdeeilr1qxxfwfreyxf0aNHD2bPns19993HZZddxsUXX0xpaSlTpkyhe/fuPPbYY3XuX5/XpCEWLVpEUFAQd999N0FBQaxYsYKZM2dSVFTEvHnzGny8pvjtqo8TJaVVGvOdFKJVUoQQbVJhYaECKOPGjav3PoACKH/++af7voMHDyomk0m55JJL3PeVlZVV23fNmjUKoLzzzjvu+9566y0FUNLS0tz37d69WwGUuXPnuu+bPHmyEhgYWO2Yn376qQIoP//8s6IoiuJyuZQuXbooo0ePVlwul0c8KSkpyrnnnuu+75FHHlEAJTc31+OY69evVwDlrbfe8nj8pKQk9+2///5b0Wq1ypgxYzziLy4uVkJDQ5WbbrrJ45hZWVlKSEhItfuPV/V6DBw4UKmsrHTfP3fuXAVQvvjiiwY/Tk3vxYcffqgAyqpVq6pt69Chg3Lddde5b//8888er7GiKEpSUpJywQUX1Plc0tLSFECZN29etW29evVSzjzzzGqP8emnn9Z6vOPfA0VRlK1btypGo1G58cYbFYvFonTo0EE59dRTFbvdXmdsiqJ+lh955BH37fvvv1+Jjo5WBg4c6BFbXfvfdttt1e6/4IILPOKseh38/f2VjIwM9/1r165VAOWuu+5y33fOOecoffr0UWw2m/s+l8ulDB06VOnSpUu1x5o4caISFBTkcV9gYKAyefJk922r1aqYzWZl8ODBSnl5uUfbY78jZ555pvt5f/PNN4per1fuueeeao9Z0+dpzpw5ikajUQ4ePFhtW5WcnBwFUObPn9/gY9X3+1/1PHr16lWt7bx586r91pzoc1zTZ9/pdCrDhw9XYmJilLy8POW2225T9Hq9sn79+lqPU5uaXpOGxF/T63fLLbcoAQEBHp8hjUajzJw506Pd8b+9DflNaej7cez36dtvv1UA5fzzz1eOP8U82e+kEK2VDCMUoo0qKioCwGw2N2i/IUOGMHDgQPftjh07Mm7cOH744Qf38BV/f3/3drvdTn5+Pp07dyY0NJSNGzdWO6bFYiEvL4/9+/fz3HPPodPpOPPMM6u1y8vL8/g7vkrV5s2b2bNnDxMnTiQ/P9/drrS0lHPOOYdVq1bhcrk89ikoKPA4ZmFh4QlfgwcffJABAwZw+eWXe9y/bNkyrFYrV199tccxdTodgwcP5ueffz7hsUEdCmcwGNy3b731VvR6Pd9++22DH+fY98Jms5GXl8fpp58OUON7UVlZiZ+f3wljtNvt5OXlkZ+fj8PhqLVdWVlZtfft+GFOVYqLi8nLy8NqtZ7w8UEdjvjoo4/yxhtvMHr0aPLy8nj77bfR6xs2KOPw4cMsXLiQGTNm1Dg8qimMHz+eDh06uG+fdtppDB482P2eFhQUsGLFCq644gr361D1+o4ePZo9e/Zw+PBhj2PabDZMJlOdj7ts2TKKi4t54IEHqrWtqVdz3bp1XHHFFVx66aU19o4c+3kqLS0lLy+PoUOHoigKmzZt8mhb9RnZt28fTz31FFqtlmHDhjXqWHDi738Vp9NZrW1ZWVmNbev7Oa6i1WpZtGgRJSUljBkzhpdffpkHH3yQU0899YT7Hvt4tb0mDYn/2Nev6jMzYsQIysrK2Llzp3tbdHQ0GRkZdcbVmN+u+r4fVRRF4cEHH+TSSy9l8ODBdbZtie+kEL5ChhEK0UYFBwcDNLisbpcuXard17VrV8rKysjNzSU2Npby8nLmzJnDW2+9xeHDh1GOqbNTUzIzYMAA9//7+fnx4osvVht
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Установка параметров для поиска по сетке для старых значений\n",
|
|||
|
"old_param_grid: dict[str, list[int | None]] = {\n",
|
|||
|
" 'n_estimators': [50, 100, 200], # Количество деревьев\n",
|
|||
|
" 'max_depth': [None, 10, 20, 30], # Максимальная глубина дерева\n",
|
|||
|
" 'min_samples_split': [2, 5, 10] # Минимальное количество образцов для разбиения узла\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Подбор гиперпараметров с помощью поиска по сетке для старых параметров\n",
|
|||
|
"old_grid_search = GridSearchCV(estimator=model, \n",
|
|||
|
" param_grid=old_param_grid,\n",
|
|||
|
" scoring='neg_mean_squared_error', cv=3, n_jobs=-1, verbose=2)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели на тренировочных данных\n",
|
|||
|
"old_grid_search.fit(X_train_processing_result, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Результаты подбора для старых параметров\n",
|
|||
|
"old_best_params = old_grid_search.best_params_\n",
|
|||
|
" # Меняем знак, так как берем отрицательное значение MSE\n",
|
|||
|
"old_best_mse = -old_grid_search.best_score_\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Установка параметров для поиска по сетке для новых значений\n",
|
|||
|
"new_param_grid: dict[str, list[int]] = {\n",
|
|||
|
" 'n_estimators': [50],\n",
|
|||
|
" 'max_depth': [5],\n",
|
|||
|
" 'min_samples_split': [10]\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Подбор гиперпараметров с помощью поиска по сетке для новых параметров\n",
|
|||
|
"new_grid_search = GridSearchCV(estimator=model, \n",
|
|||
|
" param_grid=new_param_grid,\n",
|
|||
|
" scoring='neg_mean_squared_error', cv=2)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели на тренировочных данных\n",
|
|||
|
"new_grid_search.fit(X_train_processing_result, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Результаты подбора для новых параметров\n",
|
|||
|
"new_best_params = new_grid_search.best_params_\n",
|
|||
|
"# Меняем знак, так как берем отрицательное значение MSE\n",
|
|||
|
"new_best_mse = -new_grid_search.best_score_\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели с лучшими параметрами для новых значений\n",
|
|||
|
"model_best = RandomForestRegressor(**new_best_params)\n",
|
|||
|
"model_best.fit(X_train_processing_result, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Прогнозирование на тестовой выборке\n",
|
|||
|
"y_pred = model_best.predict(X_test_processing_result)\n",
|
|||
|
"\n",
|
|||
|
"# Оценка производительности модели\n",
|
|||
|
"mse = metrics.mean_squared_error(y_test, y_pred)\n",
|
|||
|
"rmse = np.sqrt(mse)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод результатов\n",
|
|||
|
"print(\"Старые параметры:\", old_best_params)\n",
|
|||
|
"print(\"Лучший результат (MSE) на старых параметрах:\", old_best_mse)\n",
|
|||
|
"print(\"\\nНовые параметры:\", new_best_params)\n",
|
|||
|
"print(\"Лучший результат (MSE) на новых параметрах:\", new_best_mse)\n",
|
|||
|
"print(\"Среднеквадратическая ошибка (MSE) на тестовых данных:\", mse)\n",
|
|||
|
"print(\"Корень среднеквадратичной ошибки (RMSE) на тестовых данных:\", rmse)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели с лучшими параметрами для старых значений\n",
|
|||
|
"model_old = RandomForestRegressor(**old_best_params)\n",
|
|||
|
"model_old.fit(X_train_processing_result, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Прогнозирование на тестовой выборке для старых параметров\n",
|
|||
|
"y_pred_old = model_old.predict(X_test_processing_result)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация ошибок\n",
|
|||
|
"plt.figure(figsize=(10, 5))\n",
|
|||
|
"plt.plot(y_test, label='Реальные значения', marker='o', linestyle='-', color='black')\n",
|
|||
|
"plt.plot(y_pred_old, label='Предсказанные значения (старые параметры)', marker='x', linestyle='--', color='blue')\n",
|
|||
|
"plt.plot(y_pred, label='Предсказанные значения (новые параметры)', marker='s', linestyle='--', color='orange')\n",
|
|||
|
"plt.xlabel('Объекты')\n",
|
|||
|
"plt.ylabel('Значения')\n",
|
|||
|
"plt.title('Сравнение реальных и предсказанных значений')\n",
|
|||
|
"plt.legend()\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|