AIM-PIbd-32-Kaznacheeva-E-K/lab_3/Lab3.ipynb

1370 lines
330 KiB
Plaintext
Raw Normal View History

2024-10-26 12:47:48 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Приступаем к работе...\n",
"\n",
"*Вариант задания:* Продажи домов в округе Кинг (вариант - 6) \n",
"Определим бизнес-цели и цели технического проекта \n",
"\n",
"### Бизнес-цели: \n",
"1. Оптимизация процесса оценки стоимости дома \n",
"\n",
"**Формулировка:** Разработать модель, которая позволяет автоматически и точно оценивать стоимость дома на основании его характеристик (таких как площадь, количество комнат, состояние, местоположение). \n",
"**Цель:** Увеличить точность оценки стоимости недвижимости для агенств и потенциальных покупателей, а также сократить время и затраты на оценку недвижимости, обеспечивая более точное предсказание цены. \n",
"\n",
"**Ключевые показатели успеха (KPI):** \n",
"*Точность модели прогнозирования* (RMSE): Минимизация среднеквадратичной ошибки до уровня ниже 10% от реальной цены, чтобы учитывать большие отклонения оценке.\n",
"*Средная абсолютная ошибка* (MAE): Модель должна предсказать цену с минимальной ошибкой и снизить MAE до 5% или меньше учитывая большие отклонения в оценке. \n",
"*Скорость оценки:* Уменьшение времени на оценку стоимости дома, чтобы быстрее получать результат.\n",
"*Доступность:* Внедрение модели в реальную систему для использования агентами недвижимости.\n",
"\n",
"2. Оптимизация затрат на ремонт перед продажей \n",
"\n",
"**Формулировка:** Разработать модель, которая поможет продавцам домов и агентствам недвижимости определить, какие улучшения или реновации дадут наибольший прирост стоимости дома при минимальных затратах. Это поможет избежать ненужных расходов и максимизировать прибыль от продажи. \n",
"**Цель:** Снизить затраты на ремонт перед продажей, рекомендовать только те улучшения, которые максимально увеличат стоимость недвижимости, и сократить время на принятие решений по реновациям. \n",
"\n",
"**Ключевые показатели успеха (KPI):** \n",
"*Возврат инвестиций* (ROI): Продавцы должны получать не менее 20% прироста стоимости дома на каждый вложенный доллар в реновацию. Например, если на ремонт было потрачено $10,000, цена дома должна увеличиться как минимум на $12,000. \n",
"*Средняя стоимость ремонта на 1 сделку* (CPA): Задача снизить расходы на ремонт, минимизировав ненужные траты. Например, оптимизация затрат до $5,000 на дом с учетом максимального прироста в цене. \n",
"*Сокращение времени на принятие решений:* Модель должна сокращать время, необходимое на оценку вариантов реноваций, до нескольких минут, что ускорит подготовку дома к продажи.\n",
"\n",
"### Технические цели проекта для каждой выделенной бизнес-цели\n",
"\n",
"1. **Создание модели для точной оценки стоимости дома.** \n",
"*Сбор и подготовка данных:* Очистка данных от пропусков, выбросов, дубликатов (аномальных значений в столбцах price, sqft_living, bedrooms). Преобразование категориальных переменных (view, condition, waterfront) в числовую форму с применением One-Hot-Encoding. Нормализация и стандартизация с применением методов масштабирования данных (нормировка, стандартизация для числовых признаков, чтобы привести их к 1ому масштабу). Разбиение набора данных на обучающую, контрольную и тестовую выборки для предотвращения утечек данных и переобучения. \n",
"*Разработка и обучение модели:* Исследование моделей машинного обучения, проводя эксперименты с различными алгоритмами (линейная регрессия, случайный лес, градиентный бустинг, деревья решений) для предсказания стоимости недвижимости. Обучение модели на обучающей выборке с использованием метрик оценки качества, таких как RMSE (Root Mean Square Error) и MAE (Mean Absolute Error). Оценка качества моделей на тестовой выборке, минимизируя MAE и RMSE для получения точных прогнозов стоимости. \n",
"*Развёртывание модели:* Интеграция модели в существующую систему или разработка API для доступа к модели с недвижимостью и частными продавцами. Создание веб-приложения или мобильного интерфейса для удобного использования модели и получения прогнозов в режиме реального времени.\n",
"\n",
"2. **Разработка модели для рекомендаций по реновациям.** \n",
"*Сбор и подготовка данных:* Сбор данных о типах и стоимости реноваций, а также их влияние на конечную стоимость дома. Очистка и устранение неточных или неполных данных о ремонтах. Преобразование категориальных признаков (реновации, например, обновление крыши, замена окон) в числовой формат для представления этих данных с применением One-Hot-Encoding. Разбиение данных на обучающую и тестовую выборки для обучения модели. \n",
"*Разработка и обучение модели:* Использование модели регрессий (линейная регрессия, случайный лес) для предсказания и моделирования влияния конкретных реноваций на увеличение стоимости недвижимости. Оценка метрики (CPA - Cost Per Acquisition) оценка затрат на реновацию одной продажи и (ROI - Return on Investment) расчёт возврата на инвестиции от реновации дома, прирост стоимости после реновации. Обучение модели с целью прогнозирования изменений, которые могут принести наибольшую пользу для стоимости домов и реноваций. \n",
"*Развёртывание модели:* Создание интерфейса, где пользователи смогут вводить информацию о текущем состоянии дома и получать рекомендации по реновациям с расчётом ROI. Создать рекомендационную систему для продавцов недвижимости, которая будет предлагать набор реноваций.\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n",
" 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n",
" 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n",
" 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.ticker as ticker\n",
"import seaborn as sns\n",
"\n",
"# Подключим датафрейм и выгрузим данные\n",
"df = pd.read_csv(\".//static//csv//kc_house_data.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>view</th>\n",
" <th>...</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7129300520</td>\n",
" <td>20141013T000000</td>\n",
" <td>221900.0</td>\n",
" <td>3</td>\n",
" <td>1.00</td>\n",
" <td>1180</td>\n",
" <td>5650</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1180</td>\n",
" <td>0</td>\n",
" <td>1955</td>\n",
" <td>0</td>\n",
" <td>98178</td>\n",
" <td>47.5112</td>\n",
" <td>-122.257</td>\n",
" <td>1340</td>\n",
" <td>5650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>6414100192</td>\n",
" <td>20141209T000000</td>\n",
" <td>538000.0</td>\n",
" <td>3</td>\n",
" <td>2.25</td>\n",
" <td>2570</td>\n",
" <td>7242</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>2170</td>\n",
" <td>400</td>\n",
" <td>1951</td>\n",
" <td>1991</td>\n",
" <td>98125</td>\n",
" <td>47.7210</td>\n",
" <td>-122.319</td>\n",
" <td>1690</td>\n",
" <td>7639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5631500400</td>\n",
" <td>20150225T000000</td>\n",
" <td>180000.0</td>\n",
" <td>2</td>\n",
" <td>1.00</td>\n",
" <td>770</td>\n",
" <td>10000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>6</td>\n",
" <td>770</td>\n",
" <td>0</td>\n",
" <td>1933</td>\n",
" <td>0</td>\n",
" <td>98028</td>\n",
" <td>47.7379</td>\n",
" <td>-122.233</td>\n",
" <td>2720</td>\n",
" <td>8062</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2487200875</td>\n",
" <td>20141209T000000</td>\n",
" <td>604000.0</td>\n",
" <td>4</td>\n",
" <td>3.00</td>\n",
" <td>1960</td>\n",
" <td>5000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1050</td>\n",
" <td>910</td>\n",
" <td>1965</td>\n",
" <td>0</td>\n",
" <td>98136</td>\n",
" <td>47.5208</td>\n",
" <td>-122.393</td>\n",
" <td>1360</td>\n",
" <td>5000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1954400510</td>\n",
" <td>20150218T000000</td>\n",
" <td>510000.0</td>\n",
" <td>3</td>\n",
" <td>2.00</td>\n",
" <td>1680</td>\n",
" <td>8080</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>8</td>\n",
" <td>1680</td>\n",
" <td>0</td>\n",
" <td>1987</td>\n",
" <td>0</td>\n",
" <td>98074</td>\n",
" <td>47.6168</td>\n",
" <td>-122.045</td>\n",
" <td>1800</td>\n",
" <td>7503</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" id date price bedrooms bathrooms sqft_living \\\n",
"0 7129300520 20141013T000000 221900.0 3 1.00 1180 \n",
"1 6414100192 20141209T000000 538000.0 3 2.25 2570 \n",
"2 5631500400 20150225T000000 180000.0 2 1.00 770 \n",
"3 2487200875 20141209T000000 604000.0 4 3.00 1960 \n",
"4 1954400510 20150218T000000 510000.0 3 2.00 1680 \n",
"\n",
" sqft_lot floors waterfront view ... grade sqft_above sqft_basement \\\n",
"0 5650 1.0 0 0 ... 7 1180 0 \n",
"1 7242 2.0 0 0 ... 7 2170 400 \n",
"2 10000 1.0 0 0 ... 6 770 0 \n",
"3 5000 1.0 0 0 ... 7 1050 910 \n",
"4 8080 1.0 0 0 ... 8 1680 0 \n",
"\n",
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
"0 1955 0 98178 47.5112 -122.257 1340 \n",
"1 1951 1991 98125 47.7210 -122.319 1690 \n",
"2 1933 0 98028 47.7379 -122.233 2720 \n",
"3 1965 0 98136 47.5208 -122.393 1360 \n",
"4 1987 0 98074 47.6168 -122.045 1800 \n",
"\n",
" sqft_lot15 \n",
"0 5650 \n",
"1 7639 \n",
"2 8062 \n",
"3 5000 \n",
"4 7503 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Для наглядности\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>view</th>\n",
" <th>condition</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>2.161300e+04</td>\n",
" <td>2.161300e+04</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>2.161300e+04</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.580302e+09</td>\n",
" <td>5.400881e+05</td>\n",
" <td>3.370842</td>\n",
" <td>2.114757</td>\n",
" <td>2079.899736</td>\n",
" <td>1.510697e+04</td>\n",
" <td>1.494309</td>\n",
" <td>0.007542</td>\n",
" <td>0.234303</td>\n",
" <td>3.409430</td>\n",
" <td>7.656873</td>\n",
" <td>1788.390691</td>\n",
" <td>291.509045</td>\n",
" <td>1971.005136</td>\n",
" <td>84.402258</td>\n",
" <td>98077.939805</td>\n",
" <td>47.560053</td>\n",
" <td>-122.213896</td>\n",
" <td>1986.552492</td>\n",
" <td>12768.455652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>2.876566e+09</td>\n",
" <td>3.671272e+05</td>\n",
" <td>0.930062</td>\n",
" <td>0.770163</td>\n",
" <td>918.440897</td>\n",
" <td>4.142051e+04</td>\n",
" <td>0.539989</td>\n",
" <td>0.086517</td>\n",
" <td>0.766318</td>\n",
" <td>0.650743</td>\n",
" <td>1.175459</td>\n",
" <td>828.090978</td>\n",
" <td>442.575043</td>\n",
" <td>29.373411</td>\n",
" <td>401.679240</td>\n",
" <td>53.505026</td>\n",
" <td>0.138564</td>\n",
" <td>0.140828</td>\n",
" <td>685.391304</td>\n",
" <td>27304.179631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000102e+06</td>\n",
" <td>7.500000e+04</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>290.000000</td>\n",
" <td>5.200000e+02</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>290.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1900.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98001.000000</td>\n",
" <td>47.155900</td>\n",
" <td>-122.519000</td>\n",
" <td>399.000000</td>\n",
" <td>651.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>2.123049e+09</td>\n",
" <td>3.219500e+05</td>\n",
" <td>3.000000</td>\n",
" <td>1.750000</td>\n",
" <td>1427.000000</td>\n",
" <td>5.040000e+03</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>7.000000</td>\n",
" <td>1190.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1951.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98033.000000</td>\n",
" <td>47.471000</td>\n",
" <td>-122.328000</td>\n",
" <td>1490.000000</td>\n",
" <td>5100.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>3.904930e+09</td>\n",
" <td>4.500000e+05</td>\n",
" <td>3.000000</td>\n",
" <td>2.250000</td>\n",
" <td>1910.000000</td>\n",
" <td>7.618000e+03</td>\n",
" <td>1.500000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>7.000000</td>\n",
" <td>1560.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1975.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98065.000000</td>\n",
" <td>47.571800</td>\n",
" <td>-122.230000</td>\n",
" <td>1840.000000</td>\n",
" <td>7620.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>7.308900e+09</td>\n",
" <td>6.450000e+05</td>\n",
" <td>4.000000</td>\n",
" <td>2.500000</td>\n",
" <td>2550.000000</td>\n",
" <td>1.068800e+04</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>4.000000</td>\n",
" <td>8.000000</td>\n",
" <td>2210.000000</td>\n",
" <td>560.000000</td>\n",
" <td>1997.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98118.000000</td>\n",
" <td>47.678000</td>\n",
" <td>-122.125000</td>\n",
" <td>2360.000000</td>\n",
" <td>10083.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>9.900000e+09</td>\n",
" <td>7.700000e+06</td>\n",
" <td>33.000000</td>\n",
" <td>8.000000</td>\n",
" <td>13540.000000</td>\n",
" <td>1.651359e+06</td>\n",
" <td>3.500000</td>\n",
" <td>1.000000</td>\n",
" <td>4.000000</td>\n",
" <td>5.000000</td>\n",
" <td>13.000000</td>\n",
" <td>9410.000000</td>\n",
" <td>4820.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>98199.000000</td>\n",
" <td>47.777600</td>\n",
" <td>-121.315000</td>\n",
" <td>6210.000000</td>\n",
" <td>871200.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id price bedrooms bathrooms sqft_living \\\n",
"count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 \n",
"mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736 \n",
"std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897 \n",
"min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 \n",
"25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 \n",
"50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n",
"75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n",
"max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n",
"\n",
" sqft_lot floors waterfront view condition \\\n",
"count 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 1.510697e+04 1.494309 0.007542 0.234303 3.409430 \n",
"std 4.142051e+04 0.539989 0.086517 0.766318 0.650743 \n",
"min 5.200000e+02 1.000000 0.000000 0.000000 1.000000 \n",
"25% 5.040000e+03 1.000000 0.000000 0.000000 3.000000 \n",
"50% 7.618000e+03 1.500000 0.000000 0.000000 3.000000 \n",
"75% 1.068800e+04 2.000000 0.000000 0.000000 4.000000 \n",
"max 1.651359e+06 3.500000 1.000000 4.000000 5.000000 \n",
"\n",
" grade sqft_above sqft_basement yr_built yr_renovated \\\n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 7.656873 1788.390691 291.509045 1971.005136 84.402258 \n",
"std 1.175459 828.090978 442.575043 29.373411 401.679240 \n",
"min 1.000000 290.000000 0.000000 1900.000000 0.000000 \n",
"25% 7.000000 1190.000000 0.000000 1951.000000 0.000000 \n",
"50% 7.000000 1560.000000 0.000000 1975.000000 0.000000 \n",
"75% 8.000000 2210.000000 560.000000 1997.000000 0.000000 \n",
"max 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 \n",
"\n",
" zipcode lat long sqft_living15 sqft_lot15 \n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 98077.939805 47.560053 -122.213896 1986.552492 12768.455652 \n",
"std 53.505026 0.138564 0.140828 685.391304 27304.179631 \n",
"min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n",
"25% 98033.000000 47.471000 -122.328000 1490.000000 5100.000000 \n",
"50% 98065.000000 47.571800 -122.230000 1840.000000 7620.000000 \n",
"75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n",
"max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Описание данных (основные статистические показатели)\n",
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"id 0\n",
"date 0\n",
"price 0\n",
"bedrooms 0\n",
"bathrooms 0\n",
"sqft_living 0\n",
"sqft_lot 0\n",
"floors 0\n",
"waterfront 0\n",
"view 0\n",
"condition 0\n",
"grade 0\n",
"sqft_above 0\n",
"sqft_basement 0\n",
"yr_built 0\n",
"yr_renovated 0\n",
"zipcode 0\n",
"lat 0\n",
"long 0\n",
"sqft_living15 0\n",
"sqft_lot15 0\n",
"dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"id False\n",
"date False\n",
"price False\n",
"bedrooms False\n",
"bathrooms False\n",
"sqft_living False\n",
"sqft_lot False\n",
"floors False\n",
"waterfront False\n",
"view False\n",
"condition False\n",
"grade False\n",
"sqft_above False\n",
"sqft_basement False\n",
"yr_built False\n",
"yr_renovated False\n",
"zipcode False\n",
"lat False\n",
"long False\n",
"sqft_living15 False\n",
"sqft_lot15 False\n",
"dtype: bool"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Процент пропущенных значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"# Проверка на пропущенные данные\n",
"print(df.isnull().sum())\n",
"\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ооо, пропущенных колонок нету :)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиваем на выборки (обучающую, тестовую, контрольную)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 17290\n",
"Размер контрольной выборки: 4323\n",
"Размер тестовой выборки: 4323\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки: \", len(train_data))\n",
"print(\"Размер контрольной выборки: \", len(val_data))\n",
"print(\"Размер тестовой выборки: \", len(test_data))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABgFUlEQVR4nO3dd3hTdf8+8PskadJ0L7rYsjdYBAsylAIi4EIQH1Sc+NOiDCcqUxFBRAQR1EcBv4I8gIIbBWQ4yh6yZG/o3iNJk3x+f6Q5NLSFtiQ5aXq/rqsX7cnJOe8sevezjiSEECAiIiLyUiqlCyAiIiJyJYYdIiIi8moMO0REROTVGHaIiIjIqzHsEBERkVdj2CEiIiKvxrBDREREXo1hh4iIiLwaww4RkQtkZ2fjxIkTMJvNSpdCTiSEQGZmJo4fP650KVQFDDtERE5QXFyMWbNmoUOHDtDpdAgNDUWzZs2wceNGpUurEQ4ePIi1a9fKP+/btw8//fSTcgWVkpeXhzfffBMtWrSAVqtFeHg4mjdvjqNHjypdGlWSRukCyPWWLFmCxx9/XP5Zp9OhQYMG6NevHyZOnIioqCgFqyOq+YxGI/r164dt27bh//2//4e33noLfn5+UKvViIuLU7q8GiEvLw/PPPMMoqOjER4ejjFjxmDAgAEYOHCgonVlZGSgV69eOHfuHJ5//nl0794dWq0WPj4+aNSokaK1UeUx7NQi06ZNQ+PGjWEwGPDnn39i4cKF+Pnnn3Hw4EH4+fkpXR5RjTVz5kxs374dv/76K3r37q10OTVSfHy8/AUAzZs3x9NPP61wVcDLL7+My5cvIykpCW3atFG6HKomhp1aZMCAAejcuTMA4KmnnkJ4eDjmzJmD7777Dg899JDC1RHVTGazGXPnzsWLL77IoHOD1q5di8OHD6OoqAjt2rWDVqtVtJ7U1FQsXboUixYtYtCp4Thmpxa74447AACnT58GAGRmZuKll15Cu3btEBAQgKCgIAwYMAD79+8vc1+DwYApU6agefPm8PX1RUxMDO6//36cPHkSAHDmzBlIklThV+lfCps3b4YkSfjf//6H119/HdHR0fD398fdd9+N8+fPlzn39u3bceeddyI4OBh+fn7o1asX/vrrr3IfY+/evcs9/5QpU8rs+9VXXyEuLg56vR5hYWEYPnx4uee/1mMrzWq1Yu7cuWjTpg18fX0RFRWFZ555BllZWQ77NWrUCIMGDSpzntGjR5c5Znm1v/fee2WeU8DWtTJ58mQ0bdoUOp0O9evXxyuvvAKj0Vjuc1Va79690bZt2zLbZ8+eDUmScObMGYft2dnZGDt2LOrXrw+dToemTZti5syZsFqt8j7252327Nlljtu2bdty3xOrV6+usMbHHnusUt0IjRo1kl8flUqF6OhoPPjggzh37tx17wsAH3/8Mdq0aQOdTofY2FgkJiYiOztbvv3o0aPIyspCYGAgevXqBT8/PwQHB2PQoEE4ePCgvN+mTZsgSRLWrFlT5hzLly+HJElISkqSa37ssccc9rE/J5s3b5a3/fHHHxg6dCgaNGggv8bjxo1DUVGRw32nTJlS5r20bNkydOzYEb6+vggPD8dDDz1U5jl57LHHEBAQ4LBt9erVZeoAgICAgDI1A5X7XPXu3Vt+/Vu3bo24uDjs37+/3M9Vea7+nEdERGDgwIEOzz9g+/yMHj26wuMsWbLE4f29c+dOWK1WmEwmdO7c+ZrPFQD8/vvv6NGjB/z9/RESEoJ77rkHR44ccdjH/lr8+++/GDZsGIKCguRuO4PBUKbe0p93s9mMu+66C2FhYTh8+LDDvpX9/6u2YstOLWYPJuHh4QCAU6dOYe3atRg6dCgaN26MlJQUfPLJJ+jVqxcOHz6M2NhYAIDFYsGgQYOwceNGDB8+HGPGjEFeXh7Wr1+PgwcPokmTJvI5HnroIdx1110O550wYUK59UyfPh2SJOHVV19Famoq5s6di4SEBOzbtw96vR6A7T+TAQMGIC4uDpMnT4ZKpcLixYtxxx134I8//kCXLl3KHLdevXqYMWMGACA/Px/PPvtsueeeOHEihg0bhqeeegppaWmYP38+evbsib179yIkJKTMfUaNGoUePXoAAL799tsyv8SeeeYZebzUCy+8gNOnT+Ojjz7C3r178ddff8HHx6fc56EqsrOz5cdWmtVqxd13340///wTo0aNQqtWrXDgwAF88MEHOHbsmMNA0BtVWFiIXr164eLFi3jmmWfQoEED/P3335gwYQIuX76MuXPnOu1c1dWjRw+MGjUKVqsVBw8exNy5c3Hp0iX88ccf17zflClTMHXqVCQkJODZZ5/F0aNHsXDhQuzcuVN+DTMyMgDY3tfNmjXD1KlTYTAYsGDBAnTv3h07d+5E8+bN0bt3b9SvXx/Lli3Dfffd53CeZcuWoUmTJnIXTmWtWrUKhYWFePbZZxEeHo4dO3Zg/vz5uHDhAlatWlXh/ZYvX46HH34YHTp0wIwZM5CRkYF58+bhzz//xN69exEREVGlOipSnc+V3auvvlqlc7Vs2RJvvPEGhBA4efIk5syZg7vuuqvSobY89td29OjRiIuLw7vvvou0tLRyn6sNGzZgwIABuOmmmzBlyhQUFRVh/vz56N69O/bs2VMmmA8bNgyNGjXCjBkzsG3bNsybNw9ZWVn48ssvK6znqaeewubNm7F+/Xq0bt1a3n4jz3OtIcjrLV68WAAQGzZsEGlpaeL8+fNixYoVIjw8XOj1enHhwgUhhBAGg0FYLBaH+54+fVrodDoxbdo0edsXX3whAIg5c+aUOZfVapXvB0C89957ZfZp06aN6NWrl/zzpk2bBABRt25dkZubK29fuXKlACA+/PBD+djNmjUT/fv3l88jhBCFhYWicePGom/fvmXO1a1bN9G2bVv557S0NAFATJ48Wd525swZoVarxfTp0x3ue+DAAaHRaMpsP378uAAgli5dKm+bPHmyKP1x+uOPPwQAsWzZMof7rlu3rsz2hg0bioEDB5apPTExUVz9Eb269ldeeUVERkaKuLg4h+f0//7v/4RKpRJ//PGHw/0XLVokAIi//vqrzPlK69Wrl2jTpk2Z7e+9954AIE6fPi1ve+utt4S/v784duyYw76vvfaaUKvV4ty5c0KI6r0nVq1aVWGNI0eOFA0bNrzm4xDC9vyOHDnSYdt//vMf4efnd837paamCq1WK/r16+fwufjoo48EAPHFF1841BoRESHS09Pl/Y4dOyZ8fHzEkCFD5G0TJkwQOp1OZGdnO5xHo9E4vK6NGzcWjz76qEM99vNs2rRJ3lZYWFim7hkzZghJksTZs2flbaXfn2azWURFRYkmTZqI/Px8eZ/NmzcLAOLFF1+Ut40cOVL4+/s7HH/VqlVl6hBCCH9/f4fnuSqfq169ejm8/j///LMAIO68884yn4HyXH1/IYR4/fXXBQCRmpoqbwMgEhMTKzyO/f9K+/vb/nPr1q0dnmv7a1H6uerYsaOIjIwUGRkZ8rb9+/cLlUrl8FraX4u7777b4dzPPfecACD279/vUK/9fTFhwgShVqvF2rVrHe5X1f+/ait2Y9UiCQkJqFOnDurXr4/hw4cjICAAa9asQd26dQHYZmmpVLa3hMViQUZGBgICAtCiRQvs2bNHPs4333yDiIgIPP/882XOUZkm54o8+uijCAwMlH9+4IEHEBMTg59//hmAbSrq8ePH8Z///AcZGRlIT09Heno6CgoK0KdPH2zdutWh2wSwdbf5+vpe87zffvstrFYrhg0bJh8zPT0d0dHRaNasGTZt2uSwv8lkAmB7viqyatUqBAcHo2/fvg7HjIuLQ0BAQJljFhcXO+yXnp5epkn7ahcvXsT8+fMxceLEMl0Nq1atQqtWrdCyZUuHY9q7Lq8+/41YtWoVevTogdDQUIdzJSQkwGKxYOvWrQ77FxYWlnmsFoul3GPn5eUhPT3doduoOoxGI9LT05Gamor169fj999/R58+fa55nw0bNsBkMmHs2LHy5wIAnn76aQQFBZWZFv3444/LraQA0KxZM9x9991Yt26
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABpv0lEQVR4nO3deVwU9f8H8NfsLrucyymXIh6IikealZJ55JmaXXb3TS3TMvRbWmb0rTw6rOzQyqx+39IuM7W0b1beqaVoSqEoKqAoyikgx4Lssrvz+2PZkRUwQGD2eD0fj33ozszOvHfZZV98jhlBFEURRERERE5KIXcBRERERC2JYYeIiIicGsMOEREROTWGHSIiInJqDDtERETk1Bh2iIiIyKkx7BAREZFTY9ghIiIip6aSuwAiIiJnYDAYUFRUBLPZjPDwcLnLoRrYskNERHbt66+/xunTp6X7K1euRFZWlnwF1XDw4EE8+OCDCAoKgkajQVhYGCZMmCB3WXQZhh0nsnLlSgiCIN3c3d0RHR2NGTNmIC8vT+7yiIia5Pfff8dzzz2H06dPY/PmzYiLi4NCIf/X148//oibbroJKSkpeO2117B161Zs3boVn3zyidyl0WXYjeWEFi5ciI4dO6KyshJ//PEHli9fjl9++QVHjhyBp6en3OURETXKrFmzMHToUHTs2BEAMHv2bISFhclaU1FRER577DGMHj0aa9euhVqtlrUeujKGHSc0ZswYXHfddQCAxx57DIGBgXj33Xfx448/4oEHHpC5OiKixunWrRtOnjyJI0eOICgoCJ07d5a7JKxYsQKVlZVYuXIlg44DkL8dkFrcsGHDAAAZGRkALH+RPPvss+jVqxe8vb2h1WoxZswYHDp0qNZjKysrMX/+fERHR8Pd3R1hYWG46667cPLkSQDA6dOnbbrOLr8NHTpU2tfOnTshCAK+++47vPDCCwgNDYWXlxduu+02nD17ttax9+/fj1tuuQW+vr7w9PTEkCFDsGfPnjqf49ChQ+s8/vz582tt+/XXX6Nfv37w8PBAQEAA7r///jqPf6XnVpPZbMaSJUvQo0cPuLu7IyQkBI8//jguXLhgs12HDh1w66231jrOjBkzau2zrtoXL15c6zUFAL1ej3nz5iEqKgoajQYRERF47rnnoNfr63ytaho6dCh69uxZa/nbb78NQRBsxkkAQHFxMZ5++mlERERAo9EgKioKb775Jsxms7SN9XV7++23a+23Z8+edb4n1q1bV2+NkydPRocOHf7xuXTo0EH6+SgUCoSGhuK+++5DZmZmgx47efJkm2XTpk2Du7s7du7cabP8o48+Qo8ePaDRaBAeHo64uDgUFxfbbNPQ17VmzXXdrM+75mv63nvvITIyEh4eHhgyZAiOHDlS6zg7duzAoEGD4OXlBT8/P9x+++04duzYP75uNW81n3d9792aGvNzB4D8/HxMmTIFISEhcHd3xzXXXIMvvviizn2uXLkSXl5e6N+/Pzp37oy4uDgIglDrZ1ZfTdabm5sbOnTogDlz5sBgMEjbWYcAHDx4sN59DR061OY57Nu3D3369MHrr78ufR66dOmCN954w+bzAABGoxGvvPIKOnfuDI1Ggw4dOuCFF16o9Rm1vs5btmxBnz594O7ujpiYGPzwww8221nrrfn5PHr0KPz9/XHrrbfCaDRKyxvymXUFbNlxAdZgEhgYCAA4deoUNmzYgHvuuQcdO3ZEXl4ePvnkEwwZMgQpKSnSLAKTyYRbb70V27dvx/3334+nnnoKZWVl2Lp1K44cOWLz19UDDzyAsWPH2hw3Pj6+znpee+01CIKAuXPnIj8/H0uWLMGIESOQlJQEDw8PAJZf1mPGjEG/fv0wb948KBQKrFixAsOGDcPvv/+OG264odZ+27Vrh0WLFgEAdDodpk+fXuexX3rpJdx777147LHHcP78eXzwwQcYPHgw/v77b/j5+dV6zLRp0zBo0CAAwA8//ID169fbrH/88cexcuVKPPLII/j3v/+NjIwMfPjhh/j777+xZ88euLm51fk6NEZxcbH03Goym8247bbb8Mcff2DatGno3r07kpOT8d577yE1NRUbNmy46mNbVVRUYMiQIcjKysLjjz+O9u3bY+/evYiPj0dOTg6WLFnSbMdqqkGDBmHatGkwm804cuQIlixZguzsbPz++++N2s+8efPw2Wef4bvvvrP5gps/fz4WLFiAESNGYPr06Thx4gSWL1+OAwcONOlnvWTJEuh0OgDAsWPH8Prrr+OFF15A9+7dAQDe3t4223/55ZcoKytDXFwcKisrsXTpUgwbNgzJyckICQkBAGzbtg1jxoxBp06dMH/+fFy8eBEffPABBg4ciL/++qvO4Gh93WrW0ZIuXryIoUOHIj09HTNmzEDHjh2xdu1aTJ48GcXFxXjqqafqfWx6ejr+7//+r1HHs36G9Xo9Nm/ejLfffhvu7u545ZVXmvwcCgsL8ccff+CPP/7Ao48+in79+mH79u2Ij4/H6dOn8fHHH0vbPvbYY/jiiy9w991345lnnsH+/fuxaNEiHDt2rNbvk7S0NNx333144oknMGnSJKxYsQL33HMPNm3ahJEjR9ZZy9mzZ3HLLbegW7duWLNmDVQqy1e7I3xmW41ITmPFihUiAHHbtm3i+fPnxbNnz4qrV68WAwMDRQ8PD/HcuXOiKIpiZWWlaDKZbB6bkZEhajQaceHChdKyzz//XAQgvvvuu7WOZTabpccBEBcvXlxrmx49eohDhgyR7v/2228iALFt27ZiaWmptHzNmjUiAHHp0qXSvrt06SKOHj1aOo4oimJFRYXYsWNHceTIkbWOdeONN4o9e/aU7p8/f14EIM6bN09advr0aVGpVIqvvfaazWOTk5NFlUpVa3laWpoIQPziiy+kZfPmzRNrfmx+//13EYD4zTff2Dx206ZNtZZHRkaK48aNq1V7XFycePlH8fLan3vuOTE4OFjs16+fzWv61VdfiQqFQvz9999tHv/xxx+LAMQ9e/bUOl5NQ4YMEXv06FFr+eLFi0UAYkZGhrTslVdeEb28vMTU1FSbbZ9//nlRqVSKmZmZoig27T2xdu3aemucNGmSGBkZecXnIYqW13fSpEk2yx588EHR09OzUY/95JNPRADiBx98YLNNfn6+qFarxVGjRtl8fj788EMRgPj5559LyxrzulpZX4vffvut1jrra1rzcyyKorh//34RgDhr1ixpWZ8+fcTg4GCxsLBQWnbo0CFRoVCIEydOrLXvtm3bio888sgV66jvvVtXjQ35uS9ZskQEIH799dfSMoPBIMbGxore3t7S7wfrPlesWCFtd++994o9e/YUIyIiav2866up5uNFURTDw8PFsWPHSvetvzsPHDhQ776GDBli8xyGDBkiAhDnz59vs93kyZNFAGJycrIoiqKYlJQkAhAfe+wxm+2effZZEYC4Y8cOaVlkZKQIQPz++++lZSUlJWJYWJjYt2/fWvVmZGSIRUVFYkxMjNi1a1exoKDA5hgN/cy6AnZjOaERI0agTZs2iIiIwP333w9vb2+sX78ebdu2BQBoNBppJoPJZEJhYSG8vb3RtWtX/PXXX9J+vv/+ewQFBWHmzJm1jnF5t0tjTJw4ET4+PtL9u+++G2FhYfjll18AAElJSUhLS8ODDz6IwsJCFBQUoKCgAOXl5Rg+fDh2795dqwm2srIS7u7uVzzuDz/8ALPZjHvvvVfaZ0FBAUJDQ9GlSxf89ttvNttbm7k1Gk29+1y7di18fX0xcuRIm33269cP3t7etfZZVVVls11BQQEqKyuvWHdWVhY++OADvPTSS7X+0l+7di26d++Obt262ezT2nV5+fGvxtq1azFo0CD4+/vbHGvEiBEwmUzYvXu3zfYVFRW1nqvJZKpz32VlZSgoKKjVHdRYer0eBQUFyM/Px9atW7Fjxw4MHz68wY//8ccf8eSTT2LOnDmYMWOGzbpt27bBYDDg6aeftpkJNHXqVGi1Wvz8888225tMplrPv6Ki4qqe3x133CF9jgHghhtuQP/+/aXPTk5ODpKSkjB
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABmt0lEQVR4nO3deXwTZf4H8M8kadIzPelJW84C5bYo1AMQEAREXVDXYwUVwWWLK4fK1ovDAxUUcMHrtyrqiigguCByXwoFoVopZ1uuQu+D3m3SJPP7I83Q0Bba0nZyfN6vV140M5OZ7yRp58PzPDMjiKIogoiIiMhBKeQugIiIiKg1MewQERGRQ2PYISIiIofGsENEREQOjWGHiIiIHBrDDhERETk0hh0iIiJyaAw7RERE5NAYdoiIyOkZDAbk5uYiPT1d7lKoFTDsEBFRq9m0aROSkpKk5xs2bMDx48flK6iW1NRUTJkyBSEhIVCr1QgKCkJsbCx4YwHHw7BDVlauXAlBEKSHq6sroqKiMH36dOTk5MhdHhHZmeTkZDz33HNITU3FwYMH8fe//x2lpaVyl4WDBw/illtuwa5du/Cvf/0LW7duxfbt27FhwwYIgiB3edTCBN4bi2pbuXIlnnzySSxYsAAdO3ZEVVUVfv31V3z99deIjIzEsWPH4O7uLneZRGQn8vLycOuttyItLQ0AMH78eKxbt07WmvR6Pfr27QutVott27bB29tb1nqo9ankLoBs0+jRozFgwAAAwNNPPw1/f3+8//77+PHHH/HII4/IXB0R2Yt27drh2LFj0n+UevToIXdJ2LhxI06fPo1Tp04x6DgJdmNRowwbNgwAcO7cOQBAYWEhnn/+efTu3Ruenp7QarUYPXo0/vzzzzqvraqqwrx58xAVFQVXV1eEhIRg/PjxOHPmDADg/PnzVl1nVz+GDh0qrWvPnj0QBAHfffcdXnrpJQQHB8PDwwP33nsvLl68WGfbhw4dwt133w1vb2+4u7tjyJAh2L9/f737OHTo0Hq3P2/evDrL/ve//0VMTAzc3Nzg5+eHhx9+uN7tX2vfajOZTFi6dCl69uwJV1dXBAUF4ZlnnsHly5etluvQoQPuueeeOtuZPn16nXXWV/uiRYvqvKcAoNPpMHfuXHTp0gUajQbh4eF48cUXodPp6n2vahs6dCh69epVZ/rixYshCALOnz9vNb2oqAgzZsxAeHg4NBoNunTpgnfeeQcmk0laxvK+LV68uM56e/XqVe93Yu3atQ3W+MQTT6BDhw7X3ZcOHTpIn49CoUBwcDD++te/XnfQau3X1feove3GftYA8PPPP2PIkCHw8vKCVqvFzTffjFWrVgFo+Pta33fMYDDg9ddfR+fOnaHRaNChQwe89NJLdT7fxu5/eXk5Zs+eLX2G3bp1w+LFi+uMdbF8BzUaDWJiYtCjR48Gv4P1qb0vSqUSYWFhmDp1KoqKiqRlmvP5Hzx4EB07dsS6devQuXNnqNVqRERE4MUXX0RlZWWd13/44Yfo2bMnNBoNQkNDERcXZ1UDcOX3IDExEbfeeivc3NzQsWNHfPzxx1bLWerds2ePNC0zMxMdOnTAgAEDUFZWJk2/kd9LssaWHWoUSzDx9/cHAJw9exYbNmzAgw8+iI4dOyInJweffPIJhgwZghMnTiA0NBQAYDQacc8992Dnzp14+OGH8dxzz6G0tBTbt2/HsWPH0LlzZ2kbjzzyCMaMGWO13fj4+HrrefPNNyEIAubMmYPc3FwsXboUI0aMQFJSEtzc3AAAu3btwujRoxETE4O5c+dCoVDgiy++wLBhw/DLL7/glltuqbPe9u3bY+HChQCAsrIyTJs2rd5tv/rqq3jooYfw9NNPIy8vD//+978xePBg/PHHH/Dx8anzmqlTp+KOO+4AAPzwww9Yv3691fxnnnlG6kL85z//iXPnzmH58uX4448/sH//fri4uNT7PjRFUVGRtG+1mUwm3Hvvvfj1118xdepU9OjRA8nJyViyZAlSUlKwYcOGG962RUVFBYYMGYKMjAw888wziIiIwIEDBxAfH4+srCwsXbq0xbbVXHfccQemTp0Kk8mEY8eOYenSpcjMzMQvv/zS4GuWLl0qHaROnjyJt956Cy+99JLUiuHp6Skt29jPeuXKlXjqqafQs2dPxMfHw8fHB3/88Qe2bNmCRx99FC+//DKefvppAEB+fj5mzpxp9T2r7emnn8aXX36JBx54ALNnz8ahQ4ewcOFCnDx5ss538Xr7L4oi7r33XuzevRuTJ09Gv379sHXrVrzwwgvIyMjAkiVLGnyfGvoOXstf/vIXjB8/HgaDAQkJCfj0009RWVmJr7/+uknrqa2goABnz57FSy+9hPHjx2P27Nk4cuQIFi1ahGPHjuGnn36SwuK8efMwf/58jBgxAtOmTcPp06fx0Ucf4fDhw3V+Ny9fvowxY8bgoYcewiOPPILvv/8e06ZNg1qtxlNPPVVvLcXFxRg9ejRcXFywefNm6bvSlr+XTkEkquWLL74QAYg7duwQ8/LyxIsXL4qrV68W/f39RTc3N/HSpUuiKIpiVVWVaDQarV577tw5UaPRiAsWLJCmff755yIA8f3336+zLZPJJL0OgLho0aI6y/Ts2VMcMmSI9Hz37t0iADEsLEwsKSmRpn///fciAHHZsmXSurt27SqOGjVK2o4oimJFRYXYsWNH8a677qqzrVtvvVXs1auX9DwvL08EIM6dO1eadv78eVGpVIpvvvmm1WuTk5NFlUpVZ3pqaqoIQPzyyy+laXPnzhVr/+r98ssvIgDxm2++sXrtli1b6kyPjIwUx44dW6f2uLg48epf56trf/HFF8XAwEAxJibG6j39+uuvRYVCIf7yyy9Wr//4449FAOL+/fvrbK+2IUOGiD179qwzfdGiRSIA8dy5c9K0119/XfTw8BBTUlKslv3Xv/4lKpVKMT09XRTF5n0n1qxZ02CNkyZNEiMjI6+5H6Jofn8nTZpkNe3RRx8V3d3dr/vaq+vZvXt3nXmN/ayLiopELy8vceDAgWJlZaXVsrW/zxaW9+uLL76oMy8pKUkEID799NNW059//nkRgLhr1y5pWmP2f8OGDSIA8Y033rBa7oEHHhAFQRDT0tKkaY39Djbk6teLovn3NDo6WnrenM9/0qRJIgDxiSeesFrO8ru5ceNGURRFMTc3V1Sr1eLIkSOt/t4tX75cBCB+/vnn0rQhQ4aIAMT33ntPmqbT6cR+/fqJgYGBol6vt6p39+7dYlVVlTh06FAxMDDQ6n0TxRv/vSRr7Maieo0YMQLt2rVDeHg4Hn74YXh6emL9+vUICwsDAGg0GigU5q+P0WhEQUEBPD090a1bN/z+++/SetatW4eAgAA8++yzdbZxI2c8TJw4EV5eXtLzBx54ACEhIdi8eTMAICkpCampqXj00UdRUFCA/Px85Ofno7y8HMOHD8e+ffusuk0Ac3ebq6vrNbf7ww8/wGQy4aGHHpLWmZ+fj+DgYHTt2hW7d++2Wl6v1wMwv18NWbNmDby9vXHXXXdZrTMmJgaenp511lldXW21XH5+Pqqqqq5Zd0ZGBv7973/j1VdftWplsGy/R48e6N69u9U6LV2XV2//RqxZswZ33HEHfH19rbY1YsQIGI1G7Nu3z2r5ioqKOvtqNBrrXXdpaSny8/PrdC80lU6nQ35+PnJzc7F9+3bs2rULw4cPv6F1WjT2s96+fTtKS0vxr3/9q853sqm/N5bfiVmzZllNnz17NgDgp59+spp+vf3fvHkzlEol/vnPf9ZZnyiK+Pnnn+ut41rfwWuxfAeys7Oxbt06/Pnnn/V+Hs35/F944QWr5zNnzoRSqZTekx07dkCv12PGjBnS3zsAmDJlCrRabZ33TqVS4ZlnnpGeq9VqPPPMM8jNzUViYqLVsiaTCRMnTsTBgwexefNmq1ZuoG1/L50Bu7GoXitWrEBUVBRUKhWCgoLQrVs3q192k8mEZcuW4cMPP8S5c+esDkCWri7A3P3VrVs3qFQt+1Xr2rWr1XNBENClSxdpfEhqaioAYNKkSQ2uo7i4GL6
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Средняя цена в обучающей выборке: 537768.04794679\n",
"Средняя цена в контрольной выборке: 549367.443673375\n",
"Средняя цена в тестовой выборке: 549367.443673375\n"
]
}
],
"source": [
"# Оценка сбалансированности целевой переменной (цена)\n",
"# Визуализация распределения цены в выборках (гистограмма)\n",
"def plot_price_distribution(data, title):\n",
" sns.histplot(data['price'], kde=True)\n",
" plt.title(title)\n",
" plt.xlabel('Цена')\n",
" plt.ylabel('Частота')\n",
" plt.show()\n",
"\n",
"plot_price_distribution(train_data, 'Распределение цены в обучающей выборке')\n",
"plot_price_distribution(val_data, 'Распределение цены в контрольной выборке')\n",
"plot_price_distribution(test_data, 'Распределение цены в тестовой выборке')\n",
"\n",
"# Оценка сбалансированности данных по целевой переменной (price)\n",
"print(\"Средняя цена в обучающей выборке: \", train_data['price'].mean())\n",
"print(\"Средняя цена в контрольной выборке: \", val_data['price'].mean())\n",
"print(\"Средняя цена в тестовой выборке: \", test_data['price'].mean())"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABMtklEQVR4nO3deViN+f8/8OepVKeVSiVLsm8ZZMvaECG7YZiG7IYsYTBm7MYY+y7LUJaMdTD2XbaQyFgS41ujMSqhorSo9+8Pv3N/nE5RRzm4n4/rOtfVed/vc9+ve+k+z3Mv5yiEEAJEREREMqan6wKIiIiIdI2BiIiIiGSPgYiIiIhkj4GIiIiIZI+BiIiIiGSPgYiIiIhkj4GIiIiIZI+BiIiIiGSPgYiIiLSSnJyM6OhoPHv2TNelUAF7/vw5oqKikJycrOtSPhgGIiL6ZL148QKLFy+WnickJGDFihW6K0gGduzYgZYtW8Lc3BxmZmYoU6YM5s6dq+uyPgkf8/YqhMCaNWvQsGFDmJiYwMLCAk5OTti8ebOuS/tgFPzpjoIREBCAfv36Sc+NjIxQpkwZtG7dGpMnT4adnZ0OqyP6PGVmZsLS0hKrV69Gs2bNsGDBAty5cweHDx/WdWmfpR9++AFz5sxBp06d0LNnT9jY2EChUKBSpUooXbq0rsv76H3M22uvXr2wbds2eHt7o3379rC0tIRCoUDNmjVRvHhxXZf3QRjouoDPzYwZM+Dk5ITU1FScO3cOfn5+OHjwIG7evAkTExNdl0f0WdHX18f06dPRp08fZGVlwcLCAgcOHNB1WZ+loKAgzJkzB7Nnz8YPP/yg63I+SR/r9rpx40Zs27YNmzdvxjfffKPrcnSGR4gKiOoIUUhICOrWrSu1jx07FgsXLsSWLVvQq1cvHVZI9Pn6999/ER0djapVq6Jo0aK6Luez1KFDBzx9+hTnz5/XdSmfvI9te3V2dkbNmjURGBio61J0itcQFbIWLVoAACIjIwEAT58+xffffw9nZ2eYmZnBwsICbdu2xfXr1zVem5qaimnTpqFSpUowNjZGiRIl0LVrV9y/fx8AEBUVBYVCkevDzc1NGtfp06ehUCiwbds2/Pjjj7C3t4epqSk6duyI6OhojWlfunQJbdq0gaWlJUxMTNC8efNcd4Rubm45Tn/atGkafTdv3gwXFxcolUpYWVmhZ8+eOU7/bfP2pqysLCxevBjVq1eHsbEx7OzsMGTIEI2LPMuWLYv27dtrTGf48OEa48yp9nnz5mksUwBIS0vD1KlTUaFCBRgZGaF06dIYP3480tLSclxWb3Jzc9MY36xZs6Cnp4ctW7ZIbWfPnkX37t1RpkwZaRqjR4/Gy5cvpT59+/Z967agUCgQFRUl9T906BCaNm0KU1NTmJubw9PTE7du3VKrJbdxVqhQQa3fypUrUb16dRgZGcHBwQE+Pj5ISEjQmNcaNWogNDQUjRo1glKphJOTE1atWqXWT7Wdnj59Wq3d09NTY71MmzZNWnelSpWCq6srDAwMYG9vn+M4slO9Pj4+Xq39ypUrUCgUCAgIUGsvrG1t+PDhudYYEBCgse5ykn1dFStWDG5ubjh79uxbX6dy8uRJaXsoWrQoOnXqhPDwcLU+Fy9eRI0aNdCzZ09YWVlBqVSiXr162LNnj9TnxYsXMDU1xahRozSm8e+//0JfXx+zZ8+Wai5btqxGv+zr+Z9//sGwYcNQuXJlKJVKWFtbo3v37hrLJKdtJyQkBK1atYK5uTlMTU1zXCaqZXzlyhWpLT4+Psf9QPv27XOsOS/7y4LaXlUPc3Nz1K9fX235A//7X8uNat+q2r6Tk5Nx8+ZNlC5dGp6enrCwsMh1WQHA//3f/6F79+6wsrKCiYkJGjZsqHGUKz/vN3ndDwL5e1/SBk+ZFTJVeLG2tgbwemPas2cPunfvDicnJ8TGxmL16tVo3rw5bt++DQcHBwCvzzW3b98eJ06cQM+ePTFq1Cg8f/4cx44dw82bN1G+fHlpGr169UK7du3Upjtx4sQc65k1axYUCgUmTJiAuLg4LF68GO7u7ggLC4NSqQTweufYtm1buLi4YOrUqdDT04O/vz9atGiBs2fPon79+hrjLVWqlLSje/HiBYYOHZrjtCdPnowePXpg4MCBePz4MZYtW4ZmzZrh2rVrOX5SGjx4MJo2bQoA+OOPP7B792614UOGDJGOzo0cORKRkZFYvnw5rl27hvPnz6NIkSI5Lof8SEhIkObtTVlZWejYsSPOnTuHwYMHo2rVqrhx4wYWLVqEu3fvauyo3sXf3x+TJk3CggUL1A5b79ixAykpKRg6dCisra1x+fJlLFu2DP/++y927NgB4PVycHd3l17Tu3dvdOnSBV27dpXaVNcBbNq0Cd7e3vDw8MCcOXOQkpICPz8/NGnSBNeuXVPb4RsZGeG3335Tq9Pc3Fz6e9q0aZg+fTrc3d0xdOhQREREwM/PDyEhIRrL/9mzZ2jXrh169OiBXr16Yfv27Rg6dCgMDQ3Rv3//XJfLmTNncPDgwTwtwwULFiA2NjZPffPrQ2xr78PGxgaLFi0C8Dp8LFmyBO3atUN0dPRbj0IcP34cbdu2Rbly5TBt2jS8fPkSy5YtQ+PGjXH16lVpe3jy5AnWrFkDMzMzjBw5EsWLF8fmzZvRtWtXBAYGolevXjAzM0OXLl2wbds2LFy4EPr6+tJ0fv/9dwgh4OXlla/5CgkJwYULF9CzZ0+UKlUKUVFR8PPzg5ubG27fvp3rpQh///033NzcYGJignHjxsHExARr166Fu7s7jh07hmbNmuWrjtxos79U0WZ73bRpE4DXoW3lypXo3r07bt68icqVK2tV/5MnTwAAc+bMgb29PcaNGwdjY+Mcl1VsbCwaNWqElJQUjBw5EtbW1tiwYQM6duyInTt3okuXLmrjzsv7TXa57QffZznnmaAC4e/vLwCI48ePi8ePH4vo6GixdetWYW1tLZRKpfj333+FEEKkpqaKzMxMtddGRkYKIyMjMWPGDKlt/fr1AoBYuHChxrSysrKk1wEQ8+bN0+hTvXp10bx5c+n5qVOnBABRsmRJkZSUJLVv375dABBLliyRxl2xYkXh4eEhTUcIIVJSUoSTk5No1aqVxrQaNWokatSoIT1//PixACCmTp0qtUVFRQl9fX0xa9YstdfeuHFDGBgYaLTfu3dPABAbNmyQ2qZOnSre3GTPnj0rAIjAwEC11x4+fFij3dHRUXh6emrU7uPjI7L/G2Svffz48cLW1la4uLioLdNNmzYJPT09cfbsWbXXr1q1SgAQ58+f15jem5o3by6N78CBA8LAwECMHTtWo19KSopG2+zZs4VCoRD//PNPjuPOPg8qz58/F0WLFhWDBg1Sa4+JiRGWlpZq7d7e3sLU1DTX+uPi4oShoaFo3bq12ja9fPlyAUCsX79ebV4BiAULFkhtaWlpolatWsLW1lakp6cLIf63nZ46dUrq16BBA9G2bVuNecq+PcTFxQlzc3Op75vjyInq9Y8fP1ZrDwkJEQCEv7+/1FaY25qPj0+uNar2K5GRkW+dF29vb+Ho6KjWtmbNGgFAXL58+a2vVa2DJ0+eSG3Xr18Xenp6ok+fPmq1AhCnT5+W2lJSUkTVqlWFvb29tA6PHDkiAIhDhw6pTadmzZpq/z/9+vUTZcqU0agn+3rOafsPDg4WAMTGjRultuzbTrdu3YS+vr64efOm1Cc+Pl5YW1sLFxcXqU21jENCQqS2nPZhQgjh6emptpzzs78sqO31TUePHhUAxPbt26W25s2bi+rVq+c6HtX7hmr7Vj03NDQUd+/eVVsG2ZeVr6+vAKC2z3v+/LlwcnISZcuWlfYDeX2/UdX7rv2gNu9L2uApswLm7u6O4sWLo3Tp0ujZsyfMzMywe/dulCxZEsDrT9x6eq8Xe2ZmJp48eQIzMzNUrlwZV69elcaza9cu2NjYYMSIERrTyH7YPT/69Omj9gn/q6++QokSJaRP4GFhYbh37x6++eYbPHnyBPHx8YiPj0dycjJatmyJM2fOICsrS22cqampMDY2fut0//j
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABMpElEQVR4nO3deVgV9f///8cBZBEERQXElbTcza0U99xIyTRNs8x9y9Tc0t6+K9fMtNz3yq2yRTMtNfd9IbfEXNH8avo2xRVRVEB4/f7ox/l4ABURxZz77bq4Ls5rXjPznDNzDg9mXnOOzRhjBAAAYGFOmV0AAABAZiMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAbCUa9euafz48fbHUVFRmjJlSuYVBDzmhgwZIpvN5tBWqFAhtWvXLnMKekgIRJlozpw5stls9h93d3c988wz6tGjhyIjIzO7POCJ5OHhoQ8++EDz5s3TqVOnNGTIEC1ZsiSzywKQyVwyuwBIw4YNU1BQkG7evKktW7Zo2rRp+vXXX7V//35lzZo1s8sDnijOzs4aOnSo2rRpo8TERHl7e2vZsmWZXRbwrxIRESEnpyfrnAqB6DHQoEEDVaxYUZLUqVMn5cyZU2PHjtXPP/+s119/PZOrA548/fr102uvvaZTp06pePHiyp49e2aXhCfYrVu3lJiYKFdX18wuJcO4ublldgkZ7smKd0+I2rVrS5KOHz8uSbp06ZLeffddlS5dWl5eXvL29laDBg20d+/eFPPevHlTQ4YM0TPPPCN3d3flyZNHTZs21bFjxyRJJ06ccLhMl/ynVq1a9mVt2LBBNptNP/zwg/773/8qICBAnp6eevnll3Xq1KkU696+fbtefPFF+fj4KGvWrKpZs6a2bt2a6jbWqlUr1fUPGTIkRd9vvvlGFSpUkIeHh3x9fdWyZctU13+3bbtdYmKixo8fr5IlS8rd3V3+/v7q2rWrLl++7NCvUKFCeumll1Ksp0ePHimWmVrtn376aYrnVJJiY2M1ePBgFSlSRG5ubsqfP78GDBig2NjYVJ+r29WqVSvF8kaMGCEnJyd9++239rbNmzerefPmKlCggH0dffr00Y0bN+x92rVrd9djwWaz6cSJE/b+y5cvV/Xq1eXp6als2bIpNDRUBw4ccKjlTsssUqSIQ7+pU6eqZMmScnNzU2BgoLp3766oqKgU21qqVCnt3r1bVapUkYeHh4KCgjR9+nSHfknH6YYNGxzaQ0NDU+yX28dC5MuXT8HBwXJxcVFAQECqy0guaf4LFy44tO/atUs2m01z5sxxaH9Yx1qPHj3uWGPSpfjb911q7rX/kz8XCxYssL8Oc+XKpTfffFOnT59OsdzDhw+rRYsWyp07tzw8PFS0aFG9//77KfoVKlQoTetNy3F3J//v//0/NW/eXL6+vsqaNasqV67scDYwMjJSLi4uGjp0aIp5IyIiZLPZNHnyZHtbVFSUevfurfz588vNzU1FihTRqFGjlJiYaO+T9D702Wefafz48SpcuLDc3Nx08OBBSdKkSZNUsmRJZc2aVTly5FDFihUdXrt//fWX3n77bRUtWlQeHh7KmTOnmjdvnmJ/Ju3nLVu26J133lHu3LmVPXt2de3aVXFxcYqKilKbNm2UI0cO5ciRQwMGDJAxJtU6x40bp4IFC8rDw0M1a9bU/v377/ncJh9DlFTP1q1b1bdvX+XOnVuenp565ZVXdP78eYd5ExMTNWTIEAUGBipr1qx64YUXdPDgwUwfl8QZosdQUnjJmTOnpH9e1IsXL1bz5s0VFBSkyMhIzZgxQzVr1tTBgwcVGBgoSUpISNBLL72ktWvXqmXLlurVq5euXr2q1atXa//+/SpcuLB9Ha+//roaNmzosN6BAwemWs+IESNks9n03nvv6dy5cxo/frzq1q2r8PBweXh4SJLWrVunBg0aqEKFCho8eLCcnJw0e/Zs1a5dW5s3b9bzzz+fYrn58uXTyJEjJf0z0LVbt26prvvDDz9UixYt1KlTJ50/f16TJk1SjRo1tGfPnlT/s+/SpYuqV68uSfrpp5+0aNEih+ldu3bVnDlz1L59e73zzjs6fvy4Jk+erD179mjr1q3KkiVLqs/D/YiKirJv2+0SExP18ssva8uWLerSpYuKFy+uffv2ady4cTpy5IgWL158X+uZPXu2PvjgA40ZM0ZvvPGGvX3BggW6fv26unXrppw5c2rHjh2aNGmS/ve//2nBggWS/nke6tata5+ndevWeuWVV9S0aVN7W+7cuSVJX3/9tdq2bauQkBCNGjVK169f17Rp01StWjXt2bNHhQoVss/j5uamL7/80qHObNmy2X8fMmSIhg4dqrp166pbt26KiIjQtGnTtHPnzhTP/+XLl9WwYUO1aNFCr7/+uubPn69u3brJ1dVVHTp0uOPzsmnTJv36669peg7HjBnz0MbsPYpj7UGktq927typiRMnOrQlbcNzzz2nkSNHKjIyUhMmTNDWrVsdXod//PGHqlevrixZsqhLly4qVKiQjh07piVLlmjEiBEp1l+9enV16dJFknTo0CF9/PHHDtPv57hLLjIyUlWqVNH169f1zjvvKGfOnJo7d65efvll/fjjj3rllVfk7++vmjVrav78+Ro8eLDD/D/88IOcnZ3VvHlzSdL169dVs2ZNnT59Wl27dlWBAgW0bds2DRw4UGfOnHEYqC/989q8efOmunTpIjc3N/n6+uqLL77QO++8o1dffVW9evXSzZs39ccff2j79u321+/OnTu1bds2tWzZUvny5dOJEyc0bdo01apVSwcPHkwxjKJnz54KCAjQ0KFD9dtvv+nzzz9X9uzZtW3bNhUoUEAff/yxfv31V3366acqVaqU2rRp4zD/V199patXr6p79+66efOmJkyYoNq1a2vfvn3y9/e/4/N7Jz179lSOHDk0ePBgnThxQuPHj1ePHj30ww8/2PsMHDhQo0ePVqNGjRQSEqK9e/cqJCREN2/evO/1ZSiDTDN79mwjyaxZs8acP3/enDp1ynz//fcmZ86cxsPDw/zvf/8zxhhz8+ZNk5CQ4DDv8ePHjZubmxk2bJi9bdasWUaSGTt2bIp1JSYm2ueTZD799NMUfUqWLGlq1qxpf7x+/XojyeTNm9dER0fb2+fPn28kmQkTJtiX/fTTT5uQkBD7eowx5vr16yYoKMjUq1cvxbqqVKliSpUqZX98/vx5I8kMHjzY3nbixAnj7OxsRowY4TDvvn37jIuLS4r2o0ePGklm7ty59rbBgweb2w/zzZs3G0lm3rx5DvOuWLEiRXvBggVNaGhoitq7d+9ukr90ktc+YMAA4+fnZypUqODwnH799dfGycnJbN682WH+6dOnG0lm69atKdZ3u5o1a9qXt2zZMuPi4mL69euXot/169dTtI0cOdLYbDbz119/pbrs5NuQ5OrVqyZ79uymc+fODu1nz541Pj4+Du1t27Y1np6ed6z/3LlzxtXV1dSvX9/hmJ48ebKRZGbNmuWwrZLMmDFj7G2xsbGmbNmyxs/Pz8TFxRlj/u84Xb9+vb1fpUqVTIMGDVJsU/Lj4dy5cyZbtmz2vrcvIzVJ858/f96hfefOnUaSmT17tr3tYR5r3bt3v2ONSe8rx48fv+u23GlfLViwwOG5iIuLM35+fqZUqVLmxo0b9n5Lly41ksygQYPsbTVq1DDZsmVLcYzd/r6QJG/evKZ9+/b2x8n34/0cd6np3bu3keTwWrt69aoJCgoyhQoVsh9/M2bMMJLMvn37HOYvUaKEqV27tv3x8OHDjaenpzly5IhDv//85z/G2dnZnDx50hjzf++x3t7e5ty5cw59GzdubEqWLHnXulN77YaFhRlJ5quvvrK3Je3n5O+7wcHBxmazmbfeesveduvWLZMvXz6H96KkOm//W2OMMdu3bzeSTJ8+fextyV83xvxzzLZt2zZFPXXr1nWop0+fPsbZ2dlERUUZY/7Zfy4uLqZJkyYOyxsyZIiR5LDMR41LZo+BunXrKnfu3MqfP79
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABKvklEQVR4nO3deXRNZ//+8eskZJBICJKIeag5imiJeU5JVUupVs1TFTWVPlo1VpXWWIq2pqoOtEVR80zVVEHRVJWHRyXGCEESyf37o7+cryNBRCJqv19rnbWy733vvT97OMl19pBjM8YYAQAAWJhTZhcAAACQ2QhEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAB5LV69e1aRJk+zDUVFRmjZtWuYVhH+9uXPnymaz6cSJE5ldyiPBZrNp+PDh9uF/+/YhED0ESQdJ0svNzU0lSpRQr169FBkZmdnlAY8ld3d3DRkyRAsWLNCpU6c0fPhwLVu2LLPLAvCIypLZBVjJyJEjVaRIEd24cUPbtm3T9OnT9dNPP+m3335TtmzZMrs84LHi7OysESNGqF27dkpMTJSXl5dWrFiR2WUBj622bduqdevWcnV1zexS0oRA9BA1btxYlStXliR16dJFuXLl0oQJE7R06VK9/PLLmVwd8PgZMGCAXnrpJZ06dUqlS5dWjhw5MrskwC4mJkYeHh6ZXUa6cXZ2lrOzc2aXkWZcMstE9erVkyQdP35cknTx4kW9+eabCgwMlKenp7y8vNS4cWPt378/2bQ3btzQ8OHDVaJECbm5uSlv3rxq3ry5jh07Jkk6ceKEw2W621916tSxz2vTpk2y2Wz69ttv9fbbb8vf318eHh567rnndOrUqWTL3rlzp5555hl5e3srW7Zsql27trZv357iOtapUyfF5d963TnJl19+qaCgILm7u8vHx0etW7dOcfl3W7dbJSYmatKkSSpbtqzc3Nzk5+en7t2769KlSw79ChcurGeffTbZcnr16pVsninV/uGHHybbppIUGxurYcOGqXjx4nJ1dVWBAgU0aNAgxcbGpritblWnTp1k8xs9erScnJz01Vdf2du2bt2qli1bqmDBgvZl9OvXT9evX7f36dChw12Phduv+a9cuVI1a9aUh4eHsmfPrtDQUB06dMihljvNs3jx4g79PvnkE5UtW1aurq4KCAhQz549FRUVlWxdy5Urp71796patWpyd3dXkSJFNGPGDId+Scfppk2bHNpDQ0OT7Zfhw4fb913+/PkVHBysLFmyyN/fP8V53C5p+vPnzzu079mzRzabTXPnznVoz6hjrVevXnesMbX3a9xr/9++LRYtWmR/H+bOnVuvvvqqTp8+nWy+v//+u1q1aqU8efLI3d1dJUuW1DvvvJOsX+HChVO13NQcdym5dV/fa/skbf9t27bp6aeflpubm4oWLaovvvgi2fSHDh1SvXr15O7urvz58+u9995TYmJiijWk9j3j6empY8eOqUmTJsqePbvatGkjSTp69KhatGghf39/ubm5KX/+/GrdurUuX75sn37OnDmqV6+efH195erqqjJlymj69OnJaklax02bNqly5cpyd3dXYGCgfXv/8MMPCgwMlJubm4KCgrRv374U6/zrr78UEhIiDw8PBQQEaOTIkTLGpLwT0mmbHzhwQLVr13bY5nPmzHlo9yVxhigTJYWXXLlySZL++usvLVmyRC1btlSRIkUUGRmpmTNnqnbt2jp8+LACAgIkSQkJCXr22We1fv16tW7dWn369NGVK1e0du1a/fbbbypWrJh9GS+//LKaNGnisNzBgwenWM/o0aNls9n01ltv6ezZs5o0aZIaNGigsLAwubu7S5I2bNigxo0bKygoSMOGDZOTk5P9jbp161Y9/fTTyeabP39+jRkzRtI/N7r26NEjxWW/++67atWqlbp06aJz587p448/Vq1atbRv374UP9l369ZNNWvWlPTPm3zx4sUO47t37665c+eqY8eOeuONN3T8+HFNnTpV+/bt0/bt25U1a9YUt8P9iIqKsq/brRITE/Xcc89p27Zt6tatm0qXLq2DBw9q4sSJ+uOPP7RkyZL7Ws6cOXM0ZMgQjR8/Xq+88oq9fdGiRbp27Zp69OihXLlyadeuXfr444/1v//9T4sWLZL0z3Zo0KCBfZq2bdvqhRdeUPPmze1tefLkkSTNnz9f7du3V0hIiMaOHatr165p+vTpqlGjhvbt26fChQvbp3F1ddXnn3/uUGf27NntPw8fPlwjRoxQgwYN1KNHD4WHh2v69OnavXt3su1/6dIlNWnSRK1atdLLL7+shQsXqkePHnJxcVGnTp3uuF22bNmin376KVXbcPz48Rl2z97DONYeREr7avfu3ZoyZYpDW9I6PPXUUxozZowiIyM1efJkbd++3eF9eODAAdWsWVNZs2ZVt27dVLhwYR07dkzLli3T6NGjky2/Zs2a6tatmyTpyJEjev/99x3G389x96D+/PNPvfjii+rcubPat2+v2bNnq0OHDgoKClLZsmUlSREREapbt65u3ryp//znP/Lw8NCnn35q/z2Y1tpv3rypkJAQ1ahRQx999JGyZcumuLg4hYSEKDY2Vr1795a/v79Onz6t5cuXKyoqSt7e3pKk6dOnq2zZsnruueeUJUsWLVu2TK+//roSExPVs2fPZOv4yiuvqHv37nr11Vf10UcfqWnTppoxY4befvttvf7665KkMWPGqFWrVgoPD5eT0/+dH0lISNAzzzyjqlWraty4cVq1apWGDRummzdvauTIkRmyzU+fPq26devKZrNp8ODB8vDw0Oeff/5wL78ZZLg5c+YYSWbdunXm3Llz5tSpU+abb74xuXLlMu7u7uZ///ufMcaYGzdumISEBIdpjx8/blxdXc3IkSPtbbNnzzaSzIQJE5ItKzEx0T6dJPPhhx8m61O2bFlTu3Zt+/DGjRuNJJMvXz4THR1tb1+4cKGRZCZPnmyf9xNPPGFCQkLsyzHGmGvXrpkiRYqYhg0bJltWtWrVTLly5ezD586dM5LMsGHD7G0nTpwwzs7OZvTo0Q7THjx40GTJkiVZ+9GjR40kM2/ePHvbsGHDzK2H89atW40ks2DBAodpV61alay9UKFCJjQ0NFntPXv2NLe/RW6vfdCgQcbX19cEBQU5bNP58+cbJycns3XrVofpZ8yYYSSZ7du3J1verWrXrm2f34oVK0yWLFnMgAEDkvW7du1asrYxY8YYm81m/vvf/6Y479vXIcmVK1dMjhw5TNeuXR3aIyIijLe3t0N7+/btjYeHxx3rP3v2rHFxcTGNGjVyOKanTp1qJJnZs2c7rKskM378eHtbbGysqVChgvH19TVxcXHGmP87Tjdu3GjvV6VKFdO4ceNk63T78XD27FmTPXt2e99b55GSpOnPnTvn0L57924jycyZM8felpHHWs+ePe9YY9LvlePHj991Xe60rxYtWuSwLeLi4oyvr68pV66cuX79ur3f8uXLjSQzdOhQe1utWrVM9uzZkx1jt/5eSJIvXz7TsWNH+/Dt+/F+jruU3L6vk6S0fQoVKmQkmS1bttjbzp49a1xdXR3eX3379jWSzM6dOx36eXt7O8zzft8zksx//vMfh7779u0zksyiRYvuup4pvddDQkJM0aJFHdqS1vHnn3+2t61evdpIMu7u7g77bObMmcneD0l19u7d296WmJhoQkNDjYuLi8N74vb33YNs8969exubzWb27dtnb7tw4YLx8fFJ1XGeHrhk9hA1aNBAefLkUYECBdS6dWt5enpq8eLFypcvn6R/PsUlpfSEhARduHBBnp6eKlmypH799Vf7fL7//nvlzp1bvXv3TraMlE4dp1a7du0cPuG/+OKLyps3r/0TeFhYmI4ePapXXnlFFy5c0Pnz53X+/HnFxMSofv362rJlS7JTyjdu3JCbm9tdl/vDDz8oMTFRrVq1ss/z/Pnz8vf31xNPPKGNGzc69I+Li5Oku35yWLRokby9vdWwYUOHeQYFBcnT0zPZPOPj4x3
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки после oversampling и undersampling: 17620\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Преобразование целевой переменной (цены) в категориальные диапазоны с использованием квантилей\n",
"train_data['price_category'] = pd.qcut(train_data['price'], q=4, labels=['low', 'medium', 'high', 'very_high'])\n",
"\n",
"# Визуализация распределения цен после преобразования в категории\n",
"sns.countplot(x=train_data['price_category'])\n",
"plt.title('Распределение категорий цены в обучающей выборке')\n",
"plt.xlabel('Категория цены')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Балансировка категорий с помощью RandomOverSampler (увеличение меньшинств)\n",
"ros = RandomOverSampler(random_state=42)\n",
"X_train = train_data.drop(columns=['price', 'price_category'])\n",
"y_train = train_data['price_category']\n",
"\n",
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
"\n",
"# Визуализация распределения цен после oversampling\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение категорий цены после oversampling')\n",
"plt.xlabel('Категория цены')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Применение RandomUnderSampler для уменьшения большего класса\n",
"rus = RandomUnderSampler(random_state=42)\n",
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
"\n",
"# Визуализация распределения цен после undersampling\n",
"sns.countplot(x=y_resampled)\n",
"plt.title('Распределение категорий цены после undersampling')\n",
"plt.xlabel('Категория цен')\n",
"plt.ylabel('Частота')\n",
"plt.show()\n",
"\n",
"# Печать размеров выборки после балансировки\n",
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Конструирование признаков \n",
"\n",
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
"\n",
"**Процесс конструирования признаков** \n",
"Задача 1: Прогнозирование цен недвижимости. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования рыночной стоимости недвижимости. \n",
"Задача 2: Оптимизация затрат на ремонт перед продажей. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования по рекомендациям по реновациям.\n",
"\n",
"**Унитарное кодирование** \n",
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
"\n",
"**Дискретизация числовых признаков** \n",
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы train_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'price_category', 'date_20140502T000000', 'date_20140503T000000', 'date_20140504T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140517T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140608T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140704T000000', 'date_20140705T000000', 'date_20140706T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140720T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140726T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140809T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140816T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140823T000000', 'date_20140824T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140830T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140906T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140920T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140928T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141004T000000', 'date_20141005T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T0
"Столбцы val_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'date_20140502T000000', 'date_20140503T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140727T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140803T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T000000', 'date_20141012T000000', 'date_20141013T000000', 'date_20141014T000000', 'date_20141015T000000', 'date_20141016T000000', 'date_20141017T000000', 'date_20141018T000000', 'date_20141019T000000', 'date_20141020T000000', 'date_20141021T000000', 'date_20141022T000000', 'date_20141023T000000', 'date_20141024T000000', 'date_20141027T000000', 'date_20141028T000000', 'date_20141029T000000', 'date_201410
"Столбцы test_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'date_20140502T000000', 'date_20140503T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140727T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140803T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T000000', 'date_20141012T000000', 'date_20141013T000000', 'date_20141014T000000', 'date_20141015T000000', 'date_20141016T000000', 'date_20141017T000000', 'date_20141018T000000', 'date_20141019T000000', 'date_20141020T000000', 'date_20141021T000000', 'date_20141022T000000', 'date_20141023T000000', 'date_20141024T000000', 'date_20141027T000000', 'date_20141028T000000', 'date_20141029T000000', 'date_20141
]
}
],
"source": [
"# Конструирование признаков\n",
"# Унитарное кодирование категориальных признаков (применение one-hot encoding)\n",
"\n",
"# Пример категориальных признаков\n",
"categorical_features = ['date', 'waterfront', 'view', 'condition']\n",
"\n",
"# Применение one-hot encoding\n",
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
"\n",
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
"\n",
"\n",
"# Дискретизация числовых признаков (цены). Например, можно разделить площадь жилья на категории\n",
"# Пример дискретизации признака 'Общая площадь'\n",
"train_data_encoded['sqtf'] = pd.cut(train_data_encoded['sqft_living'], bins=5, labels=False)\n",
"val_data_encoded['sqtf'] = pd.cut(val_data_encoded['sqft_living'], bins=5, labels=False)\n",
"test_data_encoded['sqtf'] = pd.cut(test_data_encoded['sqft_living'], bins=5, labels=False)\n",
"\n",
"# Пример дискретизации признака 'sqft_living' на 5 категорий\n",
"df_encoded['sqtf'] = pd.cut(df_encoded['sqft_living'], bins=5, labels=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ручной синтез\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, для данных о продаже домов можно создать признак цена за квадратный фут."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Ручной синтез признаков\n",
"train_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
"val_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
"test_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
"\n",
"# Пример создания нового признака - цена за квадратный фут\n",
"df_encoded['price_per_sqft'] = df_encoded['price'] / df_encoded['sqft_living']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
"\n",
"# Пример масштабирования числовых признаков\n",
"numerical_features = ['bedrooms', 'sqft_living']\n",
"\n",
"scaler = StandardScaler()\n",
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Конструирование признаков с применением фреймворка Featuretools"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id price bedrooms bathrooms sqft_living sqft_lot \\\n",
"9876 1219000473 164950.0 -0.395263 1.75 -0.555396 15330 \n",
"14982 6308000010 585000.0 -0.395263 2.50 0.238192 5089 \n",
"1464 3630120700 757000.0 -0.395263 3.25 1.230177 5283 \n",
"19209 1901600090 359000.0 1.752138 1.75 -0.147580 6654 \n",
"2039 3395040550 320000.0 -0.395263 2.50 -0.599484 2890 \n",
"... ... ... ... ... ... ... \n",
"13184 1523049207 220000.0 0.678437 2.00 -0.412109 8043 \n",
"5759 1954420170 580000.0 -0.395263 2.50 0.083883 7484 \n",
"8433 1721801010 225000.0 -0.395263 1.00 -0.312911 6120 \n",
"10253 2422049104 85000.0 -1.468964 1.00 -1.371028 9000 \n",
"11363 7701960990 870000.0 0.678437 2.50 1.230177 14565 \n",
"\n",
" floors grade sqft_above sqft_basement ... view_2 view_3 view_4 \\\n",
"9876 1.0 7 1080 490 ... False False False \n",
"14982 2.0 9 2290 0 ... False False False \n",
"1464 2.0 9 3190 0 ... False False False \n",
"19209 1.5 7 1940 0 ... False False False \n",
"2039 2.0 7 1530 0 ... False False False \n",
"... ... ... ... ... ... ... ... ... \n",
"13184 1.0 7 850 850 ... False False False \n",
"5759 2.0 8 2150 0 ... False False False \n",
"8433 1.0 6 1790 0 ... False False False \n",
"10253 1.0 6 830 0 ... False False False \n",
"11363 2.0 11 3190 0 ... False False False \n",
"\n",
" condition_1 condition_2 condition_3 condition_4 condition_5 sqtf \\\n",
"9876 False False True False False 0 \n",
"14982 False False True False False 0 \n",
"1464 False False True False False 1 \n",
"19209 False False False True False 0 \n",
"2039 False False True False False 0 \n",
"... ... ... ... ... ... ... \n",
"13184 False False True False False 0 \n",
"5759 False False True False False 0 \n",
"8433 False False True False False 0 \n",
"10253 False False True False False 0 \n",
"11363 False False True False False 1 \n",
"\n",
" price_per_sqft \n",
"9876 105.063694 \n",
"14982 255.458515 \n",
"1464 237.304075 \n",
"19209 185.051546 \n",
"2039 209.150327 \n",
"... ... \n",
"13184 129.411765 \n",
"5759 269.767442 \n",
"8433 125.698324 \n",
"10253 102.409639 \n",
"11363 272.727273 \n",
"\n",
"[224 rows x 400 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" price bedrooms bathrooms sqft_living sqft_lot floors \\\n",
"id \n",
"7129300520 221900.0 3 1.00 1180 5650 1.0 \n",
"6414100192 538000.0 3 2.25 2570 7242 2.0 \n",
"5631500400 180000.0 2 1.00 770 10000 1.0 \n",
"2487200875 604000.0 4 3.00 1960 5000 1.0 \n",
"1954400510 510000.0 3 2.00 1680 8080 1.0 \n",
"\n",
" grade sqft_above sqft_basement yr_built ... view_2 view_3 \\\n",
"id ... \n",
"7129300520 7 1180 0 1955 ... False False \n",
"6414100192 7 2170 400 1951 ... False False \n",
"5631500400 6 770 0 1933 ... False False \n",
"2487200875 7 1050 910 1965 ... False False \n",
"1954400510 8 1680 0 1987 ... False False \n",
"\n",
" view_4 condition_1 condition_2 condition_3 condition_4 \\\n",
"id \n",
"7129300520 False False False True False \n",
"6414100192 False False False True False \n",
"5631500400 False False False True False \n",
"2487200875 False False False False False \n",
"1954400510 False False False True False \n",
"\n",
" condition_5 sqtf price_per_sqft \n",
"id \n",
"7129300520 False 0 188.050847 \n",
"6414100192 False 0 209.338521 \n",
"5631500400 False 0 233.766234 \n",
"2487200875 True 0 308.163265 \n",
"1954400510 False 0 303.571429 \n",
"\n",
"[5 rows x 402 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Предобработка данных (например, кодирование категориальных признаков, удаление дубликатов)\n",
"# Удаление дубликатов по идентификатору\n",
"df = df.drop_duplicates(subset='id')\n",
"duplicates = train_data_encoded[train_data_encoded['id'].duplicated(keep=False)]\n",
"\n",
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
"df_encoded = df_encoded.drop_duplicates(subset='id', keep='first')\n",
"\n",
"print(duplicates)\n",
"\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id='house_data')\n",
"\n",
"# Добавление датафрейма с домами\n",
"es = es.add_dataframe(dataframe_name='houses', dataframe=df_encoded, index='id')\n",
"\n",
"# Генерация признаков с помощью глубокой синтезы признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='houses', max_depth=2)\n",
"\n",
"# Выводим первые 5 строк сгенерированного набора признаков\n",
"print(feature_matrix.head())\n",
"\n",
"train_data_encoded = train_data_encoded.drop_duplicates(subset='id')\n",
"train_data_encoded = train_data_encoded.drop_duplicates(subset='id', keep='first') # or keep='last'\n",
"\n",
"# Определение сущностей (Создание EntitySet)\n",
"es = ft.EntitySet(id='house_data')\n",
"\n",
"es = es.add_dataframe(dataframe_name='houses', dataframe=train_data_encoded, index='id')\n",
"\n",
"# Генерация признаков\n",
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='houses', max_depth=2)\n",
"\n",
"# Преобразование признаков для контрольной и тестовой выборок\n",
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Оценка качества каждого набора признаков \n",
"\n",
"*Предсказательная способность Метрики:* RMSE, MAE, R² \n",
"\n",
"*Методы:* Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках. \n",
"\n",
"*Скорость вычисления Методы:* Измерение времени выполнения генерации признаков и обучения модели. \n",
"\n",
"*Надежность Методы:* Кросс-валидация, анализ чувствительности модели к изменениям в данных. \n",
"\n",
"*Корреляция Методы:* Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков. \n",
"\n",
"*Цельность Методы:* Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Время обучения модели: 5.18 секунд\n",
"Среднеквадратичная ошибка: 125198557176601739264.00\n"
]
}
],
"source": [
"import time\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"\n",
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
"X = feature_matrix.drop('price', axis=1)\n",
"y = feature_matrix['price']\n",
"\n",
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
"X.fillna(X.median(), inplace=True)\n",
"\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Обучение модели\n",
"model = LinearRegression()\n",
"\n",
"# Начинаем отсчет времени\n",
"start_time = time.time()\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Время обучения модели\n",
"train_time = time.time() - start_time\n",
"\n",
"# Предсказания и оценка модели и вычисляем среднеквадратичную ошибку\n",
"predictions = model.predict(X_val)\n",
"mse = mean_squared_error(y_val, predictions)\n",
"\n",
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RMSE: 17870.38470608543\n",
"R²: 0.9973762630189477\n",
"MAE: 5924.569330616996 \n",
"\n",
"Кросс-валидация RMSE: 34577.766841359786 \n",
"\n",
"Train RMSE: 12930.759734777745\n",
"Train R²: 0.9987426148033223\n",
"Train MAE: 2495.3698282637165\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC2mElEQVR4nOzdeVxU1fsH8M+dfZiBQXbEFXDJXbM0zazcU8sW+5WWpu25ZWVli7ZYaWppaqUtWtmubbaoWdriVmpquIOaCggoMMDsM/f8/uDL1RFQBsEB+bxfL3s1527PXJgZnjnnPEcSQggQERERERHVEapgB0BERERERHQhMQkiIiIiIqI6hUkQERERERHVKUyCiIiIiIioTmESREREREREdQqTICIiIiIiqlOYBBERERERUZ3CJIiIiIiIiOoUJkFERERERFSnMAkiIiIiogo5duwYlixZojw+fPgwPv744+AFRFRJTIKIqsFdd90Fs9kc7DCIiIiqlCRJGDNmDFatWoXDhw/j8ccfxx9//BHssIgCpgl2AEQXi5MnT+Ljjz/GH3/8gd9//x0OhwP9+/dHx44dceutt6Jjx47BDpGIiOi8JCQk4N5770X//v0BAPHx8Vi3bl1wgyKqBEkIIYIdBFFt99lnn+Hee+9FUVERmjRpAo/Hg+PHj6Njx47YsWMHPB4PRo4ciUWLFkGn0wU7XCIiovOSlpaGEydOoE2bNjCZTMEOhyhgHA5HdJ7Wr1+PO+64A3FxcVi/fj0OHTqE3r17w2Aw4O+//0ZGRgZuv/12fPDBB5g4caLfsbNmzUK3bt0QGRkJo9GISy+9FMuWLSt1DUmS8NxzzymPvV4vrrvuOkRERGD37t3KPmf7d/XVVwMA1q1bB0mSSn1zN3DgwFLXufrqq5XjShw+fBiSJPmNCQeAvXv34pZbbkFERAQMBgM6d+6M7777rtRzyc/Px8SJE9GkSRPo9Xo0aNAAI0aMwIkTJ8qNLyMjA02aNEHnzp1RVFQEAHC73ZgyZQouvfRSWCwWmEwm9OjRA2vXri11zezsbNx9991o1KgR1Gq1ck8qMmSxSZMmGDRoUKn2sWPHQpKkUu3p6ekYPXo0YmNjodfr0bp1a7z//vt++5Q8x7J+1mazGXfddZfyODc3F4899hjatm0Ls9mMsLAwDBgwADt27Dhn7MDZfy+aNGnit6/NZsOjjz6Khg0bQq/Xo0WLFpg1axYq+l3Z5s2bcd1116FevXowmUxo164d5s6dq2wvGSZ68OBB9OvXDyaTCfXr18cLL7xQ6hqBvDZK/qnVaiQkJOC+++5Dfn6+sk8g9xso/h19+OGHlfuQnJyMGTNmQJZlZZ+S18GsWbNKnbNNmzZ+r5tAXnNLliyBJEk4fPiw0rZq1Sp069YNISEhsFgsGDRoEFJSUkpdtyxOpxPPPfccmjdvDoPBgPj4eNx0001IS0s763FNmjQ56+/O6SRJwtixY/Hxxx+jRYsWMBgMuPTSS/H777+XOu8///yDAQMGICwsDGazGb169cKmTZv89im5B2X9O3bsGIDyhxwvW7aszHv95Zdf4tJLL4XRaERUVBTuuOMOpKen++3z3HPPoVWrVsrrrGvXrvjmm2/89inrPfHvv/+u9H1Zu3YtJEnC119/Xeq5fPLJJ5AkCRs3blTaKvI+W3L/dDodcnJy/LZt3LhRiXXLli0B36O77rpLed9ISkpCly5dkJubC6PRWOr3lqim43A4ovM0ffp0yLKMzz77DJdeemmp7VFRUfjwww+xe/duLFy4EFOnTkVMTAwAYO7cubj++usxfPhwuN1ufPbZZxg6dCi+//57DBw4sNxr3nPPPVi3bh1+/vlntGrVCgDw0UcfKdv/+OMPLFq0CK+//jqioqIAALGxseWe7/fff8ePP/5YqecPALt27UL37t2RkJCAJ598EiaTCV988QWGDBmC5cuX48YbbwQAFBUVoUePHtizZw9Gjx6NTp064cSJE/juu+9w7NgxJdbTWa1WDBgwAFqtFj/++KPyh09BQQHeffdd3H777bj33ntRWFiI9957D/369cNff/2FDh06KOcYOXIk1qxZg3HjxqF9+/ZQq9VYtGgRtm3bVunnXJasrCx07dpV+eMnOjoaP/30E+6++24UFBTg4YcfDvicBw8exDfffIOhQ4eiadOmyMrKwsKFC9GzZ0/s3r0b9evXP+c5+vTpgxEjRvi1zZ49G3l5ecpjIQSuv/56rF27FnfffTc6dOiAVatWYdKkSUhPT8frr79+1mv8/PPPGDRoEOLj4zFhwgTExcVhz549+P777zFhwgRlP5/Ph/79+6Nr16549dVXsXLlSkydOhVerxcvvPCCsl8gr40bb7wRN910E7xeLzZu3IhFixbB4XD4vSYqym63o2fPnkhPT8f999+PRo0aYcOGDZg8eTIyMzMxZ86cgM9Zloq+5v744w9cd911aNy4MaZOnQqPx4M333wT3bt3x99//43mzZuXe6zP58OgQYPwyy+/4LbbbsOECRNQWFiIn3/+GSkpKUhKSjrrtTt06IBHH33Ur+3DDz/Ezz//XGrf3377DZ9//jnGjx8PvV6PN998E/3798dff/2FNm3aACh+n+jRowfCwsLw+OOPQ6vVYuHChbj66qvx22+/oUuXLn7nfOGFF9C0aVO/toiIiLPGXJYlS5Zg1KhRuOyyy/DKK68gKysLc+fOxfr16/HPP/8gPDwcQPGXADfeeCOaNGkCh8OBJUuW4Oabb8bGjRtx+eWXl3v+J554otxt57ovV199NRo2bIiPP/5YeZ8s8fHHHyMpKQlXXHEFgIq/z5ZQq9VYunSp35dvixcvhsFggNPprNQ9KsuUKVNKnY+oVhBEdF4iIiJE48aN/dpGjhwpTCaTX9uzzz4rAIgVK1YobXa73W8ft9st2rRpI6699lq/dgBi6tSpQgghJk+eLNRqtfjmm2/KjWnx4sUCgDh06FCpbWvXrhUAxNq1a5W2Ll26iAEDBvhdRwghrrnmGnHVVVf5HX/o0CEBQCxevFhp69Wrl2jbtq1wOp1KmyzLolu3bqJZs2ZK25QpUwQA8dVXX5WKS5blUvE5nU5x9dVXi5iYGJGamuq3v9frFS6Xy68tLy9PxMbGitGjRyttDodDqFQqcf/99/vtW9bPqCyNGzcWAwcOLNU+ZswYceZb6N133y3i4+PFiRMn/Npvu+02YbFYlJ93yXP88ssvS53XZDKJkSNHKo+dTqfw+Xx++xw6dEjo9XrxwgsvnDN+AGLMmDGl2gcOHOj3e/vNN98IAGLatGl++91yyy1CkqRS9/90Xq9XNG3aVDRu3Fjk5eX5bSv5uQpRfM8BiHHjxvltHzhwoNDpdCInJ0dpr8xro0S3bt1Eq1atlMeB3O8XX3xRmEwmsX//fr/9nnzySaFWq8WRI0eEEKdeBzNnzix1ztatW4uePXuWun5FXnNnvnYvvfRSYbFYxPHjx5V99u/fL7Rarbj55ptLXft077//vgAgXnvttVLbTv+5lCWQ33sAAoDYsmWL0vbff/8Jg8EgbrzxRqVtyJAhQqfTibS0NKUtIyNDhIaG+r3PlNyDv//+u9z4ynv9fvnll3732u12i5iYGNGmTRvhcDiU/b7//nsBQEyZMqXca2RnZwsAYtasWUpbz549/X62P/74owAg+vfvX+n7MnnyZKHX60V+fr7ftTUajd/vRkXfZ0vu3+233y7atm2rtNtsNhEWFiaGDRvmd38DuUcjR470e99ISUkRKpVK+V0u6zOHqKbicDii81RYWKj07JxNSU9MQUGB0mY0GpX/z8vLg9VqRY8ePcrtoZg/fz5eeeUVvPHGG7jhhhvOM/JiX331Ff7++29Mnz691LaYmBhl+El5cnNz8euvv+LWW29FYWEhTpw4gRMnTuDkyZPo168fDhw4oAypWL58Odq3b1/qG0sApYaSyLKMESNGYNOmTfjxxx9LfWutVquV+VWyLCM3NxderxedO3f2u382mw2yLCMyMrJiN6SShBBYvnw5Bg8eDCGEch9OnDiBfv36wWq1lvq5nn6/Sv6
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"\n",
"# Удаление строк с NaN\n",
"feature_matrix = feature_matrix.dropna()\n",
"val_feature_matrix = val_feature_matrix.dropna()\n",
"test_feature_matrix = test_feature_matrix.dropna()\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train = feature_matrix.drop('price', axis=1)\n",
"y_train = feature_matrix['price']\n",
"X_val = val_feature_matrix.drop('price', axis=1)\n",
"y_val = val_feature_matrix['price']\n",
"X_test = test_feature_matrix.drop('price', axis=1)\n",
"y_test = test_feature_matrix['price']\n",
"\n",
"X_test = X_test.reindex(columns=X_train.columns, fill_value=0) \n",
"\n",
"# Кодирования категориальных переменных с использованием одноразового кодирования\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Разобьём тренировочный тест и примерку модели\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print()\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae} \\n\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Кросс-валидация RMSE: {rmse_cv} \\n\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\")\n",
"print()\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Фактическая цена')\n",
"plt.ylabel('Прогнозируемая цена')\n",
"plt.title('Фактическая цена по сравнению с прогнозируемой')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Выводы и итог \n",
"\n",
"**Модель случайного леса (RandomForestRegressor)** показала удовлетворительные результаты при прогнозировании цен на недвижимость. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей. \n",
"\n",
"*Точность предсказаний:* Модель демонстрирует довольно высокий R² (0.9987), что указывает на большую часть вариации целевого признака (цены недвижимости). Однако, значения RMSE и MAE остаются высоки (12930 и 2495), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими ценами. \n",
"\n",
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя. \n",
"\n",
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров. \n",
"\n",
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях.\n",
"\n",
"Кажется на этом закончили :)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "mai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}