1370 lines
330 KiB
Plaintext
1370 lines
330 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Приступаем к работе...\n",
|
|||
|
"\n",
|
|||
|
"*Вариант задания:* Продажи домов в округе Кинг (вариант - 6) \n",
|
|||
|
"Определим бизнес-цели и цели технического проекта \n",
|
|||
|
"\n",
|
|||
|
"### Бизнес-цели: \n",
|
|||
|
"1. Оптимизация процесса оценки стоимости дома \n",
|
|||
|
"\n",
|
|||
|
"**Формулировка:** Разработать модель, которая позволяет автоматически и точно оценивать стоимость дома на основании его характеристик (таких как площадь, количество комнат, состояние, местоположение). \n",
|
|||
|
"**Цель:** Увеличить точность оценки стоимости недвижимости для агенств и потенциальных покупателей, а также сократить время и затраты на оценку недвижимости, обеспечивая более точное предсказание цены. \n",
|
|||
|
"\n",
|
|||
|
"**Ключевые показатели успеха (KPI):** \n",
|
|||
|
"*Точность модели прогнозирования* (RMSE): Минимизация среднеквадратичной ошибки до уровня ниже 10% от реальной цены, чтобы учитывать большие отклонения оценке.\n",
|
|||
|
"*Средная абсолютная ошибка* (MAE): Модель должна предсказать цену с минимальной ошибкой и снизить MAE до 5% или меньше учитывая большие отклонения в оценке. \n",
|
|||
|
"*Скорость оценки:* Уменьшение времени на оценку стоимости дома, чтобы быстрее получать результат.\n",
|
|||
|
"*Доступность:* Внедрение модели в реальную систему для использования агентами недвижимости.\n",
|
|||
|
"\n",
|
|||
|
"2. Оптимизация затрат на ремонт перед продажей \n",
|
|||
|
"\n",
|
|||
|
"**Формулировка:** Разработать модель, которая поможет продавцам домов и агентствам недвижимости определить, какие улучшения или реновации дадут наибольший прирост стоимости дома при минимальных затратах. Это поможет избежать ненужных расходов и максимизировать прибыль от продажи. \n",
|
|||
|
"**Цель:** Снизить затраты на ремонт перед продажей, рекомендовать только те улучшения, которые максимально увеличат стоимость недвижимости, и сократить время на принятие решений по реновациям. \n",
|
|||
|
"\n",
|
|||
|
"**Ключевые показатели успеха (KPI):** \n",
|
|||
|
"*Возврат инвестиций* (ROI): Продавцы должны получать не менее 20% прироста стоимости дома на каждый вложенный доллар в реновацию. Например, если на ремонт было потрачено $10,000, цена дома должна увеличиться как минимум на $12,000. \n",
|
|||
|
"*Средняя стоимость ремонта на 1 сделку* (CPA): Задача снизить расходы на ремонт, минимизировав ненужные траты. Например, оптимизация затрат до $5,000 на дом с учетом максимального прироста в цене. \n",
|
|||
|
"*Сокращение времени на принятие решений:* Модель должна сокращать время, необходимое на оценку вариантов реноваций, до нескольких минут, что ускорит подготовку дома к продажи.\n",
|
|||
|
"\n",
|
|||
|
"### Технические цели проекта для каждой выделенной бизнес-цели\n",
|
|||
|
"\n",
|
|||
|
"1. **Создание модели для точной оценки стоимости дома.** \n",
|
|||
|
"*Сбор и подготовка данных:* Очистка данных от пропусков, выбросов, дубликатов (аномальных значений в столбцах price, sqft_living, bedrooms). Преобразование категориальных переменных (view, condition, waterfront) в числовую форму с применением One-Hot-Encoding. Нормализация и стандартизация с применением методов масштабирования данных (нормировка, стандартизация для числовых признаков, чтобы привести их к 1ому масштабу). Разбиение набора данных на обучающую, контрольную и тестовую выборки для предотвращения утечек данных и переобучения. \n",
|
|||
|
"*Разработка и обучение модели:* Исследование моделей машинного обучения, проводя эксперименты с различными алгоритмами (линейная регрессия, случайный лес, градиентный бустинг, деревья решений) для предсказания стоимости недвижимости. Обучение модели на обучающей выборке с использованием метрик оценки качества, таких как RMSE (Root Mean Square Error) и MAE (Mean Absolute Error). Оценка качества моделей на тестовой выборке, минимизируя MAE и RMSE для получения точных прогнозов стоимости. \n",
|
|||
|
"*Развёртывание модели:* Интеграция модели в существующую систему или разработка API для доступа к модели с недвижимостью и частными продавцами. Создание веб-приложения или мобильного интерфейса для удобного использования модели и получения прогнозов в режиме реального времени.\n",
|
|||
|
"\n",
|
|||
|
"2. **Разработка модели для рекомендаций по реновациям.** \n",
|
|||
|
"*Сбор и подготовка данных:* Сбор данных о типах и стоимости реноваций, а также их влияние на конечную стоимость дома. Очистка и устранение неточных или неполных данных о ремонтах. Преобразование категориальных признаков (реновации, например, обновление крыши, замена окон) в числовой формат для представления этих данных с применением One-Hot-Encoding. Разбиение данных на обучающую и тестовую выборки для обучения модели. \n",
|
|||
|
"*Разработка и обучение модели:* Использование модели регрессий (линейная регрессия, случайный лес) для предсказания и моделирования влияния конкретных реноваций на увеличение стоимости недвижимости. Оценка метрики (CPA - Cost Per Acquisition) оценка затрат на реновацию одной продажи и (ROI - Return on Investment) расчёт возврата на инвестиции от реновации дома, прирост стоимости после реновации. Обучение модели с целью прогнозирования изменений, которые могут принести наибольшую пользу для стоимости домов и реноваций. \n",
|
|||
|
"*Развёртывание модели:* Создание интерфейса, где пользователи смогут вводить информацию о текущем состоянии дома и получать рекомендации по реновациям с расчётом ROI. Создать рекомендационную систему для продавцов недвижимости, которая будет предлагать набор реноваций.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n",
|
|||
|
" 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n",
|
|||
|
" 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n",
|
|||
|
" 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import matplotlib.ticker as ticker\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"# Подключим датафрейм и выгрузим данные\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//kc_house_data.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>grade</th>\n",
|
|||
|
" <th>sqft_above</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>7129300520</td>\n",
|
|||
|
" <td>20141013T000000</td>\n",
|
|||
|
" <td>221900.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1180</td>\n",
|
|||
|
" <td>5650</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>1180</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1955</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98178</td>\n",
|
|||
|
" <td>47.5112</td>\n",
|
|||
|
" <td>-122.257</td>\n",
|
|||
|
" <td>1340</td>\n",
|
|||
|
" <td>5650</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>6414100192</td>\n",
|
|||
|
" <td>20141209T000000</td>\n",
|
|||
|
" <td>538000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>2570</td>\n",
|
|||
|
" <td>7242</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>2170</td>\n",
|
|||
|
" <td>400</td>\n",
|
|||
|
" <td>1951</td>\n",
|
|||
|
" <td>1991</td>\n",
|
|||
|
" <td>98125</td>\n",
|
|||
|
" <td>47.7210</td>\n",
|
|||
|
" <td>-122.319</td>\n",
|
|||
|
" <td>1690</td>\n",
|
|||
|
" <td>7639</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>5631500400</td>\n",
|
|||
|
" <td>20150225T000000</td>\n",
|
|||
|
" <td>180000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>770</td>\n",
|
|||
|
" <td>10000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" <td>770</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1933</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98028</td>\n",
|
|||
|
" <td>47.7379</td>\n",
|
|||
|
" <td>-122.233</td>\n",
|
|||
|
" <td>2720</td>\n",
|
|||
|
" <td>8062</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>2487200875</td>\n",
|
|||
|
" <td>20141209T000000</td>\n",
|
|||
|
" <td>604000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.00</td>\n",
|
|||
|
" <td>1960</td>\n",
|
|||
|
" <td>5000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>1050</td>\n",
|
|||
|
" <td>910</td>\n",
|
|||
|
" <td>1965</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98136</td>\n",
|
|||
|
" <td>47.5208</td>\n",
|
|||
|
" <td>-122.393</td>\n",
|
|||
|
" <td>1360</td>\n",
|
|||
|
" <td>5000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>1954400510</td>\n",
|
|||
|
" <td>20150218T000000</td>\n",
|
|||
|
" <td>510000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.00</td>\n",
|
|||
|
" <td>1680</td>\n",
|
|||
|
" <td>8080</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" <td>1680</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1987</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98074</td>\n",
|
|||
|
" <td>47.6168</td>\n",
|
|||
|
" <td>-122.045</td>\n",
|
|||
|
" <td>1800</td>\n",
|
|||
|
" <td>7503</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>5 rows × 21 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"0 7129300520 20141013T000000 221900.0 3 1.00 1180 \n",
|
|||
|
"1 6414100192 20141209T000000 538000.0 3 2.25 2570 \n",
|
|||
|
"2 5631500400 20150225T000000 180000.0 2 1.00 770 \n",
|
|||
|
"3 2487200875 20141209T000000 604000.0 4 3.00 1960 \n",
|
|||
|
"4 1954400510 20150218T000000 510000.0 3 2.00 1680 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view ... grade sqft_above sqft_basement \\\n",
|
|||
|
"0 5650 1.0 0 0 ... 7 1180 0 \n",
|
|||
|
"1 7242 2.0 0 0 ... 7 2170 400 \n",
|
|||
|
"2 10000 1.0 0 0 ... 6 770 0 \n",
|
|||
|
"3 5000 1.0 0 0 ... 7 1050 910 \n",
|
|||
|
"4 8080 1.0 0 0 ... 8 1680 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"0 1955 0 98178 47.5112 -122.257 1340 \n",
|
|||
|
"1 1951 1991 98125 47.7210 -122.319 1690 \n",
|
|||
|
"2 1933 0 98028 47.7379 -122.233 2720 \n",
|
|||
|
"3 1965 0 98136 47.5208 -122.393 1360 \n",
|
|||
|
"4 1987 0 98074 47.6168 -122.045 1800 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 \n",
|
|||
|
"0 5650 \n",
|
|||
|
"1 7639 \n",
|
|||
|
"2 8062 \n",
|
|||
|
"3 5000 \n",
|
|||
|
"4 7503 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 21 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Для наглядности\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>condition</th>\n",
|
|||
|
" <th>grade</th>\n",
|
|||
|
" <th>sqft_above</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>count</th>\n",
|
|||
|
" <td>2.161300e+04</td>\n",
|
|||
|
" <td>2.161300e+04</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>2.161300e+04</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" <td>21613.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>mean</th>\n",
|
|||
|
" <td>4.580302e+09</td>\n",
|
|||
|
" <td>5.400881e+05</td>\n",
|
|||
|
" <td>3.370842</td>\n",
|
|||
|
" <td>2.114757</td>\n",
|
|||
|
" <td>2079.899736</td>\n",
|
|||
|
" <td>1.510697e+04</td>\n",
|
|||
|
" <td>1.494309</td>\n",
|
|||
|
" <td>0.007542</td>\n",
|
|||
|
" <td>0.234303</td>\n",
|
|||
|
" <td>3.409430</td>\n",
|
|||
|
" <td>7.656873</td>\n",
|
|||
|
" <td>1788.390691</td>\n",
|
|||
|
" <td>291.509045</td>\n",
|
|||
|
" <td>1971.005136</td>\n",
|
|||
|
" <td>84.402258</td>\n",
|
|||
|
" <td>98077.939805</td>\n",
|
|||
|
" <td>47.560053</td>\n",
|
|||
|
" <td>-122.213896</td>\n",
|
|||
|
" <td>1986.552492</td>\n",
|
|||
|
" <td>12768.455652</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>std</th>\n",
|
|||
|
" <td>2.876566e+09</td>\n",
|
|||
|
" <td>3.671272e+05</td>\n",
|
|||
|
" <td>0.930062</td>\n",
|
|||
|
" <td>0.770163</td>\n",
|
|||
|
" <td>918.440897</td>\n",
|
|||
|
" <td>4.142051e+04</td>\n",
|
|||
|
" <td>0.539989</td>\n",
|
|||
|
" <td>0.086517</td>\n",
|
|||
|
" <td>0.766318</td>\n",
|
|||
|
" <td>0.650743</td>\n",
|
|||
|
" <td>1.175459</td>\n",
|
|||
|
" <td>828.090978</td>\n",
|
|||
|
" <td>442.575043</td>\n",
|
|||
|
" <td>29.373411</td>\n",
|
|||
|
" <td>401.679240</td>\n",
|
|||
|
" <td>53.505026</td>\n",
|
|||
|
" <td>0.138564</td>\n",
|
|||
|
" <td>0.140828</td>\n",
|
|||
|
" <td>685.391304</td>\n",
|
|||
|
" <td>27304.179631</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>min</th>\n",
|
|||
|
" <td>1.000102e+06</td>\n",
|
|||
|
" <td>7.500000e+04</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>290.000000</td>\n",
|
|||
|
" <td>5.200000e+02</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>290.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1900.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>98001.000000</td>\n",
|
|||
|
" <td>47.155900</td>\n",
|
|||
|
" <td>-122.519000</td>\n",
|
|||
|
" <td>399.000000</td>\n",
|
|||
|
" <td>651.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25%</th>\n",
|
|||
|
" <td>2.123049e+09</td>\n",
|
|||
|
" <td>3.219500e+05</td>\n",
|
|||
|
" <td>3.000000</td>\n",
|
|||
|
" <td>1.750000</td>\n",
|
|||
|
" <td>1427.000000</td>\n",
|
|||
|
" <td>5.040000e+03</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>3.000000</td>\n",
|
|||
|
" <td>7.000000</td>\n",
|
|||
|
" <td>1190.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1951.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>98033.000000</td>\n",
|
|||
|
" <td>47.471000</td>\n",
|
|||
|
" <td>-122.328000</td>\n",
|
|||
|
" <td>1490.000000</td>\n",
|
|||
|
" <td>5100.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>50%</th>\n",
|
|||
|
" <td>3.904930e+09</td>\n",
|
|||
|
" <td>4.500000e+05</td>\n",
|
|||
|
" <td>3.000000</td>\n",
|
|||
|
" <td>2.250000</td>\n",
|
|||
|
" <td>1910.000000</td>\n",
|
|||
|
" <td>7.618000e+03</td>\n",
|
|||
|
" <td>1.500000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>3.000000</td>\n",
|
|||
|
" <td>7.000000</td>\n",
|
|||
|
" <td>1560.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>1975.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>98065.000000</td>\n",
|
|||
|
" <td>47.571800</td>\n",
|
|||
|
" <td>-122.230000</td>\n",
|
|||
|
" <td>1840.000000</td>\n",
|
|||
|
" <td>7620.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>75%</th>\n",
|
|||
|
" <td>7.308900e+09</td>\n",
|
|||
|
" <td>6.450000e+05</td>\n",
|
|||
|
" <td>4.000000</td>\n",
|
|||
|
" <td>2.500000</td>\n",
|
|||
|
" <td>2550.000000</td>\n",
|
|||
|
" <td>1.068800e+04</td>\n",
|
|||
|
" <td>2.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>4.000000</td>\n",
|
|||
|
" <td>8.000000</td>\n",
|
|||
|
" <td>2210.000000</td>\n",
|
|||
|
" <td>560.000000</td>\n",
|
|||
|
" <td>1997.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>98118.000000</td>\n",
|
|||
|
" <td>47.678000</td>\n",
|
|||
|
" <td>-122.125000</td>\n",
|
|||
|
" <td>2360.000000</td>\n",
|
|||
|
" <td>10083.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>max</th>\n",
|
|||
|
" <td>9.900000e+09</td>\n",
|
|||
|
" <td>7.700000e+06</td>\n",
|
|||
|
" <td>33.000000</td>\n",
|
|||
|
" <td>8.000000</td>\n",
|
|||
|
" <td>13540.000000</td>\n",
|
|||
|
" <td>1.651359e+06</td>\n",
|
|||
|
" <td>3.500000</td>\n",
|
|||
|
" <td>1.000000</td>\n",
|
|||
|
" <td>4.000000</td>\n",
|
|||
|
" <td>5.000000</td>\n",
|
|||
|
" <td>13.000000</td>\n",
|
|||
|
" <td>9410.000000</td>\n",
|
|||
|
" <td>4820.000000</td>\n",
|
|||
|
" <td>2015.000000</td>\n",
|
|||
|
" <td>2015.000000</td>\n",
|
|||
|
" <td>98199.000000</td>\n",
|
|||
|
" <td>47.777600</td>\n",
|
|||
|
" <td>-121.315000</td>\n",
|
|||
|
" <td>6210.000000</td>\n",
|
|||
|
" <td>871200.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736 \n",
|
|||
|
"std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897 \n",
|
|||
|
"min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 \n",
|
|||
|
"25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 \n",
|
|||
|
"50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n",
|
|||
|
"75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n",
|
|||
|
"max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view condition \\\n",
|
|||
|
"count 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 1.510697e+04 1.494309 0.007542 0.234303 3.409430 \n",
|
|||
|
"std 4.142051e+04 0.539989 0.086517 0.766318 0.650743 \n",
|
|||
|
"min 5.200000e+02 1.000000 0.000000 0.000000 1.000000 \n",
|
|||
|
"25% 5.040000e+03 1.000000 0.000000 0.000000 3.000000 \n",
|
|||
|
"50% 7.618000e+03 1.500000 0.000000 0.000000 3.000000 \n",
|
|||
|
"75% 1.068800e+04 2.000000 0.000000 0.000000 4.000000 \n",
|
|||
|
"max 1.651359e+06 3.500000 1.000000 4.000000 5.000000 \n",
|
|||
|
"\n",
|
|||
|
" grade sqft_above sqft_basement yr_built yr_renovated \\\n",
|
|||
|
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 7.656873 1788.390691 291.509045 1971.005136 84.402258 \n",
|
|||
|
"std 1.175459 828.090978 442.575043 29.373411 401.679240 \n",
|
|||
|
"min 1.000000 290.000000 0.000000 1900.000000 0.000000 \n",
|
|||
|
"25% 7.000000 1190.000000 0.000000 1951.000000 0.000000 \n",
|
|||
|
"50% 7.000000 1560.000000 0.000000 1975.000000 0.000000 \n",
|
|||
|
"75% 8.000000 2210.000000 560.000000 1997.000000 0.000000 \n",
|
|||
|
"max 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 \n",
|
|||
|
"\n",
|
|||
|
" zipcode lat long sqft_living15 sqft_lot15 \n",
|
|||
|
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 98077.939805 47.560053 -122.213896 1986.552492 12768.455652 \n",
|
|||
|
"std 53.505026 0.138564 0.140828 685.391304 27304.179631 \n",
|
|||
|
"min 98001.000000 47.155900 -122.519000 399.000000 651.000000 \n",
|
|||
|
"25% 98033.000000 47.471000 -122.328000 1490.000000 5100.000000 \n",
|
|||
|
"50% 98065.000000 47.571800 -122.230000 1840.000000 7620.000000 \n",
|
|||
|
"75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 \n",
|
|||
|
"max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Описание данных (основные статистические показатели)\n",
|
|||
|
"df.describe()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"id 0\n",
|
|||
|
"date 0\n",
|
|||
|
"price 0\n",
|
|||
|
"bedrooms 0\n",
|
|||
|
"bathrooms 0\n",
|
|||
|
"sqft_living 0\n",
|
|||
|
"sqft_lot 0\n",
|
|||
|
"floors 0\n",
|
|||
|
"waterfront 0\n",
|
|||
|
"view 0\n",
|
|||
|
"condition 0\n",
|
|||
|
"grade 0\n",
|
|||
|
"sqft_above 0\n",
|
|||
|
"sqft_basement 0\n",
|
|||
|
"yr_built 0\n",
|
|||
|
"yr_renovated 0\n",
|
|||
|
"zipcode 0\n",
|
|||
|
"lat 0\n",
|
|||
|
"long 0\n",
|
|||
|
"sqft_living15 0\n",
|
|||
|
"sqft_lot15 0\n",
|
|||
|
"dtype: int64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"id False\n",
|
|||
|
"date False\n",
|
|||
|
"price False\n",
|
|||
|
"bedrooms False\n",
|
|||
|
"bathrooms False\n",
|
|||
|
"sqft_living False\n",
|
|||
|
"sqft_lot False\n",
|
|||
|
"floors False\n",
|
|||
|
"waterfront False\n",
|
|||
|
"view False\n",
|
|||
|
"condition False\n",
|
|||
|
"grade False\n",
|
|||
|
"sqft_above False\n",
|
|||
|
"sqft_basement False\n",
|
|||
|
"yr_built False\n",
|
|||
|
"yr_renovated False\n",
|
|||
|
"zipcode False\n",
|
|||
|
"lat False\n",
|
|||
|
"long False\n",
|
|||
|
"sqft_living15 False\n",
|
|||
|
"sqft_lot15 False\n",
|
|||
|
"dtype: bool"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Процент пропущенных значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на пропущенные данные\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"df.isnull().any()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Ооо, пропущенных колонок нету :)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Разбиваем на выборки (обучающую, тестовую, контрольную)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 17290\n",
|
|||
|
"Размер контрольной выборки: 4323\n",
|
|||
|
"Размер тестовой выборки: 4323\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки (80% - обучение, 20% - тестовая)\n",
|
|||
|
"train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и контрольную выборки (80% - обучение, 20% - контроль)\n",
|
|||
|
"train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Размер обучающей выборки: \", len(train_data))\n",
|
|||
|
"print(\"Размер контрольной выборки: \", len(val_data))\n",
|
|||
|
"print(\"Размер тестовой выборки: \", len(test_data))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABgFUlEQVR4nO3dd3hTdf8+8PskadJ0L7rYsjdYBAsylAIi4EIQH1Sc+NOiDCcqUxFBRAQR1EcBv4I8gIIbBWQ4yh6yZG/o3iNJk3x+f6Q5NLSFtiQ5aXq/rqsX7cnJOe8sevezjiSEECAiIiLyUiqlCyAiIiJyJYYdIiIi8moMO0REROTVGHaIiIjIqzHsEBERkVdj2CEiIiKvxrBDREREXo1hh4iIiLwaww4RkQtkZ2fjxIkTMJvNSpdCTiSEQGZmJo4fP650KVQFDDtERE5QXFyMWbNmoUOHDtDpdAgNDUWzZs2wceNGpUurEQ4ePIi1a9fKP+/btw8//fSTcgWVkpeXhzfffBMtWrSAVqtFeHg4mjdvjqNHjypdGlWSRukCyPWWLFmCxx9/XP5Zp9OhQYMG6NevHyZOnIioqCgFqyOq+YxGI/r164dt27bh//2//4e33noLfn5+UKvViIuLU7q8GiEvLw/PPPMMoqOjER4ejjFjxmDAgAEYOHCgonVlZGSgV69eOHfuHJ5//nl0794dWq0WPj4+aNSokaK1UeUx7NQi06ZNQ+PGjWEwGPDnn39i4cKF+Pnnn3Hw4EH4+fkpXR5RjTVz5kxs374dv/76K3r37q10OTVSfHy8/AUAzZs3x9NPP61wVcDLL7+My5cvIykpCW3atFG6HKomhp1aZMCAAejcuTMA4KmnnkJ4eDjmzJmD7777Dg899JDC1RHVTGazGXPnzsWLL77IoHOD1q5di8OHD6OoqAjt2rWDVqtVtJ7U1FQsXboUixYtYtCp4Thmpxa74447AACnT58GAGRmZuKll15Cu3btEBAQgKCgIAwYMAD79+8vc1+DwYApU6agefPm8PX1RUxMDO6//36cPHkSAHDmzBlIklThV+lfCps3b4YkSfjf//6H119/HdHR0fD398fdd9+N8+fPlzn39u3bceeddyI4OBh+fn7o1asX/vrrr3IfY+/evcs9/5QpU8rs+9VXXyEuLg56vR5hYWEYPnx4uee/1mMrzWq1Yu7cuWjTpg18fX0RFRWFZ555BllZWQ77NWrUCIMGDSpzntGjR5c5Znm1v/fee2WeU8DWtTJ58mQ0bdoUOp0O9evXxyuvvAKj0Vjuc1Va79690bZt2zLbZ8+eDUmScObMGYft2dnZGDt2LOrXrw+dToemTZti5syZsFqt8j7252327Nlljtu2bdty3xOrV6+usMbHHnusUt0IjRo1kl8flUqF6OhoPPjggzh37tx17wsAH3/8Mdq0aQOdTofY2FgkJiYiOztbvv3o0aPIyspCYGAgevXqBT8/PwQHB2PQoEE4ePCgvN+mTZsgSRLWrFlT5hzLly+HJElISkqSa37ssccc9rE/J5s3b5a3/fHHHxg6dCgaNGggv8bjxo1DUVGRw32nTJlS5r20bNkydOzYEb6+vggPD8dDDz1U5jl57LHHEBAQ4LBt9erVZeoAgICAgDI1A5X7XPXu3Vt+/Vu3bo24uDjs37+/3M9Vea7+nEdERGDgwIEOzz9g+/yMHj26wuMsWbLE4f29c+dOWK1WmEwmdO7c+ZrPFQD8/vvv6NGjB/z9/RESEoJ77rkHR44ccdjH/lr8+++/GDZsGIKCguRuO4PBUKbe0p93s9mMu+66C2FhYTh8+LDDvpX9/6u2YstOLWYPJuHh4QCAU6dOYe3atRg6dCgaN26MlJQUfPLJJ+jVqxcOHz6M2NhYAIDFYsGgQYOwceNGDB8+HGPGjEFeXh7Wr1+PgwcPokmTJvI5HnroIdx1110O550wYUK59UyfPh2SJOHVV19Famoq5s6di4SEBOzbtw96vR6A7T+TAQMGIC4uDpMnT4ZKpcLixYtxxx134I8//kCXLl3KHLdevXqYMWMGACA/Px/PPvtsueeeOHEihg0bhqeeegppaWmYP38+evbsib179yIkJKTMfUaNGoUePXoAAL799tsyv8SeeeYZebzUCy+8gNOnT+Ojjz7C3r178ddff8HHx6fc56EqsrOz5cdWmtVqxd13340///wTo0aNQqtWrXDgwAF88MEHOHbsmMNA0BtVWFiIXr164eLFi3jmmWfQoEED/P3335gwYQIuX76MuXPnOu1c1dWjRw+MGjUKVqsVBw8exNy5c3Hp0iX88ccf17zflClTMHXqVCQkJODZZ5/F0aNHsXDhQuzcuVN+DTMyMgDY3tfNmjXD1KlTYTAYsGDBAnTv3h07d+5E8+bN0bt3b9SvXx/Lli3Dfffd53CeZcuWoUmTJnIXTmWtWrUKhYWFePbZZxEeHo4dO3Zg/vz5uHDhAlatWlXh/ZYvX46HH34YHTp0wIwZM5CRkYF58+bhzz//xN69exEREVGlOipSnc+V3auvvlqlc7Vs2RJvvPEGhBA4efIk5syZg7vuuqvSobY89td29OjRiIuLw7vvvou0tLRyn6sNGzZgwIABuOmmmzBlyhQUFRVh/vz56N69O/bs2VMmmA8bNgyNGjXCjBkzsG3bNsybNw9ZWVn48ssvK6znqaeewubNm7F+/Xq0bt1a3n4jz3OtIcjrLV68WAAQGzZsEGlpaeL8+fNixYoVIjw8XOj1enHhwgUhhBAGg0FYLBaH+54+fVrodDoxbdo0edsXX3whAIg5c+aUOZfVapXvB0C89957ZfZp06aN6NWrl/zzpk2bBABRt25dkZubK29fuXKlACA+/PBD+djNmjUT/fv3l88jhBCFhYWicePGom/fvmXO1a1bN9G2bVv557S0NAFATJ48Wd525swZoVarxfTp0x3ue+DAAaHRaMpsP378uAAgli5dKm+bPHmyKP1x+uOPPwQAsWzZMof7rlu3rsz2hg0bioEDB5apPTExUVz9Eb269ldeeUVERkaKuLg4h+f0//7v/4RKpRJ//PGHw/0XLVokAIi//vqrzPlK69Wrl2jTpk2Z7e+9954AIE6fPi1ve+utt4S/v784duyYw76vvfaaUKvV4ty5c0KI6r0nVq1aVWGNI0eOFA0bNrzm4xDC9vyOHDnSYdt//vMf4efnd837paamCq1WK/r16+fwufjoo48EAPHFF1841BoRESHS09Pl/Y4dOyZ8fHzEkCFD5G0TJkwQOp1OZGdnO5xHo9E4vK6NGzcWjz76qEM99vNs2rRJ3lZYWFim7hkzZghJksTZs2flbaXfn2azWURFRYkmTZqI/Px8eZ/NmzcLAOLFF1+Ut40cOVL4+/s7HH/VqlVl6hBCCH9/f4fnuSqfq169ejm8/j///LMAIO68884yn4HyXH1/IYR4/fXXBQCRmpoqbwMgEhMTKzyO/f9K+/vb/nPr1q0dnmv7a1H6uerYsaOIjIwUGRkZ8rb9+/cLlUrl8FraX4u7777b4dzPPfecACD279/vUK/9fTFhwgShVqvF2rVrHe5X1f+/ait2Y9UiCQkJqFOnDurXr4/hw4cjICAAa9asQd26dQHYZmmpVLa3hMViQUZGBgICAtCiRQvs2bNHPs4333yDiIgIPP/882XOUZkm54o8+uijCAwMlH9+4IEHEBMTg59//hmAbSrq8ePH8Z///AcZGRlIT09Heno6CgoK0KdPH2zdutWh2wSwdbf5+vpe87zffvstrFYrhg0bJh8zPT0d0dHRaNasGTZt2uSwv8lkAmB7viqyatUqBAcHo2/fvg7HjIuLQ0BAQJljFhcXO+yXnp5epkn7ahcvXsT8+fMxceLEMl0Nq1atQqtWrdCyZUuHY9q7Lq8+/41YtWoVevTogdDQUIdzJSQkwGKxYOvWrQ77FxYWlnmsFoul3GPn5eUhPT3doduoOoxGI9LT05Gamor169fj999/R58+fa55nw0bNsBkMmHs2LHy5wIAnn76aQQFBZWZFv3444/LraQA0KxZM9x9991Yt26
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABpv0lEQVR4nO3deVwU9f8H8NfsLrucyymXIh6IikealZJ55JmaXXb3TS3TMvRbWmb0rTw6rOzQyqx+39IuM7W0b1beqaVoSqEoKqAoyikgx4Lssrvz+2PZkRUwQGD2eD0fj33ozszOvHfZZV98jhlBFEURRERERE5KIXcBRERERC2JYYeIiIicGsMOEREROTWGHSIiInJqDDtERETk1Bh2iIiIyKkx7BAREZFTY9ghIiIip6aSuwAiIiJnYDAYUFRUBLPZjPDwcLnLoRrYskNERHbt66+/xunTp6X7K1euRFZWlnwF1XDw4EE8+OCDCAoKgkajQVhYGCZMmCB3WXQZhh0nsnLlSgiCIN3c3d0RHR2NGTNmIC8vT+7yiIia5Pfff8dzzz2H06dPY/PmzYiLi4NCIf/X148//oibbroJKSkpeO2117B161Zs3boVn3zyidyl0WXYjeWEFi5ciI4dO6KyshJ//PEHli9fjl9++QVHjhyBp6en3OURETXKrFmzMHToUHTs2BEAMHv2bISFhclaU1FRER577DGMHj0aa9euhVqtlrUeujKGHSc0ZswYXHfddQCAxx57DIGBgXj33Xfx448/4oEHHpC5OiKixunWrRtOnjyJI0eOICgoCJ07d5a7JKxYsQKVlZVYuXIlg44DkL8dkFrcsGHDAAAZGRkALH+RPPvss+jVqxe8vb2h1WoxZswYHDp0qNZjKysrMX/+fERHR8Pd3R1hYWG46667cPLkSQDA6dOnbbrOLr8NHTpU2tfOnTshCAK+++47vPDCCwgNDYWXlxduu+02nD17ttax9+/fj1tuuQW+vr7w9PTEkCFDsGfPnjqf49ChQ+s8/vz582tt+/XXX6Nfv37w8PBAQEAA7r///jqPf6XnVpPZbMaSJUvQo0cPuLu7IyQkBI8//jguXLhgs12HDh1w66231jrOjBkzau2zrtoXL15c6zUFAL1ej3nz5iEqKgoajQYRERF47rnnoNfr63ytaho6dCh69uxZa/nbb78NQRBsxkkAQHFxMZ5++mlERERAo9EgKioKb775Jsxms7SN9XV7++23a+23Z8+edb4n1q1bV2+NkydPRocOHf7xuXTo0EH6+SgUCoSGhuK+++5DZmZmgx47efJkm2XTpk2Du7s7du7cabP8o48+Qo8ePaDRaBAeHo64uDgUFxfbbNPQ17VmzXXdrM+75mv63nvvITIyEh4eHhgyZAiOHDlS6zg7duzAoEGD4OXlBT8/P9x+++04duzYP75uNW81n3d9792aGvNzB4D8/HxMmTIFISEhcHd3xzXXXIMvvviizn2uXLkSXl5e6N+/Pzp37oy4uDgIglDrZ1ZfTdabm5sbOnTogDlz5sBgMEjbWYcAHDx4sN59DR061OY57Nu3D3369MHrr78ufR66dOmCN954w+bzAABGoxGvvPIKOnfuDI1Ggw4dOuCFF16o9Rm1vs5btmxBnz594O7ujpiYGPzwww8221nrrfn5PHr0KPz9/XHrrbfCaDRKyxvymXUFbNlxAdZgEhgYCAA4deoUNmzYgHvuuQcdO3ZEXl4ePvnkEwwZMgQpKSnSLAKTyYRbb70V27dvx/3334+nnnoKZWVl2Lp1K44cOWLz19UDDzyAsWPH2hw3Pj6+znpee+01CIKAuXPnIj8/H0uWLMGIESOQlJQEDw8PAJZf1mPGjEG/fv0wb948KBQKrFixAsOGDcPvv/+OG264odZ+27Vrh0WLFgEAdDodpk+fXuexX3rpJdx777147LHHcP78eXzwwQcYPHgw/v77b/j5+dV6zLRp0zBo0CAAwA8//ID169fbrH/88cexcuVKPPLII/j3v/+NjIwMfPjhh/j777+xZ88euLm51fk6NEZxcbH03Goym8247bbb8Mcff2DatGno3r07kpOT8d577yE1NRUbNmy46mNbVVRUYMiQIcjKysLjjz+O9u3bY+/evYiPj0dOTg6WLFnSbMdqqkGDBmHatGkwm804cuQIlixZguzsbPz++++N2s+8efPw2Wef4bvvvrP5gps/fz4WLFiAESNGYPr06Thx4gSWL1+OAwcONOlnvWTJEuh0OgDAsWPH8Prrr+OFF15A9+7dAQDe3t4223/55ZcoKytDXFwcKisrsXTpUgwbNgzJyckICQkBAGzbtg1jxoxBp06dMH/+fFy8eBEffPABBg4ciL/++qvO4Gh93WrW0ZIuXryIoUOHIj09HTNmzEDHjh2xdu1aTJ48GcXFxXjqqafqfWx6ejr+7//+r1HHs36G9Xo9Nm/ejLfffhvu7u545ZVXmvwcCgsL8ccff+CPP/7Ao48+in79+mH79u2Ij4/H6dOn8fHHH0vbPvbYY/jiiy9w991345lnnsH+/fuxaNEiHDt2rNbvk7S0NNx333144oknMGnSJKxYsQL33HMPNm3ahJEjR9ZZy9mzZ3HLLbegW7duWLNmDVQqy1e7I3xmW41ITmPFihUiAHHbtm3i+fPnxbNnz4qrV68WAwMDRQ8PD/HcuXOiKIpiZWWlaDKZbB6bkZEhajQaceHChdKyzz//XAQgvvvuu7WOZTabpccBEBcvXlxrmx49eohDhgyR7v/2228iALFt27ZiaWmptHzNmjUiAHHp0qXSvrt06SKOHj1aOo4oimJFRYXYsWNHceTIkbWOdeONN4o9e/aU7p8/f14EIM6bN09advr0aVGpVIqvvfaazWOTk5NFlUpVa3laWpoIQPziiy+kZfPmzRNrfmx+//13EYD4zTff2Dx206ZNtZZHRkaK48aNq1V7XFycePlH8fLan3vuOTE4OFjs16+fzWv61VdfiQqFQvz9999tHv/xxx+LAMQ9e/bUOl5NQ4YMEXv06FFr+eLFi0UAYkZGhrTslVdeEb28vMTU1FSbbZ9//nlRqVSKmZmZoig27T2xdu3aemucNGmSGBkZecXnIYqW13fSpEk2yx588EHR09OzUY/95JNPRADiBx98YLNNfn6+qFarxVGjRtl8fj788EMRgPj5559LyxrzulpZX4vffvut1jrra1rzcyyKorh//34RgDhr1ixpWZ8+fcTg4GCxsLBQWnbo0CFRoVCIEydOrLXvtm3bio888sgV66jvvVtXjQ35uS9ZskQEIH799dfSMoPBIMbGxore3t7S7wfrPlesWCFtd++994o9e/YUIyIiav2866up5uNFURTDw8PFsWPHSvetvzsPHDhQ776GDBli8xyGDBkiAhDnz59vs93kyZNFAGJycrIoiqKYlJQkAhAfe+wxm+2effZZEYC4Y8cOaVlkZKQIQPz++++lZSUlJWJYWJjYt2/fWvVmZGSIRUVFYkxMjNi1a1exoKDA5hgN/cy6AnZjOaERI0agTZs2iIiIwP333w9vb2+sX78ebdu2BQBoNBppJoPJZEJhYSG8vb3RtWtX/PXXX9J+vv/+ewQFBWHmzJm1jnF5t0tjTJw4ET4+PtL9u+++G2FhYfjll18AAElJSUhLS8ODDz6IwsJCFBQUoKCgAOXl5Rg+fDh2795dqwm2srIS7u7uVzzuDz/8ALPZjHvvvVfaZ0FBAUJDQ9GlSxf89ttvNttbm7k1Gk29+1y7di18fX0xcuRIm33269cP3t7etfZZVVVls11BQQEqKyuvWHdWVhY++OADvPTSS7X+0l+7di26d++Obt262ezT2nV5+fGvxtq1azFo0CD4+/vbHGvEiBEwmUzYvXu3zfYVFRW1nqvJZKpz32VlZSgoKKjVHdRYer0eBQUFyM/Px9atW7Fjxw4MHz68wY//8ccf8eSTT2LOnDmYMWOGzbpt27bBYDDg6aeftpkJNHXqVGi1Wvz8888225tMplrPv6Ki4qqe3x133CF9jgHghhtuQP/+/aXPTk5ODpKSkjB
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABmt0lEQVR4nO3deXwTZf4H8M8kadIzPelJW84C5bYo1AMQEAREXVDXYwUVwWWLK4fK1ovDAxUUcMHrtyrqiigguCByXwoFoVopZ1uuQu+D3m3SJPP7I83Q0Bba0nZyfN6vV140M5OZ7yRp58PzPDMjiKIogoiIiMhBKeQugIiIiKg1MewQERGRQ2PYISIiIofGsENEREQOjWGHiIiIHBrDDhERETk0hh0iIiJyaAw7RERE5NAYdoiIyOkZDAbk5uYiPT1d7lKoFTDsEBFRq9m0aROSkpKk5xs2bMDx48flK6iW1NRUTJkyBSEhIVCr1QgKCkJsbCx4YwHHw7BDVlauXAlBEKSHq6sroqKiMH36dOTk5MhdHhHZmeTkZDz33HNITU3FwYMH8fe//x2lpaVyl4WDBw/illtuwa5du/Cvf/0LW7duxfbt27FhwwYIgiB3edTCBN4bi2pbuXIlnnzySSxYsAAdO3ZEVVUVfv31V3z99deIjIzEsWPH4O7uLneZRGQn8vLycOuttyItLQ0AMH78eKxbt07WmvR6Pfr27QutVott27bB29tb1nqo9ankLoBs0+jRozFgwAAAwNNPPw1/f3+8//77+PHHH/HII4/IXB0R2Yt27drh2LFj0n+UevToIXdJ2LhxI06fPo1Tp04x6DgJdmNRowwbNgwAcO7cOQBAYWEhnn/+efTu3Ruenp7QarUYPXo0/vzzzzqvraqqwrx58xAVFQVXV1eEhIRg/PjxOHPmDADg/PnzVl1nVz+GDh0qrWvPnj0QBAHfffcdXnrpJQQHB8PDwwP33nsvLl68WGfbhw4dwt133w1vb2+4u7tjyJAh2L9/f737OHTo0Hq3P2/evDrL/ve//0VMTAzc3Nzg5+eHhx9+uN7tX2vfajOZTFi6dCl69uwJV1dXBAUF4ZlnnsHly5etluvQoQPuueeeOtuZPn16nXXWV/uiRYvqvKcAoNPpMHfuXHTp0gUajQbh4eF48cUXodPp6n2vahs6dCh69epVZ/rixYshCALOnz9vNb2oqAgzZsxAeHg4NBoNunTpgnfeeQcmk0laxvK+LV68uM56e/XqVe93Yu3atQ3W+MQTT6BDhw7X3ZcOHTpIn49CoUBwcDD++te/XnfQau3X1feove3GftYA8PPPP2PIkCHw8vKCVqvFzTffjFWrVgFo+Pta33fMYDDg9ddfR+fOnaHRaNChQwe89NJLdT7fxu5/eXk5Zs+eLX2G3bp1w+LFi+uMdbF8BzUaDWJiYtCjR48Gv4P1qb0vSqUSYWFhmDp1KoqKiqRlmvP5Hzx4EB07dsS6devQuXNnqNVqRERE4MUXX0RlZWWd13/44Yfo2bMnNBoNQkNDERcXZ1UDcOX3IDExEbfeeivc3NzQsWNHfPzxx1bLWerds2ePNC0zMxMdOnTAgAEDUFZWJk2/kd9LssaWHWoUSzDx9/cHAJw9exYbNmzAgw8+iI4dOyInJweffPIJhgwZghMnTiA0NBQAYDQacc8992Dnzp14+OGH8dxzz6G0tBTbt2/HsWPH0LlzZ2kbjzzyCMaMGWO13fj4+HrrefPNNyEIAubMmYPc3FwsXboUI0aMQFJSEtzc3AAAu3btwujRoxETE4O5c+dCoVDgiy++wLBhw/DLL7/glltuqbPe9u3bY+HChQCAsrIyTJs2rd5tv/rqq3jooYfw9NNPIy8vD//+978xePBg/PHHH/Dx8anzmqlTp+KOO+4AAPzwww9Yv3691fxnnnlG6kL85z//iXPnzmH58uX4448/sH//fri4uNT7PjRFUVGRtG+1mUwm3Hvvvfj1118xdepU9OjRA8nJyViyZAlSUlKwYcOGG962RUVFBYYMGYKMjAw888wziIiIwIEDBxAfH4+srCwsXbq0xbbVXHfccQemTp0Kk8mEY8eOYenSpcjMzMQvv/zS4GuWLl0qHaROnjyJt956Cy+99JLUiuHp6Skt29jPeuXKlXjqqafQs2dPxMfHw8fHB3/88Qe2bNmCRx99FC+//DKefvppAEB+fj5mzpxp9T2r7emnn8aXX36JBx54ALNnz8ahQ4ewcOFCnDx5ss538Xr7L4oi7r33XuzevRuTJ09Gv379sHXrVrzwwgvIyMjAkiVLGnyfGvoOXstf/vIXjB8/HgaDAQkJCfj0009RWVmJr7/+uknrqa2goABnz57FSy+9hPHjx2P27Nk4cuQIFi1ahGPHjuGnn36SwuK8efMwf/58jBgxAtOmTcPp06fx0Ucf4fDhw3V+Ny9fvowxY8bgoYcewiOPPILvv/8e06ZNg1qtxlNPPVVvLcXFxRg9ejRcXFywefNm6bvSlr+XTkEkquWLL74QAYg7duwQ8/LyxIsXL4qrV68W/f39RTc3N/HSpUuiKIpiVVWVaDQarV577tw5UaPRiAsWLJCmff755yIA8f3336+zLZPJJL0OgLho0aI6y/Ts2VMcMmSI9Hz37t0iADEsLEwsKSmRpn///fciAHHZsmXSurt27SqOGjVK2o4oimJFRYXYsWNH8a677qqzrVtvvVXs1auX9DwvL08EIM6dO1eadv78eVGpVIpvvvmm1WuTk5NFlUpVZ3pqaqoIQPzyyy+laXPnzhVr/+r98ssvIgDxm2++sXrtli1b6kyPjIwUx44dW6f2uLg48epf56trf/HFF8XAwEAxJibG6j39+uuvRYVCIf7yyy9Wr//4449FAOL+/fvrbK+2IUOGiD179qwzfdGiRSIA8dy5c9K0119/XfTw8BBTUlKslv3Xv/4lKpVKMT09XRTF5n0n1qxZ02CNkyZNEiMjI6+5H6Jofn8nTZpkNe3RRx8V3d3dr/vaq+vZvXt3nXmN/ayLiopELy8vceDAgWJlZaXVsrW/zxaW9+uLL76oMy8pKUkEID799NNW059//nkRgLhr1y5pWmP2f8OGDSIA8Y033rBa7oEHHhAFQRDT0tKkaY39Djbk6teLovn3NDo6WnrenM9/0qRJIgDxiSeesFrO8ru5ceNGURRFMTc3V1Sr1eLIkSOt/t4tX75cBCB+/vnn0rQhQ4aIAMT33ntPmqbT6cR+/fqJgYGBol6vt6p39+7dYlVVlTh06FAxMDDQ6n0TxRv/vSRr7Maieo0YMQLt2rVDeHg4Hn74YXh6emL9+vUICwsDAGg0GigU5q+P0WhEQUEBPD090a1bN/z+++/SetatW4eAgAA8++yzdbZxI2c8TJw4EV5eXtLzBx54ACEhIdi8eTMAICkpCampqXj00UdRUFCA/Px85Ofno7y8HMOHD8e+ffusuk0Ac3ebq6vrNbf7ww8/wGQy4aGHHpLWmZ+fj+DgYHTt2hW7d++2Wl6v1wMwv18NWbNmDby9vXHXXXdZrTMmJgaenp511lldXW21XH5+Pqqqqq5Zd0ZGBv7973/j1VdftWplsGy/R48e6N69u9U6LV2XV2//RqxZswZ33HEHfH19rbY1YsQIGI1G7Nu3z2r5ioqKOvtqNBrrXXdpaSny8/PrdC80lU6nQ35+PnJzc7F9+3bs2rULw4cPv6F1WjT2s96+fTtKS0vxr3/9q853sqm/N5bfiVmzZllNnz17NgDgp59+spp+vf3fvHkzlEol/vnPf9ZZnyiK+Pnnn+ut41rfwWuxfAeys7Oxbt06/Pnnn/V+Hs35/F944QWr5zNnzoRSqZTekx07dkCv12PGjBnS3zsAmDJlCrRabZ33TqVS4ZlnnpGeq9VqPPPMM8jNzUViYqLVsiaTCRMnTsTBgwexefNmq1ZuoG1/L50Bu7GoXitWrEBUVBRUKhWCgoLQrVs3q192k8mEZcuW4cMPP8S5c+esDkCWri7A3P3VrVs3qFQt+1Xr2rWr1XNBENClSxdpfEhqaioAYNKkSQ2uo7i4GL6
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Средняя цена в обучающей выборке: 537768.04794679\n",
|
|||
|
"Средняя цена в контрольной выборке: 549367.443673375\n",
|
|||
|
"Средняя цена в тестовой выборке: 549367.443673375\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Оценка сбалансированности целевой переменной (цена)\n",
|
|||
|
"# Визуализация распределения цены в выборках (гистограмма)\n",
|
|||
|
"def plot_price_distribution(data, title):\n",
|
|||
|
" sns.histplot(data['price'], kde=True)\n",
|
|||
|
" plt.title(title)\n",
|
|||
|
" plt.xlabel('Цена')\n",
|
|||
|
" plt.ylabel('Частота')\n",
|
|||
|
" plt.show()\n",
|
|||
|
"\n",
|
|||
|
"plot_price_distribution(train_data, 'Распределение цены в обучающей выборке')\n",
|
|||
|
"plot_price_distribution(val_data, 'Распределение цены в контрольной выборке')\n",
|
|||
|
"plot_price_distribution(test_data, 'Распределение цены в тестовой выборке')\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности данных по целевой переменной (price)\n",
|
|||
|
"print(\"Средняя цена в обучающей выборке: \", train_data['price'].mean())\n",
|
|||
|
"print(\"Средняя цена в контрольной выборке: \", val_data['price'].mean())\n",
|
|||
|
"print(\"Средняя цена в тестовой выборке: \", test_data['price'].mean())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABMtklEQVR4nO3deViN+f8/8OepVKeVSiVLsm8ZZMvaECG7YZiG7IYsYTBm7MYY+y7LUJaMdTD2XbaQyFgS41ujMSqhorSo9+8Pv3N/nE5RRzm4n4/rOtfVed/vc9+ve+k+z3Mv5yiEEAJEREREMqan6wKIiIiIdI2BiIiIiGSPgYiIiIhkj4GIiIiIZI+BiIiIiGSPgYiIiIhkj4GIiIiIZI+BiIiIiGSPgYiIiLSSnJyM6OhoPHv2TNelUAF7/vw5oqKikJycrOtSPhgGIiL6ZL148QKLFy+WnickJGDFihW6K0gGduzYgZYtW8Lc3BxmZmYoU6YM5s6dq+uyPgkf8/YqhMCaNWvQsGFDmJiYwMLCAk5OTti8ebOuS/tgFPzpjoIREBCAfv36Sc+NjIxQpkwZtG7dGpMnT4adnZ0OqyP6PGVmZsLS0hKrV69Gs2bNsGDBAty5cweHDx/WdWmfpR9++AFz5sxBp06d0LNnT9jY2EChUKBSpUooXbq0rsv76H3M22uvXr2wbds2eHt7o3379rC0tIRCoUDNmjVRvHhxXZf3QRjouoDPzYwZM+Dk5ITU1FScO3cOfn5+OHjwIG7evAkTExNdl0f0WdHX18f06dPRp08fZGVlwcLCAgcOHNB1WZ+loKAgzJkzB7Nnz8YPP/yg63I+SR/r9rpx40Zs27YNmzdvxjfffKPrcnSGR4gKiOoIUUhICOrWrSu1jx07FgsXLsSWLVvQq1cvHVZI9Pn6999/ER0djapVq6Jo0aK6Luez1KFDBzx9+hTnz5/XdSmfvI9te3V2dkbNmjURGBio61J0itcQFbIWLVoAACIjIwEAT58+xffffw9nZ2eYmZnBwsICbdu2xfXr1zVem5qaimnTpqFSpUowNjZGiRIl0LVrV9y/fx8AEBUVBYVCkevDzc1NGtfp06ehUCiwbds2/Pjjj7C3t4epqSk6duyI6OhojWlfunQJbdq0gaWlJUxMTNC8efNcd4Rubm45Tn/atGkafTdv3gwXFxcolUpYWVmhZ8+eOU7/bfP2pqysLCxevBjVq1eHsbEx7OzsMGTIEI2LPMuWLYv27dtrTGf48OEa48yp9nnz5mksUwBIS0vD1KlTUaFCBRgZGaF06dIYP3480tLSclxWb3Jzc9MY36xZs6Cnp4ctW7ZIbWfPnkX37t1RpkwZaRqjR4/Gy5cvpT59+/Z967agUCgQFRUl9T906BCaNm0KU1NTmJubw9PTE7du3VKrJbdxVqhQQa3fypUrUb16dRgZGcHBwQE+Pj5ISEjQmNcaNWogNDQUjRo1glKphJOTE1atWqXWT7Wdnj59Wq3d09NTY71MmzZNWnelSpWCq6srDAwMYG9vn+M4slO9Pj4+Xq39ypUrUCgUCAgIUGsvrG1t+PDhudYYEBCgse5ykn1dFStWDG5ubjh79uxbX6dy8uRJaXsoWrQoOnXqhPDwcLU+Fy9eRI0aNdCzZ09YWVlBqVSiXr162LNnj9TnxYsXMDU1xahRozSm8e+//0JfXx+zZ8+Wai5btqxGv+zr+Z9//sGwYcNQuXJlKJVKWFtbo3v37hrLJKdtJyQkBK1atYK5uTlMTU1zXCaqZXzlyhWpLT4+Psf9QPv27XOsOS/7y4LaXlUPc3Nz1K9fX235A//7X8uNat+q2r6Tk5Nx8+ZNlC5dGp6enrCwsMh1WQHA//3f/6F79+6wsrKCiYkJGjZsqHGUKz/vN3ndDwL5e1/SBk+ZFTJVeLG2tgbwemPas2cPunfvDicnJ8TGxmL16tVo3rw5bt++DQcHBwCvzzW3b98eJ06cQM+ePTFq1Cg8f/4cx44dw82bN1G+fHlpGr169UK7du3Upjtx4sQc65k1axYUCgUmTJiAuLg4LF68GO7u7ggLC4NSqQTweufYtm1buLi4YOrUqdDT04O/vz9atGiBs2fPon79+hrjLVWqlLSje/HiBYYOHZrjtCdPnowePXpg4MCBePz4MZYtW4ZmzZrh2rVrOX5SGjx4MJo2bQoA+OOPP7B792614UOGDJGOzo0cORKRkZFYvnw5rl27hvPnz6NIkSI5Lof8SEhIkObtTVlZWejYsSPOnTuHwYMHo2rVqrhx4wYWLVqEu3fvauyo3sXf3x+TJk3CggUL1A5b79ixAykpKRg6dCisra1x+fJlLFu2DP/++y927NgB4PVycHd3l17Tu3dvdOnSBV27dpXaVNcBbNq0Cd7e3vDw8MCcOXOQkpICPz8/NGnSBNeuXVPb4RsZGeG3335Tq9Pc3Fz6e9q0aZg+fTrc3d0xdOhQREREwM/PDyEhIRrL/9mzZ2jXrh169OiBXr16Yfv27Rg6dCgMDQ3Rv3//XJfLmTNncPDgwTwtwwULFiA2NjZPffPrQ2xr78PGxgaLFi0C8Dp8LFmyBO3atUN0dPRbj0IcP34cbdu2Rbly5TBt2jS8fPkSy5YtQ+PGjXH16lVpe3jy5AnWrFkDMzMzjBw5EsWLF8fmzZvRtWtXBAYGolevXjAzM0OXLl2wbds2LFy4EPr6+tJ0fv/9dwgh4OXlla/5CgkJwYULF9CzZ0+UKlUKUVFR8PPzg5ubG27fvp3rpQh///033NzcYGJignHjxsHExARr166Fu7s7jh07hmbNmuWrjtxos79U0WZ73bRpE4DXoW3lypXo3r07bt68icqVK2tV/5MnTwAAc+bMgb29PcaNGwdjY+Mcl1VsbCwaNWqElJQUjBw5EtbW1tiwYQM6duyInTt3okuXLmrjzsv7TXa57QffZznnmaAC4e/vLwCI48ePi8ePH4vo6GixdetWYW1tLZRKpfj333+FEEKkpqaKzMxMtddGRkYKIyMjMWPGDKlt/fr1AoBYuHChxrSysrKk1wEQ8+bN0+hTvXp10bx5c+n5qVOnBABRsmRJkZSUJLVv375dABBLliyRxl2xYkXh4eEhTUcIIVJSUoSTk5No1aqVxrQaNWokatSoIT1//PixACCmTp0qtUVFRQl9fX0xa9YstdfeuHFDGBgYaLTfu3dPABAbNmyQ2qZOnSre3GTPnj0rAIjAwEC11x4+fFij3dHRUXh6emrU7uPjI7L/G2Svffz48cLW1la4uLioLdNNmzYJPT09cfbsWbXXr1q1SgAQ58+f15jem5o3by6N78CBA8LAwECMHTtWo19KSopG2+zZs4VCoRD//PNPjuPOPg8qz58/F0WLFhWDBg1Sa4+JiRGWlpZq7d7e3sLU1DTX+uPi4oShoaFo3bq12ja9fPlyAUCsX79ebV4BiAULFkhtaWlpolatWsLW1lakp6cLIf63nZ46dUrq16BBA9G2bVuNecq+PcTFxQlzc3Op75vjyInq9Y8fP1ZrDwkJEQCEv7+/1FaY25qPj0+uNar2K5GRkW+dF29vb+Ho6KjWtmbNGgFAXL58+a2vVa2DJ0+eSG3Xr18Xenp6ok+fPmq1AhCnT5+W2lJSUkTVqlWFvb29tA6PHDkiAIhDhw6pTadmzZpq/z/9+vUTZcqU0agn+3rOafsPDg4WAMTGjRultuzbTrdu3YS+vr64efOm1Cc+Pl5YW1sLFxcXqU21jENCQqS2nPZhQgjh6emptpzzs78sqO31TUePHhUAxPbt26W25s2bi+rVq+c6HtX7hmr7Vj03NDQUd+/eVVsG2ZeVr6+vAKC2z3v+/LlwcnISZcuWlfYDeX2/UdX7rv2gNu9L2uApswLm7u6O4sWLo3Tp0ujZsyfMzMywe/dulCxZEsDrT9x6eq8Xe2ZmJp48eQIzMzNUrlwZV69elcaza9cu2NjYYMSIERrTyH7YPT/69Omj9gn/q6++QokSJaRP4GFhYbh37x6++eYbPHnyBPHx8YiPj0dycjJatmyJM2fOICsrS22cqampMDY2fut0//j
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABMpElEQVR4nO3deVgV9f///8cBZBEERQXElbTcza0U99xIyTRNs8x9y9Tc0t6+K9fMtNz3yq2yRTMtNfd9IbfEXNH8avo2xRVRVEB4/f7ox/l4ABURxZz77bq4Ls5rXjPznDNzDg9mXnOOzRhjBAAAYGFOmV0AAABAZiMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAQAAyyMQAbCUa9euafz48fbHUVFRmjJlSuYVBDzmhgwZIpvN5tBWqFAhtWvXLnMKekgIRJlozpw5stls9h93d3c988wz6tGjhyIjIzO7POCJ5OHhoQ8++EDz5s3TqVOnNGTIEC1ZsiSzywKQyVwyuwBIw4YNU1BQkG7evKktW7Zo2rRp+vXXX7V//35lzZo1s8sDnijOzs4aOnSo2rRpo8TERHl7e2vZsmWZXRbwrxIRESEnpyfrnAqB6DHQoEEDVaxYUZLUqVMn5cyZU2PHjtXPP/+s119/PZOrA548/fr102uvvaZTp06pePHiyp49e2aXhCfYrVu3lJiYKFdX18wuJcO4ublldgkZ7smKd0+I2rVrS5KOHz8uSbp06ZLeffddlS5dWl5eXvL29laDBg20d+/eFPPevHlTQ4YM0TPPPCN3d3flyZNHTZs21bFjxyRJJ06ccLhMl/ynVq1a9mVt2LBBNptNP/zwg/773/8qICBAnp6eevnll3Xq1KkU696+fbtefPFF+fj4KGvWrKpZs6a2bt2a6jbWqlUr1fUPGTIkRd9vvvlGFSpUkIeHh3x9fdWyZctU13+3bbtdYmKixo8fr5IlS8rd3V3+/v7q2rWrLl++7NCvUKFCeumll1Ksp0ePHimWmVrtn376aYrnVJJiY2M1ePBgFSlSRG5ubsqfP78GDBig2NjYVJ+r29WqVSvF8kaMGCEnJyd9++239rbNmzerefPmKlCggH0dffr00Y0bN+x92rVrd9djwWaz6cSJE/b+y5cvV/Xq1eXp6als2bIpNDRUBw4ccKjlTsssUqSIQ7+pU6eqZMmScnNzU2BgoLp3766oqKgU21qqVCnt3r1bVapUkYeHh4KCgjR9+nSHfknH6YYNGxzaQ0NDU+yX28dC5MuXT8HBwXJxcVFAQECqy0guaf4LFy44tO/atUs2m01z5sxxaH9Yx1qPHj3uWGPSpfjb911q7rX/kz8XCxYssL8Oc+XKpTfffFOnT59OsdzDhw+rRYsWyp07tzw8PFS0aFG9//77KfoVKlQoTetNy3F3J//v//0/NW/eXL6+vsqaNasqV67scDYwMjJSLi4uGjp0aIp5IyIiZLPZNHnyZHtbVFSUevfurfz588vNzU1FihTRqFGjlJiYaO+T9D702Wefafz48SpcuLDc3Nx08OBBSdKkSZNUsmRJZc2aVTly5FDFihUdXrt//fWX3n77bRUtWlQeHh7KmTOnmjdvnmJ/Ju3nLVu26J133lHu3LmVPXt2de3aVXFxcYqKilKbNm2UI0cO5ciRQwMGDJAxJtU6x40bp4IFC8rDw0M1a9bU/v377/ncJh9DlFTP1q1b1bdvX+XOnVuenp565ZVXdP78eYd5ExMTNWTIEAUGBipr1qx64YUXdPDgwUwfl8QZosdQUnjJmTOnpH9e1IsXL1bz5s0VFBSkyMhIzZgxQzVr1tTBgwcVGBgoSUpISNBLL72ktWvXqmXLlurVq5euXr2q1atXa//+/SpcuLB9Ha+//roaNmzosN6BAwemWs+IESNks9n03nvv6dy5cxo/frzq1q2r8PBweXh4SJLWrVunBg0aqEKFCho8eLCcnJw0e/Zs1a5dW5s3b9bzzz+fYrn58uXTyJEjJf0z0LVbt26prvvDDz9UixYt1KlTJ50/f16TJk1SjRo1tGfPnlT/s+/SpYuqV68uSfrpp5+0aNEih+ldu3bVnDlz1L59e73zzjs6fvy4Jk+erD179mjr1q3KkiVLqs/D/YiKirJv2+0SExP18ssva8uWLerSpYuKFy+uffv2ady4cTpy5IgWL158X+uZPXu2PvjgA40ZM0ZvvPGGvX3BggW6fv26unXrppw5c2rHjh2aNGmS/ve//2nBggWS/nke6tata5+ndevWeuWVV9S0aVN7W+7cuSVJX3/9tdq2bauQkBCNGjVK169f17Rp01StWjXt2bNHhQoVss/j5uamL7/80qHObNmy2X8fMmSIhg4dqrp166pbt26KiIjQtGnTtHPnzhTP/+XLl9WwYUO1aNFCr7/+uubPn69u3brJ1dVVHTp0uOPzsmnTJv36669peg7HjBnz0MbsPYpj7UGktq927typiRMnOrQlbcNzzz2nkSNHKjIyUhMmTNDWrVsdXod//PGHqlevrixZsqhLly4qVKiQjh07piVLlmjEiBEp1l+9enV16dJFknTo0CF9/PHHDtPv57hLLjIyUlWqVNH169f1zjvvKGfOnJo7d65efvll/fjjj3rllVfk7++vmjVrav78+Ro8eLDD/D/88IOcnZ3VvHlzSdL169dVs2ZNnT59Wl27dlWBAgW0bds2DRw4UGfOnHEYqC/989q8efOmunTpIjc3N/n6+uqLL77QO++8o1dffVW9evXSzZs39ccff2j79u321+/OnTu1bds2tWzZUvny5dOJEyc0bdo01apVSwcPHkwxjKJnz54KCAjQ0KFD9dtvv+nzzz9X9uzZtW3bNhUoUEAff/yxfv31V3366acqVaqU2rRp4zD/V199patXr6p79+66efOmJkyYoNq1a2vfvn3y9/e/4/N7Jz179lSOHDk0ePBgnThxQuPHj1ePHj30ww8/2PsMHDhQo0ePVqNGjRQSEqK9e/cqJCREN2/evO/1ZSiDTDN79mwjyaxZs8acP3/enDp1ynz//fcmZ86cxsPDw/zvf/8zxhhz8+ZNk5CQ4DDv8ePHjZubmxk2bJi9bdasWUaSGTt2bIp1JSYm2ueTZD799NMUfUqWLGlq1qxpf7x+/XojyeTNm9dER0fb2+fPn28kmQkTJtiX/fTTT5uQkBD7eowx5vr16yYoKMjUq1cvxbqqVKliSpUqZX98/vx5I8kMHjzY3nbixAnj7OxsRowY4TDvvn37jIuLS4r2o0ePGklm7ty59rbBgweb2w/zzZs3G0lm3rx5DvOuWLEiRXvBggVNaGhoitq7d+9ukr90ktc+YMAA4+fnZypUqODwnH799dfGycnJbN682WH+6dOnG0lm69atKdZ3u5o1a9qXt2zZMuPi4mL69euXot/169dTtI0cOdLYbDbz119/pbrs5NuQ5OrVqyZ79uymc+fODu1nz541Pj4+Du1t27Y1np6ed6z/3LlzxtXV1dSvX9/hmJ48ebKRZGbNmuWwrZLMmDFj7G2xsbGmbNmyxs/Pz8TFxRlj/u84Xb9+vb1fpUqVTIMGDVJsU/Lj4dy5cyZbtmz2vrcvIzVJ858/f96hfefOnUaSmT17tr3tYR5r3bt3v2ONSe8rx48fv+u23GlfLViwwOG5iIuLM35+fqZUqVLmxo0b9n5Lly41ksygQYPsbTVq1DDZsmVLcYzd/r6QJG/evKZ9+/b2x8n34/0cd6np3bu3keTwWrt69aoJCgoyhQoVsh9/M2bMMJLMvn37HOYvUaKEqV27tv3x8OHDjaenpzly5IhDv//85z/G2dnZnDx50hjzf++x3t7e5ty5cw59GzdubEqWLHnXulN77YaFhRlJ5quvvrK3Je3n5O+7wcHBxmazmbfeesveduvWLZMvXz6H96KkOm//W2OMMdu3bzeSTJ8+fextyV83xvxzzLZt2zZFPXXr1nWop0+fPsbZ2dlERUUZY/7Zfy4uLqZJkyYOyxsyZIiR5LDMR41LZo+BunXrKnfu3MqfP79
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkQAAAHHCAYAAABeLEexAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABKvklEQVR4nO3deXRNZ//+8eskZJBICJKIeag5imiJeU5JVUupVs1TFTWVPlo1VpXWWIq2pqoOtEVR80zVVEHRVJWHRyXGCEESyf37o7+cryNBRCJqv19rnbWy733vvT97OMl19pBjM8YYAQAAWJhTZhcAAACQ2QhEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAADA8ghEAB5LV69e1aRJk+zDUVFRmjZtWuYVhH+9uXPnymaz6cSJE5ldyiPBZrNp+PDh9uF/+/YhED0ESQdJ0svNzU0lSpRQr169FBkZmdnlAY8ld3d3DRkyRAsWLNCpU6c0fPhwLVu2LLPLAvCIypLZBVjJyJEjVaRIEd24cUPbtm3T9OnT9dNPP+m3335TtmzZMrs84LHi7OysESNGqF27dkpMTJSXl5dWrFiR2WUBj622bduqdevWcnV1zexS0oRA9BA1btxYlStXliR16dJFuXLl0oQJE7R06VK9/PLLmVwd8PgZMGCAXnrpJZ06dUqlS5dWjhw5MrskwC4mJkYeHh6ZXUa6cXZ2lrOzc2aXkWZcMstE9erVkyQdP35cknTx4kW9+eabCgwMlKenp7y8vNS4cWPt378/2bQ3btzQ8OHDVaJECbm5uSlv3rxq3ry5jh07Jkk6ceKEw2W621916tSxz2vTpk2y2Wz69ttv9fbbb8vf318eHh567rnndOrUqWTL3rlzp5555hl5e3srW7Zsql27trZv357iOtapUyfF5d963TnJl19+qaCgILm7u8vHx0etW7dOcfl3W7dbJSYmatKkSSpbtqzc3Nzk5+en7t2769KlSw79ChcurGeffTbZcnr16pVsninV/uGHHybbppIUGxurYcOGqXjx4nJ1dVWBAgU0aNAgxcbGpritblWnTp1k8xs9erScnJz01Vdf2du2bt2qli1bqmDBgvZl9OvXT9evX7f36dChw12Phduv+a9cuVI1a9aUh4eHsmfPrtDQUB06dMihljvNs3jx4g79PvnkE5UtW1aurq4KCAhQz549FRUVlWxdy5Urp71796patWpyd3dXkSJFNGPGDId+Scfppk2bHNpDQ0OT7Zfhw4fb913+/PkVHBysLFmyyN/fP8V53C5p+vPnzzu079mzRzabTXPnznVoz6hjrVevXnesMbX3a9xr/9++LRYtWmR/H+bOnVuvvvqqTp8+nWy+v//+u1q1aqU8efLI3d1dJUuW1DvvvJOsX+HChVO13NQcdym5dV/fa/skbf9t27bp6aeflpubm4oWLaovvvgi2fSHDh1SvXr15O7urvz58+u9995TYmJiijWk9j3j6empY8eOqUmTJsqePbvatGkjSTp69KhatGghf39/ubm5KX/+/GrdurUuX75sn37OnDmqV6+efH195erqqjJlymj69OnJaklax02bNqly5cpyd3dXYGCgfXv/8MMPCgwMlJubm4KCgrRv374U6/zrr78UEhIiDw8PBQQEaOTIkTLGpLwT0mmbHzhwQLVr13bY5nPmzHlo9yVxhigTJYWXXLlySZL++usvLVmyRC1btlSRIkUUGRmpmTNnqnbt2jp8+LACAgIkSQkJCXr22We1fv16tW7dWn369NGVK1e0du1a/fbbbypWrJh9GS+//LKaNGnisNzBgwenWM/o0aNls9n01ltv6ezZs5o0aZIaNGigsLAwubu7S5I2bNigxo0bKygoSMOGDZOTk5P9jbp161Y9/fTTyeabP39+jRkzRtI/N7r26NEjxWW/++67atWqlbp06aJz587p448/Vq1atbRv374UP9l369ZNNWvWlPTPm3zx4sUO47t37665c+eqY8eOeuONN3T8+HFNnTpV+/bt0/bt25U1a9YUt8P9iIqKsq/brRITE/Xcc89p27Zt6tatm0qXLq2DBw9q4sSJ+uOPP7RkyZL7Ws6cOXM0ZMgQjR8/Xq+88oq9fdGiRbp27Zp69OihXLlyadeuXfr444/1v//9T4sWLZL0z3Zo0KCBfZq2bdvqhRdeUPPmze1tefLkkSTNnz9f7du3V0hIiMaOHatr165p+vTpqlGjhvbt26fChQvbp3F1ddXnn3/uUGf27NntPw8fPlwjRoxQgwYN1KNHD4WHh2v69OnavXt3su1/6dIlNWnSRK1atdLLL7+shQsXqkePHnJxcVGnTp3uuF22bNmin376KVXbcPz48Rl2z97DONYeREr7avfu3ZoyZYpDW9I6PPXUUxozZowiIyM1efJkbd++3eF9eODAAdWsWVNZs2ZVt27dVLhwYR07dkzLli3T6NGjky2/Zs2a6tatmyTpyJEjev/99x3G389x96D+/PNPvfjii+rcubPat2+v2bNnq0OHDgoKClLZsmUlSREREapbt65u3ryp//znP/Lw8NCnn35q/z2Y1tpv3rypkJAQ1ahRQx999JGyZcumuLg4hYSEKDY2Vr1795a/v79Onz6t5cuXKyoqSt7e3pKk6dOnq2zZsnruueeUJUsWLVu2TK+//roSExPVs2fPZOv4yiuvqHv37nr11Vf10UcfqWnTppoxY4befvttvf7665KkMWPGqFWrVgoPD5eT0/+dH0lISNAzzzyjqlWraty4cVq1apWGDRummzdvauTIkRmyzU+fPq26devKZrNp8ODB8vDw0Oeff/5wL78ZZLg5c+YYSWbdunXm3Llz5tSpU+abb74xuXLlMu7u7uZ///ufMcaYGzdumISEBIdpjx8/blxdXc3IkSPtbbNnzzaSzIQJE5ItKzEx0T6dJPPhhx8m61O2bFlTu3Zt+/DGjRuNJJMvXz4THR1tb1+4cKGRZCZPnmyf9xNPPGFCQkLsyzHGmGvXrpkiRYqYhg0bJltWtWrVTLly5ezD586dM5LMsGHD7G0nTpwwzs7OZvTo0Q7THjx40GTJkiVZ+9GjR40kM2/ePHvbsGHDzK2H89atW40ks2DBAodpV61alay9UKFCJjQ0NFntPXv2NLe/RW6vfdCgQcbX19cEBQU5bNP58+cbJycns3XrVofpZ8yYYSSZ7du3J1verWrXrm2f34oVK0yWLFnMgAEDkvW7du1asrYxY8YYm81m/vvf/6Y479vXIcmVK1dMjhw5TNeuXR3aIyIijLe3t0N7+/btjYeHxx3rP3v2rHFxcTGNGjVyOKanTp1qJJnZs2c7rKskM378eHtbbGysqVChgvH19TVxcXHGmP87Tjdu3GjvV6VKFdO4ceNk63T78XD27FmTPXt2e99b55GSpOnPnTvn0L57924jycyZM8felpHHWs+ePe9YY9LvlePHj991Xe60rxYtWuSwLeLi4oyvr68pV66cuX79ur3f8uXLjSQzdOhQe1utWrVM9uzZkx1jt/5eSJIvXz7TsWNH+/Dt+/F+jruU3L6vk6S0fQoVKmQkmS1bttjbzp49a1xdXR3eX3379jWSzM6dOx36eXt7O8zzft8zksx//vMfh7779u0zksyiRYvuup4pvddDQkJM0aJFHdqS1vHnn3+2t61evdpIMu7u7g77bObMmcneD0l19u7d296WmJhoQkNDjYuLi8N74vb33YNs8969exubzWb27dtnb7tw4YLx8fFJ1XGeHrhk9hA1aNBAefLkUYECBdS6dWt5enpq8eLFypcvn6R/PsUlpfSEhARduHBBnp6eKlmypH799Vf7fL7//nvlzp1bvXv3TraMlE4dp1a7du0cPuG/+OKLyps3r/0TeFhYmI4ePapXXnlFFy5c0Pnz53X+/HnFxMSofv362rJlS7JTyjdu3JCbm9tdl/vDDz8oMTFRrVq1ss/z/Pnz8vf31xNPPKGNGzc69I+Li5Oku35yWLRokby9vdWwYUOHeQYFBcnT0zPZPOPj4x3
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки после oversampling и undersampling: 17620\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование целевой переменной (цены) в категориальные диапазоны с использованием квантилей\n",
|
|||
|
"train_data['price_category'] = pd.qcut(train_data['price'], q=4, labels=['low', 'medium', 'high', 'very_high'])\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после преобразования в категории\n",
|
|||
|
"sns.countplot(x=train_data['price_category'])\n",
|
|||
|
"plt.title('Распределение категорий цены в обучающей выборке')\n",
|
|||
|
"plt.xlabel('Категория цены')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Балансировка категорий с помощью RandomOverSampler (увеличение меньшинств)\n",
|
|||
|
"ros = RandomOverSampler(random_state=42)\n",
|
|||
|
"X_train = train_data.drop(columns=['price', 'price_category'])\n",
|
|||
|
"y_train = train_data['price_category']\n",
|
|||
|
"\n",
|
|||
|
"X_resampled, y_resampled = ros.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после oversampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение категорий цены после oversampling')\n",
|
|||
|
"plt.xlabel('Категория цены')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для уменьшения большего класса\n",
|
|||
|
"rus = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_resampled, y_resampled = rus.fit_resample(X_resampled, y_resampled)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен после undersampling\n",
|
|||
|
"sns.countplot(x=y_resampled)\n",
|
|||
|
"plt.title('Распределение категорий цены после undersampling')\n",
|
|||
|
"plt.xlabel('Категория цен')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Печать размеров выборки после балансировки\n",
|
|||
|
"print(\"Размер обучающей выборки после oversampling и undersampling: \", len(X_resampled))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Конструирование признаков \n",
|
|||
|
"\n",
|
|||
|
"Теперь приступим к конструированию признаков для решения каждой задачи.\n",
|
|||
|
"\n",
|
|||
|
"**Процесс конструирования признаков** \n",
|
|||
|
"Задача 1: Прогнозирование цен недвижимости. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования рыночной стоимости недвижимости. \n",
|
|||
|
"Задача 2: Оптимизация затрат на ремонт перед продажей. Цель технического проекта: Разработка модели машинного обучения для точного прогнозирования по рекомендациям по реновациям.\n",
|
|||
|
"\n",
|
|||
|
"**Унитарное кодирование** \n",
|
|||
|
"Унитарное кодирование категориальных признаков (one-hot encoding). Преобразование категориальных признаков в бинарные векторы.\n",
|
|||
|
"\n",
|
|||
|
"**Дискретизация числовых признаков** \n",
|
|||
|
"Процесс преобразования непрерывных числовых значений в дискретные категории или интервалы (бины)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Столбцы train_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'price_category', 'date_20140502T000000', 'date_20140503T000000', 'date_20140504T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140517T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140608T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140704T000000', 'date_20140705T000000', 'date_20140706T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140720T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140726T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140809T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140816T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140823T000000', 'date_20140824T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140830T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140906T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140920T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140928T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141004T000000', 'date_20141005T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T0
|
|||
|
"Столбцы val_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'date_20140502T000000', 'date_20140503T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140727T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140803T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T000000', 'date_20141012T000000', 'date_20141013T000000', 'date_20141014T000000', 'date_20141015T000000', 'date_20141016T000000', 'date_20141017T000000', 'date_20141018T000000', 'date_20141019T000000', 'date_20141020T000000', 'date_20141021T000000', 'date_20141022T000000', 'date_20141023T000000', 'date_20141024T000000', 'date_20141027T000000', 'date_20141028T000000', 'date_20141029T000000', 'date_201410
|
|||
|
"Столбцы test_data_encoded: ['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'date_20140502T000000', 'date_20140503T000000', 'date_20140505T000000', 'date_20140506T000000', 'date_20140507T000000', 'date_20140508T000000', 'date_20140509T000000', 'date_20140510T000000', 'date_20140511T000000', 'date_20140512T000000', 'date_20140513T000000', 'date_20140514T000000', 'date_20140515T000000', 'date_20140516T000000', 'date_20140518T000000', 'date_20140519T000000', 'date_20140520T000000', 'date_20140521T000000', 'date_20140522T000000', 'date_20140523T000000', 'date_20140524T000000', 'date_20140525T000000', 'date_20140526T000000', 'date_20140527T000000', 'date_20140528T000000', 'date_20140529T000000', 'date_20140530T000000', 'date_20140531T000000', 'date_20140601T000000', 'date_20140602T000000', 'date_20140603T000000', 'date_20140604T000000', 'date_20140605T000000', 'date_20140606T000000', 'date_20140607T000000', 'date_20140609T000000', 'date_20140610T000000', 'date_20140611T000000', 'date_20140612T000000', 'date_20140613T000000', 'date_20140614T000000', 'date_20140615T000000', 'date_20140616T000000', 'date_20140617T000000', 'date_20140618T000000', 'date_20140619T000000', 'date_20140620T000000', 'date_20140621T000000', 'date_20140622T000000', 'date_20140623T000000', 'date_20140624T000000', 'date_20140625T000000', 'date_20140626T000000', 'date_20140627T000000', 'date_20140628T000000', 'date_20140629T000000', 'date_20140630T000000', 'date_20140701T000000', 'date_20140702T000000', 'date_20140703T000000', 'date_20140707T000000', 'date_20140708T000000', 'date_20140709T000000', 'date_20140710T000000', 'date_20140711T000000', 'date_20140712T000000', 'date_20140713T000000', 'date_20140714T000000', 'date_20140715T000000', 'date_20140716T000000', 'date_20140717T000000', 'date_20140718T000000', 'date_20140719T000000', 'date_20140721T000000', 'date_20140722T000000', 'date_20140723T000000', 'date_20140724T000000', 'date_20140725T000000', 'date_20140727T000000', 'date_20140728T000000', 'date_20140729T000000', 'date_20140730T000000', 'date_20140731T000000', 'date_20140801T000000', 'date_20140802T000000', 'date_20140803T000000', 'date_20140804T000000', 'date_20140805T000000', 'date_20140806T000000', 'date_20140807T000000', 'date_20140808T000000', 'date_20140810T000000', 'date_20140811T000000', 'date_20140812T000000', 'date_20140813T000000', 'date_20140814T000000', 'date_20140815T000000', 'date_20140817T000000', 'date_20140818T000000', 'date_20140819T000000', 'date_20140820T000000', 'date_20140821T000000', 'date_20140822T000000', 'date_20140825T000000', 'date_20140826T000000', 'date_20140827T000000', 'date_20140828T000000', 'date_20140829T000000', 'date_20140831T000000', 'date_20140901T000000', 'date_20140902T000000', 'date_20140903T000000', 'date_20140904T000000', 'date_20140905T000000', 'date_20140907T000000', 'date_20140908T000000', 'date_20140909T000000', 'date_20140910T000000', 'date_20140911T000000', 'date_20140912T000000', 'date_20140913T000000', 'date_20140914T000000', 'date_20140915T000000', 'date_20140916T000000', 'date_20140917T000000', 'date_20140918T000000', 'date_20140919T000000', 'date_20140921T000000', 'date_20140922T000000', 'date_20140923T000000', 'date_20140924T000000', 'date_20140925T000000', 'date_20140926T000000', 'date_20140927T000000', 'date_20140929T000000', 'date_20140930T000000', 'date_20141001T000000', 'date_20141002T000000', 'date_20141003T000000', 'date_20141006T000000', 'date_20141007T000000', 'date_20141008T000000', 'date_20141009T000000', 'date_20141010T000000', 'date_20141012T000000', 'date_20141013T000000', 'date_20141014T000000', 'date_20141015T000000', 'date_20141016T000000', 'date_20141017T000000', 'date_20141018T000000', 'date_20141019T000000', 'date_20141020T000000', 'date_20141021T000000', 'date_20141022T000000', 'date_20141023T000000', 'date_20141024T000000', 'date_20141027T000000', 'date_20141028T000000', 'date_20141029T000000', 'date_20141
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Конструирование признаков\n",
|
|||
|
"# Унитарное кодирование категориальных признаков (применение one-hot encoding)\n",
|
|||
|
"\n",
|
|||
|
"# Пример категориальных признаков\n",
|
|||
|
"categorical_features = ['date', 'waterfront', 'view', 'condition']\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding\n",
|
|||
|
"train_data_encoded = pd.get_dummies(train_data, columns=categorical_features)\n",
|
|||
|
"val_data_encoded = pd.get_dummies(val_data, columns=categorical_features)\n",
|
|||
|
"test_data_encoded = pd.get_dummies(test_data, columns=categorical_features)\n",
|
|||
|
"df_encoded = pd.get_dummies(df, columns=categorical_features)\n",
|
|||
|
"\n",
|
|||
|
"print(\"Столбцы train_data_encoded:\", train_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы val_data_encoded:\", val_data_encoded.columns.tolist())\n",
|
|||
|
"print(\"Столбцы test_data_encoded:\", test_data_encoded.columns.tolist())\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Дискретизация числовых признаков (цены). Например, можно разделить площадь жилья на категории\n",
|
|||
|
"# Пример дискретизации признака 'Общая площадь'\n",
|
|||
|
"train_data_encoded['sqtf'] = pd.cut(train_data_encoded['sqft_living'], bins=5, labels=False)\n",
|
|||
|
"val_data_encoded['sqtf'] = pd.cut(val_data_encoded['sqft_living'], bins=5, labels=False)\n",
|
|||
|
"test_data_encoded['sqtf'] = pd.cut(test_data_encoded['sqft_living'], bins=5, labels=False)\n",
|
|||
|
"\n",
|
|||
|
"# Пример дискретизации признака 'sqft_living' на 5 категорий\n",
|
|||
|
"df_encoded['sqtf'] = pd.cut(df_encoded['sqft_living'], bins=5, labels=False)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ручной синтез\n",
|
|||
|
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, для данных о продаже домов можно создать признак цена за квадратный фут."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Ручной синтез признаков\n",
|
|||
|
"train_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
|
|||
|
"val_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
|
|||
|
"test_data_encoded['price_per_sqft'] = df['price'] / df['sqft_living']\n",
|
|||
|
"\n",
|
|||
|
"# Пример создания нового признака - цена за квадратный фут\n",
|
|||
|
"df_encoded['price_per_sqft'] = df_encoded['price'] / df_encoded['sqft_living']"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
|
|||
|
"\n",
|
|||
|
"# Пример масштабирования числовых признаков\n",
|
|||
|
"numerical_features = ['bedrooms', 'sqft_living']\n",
|
|||
|
"\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"train_data_encoded[numerical_features] = scaler.fit_transform(train_data_encoded[numerical_features])\n",
|
|||
|
"val_data_encoded[numerical_features] = scaler.transform(val_data_encoded[numerical_features])\n",
|
|||
|
"test_data_encoded[numerical_features] = scaler.transform(test_data_encoded[numerical_features])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Конструирование признаков с применением фреймворка Featuretools"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 33,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" id price bedrooms bathrooms sqft_living sqft_lot \\\n",
|
|||
|
"9876 1219000473 164950.0 -0.395263 1.75 -0.555396 15330 \n",
|
|||
|
"14982 6308000010 585000.0 -0.395263 2.50 0.238192 5089 \n",
|
|||
|
"1464 3630120700 757000.0 -0.395263 3.25 1.230177 5283 \n",
|
|||
|
"19209 1901600090 359000.0 1.752138 1.75 -0.147580 6654 \n",
|
|||
|
"2039 3395040550 320000.0 -0.395263 2.50 -0.599484 2890 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"13184 1523049207 220000.0 0.678437 2.00 -0.412109 8043 \n",
|
|||
|
"5759 1954420170 580000.0 -0.395263 2.50 0.083883 7484 \n",
|
|||
|
"8433 1721801010 225000.0 -0.395263 1.00 -0.312911 6120 \n",
|
|||
|
"10253 2422049104 85000.0 -1.468964 1.00 -1.371028 9000 \n",
|
|||
|
"11363 7701960990 870000.0 0.678437 2.50 1.230177 14565 \n",
|
|||
|
"\n",
|
|||
|
" floors grade sqft_above sqft_basement ... view_2 view_3 view_4 \\\n",
|
|||
|
"9876 1.0 7 1080 490 ... False False False \n",
|
|||
|
"14982 2.0 9 2290 0 ... False False False \n",
|
|||
|
"1464 2.0 9 3190 0 ... False False False \n",
|
|||
|
"19209 1.5 7 1940 0 ... False False False \n",
|
|||
|
"2039 2.0 7 1530 0 ... False False False \n",
|
|||
|
"... ... ... ... ... ... ... ... ... \n",
|
|||
|
"13184 1.0 7 850 850 ... False False False \n",
|
|||
|
"5759 2.0 8 2150 0 ... False False False \n",
|
|||
|
"8433 1.0 6 1790 0 ... False False False \n",
|
|||
|
"10253 1.0 6 830 0 ... False False False \n",
|
|||
|
"11363 2.0 11 3190 0 ... False False False \n",
|
|||
|
"\n",
|
|||
|
" condition_1 condition_2 condition_3 condition_4 condition_5 sqtf \\\n",
|
|||
|
"9876 False False True False False 0 \n",
|
|||
|
"14982 False False True False False 0 \n",
|
|||
|
"1464 False False True False False 1 \n",
|
|||
|
"19209 False False False True False 0 \n",
|
|||
|
"2039 False False True False False 0 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"13184 False False True False False 0 \n",
|
|||
|
"5759 False False True False False 0 \n",
|
|||
|
"8433 False False True False False 0 \n",
|
|||
|
"10253 False False True False False 0 \n",
|
|||
|
"11363 False False True False False 1 \n",
|
|||
|
"\n",
|
|||
|
" price_per_sqft \n",
|
|||
|
"9876 105.063694 \n",
|
|||
|
"14982 255.458515 \n",
|
|||
|
"1464 237.304075 \n",
|
|||
|
"19209 185.051546 \n",
|
|||
|
"2039 209.150327 \n",
|
|||
|
"... ... \n",
|
|||
|
"13184 129.411765 \n",
|
|||
|
"5759 269.767442 \n",
|
|||
|
"8433 125.698324 \n",
|
|||
|
"10253 102.409639 \n",
|
|||
|
"11363 272.727273 \n",
|
|||
|
"\n",
|
|||
|
"[224 rows x 400 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" price bedrooms bathrooms sqft_living sqft_lot floors \\\n",
|
|||
|
"id \n",
|
|||
|
"7129300520 221900.0 3 1.00 1180 5650 1.0 \n",
|
|||
|
"6414100192 538000.0 3 2.25 2570 7242 2.0 \n",
|
|||
|
"5631500400 180000.0 2 1.00 770 10000 1.0 \n",
|
|||
|
"2487200875 604000.0 4 3.00 1960 5000 1.0 \n",
|
|||
|
"1954400510 510000.0 3 2.00 1680 8080 1.0 \n",
|
|||
|
"\n",
|
|||
|
" grade sqft_above sqft_basement yr_built ... view_2 view_3 \\\n",
|
|||
|
"id ... \n",
|
|||
|
"7129300520 7 1180 0 1955 ... False False \n",
|
|||
|
"6414100192 7 2170 400 1951 ... False False \n",
|
|||
|
"5631500400 6 770 0 1933 ... False False \n",
|
|||
|
"2487200875 7 1050 910 1965 ... False False \n",
|
|||
|
"1954400510 8 1680 0 1987 ... False False \n",
|
|||
|
"\n",
|
|||
|
" view_4 condition_1 condition_2 condition_3 condition_4 \\\n",
|
|||
|
"id \n",
|
|||
|
"7129300520 False False False True False \n",
|
|||
|
"6414100192 False False False True False \n",
|
|||
|
"5631500400 False False False True False \n",
|
|||
|
"2487200875 False False False False False \n",
|
|||
|
"1954400510 False False False True False \n",
|
|||
|
"\n",
|
|||
|
" condition_5 sqtf price_per_sqft \n",
|
|||
|
"id \n",
|
|||
|
"7129300520 False 0 188.050847 \n",
|
|||
|
"6414100192 False 0 209.338521 \n",
|
|||
|
"5631500400 False 0 233.766234 \n",
|
|||
|
"2487200875 True 0 308.163265 \n",
|
|||
|
"1954400510 False 0 303.571429 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 402 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Предобработка данных (например, кодирование категориальных признаков, удаление дубликатов)\n",
|
|||
|
"# Удаление дубликатов по идентификатору\n",
|
|||
|
"df = df.drop_duplicates(subset='id')\n",
|
|||
|
"duplicates = train_data_encoded[train_data_encoded['id'].duplicated(keep=False)]\n",
|
|||
|
"\n",
|
|||
|
"# Удаление дубликатов из столбца \"id\", сохранив первое вхождение\n",
|
|||
|
"df_encoded = df_encoded.drop_duplicates(subset='id', keep='first')\n",
|
|||
|
"\n",
|
|||
|
"print(duplicates)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='house_data')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление датафрейма с домами\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='houses', dataframe=df_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков с помощью глубокой синтезы признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='houses', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим первые 5 строк сгенерированного набора признаков\n",
|
|||
|
"print(feature_matrix.head())\n",
|
|||
|
"\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates(subset='id')\n",
|
|||
|
"train_data_encoded = train_data_encoded.drop_duplicates(subset='id', keep='first') # or keep='last'\n",
|
|||
|
"\n",
|
|||
|
"# Определение сущностей (Создание EntitySet)\n",
|
|||
|
"es = ft.EntitySet(id='house_data')\n",
|
|||
|
"\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='houses', dataframe=train_data_encoded, index='id')\n",
|
|||
|
"\n",
|
|||
|
"# Генерация признаков\n",
|
|||
|
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name='houses', max_depth=2)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование признаков для контрольной и тестовой выборок\n",
|
|||
|
"val_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=val_data_encoded.index)\n",
|
|||
|
"test_feature_matrix = ft.calculate_feature_matrix(features=feature_defs, entityset=es, instance_ids=test_data_encoded.index)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценка качества каждого набора признаков \n",
|
|||
|
"\n",
|
|||
|
"*Предсказательная способность Метрики:* RMSE, MAE, R² \n",
|
|||
|
"\n",
|
|||
|
"*Методы:* Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках. \n",
|
|||
|
"\n",
|
|||
|
"*Скорость вычисления Методы:* Измерение времени выполнения генерации признаков и обучения модели. \n",
|
|||
|
"\n",
|
|||
|
"*Надежность Методы:* Кросс-валидация, анализ чувствительности модели к изменениям в данных. \n",
|
|||
|
"\n",
|
|||
|
"*Корреляция Методы:* Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков. \n",
|
|||
|
"\n",
|
|||
|
"*Цельность Методы:* Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 34,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Время обучения модели: 5.18 секунд\n",
|
|||
|
"Среднеквадратичная ошибка: 125198557176601739264.00\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import time\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.metrics import mean_squared_error\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и валидационную выборки. Удаляем целевую переменную\n",
|
|||
|
"X = feature_matrix.drop('price', axis=1)\n",
|
|||
|
"y = feature_matrix['price']\n",
|
|||
|
"\n",
|
|||
|
"# One-hot encoding для категориальных переменных (преобразование категориальных объектов в числовые)\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверяем, есть ли пропущенные значения, и заполняем их медианой или другим подходящим значением\n",
|
|||
|
"X.fillna(X.median(), inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model = LinearRegression()\n",
|
|||
|
"\n",
|
|||
|
"# Начинаем отсчет времени\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Время обучения модели\n",
|
|||
|
"train_time = time.time() - start_time\n",
|
|||
|
"\n",
|
|||
|
"# Предсказания и оценка модели и вычисляем среднеквадратичную ошибку\n",
|
|||
|
"predictions = model.predict(X_val)\n",
|
|||
|
"mse = mean_squared_error(y_val, predictions)\n",
|
|||
|
"\n",
|
|||
|
"print(f'Время обучения модели: {train_time:.2f} секунд')\n",
|
|||
|
"print(f'Среднеквадратичная ошибка: {mse:.2f}')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 35,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"RMSE: 17870.38470608543\n",
|
|||
|
"R²: 0.9973762630189477\n",
|
|||
|
"MAE: 5924.569330616996 \n",
|
|||
|
"\n",
|
|||
|
"Кросс-валидация RMSE: 34577.766841359786 \n",
|
|||
|
"\n",
|
|||
|
"Train RMSE: 12930.759734777745\n",
|
|||
|
"Train R²: 0.9987426148033223\n",
|
|||
|
"Train MAE: 2495.3698282637165\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"e:\\MII\\laboratory\\mai\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC2mElEQVR4nOzdeVxU1fsH8M+dfZiBQXbEFXDJXbM0zazcU8sW+5WWpu25ZWVli7ZYaWppaqUtWtmubbaoWdriVmpquIOaCggoMMDsM/f8/uDL1RFQBsEB+bxfL3s1527PXJgZnjnnPEcSQggQERERERHVEapgB0BERERERHQhMQkiIiIiIqI6hUkQERERERHVKUyCiIiIiIioTmESREREREREdQqTICIiIiIiqlOYBBERERERUZ3CJIiIiIiIiOoUJkFERERERFSnMAkiIiIiogo5duwYlixZojw+fPgwPv744+AFRFRJTIKIqsFdd90Fs9kc7DCIiIiqlCRJGDNmDFatWoXDhw/j8ccfxx9//BHssIgCpgl2AEQXi5MnT+Ljjz/GH3/8gd9//x0OhwP9+/dHx44dceutt6Jjx47BDpGIiOi8JCQk4N5770X//v0BAPHx8Vi3bl1wgyKqBEkIIYIdBFFt99lnn+Hee+9FUVERmjRpAo/Hg+PHj6Njx47YsWMHPB4PRo4ciUWLFkGn0wU7XCIiovOSlpaGEydOoE2bNjCZTMEOhyhgHA5HdJ7Wr1+PO+64A3FxcVi/fj0OHTqE3r17w2Aw4O+//0ZGRgZuv/12fPDBB5g4caLfsbNmzUK3bt0QGRkJo9GISy+9FMuWLSt1DUmS8NxzzymPvV4vrrvuOkRERGD37t3KPmf7d/XVVwMA1q1bB0mSSn1zN3DgwFLXufrqq5XjShw+fBiSJPmNCQeAvXv34pZbbkFERAQMBgM6d+6M7777rtRzyc/Px8SJE9GkSRPo9Xo0aNAAI0aMwIkTJ8qNLyMjA02aNEHnzp1RVFQEAHC73ZgyZQouvfRSWCwWmEwm9OjRA2vXri11zezsbNx9991o1KgR1Gq1ck8qMmSxSZMmGDRoUKn2sWPHQpKkUu3p6ekYPXo0YmNjodfr0bp1a7z//vt++5Q8x7J+1mazGXfddZfyODc3F4899hjatm0Ls9mMsLAwDBgwADt27Dhn7MDZfy+aNGnit6/NZsOjjz6Khg0bQq/Xo0WLFpg1axYq+l3Z5s2bcd1116FevXowmUxo164d5s6dq2wvGSZ68OBB9OvXDyaTCfXr18cLL7xQ6hqBvDZK/qnVaiQkJOC+++5Dfn6+sk8g9xso/h19+OGHlfuQnJyMGTNmQJZlZZ+S18GsWbNKnbNNmzZ+r5tAXnNLliyBJEk4fPiw0rZq1Sp069YNISEhsFgsGDRoEFJSUkpdtyxOpxPPPfccmjdvDoPBgPj4eNx0001IS0s763FNmjQ56+/O6SRJwtixY/Hxxx+jRYsWMBgMuPTSS/H777+XOu8///yDAQMGICwsDGazGb169cKmTZv89im5B2X9O3bsGIDyhxwvW7aszHv95Zdf4tJLL4XRaERUVBTuuOMOpKen++3z3HPPoVWrVsrrrGvXrvjmm2/89inrPfHvv/+u9H1Zu3YtJEnC119/Xeq5fPLJJ5AkCRs3blTaKvI+W3L/dDodcnJy/LZt3LhRiXXLli0B36O77rpLed9ISkpCly5dkJubC6PRWOr3lqim43A4ovM0ffp0yLKMzz77DJdeemmp7VFRUfjwww+xe/duLFy4EFOnTkVMTAwAYO7cubj++usxfPhwuN1ufPbZZxg6dCi+//57DBw4sNxr3nPPPVi3bh1+/vlntGrVCgDw0UcfKdv/+OMPLFq0CK+//jqioqIAALGxseWe7/fff8ePP/5YqecPALt27UL37t2RkJCAJ598EiaTCV988QWGDBmC5cuX48YbbwQAFBUVoUePHtizZw9Gjx6NTp064cSJE/juu+9w7NgxJdbTWa1WDBgwAFqtFj/++KPyh09BQQHeffdd3H777bj33ntRWFiI9957D/369cNff/2FDh06KOcYOXIk1qxZg3HjxqF9+/ZQq9VYtGgRtm3bVunnXJasrCx07dpV+eMnOjoaP/30E+6++24UFBTg4YcfDvicBw8exDfffIOhQ4eiadOmyMrKwsKFC9GzZ0/s3r0b9evXP+c5+vTpgxEjRvi1zZ49G3l5ecpjIQSuv/56rF27FnfffTc6dOiAVatWYdKkSUhPT8frr79+1mv8/PPPGDRoEOLj4zFhwgTExcVhz549+P777zFhwgRlP5/Ph/79+6Nr16549dVXsXLlSkydOhVerxcvvPCCsl8gr40bb7wRN910E7xeLzZu3IhFixbB4XD4vSYqym63o2fPnkhPT8f999+PRo0aYcOGDZg8eTIyMzMxZ86cgM9Zloq+5v744w9cd911aNy4MaZOnQqPx4M333wT3bt3x99//43mzZuXe6zP58OgQYPwyy+/4LbbbsOECRNQWFiIn3/+GSkpKUhKSjrrtTt06IBHH33Ur+3DDz/Ezz//XGrf3377DZ9//jnGjx8PvV6PN998E/3798dff/2FNm3aACh+n+jRowfCwsLw+OOPQ6vVYuHChbj66qvx22+/oUuXLn7nfOGFF9C0aVO/toiIiLPGXJYlS5Zg1KhRuOyyy/DKK68gKysLc+fOxfr16/HPP/8gPDwcQPGXADfeeCOaNGkCh8OBJUuW4Oabb8bGjRtx+eWXl3v+J554otxt57ovV199NRo2bIiPP/5YeZ8s8fHHHyMpKQlXXHEFgIq/z5ZQq9VYunSp35dvixcvhsFggNPprNQ9KsuUKVNKnY+oVhBEdF4iIiJE48aN/dpGjhwpTCaTX9uzzz4rAIgVK1YobXa73W8ft9st2rRpI6699lq/dgBi6tSpQgghJk+eLNRqtfjmm2/KjWnx4sUCgDh06FCpbWvXrhUAxNq1a5W2Ll26iAEDBvhdRwghrrnmGnHVVVf5HX/o0CEBQCxevFhp69Wrl2jbtq1wOp1KmyzLolu3bqJZs2ZK25QpUwQA8dVXX5WKS5blUvE5nU5x9dVXi5iYGJGamuq3v9frFS6Xy68tLy9PxMbGitGjRyttDodDqFQqcf/99/vtW9bPqCyNGzcWAwcOLNU+ZswYceZb6N133y3i4+PFiRMn/Npvu+02YbFYlJ93yXP88ssvS53XZDKJkSNHKo+dTqfw+Xx++xw6dEjo9XrxwgsvnDN+AGLMmDGl2gcOHOj3e/vNN98IAGLatGl++91yyy1CkqRS9/90Xq9XNG3aVDRu3Fjk5eX5bSv5uQpRfM8BiHHjxvltHzhwoNDpdCInJ0dpr8xro0S3bt1Eq1atlMeB3O8XX3xRmEwmsX//fr/9nnzySaFWq8WRI0eEEKdeBzNnzix1ztatW4uePXuWun5FXnNnvnYvvfRSYbFYxPHjx5V99u/fL7Rarbj55ptLXft077//vgAgXnvttVLbTv+5lCWQ33sAAoDYsmWL0vbff/8Jg8EgbrzxRqVtyJAhQqfTibS0NKUtIyNDhIaG+r3PlNyDv//+u9z4ynv9fvnll3732u12i5iYGNGmTRvhcDiU/b7//nsBQEyZMqXca2RnZwsAYtasWUpbz549/X62P/74owAg+vfvX+n7MnnyZKHX60V+fr7ftTUajd/vRkXfZ0vu3+233y7atm2rtNtsNhEWFiaGDRvmd38DuUcjR470e99ISUkRKpVK+V0u6zOHqKbicDii81RYWKj07JxNSU9MQUGB0mY0GpX/z8vLg9VqRY8ePcrtoZg/fz5eeeUVvPHGG7jhhhvOM/JiX331Ff7++29Mnz691LaYmBhl+El5cnNz8euvv+LWW29FYWEhTpw4gRMnTuDkyZPo168fDhw4oAypWL58Odq3b1/qG0sApYaSyLKMESNGYNOmTfjxxx9LfWutVquV+VWyLCM3NxderxedO3f2u382mw2yLCMyMrJiN6SShBBYvnw5Bg8eDCGEch9OnDiBfv36wWq1lvq5nn6/Sv6
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.metrics import r2_score, mean_absolute_error\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Удаление строк с NaN\n",
|
|||
|
"feature_matrix = feature_matrix.dropna()\n",
|
|||
|
"val_feature_matrix = val_feature_matrix.dropna()\n",
|
|||
|
"test_feature_matrix = test_feature_matrix.dropna()\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train = feature_matrix.drop('price', axis=1)\n",
|
|||
|
"y_train = feature_matrix['price']\n",
|
|||
|
"X_val = val_feature_matrix.drop('price', axis=1)\n",
|
|||
|
"y_val = val_feature_matrix['price']\n",
|
|||
|
"X_test = test_feature_matrix.drop('price', axis=1)\n",
|
|||
|
"y_test = test_feature_matrix['price']\n",
|
|||
|
"\n",
|
|||
|
"X_test = X_test.reindex(columns=X_train.columns, fill_value=0) \n",
|
|||
|
"\n",
|
|||
|
"# Кодирования категориальных переменных с использованием одноразового кодирования\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разобьём тренировочный тест и примерку модели\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Выбор модели\n",
|
|||
|
"model = RandomForestRegressor(random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_pred = model.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
|
|||
|
"r2 = r2_score(y_test, y_pred)\n",
|
|||
|
"mae = mean_absolute_error(y_test, y_pred)\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"print(f\"RMSE: {rmse}\")\n",
|
|||
|
"print(f\"R²: {r2}\")\n",
|
|||
|
"print(f\"MAE: {mae} \\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"rmse_cv = (-scores.mean())**0.5\n",
|
|||
|
"print(f\"Кросс-валидация RMSE: {rmse_cv} \\n\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_train.columns\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_pred = model.predict(X_train)\n",
|
|||
|
"\n",
|
|||
|
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
|
|||
|
"r2_train = r2_score(y_train, y_train_pred)\n",
|
|||
|
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train RMSE: {rmse_train}\")\n",
|
|||
|
"print(f\"Train R²: {r2_train}\")\n",
|
|||
|
"print(f\"Train MAE: {mae_train}\")\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
|
|||
|
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Фактическая цена')\n",
|
|||
|
"plt.ylabel('Прогнозируемая цена')\n",
|
|||
|
"plt.title('Фактическая цена по сравнению с прогнозируемой')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Выводы и итог \n",
|
|||
|
"\n",
|
|||
|
"**Модель случайного леса (RandomForestRegressor)** показала удовлетворительные результаты при прогнозировании цен на недвижимость. Метрики качества и кросс-валидация позволяют предположить, что модель не сильно переобучена и может быть использована для практических целей. \n",
|
|||
|
"\n",
|
|||
|
"*Точность предсказаний:* Модель демонстрирует довольно высокий R² (0.9987), что указывает на большую часть вариации целевого признака (цены недвижимости). Однако, значения RMSE и MAE остаются высоки (12930 и 2495), что свидетельствует о том, что модель не всегда точно предсказывает значения, особенно для объектов с высокими или низкими ценами. \n",
|
|||
|
"\n",
|
|||
|
"*Переобучение:* Разница между RMSE на обучающей и тестовой выборках незначительна, что указывает на то, что модель не склонна к переобучению. Однако в будущем стоит следить за этой метрикой при добавлении новых признаков или усложнении модели, чтобы избежать излишней подгонки под тренировочные данные. Также стоит быть осторожным и продолжать мониторинг этого показателя. \n",
|
|||
|
"\n",
|
|||
|
"*Кросс-валидация:* При кросс-валидации наблюдается небольшое увеличение ошибки RMSE по сравнению с тестовой выборкой (рост на 2-3%). Это может указывать на небольшую нестабильность модели при использовании разных подвыборок данных. Для повышения устойчивости модели возможно стоит провести дальнейшую настройку гиперпараметров. \n",
|
|||
|
"\n",
|
|||
|
"*Рекомендации:* Следует уделить внимание дополнительной обработке категориальных признаков, улучшению метода feature engineering, а также возможной оптимизации модели (например, через подбор гиперпараметров) для повышения точности предсказаний на экстремальных значениях.\n",
|
|||
|
"\n",
|
|||
|
"Кажется на этом закончили :)"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "mai",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.6"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|