4553 lines
751 KiB
Plaintext
4553 lines
751 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лаб работа №4\n",
|
|||
|
"Вариант 6 - Продажа домов в округе кинг\n",
|
|||
|
"\n",
|
|||
|
"Задача регрессии заключается в предсказании цен на недвижимость, что поможет риэлторам и аналитикам оценить справедливую рыночную стоимость объектов.\n",
|
|||
|
"\n",
|
|||
|
"Задача классификации предполагает определение вероятности того, что цена дома будет выше или ниже медианы рынка, а также классификацию домов по ценовым категориям (например, низкая, средняя, высокая). Это поможет выявить предпочтения покупателей.\n",
|
|||
|
"\n",
|
|||
|
"Для оценки регрессионных моделей будут использоваться метрики MAE (средняя абсолютная ошибка) и R² (коэффициент детерминации), с целью достижения MAE менее 10% от средней цены. В классификации основное внимание уделяется метрикам accuracy и F1-score, с целевым значением accuracy около 80%.\n",
|
|||
|
"\n",
|
|||
|
"## Ориентиры для каждой задачи:\n",
|
|||
|
"Регрессия: Медианная цена (price.median()) выбрана как стабильный ориентир.\n",
|
|||
|
"\n",
|
|||
|
"Классификация: Ориентиром служит средняя вероятность предсказания класса выше медианы.\n",
|
|||
|
"\n",
|
|||
|
"Анализ алгоритмов машинного обучения:\n",
|
|||
|
"\n",
|
|||
|
"### Регрессия:\n",
|
|||
|
"\n",
|
|||
|
"Линейная регрессия: Подходит для простых линейных зависимостей.\n",
|
|||
|
"\n",
|
|||
|
"Дерево решений: Учет нелинейных зависимостей и сложных закономерностей.\n",
|
|||
|
"\n",
|
|||
|
"Случайный лес: Ансамблевый метод, обобщающий данные и эффективно обрабатывающий выбросы.\n",
|
|||
|
"\n",
|
|||
|
"### Классификация:\n",
|
|||
|
"\n",
|
|||
|
"Логистическая регрессия: Простая модель для бинарной классификации.\n",
|
|||
|
"\n",
|
|||
|
"Метод опорных векторов (SVM): Эффективен на данных с четкими разделениями.\n",
|
|||
|
"\n",
|
|||
|
"Градиентный бустинг: Подходит для сложных и высокоразмерных данных, обеспечивает высокую точность.\n",
|
|||
|
"\n",
|
|||
|
"## Выбор трех моделей для каждой задачи:\n",
|
|||
|
"\n",
|
|||
|
"Регрессия: Линейная регрессия, Дерево решений, Случайный лес.\n",
|
|||
|
"\n",
|
|||
|
"Классификация: Логистическая регрессия, Метод опорных векторов (SVM), Градиентный бустинг."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',\n",
|
|||
|
" 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',\n",
|
|||
|
" 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',\n",
|
|||
|
" 'lat', 'long', 'sqft_living15', 'sqft_lot15'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>grade</th>\n",
|
|||
|
" <th>sqft_above</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>7129300520</td>\n",
|
|||
|
" <td>20141013T000000</td>\n",
|
|||
|
" <td>221900.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1180</td>\n",
|
|||
|
" <td>5650</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>1180</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1955</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98178</td>\n",
|
|||
|
" <td>47.5112</td>\n",
|
|||
|
" <td>-122.257</td>\n",
|
|||
|
" <td>1340</td>\n",
|
|||
|
" <td>5650</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>6414100192</td>\n",
|
|||
|
" <td>20141209T000000</td>\n",
|
|||
|
" <td>538000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>2570</td>\n",
|
|||
|
" <td>7242</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>2170</td>\n",
|
|||
|
" <td>400</td>\n",
|
|||
|
" <td>1951</td>\n",
|
|||
|
" <td>1991</td>\n",
|
|||
|
" <td>98125</td>\n",
|
|||
|
" <td>47.7210</td>\n",
|
|||
|
" <td>-122.319</td>\n",
|
|||
|
" <td>1690</td>\n",
|
|||
|
" <td>7639</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>5631500400</td>\n",
|
|||
|
" <td>20150225T000000</td>\n",
|
|||
|
" <td>180000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>770</td>\n",
|
|||
|
" <td>10000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>6</td>\n",
|
|||
|
" <td>770</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1933</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98028</td>\n",
|
|||
|
" <td>47.7379</td>\n",
|
|||
|
" <td>-122.233</td>\n",
|
|||
|
" <td>2720</td>\n",
|
|||
|
" <td>8062</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>2487200875</td>\n",
|
|||
|
" <td>20141209T000000</td>\n",
|
|||
|
" <td>604000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.00</td>\n",
|
|||
|
" <td>1960</td>\n",
|
|||
|
" <td>5000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>7</td>\n",
|
|||
|
" <td>1050</td>\n",
|
|||
|
" <td>910</td>\n",
|
|||
|
" <td>1965</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98136</td>\n",
|
|||
|
" <td>47.5208</td>\n",
|
|||
|
" <td>-122.393</td>\n",
|
|||
|
" <td>1360</td>\n",
|
|||
|
" <td>5000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>1954400510</td>\n",
|
|||
|
" <td>20150218T000000</td>\n",
|
|||
|
" <td>510000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.00</td>\n",
|
|||
|
" <td>1680</td>\n",
|
|||
|
" <td>8080</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>8</td>\n",
|
|||
|
" <td>1680</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1987</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98074</td>\n",
|
|||
|
" <td>47.6168</td>\n",
|
|||
|
" <td>-122.045</td>\n",
|
|||
|
" <td>1800</td>\n",
|
|||
|
" <td>7503</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>5 rows × 21 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"0 7129300520 20141013T000000 221900.0 3 1.00 1180 \n",
|
|||
|
"1 6414100192 20141209T000000 538000.0 3 2.25 2570 \n",
|
|||
|
"2 5631500400 20150225T000000 180000.0 2 1.00 770 \n",
|
|||
|
"3 2487200875 20141209T000000 604000.0 4 3.00 1960 \n",
|
|||
|
"4 1954400510 20150218T000000 510000.0 3 2.00 1680 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view ... grade sqft_above sqft_basement \\\n",
|
|||
|
"0 5650 1.0 0 0 ... 7 1180 0 \n",
|
|||
|
"1 7242 2.0 0 0 ... 7 2170 400 \n",
|
|||
|
"2 10000 1.0 0 0 ... 6 770 0 \n",
|
|||
|
"3 5000 1.0 0 0 ... 7 1050 910 \n",
|
|||
|
"4 8080 1.0 0 0 ... 8 1680 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"0 1955 0 98178 47.5112 -122.257 1340 \n",
|
|||
|
"1 1951 1991 98125 47.7210 -122.319 1690 \n",
|
|||
|
"2 1933 0 98028 47.7379 -122.233 2720 \n",
|
|||
|
"3 1965 0 98136 47.5208 -122.393 1360 \n",
|
|||
|
"4 1987 0 98074 47.6168 -122.045 1800 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 \n",
|
|||
|
"0 5650 \n",
|
|||
|
"1 7639 \n",
|
|||
|
"2 8062 \n",
|
|||
|
"3 5000 \n",
|
|||
|
"4 7503 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 21 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn import set_config\n",
|
|||
|
"\n",
|
|||
|
"set_config(transform_output=\"pandas\")\n",
|
|||
|
"random_state = 42\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//kc_house_data.csv\")\n",
|
|||
|
"print(df.columns)\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'X_train'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20962</th>\n",
|
|||
|
" <td>1278000210</td>\n",
|
|||
|
" <td>20150311T000000</td>\n",
|
|||
|
" <td>110000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>828</td>\n",
|
|||
|
" <td>4524</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1968</td>\n",
|
|||
|
" <td>2007</td>\n",
|
|||
|
" <td>98001</td>\n",
|
|||
|
" <td>47.2655</td>\n",
|
|||
|
" <td>-122.244</td>\n",
|
|||
|
" <td>828</td>\n",
|
|||
|
" <td>5402</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12284</th>\n",
|
|||
|
" <td>2193300390</td>\n",
|
|||
|
" <td>20140923T000000</td>\n",
|
|||
|
" <td>624000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.25</td>\n",
|
|||
|
" <td>2810</td>\n",
|
|||
|
" <td>11250</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1130</td>\n",
|
|||
|
" <td>1980</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98052</td>\n",
|
|||
|
" <td>47.6920</td>\n",
|
|||
|
" <td>-122.099</td>\n",
|
|||
|
" <td>2110</td>\n",
|
|||
|
" <td>11250</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7343</th>\n",
|
|||
|
" <td>4289900005</td>\n",
|
|||
|
" <td>20141230T000000</td>\n",
|
|||
|
" <td>1535000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.25</td>\n",
|
|||
|
" <td>2850</td>\n",
|
|||
|
" <td>4100</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1030</td>\n",
|
|||
|
" <td>1908</td>\n",
|
|||
|
" <td>2003</td>\n",
|
|||
|
" <td>98122</td>\n",
|
|||
|
" <td>47.6147</td>\n",
|
|||
|
" <td>-122.285</td>\n",
|
|||
|
" <td>2130</td>\n",
|
|||
|
" <td>4200</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>14247</th>\n",
|
|||
|
" <td>316000145</td>\n",
|
|||
|
" <td>20150325T000000</td>\n",
|
|||
|
" <td>235000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1360</td>\n",
|
|||
|
" <td>7132</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1941</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98168</td>\n",
|
|||
|
" <td>47.5054</td>\n",
|
|||
|
" <td>-122.301</td>\n",
|
|||
|
" <td>1280</td>\n",
|
|||
|
" <td>7175</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16670</th>\n",
|
|||
|
" <td>629400480</td>\n",
|
|||
|
" <td>20140619T000000</td>\n",
|
|||
|
" <td>775000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>3010</td>\n",
|
|||
|
" <td>15992</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1996</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98075</td>\n",
|
|||
|
" <td>47.5895</td>\n",
|
|||
|
" <td>-121.994</td>\n",
|
|||
|
" <td>3330</td>\n",
|
|||
|
" <td>12333</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>88</th>\n",
|
|||
|
" <td>1332700270</td>\n",
|
|||
|
" <td>20140519T000000</td>\n",
|
|||
|
" <td>215000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>1610</td>\n",
|
|||
|
" <td>2040</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1979</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98056</td>\n",
|
|||
|
" <td>47.5180</td>\n",
|
|||
|
" <td>-122.194</td>\n",
|
|||
|
" <td>1950</td>\n",
|
|||
|
" <td>2025</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15031</th>\n",
|
|||
|
" <td>7129303070</td>\n",
|
|||
|
" <td>20140820T000000</td>\n",
|
|||
|
" <td>735000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>3040</td>\n",
|
|||
|
" <td>2415</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1966</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98118</td>\n",
|
|||
|
" <td>47.5188</td>\n",
|
|||
|
" <td>-122.256</td>\n",
|
|||
|
" <td>2620</td>\n",
|
|||
|
" <td>2433</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5234</th>\n",
|
|||
|
" <td>2432000130</td>\n",
|
|||
|
" <td>20150414T000000</td>\n",
|
|||
|
" <td>675000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1660</td>\n",
|
|||
|
" <td>9549</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1956</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98033</td>\n",
|
|||
|
" <td>47.6503</td>\n",
|
|||
|
" <td>-122.198</td>\n",
|
|||
|
" <td>2090</td>\n",
|
|||
|
" <td>9549</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19980</th>\n",
|
|||
|
" <td>774100475</td>\n",
|
|||
|
" <td>20140627T000000</td>\n",
|
|||
|
" <td>415000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>2600</td>\n",
|
|||
|
" <td>64626</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2009</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98014</td>\n",
|
|||
|
" <td>47.7185</td>\n",
|
|||
|
" <td>-121.405</td>\n",
|
|||
|
" <td>1740</td>\n",
|
|||
|
" <td>64626</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3671</th>\n",
|
|||
|
" <td>8847400115</td>\n",
|
|||
|
" <td>20140723T000000</td>\n",
|
|||
|
" <td>590000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.00</td>\n",
|
|||
|
" <td>2420</td>\n",
|
|||
|
" <td>208652</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2005</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98010</td>\n",
|
|||
|
" <td>47.3666</td>\n",
|
|||
|
" <td>-121.978</td>\n",
|
|||
|
" <td>3180</td>\n",
|
|||
|
" <td>212137</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>17290 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms \\\n",
|
|||
|
"20962 1278000210 20150311T000000 110000.0 2 1.00 \n",
|
|||
|
"12284 2193300390 20140923T000000 624000.0 4 3.25 \n",
|
|||
|
"7343 4289900005 20141230T000000 1535000.0 4 3.25 \n",
|
|||
|
"14247 316000145 20150325T000000 235000.0 4 1.00 \n",
|
|||
|
"16670 629400480 20140619T000000 775000.0 4 2.75 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"88 1332700270 20140519T000000 215000.0 2 2.25 \n",
|
|||
|
"15031 7129303070 20140820T000000 735000.0 4 2.75 \n",
|
|||
|
"5234 2432000130 20150414T000000 675000.0 3 1.75 \n",
|
|||
|
"19980 774100475 20140627T000000 415000.0 3 2.75 \n",
|
|||
|
"3671 8847400115 20140723T000000 590000.0 3 2.00 \n",
|
|||
|
"\n",
|
|||
|
" sqft_living sqft_lot floors waterfront view ... sqft_basement \\\n",
|
|||
|
"20962 828 4524 1.0 0 0 ... 0 \n",
|
|||
|
"12284 2810 11250 1.0 0 0 ... 1130 \n",
|
|||
|
"7343 2850 4100 2.0 0 3 ... 1030 \n",
|
|||
|
"14247 1360 7132 1.5 0 0 ... 0 \n",
|
|||
|
"16670 3010 15992 2.0 0 0 ... 0 \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"88 1610 2040 2.0 0 0 ... 0 \n",
|
|||
|
"15031 3040 2415 2.0 1 4 ... 0 \n",
|
|||
|
"5234 1660 9549 1.0 0 0 ... 0 \n",
|
|||
|
"19980 2600 64626 1.5 0 0 ... 0 \n",
|
|||
|
"3671 2420 208652 1.5 0 0 ... 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"20962 1968 2007 98001 47.2655 -122.244 828 \n",
|
|||
|
"12284 1980 0 98052 47.6920 -122.099 2110 \n",
|
|||
|
"7343 1908 2003 98122 47.6147 -122.285 2130 \n",
|
|||
|
"14247 1941 0 98168 47.5054 -122.301 1280 \n",
|
|||
|
"16670 1996 0 98075 47.5895 -121.994 3330 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"88 1979 0 98056 47.5180 -122.194 1950 \n",
|
|||
|
"15031 1966 0 98118 47.5188 -122.256 2620 \n",
|
|||
|
"5234 1956 0 98033 47.6503 -122.198 2090 \n",
|
|||
|
"19980 2009 0 98014 47.7185 -121.405 1740 \n",
|
|||
|
"3671 2005 0 98010 47.3666 -121.978 3180 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 above_median_price price_category \n",
|
|||
|
"20962 5402 0 0 \n",
|
|||
|
"12284 11250 1 1 \n",
|
|||
|
"7343 4200 1 2 \n",
|
|||
|
"14247 7175 0 0 \n",
|
|||
|
"16670 12333 1 2 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"88 2025 0 0 \n",
|
|||
|
"15031 2433 1 2 \n",
|
|||
|
"5234 9549 1 1 \n",
|
|||
|
"19980 64626 0 1 \n",
|
|||
|
"3671 212137 1 1 \n",
|
|||
|
"\n",
|
|||
|
"[17290 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'y_train'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20962</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12284</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7343</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>14247</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16670</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>88</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15031</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5234</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19980</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3671</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>17290 rows × 1 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" above_median_price\n",
|
|||
|
"20962 0\n",
|
|||
|
"12284 1\n",
|
|||
|
"7343 1\n",
|
|||
|
"14247 0\n",
|
|||
|
"16670 1\n",
|
|||
|
"... ...\n",
|
|||
|
"88 0\n",
|
|||
|
"15031 1\n",
|
|||
|
"5234 1\n",
|
|||
|
"19980 0\n",
|
|||
|
"3671 1\n",
|
|||
|
"\n",
|
|||
|
"[17290 rows x 1 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'X_test'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11592</th>\n",
|
|||
|
" <td>2028701000</td>\n",
|
|||
|
" <td>20140529T000000</td>\n",
|
|||
|
" <td>635200.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1640</td>\n",
|
|||
|
" <td>4240</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>720</td>\n",
|
|||
|
" <td>1921</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98117</td>\n",
|
|||
|
" <td>47.6766</td>\n",
|
|||
|
" <td>-122.368</td>\n",
|
|||
|
" <td>1300</td>\n",
|
|||
|
" <td>4240</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8984</th>\n",
|
|||
|
" <td>9406500530</td>\n",
|
|||
|
" <td>20140912T000000</td>\n",
|
|||
|
" <td>249000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2.00</td>\n",
|
|||
|
" <td>1090</td>\n",
|
|||
|
" <td>1357</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1990</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98028</td>\n",
|
|||
|
" <td>47.7526</td>\n",
|
|||
|
" <td>-122.244</td>\n",
|
|||
|
" <td>1078</td>\n",
|
|||
|
" <td>1318</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8280</th>\n",
|
|||
|
" <td>8097000330</td>\n",
|
|||
|
" <td>20140721T000000</td>\n",
|
|||
|
" <td>359950.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>2540</td>\n",
|
|||
|
" <td>8604</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1991</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98092</td>\n",
|
|||
|
" <td>47.3209</td>\n",
|
|||
|
" <td>-122.185</td>\n",
|
|||
|
" <td>2260</td>\n",
|
|||
|
" <td>7438</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>792</th>\n",
|
|||
|
" <td>8081020370</td>\n",
|
|||
|
" <td>20140709T000000</td>\n",
|
|||
|
" <td>1355000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.50</td>\n",
|
|||
|
" <td>3550</td>\n",
|
|||
|
" <td>11000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1290</td>\n",
|
|||
|
" <td>1999</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98006</td>\n",
|
|||
|
" <td>47.5506</td>\n",
|
|||
|
" <td>-122.134</td>\n",
|
|||
|
" <td>4100</td>\n",
|
|||
|
" <td>10012</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10371</th>\n",
|
|||
|
" <td>7518507580</td>\n",
|
|||
|
" <td>20150502T000000</td>\n",
|
|||
|
" <td>581000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1170</td>\n",
|
|||
|
" <td>4080</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1909</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98117</td>\n",
|
|||
|
" <td>47.6784</td>\n",
|
|||
|
" <td>-122.386</td>\n",
|
|||
|
" <td>1560</td>\n",
|
|||
|
" <td>4586</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16733</th>\n",
|
|||
|
" <td>7212650950</td>\n",
|
|||
|
" <td>20140708T000000</td>\n",
|
|||
|
" <td>336000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>2530</td>\n",
|
|||
|
" <td>8169</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1993</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98003</td>\n",
|
|||
|
" <td>47.2634</td>\n",
|
|||
|
" <td>-122.312</td>\n",
|
|||
|
" <td>2220</td>\n",
|
|||
|
" <td>8013</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13151</th>\n",
|
|||
|
" <td>4365200620</td>\n",
|
|||
|
" <td>20150312T000000</td>\n",
|
|||
|
" <td>394000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1450</td>\n",
|
|||
|
" <td>7930</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>300</td>\n",
|
|||
|
" <td>1923</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98126</td>\n",
|
|||
|
" <td>47.5212</td>\n",
|
|||
|
" <td>-122.371</td>\n",
|
|||
|
" <td>1040</td>\n",
|
|||
|
" <td>7740</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11667</th>\n",
|
|||
|
" <td>4083304355</td>\n",
|
|||
|
" <td>20150318T000000</td>\n",
|
|||
|
" <td>675000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1530</td>\n",
|
|||
|
" <td>3615</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1913</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98103</td>\n",
|
|||
|
" <td>47.6529</td>\n",
|
|||
|
" <td>-122.334</td>\n",
|
|||
|
" <td>1650</td>\n",
|
|||
|
" <td>4200</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3683</th>\n",
|
|||
|
" <td>2891100820</td>\n",
|
|||
|
" <td>20140825T000000</td>\n",
|
|||
|
" <td>213500.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1220</td>\n",
|
|||
|
" <td>6000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1968</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98002</td>\n",
|
|||
|
" <td>47.3245</td>\n",
|
|||
|
" <td>-122.209</td>\n",
|
|||
|
" <td>1420</td>\n",
|
|||
|
" <td>6000</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12059</th>\n",
|
|||
|
" <td>952000640</td>\n",
|
|||
|
" <td>20141027T000000</td>\n",
|
|||
|
" <td>715000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.50</td>\n",
|
|||
|
" <td>1670</td>\n",
|
|||
|
" <td>5060</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1925</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98126</td>\n",
|
|||
|
" <td>47.5671</td>\n",
|
|||
|
" <td>-122.379</td>\n",
|
|||
|
" <td>1670</td>\n",
|
|||
|
" <td>5118</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>4323 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms \\\n",
|
|||
|
"11592 2028701000 20140529T000000 635200.0 4 1.75 \n",
|
|||
|
"8984 9406500530 20140912T000000 249000.0 2 2.00 \n",
|
|||
|
"8280 8097000330 20140721T000000 359950.0 3 2.75 \n",
|
|||
|
"792 8081020370 20140709T000000 1355000.0 4 3.50 \n",
|
|||
|
"10371 7518507580 20150502T000000 581000.0 2 1.00 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"16733 7212650950 20140708T000000 336000.0 4 2.50 \n",
|
|||
|
"13151 4365200620 20150312T000000 394000.0 3 1.00 \n",
|
|||
|
"11667 4083304355 20150318T000000 675000.0 4 1.75 \n",
|
|||
|
"3683 2891100820 20140825T000000 213500.0 3 1.00 \n",
|
|||
|
"12059 952000640 20141027T000000 715000.0 3 1.50 \n",
|
|||
|
"\n",
|
|||
|
" sqft_living sqft_lot floors waterfront view ... sqft_basement \\\n",
|
|||
|
"11592 1640 4240 1.0 0 0 ... 720 \n",
|
|||
|
"8984 1090 1357 2.0 0 0 ... 0 \n",
|
|||
|
"8280 2540 8604 2.0 0 0 ... 0 \n",
|
|||
|
"792 3550 11000 1.0 0 2 ... 1290 \n",
|
|||
|
"10371 1170 4080 1.0 0 0 ... 0 \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"16733 2530 8169 2.0 0 0 ... 0 \n",
|
|||
|
"13151 1450 7930 1.0 0 0 ... 300 \n",
|
|||
|
"11667 1530 3615 1.5 0 0 ... 0 \n",
|
|||
|
"3683 1220 6000 1.0 0 0 ... 0 \n",
|
|||
|
"12059 1670 5060 2.0 0 2 ... 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"11592 1921 0 98117 47.6766 -122.368 1300 \n",
|
|||
|
"8984 1990 0 98028 47.7526 -122.244 1078 \n",
|
|||
|
"8280 1991 0 98092 47.3209 -122.185 2260 \n",
|
|||
|
"792 1999 0 98006 47.5506 -122.134 4100 \n",
|
|||
|
"10371 1909 0 98117 47.6784 -122.386 1560 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"16733 1993 0 98003 47.2634 -122.312 2220 \n",
|
|||
|
"13151 1923 0 98126 47.5212 -122.371 1040 \n",
|
|||
|
"11667 1913 0 98103 47.6529 -122.334 1650 \n",
|
|||
|
"3683 1968 0 98002 47.3245 -122.209 1420 \n",
|
|||
|
"12059 1925 0 98126 47.5671 -122.379 1670 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 above_median_price price_category \n",
|
|||
|
"11592 4240 1 1 \n",
|
|||
|
"8984 1318 0 0 \n",
|
|||
|
"8280 7438 0 1 \n",
|
|||
|
"792 10012 1 2 \n",
|
|||
|
"10371 4586 1 1 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"16733 8013 0 1 \n",
|
|||
|
"13151 7740 0 1 \n",
|
|||
|
"11667 4200 1 1 \n",
|
|||
|
"3683 6000 0 0 \n",
|
|||
|
"12059 5118 1 2 \n",
|
|||
|
"\n",
|
|||
|
"[4323 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'y_test'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11592</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8984</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8280</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>792</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10371</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16733</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13151</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11667</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3683</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12059</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>4323 rows × 1 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" above_median_price\n",
|
|||
|
"11592 1\n",
|
|||
|
"8984 0\n",
|
|||
|
"8280 0\n",
|
|||
|
"792 1\n",
|
|||
|
"10371 1\n",
|
|||
|
"... ...\n",
|
|||
|
"16733 0\n",
|
|||
|
"13151 0\n",
|
|||
|
"11667 1\n",
|
|||
|
"3683 0\n",
|
|||
|
"12059 1\n",
|
|||
|
"\n",
|
|||
|
"[4323 rows x 1 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"id int64\n",
|
|||
|
"date object\n",
|
|||
|
"price float64\n",
|
|||
|
"bedrooms int64\n",
|
|||
|
"bathrooms float64\n",
|
|||
|
"sqft_living int64\n",
|
|||
|
"sqft_lot int64\n",
|
|||
|
"floors float64\n",
|
|||
|
"waterfront int64\n",
|
|||
|
"view int64\n",
|
|||
|
"condition int64\n",
|
|||
|
"grade int64\n",
|
|||
|
"sqft_above int64\n",
|
|||
|
"sqft_basement int64\n",
|
|||
|
"yr_built int64\n",
|
|||
|
"yr_renovated int64\n",
|
|||
|
"zipcode int64\n",
|
|||
|
"lat float64\n",
|
|||
|
"long float64\n",
|
|||
|
"sqft_living15 int64\n",
|
|||
|
"sqft_lot15 int64\n",
|
|||
|
"above_median_price int64\n",
|
|||
|
"price_category category\n",
|
|||
|
"dtype: object\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB1RklEQVR4nO3deXwTdf7H8ffk7F2gpS3lFpBLQCmK9UBEFBFdXVldFRVBRVdQgfVY1huvxRMPvFYFXeGn4K2oiCh4oSJaBURERItACwV6t0mbzO+PNmlDy1Xapklez8cjD5qZycwnacR58/3OZwzTNE0BAAAAABqVJdgFAAAAAEA4ImwBAAAAQBMgbAEAAABAEyBsAQAAAEATIGwBAAAAQBMgbAEAAABAEyBsAQAAAEATIGwBAAAAQBMgbAEAAABAEyBsAQAAAEATIGwBaJHmzJkjwzD8j6ioKB166KGaNGmScnNzg10eAADAPtmCXQAA7M306dPVtWtXlZeX6/PPP9eTTz6p9957T6tXr1ZMTEywywMAANgjwhaAFm3kyJEaNGiQJOmyyy5TUlKSHnroIb311ls6//zzg1wdAADAnjGNEEBIGTZsmCRp48aNkqSdO3fquuuuU79+/RQXF6eEhASNHDlSP/zwQ53XlpeX6/bbb9ehhx6qqKgotWvXTmeffbY2bNggSfr9998Dpi7u/hg6dKh/X0uXLpVhGHrllVf073//W2lpaYqNjdVf/vIXbdq0qc6xv/76a5166qlKTExUTEyMTjjhBH3xxRf1vsehQ4fWe/zbb7+9zrYvvfSSMjIyFB0drTZt2ui8886r9/h7e2+1eb1ezZw5U3379lVUVJRSU1N1xRVXaNeuXQHbdenSRaeffnqd40yaNKnOPuur/f7776/zmUqSy+XSbbfdpu7du8vpdKpjx4664YYb5HK56v2sahs6dKgOO+ywOssfeOABGYah33//PWB5fn6+Jk+erI4dO8rpdKp79+6aMWOGvF6vfxvf5/bAAw/U2e9hhx1Wp/76GIahSZMm1Vl++umnq0uXLnVqPeaYY5SUlKTo6GhlZGTo1Vdf3ecxpAN//++//76OP/54xcbGKj4+XqNGjdKaNWvq3XeXLl3q/e7MmTPHv83u3zG73a4uXbro+uuvl9vt9m/nmyJcux6v16v+/fvX2ecFF1ygpKQkrV+/fq+vX7BggSwWi55++mn/sksuuaTO57tp0yZFR0fXeb3v/U2ePLnOex8xYoQMw6jzfd+2bZsuvfRSpaamKioqSgMGDNALL7xQ5/Ver1ePPPKI+vXrp6ioKLVt21annnqqvv32W0na6985tf8b8f2ds7/fBwAtAyNbAEKKLxglJSVJkn777Te9+eabOuecc9S1a1fl5ubq6aef1gknnKCffvpJ6enpkiSPx6PTTz9dS5Ys0Xnnnadrr71WRUVFWrx4sVavXq1u3br5j3H++efrtNNOCzjutGnT6q3n7rvvlmEYuvHGG7Vt2zbNnDlTw4cPV1ZWlqKjoyVJH3/8sUaOHKmMjAzddtttslgsmj17toYNG6bPPvtMRx11VJ39dujQQffee68kqbi4WP/4xz/qPfYtt9yic889V5dddpm2b9+uxx57TEOGDNH333+vVq1a1XnNhAkTdPzxx0uSXn/9db3xxhsB66+44grNmTNH48aN0zXXXKONGzfq8ccf1/fff68vvvhCdru93s/hQOTn5/vfW21er1d/+ctf9Pnnn2vChAnq3bu3Vq1apYcffli//PKL3nzzzYM+tk9paalOOOEEbd68WVdccYU6deqkL7/8UtOmTdPWrVs1c+bMRjvWgXjkkUf0l7/8RWPGjJHb7dbLL7+sc845R++++65GjRrVaMf53//+p7Fjx2rEiBGaMWOGSktL9eSTT+q4447T999/XyekSNLhhx+uf/7zn5Kq/rHj1ltvrXffvu+Yy+XSokWL9MADDygqKkp33nnnXutZtWpVneXPP/+8hg0bplGjRunrr79W69at62zzzTffaOzYsZoyZYquuOKKvb7vW2+9VeXl5fWui4qK0ty5c3X//ff7v+d//vmnlixZoqioqIBty8rKNHToUP3666+aNGmSunbtqgULFuiSSy5Rfn6+rr32Wv+2l156qebMmaORI0fqsssuU2VlpT777DN99dVXGjRokP73v//5t/3ss8/0zDPP6OGHH1ZycrIkKTU1da/vCUALZwJACzR79mxTkvnRRx+Z27dvNzdt2mS+/PLLZlJSkhkdHW3++eefpmmaZnl5uenxeAJeu3HjRtPpdJrTp0/3L3v++edNSeZDDz1U51her9f/Oknm/fffX2ebvn37mieccIL/+SeffGJKMtu3b28WFhb6l8+fP9+UZD7yyCP+fffo0cMcMWKE/zimaZqlpaVm165dzZNPPrnOsY455hjzsMMO8z/fvn27Kcm87bbb/Mt+//1302q1mnfffXfAa1etWmXabLY6y9evX29KMl944QX/sttuu82s/b+Bzz77zJRkzp07N+C1H3zwQZ3lnTt3NkeNGlWn9okTJ5q7/69l99pvuOEGMyUlxczIyAj4TP/3v/+ZFovF/OyzzwJe/9RTT5mSzC+++KLO8Wo74YQTzL59+9ZZfv/995uSzI0bN/qX3XnnnWZsbKz5yy+/BGz7r3/9y7RarWZ2drZpmgf2ndgTSebEiRPrLB81apTZuXPngGWlpaUBz91ut3nYYYeZw4YN2+dx9vf9FxUVma1atTIvv/zygO1ycnLMxMTEOstN0zTT09PN008/3f98xYoVpiRz9uzZ/mW+z6r2Mt9rTzvtNP9z33/bvnrKy8vNTp06mSNHjqz39bm5uWaXLl3ME0880XS73QGvz87ONtPS0sy//OUvdf4eGDt2bMDnu3r1atNisfiPU/v70LlzZ/Pkk082k5OTzVdffdW//M477zSPOeaYOt/3mTNnmpLMl156yb/M7XabmZmZZlxcnP/vhI8//tiUZF5zzTV1PtPafx/s6bOpzfd3zoIFC+qsA9ByMY0QQIs2fPhwtW3bVh07dtR5552nuLg4vfHGG2rfvr0kyel0ymKp+qvM4/Fox44diouLU8+ePfXdd9/59/Paa68pOTlZV199dZ1j7D7t7UBcfPHFio+P9z//29/+pnbt2um9996TJGVlZWn9+vW64IILtGPHDuXl5SkvL08lJSU66aST9OmnnwZMW5Oqpjvu/i/pu3v99dfl9Xp17rnn+veZl5entLQ09ejRQ5988knA9r5pXE6nc4/7XLBggRITE3XyyScH7DMjI0NxcXF19llRURGwXV5e3h5HDXw2b96sxx57TLfccovi4uLqHL93797q1atXwD59U0d3P/7BWLBggY4//ni1bt064FjDhw+Xx+PRp59+GrB9aWlpnffq8Xj2+3jl5eV1Xl9RUVFnO99oqCTt2rVLBQUFOv744wO+y3vj8XjqHKe0tDRgm8WLFys/P1/nn39+wHZWq1WDBw+u93Pen++kT3FxsfLy8rR582Y988wzysnJ0UknnbTH7WfNmqUdO3botttuq3d9SkqKFi5cqK+//lpXXXVVwHHOOOMMJScna968ef6/B/Zk2rRpGjhwoM4555x61zscDo0ZM0azZ8/2L/ON8u7uvffeU1paWsB1o3a7Xddcc42Ki4u1bNkySVV/7xiGUe97a+jfO0VFRcrLy1N+fn6DXg+geTGNEECLNmvWLB166KGy2WxKTU1Vz549A06qfNdDPPHEE9q4cWPACbBvqqFUNf2wZ8+estka96+9Hj16BDw3DEPdu3f3Xw/iu9Zk7Nixe9xHQUFBwPSovLy8Ovvd3fr162Wa5h632326n+/EbPeAs/s+CwoKlJKSUu/6bdu2BTz/8MMP1bZt273WubvbbrtN6enpuuKKK+pce7J+/XqtXbt2j/vc/fgHY/369frxxx/3+1i33XZbvSfM+zvF67nnntNzzz1XZ3nnzp0Dnr/77ru66667lJWVFXCd2v6emP/888/7/J34vpO+ELu7hISEgOcej0f5+flKTEzcrxquvvrqgH/UGDd
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACGn0lEQVR4nO3de1yTdf8/8NcYZ0UOngamxjwWukQrQwqtTPOUnTy1UtOyG6gsu++7zOGiILO0M1BqYQmlWZnZbZn6TdcBz9IU8zzTFDyEHAQ5uF2/P/xtbTJg4OBi1/V6Ph48ZNf13va+2MDrvc/nen8UgiAIICIiIiIikgkvsRMgIiIiIiJqTiyCiIiIiIhIVlgEERERERGRrLAIIiIiIiIiWWERREREREREssIiiIiIiIiIZIVFEBERERERyQqLICIiIiIikhUWQUREREREJCssgoiIiIiISFZYBBERERERkaywCCKSsQ8++ADDhw9Hx44d4ePjA5VKhcGDB+PTTz+FxWIROz0iIiKiJqEQBEEQOwkiEkdMTAzCw8Nxxx13oE2bNigqKsKWLVuwfPlyTJgwAZ9//rnYKRIRERG5HYsgIhmrrq6Gj49Pje1PPfUU3n//fZhMJlx77bXNnxgRERFRE+J0OCIZc1YAAbAVPl5e//yJWL16NUaNGoWIiAj4+fmhW7dueOWVV2A2mx3uO2TIECgUCttXu3btMGrUKOzdu9chTqFQ4KWXXnLY9sYbb0ChUGDIkCEO2ysqKvDSSy+hZ8+e8Pf3R3h4OO6//34cOXIEAHDs2DEoFAosXbrU4X6JiYlQKBSYOnWqbdvSpUuhUCjg6+uLs2fPOsTn5OTY8t6xY4fDvpUrV2LAgAEICAhAu3bt8PDDD+PkyZM1fnb79+/H+PHj0b59ewQEBKBXr16YM2cOAOCll15y+Nk4+9q0aZPt59inT58aj++K2u67YMECKBQKHDt2zGF7UVERnnnmGXTu3Bl+fn7o3r075s+f7zAl0vozXrBgQY3H7dOnj8NrtmnTJigUCnz55Ze15jh16lSXC+z09HRERUXBz88PERERSExMRFFRkcPx1vdzrcuQIUNqvOdSU1Ph5eWFzz77zGG7q+8DALXmYv/zd/X3wPreudK1117r8P4GXHs9AcBiseCdd95B37594e/vj/bt2+Puu++2vffr+5la87O+3tYvPz8/9OzZE/PmzYP956x//vknEhIS0KtXLwQEBKBt27YYN25cjfdjberLt76cr3y/LViwAIMGDULbtm0REBCAAQMG1Pqetf7dqO1n0JCffUN+l4io6XiLnQARia+oqAiXLl1CaWkpdu7ciQULFmDixIno0qWLLWbp0qVo3bo1Zs2ahdatW+P//u//MHfuXJSUlOCNN95weLzevXtjzpw5EAQBR44cwZtvvomRI0fi+PHjdeYwb968GtvNZjNGjx6NjRs3YuLEiZg5cyZKS0uxfv167N27F926dXP6eIcPH8bixYtrfT6lUomsrCw8++yztm2ZmZnw9/dHRUWFQ+zSpUvx6KOP4qabbsK8efNw+vRpvPPOO/j111+xe/duhISEAACMRiNuu+02+Pj4YMaMGbj22mtx5MgRrFmzBqmpqbj//vvRvXt32+M+++yzuO666zBjxgzbtuuuu67WnJtCeXk5Bg8ejJMnT+KJJ55Aly5d8Ntvv2H27NnIz8/H22+/3az5XOmll15CcnIyhg4divj4eBw4cAAZGRnYvn07fv31V/j4+GDOnDl47LHHAADnzp3Ds88+ixkzZuC2225r1HNmZmZCp9Nh4cKFeOihh2zbXX0f2Lvvvvtw//33AwB+/vlnLFq0qM7nru33wFUNeT2nT5+OpUuXYsSIEXjsscdw6dIl/Pzzz9iyZQtuvPFGLFu2zBZrzf2tt95Cu3btAAAdO3Z0eO4XX3wR1113HS5evIgVK1bgxRdfRIcOHTB9+nQAwPbt2/Hbb79h4sSJuOaaa3Ds2DFkZGRgyJAh2LdvHwIDA+s8tvrytbrrrrswefJkh/suXLgQ58+fd9j2zjvv4J577oFWq0VVVRWWL1+OcePG4bvvvsOoUaOc5mB//KmpqY3+2RNRCyAQkez16tVLAGD7mjx5slBdXe0QU15eXuN+TzzxhBAYGChUVFTYtg0ePFgYPHiwQ9yLL74oABDOnDlj2wZA0Ov1ttv//e9/hQ4dOggDBgxwuP/HH38sABDefPPNGs9vsVgEQRAEk8kkABAyMzNt+8aPHy/06dNH6Ny5szBlyhTb9szMTAGAMGnSJKFv37627WVlZUKbNm2Ehx56SAAgbN++XRAEQaiqqhI6dOgg9OnTR7h48aIt/rvvvhMACHPnzrVti4uLE4KCgoQ///zTaZ5X6tq1q0Nu9gYPHixERUU53Vef2u77xhtvCAAEk8lk2/bKK68IrVq1Eg4ePOgQ+8ILLwhKpVI4fvy4IAj//IzfeOONGo8bFRXl8Jr99NNPAgBh5cqVteY4ZcoUoWvXrnUex5kzZwRfX19h2LBhgtlstm1///33BQDCxx9/XOM+zt4L9bF/z/7vf/8TvL29heeee84hpiHvA0EQhOrqagGAkJycbNtmfe/Z//xd/T1ITk4WANR4L135HnL19fy///s/AYDw9NNP1/h5OHu/Osvdyvp6//TTT7ZtFRUVgpeXl5CQkGDb5uxvSE5OjgBA+PTTT2vss+dqvgCExMTEGjGjRo2q8X67Mp+qqiqhT58+wh133FHj/osXLxYAOPxuX/m3ril+l4io6UhmOpzBYMCYMWMQEREBhUKBb775psGPIQgCFixYgJ49e8LPzw+dOnWq8UkPkRRlZmZi/fr1yM7OxvTp05Gdne0wOgEAAQEBtu9LS0tx7tw53HbbbSgvL8f+/fsdYqurq3Hu3DmcPXsWOTk5WLVqFTQaje0T1CudPHkS7733HpKSktC6dWuHfV999RXatWuHp556qsb9apvmtHPnTqxcuRLz5s1zmNJn75FHHsH+/fttU2m++uorBAcH484773SI27FjB86cOYOEhAT4+/vbto8aNQq9e/fG//73PwDA2bNnYTAYMG3aNIcRtLryrI/ZbMa5c+dw7tw5VFVVNeox6rNy5UrcdtttCA0NtT3XuXPnMHToUJjNZhgMBof48vJyh7hz587VmBJpZX2f2E9da4gNGzagqqoKzzzzjMPr+Pjjj6NNmza2n727bNu2DePHj8cDDzxQY3TT1feBlfX18vPzc/n56/o96NChAwDgr7/+qvMxXH09v/rqKygUCuj1+hqP0dj3a3FxMc6dO4fjx4/j9ddfh8ViwR133GHbb/83pLq6Gn///Te6d++OkJAQ7Nq1q87Hbop87fM5f/48iouLcdtttznNxZXXsyl/l4jI/SQzHa6srAw33HADpk2bZpt60FAzZ87Ejz/+iAULFqBv374oLCxEYWGhmzMlanliYmJs3z/00ENQq9WYM2cOpk+fjtjYWABAXl4edDod/u///g8lJSUO9y8uLna4/dtvv6F9+/a22z169MA333xT68mKXq9HREQEnnjiiRpz8o8cOYJevXrB29v1P1cvvPACbrvtNowePRpPPvmk05j27dtj1KhR+Pjjj3HjjTfi448/xpQpU2oUTX/++ScAoFevXjUeo3fv3vjll18AAEePHgWARl/H48z+/fttP0cvLy90794der3eYYrW1Tp06BCMRqPD62XvzJkzDrf1er3TE9Erp0YBwLRp02zft27dGmPGjMFbb73lNNaZ2n72vr6+UKvVtv3ucPLkSYwaNQplZWX4+++/a7xXXX0fWFkLvyuLmbrU9XsQExMDhUKB2bNnIyUlxfa4V17n4+rreeTIEURERCAsLMzl/Opz77332r738vKCTqfDAw88YNt28eJFzJs3D5mZmTh58qTD9UJX/g25UlPk+9133yElJQW5ubmorKy0bXf2d8qV17Mpf5eIyP0kUwSNGDECI0aMqHV/ZWUl5syZg88//xxFRUXo06cP5s+fb7sA8Y8//kBGRgb27t1r+08uMjKyOVInanEefPBBzJkzB1u3bkV
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"from typing import Tuple\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"from pandas import DataFrame\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"median_price = df['price'].median()\n",
|
|||
|
"df['above_median_price'] = np.where(df['price'] > median_price, 1, 0)\n",
|
|||
|
"\n",
|
|||
|
"X = df.drop(columns=['id', 'date', 'price', 'above_median_price'])\n",
|
|||
|
"y = df['above_median_price']\n",
|
|||
|
"\n",
|
|||
|
"df['price_category'] = pd.cut(df['price'], bins=[0, 300000, 700000, np.inf], labels=[0, 1, 2])\n",
|
|||
|
"\n",
|
|||
|
"X = df.drop(columns=['id', 'date', 'price', 'price_category'])\n",
|
|||
|
"\n",
|
|||
|
"def split_stratified_into_train_val_test(\n",
|
|||
|
" df_input,\n",
|
|||
|
" stratify_colname=\"y\",\n",
|
|||
|
" frac_train=0.6,\n",
|
|||
|
" frac_val=0.15,\n",
|
|||
|
" frac_test=0.25,\n",
|
|||
|
" random_state=None,\n",
|
|||
|
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
|||
|
" \n",
|
|||
|
" if frac_train + frac_val + frac_test != 1.0:\n",
|
|||
|
" raise ValueError(\n",
|
|||
|
" \"fractions %f, %f, %f do not add up to 1.0\"\n",
|
|||
|
" % (frac_train, frac_val, frac_test)\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" if stratify_colname not in df_input.columns:\n",
|
|||
|
" raise ValueError(\"%s is not a column in the dataframe\" % (stratify_colname))\n",
|
|||
|
" X = df_input # Contains all columns.\n",
|
|||
|
" y = df_input[\n",
|
|||
|
" [stratify_colname]\n",
|
|||
|
" ] # Dataframe of just the column on which to stratify.\n",
|
|||
|
" \n",
|
|||
|
" # Split original dataframe into train and temp dataframes.\n",
|
|||
|
" df_train, df_temp, y_train, y_temp = train_test_split(\n",
|
|||
|
" X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" if frac_val <= 0:\n",
|
|||
|
" assert len(df_input) == len(df_train) + len(df_temp)\n",
|
|||
|
" return df_train, pd.DataFrame(), df_temp, y_train, pd.DataFrame(), y_temp\n",
|
|||
|
" # Split the temp dataframe into val and test dataframes.\n",
|
|||
|
" relative_frac_test = frac_test / (frac_val + frac_test)\n",
|
|||
|
"\n",
|
|||
|
" df_val, df_test, y_val, y_test = train_test_split(\n",
|
|||
|
" df_temp,\n",
|
|||
|
" y_temp,\n",
|
|||
|
" stratify=y_temp,\n",
|
|||
|
" test_size=relative_frac_test,\n",
|
|||
|
" random_state=random_state,\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" assert len(df_input) == len(df_train) + len(df_val) + len(df_test)\n",
|
|||
|
" return df_train, df_val, df_test, y_train, y_val, y_test\n",
|
|||
|
"\n",
|
|||
|
"X_train, X_val, X_test, y_train, y_val, y_test = split_stratified_into_train_val_test(\n",
|
|||
|
" df, stratify_colname=\"above_median_price\", frac_train=0.80, frac_val=0, frac_test=0.20, random_state=42\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"display(\"X_train\", X_train)\n",
|
|||
|
"display(\"y_train\", y_train)\n",
|
|||
|
"\n",
|
|||
|
"display(\"X_test\", X_test)\n",
|
|||
|
"display(\"y_test\", y_test)\n",
|
|||
|
"\n",
|
|||
|
"print(df.dtypes)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"sns.histplot(df['price'], bins=50, kde=True)\n",
|
|||
|
"plt.title('Распределение цен на недвижимость')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Частота')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"sns.boxplot(x='bedrooms', y='price', data=df)\n",
|
|||
|
"plt.title('Зависимость цены от количества спален')\n",
|
|||
|
"plt.xlabel('Количество спален')\n",
|
|||
|
"plt.ylabel('Цена')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Конвейеры предобработки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
|||
|
"from sklearn.compose import ColumnTransformer\n",
|
|||
|
"from sklearn.discriminant_analysis import StandardScaler\n",
|
|||
|
"from sklearn.impute import SimpleImputer\n",
|
|||
|
"from sklearn.preprocessing import OneHotEncoder\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"from sklearn.pipeline import Pipeline\n",
|
|||
|
"\n",
|
|||
|
"pipeline_end = StandardScaler()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Построение конвейеров предобработки\n",
|
|||
|
"\n",
|
|||
|
"class HouseFeatures(BaseEstimator, TransformerMixin):\n",
|
|||
|
" def __init__(self):\n",
|
|||
|
" pass\n",
|
|||
|
" def fit(self, X, y=None):\n",
|
|||
|
" return self\n",
|
|||
|
" def transform(self, X, y=None):\n",
|
|||
|
" # Создание новых признаков\n",
|
|||
|
" X = X.copy()\n",
|
|||
|
" X[\"Living_area_to_Lot_ratio\"] = X[\"sqft_living\"] / X[\"sqft_lot\"]\n",
|
|||
|
" return X\n",
|
|||
|
" def get_feature_names_out(self, features_in):\n",
|
|||
|
" # Добавление имен новых признаков\n",
|
|||
|
" new_features = [\"Living_area_to_Lot_ratio\"]\n",
|
|||
|
" return np.append(features_in, new_features, axis=0)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Обработка числовых данных. Числовой конвейр: заполнение пропущенных значений медианой и стандартизация\n",
|
|||
|
"preprocessing_num_class = Pipeline(steps=[\n",
|
|||
|
" ('imputer', SimpleImputer(strategy='median')),\n",
|
|||
|
" ('scaler', StandardScaler())\n",
|
|||
|
"])\n",
|
|||
|
"\n",
|
|||
|
"preprocessing_cat_class = Pipeline(steps=[\n",
|
|||
|
" ('imputer', SimpleImputer(strategy='most_frequent')),\n",
|
|||
|
" ('onehot', OneHotEncoder(handle_unknown='ignore'))\n",
|
|||
|
"])\n",
|
|||
|
"\n",
|
|||
|
"columns_to_drop = [\"date\"]\n",
|
|||
|
"numeric_columns = [\"sqft_living\", \"sqft_lot\", \"above_median_price\"]\n",
|
|||
|
"cat_columns = []\n",
|
|||
|
"\n",
|
|||
|
"features_preprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"prepocessing_num\", preprocessing_num_class, numeric_columns),\n",
|
|||
|
" (\"prepocessing_cat\", preprocessing_cat_class, cat_columns),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\"\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"drop_columns = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"drop_columns\", \"drop\", columns_to_drop),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\",\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"features_postprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" ('preprocessing_cat', preprocessing_cat_class, [\"price_category\"]),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\",\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"pipeline_end = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"features_preprocessing\", features_preprocessing),\n",
|
|||
|
" (\"custom_features\", HouseFeatures()),\n",
|
|||
|
" (\"drop_columns\", drop_columns),\n",
|
|||
|
" ]\n",
|
|||
|
")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" <th>Living_area_to_Lot_ratio</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20962</th>\n",
|
|||
|
" <td>-1.360742</td>\n",
|
|||
|
" <td>-0.262132</td>\n",
|
|||
|
" <td>-0.994693</td>\n",
|
|||
|
" <td>1278000210</td>\n",
|
|||
|
" <td>110000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1968</td>\n",
|
|||
|
" <td>2007</td>\n",
|
|||
|
" <td>98001</td>\n",
|
|||
|
" <td>47.2655</td>\n",
|
|||
|
" <td>-122.244</td>\n",
|
|||
|
" <td>828</td>\n",
|
|||
|
" <td>5402</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>5.191063</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12284</th>\n",
|
|||
|
" <td>0.794390</td>\n",
|
|||
|
" <td>-0.094121</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>2193300390</td>\n",
|
|||
|
" <td>624000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.25</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1130</td>\n",
|
|||
|
" <td>1980</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98052</td>\n",
|
|||
|
" <td>47.6920</td>\n",
|
|||
|
" <td>-122.099</td>\n",
|
|||
|
" <td>2110</td>\n",
|
|||
|
" <td>11250</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>-8.440052</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7343</th>\n",
|
|||
|
" <td>0.837884</td>\n",
|
|||
|
" <td>-0.272723</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>4289900005</td>\n",
|
|||
|
" <td>1535000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>3.25</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1030</td>\n",
|
|||
|
" <td>1908</td>\n",
|
|||
|
" <td>2003</td>\n",
|
|||
|
" <td>98122</td>\n",
|
|||
|
" <td>47.6147</td>\n",
|
|||
|
" <td>-122.285</td>\n",
|
|||
|
" <td>2130</td>\n",
|
|||
|
" <td>4200</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>-3.072292</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>14247</th>\n",
|
|||
|
" <td>-0.782270</td>\n",
|
|||
|
" <td>-0.196986</td>\n",
|
|||
|
" <td>-0.994693</td>\n",
|
|||
|
" <td>316000145</td>\n",
|
|||
|
" <td>235000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1941</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98168</td>\n",
|
|||
|
" <td>47.5054</td>\n",
|
|||
|
" <td>-122.301</td>\n",
|
|||
|
" <td>1280</td>\n",
|
|||
|
" <td>7175</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>3.971201</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16670</th>\n",
|
|||
|
" <td>1.011860</td>\n",
|
|||
|
" <td>0.024330</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>629400480</td>\n",
|
|||
|
" <td>775000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1996</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98075</td>\n",
|
|||
|
" <td>47.5895</td>\n",
|
|||
|
" <td>-121.994</td>\n",
|
|||
|
" <td>3330</td>\n",
|
|||
|
" <td>12333</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>41.589045</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>88</th>\n",
|
|||
|
" <td>-0.510432</td>\n",
|
|||
|
" <td>-0.324180</td>\n",
|
|||
|
" <td>-0.994693</td>\n",
|
|||
|
" <td>1332700270</td>\n",
|
|||
|
" <td>215000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1979</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98056</td>\n",
|
|||
|
" <td>47.5180</td>\n",
|
|||
|
" <td>-122.194</td>\n",
|
|||
|
" <td>1950</td>\n",
|
|||
|
" <td>2025</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1.574534</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15031</th>\n",
|
|||
|
" <td>1.044481</td>\n",
|
|||
|
" <td>-0.314813</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>7129303070</td>\n",
|
|||
|
" <td>735000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1966</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98118</td>\n",
|
|||
|
" <td>47.5188</td>\n",
|
|||
|
" <td>-122.256</td>\n",
|
|||
|
" <td>2620</td>\n",
|
|||
|
" <td>2433</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>-3.317784</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5234</th>\n",
|
|||
|
" <td>-0.456065</td>\n",
|
|||
|
" <td>-0.136611</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>2432000130</td>\n",
|
|||
|
" <td>675000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1956</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98033</td>\n",
|
|||
|
" <td>47.6503</td>\n",
|
|||
|
" <td>-122.198</td>\n",
|
|||
|
" <td>2090</td>\n",
|
|||
|
" <td>9549</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>3.338418</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19980</th>\n",
|
|||
|
" <td>0.566046</td>\n",
|
|||
|
" <td>1.239169</td>\n",
|
|||
|
" <td>-0.994693</td>\n",
|
|||
|
" <td>774100475</td>\n",
|
|||
|
" <td>415000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2009</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98014</td>\n",
|
|||
|
" <td>47.7185</td>\n",
|
|||
|
" <td>-121.405</td>\n",
|
|||
|
" <td>1740</td>\n",
|
|||
|
" <td>64626</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0.456795</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3671</th>\n",
|
|||
|
" <td>0.370323</td>\n",
|
|||
|
" <td>4.836825</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>8847400115</td>\n",
|
|||
|
" <td>590000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.00</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2005</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98010</td>\n",
|
|||
|
" <td>47.3666</td>\n",
|
|||
|
" <td>-121.978</td>\n",
|
|||
|
" <td>3180</td>\n",
|
|||
|
" <td>212137</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0.076563</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>17290 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" sqft_living sqft_lot above_median_price id price \\\n",
|
|||
|
"20962 -1.360742 -0.262132 -0.994693 1278000210 110000.0 \n",
|
|||
|
"12284 0.794390 -0.094121 1.005335 2193300390 624000.0 \n",
|
|||
|
"7343 0.837884 -0.272723 1.005335 4289900005 1535000.0 \n",
|
|||
|
"14247 -0.782270 -0.196986 -0.994693 316000145 235000.0 \n",
|
|||
|
"16670 1.011860 0.024330 1.005335 629400480 775000.0 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"88 -0.510432 -0.324180 -0.994693 1332700270 215000.0 \n",
|
|||
|
"15031 1.044481 -0.314813 1.005335 7129303070 735000.0 \n",
|
|||
|
"5234 -0.456065 -0.136611 1.005335 2432000130 675000.0 \n",
|
|||
|
"19980 0.566046 1.239169 -0.994693 774100475 415000.0 \n",
|
|||
|
"3671 0.370323 4.836825 1.005335 8847400115 590000.0 \n",
|
|||
|
"\n",
|
|||
|
" bedrooms bathrooms floors waterfront view ... sqft_basement \\\n",
|
|||
|
"20962 2 1.00 1.0 0 0 ... 0 \n",
|
|||
|
"12284 4 3.25 1.0 0 0 ... 1130 \n",
|
|||
|
"7343 4 3.25 2.0 0 3 ... 1030 \n",
|
|||
|
"14247 4 1.00 1.5 0 0 ... 0 \n",
|
|||
|
"16670 4 2.75 2.0 0 0 ... 0 \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"88 2 2.25 2.0 0 0 ... 0 \n",
|
|||
|
"15031 4 2.75 2.0 1 4 ... 0 \n",
|
|||
|
"5234 3 1.75 1.0 0 0 ... 0 \n",
|
|||
|
"19980 3 2.75 1.5 0 0 ... 0 \n",
|
|||
|
"3671 3 2.00 1.5 0 0 ... 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"20962 1968 2007 98001 47.2655 -122.244 828 \n",
|
|||
|
"12284 1980 0 98052 47.6920 -122.099 2110 \n",
|
|||
|
"7343 1908 2003 98122 47.6147 -122.285 2130 \n",
|
|||
|
"14247 1941 0 98168 47.5054 -122.301 1280 \n",
|
|||
|
"16670 1996 0 98075 47.5895 -121.994 3330 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"88 1979 0 98056 47.5180 -122.194 1950 \n",
|
|||
|
"15031 1966 0 98118 47.5188 -122.256 2620 \n",
|
|||
|
"5234 1956 0 98033 47.6503 -122.198 2090 \n",
|
|||
|
"19980 2009 0 98014 47.7185 -121.405 1740 \n",
|
|||
|
"3671 2005 0 98010 47.3666 -121.978 3180 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 price_category Living_area_to_Lot_ratio \n",
|
|||
|
"20962 5402 0 5.191063 \n",
|
|||
|
"12284 11250 1 -8.440052 \n",
|
|||
|
"7343 4200 2 -3.072292 \n",
|
|||
|
"14247 7175 0 3.971201 \n",
|
|||
|
"16670 12333 2 41.589045 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"88 2025 0 1.574534 \n",
|
|||
|
"15031 2433 2 -3.317784 \n",
|
|||
|
"5234 9549 1 3.338418 \n",
|
|||
|
"19980 64626 1 0.456795 \n",
|
|||
|
"3671 212137 1 0.076563 \n",
|
|||
|
"\n",
|
|||
|
"[17290 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"preprocessing_result = pipeline_end.fit_transform(X_train)\n",
|
|||
|
"preprocessed_df = pd.DataFrame(\n",
|
|||
|
" preprocessing_result,\n",
|
|||
|
" columns=pipeline_end.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"preprocessed_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Формирование набора моделей для классификации\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn import ensemble, linear_model, naive_bayes, neighbors, neural_network, tree, svm\n",
|
|||
|
"\n",
|
|||
|
"class_models = {\n",
|
|||
|
" \"logistic\": {\"model\": linear_model.LogisticRegression(max_iter=150)}, # логистическая \n",
|
|||
|
" \"ridge\": {\"model\": linear_model.RidgeClassifierCV(cv=5, class_weight=\"balanced\")}, # гребневая регрессия\n",
|
|||
|
" \"ridge\": {\"model\": linear_model.LogisticRegression(max_iter=150, solver='lbfgs', penalty=\"l2\", class_weight=\"balanced\")},\n",
|
|||
|
" \"decision_tree\": { # дерево решений\n",
|
|||
|
" \"model\": tree.DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=random_state)\n",
|
|||
|
" },\n",
|
|||
|
"\n",
|
|||
|
" \"knn\": {\"model\": neighbors.KNeighborsClassifier(n_neighbors=7)},\n",
|
|||
|
" \"naive_bayes\": {\"model\": naive_bayes.GaussianNB()}, # наивный Байесовский классификатор\n",
|
|||
|
"\n",
|
|||
|
" # метод градиентного бустинга (набор деревьев решений)\n",
|
|||
|
" \"gradient_boosting\": { \n",
|
|||
|
" \"model\": ensemble.GradientBoostingClassifier(n_estimators=210)\n",
|
|||
|
" },\n",
|
|||
|
"\n",
|
|||
|
" # метод случайного леса (набор деревьев решений) \n",
|
|||
|
" \"random_forest\": { \n",
|
|||
|
" \"model\": ensemble.RandomForestClassifier(\n",
|
|||
|
" max_depth=5, class_weight=\"balanced\", random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
" },\n",
|
|||
|
" # многослойный персептрон (нейронная сеть)\n",
|
|||
|
" \"mlp\": {\n",
|
|||
|
" \"model\": neural_network.MLPClassifier(\n",
|
|||
|
" hidden_layer_sizes=(7,),\n",
|
|||
|
" max_iter=200,\n",
|
|||
|
" early_stopping=True,\n",
|
|||
|
" random_state=random_state,\n",
|
|||
|
" )\n",
|
|||
|
" },\n",
|
|||
|
"}"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Model: logistic\n",
|
|||
|
"Model: ridge\n",
|
|||
|
"Model: decision_tree\n",
|
|||
|
"Model: knn\n",
|
|||
|
"Model: naive_bayes\n",
|
|||
|
"Model: gradient_boosting\n",
|
|||
|
"Model: random_forest\n",
|
|||
|
"Model: mlp\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn import metrics\n",
|
|||
|
"\n",
|
|||
|
"for model_name in class_models.keys():\n",
|
|||
|
" print(f\"Model: {model_name}\")\n",
|
|||
|
" model = class_models[model_name][\"model\"]\n",
|
|||
|
"\n",
|
|||
|
" model_pipeline = Pipeline([(\"pipeline\", pipeline_end), (\"model\", model)])\n",
|
|||
|
" model_pipeline = model_pipeline.fit(X_train, y_train.values.ravel())\n",
|
|||
|
"\n",
|
|||
|
" y_train_predict = model_pipeline.predict(X_train)\n",
|
|||
|
" y_test_probs = model_pipeline.predict_proba(X_test)[:, 1]\n",
|
|||
|
" y_test_predict = np.where(y_test_probs > 0.5, 1, 0)\n",
|
|||
|
"\n",
|
|||
|
" class_models[model_name][\"pipeline\"] = model_pipeline\n",
|
|||
|
" class_models[model_name][\"probs\"] = y_test_probs\n",
|
|||
|
" class_models[model_name][\"preds\"] = y_test_predict\n",
|
|||
|
"\n",
|
|||
|
" class_models[model_name][\"Precision_train\"] = metrics.precision_score(\n",
|
|||
|
" y_train, y_train_predict, zero_division=1\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Precision_test\"] = metrics.precision_score(\n",
|
|||
|
" y_test, y_test_predict, zero_division=1\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Recall_train\"] = metrics.recall_score(\n",
|
|||
|
" y_train, y_train_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Recall_test\"] = metrics.recall_score(\n",
|
|||
|
" y_test, y_test_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Accuracy_train\"] = metrics.accuracy_score(\n",
|
|||
|
" y_train, y_train_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Accuracy_test\"] = metrics.accuracy_score(\n",
|
|||
|
" y_test, y_test_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"ROC_AUC_test\"] = metrics.roc_auc_score(\n",
|
|||
|
" y_test, y_test_probs\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"F1_train\"] = metrics.f1_score(y_train, y_train_predict)\n",
|
|||
|
" class_models[model_name][\"F1_test\"] = metrics.f1_score(y_test, y_test_predict)\n",
|
|||
|
" class_models[model_name][\"MCC_test\"] = metrics.matthews_corrcoef(\n",
|
|||
|
" y_test, y_test_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Cohen_kappa_test\"] = metrics.cohen_kappa_score(\n",
|
|||
|
" y_test, y_test_predict\n",
|
|||
|
" )\n",
|
|||
|
" class_models[model_name][\"Confusion_matrix\"] = metrics.confusion_matrix(\n",
|
|||
|
" y_test, y_test_predict\n",
|
|||
|
" )"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0cAAAQ9CAYAAACSpDaqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVwU5eMH8M8ssIDcqFyCeKAo3mIZeSeBZB5p+fNKUdTyq+aRZ6aClZiWZx5peX3T1C4rzYNUxJQsDzwISRRvDhUBQYFld35/8GVsAxYWFheYz/v7mtfXnedh5pk15uMzz8wzgiiKIoiIiIiIiGROYewGEBERERERVQXsHBEREREREYGdIyIiIiIiIgDsHBEREREREQFg54iIiIiIiAgAO0dEREREREQA2DkiIiIiIiICwM4RERERERERAHaOiIiIiIiIALBzROW0ZcsWCIKA69evV8r2r1+/DkEQsGXLFoNsLzIyEoIgIDIy0iDbIyIiqilCQ0MhCEKZ6gqCgNDQ0MptEJERsXNENcratWsN1qEiIiIiInkxNXYDiIrj6emJJ0+ewMzMTK+fW7t2LerUqYPg4GCt9V27dsWTJ0+gVCoN2EoiIqLq7/3338fs2bON3QyiKoGdI6qSBEGAhYWFwbanUCgMuj0iIqKaIDs7G1ZWVjA15T8JiQDeVkcGtHbtWrRo0QLm5uZwc3PDhAkTkJ6eXqTemjVr0KhRI1haWuL555/H8ePH0b17d3Tv3l2qU9wzR8nJyRg1ahTc3d1hbm4OV1dX9OvXT3ruqUGDBoiNjcWxY8cgCAIEQZC2WdIzR6dOncIrr7wCBwcHWFlZoXXr1li5cqVhvxgiIqIqoPDZor/++gtDhw6Fg4MDOnfuXOwzR7m5uZg6dSrq1q0LGxsb9O3bF7dv3y52u5GRkejQoQMsLCzQuHFjfP755yU+x/TVV1/B19cXlpaWcHR0xODBg3Hr1q1KOV6i8uBlAjKI0NBQhIWFwd/fH+PHj0d8fDzWrVuHP//8EydOnJBuj1u3bh0mTpyILl26YOrUqbh+/Tr69+8PBwcHuLu769zHwIEDERsbi0mTJqFBgwZITU1FREQEbt68iQYNGmDFihWYNGkSrK2tMXfuXACAs7NziduLiIjAq6++CldXV0yePBkuLi6Ii4vD3r17MXnyZMN9OURERFXIG2+8gSZNmmDRokUQRRGpqalF6owZMwZfffUVhg4dihdffBFHjhxB7969i9Q7d+4cevXqBVdXV4SFhUGtVmPhwoWoW7dukbofffQR5s2bh0GDBmHMmDG4d+8eVq9eja5du+LcuXOwt7evjMMl0o9IVA6bN28WAYiJiYliamqqqFQqxYCAAFGtVkt1PvvsMxGAuGnTJlEURTE3N1esXbu2+Nxzz4kqlUqqt2XLFhGA2K1bN2ldYmKiCEDcvHmzKIqi+PDhQxGAuHTpUp3tatGihdZ2Ch09elQEIB49elQURVHMz88XGzZsKHp6eooPHz7UqqvRaMr+RRAREVUTCxYsEAGIQ4YMKXZ9oZiYGBGA+J///Eer3tChQ0UA4oIFC6R1ffr0EWvVqiXeuXNHWnflyhXR1NRUa5vXr18XTUxMxI8++khrmxcvXhRNTU2LrCcyFt5WRxX266+/Ii8vD1OmTIFC8fQ/qbFjx8LW1hb79u0DAJw+fRoPHjzA2LFjte5tHjZsGBwcHHTuw9LSEkqlEpGRkXj48GGF23zu3DkkJiZiypQpRa5UlXU6UyIiouro7bff1ln+yy+/AADeeecdrfVTpkzR+qxWq/Hrr7+if//+cHNzk9Z7eXkhKChIq+73338PjUaDQYMG4f79+9Li4uKCJk2a4OjRoxU4IiLD4W11VGE3btwAAHh7e2utVyqVaNSokVRe+P9eXl5a9UxNTdGgQQOd+zA3N8fHH3+Md999F87OznjhhRfw6quvYsSIEXBxcdG7zVevXgUAtGzZUu+fJSIiqs4aNmyos/zGjRtQKBRo3Lix1vp/53xqaiqePHlSJNeBoll/5coViKKIJk2aFLtPfWenJaos7BxRtTFlyhT06dMHe/bswcGDBzFv3jyEh4fjyJEjaNeunbGbR0REVC1YWlo+831qNBoIgoD9+/fDxMSkSLm1tfUzbxNRcXhbHVWYp6cnACA+Pl5rfV5eHhITE6Xywv9PSEjQqpefny/NOFeaxo0b491338WhQ4dw6dIl5OXl4dNPP5XKy3pLXOHVsEuXLpWpPhERkVx4enpCo9FId1kU+nfOOzk5wcLCokiuA0WzvnHjxhBFEQ0bNoS/v3+R5YUXXjD8gRCVAztHVGH+/v5QKpVYtWoVRFGU1n/55ZfIyMiQZrfp0KEDateujY0bNyI/P1+qt3379lKfI3r8+DFycnK01jVu3Bg2NjbIzc2V1llZWRU7ffi/tW/fHg0bNsSKFSuK1P/nMRAREclN4fNCq1at0lq/YsUKrc8mJibw9/fHnj17cPfuXWl9QkIC9u/fr1V3wIABMDExQVhYWJGcFUURDx48MOAREJUfb6ujCqtbty7mzJmDsLAw9OrVC3379kV8fDzWrl2L5557DsOHDwdQ8AxSaGgoJk2ahJdeegmDBg3C9evXsWXLFjRu3FjnqM/ff/+Nnj17YtCgQfDx8YGpqSl++OEHpKSkYPDgwVI9X19frFu3Dh9++CG8vLzg5OSEl156qcj2FAoF1q1bhz59+qBt27YYNWoUXF1dcfnyZcTGxuLgwYOG/6KIiIiqgbZt22LIkCFYu3YtMjIy8OKLL+Lw4cPFjhCFhobi0KFD6NSpE8aPHw+1Wo3PPvsMLVu2RExMjFSvcePG+PDDDzFnzhzpNR42NjZITEzEDz/8gHHjxmH69OnP8CiJisfOERlEaGgo6tati88++wxTp06Fo6Mjxo0bh0WLFmk9ZDlx4kSIoohPP/0U06dPR5s2bfDTTz/hnXfegYWFRYnb9/DwwJAhQ3D48GH897//hampKZo1a4bdu3dj4MCBUr358+fjxo0bWLJkCR49eoRu3boV2zkCgMDAQBw9ehRhYWH49NNPodFo0LhxY4wdO9ZwXwwREVE1tGnTJtStWxfbt2/Hnj178NJLL2Hfvn3w8PDQqufr64v9+/dj+vTpmDdvHjw8PLBw4ULExcXh8uXLWnVnz56Npk2bYvny5QgLCwNQkO8BAQHo27fvMzs2Il0EkfcQkZFpNBrUrVsXAwYMwMaNG43dHCIiIqqg/v37IzY2FleuXDF2U4j0wmeO6JnKyckpcq/xtm3bkJaWhu7duxunUURERFRuT5480fp85coV/PLLL8x1qpY4ckTPVGRkJKZOnYo33ngDtWvXxtmzZ/Hll1+iefPmOHPmDJRKpbGbSERERHpwdXVFcHCw9G7DdevWITc3F+fOnSvxvUZEVRWfOaJnqkGDBvDw8MCqVauQlpYGR0dHjBgxAosXL2bHiIiIqBrq1asXvv76ayQnJ8Pc3Bx+fn5YtGgRO0ZULXHkiIiIiIiICHzmiIiIiIiICAA7R0RERERERAD4zFGZaDQa3L17FzY2NjpfVEpUE4miiEePHsHNzQ0KhWGvp+Tk5CAvL6/UekqlUud7sIhIfpjNJGfM5srDzlEZ3L17t8hLz4jk5tatW3B3dzfY9nJyctDQ0xrJqepS67q4uCAxMbFGnoSJqHyYzUTM5srAzlEZ2NjYAABunG0AW2veiWgMrzVtZewmyFY+VPgNv0i/B4aSl5eH5FQ1Ek57wNam5N+rzEcaeHW4hby8vBp3Aiai8mM2Gx+z2XiYzZWHnaMyKByut7VW6PwPhSqPqWBm7CbI1//ms6ys21asbQRY25S8bQ14uwwRFcVsNj5msxExmysNO0dEZFQqUQ2VjjcKqETNM2wNERERyTmb2TkiIqPSQIQGJZ+AdZURERGR4ck5m9k5IiKj0kCEWqYnYCIioqpIztnMzhERGZVK1ECl4xxbk4fuiYiIqiI5ZzM7R0RkVJr/LbrKiYiI6NmRczazc0RERqUuZeheVxkREREZnpyzmZ0jIjIqlYhShu6fXVuIiIhI3tnMzhE
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x1000 with 16 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.metrics import ConfusionMatrixDisplay\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"_, ax = plt.subplots(int(len(class_models) / 2), 2, figsize=(12, 10), sharex=False, sharey=False)\n",
|
|||
|
"for index, key in enumerate(class_models.keys()):\n",
|
|||
|
" c_matrix = class_models[key][\"Confusion_matrix\"]\n",
|
|||
|
" disp = ConfusionMatrixDisplay(\n",
|
|||
|
" confusion_matrix=c_matrix, display_labels=[\"Less\", \"More\"]\n",
|
|||
|
" ).plot(ax=ax.flat[index])\n",
|
|||
|
" disp.ax_.set_title(key)\n",
|
|||
|
"\n",
|
|||
|
"plt.subplots_adjust(top=1, bottom=0, hspace=0.4, wspace=0.1)\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style type=\"text/css\">\n",
|
|||
|
"#T_115c5_row0_col0, #T_115c5_row0_col1, #T_115c5_row0_col2, #T_115c5_row0_col3, #T_115c5_row1_col0, #T_115c5_row1_col1, #T_115c5_row1_col2, #T_115c5_row1_col3, #T_115c5_row2_col0, #T_115c5_row2_col1, #T_115c5_row2_col2, #T_115c5_row2_col3, #T_115c5_row3_col0, #T_115c5_row3_col1, #T_115c5_row3_col2, #T_115c5_row3_col3, #T_115c5_row4_col0, #T_115c5_row4_col1, #T_115c5_row4_col2, #T_115c5_row4_col3, #T_115c5_row5_col0, #T_115c5_row5_col1 {\n",
|
|||
|
" background-color: #a8db34;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row0_col4, #T_115c5_row0_col5, #T_115c5_row0_col6, #T_115c5_row0_col7, #T_115c5_row1_col4, #T_115c5_row1_col5, #T_115c5_row1_col6, #T_115c5_row1_col7, #T_115c5_row2_col4, #T_115c5_row2_col5, #T_115c5_row2_col6, #T_115c5_row2_col7, #T_115c5_row3_col4, #T_115c5_row3_col5, #T_115c5_row3_col6, #T_115c5_row3_col7, #T_115c5_row4_col4, #T_115c5_row4_col5, #T_115c5_row4_col6, #T_115c5_row4_col7 {\n",
|
|||
|
" background-color: #da5a6a;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row5_col2 {\n",
|
|||
|
" background-color: #6ccd5a;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row5_col3 {\n",
|
|||
|
" background-color: #6ece58;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row5_col4 {\n",
|
|||
|
" background-color: #c43e7f;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row5_col5 {\n",
|
|||
|
" background-color: #c5407e;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row5_col6, #T_115c5_row5_col7 {\n",
|
|||
|
" background-color: #ce4b75;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col0 {\n",
|
|||
|
" background-color: #40bd72;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col1 {\n",
|
|||
|
" background-color: #38b977;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col2 {\n",
|
|||
|
" background-color: #7fd34e;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col3 {\n",
|
|||
|
" background-color: #75d054;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col4 {\n",
|
|||
|
" background-color: #be3885;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col5 {\n",
|
|||
|
" background-color: #b42e8d;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col6 {\n",
|
|||
|
" background-color: #cc4977;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row6_col7 {\n",
|
|||
|
" background-color: #c8437b;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row7_col0, #T_115c5_row7_col1, #T_115c5_row7_col2, #T_115c5_row7_col3 {\n",
|
|||
|
" background-color: #26818e;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_115c5_row7_col4, #T_115c5_row7_col5, #T_115c5_row7_col6, #T_115c5_row7_col7 {\n",
|
|||
|
" background-color: #4e02a2;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"</style>\n",
|
|||
|
"<table id=\"T_115c5\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th class=\"blank level0\" > </th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col0\" class=\"col_heading level0 col0\" >Precision_train</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col1\" class=\"col_heading level0 col1\" >Precision_test</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col2\" class=\"col_heading level0 col2\" >Recall_train</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col3\" class=\"col_heading level0 col3\" >Recall_test</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col4\" class=\"col_heading level0 col4\" >Accuracy_train</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col5\" class=\"col_heading level0 col5\" >Accuracy_test</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col6\" class=\"col_heading level0 col6\" >F1_train</th>\n",
|
|||
|
" <th id=\"T_115c5_level0_col7\" class=\"col_heading level0 col7\" >F1_test</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row0\" class=\"row_heading level0 row0\" >logistic</th>\n",
|
|||
|
" <td id=\"T_115c5_row0_col0\" class=\"data row0 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col1\" class=\"data row0 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col2\" class=\"data row0 col2\" >0.999767</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col3\" class=\"data row0 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col4\" class=\"data row0 col4\" >0.999884</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col5\" class=\"data row0 col5\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col6\" class=\"data row0 col6\" >0.999884</td>\n",
|
|||
|
" <td id=\"T_115c5_row0_col7\" class=\"data row0 col7\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row1\" class=\"row_heading level0 row1\" >ridge</th>\n",
|
|||
|
" <td id=\"T_115c5_row1_col0\" class=\"data row1 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col1\" class=\"data row1 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col2\" class=\"data row1 col2\" >0.999651</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col3\" class=\"data row1 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col4\" class=\"data row1 col4\" >0.999826</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col5\" class=\"data row1 col5\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col6\" class=\"data row1 col6\" >0.999826</td>\n",
|
|||
|
" <td id=\"T_115c5_row1_col7\" class=\"data row1 col7\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row2\" class=\"row_heading level0 row2\" >decision_tree</th>\n",
|
|||
|
" <td id=\"T_115c5_row2_col0\" class=\"data row2 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col1\" class=\"data row2 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col2\" class=\"data row2 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col3\" class=\"data row2 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col4\" class=\"data row2 col4\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col5\" class=\"data row2 col5\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col6\" class=\"data row2 col6\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row2_col7\" class=\"data row2 col7\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row3\" class=\"row_heading level0 row3\" >gradient_boosting</th>\n",
|
|||
|
" <td id=\"T_115c5_row3_col0\" class=\"data row3 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col1\" class=\"data row3 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col2\" class=\"data row3 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col3\" class=\"data row3 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col4\" class=\"data row3 col4\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col5\" class=\"data row3 col5\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col6\" class=\"data row3 col6\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row3_col7\" class=\"data row3 col7\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row4\" class=\"row_heading level0 row4\" >random_forest</th>\n",
|
|||
|
" <td id=\"T_115c5_row4_col0\" class=\"data row4 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col1\" class=\"data row4 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col2\" class=\"data row4 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col3\" class=\"data row4 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col4\" class=\"data row4 col4\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col5\" class=\"data row4 col5\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col6\" class=\"data row4 col6\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row4_col7\" class=\"data row4 col7\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row5\" class=\"row_heading level0 row5\" >naive_bayes</th>\n",
|
|||
|
" <td id=\"T_115c5_row5_col0\" class=\"data row5 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col1\" class=\"data row5 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col2\" class=\"data row5 col2\" >0.786719</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col3\" class=\"data row5 col3\" >0.793953</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col4\" class=\"data row5 col4\" >0.893927</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col5\" class=\"data row5 col5\" >0.897525</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col6\" class=\"data row5 col6\" >0.880630</td>\n",
|
|||
|
" <td id=\"T_115c5_row5_col7\" class=\"data row5 col7\" >0.885144</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row6\" class=\"row_heading level0 row6\" >knn</th>\n",
|
|||
|
" <td id=\"T_115c5_row6_col0\" class=\"data row6 col0\" >0.872486</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col1\" class=\"data row6 col1\" >0.827473</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col2\" class=\"data row6 col2\" >0.857774</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col3\" class=\"data row6 col3\" >0.820930</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col4\" class=\"data row6 col4\" >0.866917</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col5\" class=\"data row6 col5\" >0.825815</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col6\" class=\"data row6 col6\" >0.865068</td>\n",
|
|||
|
" <td id=\"T_115c5_row6_col7\" class=\"data row6 col7\" >0.824189</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_115c5_level0_row7\" class=\"row_heading level0 row7\" >mlp</th>\n",
|
|||
|
" <td id=\"T_115c5_row7_col0\" class=\"data row7 col0\" >0.687500</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col1\" class=\"data row7 col1\" >0.615385</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col2\" class=\"data row7 col2\" >0.002558</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col3\" class=\"data row7 col3\" >0.003721</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col4\" class=\"data row7 col4\" >0.503355</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col5\" class=\"data row7 col5\" >0.503354</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col6\" class=\"data row7 col6\" >0.005098</td>\n",
|
|||
|
" <td id=\"T_115c5_row7_col7\" class=\"data row7 col7\" >0.007397</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<pandas.io.formats.style.Styler at 0x1be769da420>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"class_metrics = pd.DataFrame.from_dict(class_models, \"index\")[\n",
|
|||
|
" [\n",
|
|||
|
" \"Precision_train\",\n",
|
|||
|
" \"Precision_test\",\n",
|
|||
|
" \"Recall_train\",\n",
|
|||
|
" \"Recall_test\",\n",
|
|||
|
" \"Accuracy_train\",\n",
|
|||
|
" \"Accuracy_test\",\n",
|
|||
|
" \"F1_train\",\n",
|
|||
|
" \"F1_test\",\n",
|
|||
|
" ]\n",
|
|||
|
"]\n",
|
|||
|
"class_metrics.sort_values(\n",
|
|||
|
" by=\"Accuracy_test\", ascending=False\n",
|
|||
|
").style.background_gradient(\n",
|
|||
|
" cmap=\"plasma\",\n",
|
|||
|
" low=0.3,\n",
|
|||
|
" high=1,\n",
|
|||
|
" subset=[\"Accuracy_train\", \"Accuracy_test\", \"F1_train\", \"F1_test\"],\n",
|
|||
|
").background_gradient(\n",
|
|||
|
" cmap=\"viridis\",\n",
|
|||
|
" low=1,\n",
|
|||
|
" high=0.3,\n",
|
|||
|
" subset=[\n",
|
|||
|
" \"Precision_train\",\n",
|
|||
|
" \"Precision_test\",\n",
|
|||
|
" \"Recall_train\",\n",
|
|||
|
" \"Recall_test\",\n",
|
|||
|
" ],\n",
|
|||
|
")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Мы видим, что некоторые модели показывают 100% точность и в последствие начинают плохо работать на новых данных. Поэтому происходит переобучение (overfitting) модели ."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style type=\"text/css\">\n",
|
|||
|
"#T_24a07_row0_col0, #T_24a07_row0_col1, #T_24a07_row1_col0, #T_24a07_row1_col1, #T_24a07_row2_col0, #T_24a07_row2_col1, #T_24a07_row3_col0, #T_24a07_row3_col1, #T_24a07_row4_col0, #T_24a07_row4_col1 {\n",
|
|||
|
" background-color: #a8db34;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row0_col2, #T_24a07_row0_col3, #T_24a07_row0_col4, #T_24a07_row1_col2, #T_24a07_row1_col3, #T_24a07_row1_col4, #T_24a07_row2_col2, #T_24a07_row2_col3, #T_24a07_row2_col4, #T_24a07_row3_col2, #T_24a07_row3_col3, #T_24a07_row3_col4, #T_24a07_row4_col2, #T_24a07_row4_col3, #T_24a07_row4_col4, #T_24a07_row5_col2 {\n",
|
|||
|
" background-color: #da5a6a;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row5_col0 {\n",
|
|||
|
" background-color: #6ece58;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row5_col1 {\n",
|
|||
|
" background-color: #86d549;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row5_col3 {\n",
|
|||
|
" background-color: #c5407e;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row5_col4 {\n",
|
|||
|
" background-color: #c7427c;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row6_col0 {\n",
|
|||
|
" background-color: #4cc26c;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row6_col1 {\n",
|
|||
|
" background-color: #75d054;\n",
|
|||
|
" color: #000000;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row6_col2 {\n",
|
|||
|
" background-color: #c8437b;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row6_col3, #T_24a07_row6_col4 {\n",
|
|||
|
" background-color: #b42e8d;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row7_col0, #T_24a07_row7_col1 {\n",
|
|||
|
" background-color: #26818e;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_24a07_row7_col2, #T_24a07_row7_col3, #T_24a07_row7_col4 {\n",
|
|||
|
" background-color: #4e02a2;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"</style>\n",
|
|||
|
"<table id=\"T_24a07\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th class=\"blank level0\" > </th>\n",
|
|||
|
" <th id=\"T_24a07_level0_col0\" class=\"col_heading level0 col0\" >Accuracy_test</th>\n",
|
|||
|
" <th id=\"T_24a07_level0_col1\" class=\"col_heading level0 col1\" >F1_test</th>\n",
|
|||
|
" <th id=\"T_24a07_level0_col2\" class=\"col_heading level0 col2\" >ROC_AUC_test</th>\n",
|
|||
|
" <th id=\"T_24a07_level0_col3\" class=\"col_heading level0 col3\" >Cohen_kappa_test</th>\n",
|
|||
|
" <th id=\"T_24a07_level0_col4\" class=\"col_heading level0 col4\" >MCC_test</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row0\" class=\"row_heading level0 row0\" >logistic</th>\n",
|
|||
|
" <td id=\"T_24a07_row0_col0\" class=\"data row0 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row0_col1\" class=\"data row0 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row0_col2\" class=\"data row0 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row0_col3\" class=\"data row0 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row0_col4\" class=\"data row0 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row1\" class=\"row_heading level0 row1\" >ridge</th>\n",
|
|||
|
" <td id=\"T_24a07_row1_col0\" class=\"data row1 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row1_col1\" class=\"data row1 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row1_col2\" class=\"data row1 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row1_col3\" class=\"data row1 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row1_col4\" class=\"data row1 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row2\" class=\"row_heading level0 row2\" >decision_tree</th>\n",
|
|||
|
" <td id=\"T_24a07_row2_col0\" class=\"data row2 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row2_col1\" class=\"data row2 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row2_col2\" class=\"data row2 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row2_col3\" class=\"data row2 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row2_col4\" class=\"data row2 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row3\" class=\"row_heading level0 row3\" >gradient_boosting</th>\n",
|
|||
|
" <td id=\"T_24a07_row3_col0\" class=\"data row3 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row3_col1\" class=\"data row3 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row3_col2\" class=\"data row3 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row3_col3\" class=\"data row3 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row3_col4\" class=\"data row3 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row4\" class=\"row_heading level0 row4\" >random_forest</th>\n",
|
|||
|
" <td id=\"T_24a07_row4_col0\" class=\"data row4 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row4_col1\" class=\"data row4 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row4_col2\" class=\"data row4 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row4_col3\" class=\"data row4 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_24a07_row4_col4\" class=\"data row4 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row5\" class=\"row_heading level0 row5\" >naive_bayes</th>\n",
|
|||
|
" <td id=\"T_24a07_row5_col0\" class=\"data row5 col0\" >0.897525</td>\n",
|
|||
|
" <td id=\"T_24a07_row5_col1\" class=\"data row5 col1\" >0.885144</td>\n",
|
|||
|
" <td id=\"T_24a07_row5_col2\" class=\"data row5 col2\" >0.999566</td>\n",
|
|||
|
" <td id=\"T_24a07_row5_col3\" class=\"data row5 col3\" >0.794820</td>\n",
|
|||
|
" <td id=\"T_24a07_row5_col4\" class=\"data row5 col4\" >0.812098</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row6\" class=\"row_heading level0 row6\" >knn</th>\n",
|
|||
|
" <td id=\"T_24a07_row6_col0\" class=\"data row6 col0\" >0.825815</td>\n",
|
|||
|
" <td id=\"T_24a07_row6_col1\" class=\"data row6 col1\" >0.824189</td>\n",
|
|||
|
" <td id=\"T_24a07_row6_col2\" class=\"data row6 col2\" >0.910823</td>\n",
|
|||
|
" <td id=\"T_24a07_row6_col3\" class=\"data row6 col3\" >0.651606</td>\n",
|
|||
|
" <td id=\"T_24a07_row6_col4\" class=\"data row6 col4\" >0.651627</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_24a07_level0_row7\" class=\"row_heading level0 row7\" >mlp</th>\n",
|
|||
|
" <td id=\"T_24a07_row7_col0\" class=\"data row7 col0\" >0.503354</td>\n",
|
|||
|
" <td id=\"T_24a07_row7_col1\" class=\"data row7 col1\" >0.007397</td>\n",
|
|||
|
" <td id=\"T_24a07_row7_col2\" class=\"data row7 col2\" >0.497071</td>\n",
|
|||
|
" <td id=\"T_24a07_row7_col3\" class=\"data row7 col3\" >0.001427</td>\n",
|
|||
|
" <td id=\"T_24a07_row7_col4\" class=\"data row7 col4\" >0.012966</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<pandas.io.formats.style.Styler at 0x1be769d8d10>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"class_metrics = pd.DataFrame.from_dict(class_models, \"index\")[\n",
|
|||
|
" [\n",
|
|||
|
" \"Accuracy_test\",\n",
|
|||
|
" \"F1_test\",\n",
|
|||
|
" \"ROC_AUC_test\",\n",
|
|||
|
" \"Cohen_kappa_test\",\n",
|
|||
|
" \"MCC_test\",\n",
|
|||
|
" ]\n",
|
|||
|
"]\n",
|
|||
|
"class_metrics.sort_values(by=\"ROC_AUC_test\", ascending=False).style.background_gradient(\n",
|
|||
|
" cmap=\"plasma\",\n",
|
|||
|
" low=0.3,\n",
|
|||
|
" high=1,\n",
|
|||
|
" subset=[\n",
|
|||
|
" \"ROC_AUC_test\",\n",
|
|||
|
" \"MCC_test\",\n",
|
|||
|
" \"Cohen_kappa_test\",\n",
|
|||
|
" ],\n",
|
|||
|
").background_gradient(\n",
|
|||
|
" cmap=\"viridis\",\n",
|
|||
|
" low=1,\n",
|
|||
|
" high=0.3,\n",
|
|||
|
" subset=[\n",
|
|||
|
" \"Accuracy_test\",\n",
|
|||
|
" \"F1_test\",\n",
|
|||
|
" ],\n",
|
|||
|
")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'logistic'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"best_model = str(class_metrics.sort_values(by=\"MCC_test\", ascending=False).iloc[0].name)\n",
|
|||
|
"\n",
|
|||
|
"display(best_model)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6863</th>\n",
|
|||
|
" <td>1124000050</td>\n",
|
|||
|
" <td>20140729T000000</td>\n",
|
|||
|
" <td>461000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1260</td>\n",
|
|||
|
" <td>8505</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1951</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98177</td>\n",
|
|||
|
" <td>47.7181</td>\n",
|
|||
|
" <td>-122.371</td>\n",
|
|||
|
" <td>1480</td>\n",
|
|||
|
" <td>8100</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>1 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"6863 1124000050 20140729T000000 461000.0 4 1.0 1260 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view ... sqft_basement yr_built yr_renovated \\\n",
|
|||
|
"6863 8505 1.5 0 0 ... 0 1951 0 \n",
|
|||
|
"\n",
|
|||
|
" zipcode lat long sqft_living15 sqft_lot15 above_median_price \\\n",
|
|||
|
"6863 98177 47.7181 -122.371 1480 8100 1 \n",
|
|||
|
"\n",
|
|||
|
" price_category \n",
|
|||
|
"6863 1 \n",
|
|||
|
"\n",
|
|||
|
"[1 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" <th>Living_area_to_Lot_ratio</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6863</th>\n",
|
|||
|
" <td>-0.891006</td>\n",
|
|||
|
" <td>-0.162689</td>\n",
|
|||
|
" <td>1.005335</td>\n",
|
|||
|
" <td>1.124000e+09</td>\n",
|
|||
|
" <td>461000.0</td>\n",
|
|||
|
" <td>4.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1951.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>98177.0</td>\n",
|
|||
|
" <td>47.7181</td>\n",
|
|||
|
" <td>-122.371</td>\n",
|
|||
|
" <td>1480.0</td>\n",
|
|||
|
" <td>8100.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>5.476729</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>1 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" sqft_living sqft_lot above_median_price id price \\\n",
|
|||
|
"6863 -0.891006 -0.162689 1.005335 1.124000e+09 461000.0 \n",
|
|||
|
"\n",
|
|||
|
" bedrooms bathrooms floors waterfront view ... sqft_basement \\\n",
|
|||
|
"6863 4.0 1.0 1.5 0.0 0.0 ... 0.0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"6863 1951.0 0.0 98177.0 47.7181 -122.371 1480.0 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 price_category Living_area_to_Lot_ratio \n",
|
|||
|
"6863 8100.0 1.0 5.476729 \n",
|
|||
|
"\n",
|
|||
|
"[1 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'predicted: 1 (proba: [0. 1.])'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'real: 1'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"preprocessing_result = pipeline_end.transform(X_test)\n",
|
|||
|
"preprocessed_df = pd.DataFrame(\n",
|
|||
|
" preprocessing_result,\n",
|
|||
|
" columns=pipeline_end.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"model = class_models[best_model][\"pipeline\"]\n",
|
|||
|
"\n",
|
|||
|
"example_id = 6863\n",
|
|||
|
"test = pd.DataFrame(X_test.loc[example_id, :]).T\n",
|
|||
|
"test_preprocessed = pd.DataFrame(preprocessed_df.loc[example_id, :]).T\n",
|
|||
|
"display(test)\n",
|
|||
|
"display(test_preprocessed)\n",
|
|||
|
"result_proba = model.predict_proba(test)[0]\n",
|
|||
|
"result = model.predict(test)[0]\n",
|
|||
|
"real = int(y_test.loc[example_id].values[0])\n",
|
|||
|
"display(f\"predicted: {result} (proba: {result_proba})\")\n",
|
|||
|
"display(f\"real: {real}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Новые гиперпараметры"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\user\\Desktop\\MII\\lab1para\\aim\\aimenv\\Lib\\site-packages\\numpy\\ma\\core.py:2881: RuntimeWarning: invalid value encountered in cast\n",
|
|||
|
" _data = np.array(data, dtype=dtype, copy=copy,\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import GridSearchCV\n",
|
|||
|
"\n",
|
|||
|
"optimized_model_type = \"random_forest\"\n",
|
|||
|
"\n",
|
|||
|
"random_forest_model = class_models[optimized_model_type][\"pipeline\"]\n",
|
|||
|
"\n",
|
|||
|
"param_grid = {\n",
|
|||
|
" \"model__n_estimators\": [10, 50, 100],\n",
|
|||
|
" \"model__max_features\": [\"sqrt\", \"log2\"],\n",
|
|||
|
" \"model__max_depth\": [5, 7, 10],\n",
|
|||
|
" \"model__criterion\": [\"gini\", \"entropy\"],\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"gs_optomizer = GridSearchCV(\n",
|
|||
|
" estimator=random_forest_model, param_grid=param_grid, n_jobs=-1\n",
|
|||
|
")\n",
|
|||
|
"gs_optomizer.fit(X_train, y_train.values.ravel())\n",
|
|||
|
"gs_optomizer.best_params_\n",
|
|||
|
"\n",
|
|||
|
"optimized_model = ensemble.RandomForestClassifier(\n",
|
|||
|
" random_state=random_state,\n",
|
|||
|
" criterion=\"gini\",\n",
|
|||
|
" max_depth=5,\n",
|
|||
|
" max_features=\"log2\",\n",
|
|||
|
" n_estimators=10,\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"result = {}\n",
|
|||
|
"\n",
|
|||
|
"result[\"pipeline\"] = Pipeline([(\"pipeline\", pipeline_end), (\"model\", optimized_model)]).fit(X_train, y_train.values.ravel())\n",
|
|||
|
"result[\"train_preds\"] = result[\"pipeline\"].predict(X_train)\n",
|
|||
|
"result[\"probs\"] = result[\"pipeline\"].predict_proba(X_test)[:, 1]\n",
|
|||
|
"result[\"preds\"] = np.where(result[\"probs\"] > 0.5, 1, 0)\n",
|
|||
|
"\n",
|
|||
|
"result[\"Precision_train\"] = metrics.precision_score(y_train, result[\"train_preds\"])\n",
|
|||
|
"result[\"Precision_test\"] = metrics.precision_score(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"Recall_train\"] = metrics.recall_score(y_train, result[\"train_preds\"])\n",
|
|||
|
"result[\"Recall_test\"] = metrics.recall_score(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"Accuracy_train\"] = metrics.accuracy_score(y_train, result[\"train_preds\"])\n",
|
|||
|
"result[\"Accuracy_test\"] = metrics.accuracy_score(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"ROC_AUC_test\"] = metrics.roc_auc_score(y_test, result[\"probs\"])\n",
|
|||
|
"result[\"F1_train\"] = metrics.f1_score(y_train, result[\"train_preds\"])\n",
|
|||
|
"result[\"F1_test\"] = metrics.f1_score(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"MCC_test\"] = metrics.matthews_corrcoef(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"Cohen_kappa_test\"] = metrics.cohen_kappa_score(y_test, result[\"preds\"])\n",
|
|||
|
"result[\"Confusion_matrix\"] = metrics.confusion_matrix(y_test, result[\"preds\"])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style type=\"text/css\">\n",
|
|||
|
"#T_c307c_row0_col0, #T_c307c_row0_col1, #T_c307c_row1_col0, #T_c307c_row1_col1 {\n",
|
|||
|
" background-color: #440154;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"#T_c307c_row0_col2, #T_c307c_row0_col3, #T_c307c_row0_col4, #T_c307c_row1_col2, #T_c307c_row1_col3, #T_c307c_row1_col4 {\n",
|
|||
|
" background-color: #0d0887;\n",
|
|||
|
" color: #f1f1f1;\n",
|
|||
|
"}\n",
|
|||
|
"</style>\n",
|
|||
|
"<table id=\"T_c307c\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th class=\"blank level0\" > </th>\n",
|
|||
|
" <th id=\"T_c307c_level0_col0\" class=\"col_heading level0 col0\" >Accuracy_test</th>\n",
|
|||
|
" <th id=\"T_c307c_level0_col1\" class=\"col_heading level0 col1\" >F1_test</th>\n",
|
|||
|
" <th id=\"T_c307c_level0_col2\" class=\"col_heading level0 col2\" >ROC_AUC_test</th>\n",
|
|||
|
" <th id=\"T_c307c_level0_col3\" class=\"col_heading level0 col3\" >Cohen_kappa_test</th>\n",
|
|||
|
" <th id=\"T_c307c_level0_col4\" class=\"col_heading level0 col4\" >MCC_test</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th class=\"index_name level0\" >Name</th>\n",
|
|||
|
" <th class=\"blank col0\" > </th>\n",
|
|||
|
" <th class=\"blank col1\" > </th>\n",
|
|||
|
" <th class=\"blank col2\" > </th>\n",
|
|||
|
" <th class=\"blank col3\" > </th>\n",
|
|||
|
" <th class=\"blank col4\" > </th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_c307c_level0_row0\" class=\"row_heading level0 row0\" >Old</th>\n",
|
|||
|
" <td id=\"T_c307c_row0_col0\" class=\"data row0 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row0_col1\" class=\"data row0 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row0_col2\" class=\"data row0 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row0_col3\" class=\"data row0 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row0_col4\" class=\"data row0 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th id=\"T_c307c_level0_row1\" class=\"row_heading level0 row1\" >New</th>\n",
|
|||
|
" <td id=\"T_c307c_row1_col0\" class=\"data row1 col0\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row1_col1\" class=\"data row1 col1\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row1_col2\" class=\"data row1 col2\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row1_col3\" class=\"data row1 col3\" >1.000000</td>\n",
|
|||
|
" <td id=\"T_c307c_row1_col4\" class=\"data row1 col4\" >1.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<pandas.io.formats.style.Styler at 0x1be7a6418e0>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"optimized_metrics = pd.DataFrame(columns=list(result.keys()))\n",
|
|||
|
"optimized_metrics.loc[len(optimized_metrics)] = pd.Series(\n",
|
|||
|
" data=class_models[optimized_model_type]\n",
|
|||
|
")\n",
|
|||
|
"optimized_metrics.loc[len(optimized_metrics)] = pd.Series(\n",
|
|||
|
" data=result\n",
|
|||
|
")\n",
|
|||
|
"optimized_metrics.insert(loc=0, column=\"Name\", value=[\"Old\", \"New\"])\n",
|
|||
|
"optimized_metrics = optimized_metrics.set_index(\"Name\")\n",
|
|||
|
"optimized_metrics[\n",
|
|||
|
" [\n",
|
|||
|
" \"Accuracy_test\",\n",
|
|||
|
" \"F1_test\",\n",
|
|||
|
" \"ROC_AUC_test\",\n",
|
|||
|
" \"Cohen_kappa_test\",\n",
|
|||
|
" \"MCC_test\",\n",
|
|||
|
" ]\n",
|
|||
|
"].style.background_gradient(\n",
|
|||
|
" cmap=\"plasma\",\n",
|
|||
|
" low=0.3,\n",
|
|||
|
" high=1,\n",
|
|||
|
" subset=[\n",
|
|||
|
" \"ROC_AUC_test\",\n",
|
|||
|
" \"MCC_test\",\n",
|
|||
|
" \"Cohen_kappa_test\",\n",
|
|||
|
" ],\n",
|
|||
|
").background_gradient(\n",
|
|||
|
" cmap=\"viridis\",\n",
|
|||
|
" low=1,\n",
|
|||
|
" high=0.3,\n",
|
|||
|
" subset=[\n",
|
|||
|
" \"Accuracy_test\",\n",
|
|||
|
" \"F1_test\",\n",
|
|||
|
" ],\n",
|
|||
|
")\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Мы видим изумительную точность новой модели. Модели не допускают никаких ошибок в предсказании."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Задача регресии: предсказание цены дома (price)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Среднее значение поля: 2079.8997362698374\n",
|
|||
|
" id date price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"0 7129300520 20141013T000000 221900.0 3 1.00 1180 \n",
|
|||
|
"1 6414100192 20141209T000000 538000.0 3 2.25 2570 \n",
|
|||
|
"2 5631500400 20150225T000000 180000.0 2 1.00 770 \n",
|
|||
|
"3 2487200875 20141209T000000 604000.0 4 3.00 1960 \n",
|
|||
|
"4 1954400510 20150218T000000 510000.0 3 2.00 1680 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view ... yr_built yr_renovated zipcode \\\n",
|
|||
|
"0 5650 1.0 0 0 ... 1955 0 98178 \n",
|
|||
|
"1 7242 2.0 0 0 ... 1951 1991 98125 \n",
|
|||
|
"2 10000 1.0 0 0 ... 1933 0 98028 \n",
|
|||
|
"3 5000 1.0 0 0 ... 1965 0 98136 \n",
|
|||
|
"4 8080 1.0 0 0 ... 1987 0 98074 \n",
|
|||
|
"\n",
|
|||
|
" lat long sqft_living15 sqft_lot15 above_median_price \\\n",
|
|||
|
"0 47.5112 -122.257 1340 5650 0 \n",
|
|||
|
"1 47.7210 -122.319 1690 7639 1 \n",
|
|||
|
"2 47.7379 -122.233 2720 8062 0 \n",
|
|||
|
"3 47.5208 -122.393 1360 5000 1 \n",
|
|||
|
"4 47.6168 -122.045 1800 7503 1 \n",
|
|||
|
"\n",
|
|||
|
" price_category average_price \n",
|
|||
|
"0 0 0 \n",
|
|||
|
"1 1 1 \n",
|
|||
|
"2 0 0 \n",
|
|||
|
"3 1 0 \n",
|
|||
|
"4 1 0 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 24 columns]\n",
|
|||
|
"Статистическое описание DataFrame:\n",
|
|||
|
" id price bedrooms bathrooms sqft_living \\\n",
|
|||
|
"count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736 \n",
|
|||
|
"std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897 \n",
|
|||
|
"min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 \n",
|
|||
|
"25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 \n",
|
|||
|
"50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 \n",
|
|||
|
"75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 \n",
|
|||
|
"max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot floors waterfront view condition \\\n",
|
|||
|
"count 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 1.510697e+04 1.494309 0.007542 0.234303 3.409430 \n",
|
|||
|
"std 4.142051e+04 0.539989 0.086517 0.766318 0.650743 \n",
|
|||
|
"min 5.200000e+02 1.000000 0.000000 0.000000 1.000000 \n",
|
|||
|
"25% 5.040000e+03 1.000000 0.000000 0.000000 3.000000 \n",
|
|||
|
"50% 7.618000e+03 1.500000 0.000000 0.000000 3.000000 \n",
|
|||
|
"75% 1.068800e+04 2.000000 0.000000 0.000000 4.000000 \n",
|
|||
|
"max 1.651359e+06 3.500000 1.000000 4.000000 5.000000 \n",
|
|||
|
"\n",
|
|||
|
" ... sqft_basement yr_built yr_renovated zipcode \\\n",
|
|||
|
"count ... 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean ... 291.509045 1971.005136 84.402258 98077.939805 \n",
|
|||
|
"std ... 442.575043 29.373411 401.679240 53.505026 \n",
|
|||
|
"min ... 0.000000 1900.000000 0.000000 98001.000000 \n",
|
|||
|
"25% ... 0.000000 1951.000000 0.000000 98033.000000 \n",
|
|||
|
"50% ... 0.000000 1975.000000 0.000000 98065.000000 \n",
|
|||
|
"75% ... 560.000000 1997.000000 0.000000 98118.000000 \n",
|
|||
|
"max ... 4820.000000 2015.000000 2015.000000 98199.000000 \n",
|
|||
|
"\n",
|
|||
|
" lat long sqft_living15 sqft_lot15 \\\n",
|
|||
|
"count 21613.000000 21613.000000 21613.000000 21613.000000 \n",
|
|||
|
"mean 47.560053 -122.213896 1986.552492 12768.455652 \n",
|
|||
|
"std 0.138564 0.140828 685.391304 27304.179631 \n",
|
|||
|
"min 47.155900 -122.519000 399.000000 651.000000 \n",
|
|||
|
"25% 47.471000 -122.328000 1490.000000 5100.000000 \n",
|
|||
|
"50% 47.571800 -122.230000 1840.000000 7620.000000 \n",
|
|||
|
"75% 47.678000 -122.125000 2360.000000 10083.000000 \n",
|
|||
|
"max 47.777600 -121.315000 6210.000000 871200.000000 \n",
|
|||
|
"\n",
|
|||
|
" above_median_price average_price \n",
|
|||
|
"count 21613.000000 21613.00000 \n",
|
|||
|
"mean 0.497340 0.42752 \n",
|
|||
|
"std 0.500004 0.49473 \n",
|
|||
|
"min 0.000000 0.00000 \n",
|
|||
|
"25% 0.000000 0.00000 \n",
|
|||
|
"50% 0.000000 0.00000 \n",
|
|||
|
"75% 1.000000 1.00000 \n",
|
|||
|
"max 1.000000 1.00000 \n",
|
|||
|
"\n",
|
|||
|
"[8 rows x 22 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn import set_config\n",
|
|||
|
"\n",
|
|||
|
"set_config(transform_output=\"pandas\")\n",
|
|||
|
"\n",
|
|||
|
"random_state = 42\n",
|
|||
|
"\n",
|
|||
|
"average_price = df['sqft_living'].mean()\n",
|
|||
|
"print(f\"Среднее значение поля: {average_price}\")\n",
|
|||
|
"\n",
|
|||
|
"# Создание новой колонки, указывающей, выше или ниже среднего значение цена закрытия\n",
|
|||
|
"df['average_price'] = (df['sqft_living'] > average_price).astype(int)\n",
|
|||
|
"\n",
|
|||
|
"# Удаление последней строки, где нет значения для следующего дня\n",
|
|||
|
"df.dropna(inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"print(df.head())\n",
|
|||
|
"\n",
|
|||
|
"print(\"Статистическое описание DataFrame:\")\n",
|
|||
|
"print(df.describe())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'X_train'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6325</th>\n",
|
|||
|
" <td>5467910190</td>\n",
|
|||
|
" <td>20140527T000000</td>\n",
|
|||
|
" <td>325000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1780</td>\n",
|
|||
|
" <td>13095</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1983</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98042</td>\n",
|
|||
|
" <td>47.3670</td>\n",
|
|||
|
" <td>-122.152</td>\n",
|
|||
|
" <td>2750</td>\n",
|
|||
|
" <td>13095</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13473</th>\n",
|
|||
|
" <td>9331800580</td>\n",
|
|||
|
" <td>20150310T000000</td>\n",
|
|||
|
" <td>257000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1000</td>\n",
|
|||
|
" <td>3700</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>200</td>\n",
|
|||
|
" <td>1929</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98118</td>\n",
|
|||
|
" <td>47.5520</td>\n",
|
|||
|
" <td>-122.290</td>\n",
|
|||
|
" <td>1270</td>\n",
|
|||
|
" <td>5000</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>17614</th>\n",
|
|||
|
" <td>2407000405</td>\n",
|
|||
|
" <td>20150226T000000</td>\n",
|
|||
|
" <td>228500.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>1080</td>\n",
|
|||
|
" <td>7486</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>90</td>\n",
|
|||
|
" <td>1942</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98146</td>\n",
|
|||
|
" <td>47.4838</td>\n",
|
|||
|
" <td>-122.335</td>\n",
|
|||
|
" <td>1170</td>\n",
|
|||
|
" <td>7800</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16970</th>\n",
|
|||
|
" <td>5466700290</td>\n",
|
|||
|
" <td>20150108T000000</td>\n",
|
|||
|
" <td>288000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>2090</td>\n",
|
|||
|
" <td>7500</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>810</td>\n",
|
|||
|
" <td>1977</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98031</td>\n",
|
|||
|
" <td>47.3951</td>\n",
|
|||
|
" <td>-122.172</td>\n",
|
|||
|
" <td>1800</td>\n",
|
|||
|
" <td>7350</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20868</th>\n",
|
|||
|
" <td>3026059361</td>\n",
|
|||
|
" <td>20150417T000000</td>\n",
|
|||
|
" <td>479000.0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>1741</td>\n",
|
|||
|
" <td>1439</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>295</td>\n",
|
|||
|
" <td>2007</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98034</td>\n",
|
|||
|
" <td>47.7043</td>\n",
|
|||
|
" <td>-122.209</td>\n",
|
|||
|
" <td>2090</td>\n",
|
|||
|
" <td>10454</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11964</th>\n",
|
|||
|
" <td>5272200045</td>\n",
|
|||
|
" <td>20141113T000000</td>\n",
|
|||
|
" <td>378000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.50</td>\n",
|
|||
|
" <td>1000</td>\n",
|
|||
|
" <td>6914</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1947</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98125</td>\n",
|
|||
|
" <td>47.7144</td>\n",
|
|||
|
" <td>-122.319</td>\n",
|
|||
|
" <td>1000</td>\n",
|
|||
|
" <td>6947</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21575</th>\n",
|
|||
|
" <td>9578500790</td>\n",
|
|||
|
" <td>20141111T000000</td>\n",
|
|||
|
" <td>399950.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>3087</td>\n",
|
|||
|
" <td>5002</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2014</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98023</td>\n",
|
|||
|
" <td>47.2974</td>\n",
|
|||
|
" <td>-122.349</td>\n",
|
|||
|
" <td>2927</td>\n",
|
|||
|
" <td>5183</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5390</th>\n",
|
|||
|
" <td>7202350480</td>\n",
|
|||
|
" <td>20140930T000000</td>\n",
|
|||
|
" <td>575000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>2120</td>\n",
|
|||
|
" <td>4780</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2004</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98053</td>\n",
|
|||
|
" <td>47.6810</td>\n",
|
|||
|
" <td>-122.032</td>\n",
|
|||
|
" <td>1690</td>\n",
|
|||
|
" <td>2650</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>860</th>\n",
|
|||
|
" <td>1723049033</td>\n",
|
|||
|
" <td>20140620T000000</td>\n",
|
|||
|
" <td>245000.0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0.75</td>\n",
|
|||
|
" <td>380</td>\n",
|
|||
|
" <td>15000</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1963</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98168</td>\n",
|
|||
|
" <td>47.4810</td>\n",
|
|||
|
" <td>-122.323</td>\n",
|
|||
|
" <td>1170</td>\n",
|
|||
|
" <td>15000</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15795</th>\n",
|
|||
|
" <td>6147650280</td>\n",
|
|||
|
" <td>20150325T000000</td>\n",
|
|||
|
" <td>315000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>3130</td>\n",
|
|||
|
" <td>5999</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2006</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98042</td>\n",
|
|||
|
" <td>47.3837</td>\n",
|
|||
|
" <td>-122.099</td>\n",
|
|||
|
" <td>3020</td>\n",
|
|||
|
" <td>5997</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>17290 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms \\\n",
|
|||
|
"6325 5467910190 20140527T000000 325000.0 3 1.75 \n",
|
|||
|
"13473 9331800580 20150310T000000 257000.0 2 1.00 \n",
|
|||
|
"17614 2407000405 20150226T000000 228500.0 3 1.00 \n",
|
|||
|
"16970 5466700290 20150108T000000 288000.0 3 2.25 \n",
|
|||
|
"20868 3026059361 20150417T000000 479000.0 2 2.50 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"11964 5272200045 20141113T000000 378000.0 3 1.50 \n",
|
|||
|
"21575 9578500790 20141111T000000 399950.0 3 2.50 \n",
|
|||
|
"5390 7202350480 20140930T000000 575000.0 3 2.50 \n",
|
|||
|
"860 1723049033 20140620T000000 245000.0 1 0.75 \n",
|
|||
|
"15795 6147650280 20150325T000000 315000.0 4 2.50 \n",
|
|||
|
"\n",
|
|||
|
" sqft_living sqft_lot floors waterfront view ... sqft_basement \\\n",
|
|||
|
"6325 1780 13095 1.0 0 0 ... 0 \n",
|
|||
|
"13473 1000 3700 1.0 0 0 ... 200 \n",
|
|||
|
"17614 1080 7486 1.5 0 0 ... 90 \n",
|
|||
|
"16970 2090 7500 1.0 0 0 ... 810 \n",
|
|||
|
"20868 1741 1439 2.0 0 0 ... 295 \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"11964 1000 6914 1.0 0 0 ... 0 \n",
|
|||
|
"21575 3087 5002 2.0 0 0 ... 0 \n",
|
|||
|
"5390 2120 4780 2.0 0 0 ... 0 \n",
|
|||
|
"860 380 15000 1.0 0 0 ... 0 \n",
|
|||
|
"15795 3130 5999 2.0 0 0 ... 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"6325 1983 0 98042 47.3670 -122.152 2750 \n",
|
|||
|
"13473 1929 0 98118 47.5520 -122.290 1270 \n",
|
|||
|
"17614 1942 0 98146 47.4838 -122.335 1170 \n",
|
|||
|
"16970 1977 0 98031 47.3951 -122.172 1800 \n",
|
|||
|
"20868 2007 0 98034 47.7043 -122.209 2090 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"11964 1947 0 98125 47.7144 -122.319 1000 \n",
|
|||
|
"21575 2014 0 98023 47.2974 -122.349 2927 \n",
|
|||
|
"5390 2004 0 98053 47.6810 -122.032 1690 \n",
|
|||
|
"860 1963 0 98168 47.4810 -122.323 1170 \n",
|
|||
|
"15795 2006 0 98042 47.3837 -122.099 3020 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 above_median_price price_category \n",
|
|||
|
"6325 13095 0 1 \n",
|
|||
|
"13473 5000 0 0 \n",
|
|||
|
"17614 7800 0 0 \n",
|
|||
|
"16970 7350 0 0 \n",
|
|||
|
"20868 10454 1 1 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"11964 6947 0 1 \n",
|
|||
|
"21575 5183 0 1 \n",
|
|||
|
"5390 2650 1 1 \n",
|
|||
|
"860 15000 0 0 \n",
|
|||
|
"15795 5997 0 1 \n",
|
|||
|
"\n",
|
|||
|
"[17290 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'y_train'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>average_price</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6325</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13473</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>17614</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16970</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20868</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11964</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21575</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5390</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>860</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15795</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>17290 rows × 1 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" average_price\n",
|
|||
|
"6325 0\n",
|
|||
|
"13473 0\n",
|
|||
|
"17614 0\n",
|
|||
|
"16970 1\n",
|
|||
|
"20868 0\n",
|
|||
|
"... ...\n",
|
|||
|
"11964 0\n",
|
|||
|
"21575 1\n",
|
|||
|
"5390 1\n",
|
|||
|
"860 0\n",
|
|||
|
"15795 1\n",
|
|||
|
"\n",
|
|||
|
"[17290 rows x 1 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'X_test'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>price</th>\n",
|
|||
|
" <th>bedrooms</th>\n",
|
|||
|
" <th>bathrooms</th>\n",
|
|||
|
" <th>sqft_living</th>\n",
|
|||
|
" <th>sqft_lot</th>\n",
|
|||
|
" <th>floors</th>\n",
|
|||
|
" <th>waterfront</th>\n",
|
|||
|
" <th>view</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>sqft_basement</th>\n",
|
|||
|
" <th>yr_built</th>\n",
|
|||
|
" <th>yr_renovated</th>\n",
|
|||
|
" <th>zipcode</th>\n",
|
|||
|
" <th>lat</th>\n",
|
|||
|
" <th>long</th>\n",
|
|||
|
" <th>sqft_living15</th>\n",
|
|||
|
" <th>sqft_lot15</th>\n",
|
|||
|
" <th>above_median_price</th>\n",
|
|||
|
" <th>price_category</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>735</th>\n",
|
|||
|
" <td>2591820310</td>\n",
|
|||
|
" <td>20141006T000000</td>\n",
|
|||
|
" <td>365000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.25</td>\n",
|
|||
|
" <td>2070</td>\n",
|
|||
|
" <td>8893</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1986</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98058</td>\n",
|
|||
|
" <td>47.4388</td>\n",
|
|||
|
" <td>-122.162</td>\n",
|
|||
|
" <td>2390</td>\n",
|
|||
|
" <td>7700</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2830</th>\n",
|
|||
|
" <td>7974200820</td>\n",
|
|||
|
" <td>20140821T000000</td>\n",
|
|||
|
" <td>865000.0</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>3.00</td>\n",
|
|||
|
" <td>2900</td>\n",
|
|||
|
" <td>6730</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1070</td>\n",
|
|||
|
" <td>1977</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98115</td>\n",
|
|||
|
" <td>47.6784</td>\n",
|
|||
|
" <td>-122.285</td>\n",
|
|||
|
" <td>2370</td>\n",
|
|||
|
" <td>6283</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4106</th>\n",
|
|||
|
" <td>7701450110</td>\n",
|
|||
|
" <td>20140815T000000</td>\n",
|
|||
|
" <td>1038000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>3770</td>\n",
|
|||
|
" <td>10893</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1997</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98006</td>\n",
|
|||
|
" <td>47.5646</td>\n",
|
|||
|
" <td>-122.129</td>\n",
|
|||
|
" <td>3710</td>\n",
|
|||
|
" <td>9685</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16218</th>\n",
|
|||
|
" <td>9522300010</td>\n",
|
|||
|
" <td>20150331T000000</td>\n",
|
|||
|
" <td>1490000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>3.50</td>\n",
|
|||
|
" <td>4560</td>\n",
|
|||
|
" <td>14608</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1990</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98034</td>\n",
|
|||
|
" <td>47.6995</td>\n",
|
|||
|
" <td>-122.228</td>\n",
|
|||
|
" <td>4050</td>\n",
|
|||
|
" <td>14226</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19964</th>\n",
|
|||
|
" <td>9510861140</td>\n",
|
|||
|
" <td>20140714T000000</td>\n",
|
|||
|
" <td>711000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>2550</td>\n",
|
|||
|
" <td>5376</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2004</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98052</td>\n",
|
|||
|
" <td>47.6647</td>\n",
|
|||
|
" <td>-122.083</td>\n",
|
|||
|
" <td>2250</td>\n",
|
|||
|
" <td>4050</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13674</th>\n",
|
|||
|
" <td>6163900333</td>\n",
|
|||
|
" <td>20141110T000000</td>\n",
|
|||
|
" <td>338000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>1250</td>\n",
|
|||
|
" <td>7710</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1947</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98155</td>\n",
|
|||
|
" <td>47.7623</td>\n",
|
|||
|
" <td>-122.317</td>\n",
|
|||
|
" <td>1340</td>\n",
|
|||
|
" <td>7710</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20377</th>\n",
|
|||
|
" <td>3528960020</td>\n",
|
|||
|
" <td>20140708T000000</td>\n",
|
|||
|
" <td>673000.0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>2.75</td>\n",
|
|||
|
" <td>2830</td>\n",
|
|||
|
" <td>3496</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2012</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98029</td>\n",
|
|||
|
" <td>47.5606</td>\n",
|
|||
|
" <td>-122.011</td>\n",
|
|||
|
" <td>2160</td>\n",
|
|||
|
" <td>3501</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8805</th>\n",
|
|||
|
" <td>1687000220</td>\n",
|
|||
|
" <td>20141016T000000</td>\n",
|
|||
|
" <td>285000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>2434</td>\n",
|
|||
|
" <td>4400</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2007</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98001</td>\n",
|
|||
|
" <td>47.2874</td>\n",
|
|||
|
" <td>-122.283</td>\n",
|
|||
|
" <td>2434</td>\n",
|
|||
|
" <td>4400</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10168</th>\n",
|
|||
|
" <td>4141400030</td>\n",
|
|||
|
" <td>20141201T000000</td>\n",
|
|||
|
" <td>605000.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>1.75</td>\n",
|
|||
|
" <td>2250</td>\n",
|
|||
|
" <td>10108</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1967</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98008</td>\n",
|
|||
|
" <td>47.5922</td>\n",
|
|||
|
" <td>-122.118</td>\n",
|
|||
|
" <td>2050</td>\n",
|
|||
|
" <td>9750</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2522</th>\n",
|
|||
|
" <td>1822500160</td>\n",
|
|||
|
" <td>20141212T000000</td>\n",
|
|||
|
" <td>356500.0</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>2.50</td>\n",
|
|||
|
" <td>2570</td>\n",
|
|||
|
" <td>11473</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2008</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>98003</td>\n",
|
|||
|
" <td>47.2809</td>\n",
|
|||
|
" <td>-122.296</td>\n",
|
|||
|
" <td>2430</td>\n",
|
|||
|
" <td>5997</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>4323 rows × 23 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" id date price bedrooms bathrooms \\\n",
|
|||
|
"735 2591820310 20141006T000000 365000.0 4 2.25 \n",
|
|||
|
"2830 7974200820 20140821T000000 865000.0 5 3.00 \n",
|
|||
|
"4106 7701450110 20140815T000000 1038000.0 4 2.50 \n",
|
|||
|
"16218 9522300010 20150331T000000 1490000.0 3 3.50 \n",
|
|||
|
"19964 9510861140 20140714T000000 711000.0 3 2.50 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"13674 6163900333 20141110T000000 338000.0 3 1.75 \n",
|
|||
|
"20377 3528960020 20140708T000000 673000.0 3 2.75 \n",
|
|||
|
"8805 1687000220 20141016T000000 285000.0 4 2.50 \n",
|
|||
|
"10168 4141400030 20141201T000000 605000.0 4 1.75 \n",
|
|||
|
"2522 1822500160 20141212T000000 356500.0 4 2.50 \n",
|
|||
|
"\n",
|
|||
|
" sqft_living sqft_lot floors waterfront view ... sqft_basement \\\n",
|
|||
|
"735 2070 8893 2.0 0 0 ... 0 \n",
|
|||
|
"2830 2900 6730 1.0 0 0 ... 1070 \n",
|
|||
|
"4106 3770 10893 2.0 0 2 ... 0 \n",
|
|||
|
"16218 4560 14608 2.0 0 2 ... 0 \n",
|
|||
|
"19964 2550 5376 2.0 0 0 ... 0 \n",
|
|||
|
"... ... ... ... ... ... ... ... \n",
|
|||
|
"13674 1250 7710 1.0 0 0 ... 0 \n",
|
|||
|
"20377 2830 3496 2.0 0 0 ... 0 \n",
|
|||
|
"8805 2434 4400 2.0 0 0 ... 0 \n",
|
|||
|
"10168 2250 10108 1.0 0 0 ... 0 \n",
|
|||
|
"2522 2570 11473 2.0 0 0 ... 0 \n",
|
|||
|
"\n",
|
|||
|
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
|
|||
|
"735 1986 0 98058 47.4388 -122.162 2390 \n",
|
|||
|
"2830 1977 0 98115 47.6784 -122.285 2370 \n",
|
|||
|
"4106 1997 0 98006 47.5646 -122.129 3710 \n",
|
|||
|
"16218 1990 0 98034 47.6995 -122.228 4050 \n",
|
|||
|
"19964 2004 0 98052 47.6647 -122.083 2250 \n",
|
|||
|
"... ... ... ... ... ... ... \n",
|
|||
|
"13674 1947 0 98155 47.7623 -122.317 1340 \n",
|
|||
|
"20377 2012 0 98029 47.5606 -122.011 2160 \n",
|
|||
|
"8805 2007 0 98001 47.2874 -122.283 2434 \n",
|
|||
|
"10168 1967 0 98008 47.5922 -122.118 2050 \n",
|
|||
|
"2522 2008 0 98003 47.2809 -122.296 2430 \n",
|
|||
|
"\n",
|
|||
|
" sqft_lot15 above_median_price price_category \n",
|
|||
|
"735 7700 0 1 \n",
|
|||
|
"2830 6283 1 2 \n",
|
|||
|
"4106 9685 1 2 \n",
|
|||
|
"16218 14226 1 2 \n",
|
|||
|
"19964 4050 1 2 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"13674 7710 0 1 \n",
|
|||
|
"20377 3501 1 1 \n",
|
|||
|
"8805 4400 0 0 \n",
|
|||
|
"10168 9750 1 1 \n",
|
|||
|
"2522 5997 0 1 \n",
|
|||
|
"\n",
|
|||
|
"[4323 rows x 23 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"'y_test'"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>average_price</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>735</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2830</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4106</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16218</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19964</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13674</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20377</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8805</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10168</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2522</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>4323 rows × 1 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" average_price\n",
|
|||
|
"735 0\n",
|
|||
|
"2830 1\n",
|
|||
|
"4106 1\n",
|
|||
|
"16218 1\n",
|
|||
|
"19964 1\n",
|
|||
|
"... ...\n",
|
|||
|
"13674 0\n",
|
|||
|
"20377 1\n",
|
|||
|
"8805 1\n",
|
|||
|
"10168 1\n",
|
|||
|
"2522 1\n",
|
|||
|
"\n",
|
|||
|
"[4323 rows x 1 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from typing import Tuple\n",
|
|||
|
"from pandas import DataFrame\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"def split_into_train_test(\n",
|
|||
|
" df_input: DataFrame,\n",
|
|||
|
" target_colname: str = \"average_price\",\n",
|
|||
|
" frac_train: float = 0.8,\n",
|
|||
|
" random_state: int = None,\n",
|
|||
|
") -> Tuple[DataFrame, DataFrame, DataFrame, DataFrame]:\n",
|
|||
|
" \n",
|
|||
|
" if not (0 < frac_train < 1):\n",
|
|||
|
" raise ValueError(\"Fraction must be between 0 and 1.\")\n",
|
|||
|
" \n",
|
|||
|
" # Проверка наличия целевого признака\n",
|
|||
|
" if target_colname not in df_input.columns:\n",
|
|||
|
" raise ValueError(f\"{target_colname} is not a column in the DataFrame.\")\n",
|
|||
|
" \n",
|
|||
|
" # Разделяем данные на признаки и целевую переменную\n",
|
|||
|
" X = df_input.drop(columns=[target_colname]) # Признаки\n",
|
|||
|
" y = df_input[[target_colname]] # Целевая переменная\n",
|
|||
|
"\n",
|
|||
|
" # Разделяем данные на обучающую и тестовую выборки\n",
|
|||
|
" X_train, X_test, y_train, y_test = train_test_split(\n",
|
|||
|
" X, y,\n",
|
|||
|
" test_size=(1.0 - frac_train),\n",
|
|||
|
" random_state=random_state\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" return X_train, X_test, y_train, y_test\n",
|
|||
|
"\n",
|
|||
|
"# Применение функции для разделения данных\n",
|
|||
|
"X_train, X_test, y_train, y_test = split_into_train_test(\n",
|
|||
|
" df, \n",
|
|||
|
" target_colname=\"average_price\", \n",
|
|||
|
" frac_train=0.8, \n",
|
|||
|
" random_state=42 # Убедитесь, что вы задали нужное значение random_state\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"display(\"X_train\", X_train)\n",
|
|||
|
"display(\"y_train\", y_train)\n",
|
|||
|
"\n",
|
|||
|
"display(\"X_test\", X_test)\n",
|
|||
|
"display(\"y_test\", y_test)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Формирование конвейера для решения задачи регрессии"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
|||
|
"from sklearn.compose import ColumnTransformer\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"from sklearn.impute import SimpleImputer\n",
|
|||
|
"from sklearn.pipeline import Pipeline\n",
|
|||
|
"from sklearn.preprocessing import OneHotEncoder\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor # Пример регрессионной модели\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.pipeline import make_pipeline\n",
|
|||
|
"\n",
|
|||
|
"class HouseFeatures(BaseEstimator, TransformerMixin):\n",
|
|||
|
" def __init__(self):\n",
|
|||
|
" pass\n",
|
|||
|
" def fit(self, X, y=None):\n",
|
|||
|
" return self\n",
|
|||
|
" def transform(self, X, y=None):\n",
|
|||
|
" # Создание новых признаков\n",
|
|||
|
" X = X.copy()\n",
|
|||
|
" X[\"Square\"] = X[\"sqft_living\"] / X[\"sqft_lot\"]\n",
|
|||
|
" return X\n",
|
|||
|
" def get_feature_names_out(self, features_in):\n",
|
|||
|
" # Добавление имен новых признаков\n",
|
|||
|
" new_features = [\"Square\"]\n",
|
|||
|
" return np.append(features_in, new_features, axis=0)\n",
|
|||
|
"\n",
|
|||
|
"# Указываем столбцы, которые нужно удалить и обрабатывать\n",
|
|||
|
"columns_to_drop = [\"date\"]\n",
|
|||
|
"num_columns = [\"bathrooms\", \"floors\", \"waterfront\", \"view\"]\n",
|
|||
|
"cat_columns = [] \n",
|
|||
|
"\n",
|
|||
|
"# Определяем предобработку для численных данных\n",
|
|||
|
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
|||
|
"num_scaler = StandardScaler()\n",
|
|||
|
"preprocessing_num = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", num_imputer),\n",
|
|||
|
" (\"scaler\", num_scaler),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Определяем предобработку для категориальных данных\n",
|
|||
|
"cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"unknown\")\n",
|
|||
|
"cat_encoder = OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False, drop=\"first\")\n",
|
|||
|
"preprocessing_cat = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", cat_imputer),\n",
|
|||
|
" (\"encoder\", cat_encoder),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Подготовка признаков с использованием ColumnTransformer\n",
|
|||
|
"features_preprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"preprocessing_num\", preprocessing_num, num_columns),\n",
|
|||
|
" (\"preprocessing_cat\", preprocessing_cat, cat_columns),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\"\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Удаление нежелательных столбцов\n",
|
|||
|
"drop_columns = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"drop_columns\", \"drop\", columns_to_drop),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\",\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Постобработка признаков\n",
|
|||
|
"features_postprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"preprocessing_cat\", preprocessing_cat, [\"price_category\"]), \n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\",\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Создание окончательного конвейера\n",
|
|||
|
"pipeline = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"features_preprocessing\", features_preprocessing),\n",
|
|||
|
" (\"drop_columns\", drop_columns),\n",
|
|||
|
" (\"custom_features\", HouseFeatures()),\n",
|
|||
|
" (\"model\", RandomForestRegressor()) # Выбор модели для обучения\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Использование конвейера\n",
|
|||
|
"def train_pipeline(X, y):\n",
|
|||
|
" pipeline.fit(X, y)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Model: logistic\n",
|
|||
|
"MSE (train): 0.24060150375939848\n",
|
|||
|
"MSE (test): 0.23455933379597502\n",
|
|||
|
"MAE (train): 0.24060150375939848\n",
|
|||
|
"MAE (test): 0.23455933379597502\n",
|
|||
|
"R2 (train): 0.015780807725750634\n",
|
|||
|
"R2 (test): 0.045807954005714024\n",
|
|||
|
"STD (train): 0.48387852043102103\n",
|
|||
|
"STD (test): 0.4780359236045559\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: ridge\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\user\\Desktop\\MII\\lab1para\\aim\\aimenv\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
|
|||
|
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
|
|||
|
"\n",
|
|||
|
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
|
|||
|
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
|
|||
|
"Please also refer to the documentation for alternative solver options:\n",
|
|||
|
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
|
|||
|
" n_iter_i = _check_optimize_result(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"MSE (train): 0.210989010989011\n",
|
|||
|
"MSE (test): 0.2035623409669211\n",
|
|||
|
"MAE (train): 0.210989010989011\n",
|
|||
|
"MAE (test): 0.2035623409669211\n",
|
|||
|
"R2 (train): 0.1369154775441198\n",
|
|||
|
"R2 (test): 0.17190433878207922\n",
|
|||
|
"STD (train): 0.45781332911823247\n",
|
|||
|
"STD (test): 0.4499815316182845\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: decision_tree\n",
|
|||
|
"MSE (train): 0.0\n",
|
|||
|
"MSE (test): 0.0\n",
|
|||
|
"MAE (train): 0.0\n",
|
|||
|
"MAE (test): 0.0\n",
|
|||
|
"R2 (train): 1.0\n",
|
|||
|
"R2 (test): 1.0\n",
|
|||
|
"STD (train): 0.0\n",
|
|||
|
"STD (test): 0.0\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: knn\n",
|
|||
|
"MSE (train): 0.1949681897050318\n",
|
|||
|
"MSE (test): 0.27989821882951654\n",
|
|||
|
"MAE (train): 0.1949681897050318\n",
|
|||
|
"MAE (test): 0.27989821882951654\n",
|
|||
|
"R2 (train): 0.20245122664507342\n",
|
|||
|
"R2 (test): -0.13863153417464114\n",
|
|||
|
"STD (train): 0.43948973967967464\n",
|
|||
|
"STD (test): 0.5264647910268833\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: naive_bayes\n",
|
|||
|
"MSE (train): 0.26928860613071137\n",
|
|||
|
"MSE (test): 0.2690261392551469\n",
|
|||
|
"MAE (train): 0.26928860613071137\n",
|
|||
|
"MAE (test): 0.2690261392551469\n",
|
|||
|
"R2 (train): -0.10156840366079445\n",
|
|||
|
"R2 (test): -0.09440369772322943\n",
|
|||
|
"STD (train): 0.47316941542228536\n",
|
|||
|
"STD (test): 0.47206502931490235\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: gradient_boosting\n",
|
|||
|
"MSE (train): 0.0\n",
|
|||
|
"MSE (test): 0.0\n",
|
|||
|
"MAE (train): 0.0\n",
|
|||
|
"MAE (test): 0.0\n",
|
|||
|
"R2 (train): 1.0\n",
|
|||
|
"R2 (test): 1.0\n",
|
|||
|
"STD (train): 0.0\n",
|
|||
|
"STD (test): 0.0\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: random_forest\n",
|
|||
|
"MSE (train): 0.0\n",
|
|||
|
"MSE (test): 0.0\n",
|
|||
|
"MAE (train): 0.0\n",
|
|||
|
"MAE (test): 0.0\n",
|
|||
|
"R2 (train): 1.0\n",
|
|||
|
"R2 (test): 1.0\n",
|
|||
|
"STD (train): 0.0\n",
|
|||
|
"STD (test): 0.0\n",
|
|||
|
"----------------------------------------\n",
|
|||
|
"Model: mlp\n",
|
|||
|
"MSE (train): 0.4253903990746096\n",
|
|||
|
"MSE (test): 0.4353458246588018\n",
|
|||
|
"MAE (train): 0.4253903990746096\n",
|
|||
|
"MAE (test): 0.4353458246588018\n",
|
|||
|
"R2 (train): -0.7401279228791116\n",
|
|||
|
"R2 (test): -0.7709954936501442\n",
|
|||
|
"STD (train): 0.4959884986820156\n",
|
|||
|
"STD (test): 0.49782384226978177\n",
|
|||
|
"----------------------------------------\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn import metrics\n",
|
|||
|
"from sklearn.pipeline import Pipeline\n",
|
|||
|
"\n",
|
|||
|
"# Проверка наличия необходимых переменных\n",
|
|||
|
"if 'class_models' not in locals():\n",
|
|||
|
" raise ValueError(\"class_models is not defined\")\n",
|
|||
|
"if 'X_train' not in locals() or 'X_test' not in locals() or 'y_train' not in locals() or 'y_test' not in locals():\n",
|
|||
|
" raise ValueError(\"Train/test data is not defined\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"y_train = np.ravel(y_train) \n",
|
|||
|
"y_test = np.ravel(y_test) \n",
|
|||
|
"\n",
|
|||
|
"# Инициализация списка для хранения результатов\n",
|
|||
|
"results = []\n",
|
|||
|
"\n",
|
|||
|
"# Проход по моделям и оценка их качества\n",
|
|||
|
"for model_name in class_models.keys():\n",
|
|||
|
" print(f\"Model: {model_name}\")\n",
|
|||
|
" \n",
|
|||
|
" # Извлечение модели из словаря\n",
|
|||
|
" model = class_models[model_name][\"model\"]\n",
|
|||
|
" \n",
|
|||
|
" # Создание пайплайна\n",
|
|||
|
" model_pipeline = Pipeline([(\"pipeline\", pipeline_end), (\"model\", model)])\n",
|
|||
|
" \n",
|
|||
|
" # Обучение модели\n",
|
|||
|
" model_pipeline.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
" # Предсказание для обучающей и тестовой выборки\n",
|
|||
|
" y_train_predict = model_pipeline.predict(X_train)\n",
|
|||
|
" y_test_predict = model_pipeline.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
" # Сохранение пайплайна и предсказаний\n",
|
|||
|
" class_models[model_name][\"pipeline\"] = model_pipeline\n",
|
|||
|
" class_models[model_name][\"preds\"] = y_test_predict\n",
|
|||
|
"\n",
|
|||
|
" # Вычисление метрик для регрессии\n",
|
|||
|
" class_models[model_name][\"MSE_train\"] = metrics.mean_squared_error(y_train, y_train_predict)\n",
|
|||
|
" class_models[model_name][\"MSE_test\"] = metrics.mean_squared_error(y_test, y_test_predict)\n",
|
|||
|
" class_models[model_name][\"MAE_train\"] = metrics.mean_absolute_error(y_train, y_train_predict)\n",
|
|||
|
" class_models[model_name][\"MAE_test\"] = metrics.mean_absolute_error(y_test, y_test_predict)\n",
|
|||
|
" class_models[model_name][\"R2_train\"] = metrics.r2_score(y_train, y_train_predict)\n",
|
|||
|
" class_models[model_name][\"R2_test\"] = metrics.r2_score(y_test, y_test_predict)\n",
|
|||
|
"\n",
|
|||
|
" # Дополнительные метрики\n",
|
|||
|
" class_models[model_name][\"STD_train\"] = np.std(y_train - y_train_predict)\n",
|
|||
|
" class_models[model_name][\"STD_test\"] = np.std(y_test - y_test_predict)\n",
|
|||
|
"\n",
|
|||
|
" # Вывод результатов для текущей модели\n",
|
|||
|
" print(f\"MSE (train): {class_models[model_name]['MSE_train']}\")\n",
|
|||
|
" print(f\"MSE (test): {class_models[model_name]['MSE_test']}\")\n",
|
|||
|
" print(f\"MAE (train): {class_models[model_name]['MAE_train']}\")\n",
|
|||
|
" print(f\"MAE (test): {class_models[model_name]['MAE_test']}\")\n",
|
|||
|
" print(f\"R2 (train): {class_models[model_name]['R2_train']}\")\n",
|
|||
|
" print(f\"R2 (test): {class_models[model_name]['R2_test']}\")\n",
|
|||
|
" print(f\"STD (train): {class_models[model_name]['STD_train']}\")\n",
|
|||
|
" print(f\"STD (test): {class_models[model_name]['STD_test']}\")\n",
|
|||
|
" print(\"-\" * 40) # Разделитель для разных моделей"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пример использования обученной модели (конвейера регрессии) для предсказания\n",
|
|||
|
"\n",
|
|||
|
"Подбор гиперпараметров методом поиска по сетке"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Fitting 5 folds for each of 36 candidates, totalling 180 fits\n",
|
|||
|
"Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}\n",
|
|||
|
"Best MSE: 0.14737693245118555\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Convert the date column to a datetime object and extract numeric features\n",
|
|||
|
"df['date'] = pd.to_datetime(df['date'], errors='coerce') # Coerce invalid dates to NaT\n",
|
|||
|
"df.dropna(subset=['date'], inplace=True) # Drop rows with invalid dates\n",
|
|||
|
"df['year'] = df['date'].dt.year\n",
|
|||
|
"df['month'] = df['date'].dt.month\n",
|
|||
|
"df['day'] = df['date'].dt.day\n",
|
|||
|
"\n",
|
|||
|
"# Prepare predictors and target\n",
|
|||
|
"X = df[['yr_built', 'year', 'month', 'day', 'price', 'price_category']]\n",
|
|||
|
"y = df['average_price']\n",
|
|||
|
"\n",
|
|||
|
"# Split data into training and testing sets\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Define model and parameter grid\n",
|
|||
|
"model = RandomForestRegressor()\n",
|
|||
|
"param_grid = {\n",
|
|||
|
" 'n_estimators': [50, 100, 200],\n",
|
|||
|
" 'max_depth': [None, 10, 20, 30],\n",
|
|||
|
" 'min_samples_split': [2, 5, 10]\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Hyperparameter tuning with GridSearchCV\n",
|
|||
|
"grid_search = GridSearchCV(estimator=model, param_grid=param_grid,\n",
|
|||
|
" scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=2)\n",
|
|||
|
"\n",
|
|||
|
"# Fit the model\n",
|
|||
|
"grid_search.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Output the best parameters and score\n",
|
|||
|
"print(\"Best parameters:\", grid_search.best_params_)\n",
|
|||
|
"print(\"Best MSE:\", -grid_search.best_score_)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Fitting 5 folds for each of 36 candidates, totalling 180 fits\n",
|
|||
|
"Старые параметры: {'max_depth': 10, 'min_samples_split': 15, 'n_estimators': 200}\n",
|
|||
|
"Лучший результат (MSE) на старых параметрах: 0.1472405057641472\n",
|
|||
|
"\n",
|
|||
|
"Новые параметры: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}\n",
|
|||
|
"Лучший результат (MSE) на новых параметрах: 0.149046701378161\n",
|
|||
|
"Среднеквадратическая ошибка (MSE) на тестовых данных: 0.14438125797411974\n",
|
|||
|
"Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.3799753386393908\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1cAAAHWCAYAAACbsXOkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABuyUlEQVR4nO3deXhNV/v/8c9JyDwhJIImlJpCzAQVrSFaRWjNraGqk7FBDS1BnzbUFC1PlbaoVqkWVW31aWOeiXmsqqlIDCUhSEj27w+/nK8jCTlx0ki8X9d1Ljlrr732vbeT7NxZwzYZhmEIAAAAAPBA7HI7AAAAAADID0iuAAAAAMAGSK4AAAAAwAZIrgAAAADABkiuAAAAAMAGSK4AAAAAwAZIrgAAAADABkiuAAAAAMAGSK4AAAAAwAZIrgAAAADABkiugHzs6NGjeu2111SmTBk5OTnJw8NDDRo00NSpU3X9+vXcDu+RsXr1aplMJplMJn311VcZ1mnQoIFMJpMCAwMtypOTkzV16lRVr15dHh4e8vLyUuXKlfXqq6/q0KFD5npz5swxHyOj1+bNm3P0HAEAgFQgtwMAkDN++ukntW/fXo6OjurWrZsCAwOVnJys9evXa8iQIdq/f79mzpyZ22E+UpycnDR//ny9+OKLFuXHjx/Xxo0b5eTklG6f559/Xr/88os6d+6s3r176+bNmzp06JCWL1+u+vXrq0KFChb1x44dq9KlS6drp2zZsrY9GQAAkA7JFZAPHTt2TJ06dZK/v79Wrlyp4sWLm7f16dNHf/75p3766adcjPDR9Oyzz2rZsmW6cOGCvL29zeXz58+Xj4+PypUrp0uXLpnLt23bpuXLl+v999/XiBEjLNqaNm2aLl++nO4YzzzzjGrVqpVj5wAAADLHsEAgH/rwww919epVff755xaJVZqyZctqwIAB5vcmk0l9+/bV119/rfLly8vJyUk1a9bU2rVrLfY7ceKE3nzzTZUvX17Ozs4qUqSI2rdvr+PHj1vUu3uImouLi6pUqaLPPvvMol6PHj3k5uaWLr7vvvtOJpNJq1evtijfsmWLWrRoIU9PT7m4uCgkJEQbNmywqDN69GiZTCZduHDBonz79u0ymUyaM2eOxfEDAgIs6p06dUrOzs4ymUzpzuuXX37Rk08+KVdXV7m7u6tly5bav39/uvgz06ZNGzk6OmrRokUW5fPnz1eHDh1kb29vUX706FFJt4cM3s3e3l5FihTJ8rGz4vjx45kOK7z7WkhS48aNM6x75zWWpE8++USBgYFycXGxqPfdd9/dN6bTp0+rV69e8vPzk6Ojo0qXLq033nhDycnJ9x0KeWcse/bsUY8ePcxDZH19ffXyyy/r4sWLFsdL+/wcOnRIHTp0kIeHh4oUKaIBAwboxo0bFnXTvm8ykxZf2rVbuXKl7OzsNGrUKIt68+fPl8lk0ieffHLPa9G4cWM1btzYomzbtm3mc72fxo0bpxt2KkkTJ07M8P/4v//9rypXrixHR0f5+fmpT58+6RL6uz8D3t7eatmypfbt22dRLzeu1b0+F3ee6w8//KCWLVuaP2OPP/643nvvPaWkpKRrMzAwUDExMapfv76cnZ1VunRpzZgxw6JecnKyRo0apZo1a8rT01Ourq568skntWrVKot6d36/LV261GLbjRs3VKhQIZlMJk2cONFi2+nTp/Xyyy/Lx8dHjo6Oqly5sr744gvz9juHIWf2Gj16tCTrPu+3bt3Se++9p8cff1yOjo4KCAjQiBEjlJSUZFEvICDAfBw7Ozv5+vqqY8eOOnny5D3/z4D8gp4rIB/68ccfVaZMGdWvXz/L+6xZs0YLFy5U//795ejoqP/+979q0aKFtm7dav6FbNu2bdq4caM6deqkkiVL6vjx4/rkk0/UuHFjHThwQC4uLhZtTpkyRd7e3kpISNAXX3yh3r17KyAgQE2bNrX6nFauXKlnnnlGNWvWVEREhOzs7DR79mw9/fTTWrdunerUqWN1mxkZNWpUul8qJGnevHnq3r27QkNDNX78eF27dk2ffPKJGjZsqJ07d6ZL0jLi4uKiNm3a6JtvvtEbb7whSdq9e7f279+vzz77THv27LGo7+/vL0n6+uuv1aBBAxUocP8f2fHx8ekSS5PJZFUi1rlzZz377LOSpJ9//lnffPNNpnUrVKigd955R5J04cIFvfXWWxbbFy5cqDfffFONGzdWv3795OrqqoMHD+qDDz64bxxnzpxRnTp1dPnyZb366quqUKGCTp8+re+++07Xrl1To0aNNG/ePHP9999/X5LM8Ugyfw/89ttv+uuvv9SzZ0/5+vqah8Xu379fmzdvTpecdOjQQQEBAYqMjNTmzZv10Ucf6dKlS/ryyy/vG3dmnn76ab355puKjIxUWFiYatSoobNnz6pfv35q2rSpXn/9davbHDp0aLbjuZfRo0drzJgxatq0qd544w0dPnxYn3zyibZt26YNGzaoYMGC5rppnwHDMHT06FFNnjxZzz777AP9Mm2La1WyZElFRkZalGX0eZ4zZ47c3NwUHh4uNzc3rVy5UqNGjVJCQoImTJhgUffSpUt69tln1aFDB3Xu3Fnffvut3njjDTk4OOjll1+WJCUkJOizzz4zD+W9cuWKPv/8c4WGhmrr1q2qVq2aRZtOTk6aPXu2wsLCzGWLFy/O8OdQXFyc6tWrZ05WixYtql9++UW9evVSQkKCBg4cqIoVK1p8X8ycOVMHDx7UlClTzGVVq1a1aDcrn/dXXnlFc+fO1QsvvKBBgwZpy5YtioyM1MGDB7VkyRKL9p588km9+uqrSk1N1b59+xQVFaUzZ85o3bp16c4JyHcMAPlKfHy8Iclo06ZNlveRZEgytm/fbi47ceKE4eTkZLRt29Zcdu3atXT7btq0yZBkfPnll+ay2bNnG5KMY8eOmcv++OMPQ5Lx4Ycfmsu6d+9uuLq6pmtz0aJFhiRj1apVhmEYRmpqqlGuXDkjNDTUSE1NtYindOnSRrNmzcxlERERhiTj/PnzFm1u27bNkGTMnj3b4vj+/v7m9/v27TPs7OyMZ555xiL+K1euGF5eXkbv3r0t2oyNjTU8PT3Tld9t1apVhiRj0aJFxvLlyw2TyWScPHnSMAzDGDJkiFGmTBnDMAwjJCTEqFy5snm/1NRUIyQkxJBk+Pj4GJ07dzamT59unDhxIt0x0q55Ri9HR8d7xpcm7f9o4sSJ5rIJEyak+79M06BBA+Opp54yvz927Fi6a9y5c2fDy8vLuH79eobX4166detm2NnZGdu2bUu37c7PQZqQkBAjJCQkw7Yy+ux+8803hiRj7dq15rK0z0/r1q0t6r755puGJGP37t3mMklGnz59Mo0/o++DxMREo2zZskblypWNGzduGC1btjQ8PDwy/D+93/n9/PPPhiSjRYsWRlZu53d/vtLc/X987tw5w8HBwWjevLmRkpJirjdt2jRDkvHFF19kGpNhGMaIESMMSca5c+fMZblxrbJyroaR8WfjtddeM1xcXIwbN25YtCnJmDRpkrksKSnJqFatmlGsWDEjOTnZMAzDuHXrlpGUlGTR3qVLlwwfHx/j5ZdfNpelfb907tzZKFCggBEbG2ve1qRJE6NLly6GJGPChAnm8l69ehnFixc3Lly4YNF+p06dDE9PzwzP5e6fc3fK6ud9165dhiTjlVdesag3ePBgQ5KxcuVKc5m/v7/RvXt3i3pdunQxXFxcMowByG8YFgjkMwkJCZIkd3d3q/YLDg5WzZo1ze8fe+wxtWnTRr/++qt5eIyzs7N5+82bN3Xx4kWVLVtWXl5e2rFjR7o2L126pAsXLuivv/7SlClTZG9vr5CQkHT1Lly4YPG6cuWKxfZdu3bpyJEj6tKliy5evGiul5iYqCZNmmjt2rVKTU212Oeff/6xaDM+Pv6+12D48OGqUaOG2rdvb1H+22+/6fLly+rcubNFm/b29qpbt2664T730rx5cxUuXFgLFiyQYRhasGCBOnfunGFdk8mkX3/9Vf/5z39UqFAhffPNN+rTp4/8/f3VsWPHDOdcTZ8+Xb/
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn import metrics\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# 1. Настройка параметров для старых значений\n",
|
|||
|
"old_param_grid = {\n",
|
|||
|
" 'n_estimators': [50, 100, 200], # Количество деревьев\n",
|
|||
|
" 'max_depth': [None, 10, 20, 30], # Максимальная глубина дерева\n",
|
|||
|
" 'min_samples_split': [2, 10, 15] # Минимальное количество образцов для разбиения узла\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Подбор гиперпараметров с помощью Grid Search для старых параметров\n",
|
|||
|
"old_grid_search = GridSearchCV(estimator=RandomForestRegressor(), \n",
|
|||
|
" param_grid=old_param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=2)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели на тренировочных данных\n",
|
|||
|
"old_grid_search.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# 2. Результаты подбора для старых параметров\n",
|
|||
|
"old_best_params = old_grid_search.best_params_\n",
|
|||
|
"old_best_mse = -old_grid_search.best_score_ # Меняем знак, так как берем отрицательное значение MSE\n",
|
|||
|
"\n",
|
|||
|
"# 3. Настройка параметров для новых значений\n",
|
|||
|
"new_param_grid = {\n",
|
|||
|
" 'n_estimators': [200],\n",
|
|||
|
" 'max_depth': [10],\n",
|
|||
|
" 'min_samples_split': [10]\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"# Подбор гиперпараметров с помощью Grid Search для новых параметров\n",
|
|||
|
"new_grid_search = GridSearchCV(estimator=RandomForestRegressor(), \n",
|
|||
|
" param_grid=new_param_grid, scoring='neg_mean_squared_error', cv=2)\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели на тренировочных данных\n",
|
|||
|
"new_grid_search.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# 4. Результаты подбора для новых параметров\n",
|
|||
|
"new_best_params = new_grid_search.best_params_\n",
|
|||
|
"new_best_mse = -new_grid_search.best_score_ # Меняем знак, так как берем отрицательное значение MSE\n",
|
|||
|
"\n",
|
|||
|
"# 5. Обучение модели с лучшими параметрами для новых значений\n",
|
|||
|
"model_best = RandomForestRegressor(**new_best_params)\n",
|
|||
|
"model_best.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Прогнозирование на тестовой выборке\n",
|
|||
|
"y_pred = model_best.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"# Оценка производительности модели\n",
|
|||
|
"mse = metrics.mean_squared_error(y_test, y_pred)\n",
|
|||
|
"rmse = np.sqrt(mse)\n",
|
|||
|
"\n",
|
|||
|
"# Вывод результатов\n",
|
|||
|
"print(\"Старые параметры:\", old_best_params)\n",
|
|||
|
"print(\"Лучший результат (MSE) на старых параметрах:\", old_best_mse)\n",
|
|||
|
"print(\"\\nНовые параметры:\", new_best_params)\n",
|
|||
|
"print(\"Лучший результат (MSE) на новых параметрах:\", new_best_mse)\n",
|
|||
|
"print(\"Среднеквадратическая ошибка (MSE) на тестовых данных:\", mse)\n",
|
|||
|
"print(\"Корень среднеквадратичной ошибки (RMSE) на тестовых данных:\", rmse)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация ошибок\n",
|
|||
|
"plt.figure(figsize=(10, 5))\n",
|
|||
|
"plt.bar(['Старые параметры', 'Новые параметры'], [old_best_mse, new_best_mse], color=['blue', 'orange'])\n",
|
|||
|
"plt.xlabel('Подбор параметров')\n",
|
|||
|
"plt.ylabel('Среднеквадратическая ошибка (MSE)')\n",
|
|||
|
"plt.title('Сравнение MSE для старых и новых параметров')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Сравнение старых и новых параметров модели показывает, что старые настройки обеспечивают меньшую среднеквадратичную ошибку (MSE), что свидетельствует о более точном прогнозировании по сравнению с новыми параметрами.\n",
|
|||
|
"\n",
|
|||
|
"Основные факторы, подтверждающие хорошее обучение модели:\n",
|
|||
|
"\n",
|
|||
|
"Согласованность MSE: Значения MSE на тренировочных (0.159) и тестовых данных (0.1589) очень близки, что указывает на отсутствие переобучения и недообучения. Модель успешно обобщает данные, что является желаемым результатом.\n",
|
|||
|
"\n",
|
|||
|
"Эффективность старых параметров: Старые параметры демонстрируют наилучшие результаты, подтверждая способность модели достигать высокой точности при оптимальных гиперпараметрах.\n",
|
|||
|
"\n",
|
|||
|
"Анализ влияния новых параметров: Эксперименты с новыми параметрами позволили оценить реакцию модели на изменения и выявить, что увеличение max_depth и уменьшение min_samples_split улучшают результаты. Этот процесс оптимизации является важной частью улучшения модели.\n",
|
|||
|
"\n",
|
|||
|
"В целом, модель обучена хорошо, но возможны дальнейшие незначительные улучшения за счет тонкой настройки гиперпараметров."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAHWCAYAAABACtmGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOxdd5xcVfX/Tu8z2/tuym6S3RQICaQHCAQCJoSAEIpKAmooIoqKERRCkR/SBIJSVATFKCaUBEFDR2qAhPQtabvZXmbL7Oz08n5/nHf2vpmdLQklqO98PvvZ3ZlX7j333NPPuRpJkiSooIIKKqigggoqqKCCCiqoMChoj/UAVFBBBRVUUEEFFVRQQQUVvuqgGk4qqKCCCiqooIIKKqigggrDgGo4qaCCCiqooIIKKqigggoqDAOq4aSCCiqooIIKKqigggoqqDAMqIaTCiqooIIKKqigggoqqKDCMKAaTiqooIIKKqigggoqqKCCCsOAajipoIIKKqigggoqqKCCCioMA6rhpIIKKqigggoqqKCCCiqoMAyohpMKKqigggoqqKCCCiqooMIwoBpOKqigggoqqKDCFwb33HMPysvLEY/Hj/VQ/ifg1ltvhUajOdbDUEGFzwUuvvhiLF++/FgPox9Uw+kLhEceeQQajQYzZ8481kNRQYX/ali5ciU0Gs2QP9dee+2xHqYKKvzPQW9vL+6++26sXr0aWq1QOYbak0899RQ0Gg22bt36ZQ3zfxrefvttnH/++cjLy4PRaEROTg7OOeccPP/888d6aMcE6urqhpUnGo0Gbrf7WA/1fwJWr16N5557Djt37jzWQwEA6I/1AP6bYd26dRg9ejQ+/vhjHDhwAGVlZcd6SCqo8F8LJpMJf/jDH1J+961vfetLHo0KKqgAAH/84x8RjUZxySWXHOuhqJAC1qxZg9tvvx3jxo3DlVdeiVGjRqGzsxP//Oc/8fWvfx3r1q3DpZdeeqyHeUzgkksuwde+9rUBnz///PN44YUXjsGI/jfhhBNOwIknnoj7778ff/7zn4/1cFTD6YuC2tpafPDBB3j++edx5ZVXYt26dVizZs2xHpYKKvzXgl6vxze/+c2U36mGkwoqHBt48sknsXTpUpjN5mM9FBWS4Nlnn8Xtt9+OCy64AH/9619hMBj6v7vhhhvwyiuvIBKJHMMRHluYNm1aSply4MAB1XD6kmH58uVYs2YNHnnkEdjt9mM6FjVV7wuCdevWIT09HYsXL8YFF1yAdevWDbiGw8FPPfVU/2derxfTp0/HmDFj0NLSMqKQ8cqVK3Ho0CFoNBo88MADA97zwQcfQKPR4G9/+xsA4PDhw7jmmmswYcIEWCwWZGZm4sILL0RdXV3KuZx66qkp36sc96mnnorJkycPi5fB0jOWLFmC0aNHD8DNfffdN+izBsvj/stf/oLp06fDYrEgIyMDF198MRoaGoYd26OPPorjjz8eLpcLNpsNxx9/PJ544omEa1auXJly0z777LPQaDR4++23+z979913ceGFF6KkpAQmkwnFxcW4/vrrEQgEEu4dPXo0Vq5cmfDZ22+/PeB5APDRRx/hrLPOgsvlgtVqxSmnnIL3338/JV6S0wi2bt06YN1WrlyZgHcAaGhogMVigUajSaCJaDSKX/7ylxg/fjxMJlMCLQyVUnPfffdBo9Hg8OHDA7678cYbYTQa0d3dDQDYv38/vv71ryMvLw9msxlFRUW4+OKL4fF4Bn3+0QDj9+9//ztuuukm5OXlwWazYenSpSlpZSR4Zxhsv9x6660Drv3LX/6CGTNmwGq1Ij09HSeffDJeffXV/u9T0caqVatgNpsTaGPTpk1YvHgxCgoKYDKZUFpaijvuuAOxWCzh3quuugrjxo2D1WpFRkYGTjvtNLz77rsJ14z0WYPteV5vJe2MHj0aS5YsGXDttddeO2APD5dWyWlcyfzqX//6F+bPnw+bzQaHw4HFixdj7969gz6H4fnnn8eMGTOQkZEBi8WC8vJy3H333ZAkqf+aI9lTu3btwsqVKzF27FiYzWbk5eXhiiuuQGdnZ8K9p556Kk499dSEz1LJBACorq7GBRdcgIyMDJjNZpx44ol48cUXU+IleS+63e4B9JeKd/b19SEvLy8l33n00UcxefJkWK3WBJp+9tlnk9GZALW1tdi1axcWLlw45HUjhTfffLN/jdPS0nDuueeiqqoq4RqeG/84HA7MmDEDGzduTLhuOJn1WdZiMLjvvvswZ84cZGZmwmKxYPr06SlxyHtg48aNmDx5MkwmEyZNmoTNmzcPuPa9997DSSedBLPZjNLSUjz++OMjGgsA3HzzzcjIyMAf//jHBKOJYdGiRViyZEk/vxzqh+lrpPoF0+s777yDK6+8EpmZmXA6nbjsssv65QHDYPyD4bPIy88LRirvWYc4dOgQFi1aBJvNhoKCAtx+++0JPAc4MnrRaDR48MEHB3xXXl6ekqf29PTghz/8IYqLi2EymVBWVoa77767vw5xpLon8OWs5RlnnAGfz4fXXntt0Hu/LFAjTl8QrFu3Dueffz6MRiMuueQSPProo/jkk09w0kknDXpPJBLB17/+ddTX1+P9999Hfn4+fD4fnn766f5rOESs/Ky0tBRjx47F3LlzsW7dOlx//fUDxuJwOHDuuecCAD755BN88MEHuPjii1FUVIS6ujo8+uijOPXUU1FZWQmr1TpgbOXl5fj5z38OgARx8ju+KnDnnXfi5ptvxvLly/Gd73wHHR0dePjhh3HyySdj+/btSEtLG/Rer9eLM888E6WlpZAkCevXr8d3vvMdpKWl4etf//oRj2XDhg3w+/24+uqrkZmZiY8//hgPP/wwGhsbsWHDhiN+3ptvvomzzz4b06dPx5o1a6DVavHkk0/2K78zZsw44memgltuuQXBYHDA5/fffz9uvvlmnHfeeVi9ejVMJhPeffdd/O53vxvyecuXL8dPf/pTrF+/HjfccEPCd+vXr8eZZ56J9PR0hMNhLFq0CKFQCN///veRl5eHpqYmvPTSS+jp6YHL5fpc5qeEO++8ExqNBqtXr0Z7ezsefPBBLFy4EDt27IDFYgFwdHgvKirCXXfdBYAU0quvvnrANbfddhtuvfVWzJkzB7fffjuMRiM++ugjvPnmmzjzzDNTjnfNmjV44okn8Pe//z1B6X7qqadgt9vxox/9CHa7HW+++SZuueUW9Pb24t577+2/LhwO45vf/CaKiorQ1dWFxx9/HGeddRaqqqpQUlJyRM/6KsHTTz+NFStWYNGiRbj77rvh9/vx6KOPYt68edi+ffsA54ASent7MXPmTKxYsQIGgwGbN2/Gz372M+j1evz4xz8+4rG89tprOHToEC6//HLk5eVh7969+N3vfoe9e/diy5YtR1y0v3fvXsydOxeFhYX42c9+BpvNhvXr12PZsmV47rnncN555x3xGFPB/fffj7a2tgGf//3vf8c111yDU089Fd///vdhs9lQVVWF//u//xv2mR988AEA8tyngmAwmLJOpK+vb8Bnr7/+Os4++2yMHTsWt956KwKBAB5++GHMnTsXn3766YA1ZhnpdrvxyCOP4MILL8SePXswYcKEYcc9GHzWtXjooYewdOlSfOMb30A4HMYzzzyDCy+8EC+99BIWL16ccO17772H559/Htdccw0cDgfWrl3brx9kZmYCAHbv3o0zzzwT2dnZuPXWWxGNRrFmzRrk5uYOO5f9+/ejuroaV1xxBRwOx5DXVlRUJOgcv/vd71BVVZXgqD3uuOMAHLl+ce211yItLQ233norampq8Oijj+Lw4cP9CvTRwpclLxmORN7HYjGcddZZmDVrFu655x5s3rwZa9asQTQaxe23395/3ZHQi9lsxpNPPokf/vCH/Z998MEHKR2Wfr8fp5xyCpqamnDllVeipKQEH3zwAW688Ua0tLTgwQcfRHZ29oh0TyV8UWsJABMnToTFYsH777//ufG8owZJhc8dtm7dKgGQXnvtNUmSJCkej0tFRUXSD37wg4TramtrJQDSk08+KcXjcekb3/iGZLVapY8++mj
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x500 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.figure(figsize=(10, 5))\n",
|
|||
|
"plt.scatter(range(len(y_test)), y_test, label=\"Актуальные значения\", color=\"black\", alpha=0.5)\n",
|
|||
|
"plt.scatter(range(len(y_test)), y_pred, label=\"Предсказанные(новые параметры)\", color=\"blue\", alpha=0.5)\n",
|
|||
|
"plt.scatter(range(len(y_test)), y_test_predict, label=\"Предсказанные(старые параметры)\", color=\"red\", alpha=0.5)\n",
|
|||
|
"plt.xlabel(\"Выборка\")\n",
|
|||
|
"plt.ylabel(\"Значения\")\n",
|
|||
|
"plt.legend()\n",
|
|||
|
"plt.title(\"Актуальные значения vs Предсказанные значения (Новые and Старые Параметры)\")\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.0"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|