AIM-PIbd-32-Filippov-D-S/Lab_3/lab3.ipynb

1678 lines
518 KiB
Plaintext
Raw Normal View History

2024-12-07 12:48:37 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Лабораторная работа №3\n",
"\n",
"### Набор данных \"Наблюдения НЛО в США\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Для набора данных \"Наблюдения НЛО в США\" можно выделить несколько бизнес-целей и соответствующие технические задачи. Давайте рассмотрим этот процесс поэтапно.\n",
"# \n",
"# 1. Определение бизнес-целей\n",
"# Бизнес-цель 1: Прогнозирование местоположения и частоты наблюдений НЛО.\n",
"# Задача заключается в анализе географического распределения и времени наблюдений НЛО, чтобы определить, в каких местах и когда чаще всего происходят наблюдения.\n",
"# Бизнес-цель 2: Анализ факторов, влияющих на восприятие НЛО (например, форма, продолжительность, описание).\n",
"# Цель — понять, какие признаки, такие как форма НЛО, длительность наблюдения, могут быть связаны с более подробными или более эмоционально окрашенными отчетами.\n",
"# 2. Цели технического проекта для каждой бизнес-цели\n",
"# Цель для бизнес-цели 1: Создать модель, которая предскажет вероятное местоположение и время наблюдений на основе данных о предыдущих наблюдениях.\n",
"# Технические задачи:\n",
"# Прогнозирование местоположения и времени (классификация или регрессия).\n",
"# Кластеризация по географическому положению.\n",
"# Анализ временных рядов для выявления сезонных колебаний.\n",
"# Цель для бизнес-цели 2: Анализировать текстовые описания наблюдений НЛО для выявления ключевых паттернов и факторов.\n",
"# Технические задачи:\n",
"# Анализ текста с использованием методов обработки естественного языка (NLP).\n",
"# Классификация описаний по типам объектов или возможным объяснениям (например, возможный самолет или атмосферное явление)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['summary', 'city', 'state', 'date_time', 'shape', 'duration', 'stats',\n",
" 'report_link', 'text', 'posted', 'city_latitude', 'city_longitude'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.ticker as ticker\n",
"import seaborn as sns\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\")\n",
"\n",
"# Срез данных, первые 15000 строк\n",
"df = df.iloc[:15000]\n",
"\n",
"# Вывод\n",
"print(df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>summary</th>\n",
" <th>city</th>\n",
" <th>state</th>\n",
" <th>date_time</th>\n",
" <th>shape</th>\n",
" <th>duration</th>\n",
" <th>stats</th>\n",
" <th>report_link</th>\n",
" <th>text</th>\n",
" <th>posted</th>\n",
" <th>city_latitude</th>\n",
" <th>city_longitude</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Viewed some red lights in the sky appearing to...</td>\n",
" <td>Visalia</td>\n",
" <td>CA</td>\n",
" <td>2021-12-15T21:45:00</td>\n",
" <td>light</td>\n",
" <td>2 minutes</td>\n",
" <td>Occurred : 12/15/2021 21:45 (Entered as : 12/...</td>\n",
" <td>http://www.nuforc.org/webreports/165/S165881.html</td>\n",
" <td>Viewed some red lights in the sky appearing to...</td>\n",
" <td>2021-12-19T00:00:00</td>\n",
" <td>36.356650</td>\n",
" <td>-119.347937</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Look like 1 or 3 crafts from North traveling s...</td>\n",
" <td>Cincinnati</td>\n",
" <td>OH</td>\n",
" <td>2021-12-16T09:45:00</td>\n",
" <td>triangle</td>\n",
" <td>14 seconds</td>\n",
" <td>Occurred : 12/16/2021 09:45 (Entered as : 12/...</td>\n",
" <td>http://www.nuforc.org/webreports/165/S165888.html</td>\n",
" <td>Look like 1 or 3 crafts from North traveling s...</td>\n",
" <td>2021-12-19T00:00:00</td>\n",
" <td>39.174503</td>\n",
" <td>-84.481363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>seen dark rectangle moving slowly thru the sky...</td>\n",
" <td>Tecopa</td>\n",
" <td>CA</td>\n",
" <td>2021-12-10T00:00:00</td>\n",
" <td>rectangle</td>\n",
" <td>Several minutes</td>\n",
" <td>Occurred : 12/10/2021 00:00 (Entered as : 12/...</td>\n",
" <td>http://www.nuforc.org/webreports/165/S165810.html</td>\n",
" <td>seen dark rectangle moving slowly thru the sky...</td>\n",
" <td>2021-12-19T00:00:00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>One red light moving switly west to east, beco...</td>\n",
" <td>Knoxville</td>\n",
" <td>TN</td>\n",
" <td>2021-12-10T19:30:00</td>\n",
" <td>triangle</td>\n",
" <td>20-30 seconds</td>\n",
" <td>Occurred : 12/10/2021 19:30 (Entered as : 12/...</td>\n",
" <td>http://www.nuforc.org/webreports/165/S165825.html</td>\n",
" <td>One red light moving switly west to east, beco...</td>\n",
" <td>2021-12-19T00:00:00</td>\n",
" <td>35.961561</td>\n",
" <td>-83.980115</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Bright, circular Fresnel-lens shaped light sev...</td>\n",
" <td>Alexandria</td>\n",
" <td>VA</td>\n",
" <td>2021-12-07T08:00:00</td>\n",
" <td>circle</td>\n",
" <td>NaN</td>\n",
" <td>Occurred : 12/7/2021 08:00 (Entered as : 12/0...</td>\n",
" <td>http://www.nuforc.org/webreports/165/S165754.html</td>\n",
" <td>Bright, circular Fresnel-lens shaped light sev...</td>\n",
" <td>2021-12-19T00:00:00</td>\n",
" <td>38.798958</td>\n",
" <td>-77.095133</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" summary city state \\\n",
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
"2 seen dark rectangle moving slowly thru the sky... Tecopa CA \n",
"3 One red light moving switly west to east, beco... Knoxville TN \n",
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
"\n",
" date_time shape duration \\\n",
"0 2021-12-15T21:45:00 light 2 minutes \n",
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
"2 2021-12-10T00:00:00 rectangle Several minutes \n",
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
"4 2021-12-07T08:00:00 circle NaN \n",
"\n",
" stats \\\n",
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
"2 Occurred : 12/10/2021 00:00 (Entered as : 12/... \n",
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
"\n",
" report_link \\\n",
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
"2 http://www.nuforc.org/webreports/165/S165810.html \n",
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
"\n",
" text posted \\\n",
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
"2 seen dark rectangle moving slowly thru the sky... 2021-12-19T00:00:00 \n",
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
"\n",
" city_latitude city_longitude \n",
"0 36.356650 -119.347937 \n",
"1 39.174503 -84.481363 \n",
"2 NaN NaN \n",
"3 35.961561 -83.980115 \n",
"4 38.798958 -77.095133 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Для наглядности\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 15000 entries, 0 to 14999\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 summary 14998 non-null object \n",
" 1 city 14961 non-null object \n",
" 2 state 14235 non-null object \n",
" 3 date_time 14560 non-null object \n",
" 4 shape 13082 non-null object \n",
" 5 duration 13598 non-null object \n",
" 6 stats 15000 non-null object \n",
" 7 report_link 15000 non-null object \n",
" 8 text 14999 non-null object \n",
" 9 posted 14560 non-null object \n",
" 10 city_latitude 12002 non-null float64\n",
" 11 city_longitude 12002 non-null float64\n",
"dtypes: float64(2), object(10)\n",
"memory usage: 1.4+ MB\n"
]
}
],
"source": [
"# Описание данных (основные статистические показатели)\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество пропущенных значений в каждом столбце:\n",
"summary 74\n",
"city 382\n",
"state 9345\n",
"date_time 2668\n",
"shape 5922\n",
"duration 6492\n",
"stats 0\n",
"report_link 0\n",
"text 38\n",
"posted 2668\n",
"city_latitude 26804\n",
"city_longitude 26804\n",
"dtype: int64\n",
"summary Процент пустых значений: %0.05\n",
"city Процент пустых значений: %0.28\n",
"state Процент пустых значений: %6.82\n",
"date_time Процент пустых значений: %1.95\n",
"shape Процент пустых значений: %4.32\n",
"duration Процент пустых значений: %4.74\n",
"text Процент пустых значений: %0.03\n",
"posted Процент пустых значений: %1.95\n",
"city_latitude Процент пустых значений: %19.57\n",
"city_longitude Процент пустых значений: %19.57\n",
"Количество выбросов в столбце 'city_latitude': 1025\n",
"Количество выбросов в столбце 'city_longitude': 23\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAISCAYAAABI/3XmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC1a0lEQVR4nOzdeXhU9dn/8c/MJDNZJ2FLAoIYxSqoiKKVVKWolKjoo5X2eWypW1GrP7QFrVpay4NopWoRN9THasVWqNVe1VZc2BSssqjRCILihgWFECTLTLaZyeT8/pg5JxmykGW2JO/Xdc0lmXPmzHfS9Jxzz31/v7fNMAxDAAAAAAAgLuyJHgAAAAAAAP0JgTgAAAAAAHFEIA4AAAAAQBwRiAMAAAAAEEcE4gAAAAAAxBGBOAAAAAAAcUQgDgAAAABAHBGIAwAAAAAQRwTiAAAAAADEEYE4+p3DDjtMl19+eaKH0efdc889Ovzww+VwODRu3LgeHWvevHmy2WzRGVgnffnll7LZbFqyZElUjxvPv78lS5bIZrPpyy+/jMv7AUBPcH2OD67P0RXvay3/P+k7CMTRq5knv3fffbfN7ZMmTdKxxx7b4/d5+eWXNW/evB4fp79YuXKlbr75Zp166ql68skndeedd0b9Pe6880698MILUT9uNKxfv17z5s1TVVXVQffdtm2b5s2bR7AMoE/h+pyc+vv1OV4efvjhpPmiAMmLQBz9zvbt2/XHP/6xS695+eWXddttt8VoRH3Pa6+9JrvdrieeeEKXXnqpzj333B4d79Zbb1V9fX3Ec8l8oV+/fr1uu+22NgPxA//+tm3bpttuu41AHEC/x/U59vr79TkWLrnkEtXX12vkyJHWcwTi6IyURA8AiDeXy5XoIXRZbW2tMjMzEz2MTisvL1d6erqcTmdUjpeSkqKUlL5xuuqNf38AEA+98fzI9bnvXJ+7y+FwyOFwJHoY6IXIiKPfOXBuTSAQ0G233aYjjzxSaWlpGjRokE477TStWrVKknT55Zdr8eLFkiSbzWY9TLW1tbrxxhs1YsQIuVwuHXXUUfrDH/4gwzAi3re+vl4///nPNXjwYGVnZ+u//uu/9PXXX8tms0WU1ZnzrbZt26Yf//jHGjBggE477TRJ0ubNm3X55Zfr8MMPV1pamgoKCvTTn/5U+/fvj3gv8xiffPKJfvKTnygnJ0dDhgzRb3/7WxmGoV27dumCCy6Q2+1WQUGBFi5c2KnfXWNjo26//XYdccQRcrlcOuyww/TrX/9aPp/P2sdms+nJJ59UbW2t9bs62LfCmzZt0rnnnqsBAwYoMzNTY8eO1f3339/q87R8j9raWj311FPWe1x++eV6/fXXZbPZ9Pzzz7d6j2XLlslms2nDhg2d+qxt6czvf968ebrpppskSYWFhdb4zIx3y7+/JUuW6Ic//KEk6YwzzrD2Xbt2rfU52yq5bGt+2NatW3XmmWcqPT1dw4cP1x133KGmpqY2P8crr7yi008/XZmZmcrOztbUqVO1devWbv9eACAauD5zfe6J1157zbq25ebm6oILLtBHH30UsY853s8++0yXX365cnNzlZOToyuuuEJ1dXUR+3b27+LAOeKHHXaYtm7dqnXr1lm/g0mTJrX5+2rvGJJkGIbuuOMODR8+XBkZGTrjjDPavVZXVVVp1qxZ1t/6qFGjdNddd7V7H4Dk0L+/wkKfUV1drW+++abV84FA4KCvnTdvnhYsWKArr7xS3/72t+XxePTuu+/qvffe0/e+9z397Gc/0+7du7Vq1Sr95S9/iXitYRj6r//6L73++uuaMWOGxo0bpxUrVuimm27S119/rUWLFln7Xn755Xr22Wd1ySWXaMKECVq3bp2mTp3a7rh++MMf6sgjj9Sdd95p3TSsWrVKX3zxha644goVFBRo69ateuyxx7R161Zt3Lix1cn9f/7nfzR69Gj9/ve/10svvaQ77rhDAwcO1P/93//pzDPP1F133aWlS5fql7/8pU4++WRNnDixw9/VlVdeqaeeeko/+MEPdOONN2rTpk1asGCBPvroI+vi+pe//EWPPfaY3n77bT3++OOSpO985zvtHnPVqlU677zzNHToUP3iF79QQUGBPvroIy1fvly/+MUv2nzNX/7yF+t/r6uvvlqSdMQRR2jChAkaMWKEli5dqu9///sRr1m6dKmOOOIIFRUVdfgZO9KZ3/9FF12kTz75RH/961+1aNEiDR48WJI0ZMiQVsebOHGifv7zn+uBBx7Qr3/9a40ePVqSrP92VllZmc444ww1NjbqV7/6lTIzM/XYY48pPT291b5/+ctfdNlll6m4uFh33XWX6urq9Mgjj+i0007T+++/r8MOO6zrvxgAaAfXZ67P8bg+r169Wuecc44OP/xwzZs3T/X19XrwwQd16qmn6r333mt1bfvv//5vFRYWasGCBXrvvff0+OOPKy8vT3fddZe1T1f/Lkz33Xefrr/+emVlZek3v/mNJCk/P7/Ln2nu3Lm64447dO655+rcc8/Ve++9pylTpsjv90fsV1dXp+9+97v6+uuv9bOf/UyHHnqo1q9frzlz5mjPnj267777uvzeiBMD6MWefPJJQ1KHj2OOOSbiNSNHjjQuu+wy6+fjjz/emDp1aofvM3PmTKOt/7u88MILhiTjjjvuiHj+Bz/4gWGz2YzPPvvMMAzDKCkpMSQZs2bNitjv8ssvNyQZ//u//2s997//+7+GJONHP/pRq/erq6tr9dxf//pXQ5LxxhtvtDrG1VdfbT3X2NhoDB8+3LDZbMbvf/976/nKykojPT094nfSltLSUkOSceWVV0Y8/8tf/tKQZLz22mvWc5dddpmRmZnZ4fHMMRUWFhojR440KisrI7Y1NTW1+jwtZWZmtjnmOXPmGC6Xy6iqqrKeKy8vN1JSUiJ+zwezY8cOQ5Lx5JNPWs919vd/zz33GJKMHTt2tNr/wL+/5557zpBkvP766632PfBvo71jzJo1y5BkbNq0yXquvLzcyMnJiRiH1+s1cnNzjauuuirieGVlZUZOTk6r5wGgu7g+c30+UCyvz+PGjTPy8vKM/fv3W8998MEHht1uNy699NJW4/3pT38acczvf//7xqBBg6yfu/J3Yf6tt7zmH3PMMcZ3v/vdVmNv6/fV1jHKy8sNp9NpTJ06NeL3/etf/9qQFPH7vf32243MzEzjk08+iTjmr371K8PhcBg7d+5s9X5IDpSmo09YvHixVq1a1eoxduzYg742NzdXW7du1aefftrl93355ZflcDj085//POL5G2+8UYZh6JVXXpEkvfrqq5Kk//f//l/Eftdff327x77mmmtaPdcyw9nQ0KBvvvlGEyZMkCS99957rfa/8sorrX87HA6ddNJJMgxDM2bMsJ7Pzc3VUUcdpS+++KLdsUihzypJN9xwQ8TzN954oyTppZde6vD1bXn//fe1Y8cOzZo1S7m5uRHbutsO5dJLL5XP59Pf//5367m//e1vamxs1E9+8pNuHdPU1d9/vLz88suaMGGCvv3tb1vPDRkyRNOnT4/Yb9WqVaqqqtKPfvQjffPNN9bD4XDolFNO0euvvx7voQPo47g+c302xer6vGfPHpWWluryyy/XwIEDrefHjh2r733ve9bvp6UD/zc8/fTTtX//fnk8Hknd+7uIptWrV8vv9+v666+P+H3PmjWr1b7PPfecTj/9dA0YMCDi2j558mQFg0G98cYbcRkzuo7SdPQJ3/72t3XSSSe1et48KXVk/vz5uuCCC/Stb31Lxx57rM4++2xdcsklnbpJ+M9//qNhw4YpOzs74nmztPg///mP9V+73a7CwsKI/UaNGtXusQ/cV5IqKip022236ZlnnlF5eXnEturq6lb7H3rooRE/5+TkKC0tzSqXbvn8gfPYDmR+hgPHXFBQoNzcXOuzdsXnn38uSVFpYWM6+uijdfLJJ2vp0qXWDc3SpUs1YcKEDn/fndHV33+8/Oc//9Epp5zS6vmjjjoq4mfzZvbMM89s8zhutzv6gwPQr3F95vpsitX12fx8B17zpND/3itWrGi1qN6
"text/plain": [
"<Figure size 1500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество удаленных строк: 27832\n",
"Количество выбросов в столбце 'city_latitude': 38\n",
"Количество выбросов в столбце 'city_longitude': 0\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAISCAYAAABI/3XmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADfGklEQVR4nOzdeXhU5dk/8O+ZfbLMZN8gCSFEdmSzmKqIikSgVltsy1sUUdTqi1rEFy2tpeBGxQJuVGtdwBa01l+1FVQIIKBlXwIBIoQIBALZl8k66/n9cWaGDAkhy8ycmcn3c11zQc48c+aZyWTOuc/9PPcjiKIogoiIiIiIiIj8QiF3B4iIiIiIiIh6EwbiRERERERERH7EQJyIiIiIiIjIjxiIExEREREREfkRA3EiIiIiIiIiP2IgTkRERERERORHDMSJiIiIiIiI/IiBOBEREREREZEfMRAnIiIiIiIi8iMG4tTr9OvXD7NmzZK7GyHv5ZdfRv/+/aFUKjFy5Mge7WvRokUQBME7Heuk06dPQxAErFq1yqv79efnb9WqVRAEAadPn/bL8xER9QSPz/7B47N3+ftYy7+T0MFAnIKa68tv37597d4/YcIEDBs2rMfP88UXX2DRokU93k9vsXHjRjz11FO47rrr8P777+PFF1/0+nO8+OKL+Oyzz7y+X2/YsWMHFi1ahNra2iu2PXbsGBYtWsRgmYhCCo/Pgam3H5/95c9//nPAXCigwMVAnHqd48eP469//WuXHvPFF19g8eLFPupR6NmyZQsUCgXeffddzJw5E1OmTOnR/p555hk0Nzd7bAvkA/2OHTuwePHidgPxSz9/x44dw+LFixmIE1Gvx+Oz7/X247Mv3HPPPWhubkZ6erp7GwNx6gyV3B0g8jetVit3F7qssbER4eHhcnej08rLy6HX66HRaLyyP5VKBZUqNL6ugvHzR0TkD8H4/cjjc+gcn7tLqVRCqVTK3Q0KQsyIU69z6dwaq9WKxYsXIysrCzqdDrGxsbj++uuRm5sLAJg1axZWrlwJABAEwX1zaWxsxJNPPonU1FRotVoMHDgQf/rTnyCKosfzNjc34/HHH0dcXBwiIyPx4x//GCUlJRAEwWNYnWu+1bFjx/DLX/4S0dHRuP766wEAhw8fxqxZs9C/f3/odDokJSXh/vvvR1VVlcdzufZx4sQJ3H333TAajYiPj8fvf/97iKKIs2fP4o477oDBYEBSUhKWLVvWqffOZrPhueeeQ2ZmJrRaLfr164ff/va3MJvN7jaCIOD9999HY2Oj+7260lXh3bt3Y8qUKYiOjkZ4eDhGjBiBV199tc3raf0cjY2NWL16tfs5Zs2aha+//hqCIODTTz9t8xxr166FIAjYuXNnp15rezrz/i9atAjz588HAGRkZLj758p4t/78rVq1Cj/72c8AADfddJO77datW92vs70hl+3NDzt69Chuvvlm6PV69O3bF88//zwcDke7r+PLL7/EDTfcgPDwcERGRmLq1Kk4evRot98XIiJv4PGZx+ee2LJli/vYFhUVhTvuuAMFBQUebVz9PXnyJGbNmoWoqCgYjUbcd999aGpq8mjb2c/FpXPE+/Xrh6NHj2Lbtm3u92DChAntvl+X2wcAiKKI559/Hn379kVYWBhuuummyx6ra2trMXfuXPdnfcCAAXjppZcuex5AgaF3X8KikFFXV4fKyso2261W6xUfu2jRIixZsgQPPPAAfvCDH8BkMmHfvn04cOAAbr31VvzqV7/C+fPnkZubi7/97W8ejxVFET/+8Y/x9ddfY/bs2Rg5ciQ2bNiA+fPno6SkBCtWrHC3nTVrFj7++GPcc889uPbaa7Ft2zZMnTr1sv362c9+hqysLLz44ovuk4bc3Fx8//33uO+++5CUlISjR4/i7bffxtGjR7Fr1642X+6/+MUvMHjwYPzxj3/E+vXr8fzzzyMmJgZ/+ctfcPPNN+Oll17CmjVr8H//93+45pprMH78+A7fqwceeACrV6/GXXfdhSeffBK7d+/GkiVLUFBQ4D64/u1vf8Pbb7+NPXv24J133gEA/PCHP7zsPnNzc/GjH/0IycnJ+PWvf42kpCQUFBRg3bp1+PWvf93uY/72t7+5f18PPfQQACAzMxPXXnstUlNTsWbNGvzkJz/xeMyaNWuQmZmJ7OzsDl9jRzrz/v/0pz/FiRMn8OGHH2LFihWIi4sDAMTHx7fZ3/jx4/H444/jtddew29/+1sMHjwYANz/dlZpaSluuukm2Gw2/OY3v0F4eDjefvtt6PX6Nm3/9re/4d5770VOTg5eeuklNDU14c0338T111+PgwcPol+/fl1/Y4iILoPHZx6f/XF83rRpEyZPnoz+/ftj0aJFaG5uxuuvv47rrrsOBw4caHNs+/nPf46MjAwsWbIEBw4cwDvvvIOEhAS89NJL7jZd/Vy4vPLKK3jssccQERGB3/3udwCAxMTELr+mhQsX4vnnn8eUKVMwZcoUHDhwAJMmTYLFYvFo19TUhBtvvBElJSX41a9+hbS0NOzYsQMLFizAhQsX8Morr3T5uclPRKIg9v7774sAOrwNHTrU4zHp6enivffe6/756quvFqdOndrh88yZM0ds78/ls88+EwGIzz//vMf2u+66SxQEQTx58qQoiqK4f/9+EYA4d+5cj3azZs0SAYh/+MMf3Nv+8Ic/iADE//mf/2nzfE1NTW22ffjhhyIAcfv27W328dBDD7m32Ww2sW/fvqIgCOIf//hH9/aamhpRr9d7vCftycvLEwGIDzzwgMf2//u//xMBiFu2bHFvu/fee8Xw8PAO9+fqU0ZGhpieni7W1NR43OdwONq8ntbCw8Pb7fOCBQtErVYr1tbWureVl5eLKpXK432+klOnTokAxPfff9+9rbPv/8svvywCEE+dOtWm/aWfv3/+858iAPHrr79u0/bSz8bl9jF37lwRgLh79273tvLyctFoNHr0o76+XoyKihIffPBBj/2VlpaKRqOxzXYiou7i8ZnH50v58vg8cuRIMSEhQayqqnJvO3TokKhQKMSZM2e26e/999/vsc+f/OQnYmxsrPvnrnwuXJ/11sf8oUOHijfeeGObvrf3frW3j/LyclGj0YhTp071eL9/+9vfigA83t/nnntODA8PF0+cOOGxz9/85jeiUqkUi4uL2zwfBQYOTaeQsHLlSuTm5ra5jRgx4oqPjYqKwtGjR1FYWNjl5/3iiy+gVCrx+OOPe2x/8sknIYoivvzySwDAV199BQD43//9X492jz322GX3/fDDD7fZ1jrD2dLSgsrKSlx77bUAgAMHDrRp/8ADD7j/r1QqMXbsWIiiiNmzZ7u3R0VFYeDAgfj+++8v2xdAeq0AMG/ePI/tTz75JABg/fr1HT6+PQcPHsSpU6cwd+5cREVFedzX3eVQZs6cCbPZjE8++cS97R//+AdsNhvuvvvubu3Tpavvv7988cUXuPbaa/GDH/zAvS0+Ph4zZszwaJebm4va2lr8z//8DyorK903pVKJcePG4euvv/Z314koxPH4zOOzi6+OzxcuXEBeXh5mzZqFmJgY9/YRI0bg1ltvdb8/rV36O7zhhhtQVVUFk8kEoHufC2/atGkTLBYLHnvsMY/3e+7cuW3a/vOf/8QNN9yA6Ohoj2P7xIkTYbfbsX37dr/0mbqOQ9MpJPzgBz/A2LFj22x3fSl15Nlnn8Udd9yBq666CsOGDcNtt92Ge+65p1MnCWfOnEFKSgoiIyM9truGFp85c8b9r0KhQEZGhke7AQMGXHbfl7YFgOrqaixevBgfffQRysvLPe6rq6tr0z4tLc3jZ6PRCJ1O5x4u3Xr7pfPYLuV6DZf2OSkpCVFRUe7X2hVFRUUA4JUlbFwGDRqEa665BmvWrHGf0KxZswbXXntth+93Z3T1/feXM2fOYNy4cW22Dxw40ONn18nszTff3O5+DAaD9ztHRL0aj888Prv46vjsen2XHvMA6fe9YcOGNkX1Ln3/o6OjAQA1NTUwGAzd+lx4k+s1ZWVleWyPj49399WlsLAQhw8fbncKHIA2n0cKHAz
"text/plain": [
"<Figure size 1500x1000 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Средняя цена в обучающей выборке: 38.655686510239775\n",
"Средняя цена в контрольной выборке: 38.6586177634688\n",
"Средняя цена в тестовой выборке: 38.61544768163524\n",
"\n",
"Стандартное отклонение цены в обучающей выборке: 5.380551235399826\n",
"Стандартное отклонение цены в контрольной выборке: 5.34170765011401\n",
"Стандартное отклонение цены в тестовой выборке: 5.3932492782181525\n",
"\n",
"Распределение по квартилам (обучающая):\n",
"0.25 34.269424\n",
"0.50 39.222500\n",
"0.75 42.284678\n",
"Name: city_latitude, dtype: float64\n",
"\n",
"Распределение по квартилам (контрольная):\n",
"0.25 34.286571\n",
"0.50 39.302247\n",
"0.75 42.277381\n",
"Name: city_latitude, dtype: float64\n",
"\n",
"Распределение по квартилам (тестовая):\n",
"0.25 34.194501\n",
"0.50 39.165900\n",
"0.75 42.286500\n",
"Name: city_latitude, dtype: float64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA/YAAAIjCAYAAACpnIB8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeZiN9f/H8eeZfWHGMpixDdnGLmvIUvadtIhCKSkq0vJTIamkRSmlHRVfUdEihCgxRfadNIyRbWzDMPv9++PuHHPMaszMPWfm9biuc5373Odz3/f77Od9fzabYRgGIiIiIiIiIuKS3KwOQERERERERERyTom9iIiIiIiIiAtTYi8iIiIiIiLiwpTYi4iIiIiIiLgwJfYiIiIiIiIiLkyJvYiIiIiIiIgLU2IvIiIiIiIi4sKU2IuIiIiIiIi4MCX2IiIiwLlz5/j7779JSkqyOhQRkULDMAzOnDnDgQMHrA5FpFBTYi8iIkVSYmIir732Gg0bNsTb25uSJUtSo0YNVq1aZXVoLmHnzp0sXrzYcXvr1q0sWbLEuoAkT0VFRTF79mzH7UOHDjF37lzrApIC/Rm8cOECzz//PLVq1cLLy4vSpUtTs2ZN9u3bZ3VoIoWWh9UBiEhas2fP5r777nPc9vb2pnLlynTu3Jnx48dTrlw5C6MTcX3x8fF07tyZP/74gxEjRjB58mT8/Pxwd3enSZMmVofnEi5cuMBDDz1EcHAwpUuX5vHHH6dbt2706NHD6tAkD9hsNkaOHElISAi1atXi6aefplSpUgwaNMjq0IqsgvoZPH36NO3atSMyMpJHH32U1q1b4+XlhaenJ1WqVLE0NpHCTIm9SAH24osvUrVqVeLi4vj999+ZOXMmP/30Ezt37sTPz8/q8ERc1tSpU/nzzz9Zvnw57du3tzocl9SyZUvHBaBmzZo8+OCDFkcleaVChQo8+OCDdO3aFYCQkBDWrFljbVBFXEH9DD711FMcO3aM8PBw6tata3U4IkWGzTAMw+ogRMSZvcZ+48aNNG3a1LF+7NixTJs2jXnz5nH33XdbGKGI60pKSqJs2bI8/PDDvPzyy1aH4/J2797N5cuXqV+/Pl5eXlaHI3ns4MGDREdHU69ePfz9/a0ORyhYn8GTJ08SEhLCBx98UCBOMogUJepjL+JCbr31VgAiIiIAOHPmDE8++ST169enWLFiBAQE0K1bN7Zt25Zm27i4OF544QVq1qyJj48PISEh3HbbbRw8eBAw+0vabLYML6lrNdesWYPNZuOrr77i2WefJTg4GH9/f3r37s2RI0fSHPvPP/+ka9euBAYG4ufnR7t27Vi3bl26j7F9+/bpHv+FF15IU/bLL7+kSZMm+Pr6UqpUKQYMGJDu8TN7bKmlpKTw9ttvU7duXXx8fChXrhwPPfQQZ8+edSpXpUoVevbsmeY4o0aNSrPP9GJ//fXX0zynYDYPnzhxItWrV8fb25tKlSrx9NNPEx8fn+5zlVr79u2pV69emvVvvPEGNpuNQ4cOOa0/d+4co0ePplKlSnh7e1O9enWmTp1KSkqKo4z9eXvjjTfS7LdevXrpvie+/vrrDGMcOnRotpphVqlSxfH6uLm5ERwczF133UVkZGSW2wK8//771K1bF29vb8qXL8/IkSM5d+6c4/59+/Zx9uxZihcvTrt27fDz8yMwMJCePXuyc+dOR7nVq1djs9lYtGhRmmPMmzcPm81GeHi4I+ahQ4c6lbE/J6lrNdeuXcsdd9xB5cqVHa/xmDFjuHz5stO2L7zwQpr30ty5c2nUqBE+Pj6ULl2au+++O81zMnToUIoVK+a07uuvv04TB0CxYsXSxAzZ+1y1b9/e8frXqVOHJk2asG3btnQ/V9k1e/bsNO/VXbt2UbJkSXr27Ok0qOE///zDHXfcQalSpfDz8+Omm25K07c4s/dk6sduP25mF3vfcvvz+88//9ClSxf8/f0pX748L774IlfXk8TGxjJ27FjHZ6xWrVq88cYbacplFkPqz5i9zF9//ZXp85jeewAyfh8sXLjQ8XoHBQVxzz33cPTo0TT7tH92q1WrRosWLThz5gy+vr7pfr+kF9PVn/0jR45ka/uhQ4dm+fqk3n7p0qW0adMGf39/ihcvTo8ePdi1a1ea/e7du5c777yTMmXK4OvrS61atXjuueeAK5+/zC6pn8fsPoepty9ZsiTt27dn7dq1aWLL6jsMrv8zePVvbVBQED169HD6DgTzN2zUqFEZ7ufqz+3GjRtJSUkhISGBpk2bZvp9BfDLL784Xq8SJUrQp08f9uzZ41TG/nrYX7OAgABH14O4uLg08ab+zU1KSqJ79+6UKlWK3bt3O9bPmjWLW2+9lbJly+Lt7U2dOnWYOXNmmtjc3NyYMGGC03r79//V5UWspqb4Ii7EnoSXLl0aMP/cLl68mDvuuIOqVaty4sQJPvzwQ9q1a8fu3bspX748AMnJyfTs2ZNVq1YxYMAAHn/8cS5cuMCKFSvYuXMn1apVcxzj7rvvpnv37k7HHTduXLrxvPzyy9hsNp555hlOnjzJ22+/TceOHdm6dSu+vr6A+cPYrVs3mjRpwsSJE3Fzc3P8oK5du5bmzZun2W/FihWZMmUKABcvXuThhx9O99jjx4/nzjvv5IEHHuDUqVO8++67tG3bli1btlCiRIk02wwfPpw2bdoA8O2336ZJ2B566CFHa4nHHnuMiIgIZsyYwZYtW1i3bh2enp7pPg/X4ty5c47HllpKSgq9e/fm999/Z/jw4dSuXZsdO3bw1ltvsX//fqcBkq7XpUuXaNeuHUePHuWhhx6icuXKrF+/nnHjxnHs2DHefvvtXDtWTrVp04bhw4eTkpLCzp07efvtt/n333/T/ROc2gsvvMCkSZPo2LEjDz/8MPv27WPmzJls3LjR8RqePn0aMN/XNWrUYNKkScTFxfHee+/RunVrNm7cSM2aNWnfvj2VKlVi7ty59OvXz+k4c+fOpVq1ao4msNm1cOFCLl26xMMPP0zp0qXZsGED7777LlFRUSxcuDDD7ebNm8c999xDw4YNmTJlCqdPn+add97h999/Z8uWLQQFBV1THBnJyefK7plnnsmVGOyOHDlC165dCQsLY8GCBXh4mH9ZTpw4QatWrbh06RKPPfYYpUuXZs6cOfTu3Zuvv/46zWuVlbZt2/LFF184bttbcdiTPIBWrVo5lpOTk+natSs33XQTr732GsuWLWPixIkkJSXx4osvAuYo4L1792b16tUMGzaMRo0asXz5cp566imOHj3KW2+9lW4sb731luO1zI/WJPbvu2bNmjFlyhROnDjB9OnTWbduXZav94QJE9IkVdciu9s/9NBDdOzY0XH73nvvpV+/ftx2222OdWXKlAHgiy++YMiQIXTp0oWpU6dy6dIlZs6cyc0338yWLVscJxe2b99OmzZt8PT0ZPjw4VSpUoWDBw/yww8/8PLLL3PbbbdRvXp1x/7HjBlD7dq1GT58uGNd7dq1gWt7DoOCghyvfVRUFNOnT6d79+4cOXLEUS4732EZudbPYFhYGM899xyGYXDw4EGmTZtG9+7ds30SNT3279dRo0bRpEkTXn31VU6dOpXu99XKlSvp1q0bN9xwAy+88AKXL1/m3XffpXXr1mzevDnNyaA777yTKlWqMGXKFP744w/eeecdzp49y+eff55hPA888ABr1qxhxYoV1KlTx7F+5syZ1K1bl969e+Ph4cEPP/zAI488QkpKCiNHjgTMypRHHnmEKVOm0LdvXxo3bsyxY8d49NFH6dixIyNGjMjx8ySSJwwRKXBmzZplAMbKlSuNU6dOGUeOHDHmz59vlC5d2vD19TWioqIMwzCMuLg4Izk52WnbiIgIw9vb23jxxRcd6z777DMDMKZNm5bmWCkpKY7tAOP1119PU6Zu3bpGu3btHLdXr15tAEaFChWMmJgYx/oFCxYYgDF9+nTHvmvUqGF06dLFcRzDMIxLly4ZVatWNTp16pTmWK1atTLq1avnuH3q1CkDMCZOnOhYd+jQIcPd3d14+eWXnbbdsWOH4eHhkWb9gQMHDMC
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import numpy as np\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\")\n",
"\n",
"\n",
"#5. Устранение пропущенных данных\n",
" \n",
"#Сведения о пропущенных данных\n",
"print(\"Количество пропущенных значений в каждом столбце:\")\n",
"print(df.isnull().sum())\n",
"\n",
"# Процент пропущенных значений признаков\n",
"for i in df.columns:\n",
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
" if null_rate > 0:\n",
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
"\n",
"\n",
"\n",
"#6. Проблемы набора данных\n",
" #5.1Выбросы: Возможны аномалии в значениях скорости или расстояния.\n",
" #Смещение: Данные могут быть смещены в сторону объектов, которые легче обнаружить (крупные, близкие).\n",
"\n",
"#7. Решения для обнаруженных проблем\n",
" #Выбросы: Идентификация и обработка выбросов через методы (например, IQR или Z-оценка).\n",
" #Смещение: Использование методов балансировки данных, таких как oversampling.\n",
"\n",
"#7.1 Проверка набора данных на выбросы\n",
"# Выбираем столбцы для анализа\n",
"columns_to_check = df.select_dtypes(include=np.number).columns.tolist()#['city_latitude' , 'sqft_living', 'bathrooms', 'yr_built']\n",
"def Emissions(columns_to_check):\n",
"\n",
" # Функция для подсчета выбросов\n",
" def count_outliers(df, columns):\n",
" outliers_count = {}\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Считаем количество выбросов\n",
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
" outliers_count[col] = len(outliers)\n",
" \n",
" return outliers_count\n",
"\n",
" # Подсчитываем выбросы\n",
" outliers_count = count_outliers(df, columns_to_check)\n",
"\n",
" # Выводим количество выбросов для каждого столбца\n",
" for col, count in outliers_count.items():\n",
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
" \n",
" # Создаем гистограммы\n",
" plt.figure(figsize=(15, 10))\n",
" for i, col in enumerate(columns_to_check, 1):\n",
" plt.subplot(2, 3, i)\n",
" sns.histplot(df[col], kde=True)\n",
" plt.title(f'Histogram of {col}')\n",
" plt.tight_layout()\n",
" plt.show()\n",
"Emissions(columns_to_check)\n",
"\n",
"#Признак miss_distance не имеет выбросов, \n",
"#признак absolute_magnitude имеет количество выбросов в приемлемом диапазоне\n",
"#для признаков est_diameter_min, est_diameter_max и relative_velocity необходимо использовать метод решения проблемы выбросов. \n",
"#Воспользуемся методом удаления наблюдений с такими выбросами:\n",
"# Выбираем столбцы для очистки\n",
"columns_to_clean = ['city_latitude']\n",
"\n",
"# Функция для удаления выбросов\n",
"def remove_outliers(df, columns):\n",
" for col in columns:\n",
" Q1 = df[col].quantile(0.25)\n",
" Q3 = df[col].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" # Удаляем строки, содержащие выбросы\n",
" df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]\n",
" \n",
" return df\n",
"\n",
"# Удаляем выбросы\n",
"df_cleaned = remove_outliers(df, columns_to_clean)\n",
"\n",
"# Выводим количество удаленных строк\n",
"print(f\"Количество удаленных строк: {len(df) - len(df_cleaned)}\")\n",
"\n",
"df = df_cleaned\n",
"\n",
"#Оценим выбросы в выборке после усреднения:\n",
"Emissions(columns_to_clean)\n",
"\n",
"#Удалось избавиться от выбросов в соответствующих признаках как видно на диаграммах.\n",
"\n",
"\n",
"\n",
"#8. Разбиение данных на выборки\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_data, temp_data = train_test_split(df, test_size=0.3, random_state=42)\n",
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
"\n",
"# Средние значения цены\n",
"print(\"Средняя цена в обучающей выборке:\", train_data['city_latitude' ].mean())\n",
"print(\"Средняя цена в контрольной выборке:\", val_data['city_latitude' ].mean())\n",
"print(\"Средняя цена в тестовой выборке:\", test_data['city_latitude' ].mean())\n",
"print()\n",
"\n",
"# Стандартное отклонение цены\n",
"print(\"Стандартное отклонение цены в обучающей выборке:\", train_data['city_latitude' ].std())\n",
"print(\"Стандартное отклонение цены в контрольной выборке:\", val_data['city_latitude' ].std())\n",
"print(\"Стандартное отклонение цены в тестовой выборке:\", test_data['city_latitude' ].std())\n",
"print()\n",
"\n",
"# Проверка распределений по количеству объектов в диапазонах\n",
"print(\"Распределение по квартилам (обучающая):\")\n",
"print(train_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
"print()\n",
"print(\"Распределение по квартилам (контрольная):\")\n",
"print(val_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
"print()\n",
"print(\"Распределение по квартилам (тестовая):\")\n",
"print(test_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
"\n",
"# Построение гистограмм для каждой выборки\n",
"plt.figure(figsize=(12, 6))\n",
"\n",
"sns.histplot(train_data['city_latitude' ], color='blue', label='Train', kde=True)\n",
"sns.histplot(val_data['city_latitude' ], color='green', label='Validation', kde=True)\n",
"sns.histplot(test_data['city_latitude' ], color='red', label='Test', kde=True)\n",
"\n",
"plt.legend()\n",
"plt.xlabel('city_latitude' )\n",
"plt.ylabel('Frequency')\n",
"plt.title('Распределение цены в обучающей, контрольной и тестовой выборках')\n",
"plt.show()\n",
"\n",
"\n",
"#9. Оценить сбалансированность выборок для каждого набора данных. Оценить необходимость использования методов приращения (аугментации) данных. \n",
"#Выводы по сбалансированности\n",
"#Если распределение классов примерно равно (например, 50%/50%), выборка считается сбалансированной, и аугментация данных не требуется.\n",
"#Если один из классов сильно доминирует (например, 90%/10%), выборка несбалансированная, и может потребоваться аугментация данных.\n",
"\n",
"#Выборки оказались недостаточно сбалансированными. Используем методы приращения данных с избытком и с недостатком:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Нет пропущенных данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Разбиваем на выборки (обучающую, тестовую, контрольную)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры выборок:\n",
"Обучающая выборка: 65464 записей\n",
"Валидационная выборка: 21822 записей\n",
"Тестовая выборка: 21822 записей\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABccAAAIjCAYAAADGGKM5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACbjklEQVR4nOzdeXyNd/7//2cSsokkQhaZRqSordZoSdWeJiIo1fZrqaW0wQQjOhgzqui0KWorSrW2FtXqqBYtYldC0WZsbQaNRkui1oOS9fr90V+uj9MkSiRO6jzut9u5Ta7r/TrX9bpOOl7nvPI+78vBMAxDAAAAAAAAAADYEUdbJwAAAAAAAAAAwL1GcxwAAAAAAAAAYHdojgMAAAAAAAAA7A7NcQAAAAAAAACA3aE5DgAAAAAAAACwOzTHAQAAAAAAAAB2h+Y4AAAAAAAAAMDu0BwHAAAAAAAAANgdmuPAXbh27ZpOnTqlixcv2joVFCN+rwAA2I9Lly7p+PHjys7OtnUqAACUerm5uTp37px++OEHW6cCFAua48AdWrlypdq1a6fy5cvLw8NDVapU0eTJk22dFu4Sv1cAAOxDVlaWJk+erAYNGsjFxUUVKlRQjRo1tHnzZlunBgBAqZSWlqbhw4crODhYzs7O8vX1VZ06dWSxWGydGnDXytg6AcCWjhw5ovj4eG3dulXnzp1TxYoV1aZNG/3zn/9U3bp188X/4x//0KRJk/Tkk0/q3XffVaVKleTg4KCHHnrIBtmjuPB7BYDitXjxYj3//PNW+3x9fVW3bl2NGjVKUVFRNsoM9i4jI0MRERHas2ePBg0apFdffVXu7u5ycnJSaGiordMDgPueg4PDbcVt3bpVrVu3LtlkcFuOHz+uNm3aKCsrS8OGDVPjxo1VpkwZubm5qVy5crZOD7hrNMdht1atWqUePXrIx8dHAwYMUEhIiE6ePKkFCxbok08+0YoVK9S1a1czfvv27Zo0aZLi4+P1j3/8w4aZozjxewWAkjNx4kSFhITIMAylp6dr8eLF6tChg9asWaOOHTvaOj3YoUmTJmnv3r3asGEDTRcAsIEPPvjAavv9999XQkJCvv21a9e+l2nhFgYOHChnZ2ft2bNHf/nLX2ydDlDsHAzDMGydBHCvnThxQvXr11eVKlW0Y8cO+fr6mmPnzp1TixYtdOrUKR08eFAPPvigJKlTp066cOGCdu3aZau0UQL4vQJA8cubOb5v3z41adLE3H/x4kX5+/vrmWee0bJly2yYIexRdna2/Pz8NHjwYL322mu2TgcAIGnIkCGaM2eOaE2VTgcOHFCTJk20ceNGPfHEE7ZOBygRrDkOuzRlyhT9+uuvmj9/vlVjXJIqVaqkd955R9euXbNac3rPnj16+OGH1b17d/n4+MjNzU2PPPKIVq9ebcZcvXpV5cqV09/+9rd85/zpp5/k5OSk+Ph4SVK/fv1UtWrVfHEODg4aP368uf3jjz/qr3/9q2rWrCk3NzdVrFhRzzzzjE6ePGn1vG3btsnBwUHbtm0z9+3bt09PPPGEypcvr3Llyql169bauXOn1fMWL14sBwcH7d+/39x37ty5fHlIUseOHfPlvHPnTj3zzDOqUqWKXFxcFBQUpLi4OF2/fj3ftX3yySdq0qSJypcvLwcHB/Px5ptv5ostKMe8h7u7u+rVq6f33nvPKq5fv37y8PC45bF+f12383vNc/bsWQ0YMED+/v5ydXVVgwYNtGTJEquYkydPmtc0ffp0BQcHy83NTa1atdLhw4fz5fv713Pp0qVydHTUG2+8Ye47ePCg+vXrpwcffFCurq4KCAhQ//79df78+VteKwCUNt7e3nJzc1OZMtZfXnzzzTf12GOPqWLFinJzc1NoaKg++eSTAo/x+5qQ97h5FnBezM21Mjc3V/Xr15eDg4MWL16c77hVq1Yt8Li/j73dXB0cHDRkyJB8+wuqpQXVg1OnTsnNzS3fdUjS22+/rbp168rFxUWBgYGKjY3VpUuXrGJat26thx9+ON/533zzzXzHrFq1aoEz+YcMGZLv6++LFi1S27Zt5efnJxcXF9WpU0dz587N99zs7Gz9+9//1kMPPSQXFxer1/Tm9xwF6devn1V8hQoVCnwPU1jeeX7/3ig5OVkXL15U+fLl1apVK7m7u8vLy0sdO3bMV6Ml6dtvv1VUVJQ8PT3l4eGhdu3aac+ePVYxef+t7dixQwMHDlTFihXl6empPn365Luxd9WqVdWvXz+rfTExMXJ1dbV6//bZZ58pOjpagYGBcnFxUbVq1fTqq68qJyfnlq8bANyPMjIy9Morr6h69erm581Ro0YpIyMjX+zSpUv16KOPyt3dXRUqVFDLli21ceNGSYXX+bzHzXX42rVreumllxQUFCQXFxfVrFlTb775Zr4G/s3Pd3Jy0l/+8hfFxMRY1eTMzEyNGzdOoaGh8vLyUrly5dSiRQtt3bo1X/55nzerVKkiJycn89h/9Bn399fn6OiogIAA/b//9/+Umppqxtz8WbUw48ePt6r9e/bskaurq06cOGG+9wgICNDAgQN14cKFfM9fuXKlQkND5ebmpkqVKum5557Tzz//bBWT97n9hx9+UGRkpMqVK6fAwEBNnDjR6jXOy/fm92JXrlxRaGioQkJCdObMGXP/nbyXBH6PZVVgl9asWaOqVauqRYsWBY63bNlSVatW1bp168x958+f1/z58+Xh4aFhw4bJ19dXS5cu1VNPPaVly5apR48e8vDwUNeuXfXRRx9p2rRpcnJyMp//4YcfyjAM9erV645y3bdvn3bv3q3u3bvrgQce0MmTJzV37ly1bt1aR48elbu7e4HPO378uFq3bi13d3eNHDlS7u7uevfddxUeHq6EhAS1bNnyjvIozMqVK/Xrr79q8ODBqlixor7++mvNmjVLP/30k1auXGnGJSYm6tlnn1WDBg30xhtvyMvLS+fOnVNcXNxtn2v69OmqVKmSLBaLFi5cqBdffFFVq1ZVeHh4kfO/nd+rJF2/fl2tW7fW8ePHNWTIEIWEhGjlypXq16+fLl26lO8PIu+//76uXLmi2NhY3bhxQzNnzlTbtm116NAh+fv7F5jLxo0b1b9/fw0ZMsRqiZeEhAT98MMPev755xUQEKAjR45o/vz5OnLkiPbs2XPb6/YBwL12+fJlnTt3ToZh6OzZs5o1a5auXr2q5557zipu5syZ6ty5s3r16qXMzEytWLFCzzzzjNauXavo6OgCj51XEyTd1izgDz74QIcOHbplTMOGDfXSSy9JklJSUjRu3Lh8MUXJtSjGjRunGzdu5Ns/fvx4TZgwQeHh4Ro8eLCSk5M1d+5c7du3T7t27VLZsmWLLYeCzJ07V3Xr1lXnzp1VpkwZrVmzRn/961+Vm5ur2NhYM27q1Kl6+eWX1bVrV40ePVouLi7auXOn5s+ff1vnqVSpkqZPny7ptwkGM2fOVIcOHXTq1Cl5e3sXKfe8PyqPGTNGNWrU0IQJE3Tjxg3NmTNHzZs31759+8z7jRw5ckQtWrSQp6enRo0apbJly+qdd95R69attX37djVt2tTq2EOGDJG3t7fGjx9v/k5+/PFHs0FfkFdeeUULFizQRx99lO+POx4eHhoxYoQ8PDy0ZcsWjRs3ThaLRVOmTCnStQPAn1Fubq46d+6sr776SjExMapdu7YOHTqk6dOn63//+5/VhKYJEyZo/PjxeuyxxzRx4kQ5Oztr79692rJliyIiIjRjxgxdvXpVkvTdd9/p9ddf1z//+U9z+Za8BrRhGOrcubO2bt2qAQMGqGHDhtqwYYNGjhypn3/+2axNebp27aqnnnpK2dnZSkxM1Pz583X9+nVzmRiLxaL33ntPPXr00IsvvqgrV65owYIFioyM1Ndff62GDRuax+rbt682bdqkoUOHqkGDBnJyctL8+fP1zTff3Nbr1aJFC8XExCg3N1eHDx/WjBkzdPr06Xx/XL4T58+f140bNzR48GC1bdtWgwYN0okTJzRnzhzt3btXe/fulYuLi6T/++bgI488ovj4eKWnp2vmzJnatWuXvv32W6v6nZO
"text/plain": [
"<Figure size 1800x600 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['city_latitude']) # Признаки (все столбцы, кроме 'city_latitude')\n",
"y = df['city_latitude'] \n",
"# Целевая переменная (price)\n",
"\n",
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
"\n",
"# Визуализация распределения цен в каждой выборке\n",
"plt.figure(figsize=(18, 6))\n",
"\n",
"plt.subplot(1, 3, 1)\n",
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
"plt.title('Обучающая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 2)\n",
"plt.hist(y_val, bins=30, color='green', alpha=0.7)\n",
"plt.title('Валидационная выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.subplot(1, 3, 3)\n",
"plt.hist(y_test, bins=30, color='red', alpha=0.7)\n",
"plt.title('Тестовая выборка')\n",
"plt.xlabel('Цена')\n",
"plt.ylabel('Количество')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Балансировка выборок**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размеры выборок:\n",
"Обучающая выборка: 6000 записей\n",
"Валидационная выборка: 2000 записей\n",
"Тестовая выборка: 2000 записей\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABisElEQVR4nO3dd3gU1f/28TskpJAqEBLA0EINCAgoBEQQAqFXRRSlKkjvKkpHjRQBkWIBAQV+KkUURDpY6CAovSPN0CEUk0Bynj94st8sm0AmJiTA+3Vde13Z2bOzn9kzO9l7Z+aMkzHGCAAAAACQYlkyugAAAAAAeNAQpAAAAADAIoIUAAAAAFhEkAIAAAAAiwhSAAAAAGARQQoAAAAALCJIAQAAAIBFBCkAAAAAsIggBQfx8fE6f/68jhw5ktGlAMmKjo7W6dOndfbs2YwuBWmIfgWQUpcvX9ahQ4d069atjC4FjyiCFCRJkZGR6tWrl/Lnzy9XV1f5+/srJCREUVFRGV0aYLNy5Uo1atRIfn5+8vDwUN68edWzZ8+MLgv/0cPerydOnJC7u7vWrVuX0aUAD7SbN29q1KhRKlOmjNzc3PTYY4+pSJEiWrVqVUaX9ki6cOGCPD09tWTJkowuJcM4GWNMRheB9HH48GGNGjVKK1as0OnTp+Xq6qonnnhCLVq0UMeOHeXh4SFJOnTokJ577jndvHlTPXr0ULly5eTi4iIPDw9VqlRJzs7OGbwkgDR58mR1795dzzzzjNq3b6+8efNKkvLnz68iRYpkcHVIrUehX19//XUdOHBAv/zyS0aXkqmcP39e/v7+GjJkiIYOHZrR5SCTi4mJUe3atbVx40a98cYbqlmzprJlyyZnZ2eVL19ePj4+GV3iI6lnz576/ffftW3btowuJUO4ZHQBSB8//fSTXnjhBbm5ual169YqVaqUYmNj9fvvv6t///7avXu3Pv/8c0lSp06d5Orqqo0bN9q+xACZycGDB9WnTx917NhRkydPlpOTU0aXhDTwKPTruXPnNHPmTM2cOTOjSwEeaCNHjtSmTZu0bNkyVa9ePaPLwf/3xhtvaMKECVq9erVq1KiR0eXcdwSph9DRo0fVsmVL5c+fX6tXr1bu3Lltj3Xt2lWHDh3STz/9JEnatm2bVq9ereXLlxOikGlNmDBBgYGBmjBhwkP5ZftR9Sj066xZs+Ti4qKGDRtmdCnAA+vWrVsaP368+vbtS4jKZEqUKKFSpUppxowZj2SQ4hyph9CoUaN07do1TZs2zS5EJShcuLDt/IONGzfK3d1dhw8fVsmSJeXm5qbAwEB16tRJFy9etHveb7/9phdeeEH58uWTm5ubgoKC1Lt3b/37779J1uHk5JTk7dixY7Y206dPV40aNZQrVy65ubkpJCREU6ZMcZhXgQIF1KBBA4fp3bp1S/IL2KxZs/T0008rW7Zseuyxx/Tss89q+fLldvNr27at3XPmzp0rJycnFShQwDbt2LFjcnJy0pgxYzRu3Djlz59fHh4eqlatmnbt2uXwuqtXr1bVqlXl6ekpPz8/NW7cWHv37rVrM3ToULv3w9vbW08//bQWLlxo1y6l73fbtm3l5eXlUMu8efPk5OSktWvX2qZVr15dpUqVcmg7ZswYh7754YcfVL9+feXJk0dubm4KDg7WiBEjFBcX5/D8KVOmqFSpUsqWLZvdss2bN8+h7Z22b9+uunXrysfHR15eXqpZs6Y2btxo12bjxo0qX768unTpooCAALm5ualUqVL64osvbG2MMSpQoIAaN27s8BrR0dHy9fVVp06dJP2vD+5053px8eJF9evXT0888YS8vLzk4+OjunXr6s8//7R7XsJ6MmPGDNu0AwcOqGnTpnrsscfk4eGhp556yqGP165dm+T75OXl5bB+JrWu//XXX2rbtq0KFSokd3d3BQYGqn379rpw4YLDsq1Zs0ZVq1bVY489ZtdH3bp1c2ibVI0JNzc3NxUtWlQRERFKfGR4wnt6/vz5ZOd15/ubkn5NcP36dfXt21dBQUFyc3NTsWLFNGbMGN15dHrCMs2ePVvFihWTu7u7ypcvr19//dWuXVLrwJo1a+Tm5qY33njDNu3vv/9Wly5dVKxYMXl4eChHjhx64YUX7D4rd7Nw4UJVrFgxyc9ownqT1C21y57U7b333pMkxcbGavDgwSpfvrx8fX3l6empqlWras2aNUnWlZLtXtu2be22mdLtc8I8PDwctik3btxQu3bt5OnpqZCQENuhQDdv3lS7du2ULVs2lSlTRlu3brWbX/Xq1eXk5KQmTZo4vIedOnWSk5OT3XYtqc+jdPuHRCcnJ7t1MKn6E97LOw83PHXqlNq3b29bV0uWLKkvv/zS4bnR0dEaOnSoihYtKnd3d+XOnVvNmjXT4cOHk63v6tWrKl++vAoWLKh//vnHNj2lfS/d/r9Xvnx5eXh4KHv27GrZsqVOnDjh0O5Od/5PuvOWuM6E/zdHjhxReHi4PD09lSdPHg0fPtyhpvj4eI0fP14lS5aUu7u7AgIC1KlTJ126dMmhhsmTJ9u+g+TJk0ddu3bV5cuXbY/v379fly5dkre3t6pVq6Zs2bLJ19dXDRo0cFgnE5Zn3759atGihXx8fJQjRw717NlT0dHRdm1T+h2kcePGKlCggNzd3ZUrVy41atRIO3futGtz69YtjRgxQsHBwXJzc1OBAgX0zjvvKCYmxq5dgQIFbO9tlixZFBgYqBdffFHHjx+3azdmzBhVrlxZOXLkkIeHh8qXL5/k/9TktuMNGjRI8vtMSj4X0u2BPHr16mVb9woXLqyRI0cqPj7e4bVq1aqlRYsWJblePuzYI/UQWrRokQoVKqTKlSvfs+2FCxcUHR2tzp07q0aNGnrjjTd0+PBhTZo0SZs2bdKmTZvk5uYm6XbQuHHjhjp37qwcOXJo8+bN+uSTT3Ty5EnNnTs3yfk3bdpUzZo1k3Q7GCQcTphgypQpKlmypBo1aiQXFxctWrRIXbp0UXx8vLp27Zqq5R82bJiGDh2qypUra/jw4XJ1ddWmTZu0evVq1a5dO8nn3Lp1S++++26y8/zqq6909epVde3aVdHR0fr4449Vo0YN7dy5UwEBAZJunzBft25dFSpUSEOHDtW///6rTz75RFWqVNEff/zh8M/666+/lnT7PIHJkyfrhRde0K5du1SsWDFJqXu/09KMGTPk5eWlPn36yMvLS6tXr9bgwYMVFRWl0aNH29p9++236tKli6pXr67u3bvL09NTe/fu1QcffHDP19i9e7eqVq0qHx8fvfnmm8qaNas+++wzVa9eXb/88osqVqwo6fZ6unXrVrm4uKhr164KDg7WwoUL1bFjR124cEFvv/22nJyc9Morr2jUqFG6ePGismfPbnudRYsWKSoqSq+88oql9+DIkSNauHChXnjhBRUsWFBnzpzRZ599pmrVqmnPnj3KkydPks+7ePGinn32WV29elU9evRQYGCgZs2apWbNmmn27Nl66aWXLNWRnBUrVujIkSNq166dAgMDbYfs7t69Wxs3brR9IT969Kjq16+v3Llza/DgwfL395ckvfrqqyl+rXfeeUclSpTQv//+q2+//VbvvPOOcuXKpQ4dOqS6/pT0q3Q7JDdq1Ehr1qxRhw4dVLZsWS1btkz9+/fXqVOnNG7cOLv5/vLLL/r222/Vo0cPubm5afLkyapTp442b96c5A8JkvTnn3+qSZMmqlevniZNmmSbvmXLFq1fv14tW7bU448/rmPHjmnKlCmqXr269uzZo2zZsiW7fDdv3tSWLVvUuXPnu74PHTt2VNWqVSVJCxYs0Pfff297zOqy16pVS61bt7abVrZsWUlSVFSUpk6dqpdeekmvv/66rl69qmnTpik8PFybN2+2tUuQku1eUgYPHuzwhVWSevfurZkzZ6pbt256/PHH1aVLF0nS559/rho1aui9997Txx9/rLp16+rIkSPy9va2Pdfd3V0//fSTzp49q1y5ckmSbV10d3e/6/sr3T4XOKmAnlJnzpxRpUqVbF9a/f399fPPP6tDhw6KiopSr169JEl
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Разделение признаков (features) и целевой переменной (target)\n",
"X = df.drop(columns=['city_latitude']).head(10000) # Признаки (все столбцы, кроме 'city_latitude')\n",
"y = df['city_latitude'].head(10000) # Целевая переменная (цена)\n",
"\n",
"# Применение one-hot encoding для категориальных признаков\n",
"X = pd.get_dummies(X, drop_first=True)\n",
"\n",
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
"\n",
"# Проверка размеров выборок\n",
"print(f\"Размеры выборок:\")\n",
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
"\n",
"# Удаление выбросов (цены выше 95-го процентиля)\n",
"upper_limit = y_train.quantile(0.95)\n",
"X_train = X_train[y_train <= upper_limit]\n",
"y_train = y_train[y_train <= upper_limit]\n",
"\n",
"# Логарифмическое преобразование целевой переменной\n",
"y_train_log = np.log1p(y_train)\n",
"y_val_log = np.log1p(y_val)\n",
"y_test_log = np.log1p(y_test)\n",
"\n",
"# Стандартизация признаков\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_val_scaled = scaler.transform(X_val)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# Визуализация распределения цен в сбалансированной выборке\n",
"plt.figure(figsize=(10, 6))\n",
"plt.hist(y_train_log, bins=30, color='orange', alpha=0.7)\n",
"plt.title('Сбалансированная обучающая выборка (логарифмическое преобразование)')\n",
"plt.xlabel('Логарифм цены')\n",
"plt.ylabel('Количество')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Унитарное кодирование категориальных признаков**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до унитарного кодирования:\n",
" summary city state \\\n",
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
"3 One red light moving switly west to east, beco... Knoxville TN \n",
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
"5 I'm familiar with all the fakery and UFO sight... Fullerton CA \n",
"... ... ... ... \n",
"12494 star like stop and go satellite San Francisco CA \n",
"12495 Two Balls of Light sighted in Tulsa, Ok. Tulsa OK \n",
"12496 Highley reflective silver oval/disk seen in sk... St Louis MO \n",
"12497 While walking on broadwalk on south west corne... Tempe AZ \n",
"12498 was sitting on my front porch with granddaught... Elkhart KS \n",
"\n",
" date_time shape duration \\\n",
"0 2021-12-15T21:45:00 light 2 minutes \n",
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
"4 2021-12-07T08:00:00 circle NaN \n",
"5 2020-07-07T23:00:00 unknown 2 minutes \n",
"... ... ... ... \n",
"12494 2000-06-05T23:30:00 NaN 1 minute \n",
"12495 2000-06-06T08:15:00 circle about 5 minutes \n",
"12496 2000-06-06T17:22:00 oval one minute \n",
"12497 2000-06-07T22:00:00 unknown 4 seconds \n",
"12498 2000-06-07T23:48:00 circle 1to2 minutes \n",
"\n",
" stats \\\n",
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
"5 Occurred : 7/7/2020 23:00 (Entered as : 07/07... \n",
"... ... \n",
"12494 Occurred : 6/5/2000 23:30 (Entered as : 06/05... \n",
"12495 Occurred : 6/6/2000 08:15 (Entered as : 06/06... \n",
"12496 Occurred : 6/6/2000 17:22 (Entered as : 06/06... \n",
"12497 Occurred : 6/7/2000 22:00 (Entered as : 060/7... \n",
"12498 Occurred : 6/7/2000 23:48 (Entered as : 6/7/2... \n",
"\n",
" report_link \\\n",
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
"5 http://www.nuforc.org/webreports/157/S157444.html \n",
"... ... \n",
"12494 http://www.nuforc.org/webreports/013/S13042.html \n",
"12495 http://www.nuforc.org/webreports/013/S13150.html \n",
"12496 http://www.nuforc.org/webreports/013/S13043.html \n",
"12497 http://www.nuforc.org/webreports/013/S13054.html \n",
"12498 http://www.nuforc.org/webreports/013/S13051.html \n",
"\n",
" text posted \\\n",
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
"5 I'm familiar with all the fakery and UFO sight... 2020-07-09T00:00:00 \n",
"... ... ... \n",
"12494 star like stop and go satellite A white star-l... 2000-06-21T00:00:00 \n",
"12495 Two Balls of Light sighted in Tulsa, Ok. At su... 2000-06-21T00:00:00 \n",
"12496 Highley reflective silver oval/disk seen in sk... 2000-06-21T00:00:00 \n",
"12497 On southwest corner of Tempe Lake heard and sa... 2000-06-21T00:00:00 \n",
"12498 was sitting on my front porch with granddaught... 2000-06-21T00:00:00 \n",
"\n",
" city_latitude city_longitude \n",
"0 36.356650 -119.347937 \n",
"1 39.174503 -84.481363 \n",
"3 35.961561 -83.980115 \n",
"4 38.798958 -77.095133 \n",
"5 33.877422 -117.924978 \n",
"... ... ... \n",
"12494 37.769992 -122.425394 \n",
"12495 36.109456 -95.935245 \n",
"12496 38.623825 -90.308528 \n",
"12497 33.414036 -111.920920 \n",
"12498 37.046000 -101.853100 \n",
"\n",
"[10000 rows x 12 columns]\n",
"\n",
"Данные после унитарного кодирования:\n",
" city_latitude city_longitude \\\n",
"0 36.356650 -119.347937 \n",
"1 39.174503 -84.481363 \n",
"3 35.961561 -83.980115 \n",
"4 38.798958 -77.095133 \n",
"5 33.877422 -117.924978 \n",
"... ... ... \n",
"12494 37.769992 -122.425394 \n",
"12495 36.109456 -95.935245 \n",
"12496 38.623825 -90.308528 \n",
"12497 33.414036 -111.920920 \n",
"12498 37.046000 -101.853100 \n",
"\n",
" summary_ A couple stopped me and told me &quot;Am I crazy or those lights are moving? ((Starlink satellites?)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ All total, I counted 15 aircraft flying in single file with no audible noise ((Starlink satellites?)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ I and 3 others observed 4 lights in the sky. \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ I kept watching and counted 10 objects that seemed to come from nowhere just appear out of the blackness. ((Starlink satellites?)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ I noticed the stars and another in a perfect line. ((Starlink satellites)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ I saw what looked like a star moving then I seen more. ((Starlink satellites?)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ I see 5 lights in a line, separated evenly, flying at the same speed. ((\"Starlink\" satellites??))((anonymous)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" summary_ We have witnessed multiple lights looking like satellites with varing degrees of brightness in orbit. ((Starlink satellites?)) \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"4 False \n",
"5 False \n",
"... ... \n",
"12494 False \n",
"12495 False \n",
"12496 False \n",
"12497 False \n",
"12498 False \n",
"\n",
" ... posted_2020-07-03T00:00:00 posted_2020-07-09T00:00:00 \\\n",
"0 ... False False \n",
"1 ... False False \n",
"3 ... False False \n",
"4 ... False False \n",
"5 ... False True \n",
"... ... ... ... \n",
"12494 ... False False \n",
"12495 ... False False \n",
"12496 ... False False \n",
"12497 ... False False \n",
"12498 ... False False \n",
"\n",
" posted_2020-07-23T00:00:00 posted_2020-07-31T00:00:00 \\\n",
"0 False False \n",
"1 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"... ... ... \n",
"12494 False False \n",
"12495 False False \n",
"12496 False False \n",
"12497 False False \n",
"12498 False False \n",
"\n",
" posted_2020-08-06T00:00:00 posted_2020-08-20T00:00:00 \\\n",
"0 False False \n",
"1 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"... ... ... \n",
"12494 False False \n",
"12495 False False \n",
"12496 False False \n",
"12497 False False \n",
"12498 False False \n",
"\n",
" posted_2020-08-27T00:00:00 posted_2020-09-04T00:00:00 \\\n",
"0 False False \n",
"1 False False \n",
"3 False False \n",
"4 False False \n",
"5 False False \n",
"... ... ... \n",
"12494 False False \n",
"12495 False False \n",
"12496 False False \n",
"12497 False False \n",
"12498 False False \n",
"\n",
" posted_2020-11-05T00:00:00 posted_2021-12-19T00:00:00 \n",
"0 False True \n",
"1 False True \n",
"3 False True \n",
"4 False True \n",
"5 False False \n",
"... ... ... \n",
"12494 False False \n",
"12495 False False \n",
"12496 False False \n",
"12497 False False \n",
"12498 False False \n",
"\n",
"[10000 rows x 53560 columns]\n"
]
}
],
"source": [
"print(\"Данные до унитарного кодирования:\")\n",
"print(df.head(10000))\n",
"\n",
"# Применение унитарного кодирования для категориальных признаков\n",
"df_encoded = pd.get_dummies(df.head(10000), drop_first=True)\n",
"\n",
"print(\"\\nДанные после унитарного кодирования:\")\n",
"print(df_encoded.head(10000))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Дискретизация числовых признаков**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"22.1938\n",
"54.4333\n",
"32.23950000000001\n"
]
}
],
"source": [
"print(df['city_latitude'].min())\n",
"print(df['city_latitude'].max())\n",
"print(df['city_latitude'].max() - df['city_latitude'].min())"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до дискретизации:\n",
"[22, 36, 49, 63, 77, inf]\n",
"\n",
"Данные после дискретизации:\n",
" city_latitude price_bins\n",
"0 36.356650 36-49\n",
"1 39.174503 36-49\n",
"3 35.961561 22-36\n",
"4 38.798958 36-49\n",
"5 33.877422 22-36\n",
"6 36.141246 36-49\n",
"8 40.294123 36-49\n",
"10 40.698700 36-49\n",
"12 44.072800 36-49\n",
"13 42.312800 36-49\n"
]
}
],
"source": [
"print(\"Данные до дискретизации:\")\n",
"#print(df.head(10))\n",
"\n",
"\n",
"# Определение интервалов и меток для дискретизации\n",
"bins = [\n",
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0), \n",
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25), \n",
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.50), \n",
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75),\n",
"round(df['city_latitude'].min() + df['city_latitude'].max() * 1),\n",
"float('inf')\n",
"]\n",
"print(bins)\n",
"labels = [\n",
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25)),\n",
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.5)),\n",
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.5)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75)),\n",
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.1)),\n",
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 1)) + '+'\n",
"]\n",
"\n",
"# Применение дискретизации\n",
"df['price_bins'] = pd.cut(df['city_latitude'], bins=bins, labels=labels, right=False)\n",
"\n",
"print(\"\\nДанные после дискретизации:\")\n",
"print(df[['city_latitude', 'price_bins']].head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**«Ручной» синтез признаков**\n",
"\n",
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, для данных о продаже домов можно создать признак цена за единицу товара."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до синтеза признака:\n",
" summary city state \\\n",
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
"3 One red light moving switly west to east, beco... Knoxville TN \n",
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
"5 I'm familiar with all the fakery and UFO sight... Fullerton CA \n",
"6 I was driving up lakes mead towards the lake a... Las Vegas NV \n",
"8 Wing shaped craft seen at night, no lights, no... Orem UT \n",
"10 Yellow light floating across my grass as it de... Springfield NJ \n",
"12 A trail of star like lights moving from the W.... Janesville MN \n",
"13 Large bright ball not registering on any app a... Bangor MI \n",
"\n",
" date_time shape duration \\\n",
"0 2021-12-15T21:45:00 light 2 minutes \n",
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
"4 2021-12-07T08:00:00 circle NaN \n",
"5 2020-07-07T23:00:00 unknown 2 minutes \n",
"6 2020-04-23T03:00:00 oval 10 minutes \n",
"8 2020-04-18T23:00:00 other 10 seconds \n",
"10 2020-05-13T03:37:00 light 7 seconds \n",
"12 2020-04-18T21:05:00 light 15 minutes \n",
"13 2020-04-18T22:30:00 triangle 45 minutes \n",
"\n",
" stats \\\n",
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
"5 Occurred : 7/7/2020 23:00 (Entered as : 07/07... \n",
"6 Occurred : 4/23/2020 03:00 (Entered as : 4/23... \n",
"8 Occurred : 4/18/2020 23:00 (Entered as : 04/1... \n",
"10 Occurred : 5/13/2020 03:37 (Entered as : 05/1... \n",
"12 Occurred : 4/18/2020 21:05 (Entered as : 04/1... \n",
"13 Occurred : 4/18/2020 22:30 (Entered as : 04/1... \n",
"\n",
" report_link \\\n",
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
"5 http://www.nuforc.org/webreports/157/S157444.html \n",
"6 http://www.nuforc.org/webreports/155/S155608.html \n",
"8 http://www.nuforc.org/webreports/155/S155512.html \n",
"10 http://www.nuforc.org/webreports/155/S155647.html \n",
"12 http://www.nuforc.org/webreports/155/S155497.html \n",
"13 http://www.nuforc.org/webreports/155/S155495.html \n",
"\n",
" text posted \\\n",
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
"5 I'm familiar with all the fakery and UFO sight... 2020-07-09T00:00:00 \n",
"6 I was driving up lakes mead towards the lake a... 2020-05-01T00:00:00 \n",
"8 Wing shaped craft seen at night, no lights, no... 2020-05-01T00:00:00 \n",
"10 Yellow light floating across my grass as it de... 2020-05-15T00:00:00 \n",
"12 A trail of star like lights moving from the W.... 2020-05-15T00:00:00 \n",
"13 Large bright ball not registering on any app a... 2020-05-15T00:00:00 \n",
"\n",
" city_latitude city_longitude price_bins \n",
"0 36.356650 -119.347937 36-49 \n",
"1 39.174503 -84.481363 36-49 \n",
"3 35.961561 -83.980115 22-36 \n",
"4 38.798958 -77.095133 36-49 \n",
"5 33.877422 -117.924978 22-36 \n",
"6 36.141246 -115.186592 36-49 \n",
"8 40.294123 -111.701685 36-49 \n",
"10 40.698700 -74.329600 36-49 \n",
"12 44.072800 -93.728600 36-49 \n",
"13 42.312800 -86.081300 36-49 \n",
"\n",
"Данные после синтеза признака 'relative_price':\n",
" city_latitude state relative_appearing\n",
"0 36.356650 CA 1.018610\n",
"1 39.174503 OH 0.969926\n",
"3 35.961561 TN 1.001705\n",
"4 38.798958 VA 1.026087\n",
"5 33.877422 CA 0.949149\n",
"6 36.141246 NV 0.968071\n",
"8 40.294123 UT 1.003457\n",
"10 40.698700 NJ 1.008448\n",
"12 44.072800 MN 0.970854\n",
"13 42.312800 MI 0.984641\n"
]
}
],
"source": [
"# Проверка первых строк данных\n",
"print(\"Данные до синтеза признака:\")\n",
"print(df.head(10))\n",
"\n",
"# Вычисление средней цены по категориям\n",
"mean_price_by_category = df.groupby('state')['city_latitude'].transform('mean')\n",
"\n",
"# Создание нового признака 'relative_price' (относительная цена)\n",
"df['relative_appearing'] = df['city_latitude'] / mean_price_by_category\n",
"\n",
"# Проверка первых строк данных после синтеза признака\n",
"print(\"\\nДанные после синтеза признака 'relative_price':\")\n",
"print(df[['city_latitude', 'state', 'relative_appearing']].head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Масштабирование признаков на основе нормировки и стандартизации**\n",
"\n",
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Данные до масштабирования:\n",
" city_latitude relative_city_latitude\n",
"0 36.356650 1.018610\n",
"1 39.174503 0.969926\n",
"3 35.961561 1.001705\n",
"4 38.798958 1.026087\n",
"5 33.877422 0.949149\n",
"\n",
"Данные после нормировки:\n",
" city_latitude relative_city_latitude\n",
"0 0.439301 0.475114\n",
"1 0.526705 0.349452\n",
"3 0.427046 0.431477\n",
"4 0.515056 0.494412\n",
"5 0.362401 0.295823\n",
"\n",
"Данные после стандартизации:\n",
" city_latitude relative_city_latitude\n",
"0 -0.426560 0.533974\n",
"1 0.097536 -0.862875\n",
"3 -0.500043 0.048911\n",
"4 0.027688 0.748491\n",
"5 -0.887675 -1.459012\n"
]
}
],
"source": [
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"\n",
"# Создание нового признака 'relative_city_latitude' (цена относительно средней цены в категории)\n",
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
"\n",
"# Проверка первых строк данных до масштабирования\n",
"print(\"Данные до масштабирования:\")\n",
"print(df[['city_latitude', 'relative_city_latitude']].head())\n",
"\n",
"# Масштабирование признаков на основе нормировки\n",
"min_max_scaler = MinMaxScaler()\n",
"df[['city_latitude', 'relative_city_latitude']] = min_max_scaler.fit_transform(df[['city_latitude', 'relative_city_latitude']])\n",
"\n",
"# Проверка первых строк данных после нормировки\n",
"print(\"\\nДанные после нормировки:\")\n",
"print(df[['city_latitude', 'relative_city_latitude']].head())\n",
"\n",
"# Стандартизация признаков\n",
"standard_scaler = StandardScaler()\n",
"df[['city_latitude', 'relative_city_latitude']] = standard_scaler.fit_transform(df[['city_latitude', 'relative_city_latitude']])\n",
"\n",
"# Проверка первых строк данных после стандартизации\n",
"print(\"\\nДанные после стандартизации:\")\n",
"print(df[['city_latitude', 'relative_city_latitude']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Конструирование признаков с применением фреймворка Featuretools**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" pd.to_datetime(\n",
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Built 23 features\n",
"Elapsed: 00:14 | Progress: 95%|█████████▌"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Elapsed: 00:17 | Progress: 100%|██████████\n",
"Новые признаки, созданные с помощью Featuretools:\n",
" city state shape duration city_latitude \\\n",
"index \n",
"0 Visalia CA light 2 minutes -0.426560 \n",
"1 Cincinnati OH triangle 14 seconds 0.097536 \n",
"2 Knoxville TN triangle 20-30 seconds -0.500043 \n",
"3 Alexandria VA circle NaN 0.027688 \n",
"4 Fullerton CA unknown 2 minutes -0.887675 \n",
"\n",
" city_longitude price_bins relative_appearing relative_city_latitude \\\n",
"index \n",
"0 -119.347937 36-49 1.018610 0.775416 \n",
"1 -84.481363 36-49 0.969926 0.301550 \n",
"2 -83.980115 22-36 1.001705 0.977744 \n",
"3 -77.095133 36-49 1.026087 -0.177743 \n",
"4 -117.924978 22-36 0.949149 1.613646 \n",
"\n",
" DAY(date_time) ... NUM_CHARACTERS(stats) NUM_CHARACTERS(summary) \\\n",
"index ... \n",
"0 15 ... 174 91 \n",
"1 16 ... 180 114 \n",
"2 10 ... 182 108 \n",
"3 7 ... 166 127 \n",
"4 7 ... 167 134 \n",
"\n",
" NUM_CHARACTERS(text) NUM_WORDS(stats) NUM_WORDS(summary) \\\n",
"index \n",
"0 811 22 17 \n",
"1 302 22 22 \n",
"2 1633 22 20 \n",
"3 1813 21 18 \n",
"4 1275 21 28 \n",
"\n",
" NUM_WORDS(text) WEEKDAY(date_time) WEEKDAY(posted) YEAR(date_time) \\\n",
"index \n",
"0 147 2 6 2021 \n",
"1 59 3 6 2021 \n",
"2 304 4 6 2021 \n",
"3 322 1 6 2021 \n",
"4 233 1 3 2020 \n",
"\n",
" YEAR(posted) \n",
"index \n",
"0 2021 \n",
"1 2021 \n",
"2 2021 \n",
"3 2021 \n",
"4 2020 \n",
"\n",
"[5 rows x 23 columns]\n"
]
}
],
"source": [
"import featuretools as ft\n",
"\n",
"# Создание нового признака 'relative_city_latitude'\n",
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
"\n",
"# Создание EntitySet\n",
"es = ft.EntitySet(id='jio_mart_items')\n",
"\n",
"# Добавление данных с явным указанием индексного столбца\n",
"es = es.add_dataframe(dataframe_name='items_data', dataframe=df, index='index', make_index=True)\n",
"\n",
"# Конструирование признаков\n",
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='items_data', verbose=True)\n",
"\n",
"# Проверка первых строк новых признаков\n",
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
"print(features.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Оценка качества**\n",
"\n",
"*Предсказательная способность Метрики:* RMSE, MAE, R² \n",
"\n",
"*Методы:* Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках. \n",
"\n",
"*Скорость вычисления Методы:* Измерение времени выполнения генерации признаков и обучения модели. \n",
"\n",
"*Надежность Методы:* Кросс-валидация, анализ чувствительности модели к изменениям в данных. \n",
"\n",
"*Корреляция Методы:* Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков. \n",
"\n",
"*Цельность Методы:* Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE: 1.5489449749619157\n",
"R²: 0.924423816721939\n",
"MAE: 0.46826566221280963\n",
"Training Time: 37.429771184921265 seconds\n",
"Cross-validated RMSE: 1.535611497565107\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\tumvu\\AppData\\Local\\Temp\\ipykernel_54788\\399707436.py:70: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8kAAAK9CAYAAAAXCC76AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACjb0lEQVR4nOzde1yUZf7/8fcEgqMDoxCKbDqTCWpGotGWsakIidrqF0+paYalbgdzs1VcFzDooOVKme63zTUD3SxM8tDmWZM8QJglqFtpSwql+FU3ZSQVAef3R+P8nEAExCh8PR+P+/Fwrvu6rvtzj+3j+3173fc1BrvdbhcAAAAAANAN9V0AAAAAAAC/FIRkAAAAAAAcCMkAAAAAADgQkgEAAAAAcCAkAwAAAADgQEgGAAAAAMCBkAwAAAAAgAMhGQAAAAAAB0IyAAAAAAAOhGQAAAAAABwIyQAAVJPBYKjWkZGRcU3r+Pbbb5WUlKTf/va3at68uW688Ub17NlTmzZtqrT/qVOnNH78ePn5+alp06YKDw/X559/Xq1r9ezZ87L3+dVXX9XlbTm9/vrrSk1NvSZzX62ePXvqtttuq+8yau3IkSNKTExUTk5OfZcCAL9Y7vVdAAAAvxb//Oc/XT4vXrxYGzdurNDesWPHa1rHqlWr9PLLLys6OloPP/ywysrKtHjxYt1333166623NGbMGGffCxcu6P7771dubq6mTJmiG2+8Ua+//rp69uypzz77TIGBgVe83k033aSZM2dWaA8ICKjT+7ro9ddf14033qiYmJhrMv/17MiRI0pKSpLValVISEh9lwMAv0iEZAAAqmnUqFEunz/55BNt3LixQvu1Fh4eroKCAt14443Otscee0whISGaPn26S0hOT09XZmamli1bpiFDhkiSHnjgAQUFBenZZ5/VO++8c8Xrmc3mn/0e65rdbte5c+dkNBrru5R6UVZWpgsXLtR3GQDwq8Dj1gAA1KEffvhBf/rTn9S6dWt5enqqffv2mj17tux2u0s/g8GgCRMmaMmSJWrfvr0aN26sO+64Q1u3br3iNTp16uQSkCXJ09NT/fr103fffafTp08729PT09WyZUsNGjTI2ebn56cHHnhAq1atUklJyVXesVRSUqJnn31W7dq1k6enp1q3bq3Y2NgKc6ekpKhXr15q0aKFPD09deutt+rvf/+7Sx+r1ap///vf+vjjj52Pdffs2VOSlJiYKIPBUOH6qampMhgMOnTokMs8v//977V+/XqFhobKaDRq/vz5kn58/Pzpp592/h21a9dOL7/8cq1D5MW/y2XLlunWW2+V0WhUt27dtHfvXknS/Pnz1a5dOzVu3Fg9e/Z0qVP6/49wf/bZZ7rnnntkNBp1880364033qhwrWPHjunRRx9Vy5Yt1bhxY3Xu3FmLFi1y6XPo0CEZDAbNnj1bc+bM0S233CJPT0+9/vrruvPOOyVJY8aMcX6/Fx9t37Ztm4YOHao2bdo4/x4nTZqks2fPuswfExMjk8mkw4cPKzo6WiaTSX5+fpo8ebLKy8td+l64cEGvvfaagoOD1bhxY/n5+alPnz7atWuXS7+3335bd9xxh4xGo3x8fDR8+HB9++23Nf67AIC6wEoyAAB1xG63a8CAAdqyZYseffRRhYSEaP369ZoyZYoOHz6sV1991aX/xx9/rKVLl2rixInOENOnTx/t3LmzVu+9Hj16VE2aNFGTJk2cbbt371bXrl11ww2u/y7+29/+Vv/4xz904MABBQcHVzlveXm5Tpw44dLWuHFjmUwmXbhwQQMGDND27ds1fvx4dezYUXv37tWrr76qAwcOaOXKlc4xf//739WpUycNGDBA7u7u+te//qUnnnhCFy5c0JNPPilJmjNnjp566imZTCbFxcVJklq2bFnj70KS9u/frxEjRugPf/iDxo0bp/bt2+vMmTPq0aOHDh8+rD/84Q9q06aNMjMzNW3aNBUWFmrOnDm1uta2bdv0wQcfOO9j5syZ+v3vf6/Y2Fi9/vrreuKJJ3Ty5EnNmjVLjzzyiD766COX8SdPnlS/fv30wAMPaMSIEXrvvff0+OOPy8PDQ4888ogk6ezZs+rZs6f+85//aMKECbr55pu1bNkyxcTE6NSpU/rjH//oMmdKSorOnTun8ePHy9PTUwMHDtTp06c1ffp0jR8/Xvfee68k6Z577pEkLVu2TGfOnNHjjz8uX19f7dy5U/PmzdN3332nZcuWucxdXl6uqKgo3XXXXZo9e7Y2bdqk5ORk3XLLLXr88ced/R599FGlpqaqb9++Gjt2rMrKyrRt2zZ98sknCg0NlSS9+OKLSkhI0AMPPKCxY8fq+PHjmjdvnrp3767du3erWbNmtfo7AYBaswMAgFp58skn7Zf+n9KVK1faJdlfeOEFl35DhgyxGwwG+3/+8x9nmyS7JPuuXbucbfn5+fbGjRvbBw4cWONavv76a3vjxo3tDz30kEt706ZN7Y888kiF/qtXr7ZLsq9bt67KeXv06OGs9dLj4Ycfttvtdvs///lP+w033GDftm2by7g33njDLsm+Y8cOZ9uZM2cqzB8VFWVv27atS1unTp3sPXr0qND32WeftVf2/7qkpKTYJdkPHjzobLNYLJXe3/PPP29v2rSp/cCBAy7tf/7zn+1ubm72goKCSr+Hi3r06GHv1KmTS5sku6enp8v158+fb5dk9/f3t9tsNmf7tGnTKtR68TtOTk52tpWUlNhDQkLsLVq0sJ8/f95ut9vtc+bMsUuyv/32285+58+ft3fr1s1uMpmc1zl48KBdkt3b29t+7Ngxl1o//fRTuyR7SkpKhXur7O9n5syZdoPBYM/Pz3e2Pfzww3ZJ9ueee86lb5cuXex33HGH8/NHH31kl2SfOHFihXkvXLhgt9vt9kOHDtnd3NzsL774osv5vXv32t3d3Su0A8DPgcetAQCoI2vWrJGbm5smTpzo0v6nP/1Jdrtda9eudWnv1q2b7rjjDufnNm3a6H/+53+0fv36Co+tVuXMmTMaOnSojEajXnrpJZdzZ8+elaenZ4UxjRs3dp6/EqvVqo0bN7ocsbGxkn5cfezYsaM6dOigEydOOI9evXpJkrZs2eKc59L3gYuKinTixAn16NFD33zzjYqKiqp9v9V18803KyoqyqVt2bJluvfee9W8eXOXeiMjI1VeXl6tx90rExERIavV6vx81113SZIGDx4sLy+vCu3ffPONy3h3d3f94Q9/cH728PDQH/7wBx07dkyfffaZpB//+/L399eIESOc/Ro1aqSJEyequLhYH3/8scucgwcPlp+fX7Xv4dK/nx9++EEnTpzQPffcI7vdrt27d1fo/9hjj7l8vvfee13u6/3335fBYNCzzz5bYezFx+aXL1+uCxcu6IEHHnD5+/D391dgYKDLfz8A8HPhcWsAAOpIfn6+AgICXEKR9P93u87Pz3dpr2xn6aCgIJ05c0bHjx+Xv7//Fa9ZXl6u4cOH64svvtDatWsr7DhtNBorfe/43LlzzvNX0rRpU0VGRlZ67uuvv9aXX3552TB27Ngx55937NihZ599VllZWTpz5oxLv6KiIpnN5ivWUhM333xzpfXu2bOnWvXWRJs2bVw+X7yX1q1bV9p+8uRJl/aAgAA1bdrUpS0oKEjSj+8Y33333crPz1dgYGCFR+cv999XZfdflYKCAk2fPl0ffPBBhfp++o8YF98vvlTz5s1dxuXl5SkgIEA+Pj6XvebXX38tu91+2V3WGzVqVKN7AIC6QEgGAOBXbNy4cfrwww+1ZMkS5+rtpVq1aqXCwsIK7RfbrvZnnC5cuKDg4GC98sorlZ6/GBLz8vIUERGhDh066JVXXlHr1q3l4eGhNWvW6NVXX63WplmVbdol6bKr7pX9A8CFCxd03333OVfCf+piMK0pNze3GrXbf7KR27VQk528y8vLdd999+n777/X1KlT1aFDBzVt2lSHDx9WTExMhb+fy91XTV24cEEGg0Fr166tdE6TyVQn1wGAmiAkAwBQRywWizZt2qTTp0+7rCZ/9dVXzvOX+vrrryvMceD
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train RMSE: 0.4358844786848851\n",
"Train R²: 0.994185027626814\n",
"Train MAE: 0.12184558416960284\n",
"Корреляция: 0.96\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
" warnings.warn(\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0oAAAIjCAYAAAA9VuvLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADA8ElEQVR4nOzdd3iUVfr/8fczPckkk55QAoHQmwUVlSJKU1FXxc5asOGuvetvd8WOrt1VF8sKFvxace0rYu9iAUWlhA4JCSRkkkky/fn9ETMQEjSBhAnh87ouNHOedk9mkjz3nHPuY5imaSIiIiIiIiIxlngHICIiIiIi0t4oURIREREREdmGEiUREREREZFtKFESERERERHZhhIlERERERGRbShREhERERER2YYSJRERERERkW0oURIREREREdmGEiUREREREZFtKFESEWmnDMPgxhtvjHcYcTd69GhGjx4de7xq1SoMw2DWrFlxi2lb28bYXs7VXLNmzcIwDFatWtVq59zVr9NZZ51Ffn7+LrmWiOwZlCiJyB7hkUcewTAMhg0btsPnKCoq4sYbb2TBggWtF1g799FHH2EYRuyf3W6nZ8+enHHGGaxYsSLe4bXIF198wY033khFRUW8Q2mR9v6+e+6557j//vubte/bb7+t5F9EdhtKlERkjzB79mzy8/P55ptvKCws3KFzFBUVcdNNN7XbG9a2dMkll/DMM8/w2GOPMXHiRF544QX2339/ioqKdnks3bt3p7a2ltNPP71Fx33xxRfcdNNN7T5Rmjt3LnPnzo09bu/vu+0lSk29Tm+//TY33XTTLoxORGTHKVESkQ5v5cqVfPHFF9x7771kZWUxe/bseIe02xk5ciR//vOfmTJlCv/617+4++67KS8v56mnntruMdXV1W0Si2EYuFwurFZrm5w/3hwOBw6HI95h7LSO/jqJSMenRElEOrzZs2eTlpbGxIkTOeGEE7abKFVUVHD55ZeTn5+P0+mka9eunHHGGWzatImPPvqI/fffH4ApU6bEhqLVz7/Iz8/nrLPOanTObeebBINBbrjhBoYOHYrH4yEpKYmRI0fy4Ycftvh5lZSUYLPZmvyEfsmSJRiGwUMPPQRAKBTipptuonfv3rhcLjIyMhgxYgTvvfdei68LcNhhhwF1SSjAjTfeiGEY/PLLL5x22mmkpaUxYsSI2P7PPvssQ4cOJSEhgfT0dE455RTWrl3b6LyPPfYYBQUFJCQkcMABB/Dpp5822md7c18WL17MSSedRFZWFgkJCfTt25e//e1vsfiuvvpqAHr06BF7/baek9OaMf6eZ599lgMOOIDExETS0tIYNWpUgx6krd8zv/e+mzZtGna7nY0bNza6xvnnn09qaip+v79FsW3ttddeY+LEiXTu3Bmn00lBQQG33HILkUikQaxvvfUWq1evjsVWP09o29fprLPO4uGHHwZoMJyz/nkahsFHH33UIIbtvdb//e9/GTRoEC6Xi0GDBvHqq682+Ryi0Sj3338/AwcOxOVykZOTw9SpU9m8efMOf19EZM9hi3cAIiJtbfbs2Rx//PE4HA5OPfVU/v3vfzN//vzYDSiAz+dj5MiR/Prrr5x99tnsu+++bNq0iddff51169bRv39/br75Zm644QbOP/98Ro4cCcDBBx/colgqKyt54oknOPXUUznvvPOoqqriP//5DxMmTOCbb75h7733bva5cnJyOOSQQ3jxxReZNm1ag20vvPACVquVE088EahLFKZPn865557LAQccQGVlJd9++y3ff/8948aNa9FzAFi+fDkAGRkZDdpPPPFEevfuze23345pmgDcdttt/OMf/+Ckk07i3HPPZePGjfzrX/9i1KhR/PDDD6SmpgLwn//8h6lTp3LwwQdz2WWXsWLFCo455hjS09PJy8v73Xh+/PFHRo4cid1u5/zzzyc/P5/ly5fzxhtvcNttt3H88cezdOlS/u///o/77ruPzMxMALKysnZZjAA33XQTN954IwcffDA333wzDoeDr7/+mg8++IDx48c32v/33ncjRozg5ptv5oUXXuCiiy6KHRMMBnn55ZeZNGkSLpfrD2PanlmzZuF2u7niiitwu9188MEH3HDDDVRWVnLXXXcB8Le//Q2v18u6deu47777AHC73U2eb+rUqRQVFfHee+/xzDPP7HBcc+fOZdKkSQwYMIDp06dTVlbGlClT6Nq1a5PXnDVrFlOmTOGSSy5h5cqVPPTQQ/zwww98/vnn2O32HY5DRPYApohIB/btt9+agPnee++Zpmma0WjU7Nq1q3nppZc22O+GG24wAXPOnDmNzhGNRk3TNM358+ebgDlz5sxG+3Tv3t0888wzG7Ufcsgh5iGHHBJ7HA6HzUAg0GCfzZs3mzk5OebZZ5/doB0wp02b9rvP79FHHzUB86effmrQPmDAAPOwww6LPd5rr73MiRMn/u65mvLhhx+agPnkk0+aGzduNIuKisy33nrLzM/PNw3DMOfPn2+apmlOmzbNBMxTTz21wfGrVq0yrVaredtttzVo/+mnn0ybzRZrDwaDZnZ2trn33ns3+P489thjJtDge7hy5cpGr8OoUaPM5ORkc/Xq1Q2uU//amaZp3nXXXSZgrly5ss1jbMqyZctMi8ViHnfccWYkEtlunNu+Z37vfXfQQQeZw4YNa9A2Z84cEzA//PDD341nazNnzmz0vampqWm039SpU83ExETT7/fH2iZOnGh279690b5NvU4XXnih2dStR/37bNuYmzrH3nvvbXbq1MmsqKiItc2dO9cEGsTx6aefmoA5e/bsBuf83//+12S7iMi2NPRORDq02bNnk5OTw6GHHgrUDfk5+eSTef755xsMIXrllVfYa6+9OO644xqdo354UGuwWq2x+SfRaJTy8nLC4TD77bcf33//fYvPd/zxx2Oz2XjhhRdibYsWLeKXX37h5JNPjrWlpqby888/s2zZsh2K++yzzyYrK4vOnTszceJEqqureeqpp9hvv/0a7HfBBRc0eDxnzhyi0SgnnXQSmzZtiv3Lzc2ld+/esSGH3377LaWlpVxwwQUN5uecddZZeDye341t48aNfPLJJ5x99tl069atwbbmvHa7IkaoGy4WjUa54YYbsFga/vnd0ffYGWecwddffx3r4YO693xeXh6HHHLIDp2zXkJCQuzrqqoqNm3axMiRI6mpqWHx4sU7de4dVVxczIIFCzjzzDMbfM/HjRvHgAEDGuz70ksv4fF4GDduXIPXdejQobjd7h0a7ioiexYlSiLSYUUiEZ5//nkOPfRQVq5cSWFhIYWFhQwbNoySkhLef//92L7Lly9n0KBBuySup556iiFDhsTmCmVlZfHWW2/h9XpbfK7MzEzGjBnDiy++GGt74YUXsNlsHH/88bG2m2++mYqKCvr06cPgwYO5+uqr+fHHH5t9nRtuuIH33nuPDz74gB9//JGioqImq8716NGjweNly5Zhmia9e/cmKyurwb9ff/2V0tJSAFavXg1A7969GxxfX47899SXKd/R129XxAh17zGLxdLohn5nnHzyyTidzti8O6/Xy5tvvsnkyZN3OsH/+eefOe644/B4PKSkpJCVlcWf//zn2HXiYXuvAUDfvn0bPF62bBler5fs7OxGr6vP54u9riIi26M5SiLSYX3wwQcUFxfz/PPP8/zzzzfaPnv27CbnheyI7d2URiKRBlW/nn32Wc466yyOPfZYrr76arKzs7FarUyfPr1Br0BLnHLKKUyZMoUFCxaw99578+KLLzJmzJjYPByAUaNGsXz5cl577TXmzp3LE088wX333ceMGTM499xz//AagwcPZuzYsX+439a9EFDXa2YYBu+8806T1c+2N59lV9odYtyetLQ0jjrqKGbPns0NN9zAyy+/TCAQiCU0O6qiooJDDjmElJQUbr75ZgoKCnC5XHz//fdce+21RKPRVnoGdX7v52dHRaNRsrOzt1u8pX5+mojI9ihREpEOa/bs2WRnZ8cqbW1tzpw5vPrqq8yYMYOEhAQKCgpYtGjR757v9z6hT0tLa3J9ntWrVzf
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
"from sklearn.model_selection import cross_val_score\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import time\n",
"import numpy as np\n",
"\n",
"# Загрузка данных\n",
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\").head(2000)\n",
"\n",
"# Создание нового признака 'relative_city_latitude'\n",
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
"\n",
"# Предобработка данных\n",
"# Преобразуем категориальные переменные в числовые\n",
"df = pd.get_dummies(df, drop_first=True)\n",
"\n",
"# Разделение данных на признаки и целевую переменную\n",
"X = df.drop('city_latitude', axis=1).dropna()\n",
"y = df['city_latitude'].dropna()\n",
"\n",
"# Разделение данных на обучающую и тестовую выборки\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Выбор модели\n",
"model = RandomForestRegressor(random_state=42)\n",
"\n",
"# Измерение времени обучения и предсказания\n",
"start_time = time.time()\n",
"\n",
"# Обучение модели\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Предсказание и оценка\n",
"y_pred = model.predict(X_test)\n",
"\n",
"end_time = time.time()\n",
"training_time = end_time - start_time\n",
"\n",
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mae = mean_absolute_error(y_test, y_pred)\n",
"\n",
"print(f\"RMSE: {rmse}\")\n",
"print(f\"R²: {r2}\")\n",
"print(f\"MAE: {mae}\")\n",
"print(f\"Training Time: {training_time} seconds\")\n",
"\n",
"# Кросс-валидация\n",
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"rmse_cv = (-scores.mean())**0.5\n",
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
"\n",
"# Анализ важности признаков\n",
"feature_importances = model.feature_importances_\n",
"feature_names = X_train.columns\n",
"\n",
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
"\n",
"# Отобразим только топ-20 признаков\n",
"top_n = 20\n",
"importance_df_top = importance_df.head(top_n)\n",
"\n",
"plt.figure(figsize=(10, 8))\n",
"sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n",
"plt.title(f'Top {top_n} Feature Importance')\n",
"plt.xlabel('Importance')\n",
"plt.ylabel('Feature')\n",
"plt.show()\n",
"\n",
"# Проверка на переобучение\n",
"y_train_pred = model.predict(X_train)\n",
"\n",
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
"r2_train = r2_score(y_train, y_train_pred)\n",
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
"\n",
"print(f\"Train RMSE: {rmse_train}\")\n",
"print(f\"Train R²: {r2_train}\")\n",
"print(f\"Train MAE: {mae_train}\")\n",
"\n",
"correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
"print(f\"Корреляция: {correlation:.2f}\")\n",
"\n",
"# Визуализация результатов\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
"plt.xlabel('Actual city_latitude')\n",
"plt.ylabel('Predicted city_latitude')\n",
"plt.title('Actual vs Predicted city_latitude')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Выводы и итог \n",
"\n",
"**Время обучения:**\n",
"\n",
"Время обучения модели составляет 37 секунды, что является средним. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
"\n",
"**Предсказательная способность:**\n",
"\n",
"MAE (Mean Absolute Error): 0.12184558416960284 — это средняя абсолютная ошибка предсказаний модели. Значение MAE невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 0.12184558416960284. Это может быть приемлемым уровнем ошибки.\n",
"\n",
"RMSE (Mean Squared Error): 0.4358844786848851 — это среднее значение квадратов ошибок.\n",
"\n",
"R² (коэффициент детерминации): 0.994185027626814 — это средний уровень, указывающий на то, что модель объясняет 99,4% вариации целевой переменной. Это свидетельствует о средней предсказательной способности модели.\n",
"\n",
"**Корреляция:**\n",
"\n",
"Корреляция (0.96) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
"\n",
"**Надежность (кросс-валидация):**\n",
"\n",
"Среднее RMSE (кросс-валидация): 1.535611497565107 — это значительно ниже, чем обычное RMSE, что указывает на отсутствие проблем с переобучением - что и подтверждается тестом переобучением. \n",
"\n",
"Результаты визуализации важности признаков, полученные из линейной регрессии, помогают понять, какие из входных переменных наибольшим образом влияют на целевую переменную (city_latitude)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}