1678 lines
518 KiB
Plaintext
1678 lines
518 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лабораторная работа №3\n",
|
|||
|
"\n",
|
|||
|
"### Набор данных \"Наблюдения НЛО в США\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Для набора данных \"Наблюдения НЛО в США\" можно выделить несколько бизнес-целей и соответствующие технические задачи. Давайте рассмотрим этот процесс поэтапно.\n",
|
|||
|
"# \n",
|
|||
|
"# 1. Определение бизнес-целей\n",
|
|||
|
"# Бизнес-цель 1: Прогнозирование местоположения и частоты наблюдений НЛО.\n",
|
|||
|
"# Задача заключается в анализе географического распределения и времени наблюдений НЛО, чтобы определить, в каких местах и когда чаще всего происходят наблюдения.\n",
|
|||
|
"# Бизнес-цель 2: Анализ факторов, влияющих на восприятие НЛО (например, форма, продолжительность, описание).\n",
|
|||
|
"# Цель — понять, какие признаки, такие как форма НЛО, длительность наблюдения, могут быть связаны с более подробными или более эмоционально окрашенными отчетами.\n",
|
|||
|
"# 2. Цели технического проекта для каждой бизнес-цели\n",
|
|||
|
"# Цель для бизнес-цели 1: Создать модель, которая предскажет вероятное местоположение и время наблюдений на основе данных о предыдущих наблюдениях.\n",
|
|||
|
"# Технические задачи:\n",
|
|||
|
"# Прогнозирование местоположения и времени (классификация или регрессия).\n",
|
|||
|
"# Кластеризация по географическому положению.\n",
|
|||
|
"# Анализ временных рядов для выявления сезонных колебаний.\n",
|
|||
|
"# Цель для бизнес-цели 2: Анализировать текстовые описания наблюдений НЛО для выявления ключевых паттернов и факторов.\n",
|
|||
|
"# Технические задачи:\n",
|
|||
|
"# Анализ текста с использованием методов обработки естественного языка (NLP).\n",
|
|||
|
"# Классификация описаний по типам объектов или возможным объяснениям (например, возможный самолет или атмосферное явление)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['summary', 'city', 'state', 'date_time', 'shape', 'duration', 'stats',\n",
|
|||
|
" 'report_link', 'text', 'posted', 'city_latitude', 'city_longitude'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import matplotlib.ticker as ticker\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Срез данных, первые 15000 строк\n",
|
|||
|
"df = df.iloc[:15000]\n",
|
|||
|
"\n",
|
|||
|
"# Вывод\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>summary</th>\n",
|
|||
|
" <th>city</th>\n",
|
|||
|
" <th>state</th>\n",
|
|||
|
" <th>date_time</th>\n",
|
|||
|
" <th>shape</th>\n",
|
|||
|
" <th>duration</th>\n",
|
|||
|
" <th>stats</th>\n",
|
|||
|
" <th>report_link</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>posted</th>\n",
|
|||
|
" <th>city_latitude</th>\n",
|
|||
|
" <th>city_longitude</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>Viewed some red lights in the sky appearing to...</td>\n",
|
|||
|
" <td>Visalia</td>\n",
|
|||
|
" <td>CA</td>\n",
|
|||
|
" <td>2021-12-15T21:45:00</td>\n",
|
|||
|
" <td>light</td>\n",
|
|||
|
" <td>2 minutes</td>\n",
|
|||
|
" <td>Occurred : 12/15/2021 21:45 (Entered as : 12/...</td>\n",
|
|||
|
" <td>http://www.nuforc.org/webreports/165/S165881.html</td>\n",
|
|||
|
" <td>Viewed some red lights in the sky appearing to...</td>\n",
|
|||
|
" <td>2021-12-19T00:00:00</td>\n",
|
|||
|
" <td>36.356650</td>\n",
|
|||
|
" <td>-119.347937</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>Look like 1 or 3 crafts from North traveling s...</td>\n",
|
|||
|
" <td>Cincinnati</td>\n",
|
|||
|
" <td>OH</td>\n",
|
|||
|
" <td>2021-12-16T09:45:00</td>\n",
|
|||
|
" <td>triangle</td>\n",
|
|||
|
" <td>14 seconds</td>\n",
|
|||
|
" <td>Occurred : 12/16/2021 09:45 (Entered as : 12/...</td>\n",
|
|||
|
" <td>http://www.nuforc.org/webreports/165/S165888.html</td>\n",
|
|||
|
" <td>Look like 1 or 3 crafts from North traveling s...</td>\n",
|
|||
|
" <td>2021-12-19T00:00:00</td>\n",
|
|||
|
" <td>39.174503</td>\n",
|
|||
|
" <td>-84.481363</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>seen dark rectangle moving slowly thru the sky...</td>\n",
|
|||
|
" <td>Tecopa</td>\n",
|
|||
|
" <td>CA</td>\n",
|
|||
|
" <td>2021-12-10T00:00:00</td>\n",
|
|||
|
" <td>rectangle</td>\n",
|
|||
|
" <td>Several minutes</td>\n",
|
|||
|
" <td>Occurred : 12/10/2021 00:00 (Entered as : 12/...</td>\n",
|
|||
|
" <td>http://www.nuforc.org/webreports/165/S165810.html</td>\n",
|
|||
|
" <td>seen dark rectangle moving slowly thru the sky...</td>\n",
|
|||
|
" <td>2021-12-19T00:00:00</td>\n",
|
|||
|
" <td>NaN</td>\n",
|
|||
|
" <td>NaN</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>One red light moving switly west to east, beco...</td>\n",
|
|||
|
" <td>Knoxville</td>\n",
|
|||
|
" <td>TN</td>\n",
|
|||
|
" <td>2021-12-10T19:30:00</td>\n",
|
|||
|
" <td>triangle</td>\n",
|
|||
|
" <td>20-30 seconds</td>\n",
|
|||
|
" <td>Occurred : 12/10/2021 19:30 (Entered as : 12/...</td>\n",
|
|||
|
" <td>http://www.nuforc.org/webreports/165/S165825.html</td>\n",
|
|||
|
" <td>One red light moving switly west to east, beco...</td>\n",
|
|||
|
" <td>2021-12-19T00:00:00</td>\n",
|
|||
|
" <td>35.961561</td>\n",
|
|||
|
" <td>-83.980115</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>Bright, circular Fresnel-lens shaped light sev...</td>\n",
|
|||
|
" <td>Alexandria</td>\n",
|
|||
|
" <td>VA</td>\n",
|
|||
|
" <td>2021-12-07T08:00:00</td>\n",
|
|||
|
" <td>circle</td>\n",
|
|||
|
" <td>NaN</td>\n",
|
|||
|
" <td>Occurred : 12/7/2021 08:00 (Entered as : 12/0...</td>\n",
|
|||
|
" <td>http://www.nuforc.org/webreports/165/S165754.html</td>\n",
|
|||
|
" <td>Bright, circular Fresnel-lens shaped light sev...</td>\n",
|
|||
|
" <td>2021-12-19T00:00:00</td>\n",
|
|||
|
" <td>38.798958</td>\n",
|
|||
|
" <td>-77.095133</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" summary city state \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
|
|||
|
"2 seen dark rectangle moving slowly thru the sky... Tecopa CA \n",
|
|||
|
"3 One red light moving switly west to east, beco... Knoxville TN \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
|
|||
|
"\n",
|
|||
|
" date_time shape duration \\\n",
|
|||
|
"0 2021-12-15T21:45:00 light 2 minutes \n",
|
|||
|
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
|
|||
|
"2 2021-12-10T00:00:00 rectangle Several minutes \n",
|
|||
|
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
|
|||
|
"4 2021-12-07T08:00:00 circle NaN \n",
|
|||
|
"\n",
|
|||
|
" stats \\\n",
|
|||
|
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
|
|||
|
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
|
|||
|
"2 Occurred : 12/10/2021 00:00 (Entered as : 12/... \n",
|
|||
|
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
|
|||
|
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
|
|||
|
"\n",
|
|||
|
" report_link \\\n",
|
|||
|
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
|
|||
|
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
|
|||
|
"2 http://www.nuforc.org/webreports/165/S165810.html \n",
|
|||
|
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
|
|||
|
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
|
|||
|
"\n",
|
|||
|
" text posted \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
|
|||
|
"2 seen dark rectangle moving slowly thru the sky... 2021-12-19T00:00:00 \n",
|
|||
|
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
|
|||
|
"\n",
|
|||
|
" city_latitude city_longitude \n",
|
|||
|
"0 36.356650 -119.347937 \n",
|
|||
|
"1 39.174503 -84.481363 \n",
|
|||
|
"2 NaN NaN \n",
|
|||
|
"3 35.961561 -83.980115 \n",
|
|||
|
"4 38.798958 -77.095133 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Для наглядности\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 15000 entries, 0 to 14999\n",
|
|||
|
"Data columns (total 12 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 summary 14998 non-null object \n",
|
|||
|
" 1 city 14961 non-null object \n",
|
|||
|
" 2 state 14235 non-null object \n",
|
|||
|
" 3 date_time 14560 non-null object \n",
|
|||
|
" 4 shape 13082 non-null object \n",
|
|||
|
" 5 duration 13598 non-null object \n",
|
|||
|
" 6 stats 15000 non-null object \n",
|
|||
|
" 7 report_link 15000 non-null object \n",
|
|||
|
" 8 text 14999 non-null object \n",
|
|||
|
" 9 posted 14560 non-null object \n",
|
|||
|
" 10 city_latitude 12002 non-null float64\n",
|
|||
|
" 11 city_longitude 12002 non-null float64\n",
|
|||
|
"dtypes: float64(2), object(10)\n",
|
|||
|
"memory usage: 1.4+ MB\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Описание данных (основные статистические показатели)\n",
|
|||
|
"df.info()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество пропущенных значений в каждом столбце:\n",
|
|||
|
"summary 74\n",
|
|||
|
"city 382\n",
|
|||
|
"state 9345\n",
|
|||
|
"date_time 2668\n",
|
|||
|
"shape 5922\n",
|
|||
|
"duration 6492\n",
|
|||
|
"stats 0\n",
|
|||
|
"report_link 0\n",
|
|||
|
"text 38\n",
|
|||
|
"posted 2668\n",
|
|||
|
"city_latitude 26804\n",
|
|||
|
"city_longitude 26804\n",
|
|||
|
"dtype: int64\n",
|
|||
|
"summary Процент пустых значений: %0.05\n",
|
|||
|
"city Процент пустых значений: %0.28\n",
|
|||
|
"state Процент пустых значений: %6.82\n",
|
|||
|
"date_time Процент пустых значений: %1.95\n",
|
|||
|
"shape Процент пустых значений: %4.32\n",
|
|||
|
"duration Процент пустых значений: %4.74\n",
|
|||
|
"text Процент пустых значений: %0.03\n",
|
|||
|
"posted Процент пустых значений: %1.95\n",
|
|||
|
"city_latitude Процент пустых значений: %19.57\n",
|
|||
|
"city_longitude Процент пустых значений: %19.57\n",
|
|||
|
"Количество выбросов в столбце 'city_latitude': 1025\n",
|
|||
|
"Количество выбросов в столбце 'city_longitude': 23\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAISCAYAAABI/3XmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC1a0lEQVR4nOzdeXhU9dn/8c/MJDNZJ2FLAoIYxSqoiKKVVKWolKjoo5X2eWypW1GrP7QFrVpay4NopWoRN9THasVWqNVe1VZc2BSssqjRCILihgWFECTLTLaZyeT8/pg5JxmykGW2JO/Xdc0lmXPmzHfS9Jxzz31/v7fNMAxDAAAAAAAgLuyJHgAAAAAAAP0JgTgAAAAAAHFEIA4AAAAAQBwRiAMAAAAAEEcE4gAAAAAAxBGBOAAAAAAAcUQgDgAAAABAHBGIAwAAAAAQRwTiAAAAAADEEYE4+p3DDjtMl19+eaKH0efdc889Ovzww+VwODRu3LgeHWvevHmy2WzRGVgnffnll7LZbFqyZElUjxvPv78lS5bIZrPpyy+/jMv7AUBPcH2OD67P0RXvay3/P+k7CMTRq5knv3fffbfN7ZMmTdKxxx7b4/d5+eWXNW/evB4fp79YuXKlbr75Zp166ql68skndeedd0b9Pe6880698MILUT9uNKxfv17z5s1TVVXVQffdtm2b5s2bR7AMoE/h+pyc+vv1OV4efvjhpPmiAMmLQBz9zvbt2/XHP/6xS695+eWXddttt8VoRH3Pa6+9JrvdrieeeEKXXnqpzj333B4d79Zbb1V9fX3Ec8l8oV+/fr1uu+22NgPxA//+tm3bpttuu41AHEC/x/U59vr79TkWLrnkEtXX12vkyJHWcwTi6IyURA8AiDeXy5XoIXRZbW2tMjMzEz2MTisvL1d6erqcTmdUjpeSkqKUlL5xuuqNf38AEA+98fzI9bnvXJ+7y+FwyOFwJHoY6IXIiKPfOXBuTSAQ0G233aYjjzxSaWlpGjRokE477TStWrVKknT55Zdr8eLFkiSbzWY9TLW1tbrxxhs1YsQIuVwuHXXUUfrDH/4gwzAi3re+vl4///nPNXjwYGVnZ+u//uu/9PXXX8tms0WU1ZnzrbZt26Yf//jHGjBggE477TRJ0ubNm3X55Zfr8MMPV1pamgoKCvTTn/5U+/fvj3gv8xiffPKJfvKTnygnJ0dDhgzRb3/7WxmGoV27dumCCy6Q2+1WQUGBFi5c2KnfXWNjo26//XYdccQRcrlcOuyww/TrX/9aPp/P2sdms+nJJ59UbW2t9bs62LfCmzZt0rnnnqsBAwYoMzNTY8eO1f3339/q87R8j9raWj311FPWe1x++eV6/fXXZbPZ9Pzzz7d6j2XLlslms2nDhg2d+qxt6czvf968ebrpppskSYWFhdb4zIx3y7+/JUuW6Ic//KEk6YwzzrD2Xbt2rfU52yq5bGt+2NatW3XmmWcqPT1dw4cP1x133KGmpqY2P8crr7yi008/XZmZmcrOztbUqVO1devWbv9eACAauD5zfe6J1157zbq25ebm6oILLtBHH30UsY853s8++0yXX365cnNzlZOToyuuuEJ1dXUR+3b27+LAOeKHHXaYtm7dqnXr1lm/g0mTJrX5+2rvGJJkGIbuuOMODR8+XBkZGTrjjDPavVZXVVVp1qxZ1t/6qFGjdNddd7V7H4Dk0L+/wkKfUV1drW+++abV84FA4KCvnTdvnhYsWKArr7xS3/72t+XxePTuu+/qvffe0/e+9z397Gc/0+7du7Vq1Sr95S9/iXitYRj6r//6L73++uuaMWOGxo0bpxUrVuimm27S119/rUWLFln7Xn755Xr22Wd1ySWXaMKECVq3bp2mTp3a7rh++MMf6sgjj9Sdd95p3TSsWrVKX3zxha644goVFBRo69ateuyxx7R161Zt3Lix1cn9f/7nfzR69Gj9/ve/10svvaQ77rhDAwcO1P/93//pzDPP1F133aWlS5fql7/8pU4++WRNnDixw9/VlVdeqaeeeko/+MEPdOONN2rTpk1asGCBPvroI+vi+pe//EWPPfaY3n77bT3++OOSpO985zvtHnPVqlU677zzNHToUP3iF79QQUGBPvroIy1fvly/+MUv2nzNX/7yF+t/r6uvvlqSdMQRR2jChAkaMWKEli5dqu9///sRr1m6dKmOOOIIFRUVdfgZO9KZ3/9FF12kTz75RH/961+1aNEiDR48WJI0ZMiQVsebOHGifv7zn+uBBx7Qr3/9a40ePVqSrP92VllZmc444ww1NjbqV7/6lTIzM/XYY48pPT291b5/+ctfdNlll6m4uFh33XWX6urq9Mgjj+i0007T+++/r8MOO6zrvxgAaAfXZ67P8bg+r169Wuecc44OP/xwzZs3T/X19XrwwQd16qmn6r333mt1bfvv//5vFRYWasGCBXrvvff0+OOPKy8vT3fddZe1T1f/Lkz33Xefrr/+emVlZek3v/mNJCk/P7/Ln2nu3Lm64447dO655+rcc8/Ve++9pylTpsjv90fsV1dXp+9+97v6+uuv9bOf/UyHHnqo1q9frzlz5mjPnj267777uvzeiBMD6MWefPJJQ1KHj2OOOSbiNSNHjjQuu+wy6+fjjz/emDp1aofvM3PmTKOt/7u88MILhiTjjjvuiHj+Bz/4gWGz2YzPPvvMMAzDKCkpMSQZs2bNitjv8ssvNyQZ//u//2s997//+7+GJONHP/pRq/erq6tr9dxf//pXQ5LxxhtvtDrG1VdfbT3X2NhoDB8+3LDZbMbvf/976/nKykojPT094nfSltLSUkOSceWVV0Y8/8tf/tKQZLz22mvWc5dddpmRmZnZ4fHMMRUWFhojR440KisrI7Y1NTW1+jwtZWZmtjnmOXPmGC6Xy6iqqrKeKy8vN1JSUiJ+zwezY8cOQ5Lx5JNPWs919vd/zz33GJKMHTt2tNr/wL+/5557zpBkvP766632PfBvo71jzJo1y5BkbNq0yXquvLzcyMnJiRiH1+s1cnNzjauuuirieGVlZUZOTk6r5wGgu7g+c30+UCyvz+PGjTPy8vKM/fv3W8998MEHht1uNy699NJW4/3pT38acczvf//7xqBBg6yfu/J3Yf6tt7zmH3PMMcZ3v/vdVmNv6/fV1jHKy8sNp9NpTJ06NeL3/etf/9qQFPH7vf32243MzEzjk08+iTjmr371K8PhcBg7d+5s9X5IDpSmo09YvHixVq1a1eoxduzYg742NzdXW7du1aefftrl93355ZflcDj085//POL5G2+8UYZh6JVXXpEkvfrqq5Kk//f//l/Eftdff327x77mmmtaPdcyw9nQ0KBvvvlGEyZMkCS99957rfa/8sorrX87HA6ddNJJMgxDM2bMsJ7Pzc3VUUcdpS+++KLdsUihzypJN9xwQ8TzN954oyTppZde6vD1bXn//fe1Y8cOzZo1S7m5uRHbutsO5dJLL5XP59Pf//5367m//e1vamxs1E9+8pNuHdPU1d9/vLz88suaMGGCvv3tb1vPDRkyRNOnT4/Yb9WqVaqqqtKPfvQjffPNN9bD4XDolFNO0euvvx7voQPo47g+c302xer6vGfPHpWWluryyy/XwIEDrefHjh2r733ve9bvp6UD/zc8/fTTtX//fnk8Hknd+7uIptWrV8vv9+v666+P+H3PmjWr1b7PPfecTj/9dA0YMCDi2j558mQFg0G98cYbcRkzuo7SdPQJ3/72t3XSSSe1et48KXVk/vz5uuCCC/Stb31Lxx57rM4++2xdcsklnbpJ+M9//qNhw4YpOzs74nmztPg///mP9V+73a7CwsKI/UaNGtXusQ/cV5IqKip022236ZlnnlF5eXnEturq6lb7H3rooRE/5+TkKC0tzSqXbvn8gfPYDmR+hgPHXFBQoNzcXOuzdsXnn38uSVFpYWM6+uijdfLJJ2vp0qXWDc3SpUs1YcKEDn/fndHV33+8/Oc//9Epp5zS6vmjjjoq4mfzZvbMM89s8zhutzv6gwPQr3F95vpsitX12fx8B17zpND/3itWrGi1qN6
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Количество удаленных строк: 27832\n",
|
|||
|
"Количество выбросов в столбце 'city_latitude': 38\n",
|
|||
|
"Количество выбросов в столбце 'city_longitude': 0\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+IAAAISCAYAAABI/3XmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADfGklEQVR4nOzdeXhU5dk/8O+ZfbLMZN8gCSFEdmSzmKqIikSgVltsy1sUUdTqi1rEFy2tpeBGxQJuVGtdwBa01l+1FVQIIKBlXwIBIoQIBALZl8k66/n9cWaGDAkhy8ycmcn3c11zQc48c+aZyWTOuc/9PPcjiKIogoiIiIiIiIj8QiF3B4iIiIiIiIh6EwbiRERERERERH7EQJyIiIiIiIjIjxiIExEREREREfkRA3EiIiIiIiIiP2IgTkRERERERORHDMSJiIiIiIiI/IiBOBEREREREZEfMRAnIiIiIiIi8iMG4tTr9OvXD7NmzZK7GyHv5ZdfRv/+/aFUKjFy5Mge7WvRokUQBME7Heuk06dPQxAErFq1yqv79efnb9WqVRAEAadPn/bL8xER9QSPz/7B47N3+ftYy7+T0MFAnIKa68tv37597d4/YcIEDBs2rMfP88UXX2DRokU93k9vsXHjRjz11FO47rrr8P777+PFF1/0+nO8+OKL+Oyzz7y+X2/YsWMHFi1ahNra2iu2PXbsGBYtWsRgmYhCCo/Pgam3H5/95c9//nPAXCigwMVAnHqd48eP469//WuXHvPFF19g8eLFPupR6NmyZQsUCgXeffddzJw5E1OmTOnR/p555hk0Nzd7bAvkA/2OHTuwePHidgPxSz9/x44dw+LFixmIE1Gvx+Oz7/X247Mv3HPPPWhubkZ6erp7GwNx6gyV3B0g8jetVit3F7qssbER4eHhcnej08rLy6HX66HRaLyyP5VKBZUqNL6ugvHzR0TkD8H4/cjjc+gcn7tLqVRCqVTK3Q0KQsyIU69z6dwaq9WKxYsXIysrCzqdDrGxsbj++uuRm5sLAJg1axZWrlwJABAEwX1zaWxsxJNPPonU1FRotVoMHDgQf/rTnyCKosfzNjc34/HHH0dcXBwiIyPx4x//GCUlJRAEwWNYnWu+1bFjx/DLX/4S0dHRuP766wEAhw8fxqxZs9C/f3/odDokJSXh/vvvR1VVlcdzufZx4sQJ3H333TAajYiPj8fvf/97iKKIs2fP4o477oDBYEBSUhKWLVvWqffOZrPhueeeQ2ZmJrRaLfr164ff/va3MJvN7jaCIOD9999HY2Oj+7260lXh3bt3Y8qUKYiOjkZ4eDhGjBiBV199tc3raf0cjY2NWL16tfs5Zs2aha+//hqCIODTTz9t8xxr166FIAjYuXNnp15rezrz/i9atAjz588HAGRkZLj758p4t/78rVq1Cj/72c8AADfddJO77datW92vs70hl+3NDzt69Chuvvlm6PV69O3bF88//zwcDke7r+PLL7/EDTfcgPDwcERGRmLq1Kk4evRot98XIiJv4PGZx+ee2LJli/vYFhUVhTvuuAMFBQUebVz9PXnyJGbNmoWoqCgYjUbcd999aGpq8mjb2c/FpXPE+/Xrh6NHj2Lbtm3u92DChAntvl+X2wcAiKKI559/Hn379kVYWBhuuummyx6ra2trMXfuXPdnfcCAAXjppZcuex5AgaF3X8KikFFXV4fKyso2261W6xUfu2jRIixZsgQPPPAAfvCDH8BkMmHfvn04cOAAbr31VvzqV7/C+fPnkZubi7/97W8ejxVFET/+8Y/x9ddfY/bs2Rg5ciQ2bNiA+fPno6SkBCtWrHC3nTVrFj7++GPcc889uPbaa7Ft2zZMnTr1sv362c9+hqysLLz44ovuk4bc3Fx8//33uO+++5CUlISjR4/i7bffxtGjR7Fr1642X+6/+MUvMHjwYPzxj3/E+vXr8fzzzyMmJgZ/+ctfcPPNN+Oll17CmjVr8H//93+45pprMH78+A7fqwceeACrV6/GXXfdhSeffBK7d+/GkiVLUFBQ4D64/u1vf8Pbb7+NPXv24J133gEA/PCHP7zsPnNzc/GjH/0IycnJ+PWvf42kpCQUFBRg3bp1+PWvf93uY/72t7+5f18PPfQQACAzMxPXXnstUlNTsWbNGvzkJz/xeMyaNWuQmZmJ7OzsDl9jRzrz/v/0pz/FiRMn8OGHH2LFihWIi4sDAMTHx7fZ3/jx4/H444/jtddew29/+1sMHjwYANz/dlZpaSluuukm2Gw2/OY3v0F4eDjefvtt6PX6Nm3/9re/4d5770VOTg5eeuklNDU14c0338T111+PgwcPol+/fl1/Y4iILoPHZx6f/XF83rRpEyZPnoz+/ftj0aJFaG5uxuuvv47rrrsOBw4caHNs+/nPf46MjAwsWbIEBw4cwDvvvIOEhAS89NJL7jZd/Vy4vPLKK3jssccQERGB3/3udwCAxMTELr+mhQsX4vnnn8eUKVMwZcoUHDhwAJMmTYLFYvFo19TUhBtvvBElJSX41a9+hbS0NOzYsQMLFizAhQsX8Morr3T5uclPRKIg9v7774sAOrwNHTrU4zHp6enivffe6/756quvFqdOndrh88yZM0ds78/ls88+EwGIzz//vMf2u+66SxQEQTx58qQoiqK4f/9+EYA4d+5cj3azZs0SAYh/+MMf3Nv+8Ic/iADE//mf/2nzfE1NTW22ffjhhyIAcfv27W328dBDD7m32Ww2sW/fvqIgCOIf//hH9/aamhpRr9d7vCftycvLEwGIDzzwgMf2//u//xMBiFu2bHFvu/fee8Xw8PAO9+fqU0ZGhpieni7W1NR43OdwONq8ntbCw8Pb7fOCBQtErVYr1tbWureVl5eLKpXK432+klOnTokAxPfff9+9rbPv/8svvywCEE+dOtWm/aWfv3/+858iAPHrr79u0/bSz8bl9jF37lwRgLh79273tvLyctFoNHr0o76+XoyKihIffPBBj/2VlpaKRqOxzXYiou7i8ZnH50v58vg8cuRIMSEhQayqqnJvO3TokKhQKMSZM2e26e/999/vsc+f/OQnYmxsrPvnrnwuXJ/11sf8oUOHijfeeGObvrf3frW3j/LyclGj0YhTp071eL9/+9vfigA83t/nnntODA8PF0+cOOGxz9/85jeiUqkUi4uL2zwfBQYOTaeQsHLlSuTm5ra5jRgx4oqPjYqKwtGjR1FYWNjl5/3iiy+gVCrx+OOPe2x/8sknIYoivvzySwDAV199BQD43//9X492jz322GX3/fDDD7fZ1jrD2dLSgsrKSlx77bUAgAMHDrRp/8ADD7j/r1QqMXbsWIiiiNmzZ7u3R0VFYeDAgfj+++8v2xdAeq0AMG/ePI/tTz75JABg/fr1HT6+PQcPHsSpU6cwd+5cREVFedzX3eVQZs6cCbPZjE8++cS97R//+AdsNhvuvvvubu3Tpavvv7988cUXuPbaa/GDH/zAvS0+Ph4zZszwaJebm4va2lr8z//8DyorK903pVKJcePG4euvv/Z314koxPH4zOOzi6+OzxcuXEBeXh5mzZqFmJgY9/YRI0bg1ltvdb8/rV36O7zhhhtQVVUFk8kEoHufC2/atGkTLBYLHnvsMY/3e+7cuW3a/vOf/8QNN9yA6Ohoj2P7xIkTYbfbsX37dr/0mbqOQ9MpJPzgBz/A2LFj22x3fSl15Nlnn8Udd9yBq666CsOGDcNtt92Ge+65p1MnCWfOnEFKSgoiIyM9truGFp85c8b9r0KhQEZGhke7AQMGXHbfl7YFgOrqaixevBgfffQRysvLPe6rq6tr0z4tLc3jZ6PRCJ1O5x4u3Xr7pfPYLuV6DZf2OSkpCVFRUe7X2hVFRUUA4JUlbFwGDRqEa665BmvWrHGf0KxZswbXXntth+93Z3T1/feXM2fOYNy4cW22Dxw40ONn18nszTff3O5+DAaD9ztHRL0aj888Prv46vjsen2XHvMA6fe9YcOGNkX1Ln3/o6OjAQA1NTUwGAzd+lx4k+s1ZWVleWyPj49399WlsLAQhw8fbncKHIA2n0cKHAz
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1500x1000 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Средняя цена в обучающей выборке: 38.655686510239775\n",
|
|||
|
"Средняя цена в контрольной выборке: 38.6586177634688\n",
|
|||
|
"Средняя цена в тестовой выборке: 38.61544768163524\n",
|
|||
|
"\n",
|
|||
|
"Стандартное отклонение цены в обучающей выборке: 5.380551235399826\n",
|
|||
|
"Стандартное отклонение цены в контрольной выборке: 5.34170765011401\n",
|
|||
|
"Стандартное отклонение цены в тестовой выборке: 5.3932492782181525\n",
|
|||
|
"\n",
|
|||
|
"Распределение по квартилам (обучающая):\n",
|
|||
|
"0.25 34.269424\n",
|
|||
|
"0.50 39.222500\n",
|
|||
|
"0.75 42.284678\n",
|
|||
|
"Name: city_latitude, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Распределение по квартилам (контрольная):\n",
|
|||
|
"0.25 34.286571\n",
|
|||
|
"0.50 39.302247\n",
|
|||
|
"0.75 42.277381\n",
|
|||
|
"Name: city_latitude, dtype: float64\n",
|
|||
|
"\n",
|
|||
|
"Распределение по квартилам (тестовая):\n",
|
|||
|
"0.25 34.194501\n",
|
|||
|
"0.50 39.165900\n",
|
|||
|
"0.75 42.286500\n",
|
|||
|
"Name: city_latitude, dtype: float64\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA/YAAAIjCAYAAACpnIB8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeZiN9f/H8eeZfWHGMpixDdnGLmvIUvadtIhCKSkq0vJTIamkRSmlHRVfUdEihCgxRfadNIyRbWzDMPv9++PuHHPMaszMPWfm9biuc5373Odz3/f77Od9fzabYRgGIiIiIiIiIuKS3KwOQERERERERERyTom9iIiIiIiIiAtTYi8iIiIiIiLiwpTYi4iIiIiIiLgwJfYiIiIiIiIiLkyJvYiIiIiIiIgLU2IvIiIiIiIi4sKU2IuIiIiIiIi4MCX2IiIiwLlz5/j7779JSkqyOhQRkULDMAzOnDnDgQMHrA5FpFBTYi8iIkVSYmIir732Gg0bNsTb25uSJUtSo0YNVq1aZXVoLmHnzp0sXrzYcXvr1q0sWbLEuoAkT0VFRTF79mzH7UOHDjF37lzrApIC/Rm8cOECzz//PLVq1cLLy4vSpUtTs2ZN9u3bZ3VoIoWWh9UBiEhas2fP5r777nPc9vb2pnLlynTu3Jnx48dTrlw5C6MTcX3x8fF07tyZP/74gxEjRjB58mT8/Pxwd3enSZMmVofnEi5cuMBDDz1EcHAwpUuX5vHHH6dbt2706NHD6tAkD9hsNkaOHElISAi1atXi6aefplSpUgwaNMjq0IqsgvoZPH36NO3atSMyMpJHH32U1q1b4+XlhaenJ1WqVLE0NpHCTIm9SAH24osvUrVqVeLi4vj999+ZOXMmP/30Ezt37sTPz8/q8ERc1tSpU/nzzz9Zvnw57du3tzocl9SyZUvHBaBmzZo8+OCDFkcleaVChQo8+OCDdO3aFYCQkBDWrFljbVBFXEH9DD711FMcO3aM8PBw6tata3U4IkWGzTAMw+ogRMSZvcZ+48aNNG3a1LF+7NixTJs2jXnz5nH33XdbGKGI60pKSqJs2bI8/PDDvPzyy1aH4/J2797N5cuXqV+/Pl5eXlaHI3ns4MGDREdHU69ePfz9/a0ORyhYn8GTJ08SEhLCBx98UCBOMogUJepjL+JCbr31VgAiIiIAOHPmDE8++ST169enWLFiBAQE0K1bN7Zt25Zm27i4OF544QVq1qyJj48PISEh3HbbbRw8eBAw+0vabLYML6lrNdesWYPNZuOrr77i2WefJTg4GH9/f3r37s2RI0fSHPvPP/+ka9euBAYG4ufnR7t27Vi3bl26j7F9+/bpHv+FF15IU/bLL7+kSZMm+Pr6UqpUKQYMGJDu8TN7bKmlpKTw9ttvU7duXXx8fChXrhwPPfQQZ8+edSpXpUoVevbsmeY4o0aNSrPP9GJ//fXX0zynYDYPnzhxItWrV8fb25tKlSrx9NNPEx8fn+5zlVr79u2pV69emvVvvPEGNpuNQ4cOOa0/d+4co0ePplKlSnh7e1O9enWmTp1KSkqKo4z9eXvjjTfS7LdevXrpvie+/vrrDGMcOnRotpphVqlSxfH6uLm5ERwczF133UVkZGSW2wK8//771K1bF29vb8qXL8/IkSM5d+6c4/59+/Zx9uxZihcvTrt27fDz8yMwMJCePXuyc+dOR7nVq1djs9lYtGhRmmPMmzcPm81GeHi4I+ahQ4c6lbE/J6lrNdeuXcsdd9xB5cqVHa/xmDFjuHz5stO2L7zwQpr30ty5c2nUqBE+Pj6ULl2au+++O81zMnToUIoVK+a07uuvv04TB0CxYsXSxAzZ+1y1b9/e8frXqVOHJk2asG3btnQ/V9k1e/bsNO/VXbt2UbJkSXr27Ok0qOE///zDHXfcQalSpfDz8+Omm25K07c4s/dk6sduP25mF3vfcvvz+88//9ClSxf8/f0pX748L774IlfXk8TGxjJ27FjHZ6xWrVq88cYbacplFkPqz5i9zF9//ZXp85jeewAyfh8sXLjQ8XoHBQVxzz33cPTo0TT7tH92q1WrRosWLThz5gy+vr7pfr+kF9PVn/0jR45ka/uhQ4dm+fqk3n7p0qW0adMGf39/ihcvTo8ePdi1a1ea/e7du5c777yTMmXK4OvrS61atXjuueeAK5+/zC6pn8fsPoepty9ZsiTt27dn7dq1aWLL6jsMrv8zePVvbVBQED169HD6DgTzN2zUqFEZ7ufqz+3GjRtJSUkhISGBpk2bZvp9BfDLL784Xq8SJUrQp08f9uzZ41TG/nrYX7OAgABH14O4uLg08ab+zU1KSqJ79+6UKlWK3bt3O9bPmjWLW2+9lbJly+Lt7U2dOnWYOXNmmtjc3NyYMGGC03r79//V5UWspqb4Ii7EnoSXLl0aMP/cLl68mDvuuIOqVaty4sQJPvzwQ9q1a8fu3bspX748AMnJyfTs2ZNVq1YxYMAAHn/8cS5cuMCKFSvYuXMn1apVcxzj7rvvpnv37k7HHTduXLrxvPzyy9hsNp555hlOnjzJ22+/TceOHdm6dSu+vr6A+cPYrVs3mjRpwsSJE3Fzc3P8oK5du5bmzZun2W/FihWZMmUKABcvXuThhx9O99jjx4/nzjvv5IEHHuDUqVO8++67tG3bli1btlCiRIk02wwfPpw2bdoA8O2336ZJ2B566CFHa4nHHnuMiIgIZsyYwZYtW1i3bh2enp7pPg/X4ty5c47HllpKSgq9e/fm999/Z/jw4dSuXZsdO3bw1ltvsX//fqcBkq7XpUuXaNeuHUePHuWhhx6icuXKrF+/nnHjxnHs2DHefvvtXDtWTrVp04bhw4eTkpLCzp07efvtt/n333/T/ROc2gsvvMCkSZPo2LEjDz/8MPv27WPmzJls3LjR8RqePn0aMN/XNWrUYNKkScTFxfHee+/RunVrNm7cSM2aNWnfvj2VKlVi7ty59OvXz+k4c+fOpVq1ao4msNm1cOFCLl26xMMPP0zp0qXZsGED7777LlFRUSxcuDDD7ebNm8c999xDw4YNmTJlCqdPn+add97h999/Z8uWLQQFBV1THBnJyefK7plnnsmVGOyOHDlC165dCQsLY8GCBXh4mH9ZTpw4QatWrbh06RKPPfYYpUuXZs6cOfTu3Zuvv/46zWuVlbZt2/LFF184bttbcdiTPIBWrVo5lpOTk+natSs33XQTr732GsuWLWPixIkkJSXx4osvAuYo4L1792b16tUMGzaMRo0asXz5cp566imOHj3KW2+9lW4sb731luO1zI/WJPbvu2bNmjFlyhROnDjB9OnTWbduXZav94QJE9IkVdciu9s/9NBDdOzY0XH73nvvpV+/ftx2222OdWXKlAHgiy++YMiQIXTp0oWpU6dy6dIlZs6cyc0338yWLVscJxe2b99OmzZt8PT0ZPjw4VSpUoWDBw/yww8/8PLLL3PbbbdRvXp1x/7HjBlD7dq1GT58uGNd7dq1gWt7DoOCghyvfVRUFNOnT6d79+4cOXLEUS4732EZudbPYFhYGM899xyGYXDw4EGmTZtG9+7ds30SNT3279dRo0bRpEkTXn31VU6dOpXu99XKlSvp1q0bN9xwAy+88AKXL1/m3XffpXXr1mzevDnNyaA777yTKlWqMGXKFP744w/eeecdzp49y+eff55hPA888ABr1qxhxYoV1KlTx7F+5syZ1K1bl969e+Ph4cEPP/zAI488QkpKCiNHjgTMypRHHnmEKVOm0LdvXxo3bsyxY8d49NFH6dixIyNGjMjx8ySSJwwRKXBmzZplAMbKlSuNU6dOGUeOHDHmz59vlC5d2vD19TWioqIMwzCMuLg4Izk52WnbiIgIw9vb23jxxRcd6z777DMDMKZNm5bmWCkpKY7tAOP1119PU6Zu3bpGu3btHLdXr15tAEaFChWMmJgYx/oFCxYYgDF9+nTHvmvUqGF06dLFcRzDMIxLly4ZVatWNTp16pTmWK1atTLq1avnuH3q1CkDMCZOnOhYd+jQIcPd3d14+eWXnbbdsWOH4eHhkWb9gQMHDMC
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"#5. Устранение пропущенных данных\n",
|
|||
|
" \n",
|
|||
|
"#Сведения о пропущенных данных\n",
|
|||
|
"print(\"Количество пропущенных значений в каждом столбце:\")\n",
|
|||
|
"print(df.isnull().sum())\n",
|
|||
|
"\n",
|
|||
|
"# Процент пропущенных значений признаков\n",
|
|||
|
"for i in df.columns:\n",
|
|||
|
" null_rate = df[i].isnull().sum() / len(df) * 100\n",
|
|||
|
" if null_rate > 0:\n",
|
|||
|
" print(f'{i} Процент пустых значений: %{null_rate:.2f}')\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"#6. Проблемы набора данных\n",
|
|||
|
" #5.1Выбросы: Возможны аномалии в значениях скорости или расстояния.\n",
|
|||
|
" #Смещение: Данные могут быть смещены в сторону объектов, которые легче обнаружить (крупные, близкие).\n",
|
|||
|
"\n",
|
|||
|
"#7. Решения для обнаруженных проблем\n",
|
|||
|
" #Выбросы: Идентификация и обработка выбросов через методы (например, IQR или Z-оценка).\n",
|
|||
|
" #Смещение: Использование методов балансировки данных, таких как oversampling.\n",
|
|||
|
"\n",
|
|||
|
"#7.1 Проверка набора данных на выбросы\n",
|
|||
|
"# Выбираем столбцы для анализа\n",
|
|||
|
"columns_to_check = df.select_dtypes(include=np.number).columns.tolist()#['city_latitude' , 'sqft_living', 'bathrooms', 'yr_built']\n",
|
|||
|
"def Emissions(columns_to_check):\n",
|
|||
|
"\n",
|
|||
|
" # Функция для подсчета выбросов\n",
|
|||
|
" def count_outliers(df, columns):\n",
|
|||
|
" outliers_count = {}\n",
|
|||
|
" for col in columns:\n",
|
|||
|
" Q1 = df[col].quantile(0.25)\n",
|
|||
|
" Q3 = df[col].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" \n",
|
|||
|
" # Считаем количество выбросов\n",
|
|||
|
" outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]\n",
|
|||
|
" outliers_count[col] = len(outliers)\n",
|
|||
|
" \n",
|
|||
|
" return outliers_count\n",
|
|||
|
"\n",
|
|||
|
" # Подсчитываем выбросы\n",
|
|||
|
" outliers_count = count_outliers(df, columns_to_check)\n",
|
|||
|
"\n",
|
|||
|
" # Выводим количество выбросов для каждого столбца\n",
|
|||
|
" for col, count in outliers_count.items():\n",
|
|||
|
" print(f\"Количество выбросов в столбце '{col}': {count}\")\n",
|
|||
|
" \n",
|
|||
|
" # Создаем гистограммы\n",
|
|||
|
" plt.figure(figsize=(15, 10))\n",
|
|||
|
" for i, col in enumerate(columns_to_check, 1):\n",
|
|||
|
" plt.subplot(2, 3, i)\n",
|
|||
|
" sns.histplot(df[col], kde=True)\n",
|
|||
|
" plt.title(f'Histogram of {col}')\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()\n",
|
|||
|
"Emissions(columns_to_check)\n",
|
|||
|
"\n",
|
|||
|
"#Признак miss_distance не имеет выбросов, \n",
|
|||
|
"#признак absolute_magnitude имеет количество выбросов в приемлемом диапазоне\n",
|
|||
|
"#для признаков est_diameter_min, est_diameter_max и relative_velocity необходимо использовать метод решения проблемы выбросов. \n",
|
|||
|
"#Воспользуемся методом удаления наблюдений с такими выбросами:\n",
|
|||
|
"# Выбираем столбцы для очистки\n",
|
|||
|
"columns_to_clean = ['city_latitude']\n",
|
|||
|
"\n",
|
|||
|
"# Функция для удаления выбросов\n",
|
|||
|
"def remove_outliers(df, columns):\n",
|
|||
|
" for col in columns:\n",
|
|||
|
" Q1 = df[col].quantile(0.25)\n",
|
|||
|
" Q3 = df[col].quantile(0.75)\n",
|
|||
|
" IQR = Q3 - Q1\n",
|
|||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|||
|
" \n",
|
|||
|
" # Удаляем строки, содержащие выбросы\n",
|
|||
|
" df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]\n",
|
|||
|
" \n",
|
|||
|
" return df\n",
|
|||
|
"\n",
|
|||
|
"# Удаляем выбросы\n",
|
|||
|
"df_cleaned = remove_outliers(df, columns_to_clean)\n",
|
|||
|
"\n",
|
|||
|
"# Выводим количество удаленных строк\n",
|
|||
|
"print(f\"Количество удаленных строк: {len(df) - len(df_cleaned)}\")\n",
|
|||
|
"\n",
|
|||
|
"df = df_cleaned\n",
|
|||
|
"\n",
|
|||
|
"#Оценим выбросы в выборке после усреднения:\n",
|
|||
|
"Emissions(columns_to_clean)\n",
|
|||
|
"\n",
|
|||
|
"#Удалось избавиться от выбросов в соответствующих признаках как видно на диаграммах.\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"#8. Разбиение данных на выборки\n",
|
|||
|
"\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"train_data, temp_data = train_test_split(df, test_size=0.3, random_state=42)\n",
|
|||
|
"val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Средние значения цены\n",
|
|||
|
"print(\"Средняя цена в обучающей выборке:\", train_data['city_latitude' ].mean())\n",
|
|||
|
"print(\"Средняя цена в контрольной выборке:\", val_data['city_latitude' ].mean())\n",
|
|||
|
"print(\"Средняя цена в тестовой выборке:\", test_data['city_latitude' ].mean())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Стандартное отклонение цены\n",
|
|||
|
"print(\"Стандартное отклонение цены в обучающей выборке:\", train_data['city_latitude' ].std())\n",
|
|||
|
"print(\"Стандартное отклонение цены в контрольной выборке:\", val_data['city_latitude' ].std())\n",
|
|||
|
"print(\"Стандартное отклонение цены в тестовой выборке:\", test_data['city_latitude' ].std())\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка распределений по количеству объектов в диапазонах\n",
|
|||
|
"print(\"Распределение по квартилам (обучающая):\")\n",
|
|||
|
"print(train_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
|
|||
|
"print()\n",
|
|||
|
"print(\"Распределение по квартилам (контрольная):\")\n",
|
|||
|
"print(val_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
|
|||
|
"print()\n",
|
|||
|
"print(\"Распределение по квартилам (тестовая):\")\n",
|
|||
|
"print(test_data['city_latitude' ].quantile([0.25, 0.5, 0.75]))\n",
|
|||
|
"\n",
|
|||
|
"# Построение гистограмм для каждой выборки\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"\n",
|
|||
|
"sns.histplot(train_data['city_latitude' ], color='blue', label='Train', kde=True)\n",
|
|||
|
"sns.histplot(val_data['city_latitude' ], color='green', label='Validation', kde=True)\n",
|
|||
|
"sns.histplot(test_data['city_latitude' ], color='red', label='Test', kde=True)\n",
|
|||
|
"\n",
|
|||
|
"plt.legend()\n",
|
|||
|
"plt.xlabel('city_latitude' )\n",
|
|||
|
"plt.ylabel('Frequency')\n",
|
|||
|
"plt.title('Распределение цены в обучающей, контрольной и тестовой выборках')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"#9. Оценить сбалансированность выборок для каждого набора данных. Оценить необходимость использования методов приращения (аугментации) данных. \n",
|
|||
|
"#Выводы по сбалансированности\n",
|
|||
|
"#Если распределение классов примерно равно (например, 50%/50%), выборка считается сбалансированной, и аугментация данных не требуется.\n",
|
|||
|
"#Если один из классов сильно доминирует (например, 90%/10%), выборка несбалансированная, и может потребоваться аугментация данных.\n",
|
|||
|
"\n",
|
|||
|
"#Выборки оказались недостаточно сбалансированными. Используем методы приращения данных с избытком и с недостатком:\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Нет пропущенных данных."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Разбиваем на выборки (обучающую, тестовую, контрольную)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 65464 записей\n",
|
|||
|
"Валидационная выборка: 21822 записей\n",
|
|||
|
"Тестовая выборка: 21822 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABccAAAIjCAYAAADGGKM5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACbjklEQVR4nOzdeXyNd/7//2cSsokkQhaZRqSordZoSdWeJiIo1fZrqaW0wQQjOhgzqui0KWorSrW2FtXqqBYtYldC0WZsbQaNRkui1oOS9fr90V+uj9MkSiRO6jzut9u5Ta7r/TrX9bpOOl7nvPI+78vBMAxDAAAAAAAAAADYEUdbJwAAAAAAAAAAwL1GcxwAAAAAAAAAYHdojgMAAAAAAAAA7A7NcQAAAAAAAACA3aE5DgAAAAAAAACwOzTHAQAAAAAAAAB2h+Y4AAAAAAAAAMDu0BwHAAAAAAAAANgdmuPAXbh27ZpOnTqlixcv2joVFCN+rwAA2I9Lly7p+PHjys7OtnUqAACUerm5uTp37px++OEHW6cCFAua48AdWrlypdq1a6fy5cvLw8NDVapU0eTJk22dFu4Sv1cAAOxDVlaWJk+erAYNGsjFxUUVKlRQjRo1tHnzZlunBgBAqZSWlqbhw4crODhYzs7O8vX1VZ06dWSxWGydGnDXytg6AcCWjhw5ovj4eG3dulXnzp1TxYoV1aZNG/3zn/9U3bp188X/4x//0KRJk/Tkk0/q3XffVaVKleTg4KCHHnrIBtmjuPB7BYDitXjxYj3//PNW+3x9fVW3bl2NGjVKUVFRNsoM9i4jI0MRERHas2ePBg0apFdffVXu7u5ycnJSaGiordMDgPueg4PDbcVt3bpVrVu3LtlkcFuOHz+uNm3aKCsrS8OGDVPjxo1VpkwZubm5qVy5crZOD7hrNMdht1atWqUePXrIx8dHAwYMUEhIiE6ePKkFCxbok08+0YoVK9S1a1czfvv27Zo0aZLi4+P1j3/8w4aZozjxewWAkjNx4kSFhITIMAylp6dr8eLF6tChg9asWaOOHTvaOj3YoUmTJmnv3r3asGEDTRcAsIEPPvjAavv9999XQkJCvv21a9e+l2nhFgYOHChnZ2ft2bNHf/nLX2ydDlDsHAzDMGydBHCvnThxQvXr11eVKlW0Y8cO+fr6mmPnzp1TixYtdOrUKR08eFAPPvigJKlTp066cOGCdu3aZau0UQL4vQJA8cubOb5v3z41adLE3H/x4kX5+/vrmWee0bJly2yYIexRdna2/Pz8NHjwYL322mu2TgcAIGnIkCGaM2eOaE2VTgcOHFCTJk20ceNGPfHEE7ZOBygRrDkOuzRlyhT9+uuvmj9/vlVjXJIqVaqkd955R9euXbNac3rPnj16+OGH1b17d/n4+MjNzU2PPPKIVq9ebcZcvXpV5cqV09/+9rd85/zpp5/k5OSk+Ph4SVK/fv1UtWrVfHEODg4aP368uf3jjz/qr3/9q2rWrCk3NzdVrFhRzzzzjE6ePGn1vG3btsnBwUHbtm0z9+3bt09PPPGEypcvr3Llyql169bauXOn1fMWL14sBwcH7d+/39x37ty5fHlIUseOHfPlvHPnTj3zzDOqUqWKXFxcFBQUpLi4OF2/fj3ftX3yySdq0qSJypcvLwcHB/Px5ptv5ostKMe8h7u7u+rVq6f33nvPKq5fv37y8PC45bF+f12383vNc/bsWQ0YMED+/v5ydXVVgwYNtGTJEquYkydPmtc0ffp0BQcHy83NTa1atdLhw4fz5fv713Pp0qVydHTUG2+8Ye47ePCg+vXrpwcffFCurq4KCAhQ//79df78+VteKwCUNt7e3nJzc1OZMtZfXnzzzTf12GOPqWLFinJzc1NoaKg++eSTAo/x+5qQ97h5FnBezM21Mjc3V/Xr15eDg4MWL16c77hVq1Yt8Li/j73dXB0cHDRkyJB8+wuqpQXVg1OnTsnNzS3fdUjS22+/rbp168rFxUWBgYGKjY3VpUuXrGJat26thx9+ON/533zzzXzHrFq1aoEz+YcMGZLv6++LFi1S27Zt5efnJxcXF9WpU0dz587N99zs7Gz9+9//1kMPPSQXFxer1/Tm9xwF6devn1V8hQoVCnwPU1jeeX7/3ig5OVkXL15U+fLl1apVK7m7u8vLy0sdO3bMV6Ml6dtvv1VUVJQ8PT3l4eGhdu3aac+ePVYxef+t7dixQwMHDlTFihXl6empPn365Luxd9WqVdWvXz+rfTExMXJ1dbV6//bZZ58pOjpagYGBcnFxUbVq1fTqq68qJyfnlq8bANyPMjIy9Morr6h69erm581Ro0YpIyMjX+zSpUv16KOPyt3dXRUqVFDLli21ceNGSYXX+bzHzXX42rVreumllxQUFCQXFxfVrFlTb775Zr4G/s3Pd3Jy0l/+8hfFxMRY1eTMzEyNGzdOoaGh8vLyUrly5dSiRQtt3bo1X/55nzerVKkiJycn89h/9Bn399fn6OiogIAA/b//9/+Umppqxtz8WbUw48ePt6r9e/bskaurq06cOGG+9wgICNDAgQN14cKFfM9fuXKlQkND5ebmpkqVKum5557Tzz//bBWT97n9hx9+UGRkpMqVK6fAwEBNnDjR6jXOy/fm92JXrlxRaGioQkJCdObMGXP/nbyXBH6PZVVgl9asWaOqVauqRYsWBY63bNlSVatW1bp168x958+f1/z58+Xh4aFhw4bJ19dXS5cu1VNPPaVly5apR48e8vDwUNeuXfXRRx9p2rRpcnJyMp//4YcfyjAM9erV645y3bdvn3bv3q3u3bvrgQce0MmTJzV37ly1bt1aR48elbu7e4HPO378uFq3bi13d3eNHDlS7u7uevfddxUeHq6EhAS1bNnyjvIozMqVK/Xrr79q8ODBqlixor7++mvNmjVLP/30k1auXGnGJSYm6tlnn1WDBg30xhtvyMvLS+fOnVNcXNxtn2v69OmqVKmSLBaLFi5cqBdffFFVq1ZVeHh4kfO/nd+rJF2/fl2tW7fW8ePHNWTIEIWEhGjlypXq16+fLl26lO8PIu+//76uXLmi2NhY3bhxQzNnzlTbtm116NAh+fv7F5jLxo0b1b9/fw0ZMsRqiZeEhAT98MMPev755xUQEKAjR45o/vz5OnLkiPbs2XPb6/YBwL12+fJlnTt3ToZh6OzZs5o1a5auXr2q5557zipu5syZ6ty5s3r16qXMzEytWLFCzzzzjNauXavo6OgCj51XEyTd1izgDz74QIcOHbplTMOGDfXSSy9JklJSUjRu3Lh8MUXJtSjGjRunGzdu5Ns/fvx4TZgwQeHh4Ro8eLCSk5M1d+5c7du3T7t27VLZsmWLLYeCzJ07V3Xr1lXnzp1VpkwZrVmzRn/961+Vm5ur2NhYM27q1Kl6+eWX1bVrV40ePVouLi7auXOn5s+ff1vnqVSpkqZPny7ptwkGM2fOVIcOHXTq1Cl5e3sXKfe8PyqPGTNGNWrU0IQJE3Tjxg3NmTNHzZs31759+8z7jRw5ckQtWrSQp6enRo0apbJly+qdd95R69attX37djVt2tTq2EOGDJG3t7fGjx9v/k5+/PFHs0FfkFdeeUULFizQRx99lO+POx4eHhoxYoQ8PDy0ZcsWjRs3ThaLRVOmTCnStQPAn1Fubq46d+6sr776SjExMapdu7YOHTqk6dOn63//+5/VhKYJEyZo/PjxeuyxxzRx4kQ5Oztr79692rJliyIiIjRjxgxdvXpVkvTdd9/p9ddf1z//+U9z+Za8BrRhGOrcubO2bt2qAQMGqGHDhtqwYYNGjhypn3/+2axNebp27aqnnnpK2dnZSkxM1Pz583X9+nVzmRiLxaL33ntPPXr00IsvvqgrV65owYIFioyM1Ndff62GDRuax+rbt682bdqkoUOHqkGDBnJyctL8+fP1zTff3Nbr1aJFC8XExCg3N1eHDx/WjBkzdPr06Xx/XL4T58+f140bNzR48GC1bdtWgwYN0okTJzRnzhzt3btXe/fulYuLi6T/++bgI488ovj4eKWnp2vmzJnatWuXvv32W6v6nZO
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1800x600 with 3 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['city_latitude']) # Признаки (все столбцы, кроме 'city_latitude')\n",
|
|||
|
"y = df['city_latitude'] \n",
|
|||
|
"# Целевая переменная (price)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в каждой выборке\n",
|
|||
|
"plt.figure(figsize=(18, 6))\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 1)\n",
|
|||
|
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
|
|||
|
"plt.title('Обучающая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 2)\n",
|
|||
|
"plt.hist(y_val, bins=30, color='green', alpha=0.7)\n",
|
|||
|
"plt.title('Валидационная выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 3, 3)\n",
|
|||
|
"plt.hist(y_test, bins=30, color='red', alpha=0.7)\n",
|
|||
|
"plt.title('Тестовая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Балансировка выборок**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 6000 записей\n",
|
|||
|
"Валидационная выборка: 2000 записей\n",
|
|||
|
"Тестовая выборка: 2000 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1IAAAIjCAYAAAAJLyrXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABisElEQVR4nO3dd3gU1f/28TskpJAqEBLA0EINCAgoBEQQAqFXRRSlKkjvKkpHjRQBkWIBAQV+KkUURDpY6CAovSPN0CEUk0Bynj94st8sm0AmJiTA+3Vde13Z2bOzn9kzO9l7Z+aMkzHGCAAAAACQYlkyugAAAAAAeNAQpAAAAADAIoIUAAAAAFhEkAIAAAAAiwhSAAAAAGARQQoAAAAALCJIAQAAAIBFBCkAAAAAsIggBQfx8fE6f/68jhw5ktGlAMmKjo7W6dOndfbs2YwuBWmIfgWQUpcvX9ahQ4d069atjC4FjyiCFCRJkZGR6tWrl/Lnzy9XV1f5+/srJCREUVFRGV0aYLNy5Uo1atRIfn5+8vDwUN68edWzZ8+MLgv/0cPerydOnJC7u7vWrVuX0aUAD7SbN29q1KhRKlOmjNzc3PTYY4+pSJEiWrVqVUaX9ki6cOGCPD09tWTJkowuJcM4GWNMRheB9HH48GGNGjVKK1as0OnTp+Xq6qonnnhCLVq0UMeOHeXh4SFJOnTokJ577jndvHlTPXr0ULly5eTi4iIPDw9VqlRJzs7OGbwkgDR58mR1795dzzzzjNq3b6+8efNKkvLnz68iRYpkcHVIrUehX19//XUdOHBAv/zyS0aXkqmcP39e/v7+GjJkiIYOHZrR5SCTi4mJUe3atbVx40a98cYbqlmzprJlyyZnZ2eVL19ePj4+GV3iI6lnz576/ffftW3btowuJUO4ZHQBSB8//fSTXnjhBbm5ual169YqVaqUYmNj9fvvv6t///7avXu3Pv/8c0lSp06d5Orqqo0bN9q+xACZycGDB9WnTx917NhRkydPlpOTU0aXhDTwKPTruXPnNHPmTM2cOTOjSwEeaCNHjtSmTZu0bNkyVa9ePaPLwf/3xhtvaMKECVq9erVq1KiR0eXcdwSph9DRo0fVsmVL5c+fX6tXr1bu3Lltj3Xt2lWHDh3STz/9JEnatm2bVq9ereXLlxOikGlNmDBBgYGBmjBhwkP5ZftR9Sj066xZs+Ti4qKGDRtmdCnAA+vWrVsaP368+vbtS4jKZEqUKKFSpUppxowZj2SQ4hyph9CoUaN07do1TZs2zS5EJShcuLDt/IONGzfK3d1dhw8fVsmSJeXm5qbAwEB16tRJFy9etHveb7/9phdeeEH58uWTm5ubgoKC1Lt3b/37779J1uHk5JTk7dixY7Y206dPV40aNZQrVy65ubkpJCREU6ZMcZhXgQIF1KBBA4fp3bp1S/IL2KxZs/T0008rW7Zseuyxx/Tss89q+fLldvNr27at3XPmzp0rJycnFShQwDbt2LFjcnJy0pgxYzRu3Djlz59fHh4eqlatmnbt2uXwuqtXr1bVqlXl6ekpPz8/NW7cWHv37rVrM3ToULv3w9vbW08//bQWLlxo1y6l73fbtm3l5eXlUMu8efPk5OSktWvX2qZVr15dpUqVcmg7ZswYh7754YcfVL9+feXJk0dubm4KDg7WiBEjFBcX5/D8KVOmqFSpUsqWLZvdss2bN8+h7Z22b9+uunXrysfHR15eXqpZs6Y2btxo12bjxo0qX768unTpooCAALm5ualUqVL64osvbG2MMSpQoIAaN27s8BrR0dHy9fVVp06dJP2vD+5053px8eJF9evXT0888YS8vLzk4+OjunXr6s8//7R7XsJ6MmPGDNu0AwcOqGnTpnrsscfk4eGhp556yqGP165dm+T75OXl5bB+JrWu//XXX2rbtq0KFSokd3d3BQYGqn379rpw4YLDsq1Zs0ZVq1bVY489ZtdH3bp1c2ibVI0JNzc3NxUtWlQRERFKfGR4wnt6/vz5ZOd15/ubkn5NcP36dfXt21dBQUFyc3NTsWLFNGbMGN15dHrCMs2ePVvFihWTu7u7ypcvr19//dWuXVLrwJo1a+Tm5qY33njDNu3vv/9Wly5dVKxYMXl4eChHjhx64YUX7D4rd7Nw4UJVrFgxyc9ownqT1C21y57U7b333pMkxcbGavDgwSpfvrx8fX3l6empqlWras2aNUnWlZLtXtu2be22mdLtc8I8PDwctik3btxQu3bt5OnpqZCQENuhQDdv3lS7du2ULVs2lSlTRlu3brWbX/Xq1eXk5KQmTZo4vIedOnWSk5OT3XYtqc+jdPuHRCcnJ7t1MKn6E97LOw83PHXqlNq3b29bV0uWLKkvv/zS4bnR0dEaOnSoihYtKnd3d+XOnVvNmjXT4cOHk63v6tWrKl++vAoWLKh//vnHNj2lfS/d/r9Xvnx5eXh4KHv27GrZsqVOnDjh0O5Od/5PuvOWuM6E/zdHjhxReHi4PD09lSdPHg0fPtyhpvj4eI0fP14lS5aUu7u7AgIC1KlTJ126dMmhhsmTJ9u+g+TJk0ddu3bV5cuXbY/v379fly5dkre3t6pVq6Zs2bLJ19dXDRo0cFgnE5Zn3759atGihXx8fJQjRw717NlT0dHRdm1T+h2kcePGKlCggNzd3ZUrVy41atRIO3futGtz69YtjRgxQsHBwXJzc1OBAgX0zjvvKCYmxq5dgQIFbO9tlixZFBgYqBdffFHHjx+3azdmzBhVrlxZOXLkkIeHh8qXL5/k/9TktuMNGjRI8vtMSj4X0u2BPHr16mVb9woXLqyRI0cqPj7e4bVq1aqlRYsWJblePuzYI/UQWrRokQoVKqTKlSvfs+2FCxcUHR2tzp07q0aNGnrjjTd0+PBhTZo0SZs2bdKmTZvk5uYm6XbQuHHjhjp37qwcOXJo8+bN+uSTT3Ty5EnNnTs3yfk3bdpUzZo1k3Q7GCQcTphgypQpKlmypBo1aiQXFxctWrRIXbp0UXx8vLp27Zqq5R82bJiGDh2qypUra/jw4XJ1ddWmTZu0evVq1a5dO8nn3Lp1S++++26y8/zqq6909epVde3aVdHR0fr4449Vo0YN7dy5UwEBAZJunzBft25dFSpUSEOHDtW///6rTz75RFWqVNEff/zh8M/666+/lnT7PIHJkyfrhRde0K5du1SsWDFJqXu/09KMGTPk5eWlPn36yMvLS6tXr9bgwYMVFRWl0aNH29p9++236tKli6pXr67u3bvL09NTe/fu1QcffHDP19i9e7eqVq0qHx8fvfnmm8qaNas+++wzVa9eXb/88osqVqwo6fZ6unXrVrm4uKhr164KDg7WwoUL1bFjR124cEFvv/22nJyc9Morr2jUqFG6ePGismfPbnudRYsWKSoqSq+88oql9+DIkSNauHChXnjhBRUsWFBnzpzRZ599pmrVqmnPnj3KkydPks+7ePGinn32WV29elU9evRQYGCgZs2apWbNmmn27Nl66aWXLNWRnBUrVujIkSNq166dAgMDbYfs7t69Wxs3brR9IT969Kjq16+v3Llza/DgwfL395ckvfrqqyl+rXfeeUclSpTQv//+q2+//VbvvPOOcuXKpQ4dOqS6/pT0q3Q7JDdq1Ehr1qxRhw4dVLZsWS1btkz9+/fXqVOnNG7cOLv5/vLLL/r222/Vo0cPubm5afLkyapTp442b96c5A8JkvTnn3+qSZMmqlevniZNmmSbvmXLFq1fv14tW7bU448/rmPHjmnKlCmqXr269uzZo2zZsiW7fDdv3tSWLVvUuXPnu74PHTt2VNWqVSVJCxYs0Pfff297zOqy16pVS61bt7abVrZsWUlSVFSUpk6dqpdeekmvv/66rl69qmnTpik8PFybN2+2tUuQku1eUgYPHuzwhVWSevfurZkzZ6pbt256/PHH1aVLF0nS559/rho1aui9997Txx9/rLp16+rIkSPy9va2Pdfd3V0//fSTzp49q1y5ckmSbV10d3e/6/sr3T4XOKmAnlJnzpxRpUqVbF9a/f399fPPP6tDhw6KiopSr169JEl
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['city_latitude']).head(10000) # Признаки (все столбцы, кроме 'city_latitude')\n",
|
|||
|
"y = df['city_latitude'].head(10000) # Целевая переменная (цена)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding для категориальных признаков\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Удаление выбросов (цены выше 95-го процентиля)\n",
|
|||
|
"upper_limit = y_train.quantile(0.95)\n",
|
|||
|
"X_train = X_train[y_train <= upper_limit]\n",
|
|||
|
"y_train = y_train[y_train <= upper_limit]\n",
|
|||
|
"\n",
|
|||
|
"# Логарифмическое преобразование целевой переменной\n",
|
|||
|
"y_train_log = np.log1p(y_train)\n",
|
|||
|
"y_val_log = np.log1p(y_val)\n",
|
|||
|
"y_test_log = np.log1p(y_test)\n",
|
|||
|
"\n",
|
|||
|
"# Стандартизация признаков\n",
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
|||
|
"X_val_scaled = scaler.transform(X_val)\n",
|
|||
|
"X_test_scaled = scaler.transform(X_test)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в сбалансированной выборке\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.hist(y_train_log, bins=30, color='orange', alpha=0.7)\n",
|
|||
|
"plt.title('Сбалансированная обучающая выборка (логарифмическое преобразование)')\n",
|
|||
|
"plt.xlabel('Логарифм цены')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Унитарное кодирование категориальных признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до унитарного кодирования:\n",
|
|||
|
" summary city state \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
|
|||
|
"3 One red light moving switly west to east, beco... Knoxville TN \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
|
|||
|
"5 I'm familiar with all the fakery and UFO sight... Fullerton CA \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"12494 star like stop and go satellite San Francisco CA \n",
|
|||
|
"12495 Two Balls of Light sighted in Tulsa, Ok. Tulsa OK \n",
|
|||
|
"12496 Highley reflective silver oval/disk seen in sk... St Louis MO \n",
|
|||
|
"12497 While walking on broadwalk on south west corne... Tempe AZ \n",
|
|||
|
"12498 was sitting on my front porch with granddaught... Elkhart KS \n",
|
|||
|
"\n",
|
|||
|
" date_time shape duration \\\n",
|
|||
|
"0 2021-12-15T21:45:00 light 2 minutes \n",
|
|||
|
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
|
|||
|
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
|
|||
|
"4 2021-12-07T08:00:00 circle NaN \n",
|
|||
|
"5 2020-07-07T23:00:00 unknown 2 minutes \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"12494 2000-06-05T23:30:00 NaN 1 minute \n",
|
|||
|
"12495 2000-06-06T08:15:00 circle about 5 minutes \n",
|
|||
|
"12496 2000-06-06T17:22:00 oval one minute \n",
|
|||
|
"12497 2000-06-07T22:00:00 unknown 4 seconds \n",
|
|||
|
"12498 2000-06-07T23:48:00 circle 1to2 minutes \n",
|
|||
|
"\n",
|
|||
|
" stats \\\n",
|
|||
|
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
|
|||
|
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
|
|||
|
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
|
|||
|
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
|
|||
|
"5 Occurred : 7/7/2020 23:00 (Entered as : 07/07... \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 Occurred : 6/5/2000 23:30 (Entered as : 06/05... \n",
|
|||
|
"12495 Occurred : 6/6/2000 08:15 (Entered as : 06/06... \n",
|
|||
|
"12496 Occurred : 6/6/2000 17:22 (Entered as : 06/06... \n",
|
|||
|
"12497 Occurred : 6/7/2000 22:00 (Entered as : 060/7... \n",
|
|||
|
"12498 Occurred : 6/7/2000 23:48 (Entered as : 6/7/2... \n",
|
|||
|
"\n",
|
|||
|
" report_link \\\n",
|
|||
|
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
|
|||
|
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
|
|||
|
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
|
|||
|
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
|
|||
|
"5 http://www.nuforc.org/webreports/157/S157444.html \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 http://www.nuforc.org/webreports/013/S13042.html \n",
|
|||
|
"12495 http://www.nuforc.org/webreports/013/S13150.html \n",
|
|||
|
"12496 http://www.nuforc.org/webreports/013/S13043.html \n",
|
|||
|
"12497 http://www.nuforc.org/webreports/013/S13054.html \n",
|
|||
|
"12498 http://www.nuforc.org/webreports/013/S13051.html \n",
|
|||
|
"\n",
|
|||
|
" text posted \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
|
|||
|
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
|
|||
|
"5 I'm familiar with all the fakery and UFO sight... 2020-07-09T00:00:00 \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 star like stop and go satellite A white star-l... 2000-06-21T00:00:00 \n",
|
|||
|
"12495 Two Balls of Light sighted in Tulsa, Ok. At su... 2000-06-21T00:00:00 \n",
|
|||
|
"12496 Highley reflective silver oval/disk seen in sk... 2000-06-21T00:00:00 \n",
|
|||
|
"12497 On southwest corner of Tempe Lake heard and sa... 2000-06-21T00:00:00 \n",
|
|||
|
"12498 was sitting on my front porch with granddaught... 2000-06-21T00:00:00 \n",
|
|||
|
"\n",
|
|||
|
" city_latitude city_longitude \n",
|
|||
|
"0 36.356650 -119.347937 \n",
|
|||
|
"1 39.174503 -84.481363 \n",
|
|||
|
"3 35.961561 -83.980115 \n",
|
|||
|
"4 38.798958 -77.095133 \n",
|
|||
|
"5 33.877422 -117.924978 \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 37.769992 -122.425394 \n",
|
|||
|
"12495 36.109456 -95.935245 \n",
|
|||
|
"12496 38.623825 -90.308528 \n",
|
|||
|
"12497 33.414036 -111.920920 \n",
|
|||
|
"12498 37.046000 -101.853100 \n",
|
|||
|
"\n",
|
|||
|
"[10000 rows x 12 columns]\n",
|
|||
|
"\n",
|
|||
|
"Данные после унитарного кодирования:\n",
|
|||
|
" city_latitude city_longitude \\\n",
|
|||
|
"0 36.356650 -119.347937 \n",
|
|||
|
"1 39.174503 -84.481363 \n",
|
|||
|
"3 35.961561 -83.980115 \n",
|
|||
|
"4 38.798958 -77.095133 \n",
|
|||
|
"5 33.877422 -117.924978 \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 37.769992 -122.425394 \n",
|
|||
|
"12495 36.109456 -95.935245 \n",
|
|||
|
"12496 38.623825 -90.308528 \n",
|
|||
|
"12497 33.414036 -111.920920 \n",
|
|||
|
"12498 37.046000 -101.853100 \n",
|
|||
|
"\n",
|
|||
|
" summary_ A couple stopped me and told me "Am I crazy or those lights are moving? ((Starlink satellites?)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ All total, I counted 15 aircraft flying in single file with no audible noise ((Starlink satellites?)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ I and 3 others observed 4 lights in the sky. \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ I kept watching and counted 10 objects that seemed to come from nowhere just appear out of the blackness. ((Starlink satellites?)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ I noticed the stars and another in a perfect line. ((Starlink satellites)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ I saw what looked like a star moving then I seen more. ((Starlink satellites?)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ I see 5 lights in a line, separated evenly, flying at the same speed. ((\"Starlink\" satellites??))((anonymous)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" summary_ We have witnessed multiple lights looking like satellites with varing degrees of brightness in orbit. ((Starlink satellites?)) \\\n",
|
|||
|
"0 False \n",
|
|||
|
"1 False \n",
|
|||
|
"3 False \n",
|
|||
|
"4 False \n",
|
|||
|
"5 False \n",
|
|||
|
"... ... \n",
|
|||
|
"12494 False \n",
|
|||
|
"12495 False \n",
|
|||
|
"12496 False \n",
|
|||
|
"12497 False \n",
|
|||
|
"12498 False \n",
|
|||
|
"\n",
|
|||
|
" ... posted_2020-07-03T00:00:00 posted_2020-07-09T00:00:00 \\\n",
|
|||
|
"0 ... False False \n",
|
|||
|
"1 ... False False \n",
|
|||
|
"3 ... False False \n",
|
|||
|
"4 ... False False \n",
|
|||
|
"5 ... False True \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"12494 ... False False \n",
|
|||
|
"12495 ... False False \n",
|
|||
|
"12496 ... False False \n",
|
|||
|
"12497 ... False False \n",
|
|||
|
"12498 ... False False \n",
|
|||
|
"\n",
|
|||
|
" posted_2020-07-23T00:00:00 posted_2020-07-31T00:00:00 \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 False False \n",
|
|||
|
"12495 False False \n",
|
|||
|
"12496 False False \n",
|
|||
|
"12497 False False \n",
|
|||
|
"12498 False False \n",
|
|||
|
"\n",
|
|||
|
" posted_2020-08-06T00:00:00 posted_2020-08-20T00:00:00 \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 False False \n",
|
|||
|
"12495 False False \n",
|
|||
|
"12496 False False \n",
|
|||
|
"12497 False False \n",
|
|||
|
"12498 False False \n",
|
|||
|
"\n",
|
|||
|
" posted_2020-08-27T00:00:00 posted_2020-09-04T00:00:00 \\\n",
|
|||
|
"0 False False \n",
|
|||
|
"1 False False \n",
|
|||
|
"3 False False \n",
|
|||
|
"4 False False \n",
|
|||
|
"5 False False \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 False False \n",
|
|||
|
"12495 False False \n",
|
|||
|
"12496 False False \n",
|
|||
|
"12497 False False \n",
|
|||
|
"12498 False False \n",
|
|||
|
"\n",
|
|||
|
" posted_2020-11-05T00:00:00 posted_2021-12-19T00:00:00 \n",
|
|||
|
"0 False True \n",
|
|||
|
"1 False True \n",
|
|||
|
"3 False True \n",
|
|||
|
"4 False True \n",
|
|||
|
"5 False False \n",
|
|||
|
"... ... ... \n",
|
|||
|
"12494 False False \n",
|
|||
|
"12495 False False \n",
|
|||
|
"12496 False False \n",
|
|||
|
"12497 False False \n",
|
|||
|
"12498 False False \n",
|
|||
|
"\n",
|
|||
|
"[10000 rows x 53560 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(\"Данные до унитарного кодирования:\")\n",
|
|||
|
"print(df.head(10000))\n",
|
|||
|
"\n",
|
|||
|
"# Применение унитарного кодирования для категориальных признаков\n",
|
|||
|
"df_encoded = pd.get_dummies(df.head(10000), drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после унитарного кодирования:\")\n",
|
|||
|
"print(df_encoded.head(10000))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Дискретизация числовых признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"22.1938\n",
|
|||
|
"54.4333\n",
|
|||
|
"32.23950000000001\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(df['city_latitude'].min())\n",
|
|||
|
"print(df['city_latitude'].max())\n",
|
|||
|
"print(df['city_latitude'].max() - df['city_latitude'].min())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до дискретизации:\n",
|
|||
|
"[22, 36, 49, 63, 77, inf]\n",
|
|||
|
"\n",
|
|||
|
"Данные после дискретизации:\n",
|
|||
|
" city_latitude price_bins\n",
|
|||
|
"0 36.356650 36-49\n",
|
|||
|
"1 39.174503 36-49\n",
|
|||
|
"3 35.961561 22-36\n",
|
|||
|
"4 38.798958 36-49\n",
|
|||
|
"5 33.877422 22-36\n",
|
|||
|
"6 36.141246 36-49\n",
|
|||
|
"8 40.294123 36-49\n",
|
|||
|
"10 40.698700 36-49\n",
|
|||
|
"12 44.072800 36-49\n",
|
|||
|
"13 42.312800 36-49\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(\"Данные до дискретизации:\")\n",
|
|||
|
"#print(df.head(10))\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"# Определение интервалов и меток для дискретизации\n",
|
|||
|
"bins = [\n",
|
|||
|
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0), \n",
|
|||
|
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25), \n",
|
|||
|
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.50), \n",
|
|||
|
"round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75),\n",
|
|||
|
"round(df['city_latitude'].min() + df['city_latitude'].max() * 1),\n",
|
|||
|
"float('inf')\n",
|
|||
|
"]\n",
|
|||
|
"print(bins)\n",
|
|||
|
"labels = [\n",
|
|||
|
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25)),\n",
|
|||
|
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.25)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.5)),\n",
|
|||
|
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.5)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75)),\n",
|
|||
|
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.75)) + '-' + str(round(df['city_latitude'].min() + df['city_latitude'].max() * 0.1)),\n",
|
|||
|
"str(round(df['city_latitude'].min() + df['city_latitude'].max() * 1)) + '+'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Применение дискретизации\n",
|
|||
|
"df['price_bins'] = pd.cut(df['city_latitude'], bins=bins, labels=labels, right=False)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после дискретизации:\")\n",
|
|||
|
"print(df[['city_latitude', 'price_bins']].head(10))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**«Ручной» синтез признаков**\n",
|
|||
|
"\n",
|
|||
|
"Создание новых признаков на основе экспертных знаний и логики предметной области. К примеру, для данных о продаже домов можно создать признак цена за единицу товара."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до синтеза признака:\n",
|
|||
|
" summary city state \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... Visalia CA \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... Cincinnati OH \n",
|
|||
|
"3 One red light moving switly west to east, beco... Knoxville TN \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... Alexandria VA \n",
|
|||
|
"5 I'm familiar with all the fakery and UFO sight... Fullerton CA \n",
|
|||
|
"6 I was driving up lakes mead towards the lake a... Las Vegas NV \n",
|
|||
|
"8 Wing shaped craft seen at night, no lights, no... Orem UT \n",
|
|||
|
"10 Yellow light floating across my grass as it de... Springfield NJ \n",
|
|||
|
"12 A trail of star like lights moving from the W.... Janesville MN \n",
|
|||
|
"13 Large bright ball not registering on any app a... Bangor MI \n",
|
|||
|
"\n",
|
|||
|
" date_time shape duration \\\n",
|
|||
|
"0 2021-12-15T21:45:00 light 2 minutes \n",
|
|||
|
"1 2021-12-16T09:45:00 triangle 14 seconds \n",
|
|||
|
"3 2021-12-10T19:30:00 triangle 20-30 seconds \n",
|
|||
|
"4 2021-12-07T08:00:00 circle NaN \n",
|
|||
|
"5 2020-07-07T23:00:00 unknown 2 minutes \n",
|
|||
|
"6 2020-04-23T03:00:00 oval 10 minutes \n",
|
|||
|
"8 2020-04-18T23:00:00 other 10 seconds \n",
|
|||
|
"10 2020-05-13T03:37:00 light 7 seconds \n",
|
|||
|
"12 2020-04-18T21:05:00 light 15 minutes \n",
|
|||
|
"13 2020-04-18T22:30:00 triangle 45 minutes \n",
|
|||
|
"\n",
|
|||
|
" stats \\\n",
|
|||
|
"0 Occurred : 12/15/2021 21:45 (Entered as : 12/... \n",
|
|||
|
"1 Occurred : 12/16/2021 09:45 (Entered as : 12/... \n",
|
|||
|
"3 Occurred : 12/10/2021 19:30 (Entered as : 12/... \n",
|
|||
|
"4 Occurred : 12/7/2021 08:00 (Entered as : 12/0... \n",
|
|||
|
"5 Occurred : 7/7/2020 23:00 (Entered as : 07/07... \n",
|
|||
|
"6 Occurred : 4/23/2020 03:00 (Entered as : 4/23... \n",
|
|||
|
"8 Occurred : 4/18/2020 23:00 (Entered as : 04/1... \n",
|
|||
|
"10 Occurred : 5/13/2020 03:37 (Entered as : 05/1... \n",
|
|||
|
"12 Occurred : 4/18/2020 21:05 (Entered as : 04/1... \n",
|
|||
|
"13 Occurred : 4/18/2020 22:30 (Entered as : 04/1... \n",
|
|||
|
"\n",
|
|||
|
" report_link \\\n",
|
|||
|
"0 http://www.nuforc.org/webreports/165/S165881.html \n",
|
|||
|
"1 http://www.nuforc.org/webreports/165/S165888.html \n",
|
|||
|
"3 http://www.nuforc.org/webreports/165/S165825.html \n",
|
|||
|
"4 http://www.nuforc.org/webreports/165/S165754.html \n",
|
|||
|
"5 http://www.nuforc.org/webreports/157/S157444.html \n",
|
|||
|
"6 http://www.nuforc.org/webreports/155/S155608.html \n",
|
|||
|
"8 http://www.nuforc.org/webreports/155/S155512.html \n",
|
|||
|
"10 http://www.nuforc.org/webreports/155/S155647.html \n",
|
|||
|
"12 http://www.nuforc.org/webreports/155/S155497.html \n",
|
|||
|
"13 http://www.nuforc.org/webreports/155/S155495.html \n",
|
|||
|
"\n",
|
|||
|
" text posted \\\n",
|
|||
|
"0 Viewed some red lights in the sky appearing to... 2021-12-19T00:00:00 \n",
|
|||
|
"1 Look like 1 or 3 crafts from North traveling s... 2021-12-19T00:00:00 \n",
|
|||
|
"3 One red light moving switly west to east, beco... 2021-12-19T00:00:00 \n",
|
|||
|
"4 Bright, circular Fresnel-lens shaped light sev... 2021-12-19T00:00:00 \n",
|
|||
|
"5 I'm familiar with all the fakery and UFO sight... 2020-07-09T00:00:00 \n",
|
|||
|
"6 I was driving up lakes mead towards the lake a... 2020-05-01T00:00:00 \n",
|
|||
|
"8 Wing shaped craft seen at night, no lights, no... 2020-05-01T00:00:00 \n",
|
|||
|
"10 Yellow light floating across my grass as it de... 2020-05-15T00:00:00 \n",
|
|||
|
"12 A trail of star like lights moving from the W.... 2020-05-15T00:00:00 \n",
|
|||
|
"13 Large bright ball not registering on any app a... 2020-05-15T00:00:00 \n",
|
|||
|
"\n",
|
|||
|
" city_latitude city_longitude price_bins \n",
|
|||
|
"0 36.356650 -119.347937 36-49 \n",
|
|||
|
"1 39.174503 -84.481363 36-49 \n",
|
|||
|
"3 35.961561 -83.980115 22-36 \n",
|
|||
|
"4 38.798958 -77.095133 36-49 \n",
|
|||
|
"5 33.877422 -117.924978 22-36 \n",
|
|||
|
"6 36.141246 -115.186592 36-49 \n",
|
|||
|
"8 40.294123 -111.701685 36-49 \n",
|
|||
|
"10 40.698700 -74.329600 36-49 \n",
|
|||
|
"12 44.072800 -93.728600 36-49 \n",
|
|||
|
"13 42.312800 -86.081300 36-49 \n",
|
|||
|
"\n",
|
|||
|
"Данные после синтеза признака 'relative_price':\n",
|
|||
|
" city_latitude state relative_appearing\n",
|
|||
|
"0 36.356650 CA 1.018610\n",
|
|||
|
"1 39.174503 OH 0.969926\n",
|
|||
|
"3 35.961561 TN 1.001705\n",
|
|||
|
"4 38.798958 VA 1.026087\n",
|
|||
|
"5 33.877422 CA 0.949149\n",
|
|||
|
"6 36.141246 NV 0.968071\n",
|
|||
|
"8 40.294123 UT 1.003457\n",
|
|||
|
"10 40.698700 NJ 1.008448\n",
|
|||
|
"12 44.072800 MN 0.970854\n",
|
|||
|
"13 42.312800 MI 0.984641\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Проверка первых строк данных\n",
|
|||
|
"print(\"Данные до синтеза признака:\")\n",
|
|||
|
"print(df.head(10))\n",
|
|||
|
"\n",
|
|||
|
"# Вычисление средней цены по категориям\n",
|
|||
|
"mean_price_by_category = df.groupby('state')['city_latitude'].transform('mean')\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_price' (относительная цена)\n",
|
|||
|
"df['relative_appearing'] = df['city_latitude'] / mean_price_by_category\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после синтеза признака\n",
|
|||
|
"print(\"\\nДанные после синтеза признака 'relative_price':\")\n",
|
|||
|
"print(df[['city_latitude', 'state', 'relative_appearing']].head(10))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Масштабирование признаков на основе нормировки и стандартизации**\n",
|
|||
|
"\n",
|
|||
|
"Масштабирование признаков - это процесс преобразования числовых признаков таким образом, чтобы они имели одинаковый масштаб. Это важно для многих алгоритмов машинного обучения, которые чувствительны к масштабу признаков, таких как линейная регрессия, метод опорных векторов (SVM) и нейронные сети."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до масштабирования:\n",
|
|||
|
" city_latitude relative_city_latitude\n",
|
|||
|
"0 36.356650 1.018610\n",
|
|||
|
"1 39.174503 0.969926\n",
|
|||
|
"3 35.961561 1.001705\n",
|
|||
|
"4 38.798958 1.026087\n",
|
|||
|
"5 33.877422 0.949149\n",
|
|||
|
"\n",
|
|||
|
"Данные после нормировки:\n",
|
|||
|
" city_latitude relative_city_latitude\n",
|
|||
|
"0 0.439301 0.475114\n",
|
|||
|
"1 0.526705 0.349452\n",
|
|||
|
"3 0.427046 0.431477\n",
|
|||
|
"4 0.515056 0.494412\n",
|
|||
|
"5 0.362401 0.295823\n",
|
|||
|
"\n",
|
|||
|
"Данные после стандартизации:\n",
|
|||
|
" city_latitude relative_city_latitude\n",
|
|||
|
"0 -0.426560 0.533974\n",
|
|||
|
"1 0.097536 -0.862875\n",
|
|||
|
"3 -0.500043 0.048911\n",
|
|||
|
"4 0.027688 0.748491\n",
|
|||
|
"5 -0.887675 -1.459012\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_city_latitude' (цена относительно средней цены в категории)\n",
|
|||
|
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
|
|||
|
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных до масштабирования\n",
|
|||
|
"print(\"Данные до масштабирования:\")\n",
|
|||
|
"print(df[['city_latitude', 'relative_city_latitude']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Масштабирование признаков на основе нормировки\n",
|
|||
|
"min_max_scaler = MinMaxScaler()\n",
|
|||
|
"df[['city_latitude', 'relative_city_latitude']] = min_max_scaler.fit_transform(df[['city_latitude', 'relative_city_latitude']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после нормировки\n",
|
|||
|
"print(\"\\nДанные после нормировки:\")\n",
|
|||
|
"print(df[['city_latitude', 'relative_city_latitude']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Стандартизация признаков\n",
|
|||
|
"standard_scaler = StandardScaler()\n",
|
|||
|
"df[['city_latitude', 'relative_city_latitude']] = standard_scaler.fit_transform(df[['city_latitude', 'relative_city_latitude']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после стандартизации\n",
|
|||
|
"print(\"\\nДанные после стандартизации:\")\n",
|
|||
|
"print(df[['city_latitude', 'relative_city_latitude']].head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Конструирование признаков с применением фреймворка Featuretools**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Built 23 features\n",
|
|||
|
"Elapsed: 00:14 | Progress: 95%|█████████▌"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\woodwork\\logical_types.py:841: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
|||
|
" series = series.replace(ww.config.get_option(\"nan_values\"), np.nan)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Elapsed: 00:17 | Progress: 100%|██████████\n",
|
|||
|
"Новые признаки, созданные с помощью Featuretools:\n",
|
|||
|
" city state shape duration city_latitude \\\n",
|
|||
|
"index \n",
|
|||
|
"0 Visalia CA light 2 minutes -0.426560 \n",
|
|||
|
"1 Cincinnati OH triangle 14 seconds 0.097536 \n",
|
|||
|
"2 Knoxville TN triangle 20-30 seconds -0.500043 \n",
|
|||
|
"3 Alexandria VA circle NaN 0.027688 \n",
|
|||
|
"4 Fullerton CA unknown 2 minutes -0.887675 \n",
|
|||
|
"\n",
|
|||
|
" city_longitude price_bins relative_appearing relative_city_latitude \\\n",
|
|||
|
"index \n",
|
|||
|
"0 -119.347937 36-49 1.018610 0.775416 \n",
|
|||
|
"1 -84.481363 36-49 0.969926 0.301550 \n",
|
|||
|
"2 -83.980115 22-36 1.001705 0.977744 \n",
|
|||
|
"3 -77.095133 36-49 1.026087 -0.177743 \n",
|
|||
|
"4 -117.924978 22-36 0.949149 1.613646 \n",
|
|||
|
"\n",
|
|||
|
" DAY(date_time) ... NUM_CHARACTERS(stats) NUM_CHARACTERS(summary) \\\n",
|
|||
|
"index ... \n",
|
|||
|
"0 15 ... 174 91 \n",
|
|||
|
"1 16 ... 180 114 \n",
|
|||
|
"2 10 ... 182 108 \n",
|
|||
|
"3 7 ... 166 127 \n",
|
|||
|
"4 7 ... 167 134 \n",
|
|||
|
"\n",
|
|||
|
" NUM_CHARACTERS(text) NUM_WORDS(stats) NUM_WORDS(summary) \\\n",
|
|||
|
"index \n",
|
|||
|
"0 811 22 17 \n",
|
|||
|
"1 302 22 22 \n",
|
|||
|
"2 1633 22 20 \n",
|
|||
|
"3 1813 21 18 \n",
|
|||
|
"4 1275 21 28 \n",
|
|||
|
"\n",
|
|||
|
" NUM_WORDS(text) WEEKDAY(date_time) WEEKDAY(posted) YEAR(date_time) \\\n",
|
|||
|
"index \n",
|
|||
|
"0 147 2 6 2021 \n",
|
|||
|
"1 59 3 6 2021 \n",
|
|||
|
"2 304 4 6 2021 \n",
|
|||
|
"3 322 1 6 2021 \n",
|
|||
|
"4 233 1 3 2020 \n",
|
|||
|
"\n",
|
|||
|
" YEAR(posted) \n",
|
|||
|
"index \n",
|
|||
|
"0 2021 \n",
|
|||
|
"1 2021 \n",
|
|||
|
"2 2021 \n",
|
|||
|
"3 2021 \n",
|
|||
|
"4 2020 \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 23 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import featuretools as ft\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_city_latitude'\n",
|
|||
|
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
|
|||
|
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='jio_mart_items')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление данных с явным указанием индексного столбца\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='items_data', dataframe=df, index='index', make_index=True)\n",
|
|||
|
"\n",
|
|||
|
"# Конструирование признаков\n",
|
|||
|
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='items_data', verbose=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк новых признаков\n",
|
|||
|
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
|
|||
|
"print(features.head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Оценка качества**\n",
|
|||
|
"\n",
|
|||
|
"*Предсказательная способность Метрики:* RMSE, MAE, R² \n",
|
|||
|
"\n",
|
|||
|
"*Методы:* Обучение модели на обучающей выборке и оценка на контрольной и тестовой выборках. \n",
|
|||
|
"\n",
|
|||
|
"*Скорость вычисления Методы:* Измерение времени выполнения генерации признаков и обучения модели. \n",
|
|||
|
"\n",
|
|||
|
"*Надежность Методы:* Кросс-валидация, анализ чувствительности модели к изменениям в данных. \n",
|
|||
|
"\n",
|
|||
|
"*Корреляция Методы:* Анализ корреляционной матрицы признаков, удаление мультиколлинеарных признаков. \n",
|
|||
|
"\n",
|
|||
|
"*Цельность Методы:* Проверка логической связи между признаками и целевой переменной, интерпретация результатов модели. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"RMSE: 1.5489449749619157\n",
|
|||
|
"R²: 0.924423816721939\n",
|
|||
|
"MAE: 0.46826566221280963\n",
|
|||
|
"Training Time: 37.429771184921265 seconds\n",
|
|||
|
"Cross-validated RMSE: 1.535611497565107\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\tumvu\\AppData\\Local\\Temp\\ipykernel_54788\\399707436.py:70: FutureWarning: \n",
|
|||
|
"\n",
|
|||
|
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.\n",
|
|||
|
"\n",
|
|||
|
" sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8kAAAK9CAYAAAAXCC76AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACjb0lEQVR4nOzde1yUZf7/8fcEgqMDoxCKbDqTCWpGotGWsakIidrqF0+paYalbgdzs1VcFzDooOVKme63zTUD3SxM8tDmWZM8QJglqFtpSwql+FU3ZSQVAef3R+P8nEAExCh8PR+P+/Fwrvu6rvtzj+3j+3173fc1BrvdbhcAAAAAANAN9V0AAAAAAAC/FIRkAAAAAAAcCMkAAAAAADgQkgEAAAAAcCAkAwAAAADgQEgGAAAAAMCBkAwAAAAAgAMhGQAAAAAAB0IyAAAAAAAOhGQAAAAAABwIyQAAVJPBYKjWkZGRcU3r+Pbbb5WUlKTf/va3at68uW688Ub17NlTmzZtqrT/qVOnNH78ePn5+alp06YKDw/X559/Xq1r9ezZ87L3+dVXX9XlbTm9/vrrSk1NvSZzX62ePXvqtttuq+8yau3IkSNKTExUTk5OfZcCAL9Y7vVdAAAAvxb//Oc/XT4vXrxYGzdurNDesWPHa1rHqlWr9PLLLys6OloPP/ywysrKtHjxYt1333166623NGbMGGffCxcu6P7771dubq6mTJmiG2+8Ua+//rp69uypzz77TIGBgVe83k033aSZM2dWaA8ICKjT+7ro9ddf14033qiYmJhrMv/17MiRI0pKSpLValVISEh9lwMAv0iEZAAAqmnUqFEunz/55BNt3LixQvu1Fh4eroKCAt14443Otscee0whISGaPn26S0hOT09XZmamli1bpiFDhkiSHnjgAQUFBenZZ5/VO++8c8Xrmc3mn/0e65rdbte5c+dkNBrru5R6UVZWpgsXLtR3GQDwq8Dj1gAA1KEffvhBf/rTn9S6dWt5enqqffv2mj17tux2u0s/g8GgCRMmaMmSJWrfvr0aN26sO+64Q1u3br3iNTp16uQSkCXJ09NT/fr103fffafTp08729PT09WyZUsNGjTI2ebn56cHHnhAq1atUklJyVXesVRSUqJnn31W7dq1k6enp1q3bq3Y2NgKc6ekpKhXr15q0aKFPD09deutt+rvf/+7Sx+r1ap///vf+vjjj52Pdffs2VOSlJiYKIPBUOH6qampMhgMOnTokMs8v//977V+/XqFhobKaDRq/vz5kn58/Pzpp592/h21a9dOL7/8cq1D5MW/y2XLlunWW2+V0WhUt27dtHfvXknS/Pnz1a5dOzVu3Fg9e/Z0qVP6/49wf/bZZ7rnnntkNBp1880364033qhwrWPHjunRRx9Vy5Yt1bhxY3Xu3FmLFi1y6XPo0CEZDAbNnj1bc+bM0S233CJPT0+9/vrruvPOOyVJY8aMcX6/Fx9t37Ztm4YOHao2bdo4/x4nTZqks2fPuswfExMjk8mkw4cPKzo6WiaTSX5+fpo8ebLKy8td+l64cEGvvfaagoOD1bhxY/n5+alPnz7atWuXS7+3335bd9xxh4xGo3x8fDR8+HB9++23Nf67AIC6wEoyAAB1xG63a8CAAdqyZYseffRRhYSEaP369ZoyZYoOHz6sV1991aX/xx9/rKVLl2rixInOENOnTx/t3LmzVu+9Hj16VE2aNFGTJk2cbbt371bXrl11ww2u/y7+29/+Vv/4xz904MABBQcHVzlveXm5Tpw44dLWuHFjmUwmXbhwQQMGDND27ds1fvx4dezYUXv37tWrr76qAwcOaOXKlc4xf//739WpUycNGDBA7u7u+te//qUnnnhCFy5c0JNPPilJmjNnjp566imZTCbFxcVJklq2bFnj70KS9u/frxEjRugPf/iDxo0bp/bt2+vMmTPq0aOHDh8+rD/84Q9q06aNMjMzNW3aNBUWFmrOnDm1uta2bdv0wQcfOO9j5syZ+v3vf6/Y2Fi9/vrreuKJJ3Ty5EnNmjVLjzzyiD766COX8SdPnlS/fv30wAMPaMSIEXrvvff0+OOPy8PDQ4888ogk6ezZs+rZs6f+85//aMKECbr55pu1bNkyxcTE6NSpU/rjH//oMmdKSorOnTun8ePHy9PTUwMHDtTp06c1ffp0jR8/Xvfee68k6Z577pEkLVu2TGfOnNHjjz8uX19f7dy5U/PmzdN3332nZcuWucxdXl6uqKgo3XXXXZo9e7Y2bdqk5ORk3XLLLXr88ced/R599FGlpqaqb9++Gjt2rMrKyrRt2zZ98sknCg0NlSS9+OKLSkhI0AMPPKCxY8fq+PHjmjdvnrp3767du3erWbNmtfo7AYBaswMAgFp58skn7Zf+n9KVK1faJdlfeOEFl35DhgyxGwwG+3/+8x9nmyS7JPuuXbucbfn5+fbGjRvbBw4cWONavv76a3vjxo3tDz30kEt706ZN7Y888kiF/qtXr7ZLsq9bt67KeXv06OGs9dLj4Ycfttvtdvs///lP+w033GDftm2by7g33njDLsm+Y8cOZ9uZM2cqzB8VFWVv27atS1unTp3sPXr0qND32WeftVf2/7qkpKTYJdkPHjzobLNYLJXe3/PPP29v2rSp/cCBAy7tf/7zn+1ubm72goKCSr+Hi3r06GHv1KmTS5sku6enp8v158+fb5dk9/f3t9tsNmf7tGnTKtR68TtOTk52tpWUlNhDQkLsLVq0sJ8/f95ut9vtc+bMsUuyv/32285+58+ft3fr1s1uMpmc1zl48KBdkt3b29t+7Ngxl1o//fRTuyR7SkpKhXur7O9n5syZdoPBYM/Pz3e2Pfzww3ZJ9ueee86lb5cuXex33HGH8/NHH31kl2SfOHFihXkvXLhgt9vt9kOHDtnd3NzsL774osv5vXv32t3d3Su0A8DPgcetAQCoI2vWrJGbm5smTpzo0v6nP/1Jdrtda9eudWnv1q2b7rjjDufnNm3a6H/+53+0fv36Co+tVuXMmTMaOnSojEajXnrpJZdzZ8+elaenZ4UxjRs3dp6/EqvVqo0bN7ocsbGxkn5cfezYsaM6dOigEydOOI9evXpJkrZs2eKc59L3gYuKinTixAn16NFD33zzjYqKiqp9v9V18803KyoqyqVt2bJluvfee9W8eXOXeiMjI1VeXl6tx90rExERIavV6vx81113SZIGDx4sLy+vCu3ffPONy3h3d3f94Q9/cH728PDQH/7wBx07dkyfffaZpB//+/L399eIESOc/Ro1aqSJEyequLhYH3/8scucgwcPlp+fX7Xv4dK/nx9++EEnTpzQPffcI7vdrt27d1fo/9hjj7l8vvfee13u6/3335fBYNCzzz5bYezFx+aXL1+uCxcu6IEHHnD5+/D391dgYKDLfz8A8HPhcWsAAOpIfn6+AgICXEKR9P93u87Pz3dpr2xn6aCgIJ05c0bHjx+Xv7//Fa9ZXl6u4cOH64svvtDatWsr7DhtNBorfe/43LlzzvNX0rRpU0VGRlZ67uuvv9aXX3552TB27Ngx55937NihZ599VllZWTpz5oxLv6KiIpnN5ivWUhM333xzpfXu2bOnWvXWRJs2bVw+X7yX1q1bV9p+8uRJl/aAgAA1bdrUpS0oKEjSj+8Y33333crPz1dgYGCFR+cv999XZfdflYKCAk2fPl0ffPBBhfp++o8YF98vvlTz5s1dxuXl5SkgIEA+Pj6XvebXX38tu91+2V3WGzVqVKN7AIC6QEgGAOBXbNy4cfrwww+1ZMkS5+rtpVq1aqXCwsIK7RfbrvZnnC5cuKDg4GC98sorlZ6/GBLz8vIUERGhDh066JVXXlHr1q3l4eGhNWvW6NVXX63WplmVbdol6bKr7pX9A8CFCxd03333OVfCf+piMK0pNze3GrXbf7KR27VQk528y8vLdd999+n777/X1KlT1aFDBzVt2lSHDx9WTExMhb+fy91XTV24cEEGg0Fr166tdE6TyVQn1wGAmiAkAwBQRywWizZt2qTTp0+7rCZ/9dVXzvOX+vrrryvMceD
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x800 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Train RMSE: 0.4358844786848851\n",
|
|||
|
"Train R²: 0.994185027626814\n",
|
|||
|
"Train MAE: 0.12184558416960284\n",
|
|||
|
"Корреляция: 0.96\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\Users\\tumvu\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\sklearn\\metrics\\_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0oAAAIjCAYAAAA9VuvLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADA8ElEQVR4nOzdd3iUVfr/8fczPckkk55QAoHQmwUVlSJKU1FXxc5asOGuvetvd8WOrt1VF8sKFvxace0rYu9iAUWlhA4JCSRkkkky/fn9ETMQEjSBhAnh87ouNHOedk9mkjz3nHPuY5imaSIiIiIiIiIxlngHICIiIiIi0t4oURIREREREdmGEiUREREREZFtKFESERERERHZhhIlERERERGRbShREhERERER2YYSJRERERERkW0oURIREREREdmGEiUREREREZFtKFESEWmnDMPgxhtvjHcYcTd69GhGjx4de7xq1SoMw2DWrFlxi2lb28bYXs7VXLNmzcIwDFatWtVq59zVr9NZZ51Ffn7+LrmWiOwZlCiJyB7hkUcewTAMhg0btsPnKCoq4sYbb2TBggWtF1g799FHH2EYRuyf3W6nZ8+enHHGGaxYsSLe4bXIF198wY033khFRUW8Q2mR9v6+e+6557j//vubte/bb7+t5F9EdhtKlERkjzB79mzy8/P55ptvKCws3KFzFBUVcdNNN7XbG9a2dMkll/DMM8/w2GOPMXHiRF544QX2339/ioqKdnks3bt3p7a2ltNPP71Fx33xxRfcdNNN7T5Rmjt3LnPnzo09bu/vu+0lSk29Tm+//TY33XTTLoxORGTHKVESkQ5v5cqVfPHFF9x7771kZWUxe/bseIe02xk5ciR//vOfmTJlCv/617+4++67KS8v56mnntruMdXV1W0Si2EYuFwurFZrm5w/3hwOBw6HI95h7LSO/jqJSMenRElEOrzZs2eTlpbGxIkTOeGEE7abKFVUVHD55ZeTn5+P0+mka9eunHHGGWzatImPPvqI/fffH4ApU6bEhqLVz7/Iz8/nrLPOanTObeebBINBbrjhBoYOHYrH4yEpKYmRI0fy4Ycftvh5lZSUYLPZmvyEfsmSJRiGwUMPPQRAKBTipptuonfv3rhcLjIyMhgxYgTvvfdei68LcNhhhwF1SSjAjTfeiGEY/PLLL5x22mmkpaUxYsSI2P7PPvssQ4cOJSEhgfT0dE455RTWrl3b6LyPPfYYBQUFJCQkcMABB/Dpp5822md7c18WL17MSSedRFZWFgkJCfTt25e//e1vsfiuvvpqAHr06BF7/baek9OaMf6eZ599lgMOOIDExETS0tIYNWpUgx6krd8zv/e+mzZtGna7nY0bNza6xvnnn09qaip+v79FsW3ttddeY+LEiXTu3Bmn00lBQQG33HILkUikQaxvvfUWq1evjsVWP09o29fprLPO4uGHHwZoMJyz/nkahsFHH33UIIbtvdb//e9/GTRoEC6Xi0GDBvHqq682+Ryi0Sj3338/AwcOxOVykZOTw9SpU9m8efMOf19EZM9hi3cAIiJtbfbs2Rx//PE4HA5OPfVU/v3vfzN//vzYDSiAz+dj5MiR/Prrr5x99tnsu+++bNq0iddff51169bRv39/br75Zm644QbOP/98Ro4cCcDBBx/colgqKyt54oknOPXUUznvvPOoqqriP//5DxMmTOCbb75h7733bva5cnJyOOSQQ3jxxReZNm1ag20vvPACVquVE088EahLFKZPn865557LAQccQGVlJd9++y3ff/8948aNa9FzAFi+fDkAGRkZDdpPPPFEevfuze23345pmgDcdttt/OMf/+Ckk07i3HPPZePGjfzrX/9i1KhR/PDDD6SmpgLwn//8h6lTp3LwwQdz2WWXsWLFCo455hjS09PJy8v73Xh+/PFHRo4cid1u5/zzzyc/P5/ly5fzxhtvcNttt3H88cezdOlS/u///o/77ruPzMxMALKysnZZjAA33XQTN954IwcffDA333wzDoeDr7/+mg8++IDx48c32v/33ncjRozg5ptv5oUXXuCiiy6KHRMMBnn55ZeZNGkSLpfrD2PanlmzZuF2u7niiitwu9188MEH3HDDDVRWVnLXXXcB8Le//Q2v18u6deu47777AHC73U2eb+rUqRQVFfHee+/xzDPP7HBcc+fOZdKkSQwYMIDp06dTVlbGlClT6Nq1a5PXnDVrFlOmTOGSSy5h5cqVPPTQQ/zwww98/vnn2O32HY5DRPYApohIB/btt9+agPnee++Zpmma0WjU7Nq1q3nppZc22O+GG24wAXPOnDmNzhGNRk3TNM358+ebgDlz5sxG+3Tv3t0888wzG7Ufcsgh5iGHHBJ7HA6HzUAg0GCfzZs3mzk5OebZZ5/doB0wp02b9rvP79FHHzUB86effmrQPmDAAPOwww6LPd5rr73MiRMn/u65mvLhhx+agPnkk0+aGzduNIuKisy33nrLzM/PNw3DMOfPn2+apmlOmzbNBMxTTz21wfGrVq0yrVaredtttzVo/+mnn0ybzRZrDwaDZnZ2trn33ns3+P489thjJtDge7hy5cpGr8OoUaPM5ORkc/Xq1Q2uU//amaZp3nXXXSZgrly5ss1jbMqyZctMi8ViHnfccWYkEtlunNu+Z37vfXfQQQeZw4YNa9A2Z84cEzA//PDD341nazNnzmz0vampqWm039SpU83ExETT7/fH2iZOnGh279690b5NvU4XXnih2dStR/37bNuYmzrH3nvvbXbq1MmsqKiItc2dO9cEGsTx6aefmoA5e/bsBuf83//+12S7iMi2NPRORDq02bNnk5OTw6GHHgrUDfk5+eSTef755xsMIXrllVfYa6+9OO644xqdo354UGuwWq2x+SfRaJTy8nLC4TD77bcf33//fYvPd/zxx2Oz2XjhhRdibYsWLeKXX37h5JNPjrWlpqby888/s2zZsh2K++yzzyYrK4vOnTszceJEqqureeqpp9hvv/0a7HfBBRc0eDxnzhyi0SgnnXQSmzZtiv3Lzc2ld+/esSGH3377LaWlpVxwwQUN5uecddZZeDye341t48aNfPLJJ5x99tl069atwbbmvHa7IkaoGy4WjUa54YYbsFga/vnd0ffYGWecwddffx3r4YO693xeXh6HHHLIDp2zXkJCQuzrqqoqNm3axMiRI6mpqWHx4sU7de4dVVxczIIFCzjzzDMbfM/HjRvHgAEDGuz70ksv4fF4GDduXIPXdejQobjd7h0a7ioiexYlSiLSYUUiEZ5//nkOPfRQVq5cSWFhIYWFhQwbNoySkhLef//92L7Lly9n0KBBuySup556iiFDhsTmCmVlZfHWW2/h9XpbfK7MzEzGjBnDiy++GGt74YUXsNlsHH/88bG2m2++mYqKCvr06cPgwYO5+uqr+fHHH5t9nRtuuIH33nuPDz74gB9//JGioqImq8716NGjweNly5Zhmia9e/cmKyurwb9ff/2V0tJSAFavXg1A7969GxxfX47899SXKd/R129XxAh17zGLxdLohn5nnHzyyTidzti8O6/Xy5tvvsnkyZN3OsH/+eefOe644/B4PKSkpJCVlcWf//zn2HXiYXuvAUDfvn0bPF62bBler5fs7OxGr6vP54u9riIi26M5SiLSYX3wwQcUFxfz/PPP8/zzzzfaPnv27CbnheyI7d2URiKRBlW/nn32Wc466yyOPfZYrr76arKzs7FarUyfPr1Br0BLnHLKKUyZMoUFCxaw99578+KLLzJmzJjYPByAUaNGsXz5cl577TXmzp3LE088wX333ceMGTM499xz//AagwcPZuzYsX+439a9EFDXa2YYBu+8806T1c+2N59lV9odYtyetLQ0jjrqKGbPns0NN9zAyy+/TCAQiCU0O6qiooJDDjmElJQUbr75ZgoKCnC5XHz//fdce+21RKPRVnoGdX7v52dHRaNRsrOzt1u8pX5+mojI9ihREpEOa/bs2WRnZ8cqbW1tzpw5vPrqq8yYMYOEhAQKCgpYtGjR757v9z6hT0tLa3J9ntWrVzf
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|||
|
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"import time\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df = pd.read_csv(\"../../datasets/nuforc_reports.csv\").head(2000)\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'relative_city_latitude'\n",
|
|||
|
"mean_city_latitude_by_state = df.groupby('state')['city_latitude'].transform('mean')\n",
|
|||
|
"df['relative_city_latitude'] = df['city_latitude'] / mean_city_latitude_by_state\n",
|
|||
|
"\n",
|
|||
|
"# Предобработка данных\n",
|
|||
|
"# Преобразуем категориальные переменные в числовые\n",
|
|||
|
"df = pd.get_dummies(df, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на признаки и целевую переменную\n",
|
|||
|
"X = df.drop('city_latitude', axis=1).dropna()\n",
|
|||
|
"y = df['city_latitude'].dropna()\n",
|
|||
|
"\n",
|
|||
|
"# Разделение данных на обучающую и тестовую выборки\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Выбор модели\n",
|
|||
|
"model = RandomForestRegressor(random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Измерение времени обучения и предсказания\n",
|
|||
|
"start_time = time.time()\n",
|
|||
|
"\n",
|
|||
|
"# Обучение модели\n",
|
|||
|
"model.fit(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Предсказание и оценка\n",
|
|||
|
"y_pred = model.predict(X_test)\n",
|
|||
|
"\n",
|
|||
|
"end_time = time.time()\n",
|
|||
|
"training_time = end_time - start_time\n",
|
|||
|
"\n",
|
|||
|
"rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
|
|||
|
"r2 = r2_score(y_test, y_pred)\n",
|
|||
|
"mae = mean_absolute_error(y_test, y_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"RMSE: {rmse}\")\n",
|
|||
|
"print(f\"R²: {r2}\")\n",
|
|||
|
"print(f\"MAE: {mae}\")\n",
|
|||
|
"print(f\"Training Time: {training_time} seconds\")\n",
|
|||
|
"\n",
|
|||
|
"# Кросс-валидация\n",
|
|||
|
"scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"rmse_cv = (-scores.mean())**0.5\n",
|
|||
|
"print(f\"Cross-validated RMSE: {rmse_cv}\")\n",
|
|||
|
"\n",
|
|||
|
"# Анализ важности признаков\n",
|
|||
|
"feature_importances = model.feature_importances_\n",
|
|||
|
"feature_names = X_train.columns\n",
|
|||
|
"\n",
|
|||
|
"importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})\n",
|
|||
|
"importance_df = importance_df.sort_values(by='Importance', ascending=False)\n",
|
|||
|
"\n",
|
|||
|
"# Отобразим только топ-20 признаков\n",
|
|||
|
"top_n = 20\n",
|
|||
|
"importance_df_top = importance_df.head(top_n)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 8))\n",
|
|||
|
"sns.barplot(x='Importance', y='Feature', data=importance_df_top, palette='viridis')\n",
|
|||
|
"plt.title(f'Top {top_n} Feature Importance')\n",
|
|||
|
"plt.xlabel('Importance')\n",
|
|||
|
"plt.ylabel('Feature')\n",
|
|||
|
"plt.show()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на переобучение\n",
|
|||
|
"y_train_pred = model.predict(X_train)\n",
|
|||
|
"\n",
|
|||
|
"rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n",
|
|||
|
"r2_train = r2_score(y_train, y_train_pred)\n",
|
|||
|
"mae_train = mean_absolute_error(y_train, y_train_pred)\n",
|
|||
|
"\n",
|
|||
|
"print(f\"Train RMSE: {rmse_train}\")\n",
|
|||
|
"print(f\"Train R²: {r2_train}\")\n",
|
|||
|
"print(f\"Train MAE: {mae_train}\")\n",
|
|||
|
"\n",
|
|||
|
"correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
|
|||
|
"print(f\"Корреляция: {correlation:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация результатов\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(y_test, y_pred, alpha=0.5)\n",
|
|||
|
"plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)\n",
|
|||
|
"plt.xlabel('Actual city_latitude')\n",
|
|||
|
"plt.ylabel('Predicted city_latitude')\n",
|
|||
|
"plt.title('Actual vs Predicted city_latitude')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Выводы и итог \n",
|
|||
|
"\n",
|
|||
|
"**Время обучения:**\n",
|
|||
|
"\n",
|
|||
|
"Время обучения модели составляет 37 секунды, что является средним. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
|
|||
|
"\n",
|
|||
|
"**Предсказательная способность:**\n",
|
|||
|
"\n",
|
|||
|
"MAE (Mean Absolute Error): 0.12184558416960284 — это средняя абсолютная ошибка предсказаний модели. Значение MAE невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 0.12184558416960284. Это может быть приемлемым уровнем ошибки.\n",
|
|||
|
"\n",
|
|||
|
"RMSE (Mean Squared Error): 0.4358844786848851 — это среднее значение квадратов ошибок.\n",
|
|||
|
"\n",
|
|||
|
"R² (коэффициент детерминации): 0.994185027626814 — это средний уровень, указывающий на то, что модель объясняет 99,4% вариации целевой переменной. Это свидетельствует о средней предсказательной способности модели.\n",
|
|||
|
"\n",
|
|||
|
"**Корреляция:**\n",
|
|||
|
"\n",
|
|||
|
"Корреляция (0.96) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
|
|||
|
"\n",
|
|||
|
"**Надежность (кросс-валидация):**\n",
|
|||
|
"\n",
|
|||
|
"Среднее RMSE (кросс-валидация): 1.535611497565107 — это значительно ниже, чем обычное RMSE, что указывает на отсутствие проблем с переобучением - что и подтверждается тестом переобучением. \n",
|
|||
|
"\n",
|
|||
|
"Результаты визуализации важности признаков, полученные из линейной регрессии, помогают понять, какие из входных переменных наибольшим образом влияют на целевую переменную (city_latitude)."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.0"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|