lab2 probably done

2024-10-18 21:30:59 +04:00 · 2024-10-18 21:30:59 +04:00 · 7e459871e0
commit 7e459871e0
parent 726f644b68
5 changed files with 23724 additions and 8 deletions
--- a/lab_1/lab1.ipynb
+++ b/lab_1/lab1.ipynb
@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
@ -28,7 +28,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
@ -37,7 +37,7 @@
       "<Axes: xlabel='smoker', ylabel='charges'>"
      ]
     },
-     "execution_count": 4,
+     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -65,7 +65,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
@ -81,7 +81,7 @@
       "<Axes: title={'center': 'charges'}, xlabel='children'>"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -110,7 +110,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@ -119,7 +119,7 @@
       "<Axes: xlabel='age'>"
      ]
     },
-     "execution_count": 6,
+     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -146,7 +146,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "aimenv",
   "language": "python",
   "name": "python3"
  },
--- a/Billionaires.csv
+++ b/Billionaires.csv
--- a/lab_2/car_price_prediction.csv
+++ b/lab_2/car_price_prediction.csv
--- a/lab_2/lab2.ipynb
+++ b/lab_2/lab2.ipynb
@ -0,0 +1,506 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1.\n",
+    "Были выбраны следующие датасеты:\n",
+    "1) Данные о автомобилях (17) \n",
+    "2) Данные о мобильных устройствах (18)\n",
+    "3) Данные о миллиордерах (19)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "cars_df = pd.read_csv(\"./car_price_prediction.csv\")\n",
+    "phones_df = pd.read_csv(\"./mobile phone price prediction.csv\")\n",
+    "rich_df = pd.read_csv(\"./Forbes Billionaires.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "2.\n",
+    "Проблемные области:\n",
+    "car_price_prediction.csv - цены на автомобили\n",
+    "mobile phone price prediction.csv - цены на мобильные телефоны\n",
+    "Forbes Billionaires.csv - данные о миллиордерах"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "3.\n",
+    "Объекты наблюдения\n",
+    "car_price_prediction.csv: автомобили;\n",
+    "mobile phone price prediction.csv: телефоны;\n",
+    "Forbes Billionaires.csv: миллиардеры;\n",
+    "\n",
+    "Атрибуты:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Index(['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',\n",
+      "       'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',\n",
+      "       'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',\n",
+      "       'Airbags'],\n",
+      "      dtype='object')\n",
+      "Index(['Unnamed: 0', 'Name', 'Rating', 'Spec_score', 'No_of_sim', 'Ram',\n",
+      "       'Battery', 'Display', 'Camera', 'External_Memory', 'Android_version',\n",
+      "       'Price', 'company', 'Inbuilt_memory', 'fast_charging',\n",
+      "       'Screen_resolution', 'Processor', 'Processor_name'],\n",
+      "      dtype='object')\n",
+      "Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(phones_df.columns)\n",
+    "print(phones_df.columns)\n",
+    "print(rich_df.columns)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Связи между объектами не прослеживаю\n",
+    "\n",
+    "4.\n",
+    "car_price_prediction.csv и mobile phone price prediction.csv бизнес-целью будует являться формирование цены, которая будет соответсвовать существующему рынку и атрибутам объекта.\n",
+    "Forbes Billionaires.csv - выявление наиболее прибыльных видов бизнеса и проверенных спосов создания капитала\n",
+    "\n",
+    "5. \n",
+    "Формирование цены: на вход характеристики продукта; целевой признак - цена\n",
+    "Выявление...: на вход вид бизнеса, страна, источники дохода; целевой признак - место в форбс\n",
+    "\n",
+    "6, 7. \n",
+    "Проблемы наборов данных\n",
+    "Зашумленность:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "19237\n",
+      "1370\n",
+      "2600\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(cars_df.shape[0])\n",
+    "print(phones_df.shape[0])\n",
+    "print(rich_df.shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Так как набо дастаточно быльшие (более 1000 строк), то зашкмленность не будет иметь сильного влияние на качество, шумы усреднятся\n",
+    "\n",
+    "Смещение данных, актуальность и просачивание данных проверить представляетяс невозможным, так как был взят готовый сет данных"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "было  19237\n",
+      "ID 45576535.886104904 936591.4227992407\n",
+      "Price 18581.7495915248 191880.3101852926\n",
+      "Prod. year 2010.9471797575118 5.560374489543753\n",
+      "Cylinders 4.579460150135761 1.1950443346898312\n",
+      "Airbags 6.620695178600032 4.30661786316207\n",
+      "стало  18729\n",
+      "\n",
+      "------------------\n",
+      "\n",
+      "было  1370\n",
+      "Unnamed: 0 684.5 395.6292456328273\n",
+      "Rating 4.374416058394161 0.2301756924899598\n",
+      "Spec_score 80.23430656934306 8.37392155180379\n",
+      "стало  1359\n",
+      "\n",
+      "------------------\n",
+      "\n",
+      "было  2600\n",
+      "Rank  1269.5707692307692 728.1463636959434\n",
+      "Networth 4.8607499999999995 10.659670683623453\n",
+      "Age 64.25370226032736 13.195277077997176\n",
+      "стало  2565\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"было \", phones_df.shape[0])\n",
+    "for column in phones_df.select_dtypes(include=['int', 'float']).columns:\n",
+    "    mean = cars_df[column].mean()\n",
+    "    std_dev = cars_df[column].std()\n",
+    "    print(column, mean, std_dev)\n",
+    "    \n",
+    "    lower_bound = mean - 3 * std_dev\n",
+    "    upper_bound = mean + 3 * std_dev\n",
+    "    \n",
+    "    cars_df = cars_df[(cars_df[column] <= upper_bound) & (cars_df[column] >= lower_bound)]\n",
+    "    \n",
+    "print(\"стало \", cars_df.shape[0])\n",
+    "\n",
+    "print(\"\\n------------------\\n\")\n",
+    "\n",
+    "print(\"было \", phones_df.shape[0])\n",
+    "for column in phones_df.select_dtypes(include=['int', 'float']).columns:\n",
+    "    mean = phones_df[column].mean()\n",
+    "    std_dev = phones_df[column].std()\n",
+    "    print(column, mean, std_dev)\n",
+    "    \n",
+    "    lower_bound = mean - 3 * std_dev\n",
+    "    upper_bound = mean + 3 * std_dev\n",
+    "    \n",
+    "    phones_df = phones_df[(phones_df[column] <= upper_bound) & (phones_df[column] >= lower_bound)]\n",
+    "    \n",
+    "print(\"стало \", phones_df.shape[0])\n",
+    "\n",
+    "print(\"\\n------------------\\n\")\n",
+    "\n",
+    "print(\"было \", rich_df.shape[0])\n",
+    "for column in rich_df.select_dtypes(include=['int', 'float']).columns:\n",
+    "    mean = rich_df[column].mean()\n",
+    "    std_dev = rich_df[column].std()\n",
+    "    print(column, mean, std_dev)\n",
+    "    \n",
+    "    lower_bound = mean - 3 * std_dev\n",
+    "    upper_bound = mean + 3 * std_dev\n",
+    "    \n",
+    "    rich_df = rich_df[(rich_df[column] <= upper_bound) & (rich_df[column] >= lower_bound)]\n",
+    "    \n",
+    "print(\"стало \", rich_df.shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Выше были устранены выбросы, которые могли повлиять на качество данных. При этом выбока осталась достаточного размера для работы с ней"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "8.\n",
+    "На мой взгляд наборы довольно информативные учитывая кол-во строк и атрибутов. \n",
+    "Степень покрытия, соответсвие реальным данным и согласованность меток проверить не представляется возможным (но я верю составителям сетов)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "9.\n",
+    "Проверка на пропущенные значения:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 89,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Series([], dtype: int64)\n",
+      "--------------\n",
+      "Android_version      442\n",
+      "Inbuilt_memory        19\n",
+      "fast_charging         82\n",
+      "Screen_resolution      2\n",
+      "Processor             28\n",
+      "dtype: int64\n",
+      "--------------\n",
+      "Series([], dtype: int64)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(cars_df.isnull().sum().loc[lambda x: x>0])\n",
+    "print(\"--------------\")\n",
+    "print(phones_df.isnull().sum().loc[lambda x: x>0])\n",
+    "print(\"--------------\")\n",
+    "print(rich_df.isnull().sum().loc[lambda x: x>0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "в датасете с телефонами нашлись пустые значения. Жаль, но они все не числовые, поэтому просто заменим на моду"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 107,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Series([], dtype: int64)\n"
+     ]
+    }
+   ],
+   "source": [
+    "columns = [\"Android_version\", \"Inbuilt_memory\", \"fast_charging\", \"Screen_resolution\", \"Processor\"]\n",
+    "for column in columns:\n",
+    "    mode = phones_df[column].mode()[0]\n",
+    "    phones_df[column].fillna(mode, inplace=True)\n",
+    "    \n",
+    "print(phones_df.isnull().sum().loc[lambda x: x>0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Не знаю насколько это правильно и как отразиться на качестве данных, но удалять 400+ строк их 1300 явно было бы хуже\n",
+    "\n",
+    "10. \n",
+    "Разбиение данных на выборки"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 113,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "18729 13110 2810 2809\n",
+      "18729\n",
+      "\n",
+      " 1359 951 204 204\n",
+      "1359\n",
+      "\n",
+      " 2565 1795 385 385\n",
+      "2565\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "cars_train_df, cars_temp_df = train_test_split(cars_df, test_size=0.3, random_state=52)\n",
+    "cars_val_df, cars_test_df = train_test_split(cars_temp_df, test_size=0.5, random_state=52)\n",
+    "\n",
+    "phones_train_df, phones_temp_df = train_test_split(phones_df, test_size=0.3, random_state=52)\n",
+    "phones_val_df, phones_test_df = train_test_split(phones_temp_df, test_size=0.5, random_state=52)\n",
+    "\n",
+    "rich_train_df, rich_temp_df = train_test_split(rich_df, test_size=0.3, random_state=52)\n",
+    "rich_val_df, rich_test_df = train_test_split(rich_temp_df, test_size=0.5, random_state=52)\n",
+    "\n",
+    "print(cars_df.shape[0], cars_train_df.shape[0], cars_test_df.shape[0], cars_val_df.shape[0])\n",
+    "print(cars_val_df.shape[0] + cars_test_df.shape[0] + cars_train_df.shape[0])\n",
+    "print('\\n', phones_df.shape[0], phones_train_df.shape[0], phones_test_df.shape[0], phones_val_df.shape[0])\n",
+    "print(phones_val_df.shape[0] + phones_test_df.shape[0] + phones_train_df.shape[0])\n",
+    "print('\\n', rich_df.shape[0], rich_train_df.shape[0], rich_test_df.shape[0], rich_val_df.shape[0])\n",
+    "print(rich_val_df.shape[0] + rich_test_df.shape[0] + rich_train_df.shape[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Данные были разбиты на обучающую, тестовую и контрольную выборки в отношении 70%-15%-15%\n",
+    "\n",
+    "11. Взял проценты из лекции, наверное это сбалансированно"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "oversampling:\n",
+      "old_type\n",
+      "Old       13196\n",
+      "Normal    13196\n",
+      "New       13196\n",
+      "Name: count, dtype: int64\n",
+      "undersampling:\n",
+      "old_type\n",
+      "Old       2285\n",
+      "Normal    2285\n",
+      "New       2285\n",
+      "Name: count, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "from imblearn.over_sampling import RandomOverSampler\n",
+    "from imblearn.under_sampling import RandomUnderSampler\n",
+    "cars_df['old_type'] = pd.cut(cars_df['Prod. year'], bins=[1900, 2004, 2015, 2025], \n",
+    "                             labels=['Old', 'Normal', 'New'])\n",
+    "\n",
+    "y = cars_df['old_type']\n",
+    "x = cars_df.drop(columns=['Prod. year', 'old_type'])\n",
+    "\n",
+    "oversampler = RandomOverSampler(random_state=52)\n",
+    "x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
+    "\n",
+    "undersampler = RandomUnderSampler(random_state=52)\n",
+    "x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
+    "\n",
+    "print(\"oversampling:\")\n",
+    "print(pd.Series(y_resampled).value_counts())\n",
+    "\n",
+    "print(\"undersampling:\")\n",
+    "print(pd.Series(y_resampled_under).value_counts())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "oversampling:\n",
+      "rating_type\n",
+      "bad       838\n",
+      "normal    838\n",
+      "good      838\n",
+      "Name: count, dtype: int64\n",
+      "undersampling:\n",
+      "rating_type\n",
+      "bad       93\n",
+      "normal    93\n",
+      "good      93\n",
+      "Name: count, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "phones_df['rating_type'] = pd.cut(phones_df['Rating'], bins=[0, 4.0, 4.5, 5.0], \n",
+    "                             labels=[\"bad\", \"normal\", \"good\"])\n",
+    "\n",
+    "y = phones_df['rating_type']\n",
+    "x = phones_df.drop(columns=['Rating', 'rating_type'])\n",
+    "\n",
+    "oversampler = RandomOverSampler(random_state=42)\n",
+    "x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
+    "\n",
+    "undersampler = RandomUnderSampler(random_state=42)\n",
+    "x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
+    "\n",
+    "print(\"oversampling:\")\n",
+    "print(pd.Series(y_resampled).value_counts())\n",
+    "\n",
+    "print(\"undersampling:\")\n",
+    "print(pd.Series(y_resampled_under).value_counts())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 139,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "oversampling:\n",
+      "age_type\n",
+      "grown    1535\n",
+      "old      1535\n",
+      "young       0\n",
+      "Name: count, dtype: int64\n",
+      "undersampling:\n",
+      "age_type\n",
+      "grown    1030\n",
+      "old      1030\n",
+      "young       0\n",
+      "Name: count, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "rich_df['age_type'] = pd.cut(rich_df['Age'], bins=[0, 20, 60, 100], \n",
+    "                             labels=[\"young\", \"grown\", \"old\"])\n",
+    "\n",
+    "y = rich_df['age_type']\n",
+    "x = rich_df.drop(columns=['Age', 'age_type'])\n",
+    "\n",
+    "oversampler = RandomOverSampler(random_state=42)\n",
+    "x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
+    "\n",
+    "undersampler = RandomUnderSampler(random_state=42)\n",
+    "x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
+    "\n",
+    "print(\"oversampling:\")\n",
+    "print(pd.Series(y_resampled).value_counts())\n",
+    "\n",
+    "print(\"undersampling:\")\n",
+    "print(pd.Series(y_resampled_under).value_counts())"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "aimenv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/prediction.csv
+++ b/prediction.csv