lab2 probably done

This commit is contained in:
frog24 2024-10-18 21:30:59 +04:00
parent 726f644b68
commit 7e459871e0
5 changed files with 23724 additions and 8 deletions

View File

@ -2,7 +2,7 @@
"cells": [ "cells": [
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 1,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -28,7 +28,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -37,7 +37,7 @@
"<Axes: xlabel='smoker', ylabel='charges'>" "<Axes: xlabel='smoker', ylabel='charges'>"
] ]
}, },
"execution_count": 4, "execution_count": 2,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
}, },
@ -65,7 +65,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 9, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -81,7 +81,7 @@
"<Axes: title={'center': 'charges'}, xlabel='children'>" "<Axes: title={'center': 'charges'}, xlabel='children'>"
] ]
}, },
"execution_count": 9, "execution_count": 3,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
}, },
@ -110,7 +110,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 6, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -119,7 +119,7 @@
"<Axes: xlabel='age'>" "<Axes: xlabel='age'>"
] ]
}, },
"execution_count": 6, "execution_count": 4,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
}, },
@ -146,7 +146,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "aimenv",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },

File diff suppressed because it is too large Load Diff

19238
lab_2/car_price_prediction.csv Normal file

File diff suppressed because it is too large Load Diff

506
lab_2/lab2.ipynb Normal file
View File

@ -0,0 +1,506 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1.\n",
"Были выбраны следующие датасеты:\n",
"1) Данные о автомобилях (17) \n",
"2) Данные о мобильных устройствах (18)\n",
"3) Данные о миллиордерах (19)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"cars_df = pd.read_csv(\"./car_price_prediction.csv\")\n",
"phones_df = pd.read_csv(\"./mobile phone price prediction.csv\")\n",
"rich_df = pd.read_csv(\"./Forbes Billionaires.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2.\n",
"Проблемные области:\n",
"car_price_prediction.csv - цены на автомобили\n",
"mobile phone price prediction.csv - цены на мобильные телефоны\n",
"Forbes Billionaires.csv - данные о миллиордерах"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3.\n",
"Объекты наблюдения\n",
"car_price_prediction.csv: автомобили;\n",
"mobile phone price prediction.csv: телефоны;\n",
"Forbes Billionaires.csv: миллиардеры;\n",
"\n",
"Атрибуты:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['ID', 'Price', 'Levy', 'Manufacturer', 'Model', 'Prod. year',\n",
" 'Category', 'Leather interior', 'Fuel type', 'Engine volume', 'Mileage',\n",
" 'Cylinders', 'Gear box type', 'Drive wheels', 'Doors', 'Wheel', 'Color',\n",
" 'Airbags'],\n",
" dtype='object')\n",
"Index(['Unnamed: 0', 'Name', 'Rating', 'Spec_score', 'No_of_sim', 'Ram',\n",
" 'Battery', 'Display', 'Camera', 'External_Memory', 'Android_version',\n",
" 'Price', 'company', 'Inbuilt_memory', 'fast_charging',\n",
" 'Screen_resolution', 'Processor', 'Processor_name'],\n",
" dtype='object')\n",
"Index(['Rank ', 'Name', 'Networth', 'Age', 'Country', 'Source', 'Industry'], dtype='object')\n"
]
}
],
"source": [
"print(phones_df.columns)\n",
"print(phones_df.columns)\n",
"print(rich_df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Связи между объектами не прослеживаю\n",
"\n",
"4.\n",
"car_price_prediction.csv и mobile phone price prediction.csv бизнес-целью будует являться формирование цены, которая будет соответсвовать существующему рынку и атрибутам объекта.\n",
"Forbes Billionaires.csv - выявление наиболее прибыльных видов бизнеса и проверенных спосов создания капитала\n",
"\n",
"5. \n",
"Формирование цены: на вход характеристики продукта; целевой признак - цена\n",
"Выявление...: на вход вид бизнеса, страна, источники дохода; целевой признак - место в форбс\n",
"\n",
"6, 7. \n",
"Проблемы наборов данных\n",
"Зашумленность:\n"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"19237\n",
"1370\n",
"2600\n"
]
}
],
"source": [
"print(cars_df.shape[0])\n",
"print(phones_df.shape[0])\n",
"print(rich_df.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Так как набо дастаточно быльшие (более 1000 строк), то зашкмленность не будет иметь сильного влияние на качество, шумы усреднятся\n",
"\n",
"Смещение данных, актуальность и просачивание данных проверить представляетяс невозможным, так как был взят готовый сет данных"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"было 19237\n",
"ID 45576535.886104904 936591.4227992407\n",
"Price 18581.7495915248 191880.3101852926\n",
"Prod. year 2010.9471797575118 5.560374489543753\n",
"Cylinders 4.579460150135761 1.1950443346898312\n",
"Airbags 6.620695178600032 4.30661786316207\n",
"стало 18729\n",
"\n",
"------------------\n",
"\n",
"было 1370\n",
"Unnamed: 0 684.5 395.6292456328273\n",
"Rating 4.374416058394161 0.2301756924899598\n",
"Spec_score 80.23430656934306 8.37392155180379\n",
"стало 1359\n",
"\n",
"------------------\n",
"\n",
"было 2600\n",
"Rank 1269.5707692307692 728.1463636959434\n",
"Networth 4.8607499999999995 10.659670683623453\n",
"Age 64.25370226032736 13.195277077997176\n",
"стало 2565\n"
]
}
],
"source": [
"print(\"было \", phones_df.shape[0])\n",
"for column in phones_df.select_dtypes(include=['int', 'float']).columns:\n",
" mean = cars_df[column].mean()\n",
" std_dev = cars_df[column].std()\n",
" print(column, mean, std_dev)\n",
" \n",
" lower_bound = mean - 3 * std_dev\n",
" upper_bound = mean + 3 * std_dev\n",
" \n",
" cars_df = cars_df[(cars_df[column] <= upper_bound) & (cars_df[column] >= lower_bound)]\n",
" \n",
"print(\"стало \", cars_df.shape[0])\n",
"\n",
"print(\"\\n------------------\\n\")\n",
"\n",
"print(\"было \", phones_df.shape[0])\n",
"for column in phones_df.select_dtypes(include=['int', 'float']).columns:\n",
" mean = phones_df[column].mean()\n",
" std_dev = phones_df[column].std()\n",
" print(column, mean, std_dev)\n",
" \n",
" lower_bound = mean - 3 * std_dev\n",
" upper_bound = mean + 3 * std_dev\n",
" \n",
" phones_df = phones_df[(phones_df[column] <= upper_bound) & (phones_df[column] >= lower_bound)]\n",
" \n",
"print(\"стало \", phones_df.shape[0])\n",
"\n",
"print(\"\\n------------------\\n\")\n",
"\n",
"print(\"было \", rich_df.shape[0])\n",
"for column in rich_df.select_dtypes(include=['int', 'float']).columns:\n",
" mean = rich_df[column].mean()\n",
" std_dev = rich_df[column].std()\n",
" print(column, mean, std_dev)\n",
" \n",
" lower_bound = mean - 3 * std_dev\n",
" upper_bound = mean + 3 * std_dev\n",
" \n",
" rich_df = rich_df[(rich_df[column] <= upper_bound) & (rich_df[column] >= lower_bound)]\n",
" \n",
"print(\"стало \", rich_df.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выше были устранены выбросы, которые могли повлиять на качество данных. При этом выбока осталась достаточного размера для работы с ней"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"8.\n",
"На мой взгляд наборы довольно информативные учитывая кол-во строк и атрибутов. \n",
"Степень покрытия, соответсвие реальным данным и согласованность меток проверить не представляется возможным (но я верю составителям сетов)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9.\n",
"Проверка на пропущенные значения:"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Series([], dtype: int64)\n",
"--------------\n",
"Android_version 442\n",
"Inbuilt_memory 19\n",
"fast_charging 82\n",
"Screen_resolution 2\n",
"Processor 28\n",
"dtype: int64\n",
"--------------\n",
"Series([], dtype: int64)\n"
]
}
],
"source": [
"print(cars_df.isnull().sum().loc[lambda x: x>0])\n",
"print(\"--------------\")\n",
"print(phones_df.isnull().sum().loc[lambda x: x>0])\n",
"print(\"--------------\")\n",
"print(rich_df.isnull().sum().loc[lambda x: x>0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"в датасете с телефонами нашлись пустые значения. Жаль, но они все не числовые, поэтому просто заменим на моду"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Series([], dtype: int64)\n"
]
}
],
"source": [
"columns = [\"Android_version\", \"Inbuilt_memory\", \"fast_charging\", \"Screen_resolution\", \"Processor\"]\n",
"for column in columns:\n",
" mode = phones_df[column].mode()[0]\n",
" phones_df[column].fillna(mode, inplace=True)\n",
" \n",
"print(phones_df.isnull().sum().loc[lambda x: x>0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Не знаю насколько это правильно и как отразиться на качестве данных, но удалять 400+ строк их 1300 явно было бы хуже\n",
"\n",
"10. \n",
"Разбиение данных на выборки"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"18729 13110 2810 2809\n",
"18729\n",
"\n",
" 1359 951 204 204\n",
"1359\n",
"\n",
" 2565 1795 385 385\n",
"2565\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"cars_train_df, cars_temp_df = train_test_split(cars_df, test_size=0.3, random_state=52)\n",
"cars_val_df, cars_test_df = train_test_split(cars_temp_df, test_size=0.5, random_state=52)\n",
"\n",
"phones_train_df, phones_temp_df = train_test_split(phones_df, test_size=0.3, random_state=52)\n",
"phones_val_df, phones_test_df = train_test_split(phones_temp_df, test_size=0.5, random_state=52)\n",
"\n",
"rich_train_df, rich_temp_df = train_test_split(rich_df, test_size=0.3, random_state=52)\n",
"rich_val_df, rich_test_df = train_test_split(rich_temp_df, test_size=0.5, random_state=52)\n",
"\n",
"print(cars_df.shape[0], cars_train_df.shape[0], cars_test_df.shape[0], cars_val_df.shape[0])\n",
"print(cars_val_df.shape[0] + cars_test_df.shape[0] + cars_train_df.shape[0])\n",
"print('\\n', phones_df.shape[0], phones_train_df.shape[0], phones_test_df.shape[0], phones_val_df.shape[0])\n",
"print(phones_val_df.shape[0] + phones_test_df.shape[0] + phones_train_df.shape[0])\n",
"print('\\n', rich_df.shape[0], rich_train_df.shape[0], rich_test_df.shape[0], rich_val_df.shape[0])\n",
"print(rich_val_df.shape[0] + rich_test_df.shape[0] + rich_train_df.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Данные были разбиты на обучающую, тестовую и контрольную выборки в отношении 70%-15%-15%\n",
"\n",
"11. Взял проценты из лекции, наверное это сбалансированно"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"oversampling:\n",
"old_type\n",
"Old 13196\n",
"Normal 13196\n",
"New 13196\n",
"Name: count, dtype: int64\n",
"undersampling:\n",
"old_type\n",
"Old 2285\n",
"Normal 2285\n",
"New 2285\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"cars_df['old_type'] = pd.cut(cars_df['Prod. year'], bins=[1900, 2004, 2015, 2025], \n",
" labels=['Old', 'Normal', 'New'])\n",
"\n",
"y = cars_df['old_type']\n",
"x = cars_df.drop(columns=['Prod. year', 'old_type'])\n",
"\n",
"oversampler = RandomOverSampler(random_state=52)\n",
"x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
"\n",
"undersampler = RandomUnderSampler(random_state=52)\n",
"x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
"\n",
"print(\"oversampling:\")\n",
"print(pd.Series(y_resampled).value_counts())\n",
"\n",
"print(\"undersampling:\")\n",
"print(pd.Series(y_resampled_under).value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"oversampling:\n",
"rating_type\n",
"bad 838\n",
"normal 838\n",
"good 838\n",
"Name: count, dtype: int64\n",
"undersampling:\n",
"rating_type\n",
"bad 93\n",
"normal 93\n",
"good 93\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"phones_df['rating_type'] = pd.cut(phones_df['Rating'], bins=[0, 4.0, 4.5, 5.0], \n",
" labels=[\"bad\", \"normal\", \"good\"])\n",
"\n",
"y = phones_df['rating_type']\n",
"x = phones_df.drop(columns=['Rating', 'rating_type'])\n",
"\n",
"oversampler = RandomOverSampler(random_state=42)\n",
"x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
"\n",
"undersampler = RandomUnderSampler(random_state=42)\n",
"x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
"\n",
"print(\"oversampling:\")\n",
"print(pd.Series(y_resampled).value_counts())\n",
"\n",
"print(\"undersampling:\")\n",
"print(pd.Series(y_resampled_under).value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"oversampling:\n",
"age_type\n",
"grown 1535\n",
"old 1535\n",
"young 0\n",
"Name: count, dtype: int64\n",
"undersampling:\n",
"age_type\n",
"grown 1030\n",
"old 1030\n",
"young 0\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"rich_df['age_type'] = pd.cut(rich_df['Age'], bins=[0, 20, 60, 100], \n",
" labels=[\"young\", \"grown\", \"old\"])\n",
"\n",
"y = rich_df['age_type']\n",
"x = rich_df.drop(columns=['Age', 'age_type'])\n",
"\n",
"oversampler = RandomOverSampler(random_state=42)\n",
"x_resampled, y_resampled = oversampler.fit_resample(x, y)\n",
"\n",
"undersampler = RandomUnderSampler(random_state=42)\n",
"x_resampled_under, y_resampled_under = undersampler.fit_resample(x, y)\n",
"\n",
"print(\"oversampling:\")\n",
"print(pd.Series(y_resampled).value_counts())\n",
"\n",
"print(\"undersampling:\")\n",
"print(pd.Series(y_resampled_under).value_counts())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff