Compare commits

...

11 Commits
Lab_1 ... main

Author SHA1 Message Date
ecb1bf2d58 Merge pull request '5 Лаба' (#5) from Lab_5 into main
Reviewed-on: #5
2024-12-07 08:44:35 +04:00
6374426dba Merge pull request 'Четвертая ready' (#4) from Lab_4 into main
Reviewed-on: #4
2024-12-07 08:44:24 +04:00
Marselchi
515dbca7af Lab5Done 2024-12-06 22:08:48 +04:00
Marselchi
0872aa8c53 Четвертая ready 2024-12-06 21:20:39 +04:00
da73afc9a7 Merge pull request 'Lab_3' (#3) from Lab3 into main
Reviewed-on: #3
2024-11-30 09:41:03 +04:00
Marselchi
f0ed51cd94 теперь точно всё 2024-11-30 09:21:24 +04:00
Marselchi
636335b483 Лаба3 2024-11-29 21:18:24 +04:00
d4c5d73d24 Merge pull request 'Lab_2' (#2) from Lab_2 into main
Reviewed-on: #2
2024-10-25 15:19:18 +04:00
9ba3b97db8 Merge pull request 'Lab_1' (#1) from Lab_1 into main
Reviewed-on: #1
2024-10-25 15:17:52 +04:00
Marselchi
61c81894ac Merge branch 'main' of https://git.is.ulstu.ru/Marselchii/AIM-PIbd-31-Sagirov-M-M into Lab_2 2024-10-19 20:34:26 +04:00
Marselchi
f4c2250241 Lab_2 fin 2024-10-19 20:31:08 +04:00
5 changed files with 2255 additions and 0 deletions

636
Lab_2/lab2.ipynb Normal file
View File

@ -0,0 +1,636 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab2 Pibd-31 Sagirov M M"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Загрузка трёх датасетов"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"..//datasets//Lab_2//world-population-by-country-2020.csv\", sep=\",\")\n",
"df2 = pd.read_csv(\"..//datasets//Lab_2//Starbucks Dataset.csv\", sep=\",\")\n",
"df3 = pd.read_csv(\"..//datasets//Lab_2//students_adaptability_level_online_education.csv\", sep=\",\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Пункты 2-8 представленны далее в виде Markdown:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 2 - Проблемные области:\n",
"df - Население планеты\n",
"\n",
"df2 - Акции Starbucks\n",
"\n",
"df3 - Эффективность онлайн обучения"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 3 - Объекты, их аттрибуты и связи:\n",
"df - Объектом наблюдения является Страна. Имеет в себе аттрибуты такие как: место в списке, название, население на 2020 год, изменение за год, процентное изменение, плотность населения, а так же процент мигрантов и рождаемость. \n",
"\n",
"df2 - Объектом наблюдения является Акция. Имеет в себе аттрибуты такие как: дата торговли, цена на открытие биржи, высшая и низшая цена за день, цена на закрытии и объем торговли.\n",
"\n",
"df3 - Объектом наблюдения является Студент. Имеет в себе аттрибуты такие как: Уровень образования, тип обучения (платно/бесплатно), пол, возраст, расположение, финансовое состояние и уровень адаптируемости к онлайн обучению.\n",
"\n",
"Связей между объектами нет."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 4 - Бизнес-цели:\n",
"df - Для составления списка приоритетных стран для показа рекламы для миграции.\n",
"\n",
"df2 - Для выявления тенденций Акций Starbucks.\n",
"\n",
"df3 - Для решения о целесообразности введения онлайн обучения в учереждении."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 5 - Цели проектов:\n",
"df - Поступает нужны процент мигрантов, а на выходе страны с подходящим числом мигрантов.\n",
"\n",
"df2 - Поступает высшая стоимость сегодняшней акции, а на выходе предполагаемый процент завтра.\n",
"\n",
"df3 - Поступает список студентов с их состояниями, а на выходе вердикт оправдает ли себя ввод онлайн обучения. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 6 - Проблемы наборов:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Проверка на пропущенные значения:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"no 0\n",
"Country (or dependency) 0\n",
"Population 2020 0\n",
"Yearly Change 0\n",
"Net Change 0\n",
"Density (P/Km²) 0\n",
"Land Area (Km²) 0\n",
"Migrants (net) 34\n",
"Fert. Rate 0\n",
"Med. Age 0\n",
"Urban Pop % 0\n",
"World Share 0\n",
"dtype: int64\n",
"Date 0\n",
"Open 0\n",
"High 0\n",
"Low 0\n",
"Close 0\n",
"Adj Close 0\n",
"Volume 0\n",
"dtype: int64\n",
"Education Level 0\n",
"Institution Type 0\n",
"Gender 0\n",
"Age 0\n",
"Device 0\n",
"IT Student 0\n",
"Location 0\n",
"Financial Condition 0\n",
"Internet Type 0\n",
"Network Type 0\n",
"Flexibility Level 0\n",
"dtype: int64\n"
]
}
],
"source": [
"print(df.isnull().sum())\n",
"print(df2.isnull().sum())\n",
"print(df3.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Как можно заметить, пустых данных почти нет, 34 пустых ячейки есть только в df в столбце с процентами мигрантов"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"df и df2 - неактуальные, так как в первом используется информация 4-х летней давности, а во втором - с 1992. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 7 - Примеры решений:\n",
"Для обоих датасетов решением будет полное обновление данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 8 - Оценка качества:\n",
"Информативность лучше всего у df и df3, так же как и степень покрытия. Реальным данным соответствует очень хорошо df2. Во всех датасетах метки хорошо согласованны."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 9 - Устранение пустых данных:\n",
"Устраним пустые данные в df путем удаления строк, в которых они присутствуют"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df['Migrants (net)'] = df['Migrants (net)'].replace('', pd.NA)\n",
"df_cleaned = df.dropna(subset=['Migrants (net)'])\n",
"df_cleaned.to_csv(\"..//datasets//Lab_2//World_population_cleaned.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"А теперь просто проведем действия с оставшимися наборами данных:\n",
"\n",
"В df2 поставим у всех записей цену при открытии на 12"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Open High Low Close Adj Close Volume\n",
"0 1992-06-26 12 0.347656 0.320313 0.335938 0.260703 224358400\n",
"1 1992-06-29 12 0.367188 0.332031 0.359375 0.278891 58732800\n",
"2 1992-06-30 12 0.371094 0.343750 0.347656 0.269797 34777600\n",
"3 1992-07-01 12 0.359375 0.339844 0.355469 0.275860 18316800\n",
"4 1992-07-02 12 0.359375 0.347656 0.355469 0.275860 13996800\n",
"... ... ... ... ... ... ... ...\n",
"8031 2024-05-17 12 78.000000 74.919998 77.849998 77.849998 14436500\n",
"8032 2024-05-20 12 78.320000 76.709999 77.540001 77.540001 11183800\n",
"8033 2024-05-21 12 78.220001 77.500000 77.720001 77.720001 8916600\n",
"8034 2024-05-22 12 81.019997 77.440002 80.720001 80.720001 22063400\n",
"8035 2024-05-23 12 80.699997 79.169998 79.260002 79.260002 4651418\n",
"\n",
"[8036 rows x 7 columns]\n"
]
}
],
"source": [
"df2['Open'] = 12\n",
"print(df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В df3 установим у всех средний по столбцу возраст"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Age\n",
"0 17.06556\n",
"1 17.06556\n",
"2 17.06556\n",
"3 17.06556\n",
"4 17.06556\n",
"... ...\n",
"1200 17.06556\n",
"1201 17.06556\n",
"1202 17.06556\n",
"1203 17.06556\n",
"1204 17.06556\n",
"\n",
"[1205 rows x 1 columns]\n"
]
}
],
"source": [
"Age_mean = df3['Age'].mean()\n",
"df3['Age'] = Age_mean\n",
"print(df3[['Age']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 10 - Разбиение"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train df: (140, 12), Validation df: (30, 12), Test df: (31, 12)\n",
"Train df2: (5625, 7), Validation df2: (1205, 7), Test df2: (1206, 7)\n",
"Train df3: (843, 11), Validation df3: (181, 11), Test df3: (181, 11)\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_df, temp_df = train_test_split(df_cleaned, test_size=0.3, random_state=22)\n",
"val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=22) \n",
"\n",
"train_df2, temp_df2 = train_test_split(df2, test_size=0.3, random_state=22)\n",
"val_df2, test_df2 = train_test_split(temp_df2, test_size=0.5, random_state=22)\n",
"\n",
"train_df3, temp_df3 = train_test_split(df3, test_size=0.3, random_state=22)\n",
"val_df3, test_df3 = train_test_split(temp_df3, test_size=0.5, random_state=22)\n",
"print(f\"Train df: {train_df.shape}, Validation df: {val_df.shape}, Test df: {test_df.shape}\")\n",
"print(f\"Train df2: {train_df2.shape}, Validation df2: {val_df2.shape}, Test df2: {test_df2.shape}\")\n",
"print(f\"Train df3: {train_df3.shape}, Validation df3: {val_df3.shape}, Test df3: {test_df3.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 11 - Проверка на сбалансированность"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Баланс Train df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"-800 2.857143\n",
"0 2.857143\n",
"-4,000 2.142857\n",
"40,000 2.142857\n",
"-5,000 2.142857\n",
" ... \n",
"52,000 0.714286\n",
"515 0.714286\n",
"50,000 0.714286\n",
"-39,858 0.714286\n",
"1,485 0.714286\n",
"Name: proportion, Length: 115, dtype: float64\n",
"\n",
"Баланс Validation df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"900 6.666667\n",
"-40,000 6.666667\n",
"36,400 3.333333\n",
"-67,152 3.333333\n",
"87,400 3.333333\n",
"-62,920 3.333333\n",
"16,000 3.333333\n",
"-451 3.333333\n",
"-2,803 3.333333\n",
"-4,000 3.333333\n",
"380 3.333333\n",
"11,370 3.333333\n",
"-163,313 3.333333\n",
"-1,500 3.333333\n",
"-1,000 3.333333\n",
"10,220 3.333333\n",
"-852 3.333333\n",
"168,694 3.333333\n",
"2,000 3.333333\n",
"-98,955 3.333333\n",
"-200 3.333333\n",
"47,800 3.333333\n",
"-60,000 3.333333\n",
"10,000 3.333333\n",
"3,000 3.333333\n",
"-14,837 3.333333\n",
"320 3.333333\n",
"-20,000 3.333333\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Test df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"-10,000 6.451613\n",
"1,000 3.225806\n",
"204,796 3.225806\n",
"-14,704 3.225806\n",
"-6,800 3.225806\n",
"3,911 3.225806\n",
"-16,053 3.225806\n",
"145,405 3.225806\n",
"4,800 3.225806\n",
"39,520 3.225806\n",
"71,560 3.225806\n",
"-4,806 3.225806\n",
"0 3.225806\n",
"-14,400 3.225806\n",
"4,000 3.225806\n",
"6,413 3.225806\n",
"120 3.225806\n",
"-8,353 3.225806\n",
"-116,858 3.225806\n",
"40,000 3.225806\n",
"260,650 3.225806\n",
"1,200 3.225806\n",
"-9,000 3.225806\n",
"1,351 3.225806\n",
"2,001 3.225806\n",
"-1,256 3.225806\n",
"-8,863 3.225806\n",
"-38,033 3.225806\n",
"-1,342 3.225806\n",
"5,000 3.225806\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Train df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.804688 0.248889\n",
"0.742188 0.231111\n",
"0.765625 0.231111\n",
"0.789063 0.213333\n",
"0.757813 0.195556\n",
" ... \n",
"29.485001 0.017778\n",
"38.290001 0.017778\n",
"97.580002 0.017778\n",
"74.769997 0.017778\n",
"64.040001 0.017778\n",
"Name: proportion, Length: 4071, dtype: float64\n",
"\n",
"Баланс Validation df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.851563 0.414938\n",
"0.757813 0.414938\n",
"0.726563 0.331950\n",
"0.812500 0.331950\n",
"0.914063 0.331950\n",
" ... \n",
"13.000000 0.082988\n",
"10.452500 0.082988\n",
"2.492188 0.082988\n",
"61.720001 0.082988\n",
"3.800781 0.082988\n",
"Name: proportion, Length: 1078, dtype: float64\n",
"\n",
"Баланс Test df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.796875 0.414594\n",
"0.765625 0.414594\n",
"1.640625 0.331675\n",
"0.882813 0.331675\n",
"0.757813 0.331675\n",
" ... \n",
"8.085000 0.082919\n",
"10.595000 0.082919\n",
"7.540000 0.082919\n",
"99.470001 0.082919\n",
"88.540001 0.082919\n",
"Name: proportion, Length: 1088, dtype: float64\n",
"\n",
"Баланс Train df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 52.669039\n",
"Low 38.552788\n",
"High 8.778173\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Validation df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 50.276243\n",
"Low 44.198895\n",
"High 5.524862\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Test df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 49.723757\n",
"Low 41.436464\n",
"High 8.839779\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"#Проверка на сбалансированность\n",
"def check_balance(df, target_column):\n",
" distribution = df[target_column].value_counts(normalize=True) * 100\n",
" print(f\"Распределение по столбцу '{target_column}':\\n\", distribution)\n",
"\n",
"# Для датасета 1\n",
"print(\"Баланс Train df:\")\n",
"check_balance(train_df, 'Migrants (net)') \n",
"print(\"\\nБаланс Validation df:\")\n",
"check_balance(val_df, 'Migrants (net)')\n",
"print(\"\\nБаланс Test df:\")\n",
"check_balance(test_df, 'Migrants (net)')\n",
"\n",
"# Для датасета 2\n",
"print(\"\\nБаланс Train df2:\")\n",
"check_balance(train_df2, 'High')\n",
"print(\"\\nБаланс Validation df2:\")\n",
"check_balance(val_df2, 'High')\n",
"print(\"\\nБаланс Test df2:\")\n",
"check_balance(test_df2, 'High')\n",
"\n",
"# Для датасета 3\n",
"print(\"\\nБаланс Train df3:\")\n",
"check_balance(train_df3, 'Flexibility Level')\n",
"print(\"\\nБаланс Validation df3:\")\n",
"check_balance(val_df3, 'Flexibility Level')\n",
"print(\"\\nБаланс Test df3:\")\n",
"check_balance(test_df3, 'Flexibility Level')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Наши выборки сбалансированно распределены. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 12 - Приращение данных:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df - Oversampling:\n",
"Размер выборки после oversampling: (460, 11)\n",
"\n",
"df - Undersampling:\n",
"Размер выборки после undersampling: (115, 11)\n",
"\n",
"df2 - Oversampling:\n",
"Размер выборки после oversampling: (21868, 6)\n",
"\n",
"df2 - Undersampling:\n",
"Размер выборки после undersampling: (5467, 6)\n",
"\n",
"df3 - Oversampling:\n",
"Размер выборки после oversampling: (1332, 10)\n",
"\n",
"df3 - Undersampling:\n",
"Размер выборки после undersampling: (222, 10)\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Функция для выполнения oversampling\n",
"def apply_oversampling(X, y):\n",
" oversampler = RandomOverSampler(random_state=22)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" print(f\"Размер выборки после oversampling: {X_resampled.shape}\")\n",
" return X_resampled, y_resampled\n",
"\n",
"# Функция для выполнения undersampling\n",
"def apply_undersampling(X, y):\n",
" undersampler = RandomUnderSampler(random_state=22)\n",
" X_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" print(f\"Размер выборки после undersampling: {X_resampled.shape}\")\n",
" return X_resampled, y_resampled\n",
"\n",
"# Пример для первого набора данных (df)\n",
"X_1 = train_df.drop(columns=['Migrants (net)']) \n",
"y_1 = train_df['Migrants (net)']\n",
"\n",
"print(\"df - Oversampling:\")\n",
"X_resampled1, y_resampled1 = apply_oversampling(X_1, y_1)\n",
"\n",
"print(\"\\ndf - Undersampling:\")\n",
"X_resampled1_under, y_resampled1_under = apply_undersampling(X_1, y_1)\n",
"\n",
"# Пример для второго набора данных (df2)\n",
"X_2 = train_df2.drop(columns=['Volume'])\n",
"y_2 = train_df2['Volume']\n",
"\n",
"print(\"\\ndf2 - Oversampling:\")\n",
"X_resampled2, y_resampled2 = apply_oversampling(X_2, y_2)\n",
"\n",
"print(\"\\ndf2 - Undersampling:\")\n",
"X_resampled2_under, y_resampled2_under = apply_undersampling(X_2, y_2)\n",
"\n",
"# Пример для третьего набора данных (df3)\n",
"X_3 = train_df3.drop(columns=['Flexibility Level'])\n",
"y_3 = train_df3['Flexibility Level']\n",
"\n",
"print(\"\\ndf3 - Oversampling:\")\n",
"X_resampled3, y_resampled3 = apply_oversampling(X_3, y_3)\n",
"\n",
"print(\"\\ndf3 - Undersampling:\")\n",
"X_resampled3_under, y_resampled3_under = apply_undersampling(X_3, y_3)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

BIN
Lab_2/requirments.txt Normal file

Binary file not shown.

783
Lab_3/Lab_3.ipynb Normal file

File diff suppressed because one or more lines are too long

548
Lab_4/Lab4.ipynb Normal file
View File

@ -0,0 +1,548 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Определим бизнес цели:\n",
"\n",
"1. Бизнес-цель: Оптимизация страховых тарифов для клиентов (регрессия)\n",
"Целевая: Charges\n",
"Признаки: Age BMI Region Children Sex Smoker\n",
"Достижимый уровень качества: MSE (среднеквадратичная ошибка): минимизация, ориентир в зависимости от разброса целевой переменной. R^2 > 0.6.\n",
"Ориентир: Прогноз среднего значения целевой переменной.\n",
"\n",
"2. Бизнес-цель: Определение клиентов с высоким риском заболеваний для профилактики (классификация)\n",
"Целевая: Smoker \n",
"Признаки: Age BMI Region Children Sex\n",
"Достижимый уровень качества: Accuracy (точность классификации) 70-80%%\n",
"Ориентир: DummyClassifier, предсказывающий самый частый класс, даст accuracy ~50-60%%."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X_train_class: (1940, 7), y_train_class: (1940,)\n",
"X_train_reg: (1940, 8), y_train_reg: (1940,)\n"
]
}
],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import ColumnTransformer\n",
"\n",
"# Загрузка данных\n",
"data = pd.read_csv(\"..//datasets//Lab_1//Medical_insurance1.csv\",sep=',')\n",
"\n",
"# Преобразование данных для классификации\n",
"X_class = data[['Age', 'BMI', 'Children', 'Region', 'Sex']]\n",
"y_class = data['Smoker'] # Целевой столбец для классификации\n",
"\n",
"# Преобразование данных для регрессии\n",
"X_reg = data[['Age', 'BMI', 'Children', 'Region', 'Sex', 'Smoker']]\n",
"y_reg = data['Charges'] # Целевой столбец для регрессии\n",
"\n",
"# Кодирование категориальных данных\n",
"categorical_features = ['Region', 'Sex']\n",
"numerical_features = ['Age', 'BMI', 'Children']\n",
"\n",
"preprocessor_class = ColumnTransformer(\n",
" transformers=[\n",
" ('num', StandardScaler(), numerical_features),\n",
" ('cat', OneHotEncoder(drop='first'), categorical_features)\n",
" ])\n",
"\n",
"preprocessor_reg = ColumnTransformer(\n",
" transformers=[\n",
" ('num', StandardScaler(), numerical_features),\n",
" ('cat', OneHotEncoder(drop='first'), categorical_features),\n",
" ('smoker', OneHotEncoder(drop='first'), ['Smoker'])\n",
" ])\n",
"\n",
"X_class_scaled = preprocessor_class.fit_transform(X_class)\n",
"X_reg_scaled = preprocessor_reg.fit_transform(X_reg)\n",
"\n",
"X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(\n",
" X_class_scaled, y_class, test_size=0.3, random_state=42\n",
")\n",
"X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(\n",
" X_reg_scaled, y_reg, test_size=0.3, random_state=42\n",
")\n",
"\n",
"# Проверка форматов данных\n",
"print(f\"X_train_class: {X_train_class.shape}, y_train_class: {y_train_class.shape}\")\n",
"print(f\"X_train_reg: {X_train_reg.shape}, y_train_reg: {y_train_reg.shape}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Теперь создадим по три модели и построим конвейер. \n",
"Для классификации:\n",
"1. Logistic Regression — базовая линейная модель для классификации.\n",
"2. Random Forest Classifier — ансамблевый метод на основе деревьев решений.\n",
"3. Gradient Boosting Classifier (XGBoost) — продвинутый бустинг для задач классификации.\n",
"Для регрессии:\n",
"1. Linear Regression — базовая линейная модель.\n",
"2. Random Forest Regressor — ансамблевый метод для регрессии.\n",
"3. Gradient Boosting Regressor (XGBoost) — продвинутый бустинг для регрессии."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оценка моделей для классификации:\n",
"\n",
"Logistic Regression:\n",
" Accuracy: 0.8029\n",
" ROC-AUC: 0.5669\n",
"\n",
"Random Forest:\n",
" Accuracy: 0.9387\n",
" ROC-AUC: 0.9582\n",
"\n",
"SVC:\n",
" Accuracy: 0.8029\n",
" ROC-AUC: 0.6788\n",
"\n",
"Оценка моделей для регрессии:\n",
"\n",
"Linear Regression:\n",
" Mean Squared Error: 40004195.9424\n",
" R^2 Score: 0.7443\n",
"\n",
"Random Forest:\n",
" Mean Squared Error: 10687675.8724\n",
" R^2 Score: 0.9317\n",
"\n",
"SVR:\n",
" Mean Squared Error: 169727106.2359\n",
" R^2 Score: -0.0847\n",
"\n"
]
}
],
"source": [
"from sklearn.svm import SVC, SVR\n",
"from sklearn.linear_model import LogisticRegression, LinearRegression\n",
"from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import accuracy_score, roc_auc_score, mean_squared_error, r2_score\n",
"\n",
"# Конвейеры для классификации\n",
"pipelines_class = {\n",
" 'Logistic Regression': Pipeline([\n",
" ('scaler', StandardScaler()), # Масштабирование\n",
" ('classifier', LogisticRegression(random_state=42, max_iter=500))\n",
" ]),\n",
" 'Random Forest': Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('classifier', RandomForestClassifier(random_state=42))\n",
" ]),\n",
" 'SVC': Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('classifier', SVC(probability=True, random_state=42))\n",
" ])\n",
"}\n",
"\n",
"# Конвейеры для регрессии\n",
"pipelines_reg = {\n",
" 'Linear Regression': Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('regressor', LinearRegression())\n",
" ]),\n",
" 'Random Forest': Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('regressor', RandomForestRegressor(random_state=42))\n",
" ]),\n",
" 'SVR': Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('regressor', SVR())\n",
" ])\n",
"}\n",
"\n",
"# Функция для оценки классификации\n",
"def evaluate_classification(pipelines, X_train, X_test, y_train, y_test):\n",
" print(\"Оценка моделей для классификации:\\n\")\n",
" for name, pipeline in pipelines.items():\n",
" pipeline.fit(X_train, y_train)\n",
" y_pred = pipeline.predict(X_test)\n",
" y_proba = pipeline.predict_proba(X_test)[:, 1] if hasattr(pipeline['classifier'], 'predict_proba') else None\n",
" acc = accuracy_score(y_test, y_pred)\n",
" roc_auc = roc_auc_score(y_test, y_proba) if y_proba is not None else None\n",
" print(f\"{name}:\")\n",
" print(f\" Accuracy: {acc:.4f}\")\n",
" if roc_auc is not None:\n",
" print(f\" ROC-AUC: {roc_auc:.4f}\")\n",
" print()\n",
"\n",
"# Функция для оценки регрессии\n",
"def evaluate_regression(pipelines, X_train, X_test, y_train, y_test):\n",
" print(\"Оценка моделей для регрессии:\\n\")\n",
" for name, pipeline in pipelines.items():\n",
" pipeline.fit(X_train, y_train)\n",
" y_pred = pipeline.predict(X_test)\n",
" mse = mean_squared_error(y_test, y_pred)\n",
" r2 = r2_score(y_test, y_pred)\n",
" print(f\"{name}:\")\n",
" print(f\" Mean Squared Error: {mse:.4f}\")\n",
" print(f\" R^2 Score: {r2:.4f}\")\n",
" print()\n",
"\n",
"# Оценка \n",
"evaluate_classification(pipelines_class, X_train_class, X_test_class, y_train_class, y_test_class)\n",
"evaluate_regression(pipelines_reg, X_train_reg, X_test_reg, y_train_reg, y_test_reg)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Теперь займемся настройкой гиперпараметров\n",
"GridSearchCV (кросс-валидация). Параметры: cv=5: 5 фолдов для кросс-валидации. n_jobs=-1: Используем все доступные процессоры для ускорения вычислений. verbose=1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Настройка гиперпараметров для классификации:\n",
" Настройка модели: Logistic Regression\n",
"Fitting 5 folds for each of 8 candidates, totalling 40 fits\n",
" Лучшие параметры: {'classifier__C': 0.01, 'classifier__solver': 'lbfgs'}\n",
" Лучшая ROC-AUC: 0.5708\n",
"\n",
" Настройка модели: Random Forest\n",
"Fitting 5 folds for each of 27 candidates, totalling 135 fits\n",
" Лучшие параметры: {'classifier__max_depth': None, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200}\n",
" Лучшая ROC-AUC: 0.8578\n",
"\n",
" Настройка модели: SVC\n",
"Fitting 5 folds for each of 12 candidates, totalling 60 fits\n",
" Лучшие параметры: {'classifier__C': 1, 'classifier__gamma': 'scale', 'classifier__kernel': 'rbf'}\n",
" Лучшая ROC-AUC: 0.6190\n",
"\n",
"Настройка гиперпараметров для регрессии:\n",
" Настройка модели: Linear Regression\n",
"Fitting 5 folds for each of 1 candidates, totalling 5 fits\n",
" Лучшие параметры: {}\n",
" Лучший R^2: 0.7505\n",
"\n",
" Настройка модели: Random Forest\n",
"Fitting 5 folds for each of 27 candidates, totalling 135 fits\n",
" Лучшие параметры: {'regressor__max_depth': None, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 200}\n",
" Лучший R^2: 0.9079\n",
"\n",
" Настройка модели: SVR\n",
"Fitting 5 folds for each of 12 candidates, totalling 60 fits\n",
" Лучшие параметры: {'regressor__C': 10, 'regressor__gamma': 'scale', 'regressor__kernel': 'linear'}\n",
" Лучший R^2: 0.5283\n",
"\n"
]
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.metrics import make_scorer\n",
"\n",
"# Гиперпараметры для классификации\n",
"param_grids_class = {\n",
" 'Logistic Regression': {\n",
" 'classifier__C': [0.01, 0.1, 1, 10],\n",
" 'classifier__solver': ['liblinear', 'lbfgs']\n",
" },\n",
" 'Random Forest': {\n",
" 'classifier__n_estimators': [50, 100, 200],\n",
" 'classifier__max_depth': [None, 10, 20],\n",
" 'classifier__min_samples_split': [2, 5, 10]\n",
" },\n",
" 'SVC': {\n",
" 'classifier__C': [0.1, 1, 10],\n",
" 'classifier__kernel': ['linear', 'rbf'],\n",
" 'classifier__gamma': ['scale', 'auto']\n",
" }\n",
"}\n",
"\n",
"# Гиперпараметры для регрессии\n",
"param_grids_reg = {\n",
" 'Linear Regression': {}, # У линейной регрессии обычно мало гиперпараметров\n",
" 'Random Forest': {\n",
" 'regressor__n_estimators': [50, 100, 200],\n",
" 'regressor__max_depth': [None, 10, 20],\n",
" 'regressor__min_samples_split': [2, 5, 10]\n",
" },\n",
" 'SVR': {\n",
" 'regressor__C': [0.1, 1, 10],\n",
" 'regressor__kernel': ['linear', 'rbf'],\n",
" 'regressor__gamma': ['scale', 'auto']\n",
" }\n",
"}\n",
"\n",
"# Функция для настройки гиперпараметров классификации\n",
"def tune_hyperparameters_class(pipelines, param_grids, X_train, y_train):\n",
" best_models = {}\n",
" print(\"Настройка гиперпараметров для классификации:\")\n",
" for name, pipeline in pipelines.items():\n",
" print(f\" Настройка модели: {name}\")\n",
" grid_search = GridSearchCV(\n",
" pipeline, param_grids[name], scoring='roc_auc', cv=5, n_jobs=-1, verbose=1\n",
" )\n",
" grid_search.fit(X_train, y_train)\n",
" best_models[name] = grid_search.best_estimator_\n",
" print(f\" Лучшие параметры: {grid_search.best_params_}\")\n",
" print(f\" Лучшая ROC-AUC: {grid_search.best_score_:.4f}\\n\")\n",
" return best_models\n",
"\n",
"# Функция для настройки гиперпараметров регрессии\n",
"def tune_hyperparameters_reg(pipelines, param_grids, X_train, y_train):\n",
" best_models = {}\n",
" print(\"Настройка гиперпараметров для регрессии:\")\n",
" for name, pipeline in pipelines.items():\n",
" print(f\" Настройка модели: {name}\")\n",
" grid_search = GridSearchCV(\n",
" pipeline, param_grids[name], scoring='r2', cv=5, n_jobs=-1, verbose=1\n",
" )\n",
" grid_search.fit(X_train, y_train)\n",
" best_models[name] = grid_search.best_estimator_\n",
" print(f\" Лучшие параметры: {grid_search.best_params_}\")\n",
" print(f\" Лучший R^2: {grid_search.best_score_:.4f}\\n\")\n",
" return best_models\n",
"\n",
"# Настройка гиперпараметров\n",
"best_models_class = tune_hyperparameters_class(pipelines_class, param_grids_class, X_train_class, y_train_class)\n",
"best_models_reg = tune_hyperparameters_reg(pipelines_reg, param_grids_reg, X_train_reg, y_train_reg)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Теперь оценим их"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n",
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
"import numpy as np\n",
"\n",
"# Преобразуем текстовые метки в числовые\n",
"y_class_numeric = y_class.map({'no': 0, 'yes': 1})\n",
"\n",
"# Оценка качества классификации\n",
"for name, model in best_models_class.items():\n",
" y_pred_class = model.predict(X_class_scaled)\n",
"\n",
" # Преобразуем текстовые предсказания в числовые метки\n",
" # if isinstance(y_pred_class[0], str):\n",
" # y_pred_class = pd.Series(y_pred_class).map({'no': 0, 'yes': 1}).values\n",
"\n",
" # y_pred_proba = model.predict_proba(X_class_scaled)[:, 1] if hasattr(model, \"predict_proba\") else None\n",
"\n",
" # print(f\"Оценка качества для модели {name}:\")\n",
" # print(\"Accuracy:\", accuracy_score(y_class_numeric, y_pred_class))\n",
" # print(\"Precision:\", precision_score(y_class_numeric, y_pred_class))\n",
" # print(\"Recall:\", recall_score(y_class_numeric, y_pred_class))\n",
" # print(\"F1-Score:\", f1_score(y_class_numeric, y_pred_class))\n",
"\n",
" # if y_pred_proba is not None:\n",
" # print(\"ROC AUC:\", roc_auc_score(y_class_numeric, y_pred_proba))\n",
" # else:\n",
" # print(\"ROC AUC: Невозможно вычислить, модель не поддерживает predict_proba\")\n",
" # print(\"\\n\")\n",
"\n",
"# Оценка качества регрессии\n",
"for name, model in best_models_reg.items():\n",
" y_pred_reg = model.predict(X_reg_scaled)\n",
" # print(f\"Оценка качества для модели {name}:\")\n",
" # print(\"MAE:\", mean_absolute_error(y_reg, y_pred_reg))\n",
" # print(\"MSE:\", mean_squared_error(y_reg, y_pred_reg))\n",
" # print(\"RMSE:\", np.sqrt(mean_squared_error(y_reg, y_pred_reg)))\n",
" # print(\"R²:\", r2_score(y_reg, y_pred_reg))\n",
" # print(\"\\n\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Результат и оправдание метрик (из-за большого размера):\n",
"\n",
"\n",
"Классификация:\n",
"Оценка качества для модели Logistic Regression:\n",
"Accuracy: 0.7965367965367965\n",
"Precision: 0.0\n",
"F1-Score: 0.0\n",
"ROC AUC: 0.5788123779422346\n",
"\n",
"Оценка качества для модели Random Forest:\n",
"Accuracy: 0.9812409812409812\n",
"Precision: 0.9671532846715328\n",
"F1-Score: 0.9532374100719424\n",
"ROC AUC: 0.9922653921266317\n",
"\n",
"Оценка качества для модели SVC:\n",
"Accuracy: 0.7965367965367965\n",
"Precision: 0.0\n",
"F1-Score: 0.0\n",
"ROC AUC: 0.752295007194984\n",
"\n",
"\n",
"Регрессия:\n",
"Оценка качества для модели Linear Regression:\n",
"MAE: 4136.775081674497\n",
"MSE: 36800983.69176898\n",
"R²: 0.7506914796021513\n",
"\n",
"\n",
"Оценка качества для модели Random Forest:\n",
"MAE: 827.1310058445929\n",
"MSE: 4157251.954692241\n",
"R²: 0.9718366676710003\n",
"\n",
"\n",
"Оценка качества для модели SVR:\n",
"MAE: 3907.745018325371\n",
"MSE: 67849095.49493024\n",
"R²: 0.5403558298916683\n",
"\n",
"\n",
"\n",
"Для задачи классификации\n",
"Метрики:\n",
"Precision (Точность):\n",
"Это доля истинных положительных случаев среди всех предсказанных положительных случаев. Важна для задач, где важно минимизировать количество ложных срабатываний\n",
"Accuracy (доля правильных ответов):\n",
"Показывает, какая доля объектов была классифицирована правильно.\n",
"Уместна при сбалансированных классах.\n",
"ROC-AUC (площадь под кривой ошибок):\n",
"Учитывает баланс между чувствительностью и специфичностью.\n",
"Подходит для несбалансированных данных.\n",
"F1-Score:\n",
"Баланс между точностью и полнотой.\n",
"Полезна, если ошибки классификации одного из классов имеют больший вес.\n",
"\n",
"\n",
"Для задачи регрессии\n",
"Метрики:\n",
"\n",
"R² (коэффициент детерминации):\n",
"Оценивает долю объясненной дисперсии целевой переменной моделью.\n",
"Mean Absolute Error (MAE):\n",
"Среднее абсолютное отклонение предсказаний от истинных значений.\n",
"Удобна для интерпретации, так как измеряется в тех же единицах, что и целевая переменная.\n",
"Mean Squared Error (MSE):\n",
"Усиливает влияние больших ошибок.\n",
"Уместна, если крупные ошибки особенно нежелательны."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оценка смещения и дисперсии для classification:\n",
"\n",
"Модель: Logistic Regression\n",
"Смещение (bias): 0.2062\n",
"Дисперсия (variance): -0.0091\n",
"\n",
"Модель: Random Forest\n",
"Смещение (bias): 0.0005\n",
"Дисперсия (variance): 0.0608\n",
"\n",
"Модель: SVC\n",
"Смещение (bias): 0.2062\n",
"Дисперсия (variance): -0.0091\n",
"Оценка смещения и дисперсии для regression:\n",
"\n",
"Модель: Linear Regression\n",
"Смещение (bias): 0.2463\n",
"Дисперсия (variance): 0.0093\n",
"\n",
"Модель: Random Forest\n",
"Смещение (bias): 0.0095\n",
"Дисперсия (variance): 0.0585\n",
"\n",
"Модель: SVR\n",
"Смещение (bias): 0.4514\n",
"Дисперсия (variance): 0.0260\n"
]
}
],
"source": [
"def evaluate_bias_variance(models, X_train, X_test, y_train, y_test, task=\"classification\"):\n",
" print(f\"Оценка смещения и дисперсии для {task}:\")\n",
" for name, model in models.items():\n",
" if task == \"classification\":\n",
" train_score = model.score(X_train, y_train)\n",
" test_score = model.score(X_test, y_test)\n",
" else: # Для регрессии\n",
" train_score = r2_score(y_train, model.predict(X_train))\n",
" test_score = r2_score(y_test, model.predict(X_test))\n",
"\n",
" bias = 1 - train_score\n",
" variance = train_score - test_score\n",
"\n",
" print(f\"\\nМодель: {name}\")\n",
" print(f\"Смещение (bias): {bias:.4f}\")\n",
" print(f\"Дисперсия (variance): {variance:.4f}\")\n",
"\n",
"# Анализ смещения и дисперсии\n",
"evaluate_bias_variance(best_models_class, X_train_class, X_test_class, y_train_class, y_test_class, task=\"classification\")\n",
"evaluate_bias_variance(best_models_reg, X_train_reg, X_test_reg, y_train_reg, y_test_reg, task=\"regression\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

288
Lab_5/Lab_5.ipynb Normal file

File diff suppressed because one or more lines are too long