2024-10-19 20:31:08 +04:00

637 lines
22 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab2 Pibd-31 Sagirov M M"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Загрузка трёх датасетов"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"..//datasets//Lab_2//world-population-by-country-2020.csv\", sep=\",\")\n",
"df2 = pd.read_csv(\"..//datasets//Lab_2//Starbucks Dataset.csv\", sep=\",\")\n",
"df3 = pd.read_csv(\"..//datasets//Lab_2//students_adaptability_level_online_education.csv\", sep=\",\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Пункты 2-8 представленны далее в виде Markdown:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 2 - Проблемные области:\n",
"df - Население планеты\n",
"\n",
"df2 - Акции Starbucks\n",
"\n",
"df3 - Эффективность онлайн обучения"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 3 - Объекты, их аттрибуты и связи:\n",
"df - Объектом наблюдения является Страна. Имеет в себе аттрибуты такие как: место в списке, название, население на 2020 год, изменение за год, процентное изменение, плотность населения, а так же процент мигрантов и рождаемость. \n",
"\n",
"df2 - Объектом наблюдения является Акция. Имеет в себе аттрибуты такие как: дата торговли, цена на открытие биржи, высшая и низшая цена за день, цена на закрытии и объем торговли.\n",
"\n",
"df3 - Объектом наблюдения является Студент. Имеет в себе аттрибуты такие как: Уровень образования, тип обучения (платно/бесплатно), пол, возраст, расположение, финансовое состояние и уровень адаптируемости к онлайн обучению.\n",
"\n",
"Связей между объектами нет."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 4 - Бизнес-цели:\n",
"df - Для составления списка приоритетных стран для показа рекламы для миграции.\n",
"\n",
"df2 - Для выявления тенденций Акций Starbucks.\n",
"\n",
"df3 - Для решения о целесообразности введения онлайн обучения в учереждении."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 5 - Цели проектов:\n",
"df - Поступает нужны процент мигрантов, а на выходе страны с подходящим числом мигрантов.\n",
"\n",
"df2 - Поступает высшая стоимость сегодняшней акции, а на выходе предполагаемый процент завтра.\n",
"\n",
"df3 - Поступает список студентов с их состояниями, а на выходе вердикт оправдает ли себя ввод онлайн обучения. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 6 - Проблемы наборов:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Проверка на пропущенные значения:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"no 0\n",
"Country (or dependency) 0\n",
"Population 2020 0\n",
"Yearly Change 0\n",
"Net Change 0\n",
"Density (P/Km²) 0\n",
"Land Area (Km²) 0\n",
"Migrants (net) 34\n",
"Fert. Rate 0\n",
"Med. Age 0\n",
"Urban Pop % 0\n",
"World Share 0\n",
"dtype: int64\n",
"Date 0\n",
"Open 0\n",
"High 0\n",
"Low 0\n",
"Close 0\n",
"Adj Close 0\n",
"Volume 0\n",
"dtype: int64\n",
"Education Level 0\n",
"Institution Type 0\n",
"Gender 0\n",
"Age 0\n",
"Device 0\n",
"IT Student 0\n",
"Location 0\n",
"Financial Condition 0\n",
"Internet Type 0\n",
"Network Type 0\n",
"Flexibility Level 0\n",
"dtype: int64\n"
]
}
],
"source": [
"print(df.isnull().sum())\n",
"print(df2.isnull().sum())\n",
"print(df3.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Как можно заметить, пустых данных почти нет, 34 пустых ячейки есть только в df в столбце с процентами мигрантов"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"df и df2 - неактуальные, так как в первом используется информация 4-х летней давности, а во втором - с 1992. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 7 - Примеры решений:\n",
"Для обоих датасетов решением будет полное обновление данных."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 8 - Оценка качества:\n",
"Информативность лучше всего у df и df3, так же как и степень покрытия. Реальным данным соответствует очень хорошо df2. Во всех датасетах метки хорошо согласованны."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 9 - Устранение пустых данных:\n",
"Устраним пустые данные в df путем удаления строк, в которых они присутствуют"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df['Migrants (net)'] = df['Migrants (net)'].replace('', pd.NA)\n",
"df_cleaned = df.dropna(subset=['Migrants (net)'])\n",
"df_cleaned.to_csv(\"..//datasets//Lab_2//World_population_cleaned.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"А теперь просто проведем действия с оставшимися наборами данных:\n",
"\n",
"В df2 поставим у всех записей цену при открытии на 12"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Open High Low Close Adj Close Volume\n",
"0 1992-06-26 12 0.347656 0.320313 0.335938 0.260703 224358400\n",
"1 1992-06-29 12 0.367188 0.332031 0.359375 0.278891 58732800\n",
"2 1992-06-30 12 0.371094 0.343750 0.347656 0.269797 34777600\n",
"3 1992-07-01 12 0.359375 0.339844 0.355469 0.275860 18316800\n",
"4 1992-07-02 12 0.359375 0.347656 0.355469 0.275860 13996800\n",
"... ... ... ... ... ... ... ...\n",
"8031 2024-05-17 12 78.000000 74.919998 77.849998 77.849998 14436500\n",
"8032 2024-05-20 12 78.320000 76.709999 77.540001 77.540001 11183800\n",
"8033 2024-05-21 12 78.220001 77.500000 77.720001 77.720001 8916600\n",
"8034 2024-05-22 12 81.019997 77.440002 80.720001 80.720001 22063400\n",
"8035 2024-05-23 12 80.699997 79.169998 79.260002 79.260002 4651418\n",
"\n",
"[8036 rows x 7 columns]\n"
]
}
],
"source": [
"df2['Open'] = 12\n",
"print(df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В df3 установим у всех средний по столбцу возраст"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Age\n",
"0 17.06556\n",
"1 17.06556\n",
"2 17.06556\n",
"3 17.06556\n",
"4 17.06556\n",
"... ...\n",
"1200 17.06556\n",
"1201 17.06556\n",
"1202 17.06556\n",
"1203 17.06556\n",
"1204 17.06556\n",
"\n",
"[1205 rows x 1 columns]\n"
]
}
],
"source": [
"Age_mean = df3['Age'].mean()\n",
"df3['Age'] = Age_mean\n",
"print(df3[['Age']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 10 - Разбиение"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train df: (140, 12), Validation df: (30, 12), Test df: (31, 12)\n",
"Train df2: (5625, 7), Validation df2: (1205, 7), Test df2: (1206, 7)\n",
"Train df3: (843, 11), Validation df3: (181, 11), Test df3: (181, 11)\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_df, temp_df = train_test_split(df_cleaned, test_size=0.3, random_state=22)\n",
"val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=22) \n",
"\n",
"train_df2, temp_df2 = train_test_split(df2, test_size=0.3, random_state=22)\n",
"val_df2, test_df2 = train_test_split(temp_df2, test_size=0.5, random_state=22)\n",
"\n",
"train_df3, temp_df3 = train_test_split(df3, test_size=0.3, random_state=22)\n",
"val_df3, test_df3 = train_test_split(temp_df3, test_size=0.5, random_state=22)\n",
"print(f\"Train df: {train_df.shape}, Validation df: {val_df.shape}, Test df: {test_df.shape}\")\n",
"print(f\"Train df2: {train_df2.shape}, Validation df2: {val_df2.shape}, Test df2: {test_df2.shape}\")\n",
"print(f\"Train df3: {train_df3.shape}, Validation df3: {val_df3.shape}, Test df3: {test_df3.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 11 - Проверка на сбалансированность"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Баланс Train df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"-800 2.857143\n",
"0 2.857143\n",
"-4,000 2.142857\n",
"40,000 2.142857\n",
"-5,000 2.142857\n",
" ... \n",
"52,000 0.714286\n",
"515 0.714286\n",
"50,000 0.714286\n",
"-39,858 0.714286\n",
"1,485 0.714286\n",
"Name: proportion, Length: 115, dtype: float64\n",
"\n",
"Баланс Validation df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"900 6.666667\n",
"-40,000 6.666667\n",
"36,400 3.333333\n",
"-67,152 3.333333\n",
"87,400 3.333333\n",
"-62,920 3.333333\n",
"16,000 3.333333\n",
"-451 3.333333\n",
"-2,803 3.333333\n",
"-4,000 3.333333\n",
"380 3.333333\n",
"11,370 3.333333\n",
"-163,313 3.333333\n",
"-1,500 3.333333\n",
"-1,000 3.333333\n",
"10,220 3.333333\n",
"-852 3.333333\n",
"168,694 3.333333\n",
"2,000 3.333333\n",
"-98,955 3.333333\n",
"-200 3.333333\n",
"47,800 3.333333\n",
"-60,000 3.333333\n",
"10,000 3.333333\n",
"3,000 3.333333\n",
"-14,837 3.333333\n",
"320 3.333333\n",
"-20,000 3.333333\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Test df:\n",
"Распределение по столбцу 'Migrants (net)':\n",
" Migrants (net)\n",
"-10,000 6.451613\n",
"1,000 3.225806\n",
"204,796 3.225806\n",
"-14,704 3.225806\n",
"-6,800 3.225806\n",
"3,911 3.225806\n",
"-16,053 3.225806\n",
"145,405 3.225806\n",
"4,800 3.225806\n",
"39,520 3.225806\n",
"71,560 3.225806\n",
"-4,806 3.225806\n",
"0 3.225806\n",
"-14,400 3.225806\n",
"4,000 3.225806\n",
"6,413 3.225806\n",
"120 3.225806\n",
"-8,353 3.225806\n",
"-116,858 3.225806\n",
"40,000 3.225806\n",
"260,650 3.225806\n",
"1,200 3.225806\n",
"-9,000 3.225806\n",
"1,351 3.225806\n",
"2,001 3.225806\n",
"-1,256 3.225806\n",
"-8,863 3.225806\n",
"-38,033 3.225806\n",
"-1,342 3.225806\n",
"5,000 3.225806\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Train df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.804688 0.248889\n",
"0.742188 0.231111\n",
"0.765625 0.231111\n",
"0.789063 0.213333\n",
"0.757813 0.195556\n",
" ... \n",
"29.485001 0.017778\n",
"38.290001 0.017778\n",
"97.580002 0.017778\n",
"74.769997 0.017778\n",
"64.040001 0.017778\n",
"Name: proportion, Length: 4071, dtype: float64\n",
"\n",
"Баланс Validation df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.851563 0.414938\n",
"0.757813 0.414938\n",
"0.726563 0.331950\n",
"0.812500 0.331950\n",
"0.914063 0.331950\n",
" ... \n",
"13.000000 0.082988\n",
"10.452500 0.082988\n",
"2.492188 0.082988\n",
"61.720001 0.082988\n",
"3.800781 0.082988\n",
"Name: proportion, Length: 1078, dtype: float64\n",
"\n",
"Баланс Test df2:\n",
"Распределение по столбцу 'High':\n",
" High\n",
"0.796875 0.414594\n",
"0.765625 0.414594\n",
"1.640625 0.331675\n",
"0.882813 0.331675\n",
"0.757813 0.331675\n",
" ... \n",
"8.085000 0.082919\n",
"10.595000 0.082919\n",
"7.540000 0.082919\n",
"99.470001 0.082919\n",
"88.540001 0.082919\n",
"Name: proportion, Length: 1088, dtype: float64\n",
"\n",
"Баланс Train df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 52.669039\n",
"Low 38.552788\n",
"High 8.778173\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Validation df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 50.276243\n",
"Low 44.198895\n",
"High 5.524862\n",
"Name: proportion, dtype: float64\n",
"\n",
"Баланс Test df3:\n",
"Распределение по столбцу 'Flexibility Level':\n",
" Flexibility Level\n",
"Moderate 49.723757\n",
"Low 41.436464\n",
"High 8.839779\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"#Проверка на сбалансированность\n",
"def check_balance(df, target_column):\n",
" distribution = df[target_column].value_counts(normalize=True) * 100\n",
" print(f\"Распределение по столбцу '{target_column}':\\n\", distribution)\n",
"\n",
"# Для датасета 1\n",
"print(\"Баланс Train df:\")\n",
"check_balance(train_df, 'Migrants (net)') \n",
"print(\"\\nБаланс Validation df:\")\n",
"check_balance(val_df, 'Migrants (net)')\n",
"print(\"\\nБаланс Test df:\")\n",
"check_balance(test_df, 'Migrants (net)')\n",
"\n",
"# Для датасета 2\n",
"print(\"\\nБаланс Train df2:\")\n",
"check_balance(train_df2, 'High')\n",
"print(\"\\nБаланс Validation df2:\")\n",
"check_balance(val_df2, 'High')\n",
"print(\"\\nБаланс Test df2:\")\n",
"check_balance(test_df2, 'High')\n",
"\n",
"# Для датасета 3\n",
"print(\"\\nБаланс Train df3:\")\n",
"check_balance(train_df3, 'Flexibility Level')\n",
"print(\"\\nБаланс Validation df3:\")\n",
"check_balance(val_df3, 'Flexibility Level')\n",
"print(\"\\nБаланс Test df3:\")\n",
"check_balance(test_df3, 'Flexibility Level')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Наши выборки сбалансированно распределены. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Пункт 12 - Приращение данных:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df - Oversampling:\n",
"Размер выборки после oversampling: (460, 11)\n",
"\n",
"df - Undersampling:\n",
"Размер выборки после undersampling: (115, 11)\n",
"\n",
"df2 - Oversampling:\n",
"Размер выборки после oversampling: (21868, 6)\n",
"\n",
"df2 - Undersampling:\n",
"Размер выборки после undersampling: (5467, 6)\n",
"\n",
"df3 - Oversampling:\n",
"Размер выборки после oversampling: (1332, 10)\n",
"\n",
"df3 - Undersampling:\n",
"Размер выборки после undersampling: (222, 10)\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# Функция для выполнения oversampling\n",
"def apply_oversampling(X, y):\n",
" oversampler = RandomOverSampler(random_state=22)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" print(f\"Размер выборки после oversampling: {X_resampled.shape}\")\n",
" return X_resampled, y_resampled\n",
"\n",
"# Функция для выполнения undersampling\n",
"def apply_undersampling(X, y):\n",
" undersampler = RandomUnderSampler(random_state=22)\n",
" X_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" print(f\"Размер выборки после undersampling: {X_resampled.shape}\")\n",
" return X_resampled, y_resampled\n",
"\n",
"# Пример для первого набора данных (df)\n",
"X_1 = train_df.drop(columns=['Migrants (net)']) \n",
"y_1 = train_df['Migrants (net)']\n",
"\n",
"print(\"df - Oversampling:\")\n",
"X_resampled1, y_resampled1 = apply_oversampling(X_1, y_1)\n",
"\n",
"print(\"\\ndf - Undersampling:\")\n",
"X_resampled1_under, y_resampled1_under = apply_undersampling(X_1, y_1)\n",
"\n",
"# Пример для второго набора данных (df2)\n",
"X_2 = train_df2.drop(columns=['Volume'])\n",
"y_2 = train_df2['Volume']\n",
"\n",
"print(\"\\ndf2 - Oversampling:\")\n",
"X_resampled2, y_resampled2 = apply_oversampling(X_2, y_2)\n",
"\n",
"print(\"\\ndf2 - Undersampling:\")\n",
"X_resampled2_under, y_resampled2_under = apply_undersampling(X_2, y_2)\n",
"\n",
"# Пример для третьего набора данных (df3)\n",
"X_3 = train_df3.drop(columns=['Flexibility Level'])\n",
"y_3 = train_df3['Flexibility Level']\n",
"\n",
"print(\"\\ndf3 - Oversampling:\")\n",
"X_resampled3, y_resampled3 = apply_oversampling(X_3, y_3)\n",
"\n",
"print(\"\\ndf3 - Undersampling:\")\n",
"X_resampled3_under, y_resampled3_under = apply_undersampling(X_3, y_3)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "aimenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}