1752 lines
608 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Начало лабораторной\n",
"\n",
"https://www.kaggle.com/datasets/nikhil1e9/goodreads-books?resource=download\n",
"Данный набор данных представляет книги с Goodreads\n",
"Примр цели — создание системы рекомендаций для книг, прогнозирование рейтингов для новых книг.\n",
"Входные данные: Название, Автор, Средняя оценка, Общее количество оценок, Количество добавлений на полки, Год публикации, Описание, Изображение"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в Popular-Books:\n",
"Index(['Title', 'Author', 'Score', 'Ratings', 'Shelvings', 'Published',\n",
" 'Description', 'Image'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"df_books = pd.read_csv(\".//static//csv//Popular-Books.csv\")\n",
"\n",
"print(\"Столбцы в Popular-Books:\")\n",
"print(df_books.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Посмотрим краткое содержание датасета."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Информация о датасете Popular-Books:\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 27621 entries, 0 to 27620\n",
"Data columns (total 8 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Title 27621 non-null object \n",
" 1 Author 27621 non-null object \n",
" 2 Score 27621 non-null float64\n",
" 3 Ratings 27621 non-null int64 \n",
" 4 Shelvings 27621 non-null int64 \n",
" 5 Published 27621 non-null int64 \n",
" 6 Description 27549 non-null object \n",
" 7 Image 27621 non-null object \n",
"dtypes: float64(1), int64(3), object(4)\n",
"memory usage: 1.7+ MB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Title</th>\n",
" <th>Author</th>\n",
" <th>Score</th>\n",
" <th>Ratings</th>\n",
" <th>Shelvings</th>\n",
" <th>Published</th>\n",
" <th>Description</th>\n",
" <th>Image</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>The English Assassin (Gabriel Allon, #2)</td>\n",
" <td>Daniel Silva</td>\n",
" <td>4.16</td>\n",
" <td>40122</td>\n",
" <td>44602</td>\n",
" <td>2002</td>\n",
" <td>The Unlikely Spy, Daniel Silva's extraordinary...</td>\n",
" <td>https://images-na.ssl-images-amazon.com/images...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Pompeii</td>\n",
" <td>Robert Harris</td>\n",
" <td>3.86</td>\n",
" <td>46097</td>\n",
" <td>64840</td>\n",
" <td>2003</td>\n",
" <td>With his trademark elegance and intelligence R...</td>\n",
" <td>https://images-na.ssl-images-amazon.com/images...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Notorious RBG: The Life and Times of Ruth Bade...</td>\n",
" <td>Irin Carmon</td>\n",
" <td>4.19</td>\n",
" <td>59670</td>\n",
" <td>171959</td>\n",
" <td>2015</td>\n",
" <td>You can't spell truth without Ruth.Only Ruth B...</td>\n",
" <td>https://images-na.ssl-images-amazon.com/images...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>The Abolition of Man</td>\n",
" <td>C.S. Lewis</td>\n",
" <td>4.11</td>\n",
" <td>34390</td>\n",
" <td>52770</td>\n",
" <td>1943</td>\n",
" <td>Alternative cover for ISBN: 978-0060652944The ...</td>\n",
" <td>https://images-na.ssl-images-amazon.com/images...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Portrait of a Murderer</td>\n",
" <td>Anne Meredith (Pseudonym)</td>\n",
" <td>3.38</td>\n",
" <td>1129</td>\n",
" <td>1739</td>\n",
" <td>1933</td>\n",
" <td>'Adrian Gray was born in May 1862 and met his ...</td>\n",
" <td>https://images-na.ssl-images-amazon.com/images...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Title \\\n",
"0 The English Assassin (Gabriel Allon, #2) \n",
"1 Pompeii \n",
"2 Notorious RBG: The Life and Times of Ruth Bade... \n",
"3 The Abolition of Man \n",
"4 Portrait of a Murderer \n",
"\n",
" Author Score Ratings Shelvings Published \\\n",
"0 Daniel Silva 4.16 40122 44602 2002 \n",
"1 Robert Harris 3.86 46097 64840 2003 \n",
"2 Irin Carmon 4.19 59670 171959 2015 \n",
"3 C.S. Lewis 4.11 34390 52770 1943 \n",
"4 Anne Meredith (Pseudonym) 3.38 1129 1739 1933 \n",
"\n",
" Description \\\n",
"0 The Unlikely Spy, Daniel Silva's extraordinary... \n",
"1 With his trademark elegance and intelligence R... \n",
"2 You can't spell truth without Ruth.Only Ruth B... \n",
"3 Alternative cover for ISBN: 978-0060652944The ... \n",
"4 'Adrian Gray was born in May 1862 and met his ... \n",
"\n",
" Image \n",
"0 https://images-na.ssl-images-amazon.com/images... \n",
"1 https://images-na.ssl-images-amazon.com/images... \n",
"2 https://images-na.ssl-images-amazon.com/images... \n",
"3 https://images-na.ssl-images-amazon.com/images... \n",
"4 https://images-na.ssl-images-amazon.com/images... "
]
},
2024-10-12 13:14:41 +04:00
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"Информация о датасете Popular-Books:\")\n",
"df_books.info()\n",
"df_books.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Анализируем датафрейм при помощи \"ящика с усами\". Проверяет на пустые значения."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA78AAAImCAYAAACb/j2lAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABADklEQVR4nO3de5xVdb0//tdwv99hoLykIqDiBRLMa2qax8q0rEfe89LXS2q/RDPK1DyKWnlBxTh6jmmmR00lPZXmqbQ6WQqkeVQgQNFEBQZwAOUyXPbvDx+zzwwzAzOKAovn8/GYB3vW/qzPeu/F57Hm89pr7bUrSqVSKQAAAFBgrTZ2AQAAAPBBE34BAAAoPOEXAACAwhN+AQAAKDzhFwAAgMITfgEAACg84RcAAIDCE34BAAAoPOEXAACAwmuzsQsA4IN14oknZuLEifWWtW3bNn369MlBBx2Ub37zm+nevftGqg4A4MMh/AJsAXbeeedceuml5d9XrlyZF198Mdddd12mTp2ae+65JxUVFRuxQgCAD5bwC7AF6NKlS/bYY496y0aMGJF33nknN954Y5577rkGzwMAFInP/AJswYYOHZokeeONN5Ik06ZNyznnnJNPfOIT2WWXXbL//vvniiuuyPLly8vr1NTUZOzYsfnUpz6V3XbbLZ/73Ofyi1/8ovz8iSeemMGDBzf6M3v27CTJ6NGjc+KJJ+aBBx7IQQcdlGHDhuWrX/1qpk2bVq++N954I6NGjcrIkSOz++6756tf/WqmTJlSr83Pf/7zRrc1evToeu1+97vf5Ytf/GJ23XXX7LvvvrniiiuydOnS8vMTJkxosu4JEyY0u6bZs2c3WKf2NR988MHl3w8++OAGNY4aNSqDBw/O008/XV42ffr0nHHGGRk+fHiGDx+es88+O6+99lqD/8u1PfnkkznuuOPy8Y9/PHvttVfOP//8vPnmmw1eb+3/SVN1rVmzJrfeemsOPfTQDB06NIcddlh+9rOf1VvnxBNPzIknnlhv2dNPP93gtTz//PM57bTTstdee2X48OE588wzM2PGjCbXmT59eg455JAcc8wxjb7G9bVvbB+v/brX/n9JknvvvTeDBw/OTTfdVG87dX/Wfr0AbPqc+QXYgs2aNStJsvXWW2fevHk5/vjjs8cee+Tqq69Ou3bt8qc//Sm33357+vXrl9NPPz1JcsEFF+SPf/xjzjrrrOy+++754x//mNGjR6dt27b53Oc+l6ThZdZ/+MMfMn78+Hrbnjp1al5++eWMGjUq3bt3z4033pgTTjghjzzySPr165eFCxfmmGOOSceOHXPxxRenY8eO+elPf5rjjz8+DzzwQHbYYYckyfLly7Prrrvme9/7Xrnvc845p962fvnLX+aCCy7IEUcckW9+85t5/fXXc/3112fmzJm5/fbb613yPW7cuPTt2zdJUlVVVa+v5tb0XkyePDm//vWv6y2bNWtWjjnmmGy//fb5wQ9+kFWrVmX8+PE59thj8/DDD6d3796N9vXQQw/l29/+dj73uc/ljDPOyFtvvZUbb7wxX/nKV/KLX/yiyfUa8/3vfz8TJkzIGWeckWHDhmXSpEm58sors3jx4px99tnN7uepp57K1772tey111658sors2LFitxyyy055phj8vOf/7zRffejH/0oQ4cOzVlnndWsbbS0fWMWLVqUsWPHNvrcJZdckl122SXJu1dTALB5EX4BtgClUimrVq0q/75o0aJMnDgx48ePz7BhwzJ06NA8+eST2WmnnXLDDTeUJ/b77LNPnnzyyTz99NM5/fTTM3369Dz22GP57ne/m69+9atJkr333juvv/56nn766XL4Xfsy65dffrlBTUuWLMm//du/Zc8990yS7LbbbjnkkENy55135oILLshPf/rTVFdX55577slHP/rRJMkBBxyQz3zmM7nhhhty4403JkmWLVuWPn361Nteu3bt6r32a665Jvvvv3+uueaa8vKPfexjOfnkk/PHP/4xBx54YHn5TjvtlK222ipJGpwVbW5NLbVmzZpcccUV2WWXXfLiiy+Wl48bNy4dO3bMHXfcUf4/2XvvvXPIIYfkP/7jP/Ltb3+70b6uueaa7Lfffrn22mvLy4cPH57PfOYzue2223LhhRc2q65Zs2bl5z//eUaNGlV+82O//fZLRUVFbrnllhx33HHp2bNns/q69tprs+222+bWW29N69aty30deuihufHGG3PDDTfUa//qq6/mz3/+c/7rv/4rO+6443r7b2n7ptx44435yEc+krfeeqvBcwMHDvTxAIDNmMueAbYAkyZNyi677FL+2WeffTJq1KgMHTo01157bSoqKrLffvvlrrvuSvv27TNz5sz8/ve/z/jx47Nw4cLU1NQkSf72t78lST796U/X6/+mm27K5Zdf3qKattpqq3LwTZJ+/fqVzywmyV//+tfstNNOqayszKpVq7Jq1aq0atUqBxxwQP7yl7+U13vzzTfTtWvXJrfz8ssvZ86cOTn44IPL/axatSojRoxIly5d8uSTTza75ubWlLwbQutur1QqNdnvvffem6qqqgZnUp966qmMHDkyHTp0KPfTpUuX7Lnnng22V2vWrFmpqqoqvxFRa5tttsmwYcMa3Pl7XZ566qmUSqUG++7ggw/OihUryuMh+b83WGp/1qxZU35u6dKlef7553P44YeXg2+SdOvWLQcddFCDmpYuXZrrr78+e+21V7OC7Lrar6uutU2fPj333XdfLr744vVuE4DNjzO/AFuAXXbZJZdddlmSpKKiIu3bt8+AAQPqXbq5Zs2aXHfddbn77ruzdOnSDBgwILvttlvat29fblNdXZ0kLbpstimVlZUNlvXu3bt85rO6ujqvvvpq+TLTtS1btiwdO3bM66+/3mSbujVfdtll5X1Q17x585pdc3NqqnXRRRfloosuqvd87dnitfu84YYbcuGFFza4lLa6ujqPPPJIHnnkkQbr9erVq8kak6RPnz4NnuvTp0+Dz0yvS21fn/3sZxt9fu7cueXHtW+wNGbJkiUplUpN1rRkyZJ6y84888x069YtDzzwQLPqXFf7hx56KA899FCz+rniiivy2c9+NsOGDWtWewA2L8IvwBagc+fO2XXXXdfZ5tZbb80dd9yRyy67LJ/+9KfLZ1O/9KUvldt069Ytybuffe3fv395+UsvvZTq6up8/OMfb3ZNjV1WOn/+/HKw7tq1a0aOHNnkJbrt2rXLmjVr8txzz+Xoo49ucju1NV944YUZOXJkg+fX/o7jdX3lU3NqqnXOOefUu5z65ptvzvTp0xusc8MNN2SbbbbJF7/4xQZnQLt27Zp99tknp5xySoP12rRp/E94jx49kry7L9dWVVXV7MuUk//bdz/96U/TuXPnBs9/5CMfKT+u+wZLkrz44ovlz3137do1FRUVTdZUW3OtCy+8ML/5zW/yjW98I3ffffd6P1+7rvYHHXRQvTPqf/jDHzJu3LgGfTz66KN54YUX6l0qDkCxuOwZgCTvXtI8cODAHH300eXgO3fu3EyfPr18qWhtuH388cfrrXvNNddkzJgxLdreK6+8kpdeeqn8+9y5c/Pss89m7733TpKMHDkys2bNynbbbZddd921/PPwww/ngQceSOvWrfPMM89k6dKl2WuvvZrczvbbb5/evXtn9uzZ9fqprKzMtddeWz4TWvsa616Wu7bm1FTrox/9aL02awe85N3LbO+///5cfPHFjYbukSNHZubMmdlpp53K/QwdOjR33HFHfvvb3zZa43bbbZe+ffvmV7/6Vb3lr732Wv7+979n+PDhTb6+tdVelv7WW2/Vey0LFy7MDTfcUD4znPzfGyy1P9ttt135uU6dOmXo0KF59NFHs3r16vLyJUuW5A9/+EODN02GDh2acePG5fXXX8+PfvSj9da5rvY9evSoV1djZ99ramrywx/+MGeffXb5ZmcAFI8zvwAkefeGUz/+8Y9z6623Zo899sirr76aW265JTU1NeXLeYcMGZJ/+Zd/yY9+9KMsX748O+20U/70pz/liSeeaPRs2rqUSqWceea
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка на пустые значения в наборе данных 'Popular Books':\n",
"Description 72\n",
"dtype: int64\n",
"\n",
"\n"
]
}
],
"source": [
"# Настройка стиля графиков\n",
"sns.set_theme(style=\"whitegrid\")\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"sns.boxplot(x='Score', data=df_books)\n",
"plt.title('Распределение оценок книг')\n",
"plt.xlabel('Оценка')\n",
"plt.show()\n",
"\n",
"# Проверка на пустые значения для каждого набора данных\n",
"def check_missing_values(dataframe, name):\n",
" missing_values = dataframe.isnull().sum()\n",
" print(f\"Проверка на пустые значения в наборе данных '{name}':\")\n",
" print(missing_values[missing_values > 0]) # Отображаем только столбцы с пропущенными значениями\n",
" print(\"\\n\")\n",
"\n",
"check_missing_values(df_books, \"Popular Books\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Удаляем все найденные пустые значения."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"В наборе данных 'Popular Books' было удалено 72 строк с пустыми значениями.\n"
]
}
],
"source": [
"# Функция для удаления строк с пустыми значениями\n",
"def drop_missing_values(dataframe, name):\n",
" before_shape = dataframe.shape # Размер до удаления\n",
" cleaned_dataframe = dataframe.dropna() # Удаляем строки с пустыми значениями\n",
" after_shape = cleaned_dataframe.shape # Размер после удаления\n",
" print(f\"В наборе данных '{name}' было удалено {before_shape[0] - after_shape[0]} строк с пустыми значениями.\")\n",
" return cleaned_dataframe\n",
"\n",
"cleaned_df_books = drop_missing_values(df_books, \"Popular Books\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Очистка данных от шумов"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1YAAAImCAYAAABQCRseAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACLJUlEQVR4nO3dd3gU1f7H8c8GSKEkIQlVvFQTIAmQSO8oIl4VEL12BUVQ8doQFazoVbEBSrlKEbGAiqKAXhTlqohGmoD0Xi4dkgABUknm9we/XdnU7bPZvF/Pw6OZPTNzZvdsMt8553yPxTAMQwAAAAAAlwWZXQEAAAAAKO8IrAAAAADATQRWAAAAAOAmAisAAAAAcBOBFQAAAAC4icAKAAAAANxEYAUAAAAAbiKwAgAAAAA3EVgBKBcKr2UeCGubl/drKO/1BwDAkwisADht+/btevTRR9WlSxclJCSoa9eueuSRR7R161aPnys3N1evvPKKvv76a9u2HTt26JZbbvH4uay+/PJLxcXF2f1r0aKF2rVrp7vvvlt//PGHreykSZMUFxfn1PGPHDmiYcOG6eDBg27Xddu2bRowYIASEhL097//vdgyo0aNKnI9cXFxSkpK0rXXXqv333/f6fMW9xnExcVp0qRJLl1HeXHixAmNHTtWvXv3VkJCgtq3b69Bgwbphx9+MLtqDjtw4IB69eql9PR0SdJll11m1y6aN2+uDh066L777nPpO33u3DmNGjVKSUlJSk5O1vLlyz19CQ4rKCjQ559/rttuu00dOnRQcnKyrrvuOn300UfKzc21lbN+5w8cOGBKPS+77DKNGjXK48d9++23NWbMGI8fF0DxKptdAQDly44dO3TTTTepTZs2euaZZxQdHa0jR47o448/1o033qgPP/xQbdq08dj5jh07pg8++EBjx461bfvuu++0du1aj52jJJMnT1atWrUknb9BS01N1ZQpUzRo0CB98cUXat68uUvHTUlJ0dKlSz1SxylTpujQoUOaMmWKoqKiSixXq1YtTZ482fazYRhKTU3Vp59+qldffVUhISG69dZbHT5vcZ/BZ599prp16zp/EeVEdna2brvtNuXn52vYsGFq2LChTp8+rW+//Vb//Oc/9dRTT2nQoEFmV7NUhmFo9OjRGjRokF176dGjh4YPHy7pfGB07NgxzZw5U4MGDdKiRYsUHR3t8DmWLVumr776SsOHD1fnzp3VsmVLj1+HI7KysnTffffpzz//1C233KJ77rlHVapU0fLly/X666/rl19+0ZQpUxQcHGxK/S40efJkVa9e3ePHHTZsmK688kpdeeWV6tSpk8ePD8AegRUAp7z//vuqWbOmpk+frsqV//oV0rt3b/Xt21f//ve/NW3aNBNr6DktWrRQgwYN7La1bNlSV1xxhebMmaMXX3zRpJr95cSJE4qNjVWPHj1KLRccHFxswNuzZ0/17t1bX375pVOBVXE8GVD7o++++067du3S4sWL1ahRI9v23r17Kzs7WxMnTtTtt9+uSpUqmVfJMvzwww/avn273nvvPbvtUVFRRT6/xMRE9e7dW999951uu+02h89x8uRJSdLAgQN18cUXu1tll40dO1Zr1qzRRx99ZHdtXbt2VfPmzfXYY4/p008/1Z133mlaHa28FXyGhYVp0KBBGjt2rBYuXOiVcwD4C0MBATglNTVVhmGooKDAbnvVqlX11FNP6aqrrrLbPn/+fF133XVq3bq1evbsqXHjxtkNwVmyZIluvfVWJSUlKSEhQX379tXs2bMlnR+ydPnll0uSRo8ercsuu0yTJk2y9bxcOPSsoKBA06ZN0xVXXKGEhARdeeWV+uijj+zqcscdd2jkyJF66KGH1KZNG911111OX3+DBg1Us2ZNHTp0qMQyixYt0sCBA5WUlKQuXbroueee06lTpySdH3I0evRoSdLll19e6vCfY8eOafTo0erRo4datWqlG264Qf/9739tr8fFxWnlypVatWqV4uLi9OWXXzp9PVWqVFFYWJgsFottW3Z2tsaNG6c+ffooISFBycnJuuuuu7RlyxZJKvEzuPD/V6xYobi4OP3++++6++671bp1a3Xp0kVvvPGG8vPzbec6c+aMnnvuOXXq1ElJSUl69NFHNWvWLLvhlf/73/903333qUOHDmrdurVuuummUnv8nn32WXXp0sXuPJL08ssvq0OHDsrLy1N2drbGjBmj7t2729pd4WCjsNTUVEkq0vYl6d5779Xw4cPt2va6det09913Kzk5WR07dtSIESN09OhR2+tlfb7W93Ty5MkaOHCgWrVqZXvfDx06pBEjRqh9+/Zq3bq1Bg0apM2bN5daf0maOnWqrrzySod6aSIiIord/vnnn+vqq69WQkKCevbsqUmTJtne61GjRtnadO/evXXHHXdIknJycjRlyhT17dtXiYmJ6tOnj6ZNm2b3Xpb0/czJydHrr7+uHj16KCEhQddee60WLVpUat3T09M1b948XX/99cUG/Ndcc43uvvtu1alTp8RjrF69Wrfffrtat26t9u3b68knn7QNn7RatWqVhgwZonbt2ikhIcH2O8p6XQcOHFBcXJy+/fZbPfTQQ0pKSlL79u31zDPPKDMz03acC4cCOrpPXl6e3nzzTXXv3l2tWrXSkCFDNH/+/CJDGq+55hrt2LFDP//8c6nvGQAPMADACbNnzzZiY2ON6667zvj444+NnTt3GgUFBcWW/fjjj43Y2Fjj6aefNn755Rdj9uzZRuvWrY1nn33WMAzD+Omnn4zY2FjjpZdeMlJSUowff/zRuOeee4zY2Fhj3bp1Rk5OjvH9998bsbGxxoQJE4xNmzYZhw8fNp566ikjNjbWWLt2rXH48GHDMAzj2WefNeLj442JEycay5YtM8aPH280b97cmDx5sq0+t99+u9GyZUtj1KhRRkpKivHrr78WW+958+YZsbGxxv79+4u8lp6ebjRv3tx44YUXDMMwjIkTJxqxsbG216dMmWLExcUZL7zwgu2a27dvb1x77bVGVlaWkZaWZkyYMMGIjY01vv/+e2Pfvn3F1uH48eNGt27djN69extfffWV8fPPPxsPPfSQERcXZyxYsMAwDMNYu3atMWDAAGPAgAHG2rVrjbS0tGKP9eSTTxq9evUy8vLybP9ycnKM/fv3G6+88ooRGxtrfPzxx7byDz74oNGpUyfj888/N1asWGHMnTvX6NKli3HVVVcZBQUFJX4GsbGxxsSJEw3DMIzly5cbsbGxRufOnY3JkycbKSkptnN98skntnPdcccdRtu2bY3Zs2cbP/30kzF06FAjISHB9p7m5+cbffv2Ne68807j559/Nn799Vdj2LBhRosWLYy9e/cWe72rVq0yYmNjjd9++822LT8/3+jSpYvtc3v22WeNXr16Gd98842xfPly4/XXXzdiY2ONL774othjGoZhbN261WjZsqXRtWtXY9KkScbatWuN3NzcYstu2rTJiI+PN2699Vbjhx9+ML777jvjiiuuMK6++mojLy/Poc/X+p7Gx8cbM2fONH766Sdj+/btRlpamtGtWzejT58+xsKFC40ffvjBuP322402bdoYO3fuLLH+u3btMmJjY41ly5bZbe/Vq5fxxBNP2LWNgwcPGk888YTRuXNnu3b17rvvGnFxcca//vUvY9myZca0adOMxMREY/To0YZhGMa+ffvs2veOHTuMgoICY/DgwUabNm2MGTNmGL/++qsxbtw4o0WLFsYzzzxjO3Zx38+CggJjyJAhRlJSkvH+++8bv/zyi/Hss88asbGxxldffVXitX7zzTdGbGys8fPPP5dY5kKFv/MrV6404uPjjSFDhhg//vij8dVXXxk9e/Y0rr76aiMrK8swDMPYsmWL0bJlS2PEiBHGsmXLjF9++cV4/PHHjdjYWOObb74xDMMw9u/fb8TGxhrt2rUzXn31VSMlJcX2Hr755pt2n8GTTz7p1D6jRo0yEhISjKlTpxq//PKL8cQTT9i+O4V/d918883GiBEjHHovALiOwAqA09566y0jMTHRiI2NNWJjY40OHToYjz3
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в Popular Books:\n",
" Title Author Score \\\n",
"32 The Nice Old Man and the Pretty Girl Italo Svevo 3.11 \n",
"96 The Killer Stewart Edward White 3.22 \n",
"156 The Three of Us Ore Agbaje-Williams 2.98 \n",
"197 Old Bugs H.P. Lovecraft 2.82 \n",
"249 Lair of the White Worm Bram Stoker 2.78 \n",
"... ... ... ... \n",
"27183 The Mysterious Ship H.P. Lovecraft 2.15 \n",
"27267 Murder in the Snow Gladys Mitchell 3.09 \n",
"27282 The End Samuel Beckett 3.18 \n",
"27527 Snuff Chuck Palahniuk 3.22 \n",
"27618 Mosquitoes William Faulkner 3.11 \n",
"\n",
" Ratings Shelvings Published \\\n",
"32 728 747 1926 \n",
"96 59 158 1919 \n",
"156 5197 24819 2023 \n",
"197 972 1142 1919 \n",
"249 4194 6625 1911 \n",
"... ... ... ... \n",
"27183 328 293 1902 \n",
"27267 784 1035 1950 \n",
"27282 1369 3025 1946 \n",
"27527 62483 95559 2008 \n",
"27618 1193 2331 1927 \n",
"\n",
" Description \\\n",
"32 ...the sin of an old man is equal to about two... \n",
"96 This book was converted from its physical edit... \n",
"156 Long-standing tensions between a husband, his ... \n",
"197 With the onset of Prohibition, the Sheehan Bil... \n",
"249 In a tale of ancient evil, Bram Stoker creates... \n",
"... ... \n",
"27183 \"The Mysterious Ship\" is a story story by Amer... \n",
"27267 A delight… An amateur sleuth to rival Miss Ma... \n",
"27282 'They didn't seem to take much interest in my ... \n",
"27527 From the master of literary mayhem and provoca... \n",
"27618 A delightful surprise, Faulkner wrote his seco... \n",
"\n",
" Image \n",
"32 https://images-na.ssl-images-amazon.com/images... \n",
"96 https://images-na.ssl-images-amazon.com/images... \n",
"156 https://images-na.ssl-images-amazon.com/images... \n",
"197 https://dryofg8nmyqjw.cloudfront.net/images/no... \n",
"249 https://images-na.ssl-images-amazon.com/images... \n",
"... ... \n",
"27183 https://images-na.ssl-images-amazon.com/images... \n",
"27267 https://images-na.ssl-images-amazon.com/images... \n",
"27282 https://images-na.ssl-images-amazon.com/images... \n",
"27527 https://images-na.ssl-images-amazon.com/images... \n",
"27618 https://images-na.ssl-images-amazon.com/images... \n",
"\n",
"[437 rows x 8 columns]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1cAAAImCAYAAAC/y3AgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACcE0lEQVR4nOzdd3hUVf4G8HdCMimkQYK0INUESA8JoUhnwUYRVBRBQCkrIIoiZVcEXRQsiAKKAiJIURGQ4qIgKyAaQZCW0EMTCC0JaaSSOb8/+M2YybQ7M3f6+3ken93MnLn3zMwZMifnvd+jEEIIEBERERERkVW8HN0BIiIiIiIid8DJFRERERERkQw4uSIiIiIiIpIBJ1dEREREREQy4OSKiIiIiIhIBpxcERERERERyYCTKyIiIiIiIhlwckVERERERCQDTq6IyOVV3wvdHfZGd/Xn4Or9J/lwLBCRJ+Hkiohkdfr0aUycOBEdO3ZETEwM7r//frz00ks4efKk7OcqLy/H22+/jS1btmhuO3PmDJ566inZz6W2YcMGREVFaf3XqlUrpKSk4Nlnn8Wff/6pabtgwQJERUWZdfxr165h9OjRuHLlitV9PXXqFPr374+YmBg89NBDettMnTpV5/lERUUhMTERffr0wRdffGH2efW9B1FRUViwYIFFz8NV3Lp1C7Nnz0bPnj0RExODtm3bYtiwYfjpp58c3TXJLl++jG7duiE3N1fnvieffBJRUVHYtm2b3sdeu3YNTz/9NGJjY9G+fXsUFhbqfD5tZdu2bXjuuefQoUMHJCQk4JFHHsEnn3yCoqIiTZt9+/YhKioK+/bts3l/9Bk6dCiGDh0q+3HXrVuH0aNHy35cIrKMt6M7QETu48yZMxg0aBASEhLw2muvISwsDNeuXcOqVavwxBNP4Msvv0RCQoJs57tx4wZWrFiB2bNna2778ccfcejQIdnOYcjChQtRp04dAIBKpUJ2djY+/vhjDBs2DOvWrUPLli0tOm5aWhp2794tSx8//vhjZGVl4eOPP0bt2rUNtqtTpw4WLlyo+VkIgezsbHz99deYM2cOfH19MXjwYMnn1fcefPPNN6hXr575T8JFlJaW4umnn0ZlZSVGjx6Nxo0bo7CwED/88APGjx+Pf/3rXxg2bJiju2mUEALTpk3DsGHDdMbLuXPncOjQIURGRuLrr79G7969dR6/YsUKHD58GO+99x7q1q2L/Px8nc+n3FQqFV599VX8+OOPGDhwIJ566inUrFkThw8fxueff44dO3Zg+fLlCA4OtlkfpJoxY4ZNjjtw4ECsXr0a69atw2OPPWaTcxCRdJxcEZFsvvjiC9SqVQtLliyBt/ff/7z07NkTDzzwAD755BMsXrzYgT2UT6tWrRAREaF1W+vWrfGPf/wDa9aswZtvvumgnv3t1q1biIyMRJcuXYy2UyqVeie9Xbt2Rc+ePbFhwwazJlf6yDmpdkY//vgjzp49i23btqFJkyaa23v27InS0lLMnz8fQ4YMQY0aNRzXSRN++uknnD59Gp9//rnOfRs2bEDDhg0xZswYTJo0CRcvXkTjxo212uTl5eGee+7RrJJevnzZ5n1eunQpvv/+eyxcuBD/+Mc/NLe3b98ebdu2xdNPP42PP/4Y06ZNs3lfTGnRooVNjqtQKDBmzBi8+eabeOSRR+Dn52eT8xCRNIwFEpFssrOzIYSASqXSuj0gIAD/+te/8OCDD2rdvnHjRjz66KOIj49H165dMXfuXJSXl2vu37FjBwYPHozExETExMTggQcewOrVqwHc/eLWo0cPAMC0adPQvXt3LFiwQLMCUzWGplKpsHjxYvzjH/9ATEwMevfujZUrV2r1ZejQoZg0aRImTJiAhIQEjBgxwuznHxERgVq1aiErK8tgm61bt2LAgAFITExEx44d8frrryM/Px/A3S+w6i+BPXr0wNSpUw0e58aNG5g2bRq6dOmCuLg4PPbYY/jf//6nuT8qKgp//PEH9u/fj6ioKGzYsMHs5+Pj4wN/f38oFArNbaWlpZg7dy569eqFmJgYJCUlYcSIEThx4gQAGHwPqv5/dTzr999/x7PPPov4+Hh07NgR7733HiorKzXnKioqwuuvv4727dsjMTEREydOxPLly7Wiln/99Rf++c9/IjU1FfHx8Rg0aJDRlb/p06ejY8eOWucBgLfeegupqamoqKhAaWkpZs6cic6dO2vGnb4JR1XZ2dkAoDP2AWDMmDEYO3as1tg+fPgwnn32WSQlJaFdu3Z4+eWXcf36dc39pt5f9Wu6cOFCDBgwAHFxcZrXPSsrCy+//DLatm2L+Ph4DBs2DMePHzfafwD47LPP0Lt3byiVSq3bKysrsXHjRnTr1g09e/ZEQEAAvvnmG6023bt3x4YNG5CVlYWoqChMnTpV5/OpduDAAQwZMgTx8fFo27YtpkyZohVD3LBhA1q3bo1vv/0WHTt2RNu2bZGZmanT34qKCixbtgydO3fWmliptWnTBhMmTDA6qTl9+jTGjBmDpKQkJCUlYdy4cbh06ZJWm5MnT2L8+PFo164doqOj0alTJ8yaNQulpaWaNlFRUVi9ejX+/e9/o23btkhMTMSLL76oGReAbixQymMA4PPPP0ePHj0QFxeHJ598Ej///LNOvLFbt24oKyvD+vXrDT5XIrITQUQkk9WrV4vIyEjx6KOPilWrVonMzEyhUqn0tl21apWIjIwU//73v8Uvv/wiVq9eLeLj48X06dOFEELs3LlTREZGilmzZom0tDTx888/i5EjR4rIyEhx+PBhUVZWJrZv3y4iIyPFvHnzxLFjx8TVq1fFv/71LxEZGSkOHTokrl69KoQQYvr06SI6OlrMnz9f7NmzR3zwwQeiZcuWYuHChZr+DBkyRLRu3VpMnTpVpKWliV9//VVvv9evXy8iIyPFpUuXdO7Lzc0VLVu2FG+88YYQQoj58+eLyMhIzf0ff/yxiIqKEm+88YbmObdt21b06dNHlJSUiJycHDFv3jwRGRkptm/fLi5evKi3Dzdv3hSdOnUSPXv2FN99953YtWuXmDBhgoiKihKbNm0SQghx6NAh0b9/f9G/f39x6NAhkZOTo/dYU6ZMEd26dRMVFRWa/8rKysSlS5fE22+/LSIjI8WqVas07V944QXRvn178e2334p9+/aJtWvXio4dO4oHH3xQqFQqg+9BZGSkmD9/vhBCiL1794rIyEjRoUMHsXDhQpGWlqY511dffaU519ChQ0VycrJYvXq12Llzpxg1apSIiYnRvKaVlZXigQceEM8884zYtWuX+PXXX8Xo0aNFq1atxIULF/Q+3/3794vIyEjx22+/aW6rrKwUHTt21Lxv06dPF926dRPff/+92Lt3r3j33XdFZGSkWLdund5jCiHEyZMnRevWrcX9998vFixYIA4dOiTKy8v1tj127JiIjo4WgwcPFj/99JP48ccfxT/+8Q/x8MMPi4qKCknvr/o1jY6OFsuWLRM7d+4Up0+fFjk5OaJTp06iV69eYvPmzeKnn34SQ4YMEQkJCSIzM9Ng/8+ePSsiIyPFnj17dO77+eefRWRkpDh69KgQQoh//etfol27dqKsrEzrOY0aNUp07NhRHDp0SFy+fFnn8ymEEH/88YeIjo4Wzz33nPj555/Fd999J7p27SoefvhhUVJSIoT4+zP2wAMPiJ07d4oNGzbo/Xfk0KFDIjIyUqxevdrg86pKPe727t0rhBDi3LlzIjExUQwcOFBs375dbN26VfTp00d07NhRZGdnCyGEuH79ukhKShLPPvus2Llzp/jtt9/E7NmzRWRkpPjss8+03os2bdqIqVOnij179og1a9aI2NhYMXHiRE2bIUOGiCFDhpj1mAULFoiWLVuK9957T+zZs0e8/fbbIjY2Vut5qL3yyiti0KBBkl4LIrIdTq6ISFYffvih5pd/ZGSkSE1NFa+88oo4cuSIpk1lZaVo3769GDt2rNZjly5dKh599FFRXl4ulixZIqZMmaJ1/61bt7S+1Fy6dElERkaK9evXa9pUn9CcO3dOREVFaX0REkKIefPmidj
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Визуализация перед очисткой\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df_books['Ratings'], cleaned_df_books['Score'])\n",
"plt.xlabel('Ratings')\n",
"plt.ylabel('Score')\n",
"plt.title('Scatter Plot of Ratings vs Score (Before Cleaning)')\n",
"plt.show()\n",
"\n",
"# Рассчитываем квартиль 1 (Q1) и квартиль 3 (Q3) для Score\n",
"Q1 = cleaned_df_books[\"Score\"].quantile(0.25)\n",
"Q3 = cleaned_df_books[\"Score\"].quantile(0.75)\n",
"\n",
"# Рассчитываем межквартильный размах (IQR)\n",
"IQR = Q3 - Q1\n",
"\n",
"# Определяем порог для выбросов\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
"# Фильтруем выбросы\n",
"outliers = (cleaned_df_books[\"Score\"] < lower_bound) | (cleaned_df_books[\"Score\"] > upper_bound)\n",
"\n",
"# Вывод выбросов\n",
"print(\"Выбросы в Popular Books:\")\n",
"print(cleaned_df_books[outliers])\n",
"\n",
"# Заменяем выбросы на медианные значения\n",
"median_score = cleaned_df_books[\"Score\"].median()\n",
"cleaned_df_books.loc[outliers, \"Score\"] = median_score\n",
"\n",
"# Визуализация данных после обработки\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(cleaned_df_books['Ratings'], cleaned_df_books['Score'])\n",
"plt.xlabel('Ratings')\n",
"plt.ylabel('Score')\n",
"plt.title('Scatter Plot of Ratings vs Score (After Cleaning)')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение набора данных на обучающую, контрольную и тестовую выборки"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 16529\n",
"Размер контрольной выборки: 5510\n",
"Размер тестовой выборки: 5510\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Разделение на обучающую и тестовую выборки\n",
"train_df, test_df = train_test_split(cleaned_df_books, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим недостаток баланса:"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Ratings в обучающей выборке:\n",
"Ratings\n",
"70 18\n",
"55 18\n",
"16100 18\n",
"61 18\n",
"162 17\n",
" ..\n",
"15370 1\n",
"4510 1\n",
"50015 1\n",
"24791 1\n",
"16244 1\n",
"Name: count, Length: 10844, dtype: int64\n",
"\n",
"Распределение Ratings в контрольной выборке:\n",
"Ratings\n",
"86 9\n",
"246 8\n",
"66 8\n",
"83 8\n",
"237 8\n",
" ..\n",
"15184 1\n",
"65771 1\n",
"6498 1\n",
"457617 1\n",
"316921 1\n",
"Name: count, Length: 4435, dtype: int64\n",
"\n",
"Распределение Ratings в тестовой выборке:\n",
"Ratings\n",
"136 11\n",
"100 11\n",
"159 8\n",
"55 8\n",
"71 8\n",
" ..\n",
"45669 1\n",
"2055 1\n",
"179534 1\n",
"16031 1\n",
"1108 1\n",
"Name: count, Length: 4428, dtype: int64\n",
"\n"
]
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['Ratings'].value_counts()\n",
" print(f\"Распределение Ratings в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Используем oversample и undersample"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение Ratings в обучающей выборке:\n",
"Ratings\n",
"2906 18\n",
"647 18\n",
"84803 18\n",
"52669 18\n",
"4880 18\n",
" ..\n",
"6093 18\n",
"2341 18\n",
"29423 18\n",
"93667 18\n",
"224935 18\n",
"Name: count, Length: 10844, dtype: int64\n",
"\n",
"Распределение Ratings в контрольной выборке:\n",
"Ratings\n",
"19873 9\n",
"224 9\n",
"1896 9\n",
"39208 9\n",
"9145 9\n",
" ..\n",
"10122 9\n",
"132 9\n",
"53626 9\n",
"17870 9\n",
"88623 9\n",
"Name: count, Length: 4435, dtype: int64\n",
"\n",
"Распределение Ratings в тестовой выборке:\n",
"Ratings\n",
"141477 11\n",
"1441 11\n",
"2471 11\n",
"17264 11\n",
"637349 11\n",
" ..\n",
"556 11\n",
"20224 11\n",
"24353 11\n",
"719 11\n",
"7381 11\n",
"Name: count, Length: 4428, dtype: int64\n",
"\n",
"Андерсэмплинг:\n",
"Распределение Ratings в обучающей выборке:\n",
"Ratings\n",
"9282201 1\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 1\n",
" ..\n",
"19 1\n",
"18 1\n",
"17 1\n",
"16 1\n",
"15 1\n",
"Name: count, Length: 10844, dtype: int64\n",
"\n",
"Распределение Ratings в контрольной выборке:\n",
"Ratings\n",
"5202524 1\n",
"8 1\n",
"9 1\n",
"2058282 1\n",
"1900499 1\n",
" ..\n",
"15 1\n",
"14 1\n",
"13 1\n",
"12 1\n",
"11 1\n",
"Name: count, Length: 4435, dtype: int64\n",
"\n",
"Распределение Ratings в тестовой выборке:\n",
"Ratings\n",
"9596885 1\n",
"5 1\n",
"9 1\n",
"10 1\n",
"11 1\n",
" ..\n",
"28 1\n",
"27 1\n",
"25 1\n",
"23 1\n",
"22 1\n",
"Name: count, Length: 4428, dtype: int64\n",
"\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"def oversample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" x_resampled, y_resampled = oversampler.fit_resample(X, y) # type: ignore\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"def undersample(df, target_column):\n",
" X = df.drop(target_column, axis=1)\n",
" y = df[target_column]\n",
" \n",
" undersampler = RandomUnderSampler(random_state=42)\n",
" x_resampled, y_resampled = undersampler.fit_resample(X, y) # type: ignore\n",
" \n",
" resampled_df = pd.concat([x_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_df, 'Ratings')\n",
"val_df_oversampled = oversample(val_df, 'Ratings')\n",
"test_df_oversampled = oversample(test_df, 'Ratings')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'Ratings')\n",
"val_df_undersampled = undersample(val_df, 'Ratings')\n",
"test_df_undersampled = undersample(test_df, 'Ratings')\n",
"\n",
"# Проверка сбалансированности после oversampling\n",
"print(\"Оверсэмплинг:\")\n",
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"# Проверка сбалансированности после undersampling\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/datasets/gallo33henrique/bitcoin-btc-usd-stock-dataset\n",
"Данный набор данных относится к анализу и прогнозированию финансовых временных рядов, связанных с криптовалютами.\n",
"Примр цели — разработка модели машинного обучения для прогнозирования цен на основе временных рядов.\n",
"Входные данные: Дата, Цена открытия на начало торговли, Самая высокая цена, Самая низкая цена, Цена закрытия в конце торговли, Скорректированная цена закрытия, Количество проданных."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Столбцы в Popular-Books:\n",
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')\n"
]
}
],
"source": [
"df_btc = pd.read_csv(\".//static//csv//BTC-USD_stock_data.csv\")\n",
"\n",
"print(\"Столбцы в Popular-Books:\")\n",
"print(df_btc.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Посмотрим краткое содержание датасета"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Date Close Price_Change\n",
"0 2017-01-01 998.325012 up\n",
"1 2017-01-02 1021.750000 up\n",
"2 2017-01-03 1043.839966 up\n",
"3 2017-01-04 1154.729980 down\n",
"4 2017-01-05 1013.380005 down\n",
"\n",
"Информация о датасете BTC-USD:\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2836 entries, 0 to 2835\n",
"Data columns (total 8 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Date 2836 non-null object \n",
" 1 Open 2836 non-null float64\n",
" 2 High 2836 non-null float64\n",
" 3 Low 2836 non-null float64\n",
" 4 Close 2836 non-null float64\n",
" 5 Adj Close 2836 non-null float64\n",
" 6 Volume 2836 non-null int64 \n",
" 7 Price_Change 2836 non-null object \n",
"dtypes: float64(5), int64(1), object(2)\n",
"memory usage: 177.4+ KB\n",
"None\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Adj Close</th>\n",
" <th>Volume</th>\n",
" <th>Price_Change</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2017-01-01</td>\n",
" <td>963.658020</td>\n",
" <td>1003.080017</td>\n",
" <td>958.698975</td>\n",
" <td>998.325012</td>\n",
" <td>998.325012</td>\n",
" <td>147775008</td>\n",
" <td>up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2017-01-02</td>\n",
" <td>998.617004</td>\n",
" <td>1031.390015</td>\n",
" <td>996.702026</td>\n",
" <td>1021.750000</td>\n",
" <td>1021.750000</td>\n",
" <td>222184992</td>\n",
" <td>up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2017-01-03</td>\n",
" <td>1021.599976</td>\n",
" <td>1044.079956</td>\n",
" <td>1021.599976</td>\n",
" <td>1043.839966</td>\n",
" <td>1043.839966</td>\n",
" <td>185168000</td>\n",
" <td>up</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2017-01-04</td>\n",
" <td>1044.400024</td>\n",
" <td>1159.420044</td>\n",
" <td>1044.400024</td>\n",
" <td>1154.729980</td>\n",
" <td>1154.729980</td>\n",
" <td>344945984</td>\n",
" <td>down</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2017-01-05</td>\n",
" <td>1156.729980</td>\n",
" <td>1191.099976</td>\n",
" <td>910.416992</td>\n",
" <td>1013.380005</td>\n",
" <td>1013.380005</td>\n",
" <td>510199008</td>\n",
" <td>down</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Open High Low Close \\\n",
"0 2017-01-01 963.658020 1003.080017 958.698975 998.325012 \n",
"1 2017-01-02 998.617004 1031.390015 996.702026 1021.750000 \n",
"2 2017-01-03 1021.599976 1044.079956 1021.599976 1043.839966 \n",
"3 2017-01-04 1044.400024 1159.420044 1044.400024 1154.729980 \n",
"4 2017-01-05 1156.729980 1191.099976 910.416992 1013.380005 \n",
"\n",
" Adj Close Volume Price_Change \n",
"0 998.325012 147775008 up \n",
"1 1021.750000 222184992 up \n",
"2 1043.839966 185168000 up \n",
"3 1154.729980 344945984 down \n",
"4 1013.380005 510199008 down "
]
},
2024-10-12 13:14:41 +04:00
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Добавляем бинарную переменную 'Price_Change': 'up' - цена выросла, 'down' - цена упала\n",
"df_btc['Price_Change'] = df_btc['Close'].diff(-1).apply(lambda x: 'up' if x < 0 else 'down')\n",
"\n",
"# Удаляем строки с NaN значениями, возникшими из-за сдвига\n",
"df_btc.dropna()\n",
"\n",
"# Вывод первых строк для проверки\n",
"print(df_btc[['Date', 'Close', 'Price_Change']].head())\n",
"\n",
"print(\"\\nИнформация о датасете BTC-USD:\")\n",
"print(df_btc.info())\n",
"df_btc.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Анализируем датафрейм при помощи \"ящика с усами\". Проверяет на пустые значения."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA7YAAAImCAYAAABn6xZvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA/cElEQVR4nO3deZiVZf348c+wCYggYCxuCRKLMMOiIAIuIKKpfTVcEcE1wK1vkgnmkpqoKRIKAqKmkqYYImbq1zQLc0PEIguUEESwWBRhUpZheX5/cM35cRzEQSm4x9frurhqnuec59xn7jk4b577PKcgy7IsAAAAIFGVdvQAAAAA4KsQtgAAACRN2AIAAJA0YQsAAEDShC0AAABJE7YAAAAkTdgCAACQNGELAABA0oQtAAAASauyowcAsKP169cvXn/99bxtVatWjT322CO6d+8eP/jBD6JOnTo7aHRARESPHj3igw8+yH1dUFAQderUifbt28cPfvCDaNmyZUREDB06NB5//PGtHqtTp07xy1/+Mvf1/Pnz44EHHoiXXnopli5dGvXq1YsOHTrEgAEDcsf9PIsWLYojjzwybrrppujdu3eZ/dOmTYv+/fvHhAkT4uCDD85tnzx5cjz66KMxZ86c2LBhQ+y9997Rq1evOO+886JWrVq527Vo0SLveJUrV47ddtstWrduHf369Yvu3btvdXwAXxfCFiAiDjjggPjJT36S+3rdunXx97//PUaMGBGzZ8+Ohx9+OAoKCnbgCIHDDz88LrzwwoiIWL9+fSxdujR+8YtfxFlnnRVPP/101K9fPy688MI4/fTTc/cZM2ZMzJo1K0aPHp3btnk4/u53v4vLL788vvWtb8UFF1wQe++9dyxevDgeeOCBOPXUU2Ps2LHRtWvX7fo8Ro8eHePGjYtzzz03LrjggqhatWr87W9/i3vuuSf+9Kc/xcMPPxxVq1bN3f7kk0+OU045JSI2/d20bNmyeOyxx2LQoEFx5ZVXRv/+/bfr+ABSJGwBYtMvuu3atcvb1rFjx/j000/jjjvuiJkzZ5bZD/x31atXr8zrsLCwMHr27Bn/93//F3379o1999039t1337z7VKtWbYuv3/fffz+GDBkShx56aIwcOTIqV66c29erV6/o06dPDBkyJF544YWoVq3adnkOJSUlcffdd8d5550Xl156aW57ly5domnTpnHRRRfF888/H9/+9rdz+xo1alRm/Mcee2xccsklccstt0SPHj1i77333i7jA0iV99gCbEWbNm0iIuKf//xnRES8/fbbcfHFF0fnzp2jdevWceihh8YNN9wQa9asyd2npKQkRo4cGUceeWQUFRXF8ccfn7c0sl+/ftGiRYst/lm0aFFEbFpO2a9fv5g0aVJ079492rdvH2eddVa8/fbbeeP75z//GYMHD45OnTpF27Zt46yzzopZs2bl3ebRRx/d4mMNHTo073bPP/989O7dOwoLC6Nr165xww03xKpVq3L7J0+e/Lnjnjx5crnHtGjRojL3KX3OPXr0yH3do0ePMmMcPHhwtGjRIqZNm5bbNmfOnBg4cGB06NAhOnToEBdddFEsXLiwzFxu7bE+b1wrVqyIa665Jrp06RKFhYVx6qmnxquvvpp3vxYtWsSoUaPyto0aNarMEtLPevXVV6N3797Rrl27OPbYY+O5557L2z99+vQ477zzomPHjtGmTZvo0aNHjBo1KjZu3LjF8S5ZsiROPPHEOPTQQyNi08/Z0KFDY9y4cdGlS5c48MAD48ILL8xbzvtF4yyd80WLFm11/kvn6bPfiyzL4vTTT8/72V68eHF8//vfj86dO3/uz/+2+CpvE/jlL38ZJSUlcdVVV+VFbUREjRo1YsiQIXHSSSfFypUrv/RjfNYnn3wSa9asyc3j5g4//PC49NJLY5999inXsS699NJYt25dTJo0abuNDyBVztgCbMX8+fMjImKfffaJpUuXRt++faNdu3Zx8803R7Vq1eLFF1+M++67Lxo0aBADBgyIiIjLLrsspk6dGhdccEG0bds2pk6dGkOHDo2qVavG8ccfHxFllz7/8Y9/jLFjx+Y99uzZs2PevHkxePDgqFOnTtxxxx1x5plnxtNPPx0NGjSI5cuXx+mnnx41atSIq6++OmrUqBEPPPBA9O3bNyZNmhT7779/RESsWbMmCgsL46qrrsod++KLL857rCeffDIuu+yy+M53vhM/+MEP4oMPPoif//znMXfu3LjvvvvylmGPHj06vvGNb0RExLJly/KOVd4xfRlvvPFGPPXUU3nb5s+fH6effno0bdo0fvazn8X69etj7Nix0adPn3jiiSeifv36X/rx1q5dG2eddVZ8+OGHcemll0aDBg3isccei/PPPz/uueeeOOSQQ770sf/1r3/FhRdeGAcffHD86Ec/imeeeSZ+8IMfxOOPPx7NmzePt99+O84+++w45phj4uc//3lkWRZPPvlkjB49Opo2bRrHHXdcmWOOHTs2atWqFddff31u2+9///uoW7duXHXVVbFx48a47bbbol+/fvHUU09FjRo1tmnMRxxxREycODEiyi7vrVev3hbv88QTT8Sf//znvG1DhgyJefPmxRVXXBF77713VK5ceYs//1uSZVmsX78+IiI2btwYH374Ydx+++2xxx575J3hLK8//elPccABB0TDhg23uP+QQw75SvO8JfXq1Yu2bdvGvffeG0uXLo2jjjoqOnToEPXq1YuqVavGoEGDyn2spk2bxp577hkzZszYrmMESJGwBYj8X5gjIlauXBmvv/56jB07Ntq3bx9t2rSJl19+OVq1ahW333577j16Xbp0iZdffjmmTZsWAwYMiDlz5sSzzz4bP/7xj+Oss86KiE2/HH/wwQcxbdq0XNh+dunzvHnzyozp3//+d4wbNy4OOuigiIgoKiqKnj17xoQJE+Kyyy6LBx54IFasWBEPP/xw7LXXXhERcdhhh8Wxxx4bt99+e9xxxx0REbF69erYY4898h5v82WVWZbF8OHD49BDD43hw4fntu+3335x9tlnx9SpU+OII47IbW/VqlVu2eNnz7CVd0zbauPGjXHDDTdE69at4+9//3tu++jRo6NGjRpx//335+bkkEMOiZ49e8Y999wTQ4YM+VKPF7Epyt5+++149NFHo23btrnn0q9fvxg+fHg89thjX/rYixYtis6dO8ett94atWrVirZt28bEiRPjlVdeyYVtly5d4tZbb41KlTYtruratWu88MILMW3atDJh+8knn8QTTzwRI0eOjKKiotz21atXx+TJk3NnAJs2bRrf/e53Y8qUKdGnT59tGnO9evVyAbu15b2lPv300xg+fHiZOfvrX/8ap59+epxwwgm5bVv6+d+SKVOmxJQpU/K2FRQUxK233vq5cb01ixcvjlatWm3z/b6qO+64Iy6//PLc8ykoKIhvfetbcdRRR8VZZ521TWeh99hjj/jwww//g6MFSIOwBYhNyz5bt26dt61SpUrRpUuXuP7666OgoCC6desW3bp1i3Xr1sXcuXNjwYIFMWfOnFi+fHnsvvvuERG5Mye9evXKO9Znl6qWx957752L2oiIBg0aRPv27WP69OkRsWkpa6tWraJhw4a5KK9UqVIcdthh8Zvf/CZ3v3/961+x2267fe7jzJs3LxYvXhwDBw7Mi/uOHTtGrVq14uWXX84L260p75giNsXq5o+XZdnnHveRRx6JZcuWxfXXX5+7eFBExGuvvRadOnWK6tWr545Vq1atOOigg+KVV14p15i39ly+8Y1vROvWrfPG2b1797jlllti5cqVuQD57HPZ0jLTzXXs2DE6duwYEZvODJeeiS49o33iiSfGiSeeGGvXro358+fHggULYvbs2bFhw4ZYt25d3rHWrl0bo0ePjgYNGuSWIZfq0KFD3rLWAw44IPbZZ5+YPn16XtiuX78+CgoKyizH/SrGjBkTdevWjT59+uStFigsLIzf//738e1vfzuaNm0a1atX/8LvV6nu3bvHRRddFBGbfl6WL18ezzzzTFx22WWxevXqOPXUU7dpjJUrV44
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка на пустые значения в наборе данных 'BTC-USD':\n",
"Series([], dtype: int64)\n",
"\n",
"\n"
]
}
],
"source": [
"plt.figure(figsize=(12, 6))\n",
"sns.boxplot(x='Close', data=df_btc)\n",
"plt.title('Распределение цен закрытия BTC-USD')\n",
"plt.xlabel('Цена закрытия (USD)')\n",
"plt.show()\n",
"\n",
"check_missing_values(df_btc, \"BTC-USD\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видно, что выборка относительно сбалансированна, пустых значений нет."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2cAAAJZCAYAAAAtTE0MAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADXxklEQVR4nOzdeVxU9foH8M8MMMMMiyAgmCKihgqi4r6kmZlxSy2zbLNfZYu3bNMy29dbt1tZlravVnbLyixbsJv3luYOWqmIpSLhwioj4GzAOb8/aCYGZjln5gwzMJ/36+XrJTPnnPkOs3Ce832+z6MSRVEEERERERERBZQ60AMgIiIiIiIiBmdERERERERBgcEZERERERFREGBwRkREREREFAQYnBEREREREQUBBmdERERERERBgMEZERERERFREGBwRkREREREFAQYnBFRpyOKotufOyI+ByIios6PwRkR+dVvv/2GBQsWYPz48Rg0aBDOOOMM3HHHHSgqKlL8saxWK5588kmsXbvWftvvv/+Oyy+/XPHHslm9ejX69+/v8G/gwIEYOXIk5s6di4KCAvu2y5YtQ//+/WUdv6ysDDfeeCOOHj3q81j379+PCy+8EIMGDcJ5553ndJt77rnH4bkMGDAAQ4cOxfTp07F8+XKYzWbZj6vkc/Cn4uJiPPLII5gyZQoGDx6MSZMmYeHChW3eq5MnT8Y999wToFE6av169e/fH1lZWTjjjDOwaNEiHD9+3OMx+vfvj2XLlrXDaJv9/e9/xyeffALgr89Ey39Dhw7FBRdcgI8++sir43/99dc466yzMGjQIDz00ENKDl223bt3Y9GiRZg0aRIGDx6MKVOm4MEHH0RpaanDdu39GrTkzfeSFIcOHcLkyZNRW1ur+LGJOrPwQA+AiDqv33//HZdeeimGDh2KBx54AAkJCSgrK8MHH3yA2bNn47333sPQoUMVe7yKigqsWLEC//znP+235eXlYdeuXYo9hivLly9HUlISAEAQBFRVVeGll17C1VdfjU8//RQDBgzw6ribN2/Gjz/+qMgYX3rpJRw7dgwvvfQSunbt6nK7pKQkLF++HEDzc6mrq0N+fj5ee+01/PTTT1ixYgW0Wq3kx1XyOfjLd999h7vvvhunn346brrpJvTs2RNlZWVYsWIFZs+ejVdeeQXjx48P9DCdavl6AUBjYyOKi4vx7LPPYteuXfjqq68QGRnpcv+PP/4YKSkp7TFUrF69GuXl5Zg1a1abMQDN77f6+nps2LABDz/8MMLCwnDJJZfIeozHHnsMvXv3xlNPPYXk5GTFxi7XypUr8eSTT2L06NG488470a1bN5SUlOCtt97Cd999hxUrVnj9vaCkSy65BBMmTFD8uH369MHZZ5+Nf/zjH3j66acVPz5RZ8XgjIj85p133kF8fDzeeOMNhIf/9XUzZcoU5Obm4uWXX8brr78ewBEqZ+DAgejZs6fDbZmZmTjnnHPw4Ycf4rHHHgvQyP5SU1ODjIwMnHnmmW6302g0bYLmM888E0OGDMH8+fPx9ttv46abbvLjSNvXH3/8gcWLF2PChAlYunQpwsLC7PdNnToVl19+ORYvXoz//ve/0Gg0ARypc85erxEjRiAiIgKLFy/G+vXrcf7557vcX8kLJO6YzWY8++yzePjhh6FWOybutB7DxIkTUVRUhI8++kh2cGYwGDB+/HiMHj3a1yF7raCgAE888QSuvPJK3H///fbbR48ejSlTpuDCCy/Efffdh9WrVwdsjDYpKSl+C85vvPFGTJo0CVdffTWysrL88hhEnQ3TGonIb6qqqiCKIgRBcLhdr9fjvvvuw9/+9jeH29esWYOZM2diyJAhmDRpEpYsWQKr1Wq///vvv8cVV1yBnJwcDBo0CLm5uVi5ciUA4MiRIzj77LMBAPfeey8mT56MZcuW2WcUWqYNCYKA119/Heeccw4GDRqEc889F++//77DWK666ircdddduO222zB06FBce+21sp9/z549ER8fj2PHjrnc5ptvvsFFF12EnJwcjB8/Hg899BBOnjwJoHmW4d577wUAnH322W5T6SoqKnDvvffizDPPxODBg3HxxRdj/fr19vv79++P7du3Y8eOHejfv79XJ4VTpkzB0KFDHdLNmpqa8Prrr2PatGkYPHgwhg4dissuuwxbt271+Bw++eQTnH/++Rg0aBAmTZqEZcuWoampyeXjz507FxdddFGb22+++WbMmDEDAHDixAnceeedGD9+PLKzs3HBBRdgzZo1bp/X+++/D6vVigceeMAhMAMAnU6HxYsXY9asWfbXpbW6ujr885//xJQpU5CdnY1p06bh008/ddhmz549uPrqqzF8+HDk5OTgmmuuwc8//+ywTX5+PubMmYMhQ4Zg1KhRWLx4MU6cOOF27O5kZ2cDgD2d9J577sHVV1+Nhx9+GMOGDcN5552HpqamNil1FRUVWLx4McaOHYucnBzMmTPHYfZZyufHmc8++wwWiwVnnXWWpPHHxsZCpVI53Pbbb79h3rx5GDZsGIYNG4b58+fbUwS3bdtmT8976aWX0L9/fxw5cgQAsGnTJlxxxRUYPny4fSarZcrn6tWrkZmZiU8++QTjx4/HqFGjcODAAQDN3zsXXXQRsrOzMX78ePzjH/+A0Wh0O/a33noLMTExWLhwYZv7unbtinvuuQdnn322y+MYDAY89NBDGDduHLKzszF79mxs2bLFYZsTJ07g0Ucftadwjho1CvPnz7c/Z6D5e+z+++/H66+/jkmTJiE7OxuXXXYZfv31V/s2rdMapewDAD/88AMuuugiDB48GOeeey6++uornHPOOQ7vpaSkJIwZMwavvfaa298XEf2FwRkR+c2kSZNw7NgxXHbZZVi5ciUOHjxoLwqRm5uLmTNn2rdduXIlFi9ejKysLCxfvhw33ngj3n//ffzjH/8A0HwiMH/+fGRlZeHll1/GsmXLkJqaisceewy//PILunXrZg/EbrrpJixfvhyXXHIJLr74YgDNaVO2K/CPPPIIXnzxRcyYMQOvvvoqcnNz8eSTT+Kll15yGP+3336LqKgovPLKK7j++utlP/+amhrU1NSgV69eTu9/+eWXsXDhQgwdOhQvvvgi5s+fj3Xr1uGqq66C2WzGpEmT7DNUy5cvx8033+z0OFVVVbj44ouRn5+PBQsWYNmyZejRowfmz5+PL7/80v78MzMzkZmZiY8//hiTJk2S/XwAYPz48SgrK7Of8D/77LN4+eWXcemll+LNN9/E448/DoPBgNtvvx0mk8nlc3jttdfw4IMPYuzYsXj11Vdx5ZVX4o033sCDDz7o8rFnzJiBvXv3oqSkxH5bbW0tNmzYgAsuuAAAsGjRIhw8eBCPPvoo3njjDWRmZmLx4sX2YNGZjRs3IjMz02UK3NixY7FgwQJ72mpLZrMZV1xxBdauXYvrr78eL7/8MoYPH477778fr776KgCgvr4e119/PeLj47Fs2TI8//zzMJlMuO6661BXVwcA2LFjB6655hpERkZi6dKluO+++7B9+3b83//9n1fr/IDmNXQAHN5/+fn5OH78OF566SXceeedbYLRU6dO4fLLL8e2bduwaNEiLF++HFqtFnPnzsXhw4cBSP/8tPbll19i0qRJTmcfGxsb7f9qa2vx1VdfYcOGDZgzZ47D87nssstQXV2Nf/3rX3jiiSdQWlqKyy+/HNXV1cjKyrKnR1588cX4+OOP0a1bN6xZswZz585F9+7d8dxzz+Hee+/Frl27cOmll6K6utp+/KamJrz99tt44okncO+996Jv375Yu3Yt5s+fjz59+uCll17CLbfcgi+//BI333yzywI3oijip59+wtixY6HT6Zxuc95552H+/PnQ6/Vt7rNYLLj66quxfv16LFiwAMuXL0dKSgquv/56e4AmiiLmzZuHTZs24a677sJbb72FW265BVu2bMHDDz/scLx169Zh/fr1eOCBB/Dcc8+hqqoKt956q9sLIZ722bp1K26++WZ0794dy5Ytw5VXXomHH37Y6RrH3Nxc/Pe//8WpU6dcPh4RtSASEfn
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(df_btc['Date'], df_btc['Close'])\n",
"plt.xlabel('Date')\n",
"plt.ylabel('Close Price')\n",
"plt.title('Scatter Plot of Date vs Close Price (Before Cleaning)')\n",
"plt.xticks(rotation=45)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Убираем шумы"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в BTC-USD Stock Data:\n",
"Empty DataFrame\n",
"Columns: [Date, Open, High, Low, Close, Adj Close, Volume, Price_Change]\n",
"Index: []\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2cAAAJZCAYAAAAtTE0MAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADVz0lEQVR4nOzdeXhTZdoG8DtpmzTpQktbWiyllJ2WrUDZtMggYkcBRZRxBBdcRxkRUNQZHQd1XD4FNxbRARW3URREcCkuMwrK2ooKlIJAW8rWvbQlW9uT74+a2LRNck5ysjX377q8LknOOXmz9jznfd7nUZjNZjOIiIiIiIjIp5S+HgARERERERExOCMiIiIiIvILDM6IiIiIiIj8AIMzIiIiIiIiP8DgjIiIiIiIyA8wOCMiIiIiIvIDDM6IiIiIiIj8AIMzIiIiIiIiP8DgjIiCitlsdvjvQMTnQNQxfq6IKNAwOCMinzly5AgWLlyICy+8EIMHD8ZFF12EBQsWoLCwUPbHMplMeOqpp7Blyxbrbb/++iv+/Oc/y/5YFhs3bsSAAQNs/hs0aBCysrJwyy23ID8/37rt8uXLMWDAAEnHP3v2LO644w6cOnXK7bEePnwYV111FQYPHozLL7+8w20eeughm+cycOBADB8+HNOmTcOKFStgMBgkP66cz8GTioqKsGTJEkyePBlDhw7FxIkTsWjRonaf1UmTJuGhhx7y0ShttX2/BgwYgIyMDFx00UVYvHgxzpw54/QYAwYMwPLly70w2hZ/+ctf8OGHH7a7ff369RgwYAD+8pe/2N33ueeew+jRozF8+HBs2rQJ33zzDR588EFPDhdAYHw2LL9FJ0+elPW4NTU1mDhxIkpLS2U9LlEwC/X1AIgoOP3666/405/+hOHDh+ORRx5BXFwczp49i3feeQezZs3CW2+9heHDh8v2eOXl5Vi3bh2efvpp6225ubnYt2+fbI9hz4oVK5CQkAAAEAQBlZWVWLlyJW666SZ89NFHGDhwoEvH3bFjB7777jtZxrhy5UqcPn0aK1euRNeuXe1ul5CQgBUrVgBoeS719fXIy8vDq6++iu+//x7r1q2DWq0W/bhyPgdP+fLLL/HAAw+gX79+uOuuu9CjRw+cPXsW69atw6xZs/DKK6/gwgsv9PUwO9T6/QKApqYmFBUVYenSpdi3bx8+/fRThIeH293/gw8+QFJSkjeGio0bN6KsrAwzZ85sd9+GDRvQv39/bNu2DWfOnEH37t1t7j9y5AjWrFmDWbNm4corr0Tv3r1x7733enzMgfLZmDhxIj744AN069ZN1uPGxsbi5ptvxt///ne89dZbUCgUsh6fKBgxOCMin3jjjTcQGxuLf//73wgN/f2naPLkycjJycGqVavw2muv+XCE8hk0aBB69Ohhc1t6ejouvfRSvPfee3j88cd9NLLf1dTUoH///rj44osdbqdSqdoFzRdffDGGDRuGefPm4fXXX8ddd93lwZF614kTJ/Dggw8iOzsbL774IkJCQqz3TZkyBX/+85/x4IMP4r///S9UKpUPR9qxjt6vUaNGISwsDA8++CC++eYbXHHFFXb3l/MCiSMGgwFLly7FP//5TyiVtkk9x44dw08//YQ1a9Zg4cKF+OCDD7BgwQKbbWprawEAV1xxBUaNGuWVMQfSZ6Nr164OL7q44/rrr8crr7yCr776ClOmTPHIYxAFE6Y1EpFPVFZWwmw2QxAEm9u1Wi3+/ve/449//KPN7Zs2bcKMGTMwbNgwTJw4EcuWLYPJZLLe//XXX+P6669HZmYmBg8ejJycHLz77rsAgJMnT+KSSy4BAPztb3/DpEmTsHz5cuuMQuvULUEQ8Nprr+HSSy/F4MGDcdlll+Htt9+2GcsNN9yA+++/H/Pnz8fw4cMxd+5cyc+/R48eiI2NxenTp+1u8/nnn+Pqq69GZmYmLrzwQjz66KM4d+4cgJZZhr/97W8AgEsuucRhulR5eTn+9re/4eKLL8bQoUNxzTXX4JtvvrHeP2DAAOzZswd79+7FgAEDsHHjRsnPZ/LkyRg+fDjef/99623Nzc147bXXMHXqVAwdOhTDhw/Hddddh127djl9Dh9++CGuuOIKDB48GBMnTsTy5cvR3Nxs9/FvueUWXH311e1uv/vuuzF9+nQAQHV1Ne677z5ceOGFGDJkCK688kps2rTJ4fN6++23YTKZ8Mgjj9icfAOARqPBgw8+iJkzZ1rfl7bq6+vx9NNPY/LkyRgyZAimTp2Kjz76yGabAwcO4KabbsLIkSORmZmJm2++GT/99JPNNnl5eZgzZw6GDRuG0aNH48EHH0R1dbXDsTsyZMgQALCmkz700EO46aab8M9//hMjRozA5Zdfjubm5nZpjeXl5XjwwQcxbtw4ZGZmYs6cOTazz2K+Px3ZsGEDjEYj/vCHP3R4X5cuXTB27Fhcdtll+Oijj9DU1GS9f/ny5bjhhhsAADfddBMmTZqEG264AXv27MGePXswYMAA7N69G0BLEPfoo49i/PjxGDJkCGbNmoWdO3faPN6AAQOwYsUKXH311Rg6dKjNzGNr7n42jEYjnn32WVx88cUYPHgwpk2bhs8//9xmG4PBgGXLlmHKlCkYPHgwRowYgblz5+LQoUPWbR566CHcfPPN2LBhAy677DIMHjwYV155JbZt22bdpm1ao5h9AGDfvn2YPXs2hg8fjokTJ2LdunW4+eabbb6rKpUKl112GV599dUOnycRScPgjIh8YuLEiTh9+jSuu+46vPvuuzh27Jh18X5OTg5mzJhh3fbdd9/Fgw8+iIyMDKxYsQJ33HEH3n77bfzrX/8CAHz77beYN28eMjIysGrVKixfvhwpKSl4/PHH8fPPP6Nbt27WE6y77roLK1aswLXXXotrrrkGQEvq1rXXXgsAWLJkCV5++WVMnz4dq1evRk5ODp566imsXLnSZvxffPEFIiIi8Morr+C2226T/PxrampQU1ODnj17dnj/qlWrsGjRIgwfPhwvv/wy5s2bh61bt+KGG26AwWDAxIkTrTNUK1aswN13393hcSorK3HNNdcgLy8PCxcuxPLly5GcnIx58+Zh8+bN1uefnp6O9PR0fPDBB5g4caLk5wMAF154Ic6ePWs94V+6dClWrVqFP/3pT1izZg2eeOIJ1NbW4t5774Ver7f7HF599VX84x//wLhx47B69WrMnj0b//73v/GPf/zD7mNPnz4dBw8eRElJifW2uro6bNu2DVdeeSUAYPHixTh27Bgee+wx/Pvf/0Z6ejoefPBBa7DYke3btyM9PR2JiYkd3j9u3DgsXLjQmrbamsFgwPXXX48tW7bgtttuw6pVqzBy5Eg8/PDDWL16NQCgoaEBt912G2JjY7F8+XK88MIL0Ov1uPXWW1FfXw8A2Lt3L26++WaEh4fjxRdfxN///nfs2bMHN954o0vr/ICWdVIAbD5/eXl5OHPmDFauXIn77ruvXcBx/vx5/PnPf8bu3buxePFirFixAmq1GrfccguKi4sBiP/+tLV582ZMnDix3QxTU1MTNm/ejKlTpyIsLAwzZsxARUUF/vvf/1q3ufbaa/Hoo48CAB599FGsWLEC//znP20+0xkZGTAajbjpppvwzTffYOHChVixYgWSkpJw2223tQvQVq9ejWnTpuHll1/GZZdd1uGY3flsmM1mzJs3D++//z7mzp2LV155BZmZmVi4cKHNBYMHHngAGzZswB133IHXX38df/vb3/Drr7/ivvvusyl2cuDAAaxduxbz58/HypUrERISgnvuucduYChmn2PHjuHmm28GADz//PO455578Nprr9mslbXIycnBgQMHrJ8rInId0xqJyCeuv/56VFRUYO3atda0vtjYWFx00UW48cYbMXToUAAtV+JXrlyJyZMnW4MxANDr9fjss8/Q2NiIo0ePYsaMGXj44Yet92dmZmLMmDHYvXs3hg0bhkGDBgFoORlNT08HAOtaGkvqVlFREdavX49FixbhjjvuAABcdNFFUCgUePXVV3H99dcjNjYWABAWFobHHntMVLqSIAjWK/1GoxH
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Рассчитываем квартиль 1 (Q1) и квартиль 3 (Q3) для Close\n",
"Q1 = df_btc[\"Close\"].quantile(0.25)\n",
"Q3 = df_btc[\"Close\"].quantile(0.75)\n",
"\n",
"# Рассчитываем межквартильный размах (IQR)\n",
"IQR = Q3 - Q1\n",
"\n",
"# Определяем порог для выбросов\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
"# Фильтруем выбросы\n",
"outliers = (df_btc[\"Close\"] < lower_bound) | (df_btc[\"Close\"] > upper_bound)\n",
"\n",
"# Вывод выбросов\n",
"print(\"Выбросы в BTC-USD Stock Data:\")\n",
"print(df_btc[outliers])\n",
"\n",
"# Заменяем выбросы на медианные значения\n",
"median_close = df_btc[\"Close\"].median()\n",
"df_btc.loc[outliers, \"Close\"] = median_close\n",
"\n",
"# Визуализация данных после обработки\n",
"plt.figure(figsize=(10, 6))\n",
"plt.scatter(df_btc['Date'], df_btc['Close'])\n",
"plt.xlabel('Date')\n",
"plt.ylabel('Close Price')\n",
"plt.title('Scatter Plot of Date vs Close Price (After Cleaning)')\n",
"plt.xticks(rotation=45)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение набора данных на обучающую, контрольную и тестовую выборки"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 1701\n",
"Размер контрольной выборки: 567\n",
"Размер тестовой выборки: 568\n"
]
}
],
"source": [
"# Разделение на обучающую и тестовую выборки\n",
"train_df, test_df = train_test_split(df_btc, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим недостаток баланса:"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Price_Change в обучающей выборке:\n",
"Price_Change\n",
"up 882\n",
"down 819\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в контрольной выборке:\n",
"Price_Change\n",
"up 301\n",
"down 266\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в тестовой выборке:\n",
"Price_Change\n",
"up 308\n",
"down 260\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['Price_Change'].value_counts()\n",
" print(f\"Распределение Price_Change в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Используем oversample и undersample"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение Price_Change в обучающей выборке:\n",
"Price_Change\n",
"up 882\n",
"down 882\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в контрольной выборке:\n",
"Price_Change\n",
"down 301\n",
"up 301\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в тестовой выборке:\n",
"Price_Change\n",
"down 308\n",
"up 308\n",
"Name: count, dtype: int64\n",
"\n",
"Андерсэмплинг:\n",
"Распределение Price_Change в обучающей выборке:\n",
"Price_Change\n",
"down 819\n",
"up 819\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в контрольной выборке:\n",
"Price_Change\n",
"down 266\n",
"up 266\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Price_Change в тестовой выборке:\n",
"Price_Change\n",
"down 260\n",
"up 260\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df_oversampled = oversample(train_df, 'Price_Change')\n",
"val_df_oversampled = oversample(val_df, 'Price_Change')\n",
"test_df_oversampled = oversample(test_df, 'Price_Change')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'Price_Change')\n",
"val_df_undersampled = undersample(val_df, 'Price_Change')\n",
"test_df_undersampled = undersample(test_df, 'Price_Change')\n",
"\n",
"# Проверка сбалансированности после oversampling\n",
"print(\"Оверсэмплинг:\")\n",
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"# Проверка сбалансированности после undersampling\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/datasets/junaid512/random-student-data-set-for-education-purpose\n",
"Набор данных включает случайные данные о студентах, которые используются для целей моделирования в сфере образования.\n",
"Примр цели — образовательная аналитика.\n",
"Входные данные: Полные имена студентов, Класс/программа обучения, Возраст, IQ, Совокупный средний балл успеваемости, Навыки"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Name', 'Class_or_Program', 'Age', 'Country', 'IQ', 'CGPA', 'Skill'], dtype='object')\n"
]
}
],
"source": [
"df_students = pd.read_csv(\".//static//csv//student_data_01.csv\")\n",
"\n",
"print(df_students.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Посмотрим краткое содержание датасета"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Информация о датасете BTC-USD:\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 50000 entries, 0 to 49999\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Name 50000 non-null object \n",
" 1 Class_or_Program 50000 non-null object \n",
" 2 Age 50000 non-null int64 \n",
" 3 Country 50000 non-null object \n",
" 4 IQ 50000 non-null int64 \n",
" 5 CGPA 50000 non-null float64\n",
" 6 Skill 50000 non-null object \n",
"dtypes: float64(1), int64(2), object(4)\n",
"memory usage: 2.7+ MB\n",
"None\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Class_or_Program</th>\n",
" <th>Age</th>\n",
" <th>Country</th>\n",
" <th>IQ</th>\n",
" <th>CGPA</th>\n",
" <th>Skill</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Catherine Owen</td>\n",
" <td>Arts</td>\n",
" <td>21</td>\n",
" <td>Tonga</td>\n",
" <td>105</td>\n",
" <td>3.18</td>\n",
" <td>Communication</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Melissa Wright</td>\n",
" <td>10th</td>\n",
" <td>24</td>\n",
" <td>United Arab Emirates</td>\n",
" <td>102</td>\n",
" <td>2.72</td>\n",
" <td>Leadership</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Laura Shaw</td>\n",
" <td>12th</td>\n",
" <td>18</td>\n",
" <td>Slovakia (Slovak Republic)</td>\n",
" <td>136</td>\n",
" <td>3.40</td>\n",
" <td>Communication</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Rodney Cummings</td>\n",
" <td>10th</td>\n",
" <td>17</td>\n",
" <td>Barbados</td>\n",
" <td>83</td>\n",
" <td>2.49</td>\n",
" <td>Problem-solving</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Barbara Hicks</td>\n",
" <td>12th</td>\n",
" <td>25</td>\n",
" <td>Canada</td>\n",
" <td>129</td>\n",
" <td>2.39</td>\n",
" <td>Communication</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Class_or_Program Age Country IQ \\\n",
"0 Catherine Owen Arts 21 Tonga 105 \n",
"1 Melissa Wright 10th 24 United Arab Emirates 102 \n",
"2 Laura Shaw 12th 18 Slovakia (Slovak Republic) 136 \n",
"3 Rodney Cummings 10th 17 Barbados 83 \n",
"4 Barbara Hicks 12th 25 Canada 129 \n",
"\n",
" CGPA Skill \n",
"0 3.18 Communication \n",
"1 2.72 Leadership \n",
"2 3.40 Communication \n",
"3 2.49 Problem-solving \n",
"4 2.39 Communication "
]
},
2024-10-12 13:14:41 +04:00
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\nИнформация о датасете BTC-USD:\")\n",
"print(df_students.info())\n",
"df_students.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Анализируем датафрейм при помощи \"ящика с усами\". Проверяет на пустые значения."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA7YAAAImCAYAAABn6xZvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABDo0lEQVR4nO3de/zX8+H///u7s3IIkfOIdVA6OFRaoYaZ9TGrhiGnTbGaET8xIzkfcsiK5ZA0bRhFM7PNYYytmrPPlkUl55RU0lmv3x8uvb7e3sk7H9ST6/Vycdl6vp6v5/Pxer8er3rfXs/n6/mqKJVKpQAAAEBB1VjbAwAAAID/C2ELAABAoQlbAAAACk3YAgAAUGjCFgAAgEITtgAAABSasAUAAKDQhC0AAACFJmwBAAAotFprewAAn0Xv3r0zadKkSstq166dRo0apWvXrjn55JOz0UYbraXR8UWaOXNmbrnlljz00EN58803s8EGG6RVq1b5yU9+kt133728Xu/evZMkv/nNb9bWUNcarw8Avm6ELVBYO++8cwYNGlT+87Jly/Lvf/87V155ZSZPnpzf/e53qaioWIsj5PP25JNPpl+/ftl4441z1FFHZYcddsjcuXNz++23p3fv3rn44otz8MEHr+1hrhO8PgD4OhG2QGGtv/76adu2baVle+yxR95///1cc801efbZZ6vcTnHNnTs3J598crbffvvcfPPNWW+99cq3fec730mfPn1yzjnnpHPnzmnUqNFaHOm6wesDgK8Tn7EFvnJatWqVJHnjjTeSJC+88EL69++fjh07pmXLlunSpUsuuOCCLF68uHyfpUuX5uqrr863v/3ttG7dOt27d8+4cePKt/fu3TvNmjVb5X+vvfZakuSMM85I7969c+edd6Zr165p165djj766LzwwguVxvfGG29kwIABad++fdq0aZOjjz46//nPfyqtc8cdd6xyX2eccUal9R544IH06NEju+yyS771rW/lggsuyMKFC8u3jx079hPHPXbs2GqP6bXXXqtyn5WPuVu3buU/d+vWrcoYBwwYkGbNmmXixInlZVOmTEnfvn2z6667Ztddd02/fv3y6quvVnkuP+ruu+/O22+/nV/84heVojZJatSokdNOOy1HHHFEFixYsMr7z5kzJ4MHD07Xrl3TqlWrtG/fPv369Ss/f0nyyiuv5IQTTkiHDh3Spk2bHHrooXnkkUfKty9evDjnnntu9tprr7Rq1SoHHHBAbrrpptWOe1WWLFmS4cOH54ADDsguu+yS/fffP9dff31WrFhRXqd379457bTTctJJJ6Vt27Y59thj13g/q/Lx18cn7ee9997LxRdfnH333Te77LJLunfvnjvvvLPStpYtW5YhQ4Zkr732SuvWrfPjH/84d999d5XXxdFHH51BgwZl1113zYEHHpgPPvigWs9H7969c8455+Taa69Nly5d0qZNmxx//PGZPXt27rrrruy3335p165djjnmmEr3W5UFCxbk/PPPT5cuXdK2bdv07Nkzf/vb38r7+aTXycSJE9OzZ88cdthhVbZ5zDHHlH9eL7/88irv/9HXR/Lpc3/ixIlVXi8rx7jy9Ppu3bp96t9Hjz/+eA4//PDstttu6dChQ0499dS8+eab5e19/O+GVq1a5Tvf+U7Gjx+/2p8jwLrKEVvgK2f69OlJkm233TZvv/12jjjiiLRt2zaXXHJJ6tSpk0cffTQ333xzNt988/Tp0ydJctppp+WRRx7JiSeemDZt2uSRRx7JGWeckdq1a6d79+5Jqp7a+be//S3XXXddpX1Pnjw506ZNy4ABA7LRRhvlmmuuyZFHHpn77rsvm2++eebMmZPDDjss6623Xs4+++yst956ueWWW3LEEUfkzjvvzI477pjkw4DaZZdd8stf/rK87f79+1fa1x/+8Iecdtpp+Z//+Z+cfPLJef3113PVVVflpZdeys0331zpNNNhw4Zls802S5LMmjWr0raqO6bP4oknnsgf//jHSsumT5+eww47LE2aNMmll16a5cuX57rrrsuPfvSj3HPPPdl0001Xua2///3vadSoUVq3br3K25s3b57mzZuv8rZSqZS+fftm3rx5Oe2009KoUaP897//zdVXX51BgwblpptuyooVK9K3b99svvnmueyyy1KrVq2MHj06J554Yv70pz/lG9/4Ri666KI89thjGThwYBo1apRHH300l112WRo2bJiePXtW62dSKpVywgkn5Jlnnkn//v3TvHnzTJw4MVdffXVeffXVnH/++eV1//SnP+Wggw7KddddVyl6/y8++vr4pP0sXrw4hx9+eN55552cdNJJ2XrrrfPAAw/krLPOyuzZs3PCCSckSc4555zce++9+dnPfpYWLVrk3nvvzdlnn11ln0888UTq1q2b4cOHZ+HChalRo8anPh8r3XvvvWnZsmUuvPDCvPXWWznvvPNy5JFHpm7duhk4cGAWLVqUc845J+edd16uv/76VT7mDz74IMcdd1xefvnlnHTSSWnSpEnGjRuXfv365ZZbbsmgQYPKb4gceuih6dWrV374wx8mSXbaaaf06tUr5557bmbMmJFvfOMbSZI333wzEydOzGWXXZbkw9dszZo189vf/ra832uvvTYvvfRSpZ/9Z5n7Hzds2LAsXbq0/Fo+8cQTs88++yRJNt9889x9990ZOHBgunfvnr59++bdd9/NNddck0MPPTTjxo2rtJ+VfzfMmzcvt912WwYOHJhddtklO+ywQ7XGArCuELZAYZVKpSxfvrz853nz5mXSpEm57rrr0q5du7Rq1SqPP/54WrRokaFDh2b99ddPknTq1CmPP/54Jk6cmD59+mTKlCn585//nF/84hc5+uijkyR77rlnXn/99UycOLEcth8/tXPatGlVxvTee+/l17/+dfkiRq1bt86+++6b0aNH57TTTsstt9ySuXPn5ne/+1223nrrJMlee+2VAw88MEOHDs0111yTJFm0aFEaNWpUaX916tSp9NiHDBmSLl26ZMiQIeXl22+/fY455pg88sgj5V90k6RFixbZZpttkqTKka3qjmlNrVixIhdccEFatmyZf//73+Xlw4YNy3rrrZdRo0aVn5M999wz++67b2688cYMHDhwldt76623yuNbU2+//XbWW2+9DBw4sPzcdOjQIa+88kpuv/32JMk777yTadOm5ac//Wn23nvvJB8+fysjIkkmTZqUb33rW/ne975X3kb9+vWrHSRJ8uijj+Yf//hHrrzyyvJ2vvWtb6VevXoZOnRojjrqqHzzm99M8uEFnwYPHlzpua+u6rw+Vvr4fn77299mypQpue2229KuXbskSZcuXbJ8+fJce+21OeywwzJ//vyMGzcuAwcOLB+17NKlS2bPnp3HHnus0liWL1+e8847L1tssUWSDy8A9mnPx0fvO2zYsPLFrv7yl7/k73//ex544IFynD/zzDO55557PvFn8eijj+bZZ5/N8OHDs++++yZJOnbsmFdffTUTJkyo8qbRFltsUem1171791xyySW55557ctJJJyVJ7rnnnjRo0CD77bdfkg9fs3Xr1q10v0022aTSdj/r3P+4nXfeOcn/ey1vt9125f2uWLEiQ4YMSefOnXPFFVeU77PyaPlNN92U008/vbz8o383bLnllnnooYcyefJkYQsUjrAFCutf//pXWrZsWWlZjRo10qlTp5x33nmpqKhI586d07lz5yxbtiwvvfRSZsyYkSlTpmTOnDlp2LBhkg8vSJQk+++/f6Vt/epXv1rjMW2zzTaVrsy7+eabp127dvnXv/6VJPnnP/+ZFi1apHHjxuXoqFGjRvbaa69KpwCuvNrvJ5k2bVreeuut9O3bt1K87LHHHll//fXz+OOPVwrb1anumJIPf2n+6P5KpdInbve2227LrFmzct555+WnP/1pefmECRPSvn371KtXr7yt9ddfP7vvvnv+8Y9/fOL2atasmQ8++KBaj+njGjdunNGjR6dUKuW1117LjBkzMm3atDz11FPlaG3UqFF22mmnnH322XnsscfSuXPn7LXXXjn
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Проверка на пустые значения в наборе данных 'Student':\n",
"Series([], dtype: int64)\n",
"\n",
"\n"
]
}
],
"source": [
"plt.figure(figsize=(12, 6))\n",
"sns.boxplot(x='Class_or_Program', data=df_students)\n",
"plt.title('Распределение Class_or_Program студентов')\n",
"plt.xlabel('Class_or_Program')\n",
"plt.show()\n",
"\n",
"check_missing_values(df_students, \"Student\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видно, что выборка относительно сбалансированна, пустых значений нет."
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
2024-10-12 13:14:41 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAzQAAAImCAYAAACFG89TAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd5gURd7A8W/3hM05J9IuaWGBBZYgUUBAJWPA7N2p58kZTwzvcaYzcCJiQAxnOhUVIxlBMQBKjkvOsDnnNKHr/WN2BjbBhtlZ0Po8D+/rTait7pqurl9VdZUihBBIkiRJkiRJkiRdgtS2zoAkSZIkSZIkSVJzyYBGkiRJkiRJkqRLlgxoJEmSJEmSJEm6ZMmARpIkSZIkSZKkS5YMaCRJkiRJkiRJumTJgEaSJEmSJEmSpEuWDGgkSZIkSZIkSbpkyYBGkiRJkiRJkqRLlgxoJKmW2nvN/h72nv09HIMkSZIkSVJ9ZEAjXdSOHDnCgw8+yJAhQ+jZsydDhw7lgQce4NChQ07/WyaTieeff57ly5c7Xjt69Cg33HCD0/+W3TfffEPXrl1r/OvevTtJSUn8+c9/ZseOHY7Pvv7663Tt2rVJ6WdmZnLXXXeRlpbW4rwePnyYKVOm0LNnT6666qrzfvbkyZM89dRTjBkzhl69ejFy5EgeeuihOuU2atQoHnvssRbn7VI0atSoOmWfkJDAFVdcwbx586iqqmrrLP4uPPbYY4waNarO63l5ecyfP5+rr76aPn36MHjwYG677TZWrVrVpPTff/99Hn74YQC2bNlSp0x79uzJ6NGjefHFF6moqGhy/pty3bW2rKwsXnzxRcaPH0/v3r0ZOnQod999N9u3b6/xuVtuuYVbbrmlTfJoL4MtW7Y4NV2TycT48ePZvXu3U9OVJMk59G2dAUlqyNGjR7n++uvp06cPs2fPJigoiMzMTD755BOuu+46PvroI/r06eO0v5ednc3//vc/XnjhBcdr3333Hbt27XLa32jIggULCAkJAUDTNHJzc3njjTe47bbb+Oqrr+jWrVuz0v3tt9/45ZdfnJLHN954g/T0dN544w0CAwMb/NzatWt55JFH6Ny5M3/729+Ijo4mMzOT//3vf1x33XW8+eabDBkyxCl5utSNGDGCe+65x/G/q6qq2LJlCwsXLiQtLY2XX365DXP3+3XgwAH++te/otPpuO2224iPj6ekpIR169bxj3/8gzVr1vDSSy9hMBjOm87x48d5++23WbZsWY3Xn3jiCXr06AFARUUFhw4d4rXXXiMnJ4e5c+c2Ka+Nve5a244dO5g5cyYBAQHceuutdOzYkcLCQhYvXswtt9zCCy+8wJQpU9osf3Y9evRg8eLFxMXFOTVdo9HIww8/zKOPPsrSpUtxd3d3avqSJLWMDGiki9YHH3xAQEAA//3vf9Hrz/5Ux4wZw/jx41m4cCHvvPNOG+bQebp37050dHSN1+Lj47niiiv49NNPeeaZZ9ooZ2cVFBTQpUsXRowY0eBnzpw5w6OPPsqwYcN45ZVX0Ol0jvfGjh3LDTfcwKOPPsqPP/6I0Wh0RbYvaoGBgXWC8oEDB5KZmck333zDY489RmhoaNtk7neqvLyc+++/n8DAQD766CP8/Pwc740ZM4bLL7+ce++9l44dO/LAAw+cN625c+cyYcIEwsLCarweFxdXo1wHDx5MSUkJb775Jk8++STe3t6Nzm9jrrvWVlhYyAMPPECHDh344IMP8PDwcLw3btw47rrrLp544gmGDh1KcHBwm+UTwNvb26kdXecaM2YMr7zyCp999hl/+tOfWuVvSJLUPHLKmXTRys3NRQiBpmk1Xvf09OT//u//uPLKK2u8vmTJEqZOnUrv3r0ZOXIk8+bNw2QyOd7/4YcfuPHGG0lMTKRnz56MHz+eRYsWAZCamsro0aMBePzxxxk1ahSvv/46CxYsAKBr1668/vrrgG0E5Z133uGKK66gZ8+ejBs3jo8//rhGXm655RYefvhh7rvvPvr06dOsm190dDQBAQGkp6c3+JlVq1Yxbdo0EhMTGTJkCE888QRFRUWAbTrb448/DsDo0aPPO7UrOzubxx9/nBEjRtCrVy+uueYa1q1b53i/a9eubN26lW3bttG1a1e++eabetP5+OOPMZlMzJ49u0YwA+Dh4cGjjz7K9OnTHXmsLTU1lUceeYShQ4fSo0cPBg8ezCOPPEJBQYHjM/v27eO2226jX79+JCYmcvvtt9eYBpKfn88//vEPhgwZQkJCApMnT2bJkiUNHntDSkpKeOGFFxgzZgwJCQlMmDCBr776qsZnRo0axfPPP89tt91Gr169+Oc//9nkv1Ofnj17IoQgIyPjvH/nQuUGUFpayhNPPMHgwYNJTEzkwQcf5MMPP6wxfbGh32tjymPUqFEsWLCA559/noEDB5KYmMg//vEPysrKeOeddxg+fDj9+vXj3nvvrfG9c1VVVdGvXz/+85//1HjdYrEwaNAgnn32WeDCZd8YK1as4MyZMzz11FM1ghm7sWPHctVVV/Hhhx9SVlbWYDpHjhzh559/ZsKECY36u76+vnVeKyws5IknnuCyyy4jISGB6667jk2bNjneb+i6O3XqFPfddx9DhgyhT58+3HLLLTWmp6amptK1a1c++OADx/Swr7/+2pHvv/71r/Tt25e+ffsyc+ZMUlJSzpv3JUuWkJ2dzf/93//VCGYAVFXl4Ycf5qabbqK0tLTe7zemzrRarbzzzjtMmDCBXr160adPH2bMmMHmzZsdn3n99de54oor+Pnnn5k4caIjrXOv79pTzhrzHbCNtt1555307duXyy67jPnz5/P444/XmTo3ceJEPvjggxr3FkmS2p4MaKSL1siRI0lPT2fGjBksWrSI48ePOx5uHz9+PFOnTnV8dtGiRTz66KP06NGDBQsWcNddd/Hxxx87GkI///wzM2fOpEePHixcuJDXX3+dmJgYnnnmGfbs2UNoaKgjePnb3/7GggULuPbaa7nmmmsAWLx4Mddeey0ATz31FK+99hqTJk3irbfeYvz48Tz//PO88cYbNfK/evVqvLy8ePPNN7njjjuafPwFBQUUFBTQrl27et9fuHAhDz30EH369OG1115j5syZrFmzhltuuYXKykpGjhzJ3/72N8A2pe3cqU3nys3N5ZprrmH79u08+OCDvP7660RFRTFz5kzHVJrFixcTHx9PfHw8ixcvZuTIkfWmtWHDBuLj4+v0WNsNHjyYBx980DG97lwVFRXceuutHD9+nCeffJL33nuPW2+9lZUrVzJ//nzA1ji/4447CAgI4PXXX2f+/PlUVFTwl7/8hZKSEgBmzZrF8ePHefrpp/nvf/9LfHw8jz76aI2G0YVUVlZy4403snz5cu644w4WLlxIv379+Oc//8lbb71V47OLFi0iISGBhQsXOn4vLXXy5EkAYmJiGvw7jSk3gHvuuYfVq1dz7733Mn/+fMrKypg3b16dv1n799qY8rB7//33ycjIYP78+fztb39jxYoVTJ8+nY0bN/Lvf/+bhx56iHXr1vHaa6/Ve7xubm6MGzeO1atX11jA4tdff6WgoIDJkyc3quwb48cffyQ4OJjExMQGP3P11VdTUVHBr7/+2uBnli9fTkhISL2jAZqmYbFYsFgsVFRUsHPnTj766COmTJniGJ2pqqritttuY926dTz44IMsWLCA8PBw7rjjDkdQU991d+zYMaZNm0ZqaiqzZ8/mpZdeQlEUbrvtNrZu3VojH6+//jp33nknL774IkOGDOHkyZPMmDGDvLw8/vOf//Dcc8+RkpLCDTfcQF5eXoPHumHDBoKDg+nVq1e973fr1o1HH32UDh061Pt+Y+rMl156iYULF3L99dfz7rvv8u9//5vCwkLuv//+Gs8e5eTk8Mwzz3DrrbfyzjvvEB0dzaOPPsrx48cbzP+FvpOfn8/NN99MRkYGL7zwArNnz+a7775jxYoVddIaP348WVlZdc61JEltS045ky5aN954Izk5Obz33nuOKVcBAQEMHTqUW2+91XFz1TSNN954gzFjxjgCGLA1kFe
"text/plain": [
2024-10-12 13:14:41 +04:00
"<Figure size 1000x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2024-10-12 13:14:41 +04:00
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"# Преобразование категориального столбца в числовой для hexbin\n",
"label_encoder = LabelEncoder()\n",
"df_students['Class_or_Program_encoded'] = label_encoder.fit_transform(df_students['Class_or_Program'])\n",
"\n",
"# Визуализация плотности точек с использованием hexbin\n",
"plt.figure(figsize=(10, 6))\n",
2024-10-12 13:14:41 +04:00
"plt.hexbin(df_students['Class_or_Program_encoded'], df_students['IQ'], gridsize=30, cmap='coolwarm')\n",
"plt.colorbar(label='IQ')\n",
"# Настройка оси X, чтобы отображать текстовые метки вместо чисел\n",
"plt.xticks(ticks=range(len(label_encoder.classes_)), labels=label_encoder.classes_)\n",
"plt.xlabel('Class_or_Program')\n",
"plt.ylabel('IQ')\n",
"plt.title('Scatter Plot of Class_or_Program vs IQ (Before Cleaning)')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Убираем шумы"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Выбросы в Student Data:\n",
"Empty DataFrame\n",
2024-10-12 13:14:41 +04:00
"Columns: [Name, Class_or_Program, Age, Country, IQ, CGPA, Skill, Class_or_Program_encoded]\n",
"Index: []\n"
]
},
{
"data": {
2024-10-12 13:14:41 +04:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAzQAAAImCAYAAACFG89TAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd5gURd7A8W/3hM05J9IuaWGBBZYgUUBAJWPA7N2p58kZTwzvcaYzcCJiQAxnOhUVIxlBMQBKjkvOsDnnNKHr/WN2BjbBhtlZ0Po8D+/rTait7pqurl9VdZUihBBIkiRJkiRJkiRdgtS2zoAkSZIkSZIkSVJzyYBGkiRJkiRJkqRLlgxoJEmSJEmSJEm6ZMmARpIkSZIkSZKkS5YMaCRJkiRJkiRJumTJgEaSJEmSJEmSpEuWDGgkSZIkSZIkSbpkyYBGkiRJkiRJkqRLlgxoJKmW2nvN/h72nv09HIMkSZIkSVJ9ZEAjXdSOHDnCgw8+yJAhQ+jZsydDhw7lgQce4NChQ07/WyaTieeff57ly5c7Xjt69Cg33HCD0/+W3TfffEPXrl1r/OvevTtJSUn8+c9/ZseOHY7Pvv7663Tt2rVJ6WdmZnLXXXeRlpbW4rwePnyYKVOm0LNnT6666qrzfvbkyZM89dRTjBkzhl69ejFy5EgeeuihOuU2atQoHnvssRbn7VI0atSoOmWfkJDAFVdcwbx586iqqmrrLP4uPPbYY4waNarO63l5ecyfP5+rr76aPn36MHjwYG677TZWrVrVpPTff/99Hn74YQC2bNlSp0x79uzJ6NGjefHFF6moqGhy/pty3bW2rKwsXnzxRcaPH0/v3r0ZOnQod999N9u3b6/xuVtuuYVbbrmlTfJoL4MtW7Y4NV2TycT48ePZvXu3U9OVJMk59G2dAUlqyNGjR7n++uvp06cPs2fPJigoiMzMTD755BOuu+46PvroI/r06eO0v5ednc3//vc/XnjhBcdr3333Hbt27XLa32jIggULCAkJAUDTNHJzc3njjTe47bbb+Oqrr+jWrVuz0v3tt9/45ZdfnJLHN954g/T0dN544w0CAwMb/NzatWt55JFH6Ny5M3/729+Ijo4mMzOT//3vf1x33XW8+eabDBkyxCl5utSNGDGCe+65x/G/q6qq2LJlCwsXLiQtLY2XX365DXP3+3XgwAH++te/otPpuO2224iPj6ekpIR169bxj3/8gzVr1vDSSy9hMBjOm87x48d5++23WbZsWY3Xn3jiCXr06AFARUUFhw4d4rXXXiMnJ4e5c+c2Ka+Nve5a244dO5g5cyYBAQHceuutdOzYkcLCQhYvXswtt9zCCy+8wJQpU9osf3Y9evRg8eLFxMXFOTVdo9HIww8/zKOPPsrSpUtxd3d3avqSJLWMDGiki9YHH3xAQEAA//3vf9Hrz/5Ux4wZw/jx41m4cCHvvPNOG+bQebp37050dHSN1+Lj47niiiv49NNPeeaZZ9ooZ2cVFBTQpUsXRowY0eBnzpw5w6OPPsqwYcN45ZVX0Ol0jvfGjh3LDTfcwKOPPsqPP/6I0Wh0RbYvaoGBgXWC8oEDB5KZmck333zDY489RmhoaNtk7neqvLyc+++/n8DAQD766CP8/Pwc740ZM4bLL7+ce++9l44dO/LAAw+cN625c+cyYcIEwsLCarweFxdXo1wHDx5MSUkJb775Jk8++STe3t6Nzm9jrrvWVlhYyAMPPECHDh344IMP8PDwcLw3btw47rrrLp544gmGDh1KcHBwm+UTwNvb26kdXecaM2YMr7zyCp999hl/+tOfWuVvSJLUPHLKmXTRys3NRQiBpmk1Xvf09OT//u//uPLKK2u8vmTJEqZOnUrv3r0ZOXIk8+bNw2QyOd7/4YcfuPHGG0lMTKRnz56MHz+eRYsWAZCamsro0aMBePzxxxk1ahSvv/46CxYsAKBr1668/vrrgG0E5Z133uGKK66gZ8+ejBs3jo8//rhGXm655RYefvhh7rvvPvr06dOsm190dDQBAQGkp6c3+JlVq1Yxbdo0EhMTGTJkCE888QRFRUWAbTrb448/DsDo0aPPO7UrOzubxx9/nBEjRtCrVy+uueYa1q1b53i/a9eubN26lW3bttG1a1e++eabetP5+OOPMZlMzJ49u0YwA+Dh4cGjjz7K9OnTHXmsLTU1lUceeYShQ4fSo0cPBg8ezCOPPEJBQYHjM/v27eO2226jX79+JCYmcvvtt9eYBpKfn88//vEPhgwZQkJCApMnT2bJkiUNHntDSkpKeOGFFxgzZgwJCQlMmDCBr776qsZnRo0axfPPP89tt91Gr169+Oc//9nkv1Ofnj17IoQgIyPjvH/nQuUGUFpayhNPPMHgwYNJTEzkwQcf5MMPP6wxfbGh32tjymPUqFEsWLCA559/noEDB5KYmMg//vEPysrKeOeddxg+fDj9+vXj3nvvrfG9c1VVVdGvXz/+85//1HjdYrEwaNAgnn32WeDCZd8YK1as4MyZMzz11FM1ghm7sWPHctVVV/Hhhx9SVlbWYDpHjhzh559/ZsKECY36u76+vnVeKyws5IknnuCyyy4jISGB6667jk2bNjneb+i6O3XqFPfddx9DhgyhT58+3HLLLTWmp6amptK1a1c++OADx/Swr7/+2pHvv/71r/Tt25e+ffsyc+ZMUlJSzpv3JUuWkJ2dzf/93//VCGYAVFXl4Ycf5qabbqK0tLTe7zemzrRarbzzzjtMmDCBXr160adPH2bMmMHmzZsdn3n99de54oor+Pnnn5k4caIjrXOv79pTzhrzHbCNtt1555307duXyy67jPnz5/P444/XmTo3ceJEPvjggxr3FkmS2p4MaKSL1siRI0lPT2fGjBksWrSI48ePOx5uHz9+PFOnTnV8dtGiRTz66KP06NGDBQsWcNddd/Hxxx87GkI///wzM2fOpEePHixcuJDXX3+dmJgYnnnmGfbs2UNoaKgjePnb3/7GggULuPbaa7nmmmsAWLx4Mddeey0ATz31FK+99hqTJk3irbfeYvz48Tz//PO88cYbNfK/evVqvLy8ePPNN7njjjuafPwFBQUUFBTQrl27et9fuHAhDz30EH369OG1115j5syZrFmzhltuuYXKykpGjhzJ3/72N8A2pe3cqU3nys3N5ZprrmH79u08+OCDvP7660RFRTFz5kzHVJrFixcTHx9PfHw8ixcvZuTIkfWmtWHDBuLj4+v0WNsNHjyYBx980DG97lwVFRXceuutHD9+nCeffJL33nuPW2+9lZUrVzJ//nzA1ji/4447CAgI4PXXX2f+/PlUVFTwl7/8hZKSEgBmzZrF8ePHefrpp/nvf/9LfHw8jz76aI2G0YVUVlZy4403snz5cu644w4WLlxIv379+Oc//8lbb71V47OLFi0iISGBhQsXOn4vLXXy5EkAYmJiGvw7jSk3gHvuuYfVq1dz7733Mn/+fMrKypg3b16dv1n799qY8rB7//33ycjIYP78+fztb39jxYoVTJ8+nY0bN/Lvf/+bhx56iHXr1vHaa6/Ve7xubm6MGzeO1atX11jA4tdff6WgoIDJkyc3quwb48cffyQ4OJjExMQGP3P11VdTUVHBr7/+2uBnli9fTkhISL2jAZqmYbFYsFgsVFRUsHPnTj766COmTJniGJ2pqqritttuY926dTz44IMsWLCA8PBw7rjjDkdQU991d+zYMaZNm0ZqaiqzZ8/mpZdeQlEUbrvtNrZu3VojH6+//jp33nknL774IkOGDOHkyZPMmDGDvLw8/vOf//Dcc8+RkpLCDTfcQF5eXoPHumHDBoKDg+nVq1e973fr1o1HH32UDh061Pt+Y+rMl156iYULF3L99dfz7rvv8u9//5vCwkLuv//+Gs8e5eTk8Mwzz3DrrbfyzjvvEB0dzaOPPsrx48cbzP+FvpOfn8/NN99MRkYGL7zwArNnz+a7775jxYoVddIaP348WVlZdc61JEltS045ky5aN954Izk5Obz33nuOKVcBAQEMHTqUW2+91XFz1TSNN954gzFjxjgCGLA1kFe
"text/plain": [
2024-10-12 13:14:41 +04:00
"<Figure size 1000x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Рассчитываем квартиль 1 (Q1) и квартиль 3 (Q3) для IQ\n",
"Q1 = df_students[\"IQ\"].quantile(0.25)\n",
"Q3 = df_students[\"IQ\"].quantile(0.75)\n",
"\n",
"# Рассчитываем межквартильный размах (IQR)\n",
"IQR = Q3 - Q1\n",
"\n",
"# Определяем порог для выбросов\n",
"threshold = 1.5 * IQR\n",
"lower_bound = Q1 - threshold\n",
"upper_bound = Q3 + threshold\n",
"\n",
"# Фильтруем выбросы\n",
"outliers = (df_students[\"IQ\"] < lower_bound) | (df_students[\"IQ\"] > upper_bound)\n",
"\n",
"# Вывод выбросов\n",
"print(\"Выбросы в Student Data:\")\n",
"print(df_students[outliers])\n",
"\n",
"# Заменяем выбросы на медианные значения\n",
"median_iq = df_students[\"IQ\"].median()\n",
"df_students.loc[outliers, \"IQ\"] = median_iq\n",
"\n",
2024-10-12 13:14:41 +04:00
"# Визуализация плотности точек с использованием hexbin\n",
"plt.figure(figsize=(10, 6))\n",
2024-10-12 13:14:41 +04:00
"plt.hexbin(df_students['Class_or_Program_encoded'], df_students['IQ'], gridsize=30, cmap='coolwarm')\n",
"plt.colorbar(label='IQ')\n",
"# Настройка оси X, чтобы отображать текстовые метки вместо чисел\n",
"plt.xticks(ticks=range(len(label_encoder.classes_)), labels=label_encoder.classes_)\n",
"plt.xlabel('Class_or_Program')\n",
"plt.ylabel('IQ')\n",
2024-10-12 13:14:41 +04:00
"plt.title('Scatter Plot of Class_or_Program vs IQ (Before Cleaning)')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Разбиение набора данных на обучающую, контрольную и тестовую выборки"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 30000\n",
"Размер контрольной выборки: 10000\n",
"Размер тестовой выборки: 10000\n"
]
}
],
"source": [
"# Разделение на обучающую и тестовую выборки\n",
"train_df, test_df = train_test_split(df_students, test_size=0.2, random_state=42)\n",
"\n",
"# Разделение обучающей выборки на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видим недостаток баланса"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Class_or_Program в обучающей выборке:\n",
"Class_or_Program\n",
"12th 5072\n",
"11th 5070\n",
"10th 5067\n",
"Commerce 4988\n",
"Science 4915\n",
"Arts 4888\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в контрольной выборке:\n",
"Class_or_Program\n",
"10th 1722\n",
"Science 1693\n",
"Arts 1676\n",
"12th 1661\n",
"11th 1637\n",
"Commerce 1611\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в тестовой выборке:\n",
"Class_or_Program\n",
"10th 1713\n",
"Science 1692\n",
"12th 1669\n",
"11th 1648\n",
"Commerce 1641\n",
"Arts 1637\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['Class_or_Program'].value_counts()\n",
" print(f\"Распределение Class_or_Program в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"check_balance(val_df, \"контрольной выборке\")\n",
"check_balance(test_df, \"тестовой выборке\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Используем oversample и undersample"
]
},
{
"cell_type": "code",
2024-10-12 13:14:41 +04:00
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Оверсэмплинг:\n",
"Распределение Class_or_Program в обучающей выборке:\n",
"Class_or_Program\n",
"10th 5072\n",
"12th 5072\n",
"Science 5072\n",
"11th 5072\n",
"Commerce 5072\n",
"Arts 5072\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в контрольной выборке:\n",
"Class_or_Program\n",
"Arts 1722\n",
"Science 1722\n",
"10th 1722\n",
"Commerce 1722\n",
"11th 1722\n",
"12th 1722\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в тестовой выборке:\n",
"Class_or_Program\n",
"Science 1713\n",
"11th 1713\n",
"Commerce 1713\n",
"12th 1713\n",
"10th 1713\n",
"Arts 1713\n",
"Name: count, dtype: int64\n",
"\n",
"Андерсэмплинг:\n",
"Распределение Class_or_Program в обучающей выборке:\n",
"Class_or_Program\n",
"10th 4888\n",
"11th 4888\n",
"12th 4888\n",
"Arts 4888\n",
"Commerce 4888\n",
"Science 4888\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в контрольной выборке:\n",
"Class_or_Program\n",
"10th 1611\n",
"11th 1611\n",
"12th 1611\n",
"Arts 1611\n",
"Commerce 1611\n",
"Science 1611\n",
"Name: count, dtype: int64\n",
"\n",
"Распределение Class_or_Program в тестовой выборке:\n",
"Class_or_Program\n",
"10th 1637\n",
"11th 1637\n",
"12th 1637\n",
"Arts 1637\n",
"Commerce 1637\n",
"Science 1637\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"train_df_oversampled = oversample(train_df, 'Class_or_Program')\n",
"val_df_oversampled = oversample(val_df, 'Class_or_Program')\n",
"test_df_oversampled = oversample(test_df, 'Class_or_Program')\n",
"\n",
"train_df_undersampled = undersample(train_df, 'Class_or_Program')\n",
"val_df_undersampled = undersample(val_df, 'Class_or_Program')\n",
"test_df_undersampled = undersample(test_df, 'Class_or_Program')\n",
"\n",
"# Проверка сбалансированности после oversampling\n",
"print(\"Оверсэмплинг:\")\n",
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
"\n",
"# Проверка сбалансированности после undersampling\n",
"print(\"Андерсэмплинг:\")\n",
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
"check_balance(test_df_undersampled, \"тестовой выборке\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}