AIM-PIbd-32-Isaeva-A-I/lab_2/Lab2.ipynb

1594 lines
1.1 MiB
Plaintext
Raw Normal View History

2024-11-15 18:57:46 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Лабораторная 2\n",
"### Датасет 1. Комиксы на сайте Webtoon"
]
},
{
"cell_type": "code",
"execution_count": 286,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['id', 'Name', 'Writer', 'Likes', 'Genre', 'Rating', 'Subscribers',\n",
" 'Summary', 'Update', 'Reading Link'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from matplotlib.ticker import FuncFormatter\n",
"\n",
"df = pd.read_csv(\".//csv//Webtoon_Dataset.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1) **Бизнес-цель:** Проанализировать предпочтений читателей для создания комикса\n",
"2) **Эффект:** успешность комикса\n",
"3) **Техническая цель:** определить предпочитаемые жанры аудитории сайта Webtoon\n",
"4) **Входные данные:** 'Subscribers', 'Rating', 'Genre'\n",
"5) **Целевой признак:** 'Genre'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6) **Обнаруженные проблемы:**<br/>\n",
"Зашумленность данных. В колонке \"число подписчиков\" данные имеют строковое представление, есть значения с буквенными обозначениями K и k. Переделаем их в цифровое представление.<br/>\n",
"Выбросы: есть комиксы с очень большим количеством подписчиков.\n"
]
},
{
"cell_type": "code",
"execution_count": 287,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+QAAAK9CAYAAACtq6aaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydd3hUxfeH3+01vSdAQgm9994RqYJgL2DDrw17ARv2hj+xK4odFVFEREQBpffeO6GE9LLJZvvu/f2xZJNld0NCC+q8z7PPk8zMnTt797Yzc87nyCRJkhAIBAKBQCAQCAQCgUBwUZHX9gAEAoFAIBAIBAKBQCD4LyIMcoFAIBAIBAKBQCAQCGoBYZALBAKBQCAQCAQCgUBQCwiDXCAQCAQCgUAgEAgEglpAGOQCgUAgEAgEAoFAIBDUAsIgFwgEAoFAIBAIBAKBoBYQBrlAIBAIBAKBQCAQCAS1gDDIBQKBQCAQCAQCgUAgqAWEQS4QCAQCgUAgEAgEAkEtIAxygUAgEAgEAsG/joyMDGQyGVOnTq3toQgEAkFIhEEuEAj+UXzxxRfIZLKQnxMnTlzU8RiNRsaPH39R9ykQCAT/ZIYOHUpUVBSSJPmVb9myBZlMRmpqasA2f/31FzKZjOnTp1+QMS1YsIApU6ZckL4FAoGgKpS1PQCBQCA4G55//nnq168fUB4dHV0LoxEIBAJBdenZsye///47O3fupFWrVr7yVatWoVQqOXbsGCdOnKBOnTp+deXbXggWLFjA+++/L4xygUBw0REGuUAg+EcyZMgQOnbsWNvDEAgEAkENKTeqV65cGWCQDx06lL/++ouVK1dy7bXX+upWrlxJTEwMzZo1u+jjFQgEgguJcFkXCAT/Sspd25cvX86dd95JTEwM4eHh3HzzzRQVFfm1/eWXXxg2bBjJycloNBoaNmzICy+8gNvt9mvn8Xh49NFHiYiIIC0tjYULF/rqHn/8ccLCwkhPT+f333/32278+PGkpaX5lR0/fhydTodMJiMjI8NXnpaWFuACP2HCBLRaLUuXLq3yO48fP75Kd/7Tt589ezYdOnRAp9MRGxvLjTfeSGZm5ln1+eyzz6JSqcjLywsY14QJE4iMjMRms/nKli5dGrS/04/T4cOHueqqq0hOTkYul/vatWzZMqCvWbNmMXnyZBITEzEYDIwcOZLjx4/79de3b1+/bcuZOnVqwG8B8MEHH9CiRQs0Gg3Jycncc889FBcXB2y/bt06nxuuwWCgdevWvP3229U6hpX3G+z3nz17dtBjczrFxcU0a9aMzp07Y7VafeXBzr97770Xo9HI5s2bfWXVvQ5qcgzT0tIYPnw4f/75J23btkWr1dK8eXPmzJkTsH35bx0dHY1er6dr16789ttvfm1OP280Gg2NGzfmlVdeCXB/Ph2Hw8EzzzxDhw4diIiIwGAw0KtXL/7++29fG0mSSEtL44orrgjY3mazERERwZ133ulXPmXKlKC/ad++fQOO2+llGzZs8LUv50znSuU+cnNzue2220hISECr1dKmTRu+/PJLv31UjqN+6623SE1NRafT0adPH3bu3OnX9lzvVdU9Vzt37oxarfatepezatUqevfuTefOnf3qPB4Pa9eupXv37r5jVVxczAMPPEDdunXRaDQ0atSI1157DY/HE3SfVX338ePH8/777wP+x7+csrIyHn74Yd++mjRpwtSpUwPOOZfLxQsvvEDDhg3RaDSkpaUxefJk7Ha7X7vy62LlypV07twZrVZLgwYN+Oqrr6o8bgKB4N+JWCEXCAT/au69914iIyOZMmUK+/bt48MPP+To0aO+F3vwGu9Go5GHHnoIo9HIX3/9xTPPPENJSQlvvPGGr6/XXnuNqVOnctNNN9GhQwcefPBBHA4Hv/32G23btuWll17i008/5corr2T37t1BXerLeeaZZ/wM1FA8++yzzJgxg1mzZgW8zAdDo9Hw6aef+pVt2LCBd955x6/siy++4JZbbqFTp0688sor5OTk8Pbbb7Nq1Sq2bNlCZGRkjfq86aabeP7555k1axb33nuvr9zhcPDjjz8yZswYtFptwHgnT57sW/GaPn06x44d89W53W5GjhzJ0aNHeeCBB2jcuDEymYyXXnop6Hd/6aWXkMlkPP744+Tm5jJt2jQGDhzI1q1b0el0ZzhygUyZMoXnnnuOgQMHctddd/nOnw0bNrBq1SpUKhUAixYtYvjw4SQlJXH//feTmJjInj17mD9/Pvfffz933nknAwcO9DtWo0eP5sorr/SVxcXFBR2Dy+XiySefrNZ4IyMjmT9/Pl27dmXcuHHMmjXLz6go59133+XDDz9kzpw5tG/f3lde3eugphw4cIBrrrmG//3vf4wbN47PP/+cq666ioULFzJo0CAAcnJy6N69OxaLhYkTJxITE8OXX37JyJEj+fHHHxk9erRfn+XnjdVq9U3ExMfHc9ttt4UcR0lJCZ9++inXXXcdd9xxB6WlpcyYMYPBgwezfv162rZti0wm48Ybb+T111+nsLDQLwTm119/paSkhBtvvDFo/x9++CFGoxGASZMmVevYPP744wFlX3/9te/vFStWMH36dN566y1iY2MBSEhIAMBqtdK3b18OHjzIvffeS/369Zk9ezbjx4+nuLiY+++/36/fr776itLSUu655x5sNhtvv/02/fv3Z8eOHb4+g1Hde1VNzlWtVkuHDh1YuXKlr+z48eMcP36c7t27U1xc7DcZs2PHDkpKSnwr6xaLhT59+pCZmcmdd95JvXr1WL16NZMmTSIrK4tp06bV6LvfeeednDx5kkWLFvkdf/BO0owcOZK///6b2267jbZt2/LHH3/w6KOPkpmZyVtvveVre/vtt/Pll18yduxYHn74YdatW8crr7zCnj17+Pnnn/36PXjwIGPHjuW2225j3LhxfPbZZ4wfP54OHTrQokWLah1HgUDwL0ESCASCfxCff/65BEgbNmyoVrsOHTpIDofDV/76669LgPTLL7/4yiwWS8D2d955p6TX6yWbzSZJkiTZbDYpPj5euu6663xttm3bJikUCqlNmzaS3W6XJEmS8vPzpbCwMOn+++/3tRs3bpyUmprq+3/nzp2SXC6XhgwZIgHSkSNHfHWpqanSuHHjJEmSpI8//lgCpHffffeMx6V8PwaDIaB89uzZEiD9/fffkiRJksPhkOLj46WWLVtKVqvV127+/PkSID3zzDM17lOSJKlbt25Sly5d/NrNmTMnoJ0kSdKiRYskQFq2bJnfviofp3379kmA9Morr/ht26dPH6lFixa+///++28JkFJSUqSSkhJf+Q8//CAB0ttvvx1y23LeeOMNv98iNzdXUqvV0mWXXSa53W5fu/fee08CpM8++0ySJElyuVxS/fr1pdTUVKmoqMivT4/HE7AfSZIkQHr22WeD1lX+/SVJkj744ANJo9FI/fr18zs2VbFixQpJo9FITz75pCRJ/sf1999/lxQKhfTGG28EbFed60CSqn8My78PIP3000++MpPJJCUlJUnt2rXzlT3wwAMSIK1YscJXVlpaKtWvX19KS0vz/Qblv3Xl88lms0lyuVy6++67qzwuLpfLd52WU1RUJCUkJEi33nqrr6z8vPvwww/92o4cOVJKS0sL+F0nT54sAVJ+fr6vrEWLFlKfPn382vXp08evbMGCBRIgXX755VKo17Hy+1jlY1rOtGnTJED65ptvfGUOh0Pq1q2bZDQafdfCkSNHJEDS6XTSiRMnfG3XrVsnAdKDDz7oKzvbe5Uk1fxcffTRRyXAN6bvvvtO0mq1kt1ulxYsWCApFArfdyi/7latWiVJkiS98MILksFgkPbv3+/X5xNPPCEpFArp2LFjNf7u99xzT9DfYe7cuRIgvfjii37lY8eOlWQymXTw4EFJkiRp69atEiDdfvvtfu0eeeQRCZD++usvv2MHSMuXL/eV5ebmShqNRnr44YfPeOwEAsG/C+GyLhAI/tVMmDDBt5IJcNddd6FUKlmwYIGvrPLqaWlpKfn5+fTq1QuLxcLevXsB7wpNbm6u36pm69at0Wq
"text/plain": [
"<Figure size 1200x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# убираем зашумленность\n",
"def convert_str_to_float(value):\n",
" if isinstance(value, str):\n",
" if 'm' in value:\n",
" return float(value.replace(',', '').replace('m', '')) * 1000000\n",
" elif 'M' in value:\n",
" return float(value.replace(',', '').replace('M', '')) * 1000\n",
" elif 'k' in value:\n",
" return float(value.replace(',', '').replace('k', '')) * 1000\n",
" elif 'K' in value:\n",
" return float(value.replace(',', '').replace('K', '')) * 1000\n",
" elif 'b' in value:\n",
" return float(value.replace(',', '').replace('b', '')) * 1000000000\n",
" elif 'B' in value:\n",
" return float(value.replace(',', '').replace('B', '')) * 1000000000\n",
" return value\n",
"\n",
"# чтобы шкала была более наглядная\n",
"def thousands(x, pos):\n",
" if x >= 1_000_000:\n",
" return f'{x / 1_000_000:.1f}M'\n",
" else:\n",
" return f'{x / 1_000:.1f}K'\n",
"\n",
"df['Subscribers'] = df['Subscribers'].apply(convert_str_to_float)\n",
"\n",
"plt.figure(figsize=(12, 8))\n",
"ax = sns.scatterplot(x='Subscribers', y='Rating', hue='Genre', data=df, palette='Set1')\n",
"# форматтер для x\n",
"ax.xaxis.set_major_formatter(FuncFormatter(thousands))\n",
"plt.title('График популярности жанров аудитории Webtoon')\n",
"plt.xlabel('Количество подписчиков')\n",
"plt.ylabel('Рейтинг')\n",
"plt.legend(title='Жанры')\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выводы по графику:\n",
"- Есть смещение в сторону меньших значений, это можно исправить при помощи oversampling и undersampling;"
]
},
{
"cell_type": "code",
"execution_count": 288,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+kAAAK9CAYAAABYVS0qAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUVfrA8e/0nt4hJPTeO6IgigoINuwF7L3rKuraXdeyP3tde1lXUdaKvSAgUlRAqdIJCellMpk+9/fHSGCYmZBAkpkk7+d5eB5yz507Z/p97znnfVWKoigIIYQQQgghhBAi5tSx7oAQQgghhBBCCCGCJEgXQgghhBBCCCHihATpQgghhBBCCCFEnJAgXQghhBBCCCGEiBMSpAshhBBCCCGEEHFCgnQhhBBCCCGEECJOSJAuhBBCCCGEEELECQnShRBCCCGEEEKIOCFBuhBCCCGEEEIIESckSBdCCCGEEO3etm3bUKlUPProo7HuihBCNEiCdCFEm/baa6+hUqmi/isoKGjV/litVmbPnt2q9ymEEG3Z1KlTSU5ORlGUkO2//fYbKpWKvLy8sNt89913qFQqXnzxxRbp0/z587n77rtb5NhCCHEg2lh3QAghmsO9995L165dw7anpKTEoDdCCCEaa/z48Xz++ef88ccfDBw4sH774sWL0Wq17Nixg4KCAjp37hzStue2LWH+/Pk888wzEqgLIWJCgnQhRLswZcoURowYEetuCCGEaKI9gfaiRYvCgvSpU6fy3XffsWjRIs4444z6tkWLFpGamkrfvn1bvb9CCNHSZLq7EKJD2DMt/scff+TSSy8lNTWVhIQEzjvvPCorK0P2/eijj5g2bRo5OTkYDAa6d+/Offfdh9/vD9kvEAhw8803k5iYSH5+Pl988UV92y233ILNZqNnz558/vnnIbebPXs2+fn5Idt27tyJyWRCpVKxbdu2+u35+flh0+cvueQSjEYjP/zwQ4OPefbs2Q0uBdj/9nPnzmX48OGYTCbS0tI455xz2LVr10Ed86677kKn01FaWhrWr0suuYSkpCRcLlf9th9++CHi8fZ/nrZs2cKpp55KTk4OarW6fr8BAwaEHevdd9/ltttuIysrC4vFwowZM9i5c2fI8SZOnBhy2z0effTRsNcC4Nlnn6V///4YDAZycnK48sorqaqqCrv90qVL66fwWiwWBg0axBNPPNGo53Df+430+s+dOzfic7O/qqoq+vbty6hRo3A6nfXbI73/rrrqKqxWK7/++mv9tsZ+DpryHObn53P88cfz1VdfMWTIEIxGI/369WPevHlht9/zWqekpGA2mxkzZgyfffZZyD77v28MBgO9evXiwQcfDJs6vT+Px8Odd97J8OHDSUxMxGKxcPjhh/P999/X76MoCvn5+Zxwwglht3e5XCQmJnLppZeGbL/77rsjvqYTJ04Me97237Z8+fL6/fc40Htl32OUlJRw4YUXkpmZidFoZPDgwbz++ush97HvuuzHHnuMvLw8TCYTEyZM4I8//gjZ91C/qxr7Xh01ahR6vb5+dHyPxYsXc8QRRzBq1KiQtkAgwM8//8y4cePqn6uqqiquu+46cnNzMRgM9OjRg4ceeohAIBDxPht67LNnz+aZZ54BQp//PRwOBzfeeGP9ffXu3ZtHH3007D3n8/m477776N69OwaDgfz8fG677TbcbnfIfns+F4sWLWLUqFEYjUa6devGG2+80eDzJoRov2QkXQjRoVx11VUkJSVx9913s2HDBp577jm2b99ef7IPwYDearVyww03YLVa+e6777jzzjupqanhkUceqT/WQw89xKOPPsq5557L8OHDuf766/F4PHz22WcMGTKEBx54gJdeeomTTz6ZtWvXRpyOv8edd94ZErRGc9ddd/Hyyy/z7rvvhp3gR2IwGHjppZdCti1fvpwnn3wyZNtrr73G+eefz8iRI3nwwQcpLi7miSeeYPHixfz2228kJSU16Zjnnnsu9957L++++y5XXXVV/XaPx8P777/PKaecgtFoDOvvbbfdVj8y9uKLL7Jjx476Nr/fz4wZM9i+fTvXXXcdvXr1QqVS8cADD0R87A888AAqlYpbbrmFkpISHn/8cY4++mhWrlyJyWQ6wDMX7u677+aee+7h6KOP5vLLL69//yxfvpzFixej0+kA+Prrrzn++OPJzs7m2muvJSsri3Xr1vHpp59y7bXXcumll3L00UeHPFcnnXQSJ598cv229PT0iH3w+XzcfvvtjepvUlISn376KWPGjGHWrFm8++67IYHGHk899RTPPfcc8+bNY9iwYfXbG/s5aKo///yT008/ncsuu4xZs2bx6quvcuqpp/LFF18wefJkAIqLixk3bhx1dXVcc801pKam8vrrrzNjxgzef/99TjrppJBj7nnfOJ3O+oszGRkZXHjhhVH7UVNTw0svvcSZZ57JxRdfjN1u5+WXX+bYY49l2bJlDBkyBJVKxTnnnMPDDz9MRUVFyPKZTz75hJqaGs4555yIx3/uueewWq0AzJkzp1HPzS233BK27c0336z//8KFC3nxxRd57LHHSEtLAyAzMxMAp9PJxIkT2bRpE1dddRVdu3Zl7ty5zJ49m6qqKq699tqQ477xxhvY7XauvPJKXC4XTzzxBJMmTeL333+vP2Ykjf2uasp71Wg0Mnz4cBYtWlS/befOnezcuZNx48ZRVVUVcoHm999/p6ampn4Evq6ujgkTJrBr1y4uvfRSunTpwk8//cScOXMoKiri8ccfb9Jjv/TSSyksLOTrr78Oef4heOFmxowZfP/991x44YUMGTKEL7/8kptvvpldu3bx2GOP1e970UUX8frrrzNz5kxuvPFGli5dyoMPPsi6dev43//+F3LcTZs2MXPmTC688EJmzZrFK6+8wuzZsxk+fDj9+/dv1PMohGhHFCGEaMNeffVVBVCWL1/eqP2GDx+ueDye+u0PP/ywAigfffRR/ba6urqw21966aWK2WxWXC6XoiiK4nK5lIyMDOXMM8+s32fVqlWKRqNRBg8erLjdbkVRFKWsrEyx2WzKtddeW7/frFmzlLy8vPq///jjD0WtVitTpkxRAGXr1q31bXl5ecqsWbMURVGUF154QQGUp5566oDPy577sVgsYdvnzp2rAMr333+vKIqieDweJSMjQxkwYIDidDrr9/v0008VQLnzzjubfExFUZSxY8cqo0ePDtlv3rx5YfspiqJ8/fXXCqAsWLAg5L72fZ42bNigAMqDDz4YctsJEyYo/fv3r//7+++/VwClU6dOSk1NTf329957TwGUJ554Iupt93jkkUdCXouSkhJFr9crxxxzjOL3++v3e/rppxVAeeWVVxRFURSfz6d07dpVycvLUyorK0OOGQgEwu5HURQFUO66666Ibfu+/oqiKM8++6xiMBiUI488MuS5acjChQsVg8Gg3H777YqihD6vn3/+uaLRaJRHHnkk7HaN+RwoSuOfwz2PB1A++OCD+m3V1dVKdna2MnTo0Ppt1113nQIoCxcurN9mt9uVrl27Kvn5+fWvwZ7Xet/3k8vlUtRqtXLFFVc0+Lz4fL76z+kelZWVSmZmpnLBBRfUb9vzvnvuuedC9p0xY4aSn58f9rredtttCqCUlZXVb+vfv78yYcKEkP0mTJgQsm3+/PkKoBx33HFKtNOzPd9j+z6nezz++OMKoLz11lv12zwejzJ27FjFarXWfxa2bt2qAIrJZFIKCgrq9126dKkCKNdff339toP9rlKUpr9Xb775ZgWo79M777yjGI1Gxe12K/Pnz1c0Gk39Y9jzuVu8eLGiKIpy3333KRaLRdm4cWPIMW+99VZFo9EoO3bsaPJjv/LKKyO+Dh9++KECKPfff3/I9pkzZyoqlUrZtGmToiiKsnLlSgVQLrroopD9brrpJgVQvvvuu5DnDlB+/PHH+m0lJSWKwWBQbrzxxgM+d0KI9kemuwshOpRLLrmkfsQT4PLLL0er1TJ//vz6bfuOstrtdsrKyjj88MOpq6tj/fr1QHAkp6SkJGT0c9CgQRiNRoYMGYJerwc
"text/plain": [
"<Figure size 1200x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Q1 = df['Rating'].quantile(0.25)\n",
"Q3 = df['Rating'].quantile(0.75)\n",
"IQR = Q3 - Q1\n",
"\n",
"threshold = 1.5 * IQR\n",
"outliers = (df['Rating'] < (Q1 - threshold)) | (df['Rating'] > (Q3 + threshold))\n",
"\n",
"median_rating = df['Rating'].median()\n",
"df.loc[outliers, 'Rating'] = median_rating\n",
"\n",
"plt.figure(figsize=(12, 8))\n",
"ax = sns.scatterplot(x='Subscribers', y='Rating', hue='Genre', data=df, palette='Set1')\n",
"# форматтер для x\n",
"ax.xaxis.set_major_formatter(FuncFormatter(thousands))\n",
"plt.title('График популярности жанров аудитории Webtoon')\n",
"plt.xlabel('Количество подписчиков')\n",
"plt.ylabel('Рейтинг')\n",
"plt.legend(title='Жанры')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7) **Разбиение** на обучающую, контрольную и тестовую выборки."
]
},
{
"cell_type": "code",
"execution_count": 289,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 341\n",
"Размер контрольной выборки: 114\n",
"Размер тестовой выборки: 114\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# обучающая и тестовая\n",
"train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# обучающая на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"8) **Оценка сбалансированности выборок.**<br/>"
]
},
{
"cell_type": "code",
"execution_count": 290,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в обучающей выборке:\n",
"Genre\n",
"Fantasy 59\n",
"Romance 51\n",
"Action 37\n",
"Drama 33\n",
"Slice of life 30\n",
"Comedy 29\n",
"Sci-fi 24\n",
"Thriller 19\n",
"Supernatural 18\n",
"Superhero 15\n",
"Horror 8\n",
"Sports 7\n",
"Historical 4\n",
"Mystery 3\n",
"Informative 3\n",
"Heartwarming 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSUAAAGJCAYAAAB8YFZgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABfp0lEQVR4nO3deVhVVfv/8Q8gAjIKIk4oOOOYoilajhj6mEPa4FBqaYM5j+XzTa2sUMshe8zKFK00zQazzClKU3M2p8QxTMspTUU0UWH9/uji/DwCAofjOYjv13WdS/e0zr0Xe1j7Pmvv7WKMMQIAAAAAAAAAB3F1dgAAAAAAAAAA7i4kJQEAAAAAAAA4FElJAAAAAAAAAA5FUhIAAAAAAACAQ5GUBAAAAAAAAOBQJCUBAAAAAAAAOBRJSQAAAAAAAAAORVISAAAAAAAAgEORlAQAALjLXbp0SceOHdO5c+ecHQoAAADuEiQlAQAA7kKLFi1Sy5Yt5evrKx8fH5UtW1YTJ050dlgAAAC4S5CUBAAAKAB+/fVXPf744ypdurQ8PDxUqlQpde/eXb/++muGeV988UU9+uij8vX11cyZM7Vq1Sp9//33ev75550QOQAAAO5GLsYY4+wgAAAAYLsvv/xSXbt2VWBgoHr37q3w8HAdOXJEs2bN0tmzZ7VgwQI99NBDkqQ1a9aoWbNmio2N1YsvvujkyAEAAHC3IikJAABwBzt8+LBq1aqlsmXL6qefflJwcLBl2pkzZ3T//ffr2LFj2rVrl8qXL6927drp77//1vr1650YNQAAAO523L4NAABwB3vzzTd1+fJlffDBB1YJSUkqVqyY3n//fV26dMnyvMiNGzeqRo0a6tKliwIDA+Xl5aX69etr8eLFluWSk5Pl7e2tQYMGZfi+P/74Q25uboqNjZUk9erVS2FhYRnmc3Fx0csvv2wZ/v333/X888+rSpUq8vLyUlBQkB555BEdOXLEarnVq1fLxcVFq1evtozbsmWLWrVqJV9fX3l7e6tZs2Zau3at1XJz5syRi4uLtm7dahl35syZDHFI0oMPPpgh5rVr1+qRRx5R2bJl5eHhodDQUA0ZMkT//PNPhnX7/PPPVa9ePfn6+srFxcXyeeuttzLMCwAAgMwVcnYAAAAAsN0333yjsLAw3X///ZlOb9KkicLCwrR06VJJ0tmzZ/XBBx/Ix8dHAwcOVHBwsD755BN16tRJ8+bNU9euXeXj46OHHnpICxcu1OTJk+Xm5mYp79NPP5UxRt27d89VnFu2bNHPP/+sLl26qEyZMjpy5IhmzJihZs2aae/evSpSpEimyx06dEjNmjVTkSJFNGLECBUpUkQzZ85UdHS0Vq1apSZNmuQqjqwsWrRIly9fVt++fRUUFKTNmzfrnXfe0R9//KFFixZZ5tuwYYMeffRR1a5dW+PHj5e/v7/OnDmjIUOG2CUOAACAuwVJSQAAgDvUhQsXdPz4cXXo0OGW89WqVUtLlizRxYsXlf7knm+//VZNmzaVJD377LOKjIzU0KFD9fDDD8vd3V09evTQvHnztGrVKrVu3dpS1ieffKImTZqobNmykiRXV1fl5GlAbdu21cMPP2w1rl27doqKitIXX3yhJ554ItPlXnzxRaWkpGjz5s2qXr26JOnJJ59UlSpVNHToUKuekXkxYcIEeXl5WYafeeYZVaxYUf/973919OhRy/p+8803MsZo2bJlKlGihCTpyJEjJCUBAAByidu3AQAA7lAXL16UJPn6+t5yvvTpSUlJkqT69etbEpKS5OXlpeeff14nT57U9u3bJUnR0dEqVaqU5s2bZ5lvz5492rVrlx5//HHLuOLFi+v06dO6evXqLWO4MeF37do1nT17VhUrVlRAQIDlO2904cIFnT59WqtWrVJMTIwlISlJQUFB6tWrl7Zt26ZTp07d8ntz6sb4Ll26pDNnzqhRo0YyxuiXX36xTLt48aJcXV0VEBBgl+8FAAC4W5GUBAAAuEOlJxvTk5NZuTl5WbVq1QzzRERESJLlGY+urq7q3r27Fi9erMuXL0uS5s2bJ09PTz3yyCOW5Ro1aqQrV67opZde0h9//KEzZ87ozJkzGcr/559/NGbMGIWGhsrDw0PFihVTcHCwzp8/rwsXLmSYv2PHjgoJCVFSUpKqVKmSbbx5dfToUfXq1UuBgYHy8fFRcHCwJXF7Y3xRUVFKS0vToEGDdPjwYZ05c0bnzp2zSwwAAAB3E5KSAAAAdyh/f3+VLFlSu3btuuV8u3btUunSpeXn52fVIzA7PXr0UHJyshYvXixjjObPn68HH3xQ/v7+lnnat2+vp556Sm+++aZCQ0MVHByc4YU7kjRgwAC9/vrrevTRR/XZZ59p5cqVWrVqlYKCgpSWlpZh/rfeektff/11jmPNi9TUVLVq1UpLly7VCy+8oMWLF2vVqlWaM2eOJFnF16VLFw0bNkxz5sxRxYoVFRwcrLp16zokTgAAgIKEZ0oCAADcwR588EHNnDlT69at03333Zdh+tq1a3XkyBE9++yzkqTw8HDt378/w3z79u2TJKu3UteoUUN16tTRvHnzVKZMGR09elTvvPNOhmVnzZqlMWPG6PDhw5YEXqtWrazm+fzzz9WzZ09NmjTJMu7KlSs6f/58pusVGRmppk2bysfHJ8fx2mr37t06cOCA5s6dqx49eljGr1q1KsO8rq6ueuutt7R7924lJibq3Xff1alTp6xuaQcAAED26CkJAABwBxsxYoS8vLz07LPP6uzZs1bT/v77bz333HOWN1dL0n/+8x9t3rxZP//8s2W+K1euaMaMGSpRooQiIyOtynjiiSe0cuVKTZ06VUFBQWrTpk2mcZQrV04tWrRQdHS0oqOjM0x3c3PL8EKcd955R6mpqVmum4uLix544AGtWLFCCQkJVus1d+5c1atXTyEhIVkun1Ppbxe/MT5jjN5+++1M53/nnXf0ww8/aN68eYqOjlbjxo3zHAMAAMDdhp6SAAAAd7BKlSpp7ty56t69u2rWrKnevXsrPDxcR44c0axZs3TmzBl9+umnqlChgiRp5MiRmjdvntq0aaOBAweqWLFi+uSTT7R3717NmzdPhQpZNw+7deumkSNH6quvvlLfvn3l7u5uU5wPPvigPv74Y/n7+6tatWrasGGDvv/+ewUFBd1yuXHjxmnFihVq2rSpBgwYoCJFimjmzJk6f/68Pv/88wzzb9iwwfJMy/QX+xw6dEjLly+3zPPXX3/pn3/+0fLly9W6dWtVrVpVFSpU0PDhw/Xnn3/Kz89PX3zxRabPivz11181cuRIvfzyy6pfv75NdQEAAACSkgAAAHe8Rx55RFWrVlVsbKwlERkUFKTmzZvrv//9r2rUqGGZNzg4WOvWrdMLL7ygd955RykpKapZs6a++uordejQIUPZISEheuCBB/Tdd9/piSeesDnGt99+W25ubpo3b56uXLmixo0b6/vvv1dMTMwtl6tWrZp++uknjRo1ShMnTlRaWprq1aunDz74QE2aNMkw/8CBAzOMmzdvntVbxNO1adNGxhi5u7vrm2++0cCBAxUbGytPT0899NBD6t+/v2rXrm2ZPyUlRd26dVO9evX04osv2lALAAAASOdibr6PBgAAALjBQw89pN27d+vQoUPODsVujhw5ovDw8Ay3lAMAAMAxeKYkAAAAsnTixAktXbo0T70kAQAAgJtx+zYAAAAySExM1Pr16/Xhhx/K3d3d8vbugsLLyyvbW8cBAABw+9BTEgAAABmsWbNGTzzxhBITEzV37lyVKFHC2SHZVUhIiNXLbwAAAOBYPFMSAAAAAAAAgEPRUxIAAAAAAACAQ5GUBAAAAAAAAOBQBf5FN2lpaTp+/Lh8fX3l4uLi7HAAAAAAAACAO4oxRhcvXlSpUqXk6mqfPo4FPil5/PhxhYaGOjsMAAAAAAAA4I527NgxlSlTxi5lFfikpK+vr6R/K83Pz8/J0QAAAAAAAAB3lqSkJIWGhlrybPZQ4JOS6bds+/n5kZQEAAAAAAAAbGTPRyPyohsAAAAAAAAADkVSEgAAAAAAAIBDkZQ
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def build_graph(df, column_name, title, xlabel):\n",
" genre_counts = df[f'{column_name}'].value_counts()\n",
" plt.figure(figsize=(16, 4))\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n",
" plt.title(f'{title}')\n",
" plt.xlabel(f'{xlabel}')\n",
" plt.ylabel('Количество')\n",
" plt.show()\n",
"\n",
"def check_balance(df, name):\n",
" counts = df['Genre'].value_counts()\n",
" print(f\"Распределение Genre в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"build_graph(train_df, 'Genre', 'Обучающая', 'Жанр')"
]
},
{
"cell_type": "code",
"execution_count": 291,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в контрольной выборке:\n",
"Genre\n",
"Romance 20\n",
"Fantasy 19\n",
"Slice of life 13\n",
"Comedy 11\n",
"Drama 11\n",
"Superhero 8\n",
"Thriller 7\n",
"Action 6\n",
"Horror 6\n",
"Supernatural 6\n",
"Sci-fi 4\n",
"Mystery 2\n",
"Sports 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSgAAAGJCAYAAACJnt3QAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABdZ0lEQVR4nO3de3zP9f//8fvbzIYdhNkhMmdzZsKc5TA+EpFPSTmESoRIfXw+tBz6jEqkpBJTSZNyKjVq5UyOCzXHNofYHMJMHzPb8/dHv72/3nawzeY1c7teLq8Lr9fr+Xq+Hq/n+/1+vl+vx57v18tmjDECAAAAAAAAAAsUsToAAAAAAAAAAHcvEpQAAAAAAAAALEOCEgAAAAAAAIBlSFACAAAAAAAAsAwJSgAAAAAAAACWIUEJAAAAAAAAwDIkKAEAAAAAAABYhgQlAAAAAAAAAMuQoAQAAAAAAABgGRKUAAAAAAAAACxDghIAAKCQWrBggWw2m3bs2JFu3dy5c2Wz2dSjRw+lpKRYEB0AAADwNxKUAAAAd5lly5Zp6NChatWqlcLDw+Xk5GR1SAAAALiLkaAEAAC4i6xdu1Z9+vRRrVq19PXXX8vV1dXqkAAAAHCXI0EJAABwl4iKilL37t3l6+ur1atXy9PT02H9kiVLFBgYqOLFi6ts2bJ64okn9McffziUGTBggNzc3NLV/eWXX8pms2nt2rWSpLZt28pms2U5pbHZbBo+fLg+++wz1ahRQ66urgoMDNT69evT7Wf37t3q0qWLPDw85Obmpvbt22vr1q0ZHm9mMSxYsMChTJ06dW7admkx3ujBBx+Uv7+/w7I333xTzZs3V5kyZVS8eHEFBgbqyy+/TLdtYmKixowZo8qVK8vZ2dkhxrNnz940JgAAgMKiqNUBAAAAIP8dOXJEnTt3louLi1avXi1fX1+H9QsWLNDAgQN1//33KzQ0VPHx8Xr77be1adMm7d69W6VKlcrR/v7zn/9o8ODBkqSzZ8/qhRde0NNPP61WrVplWH7dunVavHixRowYIRcXF7333nvq3Lmztm3bZk8g/vrrr2rVqpU8PDz00ksvydnZWR988IHatm2rdevWqWnTpunqrVmzpv7zn/84xJHf3n77bT300EPq27evrl69qvDwcPXu3VvffPONunbtai83duxYvf/++xo0aJBatGghZ2dnLV26VMuWLcv3GAEAAAoSEpQAAACFXHx8vB577DHFx8erU6dOql69usP65ORkvfzyy6pTp47Wr19v/9l3y5Yt9eCDD2rGjBmaOHFijvbZsWNH+/9jY2P1wgsvKCgoSE888USG5fft26cdO3YoMDBQkvTYY4+pRo0aeuWVV7R06VJJ0vjx45WcnKyNGzeqcuXKkqR+/fqpRo0aeumll7Ru3TqHOq9duyZfX1/7PtPiyG8HDx5U8eLF7fPDhw9Xo0aN9NZbbzkkKFesWKHg4GB99NFH9mWHDx8mQQkAAO46/MQbAACgkBswYICOHz+uxx9/XGvWrNGSJUsc1u/YsUOnT5/Wc88953BPyq5du6pmzZpatWpVujrPnj3rMF26dOmWYgwKCrInJyXpvvvuU/fu3bV69WqlpKQoJSVFa9asUY8ePezJSUny9fXV448/ro0bNyohIcGhzqtXr8rFxeWm+05JSbEfx9WrVzMtd+XKlXTHnZycnK7c9cnJ8+fP6+LFi2rVqpV27drlUO7SpUsqU6bMTeMDAAAo7EhQAgAAFHJ//vmnFi5cqI8//lgNGjTQyJEjdfHiRfv6o0ePSpJq1KiRbtuaNWva16e5fPmyvLy8HKannnrqlmKsVq1aumXVq1fXX3/9pTNnzujMmTP666+/MowxICBAqampOn78uMPyCxcuZHi/zBvt37/ffhzFixdXjRo1tGjRonTl5s2bl+6416xZk67cN998o2bNmsnV1VWlS5eWl5eX5syZ49Dm0t9J2WXLlunLL7/UqVOndPbsWf311183jRcAAKCw4SfeAAAAhdwbb7yh3r17S5I+/PBDNWvWTOPGjdN7772Xq/pcXV319ddfOyzbsGGDJk2adMux5qW4uDgFBwfftJy/v7/mzp0rSTp37pxmzZqlJ598UpUrV1azZs3s5bp3757uQTnjx49XXFycfX7Dhg166KGH1Lp1a7333nvy9fWVs7OzwsLC0iU9P/zwQ/Xp08f+2gAAANytSFACAAAUcq1bt7b///7779ewYcM0e/Zs9evXT82aNVPFihUlSQcOHNADDzzgsO2BAwfs69M4OTmpQ4cODssuXLhwSzEeOnQo3bKDBw+qRIkS8vLykiSVKFFCBw4cSFdu//79KlKkiCpUqGBfduLECV26dEkBAQE33XfJkiUdjqdVq1a69957tWbNGocEZfny5dMd98yZMx0SlF999ZVcXV21evVqh5+Xh4WFpduvv7+/Fi5cqLp16+qpp55Sjx499Mknn+jTTz+9acwAAACFCT/xBgAAuMu89tpr8vX11dNPP61r166pcePGKleunN5//30lJSXZy3333XeKjo52eLBLftmyZYvDPRqPHz+uFStWqFOnTnJycpKTk5M6deqkFStWKDY21l4uPj5eixYtUsuWLeXh4WFfHh4eLknpEq7ZkZqaKunvRGxOOTk5yWazKSUlxb4sNjZWy5cvT1f22rVr6tu3r2rXrq0ZM2aoQ4cODvfXBAAAuFswghIAAOAu4+7urnfeeUc9e/bU9OnT9fLLL2vatGkaOHCg2rRpoz59+ig+Pl5vv/22/P39b8uTr+vUqaPg4GCNGDFCLi4u9p+fX//08ClTpuj7779Xy5Yt9dxzz6lo0aL64IMPlJSUpNdff13S3wnLkJAQffTRR3rsscdUs2bNm+47MTFRERERkv6+X+esWbPk7Oycq8Rs165d9dZbb6lz5856/PHHdfr0ac2ePVtVq1bVnj17HMpOnDhRe/fu1e7du+Xs7JzjfQEAABQWJCgBAADuQg8//LC6d++uSZMm6Z///KcGDBigEiVKaOrUqXr55ZdVsmRJPfzww5o2bZpKlSqV7/G0adNGQUFBmjhxoo4dO6ZatWppwYIFqlevnr1M7dq1tWHDBo0bN06hoaFKTU1V06ZNtXDhQjVt2lSSdOTIEUVGRmrChAkaN25ctvZ99OhRdenSRZJUqlQp1a5dWytXrlSDBg1yfBwPPPCA5s2bp6lTp2rUqFGqVKmSpk2bptjYWIcE5caNGxUaGqr33ntP1atXz/F+AAAAChObMcZYHQQAAADuXjabTcOGDdO7775rdSgAAACwAPegBAAAAAAAAGAZEpQAAAAAAAAALEOCEgAAAAAAAIBleEgOAAAALMUt0QEAAO5ujKAEAAAAAAAAYBkSlAAAAAAAAAAsw0+8M5CamqqTJ0/K3d1dNpvN6nAAAAAAAACAO4oxRpcuXZKfn5+KFMl6jCQJygycPHlSFSpUsDoMAAAAAAAA4I52/PhxlS9fPssyJCgz4O7uLunvBvTw8LA4GgAAAAAAAODOkpCQoAoVKtjzbFkhQZmBtJ91e3h4kKAEAAAAAAAAcik7t0/kITkAAAAAAAAALEOCEgAAAAAAAIBlSFACAAAAAAAAsAwJSgAAAAAAAACWIUEJAAAAAAAAwDIkKAEAAAAAAABYhgQlAAAAAAAAAMtYmqAMDQ3V/fffL3d3d5UrV049evTQgQMHHMpcuXJFw4YNU5kyZeTm5qZevXopPj4+y3qNMXrllVfk6+ur4sWLq0OHDjp06FB+HgoAAAAAAACAXLA0Qblu3ToNGzZMW7du1ffff6/k5GR16tRJly9ftpd54YUX9PXXX2vJkiVat26dTp48qZ49e2ZZ7+uvv65Zs2bp/fff188//6ySJUsqODhYV65cye9DAgAAAAAAAJADNmOMsTqINGfOnFG5cuW0bt06tW7dWhcvXpSXl5cWLVqkRx55RJK0f/9+BQQEaMuWLWrWrFm6Oowx8vPz05gxY/Tiiy9Kki5evChvb28tWLBAjz322E3jSEhIkKenpy5evCgPD4+8PUgAAAAAAACgkMtJfq1A3YPy4sWLkqT
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(val_df, \"контрольной выборке\")\n",
"build_graph(val_df, 'Genre', 'Контрольная', 'Жанр')"
]
},
{
"cell_type": "code",
"execution_count": 292,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в тестовой выборке:\n",
"Genre\n",
"Romance 19\n",
"Fantasy 17\n",
"Drama 16\n",
"Comedy 12\n",
"Supernatural 9\n",
"Thriller 9\n",
"Horror 6\n",
"Slice of life 6\n",
"Sci-fi 4\n",
"Action 4\n",
"Mystery 4\n",
"Superhero 3\n",
"Informative 2\n",
"Sports 2\n",
"Heartwarming 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABTAAAAGJCAYAAAC95xzMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf80lEQVR4nO3dZ3gV1fr38d9OgATSaCEQiIRO6E0wIE1K4CBFsCHSBFQOVRA9HJEg6AmIFBGOBSGAqCCKwBEEMQJKb4YiocXQpAmCISgtWc8Ln8yfTQopO2SQ7+e65oJpa9+zMjN7zb3XzDiMMUYAAAAAAAAAYENuuR0AAAAAAAAAAKSFBCYAAAAAAAAA2yKBCQAAAAAAAMC2SGACAAAAAAAAsC0SmAAAAAAAAABsiwQmAAAAAAAAANsigQkAAAAAAADAtkhgAgAAAAAAALAtEpgAAAAAAAAAbIsEJgAAAAAAAADbIoEJAACAVDkcjgwNa9euze1QAQAA8DeWJ7cDAAAAgD199NFHTuPz5s3T6tWrU0wPCQm5k2EBAADgHuMwxpjcDgIAAAD2N3DgQM2YMUM0HwEAAHAncQs5AAAAXOLq1asKDw9X+fLl5eHhoaCgIL300ku6evVqimXnz5+v+vXrq0CBAipUqJCaNGmib775RpIUHByc7i3rwcHBVjmXL1/W8OHDFRQUJA8PD1WqVElvvfVWiiTrzeu7u7urZMmSevbZZ3Xx4kVrmWvXrmn06NGqW7eu/Pz85OXlpcaNG2vNmjUp4j979qz69Omj++67T+7u7lbZ3t7erqlMAAAAWLiFHAAAANmWlJSkDh06aP369Xr22WcVEhKiPXv2aMqUKTp48KCWLFliLfvaa69pzJgxatiwocaOHat8+fJpy5Yt+u6779S6dWtNnTpVCQkJkqSYmBj95z//0b///W/rVvXkJKExRh06dNCaNWvUp08f1apVS6tWrdKIESP0yy+/aMqUKU4xPvLII+rcubNu3LihTZs26YMPPtCff/5p3RIfHx+vDz/8UF27dlW/fv106dIlzZo1S2FhYdq6datq1aplldWzZ099++23GjRokGrWrCl3d3d98MEH2rlzZw7WMgAAwL2JW8gBAACQIendQj5//nz17NlT69at04MPPmhNf//99/X8889rw4YNatiwoQ4fPqxKlSqpY8eO+vzzz+Xm9n83BBlj5HA4nMpdu3atmjdvrjVr1qhZs2ZO85YuXapOnTrp9ddf1yuvvGJNf+yxx/TFF1/o0KFDKleunKS/emCGh4drzJgx1nKNGjXSxYsX9dNPP0mSEhMTlZiYqHz58lnLXLx4UZUrV1a7du00a9YsSdKVK1fk5eWlfv366b333rOW7dWrlz7//HMr+QoAAADX4BZyAAAAZNuiRYsUEhKiypUr69y5c9bw0EMPSZJ1G/aSJUuUlJSk0aNHOyUvJaVIXt7OihUr5O7ursGDBztNHz58uIwx+vrrr52m//HHHzp37pxOnz6tL774Qrt27VKLFi2s+e7u7lbyMikpSb/99ptu3LihevXqOfWsvHz5spKSklSkSJFMxQsAAICs4RZyAAAAZNuhQ4cUExMjf3//VOefPXtWkhQbGys3NzdVqVIl25959OhRBQYGysfHx2l68q3mR48edZo+ceJETZw40Rpv06aNJkyY4LTM3LlzNWnSJO3fv1/Xr1+3ppcpU8b6f5EiRVShQgV9+OGHatq0qWrVqiU3N7dUn/UJAACA7COBCQAAgGxLSkpS9erVNXny5FTnBwUF3eGIUurevbt69OihpKQk/fzzzxo3bpwefvhhffvtt3I4HJo/f7569eqlTp06acSIESpWrJjc3d0VERGh2NhYp7IWLlyobt26KSwszGm6l5fXndwkAACAewIJTAAAAGRbuXLlrFuy07sVvFy5ckpKStK+ffucXoqTFaVLl9a3336rS5cuOfXC3L9/vzX/ZmXLllXLli2tcT8/Pz311FPavHmzQkND9fnnn6ts2bJavHix0zaEh4en+OzatWtr5syZaty4scaOHasHHnhAEydO1IYNG7K1TQAAAEiJZ2ACAAAg2x5//HH98ssvmjlzZop5f/75py5fvixJ6tSpk9zc3DR27FglJSU5LZfZd0v+4x//UGJioqZPn+40fcqUKXI4HGrbtm266//555+SZN367e7uniKOLVu2aNOmTSnWjY+PV/fu3dWhQweNGjVKLVu2VIkSJTIVPwAAADKGHpgAAADItu7du+uzzz7T888/rzVr1qhRo0ZKTEzU/v379dlnn2nVqlWqV6+eypcvr1deeUXjxo1T48aN1blzZ3l4eGjbtm0KDAxUREREhj+zffv2at68uV555RUdOXJENWvW1DfffKOlS5dq6NCh1hvIk+3evVvz58+XMUaxsbGaNm2aSpUqpXr16kmSHn74YS1evFiPPPKI2rVrp7i4OL333nuqUqVKijeLDxgwQH/++ac+/PDD7FceAAAA0kUCEwAAANnm5uamJUuWaMqUKZo3b56+/PJLFShQQGXLltWQIUNUsWJFa9mxY8eqTJkyeuedd/TKK6+oQIECqlGjhrp3757pz1y2bJlGjx6thQsXKjIyUsHBwZo4caKGDx+eYvkvv/xSX375pRwOhwICAtS8eXO98cYb8vb2liT16tVLp0+f1vvvv69Vq1apSpUqmj9/vhYtWqS1a9da5SxYsEAff/yxvv76axUtWjRrFQYAAIAMc5jM3qsDAAAAAAAAAHcIz8AEAAAAAAAAYFskMAEAAAAAAADYFglMAAAAAAAAALZFAhMAAAAAAACAbZHABAAAAAAAAGBbJDABAAAAAAAA2Fae3A7AjpKSknTy5En5+PjI4XDkdjgAAAAAAADAXcUYo0uXLikwMFBubtnrQ0kCMxUnT55UUFBQbocBAAAAAAAA3NWOHz+uUqVKZasMEpip8PHxkfRXBfv6+uZyNAAAAAAAAMDdJT4+XkFBQVaeLTtIYKYi+bZxX19fEpgAAAAAAABAFrni8Yy8xAcAAAAAAACAbZHABAAAAAAAAGBbJDABAAAAAAAA2BYJTAAAAAAAAAC2RQITAAAAAAAAgG2RwAQAAAAAAABgWyQwAQAAAAAAANgWCUwAAAAAAAAAtkUCEwAAAAAAAIBtkcAEAAAAAAAAYFskMAEAAAAAAADYVp7cDuBuFVaud26HYAurYiNzOwQAAAAAAAD8jdEDEwAAAAAAAIBtkcAEAAAAAAAAYFskMAEAAAAAAADYFglMAAAAAAAAALZFAhMAAAAAAACAbZHABAAAAAAAAGBbJDABAAAAAAAA2BYJTAAAAAAAAAC2RQITAAAAAAAAgG2RwAQAAAAAAABgWyQwAQAAAAAAANgWCUwAAAAAAAAAtkUCEwAAAAAAAIBtkcAEAAAAAAAAYFskMAEAAAAAAADYFglMAAAAAAAAALZFAhMAAAAAAACAbZHABAAAAAAAAGBbJDABAAAAAAAA2FauJjC///57tW/fXoGBgXI4HFqyZInTfIfDkeowceLENMscM2ZMiuUrV66cw1sCAAAAAAAAICfkagLz8uXLqlmzpmbMmJHq/FOnTjkNs2fPlsPhUJcuXdItt2rVqk7rrV+/PifCBwAAAAAAAJDD8uTmh7dt21Zt27ZNc37x4sWdxpcuXarmzZurbNmy6ZabJ0+eFOsCAAAAAAAAuPvcNc/APHPmjJYvX64+ffrcdtlDhw4pMDBQZcuWVbdu3XTs2LF0l7969ari4+OdBgAAAAAAAAC5765JYM6dO1c+Pj7q3Llzuss1aNBAc+bM0cqVK/Xuu+8qLi5OjRs31qVLl9JcJyIiQn5+ftYQFBTk6vABAAAAAAAAZMFdk8CcPXu2unXrJk9Pz3SXa9u2rR577DHVqFFDYWFhWrFihS5evKjPPvsszXVGjhyp33//3RqOHz/u6vABAAAAAAAAZEGuPgMzo3744QcdOHBACxcuzPS6BQsWVMWKFXX48OE0l/Hw8JCHh0d2QgQAAAAAAACQA+6KHpizZs1S3bp1VbNmzUyvm5CQoNjYWJUoUSIHIgMAAAA
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(test_df, \"тестовой выборке\")\n",
"build_graph(test_df, 'Genre', 'Тестовая', 'Жанр')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Выборки относительно сбалансированы"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9) Приращение данных с помощью oversampling и undersampling"
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в обучающей выборке после oversampling:\n",
"Genre\n",
"Action 59\n",
"Fantasy 59\n",
"Romance 59\n",
"Thriller 59\n",
"Sci-fi 59\n",
"Sports 59\n",
"Slice of life 59\n",
"Superhero 59\n",
"Mystery 59\n",
"Comedy 59\n",
"Horror 59\n",
"Drama 59\n",
"Supernatural 59\n",
"Historical 59\n",
"Informative 59\n",
"Heartwarming 59\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"\n",
"def oversample(df):\n",
" X = df.drop('Genre', axis=1)\n",
" y = df['Genre']\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_df)\n",
"val_df_oversampled = oversample(val_df)\n",
"test_df_oversampled = oversample(test_df)\n",
"\n",
"# Проверка\n",
"check_balance(train_df_oversampled, \"обучающей выборке после oversampling\")"
]
},
{
"cell_type": "code",
"execution_count": 294,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в контрольной выборке после oversampling:\n",
"Genre\n",
"Drama 20\n",
"Comedy 20\n",
"Fantasy 20\n",
"Mystery 20\n",
"Romance 20\n",
"Horror 20\n",
"Superhero 20\n",
"Sports 20\n",
"Sci-fi 20\n",
"Action 20\n",
"Thriller 20\n",
"Supernatural 20\n",
"Slice of life 20\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(val_df_oversampled, \"контрольной выборке после oversampling\")"
]
},
{
"cell_type": "code",
"execution_count": 295,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в тестовой выборке после oversampling:\n",
"Genre\n",
"Fantasy 19\n",
"Drama 19\n",
"Supernatural 19\n",
"Sci-fi 19\n",
"Horror 19\n",
"Informative 19\n",
"Comedy 19\n",
"Romance 19\n",
"Thriller 19\n",
"Slice of life 19\n",
"Superhero 19\n",
"Sports 19\n",
"Mystery 19\n",
"Action 19\n",
"Heartwarming 19\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(test_df_oversampled, \"тестовой выборке после oversampling\")\n"
]
},
{
"cell_type": "code",
"execution_count": 296,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в обучающей выборке после undersampling:\n",
"Genre\n",
"Action 1\n",
"Comedy 1\n",
"Drama 1\n",
"Fantasy 1\n",
"Heartwarming 1\n",
"Historical 1\n",
"Horror 1\n",
"Informative 1\n",
"Mystery 1\n",
"Romance 1\n",
"Sci-fi 1\n",
"Slice of life 1\n",
"Sports 1\n",
"Superhero 1\n",
"Supernatural 1\n",
"Thriller 1\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"def undersample(df):\n",
" X = df.drop('Genre', axis=1)\n",
" y = df['Genre']\n",
" \n",
" undersampler = RandomUnderSampler(random_state=42)\n",
" X_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_undersampled = undersample(train_df)\n",
"val_df_undersampled = undersample(val_df)\n",
"test_df_undersampled = undersample(test_df)\n",
"\n",
"check_balance(train_df_undersampled, \"обучающей выборке после undersampling\")"
]
},
{
"cell_type": "code",
"execution_count": 297,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в контрольной выборке после undersampling:\n",
"Genre\n",
"Action 1\n",
"Comedy 1\n",
"Drama 1\n",
"Fantasy 1\n",
"Horror 1\n",
"Mystery 1\n",
"Romance 1\n",
"Sci-fi 1\n",
"Slice of life 1\n",
"Sports 1\n",
"Superhero 1\n",
"Supernatural 1\n",
"Thriller 1\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(val_df_undersampled, \"контрольной выборке после undersampling\")"
]
},
{
"cell_type": "code",
"execution_count": 298,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Genre в тестовой выборке после undersampling:\n",
"Genre\n",
"Action 1\n",
"Comedy 1\n",
"Drama 1\n",
"Fantasy 1\n",
"Heartwarming 1\n",
"Horror 1\n",
"Informative 1\n",
"Mystery 1\n",
"Romance 1\n",
"Sci-fi 1\n",
"Slice of life 1\n",
"Sports 1\n",
"Superhero 1\n",
"Supernatural 1\n",
"Thriller 1\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(test_df_undersampled, \"тестовой выборке после undersampling\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Датасет 2. Использование мобильных телефонов"
]
},
{
"cell_type": "code",
"execution_count": 299,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['User_ID', 'Age', 'Gender', 'Total_App_Usage_Hours',\n",
" 'Daily_Screen_Time_Hours', 'Number_of_Apps_Used',\n",
" 'Social_Media_Usage_Hours', 'Productivity_App_Usage_Hours',\n",
" 'Gaming_App_Usage_Hours', 'Location'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df = pd.read_csv(\".//csv//mobile_usage_behavioral_analysis.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1) **Бизнес-цель:** для исследовательского журнала определить статистику, сколько человек тратит времени на социальные сети.\n",
"2) **Эффект:** увеличение интереса читателей к журналу.\n",
"3) **Техническая цель:** разработка модели прогнозирования использования соцсетей в зависимости от возраста\n",
"4) **Входные данные:** Social_Media_Usage_Hours\n",
"5) **Целевой признак:** Social_Media_Usage_Hours"
]
},
{
"cell_type": "code",
"execution_count": 300,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlgAAAK9CAYAAAD47xLtAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAADqwElEQVR4nOydeXwURfr/P5NjkswkmYQJ4VgJBBIF5QoLHiSAgBeKCvqDr7iuHK64yuHKqoj7FQFdUBRckfUWcHWNrge4ousBKAJeKFFQUAkLBAGJCclMkkkyJOnfH3xnZDJXV09Xz9Pd9X69fL1kutNTU1311KeeeuopiyRJEgQCgUAgEAgEqpEQ7wIIBAKBQCAQGA0hsAQCgUAgEAhURggsgUAgEAgEApURAksgEAgEAoFAZYTAEggEAoFAIFAZIbAEAoFAIBAIVEYILIFAIBAIBAKVEQJLIBAIBAKBQGWEwBIIBAKBQCBQGSGwBAKBQCCIwk8//YQ1a9b4/33gwAH885//jF+BBOQxtcBas2YNLBZL2P9++uknTcuTnp6OKVOmaPqdAoFAIIiOxWLBjBkz8N577+HAgQO48847sWXLlngXS0CYpHgXgAKLFi1Cfn5+0OcdOnSIQ2kEAoFAQI3f/OY3uPHGG3HJJZcAALp06YKPPvoovoUSkEYILABjxozB4MGD410MgUAgEBDmb3/7G2bNmoWqqir07dsXdrs93kUSEMbUS4Ry8S0lfvzxx7jpppvgdDqRmZmJ66+/HjU1NQH3vvnmm7jsssvQtWtXpKSkoFevXrjvvvvQ2toacF9bWxvuuOMOOBwO9OjRA++++67/2ty5c5GRkYHCwkL85z//Cfi7KVOmoEePHgGfHTp0CGlpabBYLDhw4ID/8x49egQtOU6fPh2pqalRZ14LFizAmWeeifT0dGRmZuLcc8/FunXrAu7ZsmULJkyYgLy8PKSkpKBbt2647bbb0NjYGFTmU5des7Ozcf755we510OV99VXX4XFYgn6zW1tbXj00UfRr18/pKamomPHjrjkkkvw5Zdf+u+xWCxYsGBBwN899NBDsFgsOP/88/2fffTRR/6yff311wH3Hz58GImJibBYLHjttdcCrm3atAnDhg2D3W5HVlYWrrzySuzZsyeoLg8fPowbbrjB3yby8/Nx8803w+v1Rl2mtlgs/rgPlncfCta//89//oMRI0YgIyMDmZmZGDJkCF566SX/9fPPPz9iuds/7/HHH8dZZ52FlJQUdO3aFTNmzEBtbW1QOQ8cOBD2me3vefjhhyP+5jfeeANnn302OnTogLS0NPTu3RsPPvggJEkKuK+srAxjxoxBZmYm0tPTMXr0aHz22WcB97R/VzabDf369cOzzz4bcN/OnTsxZcoU9OzZE6mpqejcuTOmTZuG6urqgPsWLFgAi8WCqqqqgM+//PLLgPcOsL27lpYW3H///Tj99NORkpISUOZT+0c4vv/+e0ycOBEdO3ZEWloazjjjDPzlL39RXF/tv7Oqqipk3wSAF198EWeffTZsNhuys7MxfPhwvP/++wH3/Oc///H3u4yMDFx22WX47rvvAuoqWp86tb6iPc/3TF/99+rVC+eccw6OHz8uu+8B0e2Frz1E+i+S3W7/9xkZGTj77LOD7DZw0q7+9re/RVpaGnJycnDdddfh8OHDQc+LNgacf/756Nu3L7766isMHToUaWlpyM/Px5NPPhlwn9frxfz58/Hb3/4WDocDdrsdw4YNw4cffhhUtmi2PVod+Ww7y3fyQniwGJg5cyaysrKwYMEC/PDDD3jiiSdw8OBB/wANnDQq6enpmDNnDtLT07Fp0ybMnz8fbrcbDz30kP9ZDz74IB5++GH8/ve/x29/+1vcdttt8Hq9ePvttzFw4ED89a9/xbPPPourrroKu3fvDrmE6WP+/PloamqKWv57770Xzz33HF555ZUAgRGKhoYGjB8/Hj169EBjYyPWrFmDq6++Gp9++inOPvtsACc7qcfjwc033wyn04kvvvgCjz32GH766Se8+uqrAc/LycnBI488AuBksOijjz6KSy+9FIcOHUJWVlbIMrS0tAQZdh833HAD1qxZgzFjxuAPf/gDWlpasGXLFnz22WdhvZG1tbVYsmRJ2N+cmpqK1atX49FHH/V/9vzzz8NqtQbV74YNGzBmzBj07NkTCxYsQGNjIx577DEUFxdjx44dfmN85MgRnH322aitrcX06dPRu3dvHD58GK+99ho8Hg+GDx+OF154wf/cv/71rwAQ8LuHDh0atsxy3z3r369ZswbTpk3DWWedhXnz5iErKwtlZWV49913ce211/rvO+2004Lq9J133kFpaWnAZwsWLMDChQtxwQUX4Oabb/b3n+3bt2Pbtm1ITk4OKsP06dMxbNgwACeF0tq1a5l/n9vtxjnnnIPJkycjOTkZ7777Lu666y4kJSXhz3/+MwDgu+++w7Bhw5CZmYk777wTycnJeOqpp3D++edj8+bNOOeccwKe+cgjjyAnJwdutxurVq3CjTfeiB49euCCCy4AAHzwwQf473//i6lTp6Jz58747rvv8PTTT+O7777DZ599FiAUYyHcu1u2bBnuuecejB8/HnPnzkVKSgq2bNmCp59+Ouozd+7ciWHDhiE5ORnTp09Hjx49sG/fPrz11lv+tslaX3JZuHAhFixYgKFDh2LRokWwWq34/PPPsWnTJlx00UUAgBdeeAGTJ0/GxRdfjAcffBAejwdPPPEESkpKUFZWhh49euCmm27yvwsA+P3vf4/x48fjqquu8n/WsWNH2c8LB0vfk2MvrrrqKhQUFPj/5rbbbkOfPn0wffp0/2d9+vSJ+l0+e1JVVYXHH38cEyZMwLfffoszzjgDwMm+PXXqVAwZMgRLlizBsWPH8Oijj2Lbtm0oKyvz22M5YwAA1NTU4NJLL8XEiRMxadIk/Otf/8LNN98Mq9WKadOmATjZD5999llMmjQJN954I+rq6vDcc8/h4osvxhdffIGBAwf6nxfNtp9qL33t2tcnAaBTp07M38kNycSsXr1aAiBt375d1n2//e1vJa/X6/986dKlEgDpzTff9H/m8XiC/v6mm26SbDab1NTUJEmSJDU1NUm5ubnSpEmT/Pd88803UmJiojRgwACpublZkiRJqqqqkjIyMqRbb73Vf9/kyZOl7t27+//97bffSgkJCdKYMWMkANL+/fv917p37y5NnjxZkiRJeuqppyQA0mOPPRa1XkJRWVkpAZAefvjhiL91yZIlksVikQ4ePBi2zJIkSU8//bQEQPriiy9ClleSJOnxxx+XUlJSpJEjRwb8/aZNmyQA0uzZs4O+v62tzf//AKR7773X/+8777xTys3NlX77299KI0aM8H/+4YcfSgCkSZMmSU6n01//kiRJhYWF0rXXXisBkF599VX/5wMHDpRyc3Ol6upq/2fffPONlJCQIF1//fX+z66//nopISEhZBs7taw+RowYEVC2U2F597H8fW1trZSRkSGdc845UmNjY9gyjxgxQjrrrLOCvuehhx4KeF5lZaVktVqliy66SGptbfXft3LlSgmAtGrVqoC/37t3rwRAev755/2f3XvvvdKp5mr//v0SAOmhhx6K+JtDceaZZ0pjx471/3vcuHGS1WqV9u3b5//syJEjUkZGhjR8+HD/Zz47cGo9//jjjxIAaenSpf7PQvWL0tJSCYD08ccfB/2mX375JeDe7du3SwCk1atX+z9jeffnnXee1KdPn4B3JdfWDR8+XMrIyAjov5IU+N5Z66v9d/7yyy9BfXPv3r1SQkKCNH78+IA2cup319XVSVlZWdKNN94YcP3nn3+WHA5H0Oc+2n+XD5bnxdr35NqLU2lvD6PRvo9IkiS9//77EgDpX//6lyRJkuT1eqXc3Fypb9++AX17/fr1EgBp/vz5YZ8fagwYMWKEBEBatmyZ/7Pm5mb/7/WNly0tLQF2VZIkqaamRurUqZM0bdo0/2dybbuPUH3Sh9zv5IlYImRg+vTpATPtm2++GUlJSXjnnXf8n6Wlpfn/v66uDlVVVRg2bBg
"text/plain": [
"<Figure size 700x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# преобразование из строки в число и очистка от пустых значений\n",
"df['Social_Media_Usage_Hours'] = df['Social_Media_Usage_Hours'].astype(float)\n",
"plt.figure(figsize=(7, 8))\n",
"\n",
"plt.subplot\n",
"sns.scatterplot(x='Age', y='Social_Media_Usage_Hours', data=df)\n",
"plt.title('График зависимости использования соцсетей от возраста')\n",
"plt.xlabel('Возраст')\n",
"plt.ylabel('Время, потраченное на соцсети, ч')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Набор достаточно сбалансирован, выбросов нет, нулевых значений тоже."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7) Разбиение на обучающую, контрольную и тестовую выборки."
]
},
{
"cell_type": "code",
"execution_count": 301,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 600\n",
"Размер контрольной выборки: 200\n",
"Размер тестовой выборки: 200\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# обучающая и тестовая\n",
"train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# обучающая на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"8) **Оценка сбалансированности выборок.**<br/>"
]
},
{
"cell_type": "code",
"execution_count": 302,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в обучающей выборке:\n",
"Social_Media_Usage_Hours\n",
"1.98 5\n",
"1.97 5\n",
"0.20 5\n",
"2.42 4\n",
"4.29 4\n",
" ..\n",
"0.39 1\n",
"1.91 1\n",
"0.80 1\n",
"0.53 1\n",
"3.50 1\n",
"Name: count, Length: 342, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABR8AAAGJCAYAAAADsUSRAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACLCElEQVR4nOzdd3gVVf7H8c8tufem90Y6BAKEHnpAei8Ca0FBxQZixY6uinWRta+6urru6m9FWVfFuot17YuAgmBby4JiW0XpJZDk/P6Ic3InCRFCroD7fj3PfTKTOXfOmXbmzPfOzPEYY4wAAAAAAAAAoJl593cBAAAAAAAAAPwyEXwEAAAAAAAAEBEEHwEAAAAAAABEBMFHAAAAAAAAABFB8BEAAAAAAABARBB8BAAAAAAAABARBB8BAAAAAAAARATBRwAAAAAAAAARQfARAADgF27r1q1au3at1q9fv7+LAgAAgP8xBB8BAAB+gf72t79pyJAhio+PV1xcnPLz8/Xb3/52fxcLAAAA/2MIPgIAABwE3nvvPU2dOlU5OTkKBoNq0aKFpkyZovfee69e2tmzZ+uII45QfHy87r77bj333HN6/vnndeqpp+6HkgMAAOB/mccYY/Z3IQAAALB7jz76qI466iilpKToxBNPVFFRkdasWaN77rlH33//vRYsWKCJEydKkl5++WUNHDhQc+fO1ezZs/dzyQEAAPC/juAjAADAAezTTz9Vp06dlJ+fr1deeUXp6el22rp169S/f3+tXbtWK1euVMuWLTVu3Dj98MMPev311/djqQEAAIAaPHYNAABwALvuuuu0bds23XXXXa7AoySlpaXpD3/4g7Zu3Wrf57h48WJ16NBBkydPVkpKiqKjo9WjRw899thj9ntbtmxRbGyszjrrrHr5ffHFF/L5fJo7d64kadq0aSosLKyXzuPx6PLLL7fjn332mU499VSVlJQoOjpaqampOvzww7VmzRrX91566SV5PB699NJL9n9Lly7VsGHDFB8fr9jYWA0cOFCvvvqq63v33nuvPB6Pli1bZv+3bt26euWQpLFjx9Yr86uvvqrDDz9c+fn5CgaDysvL09lnn63t27fXW7aHH35Y3bt3V3x8vDwej/1cf/319dICAACgcf79XQAAAADs3pNPPqnCwkL179+/wemHHHKICgsL9fTTT0uSvv/+e911112Ki4vTmWeeqfT0dN1///2aNGmS5s+fr6OOOkpxcXGaOHGi/vrXv+rGG2+Uz+ez83vwwQdljNGUKVP2qpxLly7VG2+8ocmTJys3N1dr1qzRHXfcoYEDB+r9999XTExMg9/75JNPNHDgQMXExOj8889XTEyM7r77bg0dOlTPPfecDjnkkL0qx+787W9/07Zt2zRz5kylpqZqyZIluvXWW/XFF1/ob3/7m033r3/9S0cccYQ6d+6sa6+9VomJiVq3bp3OPvvsZikHAADA/xqCjwAAAAeojRs36quvvtKhhx7aaLpOnTrpiSee0ObNm+W8Ueepp57SgAEDJEkzZsxQWVmZzjnnHB122GGKiorSscceq/nz5+u5557TyJEj7bzuv/9+HXLIIcrPz5ckeb1e7clbesaMGaPDDjvM9b9x48apT58+euSRR3TMMcc0+L3Zs2eroqJCS5YsUWlpqSTp+OOPV0lJic455xzXnY77Yt68eYqOjrbj06dPV3FxsS6++GJ9/vnndnmffPJJGWP0j3/8Q1lZWZKkNWvWEHwEAABoIh67BgAAOEBt3rxZkhQfH99oOmf6pk2bJEk9evSwgUdJio6O1qmnnqpvvvlGb7/9tiRp6NChatGihebPn2/Tvfvuu1q5cqWmTp1q/5eRkaFvv/1WO3fubLQM4YG9Xbt26fvvv1dxcbGSkpJsnuE2btyob7/9Vs8995xGjBhhA4+SlJqaqmnTpumtt97Sf//730bz3VPh5du6davWrVunvn37yhij5cuX22mbN2+W1+tVUlJSs+QLAADwv47gIwAAwAHKCSo6QcjdqRukbNu2bb007dq1kyT7Dkav16spU6boscce07Zt2yRJ8+fPVygU0uGHH26/17dvX+3YsUOXXHKJvvjiC61bt07r1q2rN//t27frsssuU15enoLBoNLS0pSenq4NGzZo48aN9dJPmDBBmZmZ2rRpk0pKSn6yvPvq888/17Rp05SSkqK4uDilp6fbAG14+fr06aPq6mqdddZZ+vTTT7Vu3TqtX7++WcoAAADwv4jgIwAAwAEqMTFR2dnZWrlyZaPpVq5cqZycHCUkJLju8Pspxx57rLZs2aLHHntMxhg98MADGjt2rBITE22a8ePH64QTTtB1112nvLw8paen1+v4RpLOOOMMXXPNNTriiCP00EMP6dlnn9Vzzz2n1NRUVVdX10t//fXX6/HHH9/jsu6LqqoqDRs2TE8//bQuvPBCPfbYY3ruued07733SpKrfJMnT9a5556re++9V8XFxUpPT1e3bt1+lnICAAD8EvHORwAAgAPY2LFjdffdd+u1115Tv3796k1/9dVXtWbNGs2YMUOSVFRUpH//+9/10n344YeS5OoFukOHDuratavmz5+v3Nxcff7557r11lvrffeee+7RZZddpk8//dQG6oYNG+ZK8/DDD+u4447TDTfcYP+3Y8cObdiwocHlKisr04ABAxQXF7fH5W2qVatW6aOPPtJ9992nY4891v7/ueeeq5fW6/Xq+uuv16pVq7R69Wr9/ve/13//+1/Xo+gAAADYc9z5CAAAcAA7//zzFR0drRkzZuj77793Tfvhhx90yimn2J6iJWn06NFasmSJ3njjDZtux44duuOOO5SVlaWysjLXPI455hg9++yzuvnmm5WamqpRo0Y1WI6CggINHjxYQ4cO1dChQ+tN9/l89TqmufXWW1VVVbXbZfN4PBo+fLieeeYZffDBB67luu+++9S9e3dlZmbu9vt7yunNO7x8xhjdcsstDaa/9dZb9eKLL2r+/PkaOnSoysvL97kMAAAA/6u48xEAAOAA1rp1a913332aMmWKOnbsqBNPPFFFRUVas2aN7rnnHq1bt04PPvigWrVqJUm64IILNH/+fI0aNUpnnnmm0tLSdP/99+v999/X/Pnz5fe7m39HH320LrjgAi1cuFAzZ85UVFRUk8o5duxY/eUvf1FiYqLat2+vf/3rX3r++eeVmpra6PeuuuoqPfPMMxowYIDOOOMMxcTE6O6779aGDRv08MMP10v/r3/9y75z0ulg55NPPtGiRYtsmu+++07bt2/XokWLNHLkSLVt21atWrXSeeedpy+//FIJCQl65JFHGnyX43vvvacLLrhAl19+uXr06NGkdQEAAIBaBB8BAAAOcIcffrjatm2ruXPn2oBjamqqBg0apIsvvlgdOnSwadPT0/Xaa6/pwgsv1K233qqKigp17NhRCxcu1KGHHlpv3pmZmRo+fLj+/ve/65hjjmlyGW+55Rb5fD7Nnz9fO3bsUHl5uZ5//nmNGDGi0e+1b99er7zyii666CL99re/VXV1tbp376677rpLhxxySL30Z555Zr3/zZ8/39Vrt2PUqFEyxigqKkpPPvmkzjzzTM2dO1ehUEgTJ07U6aefrs6dO9v0FRUVOvroo9W9e3fNnj27CWsBAAAAdXlM3edjAAAA8D9l4sSJWrVqlT755JP9XZRms2bNGhUVFdV7FBwAAAA/L975CAAA8D/s66+/1tNPP71Pdz0CAAAAu8Nj1wAAAP+DVq9erddff11//OMfFRUVZXvL/qWIjo7+yUe+AQAAEHnc+QgAAPA/6OWXX9Yxxxyj1atX67777lNWVtb+LlKzyszMdHVCAwAAgP2Ddz4CAAAAAAAAiAjufAQAAAAAAAAQEQQfAQAAAAAAAETEQd3hTHV1tb766ivFx8fL4/Hs7+IAAAAAAAAABxVjjDZv3qwWLVrI623++xQP6uDjV199pby8vP1dDAAAAAAAAOCgtnbtWuXm5jb7fA/q4GN8fLykmpWTkJCwn0sDAAAAAAAAHFw2bdqkvLw8G2drbgd18NF51DohIYHgIwA
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['Social_Media_Usage_Hours'].value_counts()\n",
" print(f\"Распределение Social_Media_Usage_Hours в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"build_graph(train_df, 'Social_Media_Usage_Hours', 'Обучающая', 'Время в соц сетях, ч')"
]
},
{
"cell_type": "code",
"execution_count": 303,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в контрольной выборке:\n",
"Social_Media_Usage_Hours\n",
"3.36 3\n",
"2.71 3\n",
"1.13 3\n",
"2.92 3\n",
"3.64 3\n",
" ..\n",
"2.07 1\n",
"2.03 1\n",
"3.98 1\n",
"1.80 1\n",
"1.34 1\n",
"Name: count, Length: 156, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSoAAAGJCAYAAACNaw3tAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACgxElEQVR4nOzdeZxN9f/A8fe9d2bu7DOG2WfMGMsw9t3YxjLWEVKSiCRaKCUtiiQVJaFQifJtQVIk2clSyJIlLahvIVlaMJYMZt6/P+Z7Pr97zZ3SRFe8no/HfTDnfT7nfM7n7O97zufaVFUFAAAAAAAAALzI7u0KAAAAAAAAAACJSgAAAAAAAABeR6ISAAAAAAAAgNeRqAQAAAAAAADgdSQqAQAAAAAAAHgdiUoAAAAAAAAAXkeiEgAAAAAAAIDXkagEAAAAAAAA4HUkKgEAAAAAAAB4HYlKAAAAAAAAAF5HohIAAOBfatq0aWKz2WTTpk0FYq+++qrYbDbp2LGj5ObmeqF2AAAAwF9DohIAAOAKM2fOHLnzzjulUaNGMnPmTHE4HN6uEgAAAPCnSFQCAABcQVauXCldu3aVtLQ0+fDDD8Xf39/bVQIAAAAuCIlKAACAK8TWrVulQ4cOEhsbK4sXL5awsDC3+Lvvvis1a9aUgIAAKVGihHTv3l3279/vNs4tt9wiwcHBBaY9e/ZssdlssnLlShERadKkidhstj/8WGw2m/Tv31/efvttSU1NFX9/f6lZs6asXr26wHy2bNkibdq0kdDQUAkODpbmzZvL+vXrPS5vYXWYNm2a2ziVKlX607az6ni+du3aSXJystuw5557TurXry/FixeXgIAAqVmzpsyePbtA2RMnTsj9998vKSkp4uvr61bHX3755U/rBAAAcLXx8XYFAAAA8Pd999130rp1a3E6nbJ48WKJjY11i0+bNk169eoltWvXlpEjR8qhQ4dk/Pjx8umnn8qWLVskPDz8L83v0Ucfldtuu01ERH755Re57777pG/fvtKoUSOP469atUreeecdueeee8TpdMqkSZOkdevWsmHDBpNI/PLLL6VRo0YSGhoqDz74oPj6+sorr7wiTZo0kVWrVkndunULTLd8+fLy6KOPutXjUhs/fry0b99eunXrJmfOnJGZM2dK586dZf78+ZKVlWXGe+CBB+Tll1+W3r17S4MGDcTX11fef/99mTNnziWvIwAAwL8RiUoAAIB/uUOHDsmNN94ohw4dkpYtW0q5cuXc4mfPnpWHHnpIKlWqJKtXrzavgzds2FDatWsnY8eOleHDh/+lebZo0cL8/4cffpD77rtP0tPTpXv37h7H37Fjh2zatElq1qwpIiI33nijpKamymOPPSbvv/++iIgMGTJEzp49K5988omkpKSIiEiPHj0kNTVVHnzwQVm1apXbNM+dOyexsbFmnlY9LrVdu3ZJQECA+bt///5So0YNef75590SlR988IG0atVKpkyZYoZ9++23JCoBAAAKwavfAAAA/3K33HKL7Nu3T2666SZZsmSJvPvuu27xTZs2yeHDh+Wuu+5y67MyKytLypcvLx999FGBaf7yyy9un+PHj/+tOqanp5skpYhIyZIlpUOHDrJ48WLJzc2V3NxcWbJkiXTs2NEkKUVEYmNj5aabbpJPPvlEsrOz3aZ55swZcTqdfzrv3NxcsxxnzpwpdLzTp08XWO6zZ88WGM81SXnkyBE5duyYNGrUSD7//HO38Y4fPy7Fixf/0/oBAAAgH4lKAACAf7nffvtN3nrrLfnPf/4j1apVkwEDBsixY8dMfM+ePSIikpqaWqBs+fLlTdxy8uRJiYyMdPvceuutf6uOZcuWLTCsXLlycurUKfn555/l559/llOnTnmsY4UKFSQvL0/27dvnNvzo0aMe+9M83zfffGOWIyAgQFJTU2X69OkFxps6dWqB5V6yZEmB8ebPny/16tUTf39/iYiIkMjISHnppZfc2lwkPzk7Z84cmT17thw4cEB++eUXOXXq1J/WFwAA4GrFq98AAAD/cqNHj5bOnTuLiMjkyZOlXr16MnjwYJk0aVKRpufv7y8ffvih27A1a9bIE0888bfrejEdPHhQWrVq9afjJScny6uvvioiIr/++qu88MILcvPNN0tKSorUq1fPjNehQ4cCP6gzZMgQOXjwoPl7zZo10r59e2ncuLFMmjRJYmNjxdfXV15//fUCyc/JkydL165dzboBAADAHyNRCQAA8C/XuHFj8//atWtLv379ZOLEidKjRw+pV6+eJCUliYjIzp07pVmzZm5ld+7caeIWh8MhmZmZbsOOHj36t+q4e/fuAsN27dolgYGBEhkZKSIigYGBsnPnzgLjffPNN2K32yUxMdEM+/HHH+X48eNSoUKFP513UFCQ2/I0atRI4uPjZcmSJW6JyoSEhALLPW7cOLdE5XvvvSf+/v6yePFit9fOX3/99QLzTU5OlrfeeksqV64st956q3Ts2FHeeOMNefPNN/+0zgAAAFcjXv0GAAC4wjz11FMSGxsrffv2lXPnzkmtWrUkKipKXn75ZcnJyTHjLVy4UL7++mu3H4C5VNatW+fWh+O+ffvkgw8+kJYtW4rD4RCHwyEtW7aUDz74QH744Qcz3qFDh2T69OnSsGFDCQ0NNcNnzpwpIlIg8Xoh8vLyRCQ/IftXORwOsdlskpuba4b98MMPMnfu3ALjnjt3Trp16yYVK1aUsWPHSmZmplv/mwAAAHDHE5UAAABXmJCQEHnxxRelU6dOMmbMGHnooYfkmWeekV69eklGRoZ07dpVDh06JOPHj5fk5OR/5JeyK1WqJK1atZJ77rlHnE6neS3d9dfGn3zySVm6dKk0bNhQ7rrrLvHx8ZFXXnlFcnJy5NlnnxWR/MTlsGHDZMqUKXLjjTdK+fLl/3TeJ06ckEWLFolIfn+eL7zwgvj6+hYpQZuVlSXPP/+8tG7dWm666SY5fPiwTJw4UcqUKSPbt293G3f48OHyxRdfyJYtW8TX1/cvzwsAAOBqQ6ISAADgCnTttddKhw4d5IknnpAbbrhBbrnlFgkMDJRRo0bJQw89JEFBQXLttdfKM888I+Hh4Ze8PhkZGZKeni7Dhw+XvXv3SlpamkybNk2qVKlixqlYsaKsWbNGBg8eLCNHjpS8vDypW7euvPXWW1K3bl0REfnuu+9k+fLlMnToUBk8ePAFzXvPnj3Spk0bEREJDw+XihUryrx586RatWp/eTmaNWsmU6dOlVGjRsm9994rpUqVkmeeeUZ++OEHt0TlJ598IiNHjpRJkyZJuXLl/vJ8AAAArkY2VVVvVwIAAABXLpvNJv369ZMJEyZ4uyoAAAC4jNFHJQAAAAAAAACvI1EJAAAAAAAAwOtIVAIAAAAAAADwOn5MBwAAAJcUXaIDAADgQvBEJQAAAAAAAACvI1EJAAAAAAAAwOuuule/8/Ly5KeffpKQkBCx2Wzerg4AAAAAAADwr6Kqcvz4cYmLixO7/eI9B3nVJSp/+uknSUxM9HY1AAAAAAAAgH+1ffv2SUJCwkWb3lWXqAwJCRGR/IYMDQ31cm0AAAAAAACAf5fs7GxJTEw0ebaL5apLVFqve4eGhpKoBAAAAAAAAIroYneryI/pAAAAAAAAAPA6EpUAAAAAAAAAvI5EJQAAAAAAAACvI1EJAAAAAAAAwOtIVAIAAAAAAADwOhKVAAAAAAAAALyORCUAAAAAAAAAr/NqovKll16SKlWqSGhoqISGhkp6erosXLjwD8u8++67Ur58efH395fKlSvLggUL/qHaAgAAAAAAALhUvJqoTEhIkFGjRsnmzZtl06ZN0qxZM+nQoYN8+eWXHsdfu3atdO3aVXr37i1btmyRjh07SseOHWXHjh3/cM0BAAAAAAAAXEw2VVVvV8JVRESEjB49Wnr37l0g1qVLFzl58qTMnz/fDKtXr55Uq1ZNXn755QuafnZ2toSFhcmxY8ckNDT0otUbAAAAAAAAuBpcqvzaZdNHZW5ursycOVNOnjwp6enpHsdZt26dZGZmug1r1aqVrFu3rtDp5uTkSHZ2tts
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(val_df, \"контрольной выборке\")\n",
"build_graph(val_df, 'Social_Media_Usage_Hours', 'Контрольная', 'Время в соц сетях, ч')"
]
},
{
"cell_type": "code",
"execution_count": 304,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в тестовой выборке:\n",
"Social_Media_Usage_Hours\n",
"4.40 3\n",
"2.83 3\n",
"2.00 2\n",
"4.16 2\n",
"1.48 2\n",
" ..\n",
"0.86 1\n",
"3.74 1\n",
"0.41 1\n",
"0.21 1\n",
"0.67 1\n",
"Name: count, Length: 169, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSsAAAGJCAYAAABiqWbTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACkc0lEQVR4nOzde5xN1f/48fc5Z86cud/NxcyYGff7dVwGuQ4TIyRCRReRUKQolwhFkvCJXFJRipRcihSiG5FrUYqPvpFc+hQGMczM+/fH/Pb6zDEHNZHTx+v5eJwHs9dea6+999p7r/0+e69jU1UVAAAAAAAAALjG7Ne6AgAAAAAAAAAgQrASAAAAAAAAgJcgWAkAAAAAAADAKxCsBAAAAAAAAOAVCFYCAAAAAAAA8AoEKwEAAAAAAAB4BYKVAAAAAAAAALwCwUoAAAAAAAAAXoFgJQAAAAAAAACvQLASAAAAAAAAgFcgWAkAAHCdstlsf+izbt26a11VAAAAXCd8rnUFAAAAcG289tprbn+/+uqrsmrVqkLTK1So8HdWCwAAANcxm6rqta4EAAAArr1+/frJtGnThO4hAAAArhVeAwcAAMAfkp2dLSNHjpTSpUuLy+WSxMREGTx4sGRnZxead968eVKnTh0JCAiQ8PBwadSokXz44YciIpKcnHzJ186Tk5NNOadPn5aHH35YEhMTxeVySbly5eTZZ58tFFAtmN/hcEh8fLz06tVLjh8/buY5d+6cjBgxQmrVqiWhoaESGBgoN9xwg6xdu7ZQ/Y8ePSo9evSQEiVKiMPhMGUHBQVdmY0JAAAAj3gNHAAAAJeVl5cnbdu2lc8++0x69eolFSpUkK+//lomTZok33//vSxZssTMO2rUKHniiSekfv36Mnr0aPH19ZWNGzfKRx99JC1btpTJkyfLqVOnRETk22+/lbFjx8rQoUPN6+ZWQFBVpW3btrJ27Vrp0aOHVK9eXT744AMZNGiQHDx4UCZNmuRWx5tvvlk6dOggOTk5smHDBpk1a5acOXPGvNaelZUls2fPlq5du0rPnj3l5MmT8tJLL0lGRoZs2rRJqlevbsq68847ZfXq1fLAAw9ItWrVxOFwyKxZs2Tr1q1XcSsDAACA18ABAAAgIpd+DXzevHly5513yscffywNGzY002fOnCm9e/eWzz//XOrXry979+6VcuXKSbt27eTtt98Wu/2/L/KoqthsNrdy161bJ02bNpW1a9dKkyZN3NKWLl0q7du3lyeffFKGDRtmpnfq1EkWLVoke/bskVKlSolI/pOVI0eOlCeeeMLM16BBAzl+/Ljs2rVLRERyc3MlNzdXfH19zTzHjx+X8uXLS2Zmprz00ksiInL27FkJDAyUnj17yowZM8y8d911l7z99tsm0AoAAIArj9fAAQAAcFlvvfWWVKhQQcqXLy//+c9/zKdZs2YiIuZV6iVLlkheXp6MGDHCLVApIoUClZezYsUKcTgc8uCDD7pNf/jhh0VV5f3333eb/vvvv8t//vMfOXz4sCxatEh27NghzZs3N+kOh8MEKvPy8uS3336TnJwcSU1NdXti8vTp05KXlyeRkZF/qr4AAAD463gNHAAAAJe1Z88e+fbbb6VYsWIe048ePSoiIv/+97/FbrdLxYoV//Iyf/zxRylevLgEBwe7TbdeF//xxx/dpk+YMEEmTJhg/r7xxhtl/PjxbvPMnTtXJk6cKLt375bz58+b6SkpKeb/kZGRUqZMGZk9e7Y0btxYqlevLna73ePYnAAAALiyCFYCAADgsvLy8qRKlSry3HPPeUxPTEz8m2tUWLdu3aR79+6Sl5cn+/btkzFjxkibNm1k9erVYrPZZN68eXLXXXdJ+/btZdCgQRIdHS0Oh0PGjRsn//73v93KevPNN+X222+XjIwMt+mBgYF/5yoBAABcdwhWAgAA4LJKlSplXqu+1OvcpUqVkry8PPnmm2/cfrCmKJKSkmT16tVy8uRJt6crd+/ebdILKlmypKSnp5u/Q0ND5bbbbpMvvvhC0tLS5O2335aSJUvKO++847YOI0eOLLTsGjVqyIsvvig33HCDjB49WurVqycTJkyQzz///C+tEwAAAC6NMSsBAABwWbfeeqscPHhQXnzxxUJpZ86ckdOnT4uISPv27cVut8vo0aMlLy/Pbb4/+7uOrVu3ltzcXJk6darb9EmTJonNZpNWrVpdMv+ZM2dERMzr2w6Ho1A9Nm7cKBs2bCiUNysrS7p16yZt27aV4cOHS3p6usTFxf2p+gMAAODP48lKAAAAXFa3bt1k4cKF0rt3b1m7dq00aNBAcnNzZffu3bJw4UL54IMPJDU1VUqXLi3Dhg2TMWPGyA033CAdOnQQl8slX375pRQvXlzGjRv3h5d50003SdOmTWXYsGHyf//3f1KtWjX58MMPZenSpTJgwADzS+CWr776SubNmyeqKv/+97/lX//6lyQkJEhqaqqIiLRp00beeecdufnmmyUzM1N++OEHmTFjhlSsWLHQL3z37dtXzpw5I7Nnz/7rGw8AAAB/GMFKAAAAXJbdbpclS5bIpEmT5NVXX5XFixdLQECAlCxZUvr37y9ly5Y1844ePVpSUlLk+eefl2HDhklAQIBUrVpVunXr9qeXuWzZMhkxYoS8+eab8sorr0hycrJMmDBBHn744ULzL168WBYvXiw2m01iYmKkadOm8tRTT0lQUJCIiNx1111y+PBhmTlzpnzwwQdSsWJFmTdvnrz11luybt06U86CBQvk9ddfl/fff1+ioqKKtsEAAABQJDb9s+/jAAAAAAAAAMBVwJiVAAAAAAAAALwCwUoAAAAAAAAAXoFgJQAAAAAAAACvQLASAAAAAAAAgFcgWAkAAAAAAADAKxCsBAAAAAAAAOAVfK51Bf5ueXl58vPPP0twcLDYbLZrXR0AAAAAAADgH0VV5eTJk1K8eHGx26/ss5DXXbDy559/lsTExGtdDQAAAAAAAOAf7cCBA5KQkHBFy7zugpXBwcEikr8xQ0JCrnFtAAAAAAAAgH+WrKwsSUxMNHG2K+m6C1Zar36HhIQQrAQAAAAAAACK6GoMscgP7AAAAAAAAADwCgQrAQAAAAAAAHgFgpUAAAAAAAAAvALBSgAAAAAAAABegWAlAAAAAAAAAK9AsBIAAAAAAACAVyBYCQAAAAAAAMArXNNg5fTp06Vq1aoSEhIiISEhkpaWJu+///4l87z11ltSvnx58fPzkypVqsiKFSv+ptoCAAAAAAAAuJquabAyISFBnn76admyZYts3rxZmjVrJu3atZNdu3Z5nH/9+vXStWtX6dGjh2zbtk3at28v7du3l507d/7NNQcAAAAAAABwpdlUVa91JQqKiIiQCRMmSI8ePQqlde7cWU6fPi3vvfeemVavXj2pXr26zJgx4w+Vn5WVJaGhoXLixAkJCQm5YvUGAAAAAAAArgdXM77mNWNW5ubmyoIFC+T06dOSlpbmcZ4NGzZIenq627SMjAzZsGHDRcvNzs6WrKwstw8AAAAAAAAA7+NzrSvw9ddfS1pampw9e1aCgoJk8eLFUrFiRY/zHj58WGJiYtymxcTEyOHDhy9a/rhx42TUqFFXtM4AAABAtQlPeJy+Y5Dn6QAAXGlzN7QuNO3ONH7bA/9s1/zJynLlysn27dtl48aNcv/998udd94p33zzzRUrf8iQIXLixAnzOXDgwBUrGwAAAAAAAMCVc82frPT19ZXSpUuLiEitWrXkyy+/lClTpsjMmTMLzRsbGytHjhxxm3bkyBGJjY29aPkul0tcLteVrTQAAAAAAACAK+6aP1l5oby8PMnOzvaYlpaWJmvWrHGbtmrVqouOcQkAAAAAAADgn+OaPlk5ZMgQadWqlZQoUUJOnjwpb7zxhqxbt04++OADERHp3r27xMfHy7hx40REpH///tK4cWOZOHGiZGZmyoIFC2Tz5s0ya9asa7kaAAAAAAAAAK6AaxqsPHr0qHTv3l0OHTokoaGhUrVqVfnggw+kRYsWIiKyf/9+sdv/+/Bn/fr15Y033pDhw4fL0KFDpUyZMrJkyRK
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(test_df, \"тестовой выборке\")\n",
"build_graph(test_df, 'Social_Media_Usage_Hours', 'Тестовая', 'Время в соц сетях, ч')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9) Приращение данных с помощью oversampling и undersampling"
]
},
{
"cell_type": "code",
"execution_count": 305,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в обучающей выборке после oversampling:\n",
"Social_Media_Usage_Hours\n",
"2 135\n",
"0 135\n",
"1 135\n",
"3 135\n",
"4 135\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"\n",
"def oversample(df):\n",
" X = df.drop('Social_Media_Usage_Hours', axis=1)\n",
" # метки y должны быть дискретными, а не неприрыными -> конвертируем в целое число\n",
" y = df['Social_Media_Usage_Hours'].astype(int)\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_df)\n",
"val_df_oversampled = oversample(val_df)\n",
"test_df_oversampled = oversample(test_df)\n",
"\n",
"check_balance(train_df_oversampled, \"обучающей выборке после oversampling\")"
]
},
{
"cell_type": "code",
"execution_count": 306,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в контрольной выборке после oversampling:\n",
"Social_Media_Usage_Hours\n",
"1 48\n",
"3 48\n",
"0 48\n",
"4 48\n",
"2 48\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(val_df_oversampled, \"контрольной выборке после oversampling\")"
]
},
{
"cell_type": "code",
"execution_count": 307,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Social_Media_Usage_Hours в тестовой выборке после oversampling:\n",
"Social_Media_Usage_Hours\n",
"3 46\n",
"4 46\n",
"2 46\n",
"1 46\n",
"0 46\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(test_df_oversampled, \"тестовой выборке после oversampling\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Датасет 3. Факторы, влияющие на успеваемость студентов"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/datasets/lainguyn123/student-performance-factors"
]
},
{
"cell_type": "code",
"execution_count": 308,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Hours_Studied', 'Attendance', 'Parental_Involvement',\n",
" 'Access_to_Resources', 'Extracurricular_Activities', 'Sleep_Hours',\n",
" 'Previous_Scores', 'Motivation_Level', 'Internet_Access',\n",
" 'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'School_Type',\n",
" 'Peer_Influence', 'Physical_Activity', 'Learning_Disabilities',\n",
" 'Parental_Education_Level', 'Distance_from_Home', 'Gender',\n",
" 'Exam_Score'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df = pd.read_csv(\".//csv//StudentPerformanceFactors.csv\")\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1) **Бизнес-цель:** повышение успеваемости студентов за счет улучшения факторов, влияющих на успеваемость студентов.\n",
"2) **Эффект:** улучшение показателей успеваемости и увеличение конкурентоспособности образовательного учреждения.\n",
"3) **Техническая цель:** разработка модели машинного обучения, которая сможет предсказывать успеваемость студента (например, оценки на экзаменах)\n",
"4) **Входные данные:** 'Hours_Studied', 'Attendance', 'Extracurricular_Activities', 'Sleep_Hours'\n",
2024-11-15 19:02:19 +04:00
"5) **Целевой признак:** 'Exam_Score'"
2024-11-15 18:57:46 +04:00
]
},
{
"cell_type": "code",
"execution_count": 309,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA+QAAAK9CAYAAACtq6aaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1hT1/8H8HcmCYS9QZmiKKLi3loXzjprtc5q1X71q9W2dtfaae22tnV0qHW0rjq63HtPHLhQEQRkBrJ37u8Pf+RrTAIkJhDg83oenkfPuTk5Se69537uPYPFMAwDQgghhBBCCCGEVCt2TVeAEEIIIYQQQgipjyggJ4QQQgghhBBCagAF5IQQQgghhBBCSA2ggJwQQgghhBBCCKkBFJATQgghhBBCCCE1gAJyQgghhBBCCCGkBlBATgghhBBCCCGE1AAKyAkhhBBCCCGEkBpAATkhhBBCCCEupNPpUFxcDK1WCwAoLi6GQqGo9HVqtRp5eXkoLCx0dRUJITWEAnJCCCGEEEJc6Pjx4wgODsbOnTsBAMHBwXj99detbrtv3z48/fTT8PPzg1AoRGRkJF566aXqrC4hpBpxa7oChLiD5cuXY9u2bUhLS4NYLEZgYCCaNGmCqVOnYvz48WCz6d4VIYQQQhzTsmVL7N27Fy1atAAA7N27Fw0bNrTY7ocffsDs2bPRtWtXLFmyBJGRkQCA6Ojoaq0vIaT6sBiGYWq6EoTUtE6dOiE8PBy9evWCj48PysrKcOrUKfz+++949tln8dtvv9V0FQkhhBBSh2VkZCA5ORnPP/88fvjhB7BYrJquEiGkGlBATggeju3i8XgW6bNnz8Z3332HzMxMxMTEVH/FCCGEEFIvzJ49G3/++ScyMjKsXpMQQuom6odLCGCz4SsPwh/tsr5jxw4MGjQIERER8PDwQHx8PD788EMYDAaz1/bs2RMsFsv0FxQUhEGDBuHq1atm27FYLCxcuNAs7fPPPweLxULPnj3N0tVqNRYuXIjGjRtDIBAgPDwcI0aMwJ07dwAA9+7dA4vFwurVq81eN2vWLLBYLEyePNmUtnr1arBYLPD5fBQVFZltf/LkSVO9z507Z5a3efNmtGnTBkKhEEFBQRg/fjxyc3MtvrsbN25g9OjRCA4OhlAoRJMmTfD2228DABYuXGj23Vj7O3TokOl7bN68uUX5VVWV+k6ePBkikcjitVu2bDGrSzmNRoP33nsPjRo1goeHBxo2bIjXXnsNGo3GbDsWi4X//ve/FuUOHjzY7AaPtd9NJpOhTZs2iI2NxYMHD2xuB1j/fa25efMmevXqhbCwMFO9X3zxRYjFYtM2Wq0WCxYsQJs2beDr6wsvLy9069YNBw8erLBsADh69Cj69OmDoKAgCIVCpKSkYNmyZXj8vm/Pnj0t9u2PP/4YbDYbGzZsMKW9+OKLSEhIgKenJwICAtCrVy8cPXrU7HX2Ho/Dhg2zqPeMGTPAYrEs9jOj0YhvvvkGSUlJEAgECA0NxYwZM1BaWmq2XUxMDAYPHmxR7n//+98qP+GqbD+dPHlypcfMvXv3bJZf2esf38edcZyXy83NxZQpUxAaGgoPDw8kJSXhl19+Mdvm0KFDYLFY2LJli1m6SCSy2K+tfa+XL1/G5MmTERcXB4FAgLCwMEyZMgUlJSUWdbanPiwWC2lpaRav53A4Vut74MABdOvWDV5eXvDz88PQoUNx/fp1q3WYOnWqab+NjY3Ff/7zH2i1WtO5uaK/8nPA5MmTLW4W379/H0KhsNJ9whmvj4mJsfh9yr+7R/epo0eP4plnnkFUVJTp3DNv3jyoVCqLMquyT128eBEDBgyAj48PRCIRevfujVOnTplt8/j36OnpieTkZPz0008Vfqa7d++CxWLh66+/tsg7ceIEWCyWqdectXOZrfP0jRs3MGrUKAQEBEAgEKBt27am8eTlTp06hTZt2mDmzJmm/bN58+b48ccfrb7HF198YfNzlLezlanqtYqt11prn7/44gur+8+///5rOj68vb0xaNAgpKenWy07Jiamwn2/XFXP04D5cf3oX2Xt8a1btzB8+HD4+/tDKBSiXbt22L59e6XfDyFVRWPICXlEWVkZ9Ho9ZDIZzp8/jy+++AJjxoxBVFSUaZvVq1dDJBLh5ZdfhkgkwoEDB7BgwQJIpVJ8/vnnZuUlJibi7bffBsMwuHPnDr766isMHDgQ2dnZFdZh0aJFFukGgwGDBw/G/v37MWbMGLz00kuQyWTYu3cvrl69ivj4eKvl3b5926IxfxSHw8G6deswb948U9qqVasgEAigVqvNtl29ejWef/55tGvXDosWLUJBQQGWLFmC48eP4+LFi/Dz8wPw8OK4W7du4PF4mD59OmJiYnDnzh38+eef+PjjjzFixAg0atTIVO68efPQtGlTTJ8+3ZTWtGlTm3WuqqrW1x5GoxFPP/00jh07hunTp6Np06a4cuUKvv76a9y6dcspjbROp8PIkSORnZ2N48ePIzw83Oa2lf2+j1IoFGjQoAGGDBkCHx8fXL16Fd9//z1yc3Px559/AgCkUil++uknjB07FtOmTYNMJsPPP/+M1NRUnDlzBq1atbJZ/okTJxASEoJ33nkHHA4Hhw8fxsyZM3H58mUsW7bM5utWrVqFd955B19++SWee+45U7pWq8X48ePRoEEDiMVirFixAv3798f169dNx6Q9x6NAIMDff/+NwsJChISEAABUKhU2btwIgUBgUa8ZM2aY9qE5c+YgMzMT3333HS5evIjjx4877QlWVfbTGTNmoE+fPqbXTJgwAcOHD8eIESNMacHBwRW+j4eHh0UwcvbsWXz77bd21weo/DgHgIKCAnTs2NF0cyo4OBj//vsvpk6dCqlUirlz5z7BN/c/e/fuxd27d/H8888jLCwM6enpWLlyJdLT03Hq1ClTYGJvfQQCAVatWoUlS5aY0tasWQM+n29xfty3bx8GDBiAuLg4LFy4ECqVCkuXLkWXLl1w4cIF00V/Xl4e2rdvj7KyMkyfPh2JiYnIzc3Fli1boFQq0b17d6xdu9ZUbvl3+WhQ2rlzZ5vfxYIFCyzqZo8nfb01mzdvhlKpxH/+8x8EBgbizJkzWLp0KXJycrB582bTdlXZp9LT09GtWzf4+PjgtddeA4/Hw4oVK9CzZ08cPnwYHTp0MHvvr7/+GkFBQZBKpfjll18wbdo0xMTEmB1Pj4qLi0OXLl2wfv16s3YRANavXw9vb28MHTrUrs+fnp6OLl26IDIyEm+88Qa8vLywadMmDBs2DFu3bsXw4cMBACUlJTh37hy4XC5mzZqF+Ph4bN++HdOnT0dJSQneeOMNu963qhy5VrHX2rVrMWnSJKSmpmLx4sVQKpVYtmwZunbtiosXL1rthdiqVSu88sorAIDMzEwsWLDAYhtHztNvvfWW6Rpj5cqVFX5OsViM7t27QyaTYc6cOQgLC8O6deswYsQIrF+/HmPHjnXwGyHkEQwhxKRJkyYMANPfxIkTGZ1OZ7aNUqm0eN2MGTMYT09PRq1Wm9J69OjB9OjRw2y7t956iwHAFBYWmtIAMO+9957p/6+99hoTEhLCtGnTxuz1v/zyCwOA+eqrryze32g0MgzDMJmZmQwAZtWqVaa80aNHM82bN2caNmzITJo0yZS+atUqBgAzduxYJjk52ZSuUCgYHx8f5rnnnmMAMGfPnmUYhmG0Wi0TEhLCNG/enFGpVKbt//rrLwYAs2DBAlNa9+7dGW9vbyYrK8tqPR8XHR1tVrdH9ejRg0lKSrKaVxF76jtp0iTGy8vLoozNmzczAJiDBw+a0tauXcuw2Wzm6NGjZtsuX76cAcAcP37clAaAmTVrlkW5gwYNYqKjo03/f/R3MxqNzLhx4xhPT0/m9OnTZq+z5/etqpkzZzIikcj0f71ez2g0GrNtSktLmdDQUGbKlCl2l//2228zAJgjR46Y0h49Nv7++2+Gy+Uyr7zySqVlnTlzhgHAbNmyxZRmz/GYlJTEtGjRgvniiy9M6WvXrmUaNGjAdOvWzWw/O3r0KAOAWb9+vVnZu3btskiPjo5
"text/plain": [
"<Figure size 1200x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12, 8))\n",
"sns.scatterplot(x='Attendance', y='Hours_Studied', hue='Exam_Score', data=df, palette='viridis')\n",
"# форматтер для x\n",
"plt.title('Зависимость оценки за экзамен от посещаемости и часов учёбы в неделю')\n",
"plt.xlabel('Посещаемость')\n",
"plt.ylabel('Количество часов, потраченных на учёбу в неделю')\n",
"plt.legend(title='Оценка за экзамен')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. Обнаруженные проблемы:\n",
"- Точки расположены относительно равномерно, значит, зашумленности нет.\n",
"- Точки распределены относительно равномерно по осям X и Y, смещениея нет\n",
2024-11-15 19:02:19 +04:00
"- На графике есть выбросы, но они не критичны"
2024-11-15 18:57:46 +04:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7) Разбиение на обучающую, контрольную и тестовую выборки.\n"
]
},
{
"cell_type": "code",
"execution_count": 310,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Размер обучающей выборки: 3963\n",
"Размер контрольной выборки: 1322\n",
"Размер тестовой выборки: 1322\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# обучающая и тестовая\n",
"train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
"\n",
"# обучающая на обучающую и контрольную\n",
"train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n",
"\n",
"print(\"Размер обучающей выборки:\", len(train_df))\n",
"print(\"Размер контрольной выборки:\", len(val_df))\n",
"print(\"Размер тестовой выборки:\", len(test_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"8) **Оценка сбалансированности выборок.**<br/>"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в обучающей выборке:\n",
"Exam_Score\n",
"68 467\n",
"66 441\n",
"67 420\n",
"65 404\n",
"69 368\n",
"70 322\n",
"64 301\n",
"71 255\n",
"63 226\n",
"72 180\n",
"62 164\n",
"61 108\n",
"73 84\n",
"74 60\n",
"60 50\n",
"59 24\n",
"75 22\n",
"58 17\n",
"76 11\n",
"80 4\n",
"77 4\n",
"78 3\n",
"94 2\n",
"79 2\n",
"86 2\n",
"98 2\n",
"84 2\n",
"92 2\n",
"99 2\n",
"82 2\n",
"89 2\n",
"87 1\n",
"95 1\n",
"57 1\n",
"83 1\n",
"85 1\n",
"97 1\n",
"96 1\n",
"101 1\n",
"88 1\n",
"93 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSMAAAGJCAYAAABxfiYnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABSnElEQVR4nO3dfXzO9f////sxZjazaWObYc5zkpNyEuMdYs4SQulEqHxyEiElvN86wbt0ivJWThKJSkoURYxQlvNzYjQRRk5nTuZkz+8f/Xb8HDbbsWPH6zXW7Xq5HJeL4/V6Hs/783XM69jzeOx14jDGGAEAAAAAAACAxXxyewAAAAAAAAAA/hkoRgIAAAAAAACwBcVIAAAAAAAAALagGAkAAAAAAADAFhQjAQAAAAAAANiCYiQAAAAAAAAAW1CMBAAAAAAAAGALipEAAAAAAAAAbEExEgAAAFk6d+6cDh48qFOnTuX2UAAAAHALoxgJAACADM2ZM0fNmjVT4cKFFRgYqKioKL311lu5PSwAAADcwihGAgAA/EPs2LFDjz/+uEqUKCE/Pz9FRkaqS5cu2rFjR7q2Q4cOVefOnVW4cGFNmTJFS5Ys0dKlS/XMM8/kwsgBAACQVziMMSa3BwEAAABrzZ07V48++qhCQkLUo0cPlS1bVvv379fUqVN14sQJffHFF+rQoYMkacWKFWrSpIlGjx6toUOH5vLIAQAAkJdQjAQAAMjj9u3bpxo1aigqKkorV65UsWLFnOuOHz+ue+65RwcPHtTWrVtVrlw5tW3bVidPntQvv/ySi6MGAABAXsRp2gAAAHnc22+/rfPnz2vy5MkuhUhJKlq0qCZNmqRz5845rwf566+/qlq1anrkkUcUEhIif39/1a1bV/PmzXO+Ljk5WYUKFdKAAQPS5f3555/Kly+fRo8eLUl64oknVKZMmXTtHA6HXn31VefzP/74Q88884wqVaokf39/hYaG6qGHHtL+/ftdXvfTTz/J4XDop59+ci5bt26dmjdvrsKFC6tQoUJq0qSJVq1a5fK66dOny+FwaP369c5lx48fTzcOSbr//vvTjXnVqlV66KGHFBUVJT8/P5UqVUrPPfecLly4kG7bvvrqK9WpU0eFCxeWw+FwPt555510bQEAAP5J8uf2AAAAAGCt7777TmXKlNE999yT4fpGjRqpTJkyWrhwoSTpxIkTmjx5sgIDA9W/f38VK1ZMM2fOVMeOHTVr1iw9+uijCgwMVIcOHTR79myNGTNG+fLlc/b3+eefyxijLl26ZGuc69at0+rVq/XII4+oZMmS2r9/vz788EM1adJEO3fuVEBAQIav27t3r5o0aaKAgAANHjxYAQEBmjJlimJiYrRkyRI1atQoW+O4kTlz5uj8+fPq06ePQkNDtXbtWo0fP15//vmn5syZ42wXFxenzp07q2bNmnrjjTcUHBys48eP67nnnvPKOAAAAG5lFCMBAADysDNnzujw4cNq3759pu1q1Kihb7/9VmfPnlXaVXwWLFigxo0bS5J69eql2rVra9CgQXrwwQfl6+urbt26adasWVqyZIlatWrl7GvmzJlq1KiRoqKiJEk+Pj5y58pAbdq00YMPPuiyrG3btoqOjtbXX3+trl27Zvi6oUOHKiUlRWvXrtUdd9whSXryySdVqVIlDRo0yOVIyJx488035e/v73zes2dPVahQQf/+97914MAB5/Z+9913Msbohx9+UEREhCRp//79FCMBAADEadoAAAB52tmzZyVJhQsXzrRd2vqkpCRJUt26dZ2FSEny9/fXM888o8TERG3cuFGSFBMTo8jISM2aNcvZbvv27dq6dasef/xx57KwsDAdO3ZMly5dynQM1xb6Ll++rBMnTqhChQoqUqSIM/NaZ86c0bFjx7RkyRK1bNnSWYiUpNDQUD3xxBPasGGDjh49mmmuu64d37lz53T8+HE1aNBAxhht2rTJue7s2bPy8fFRkSJFvJILAACQl1CMBAAAyMPSioxpRckbub5oWbly5XRtqlSpIknOazj6+PioS5cumjdvns6fPy9JmjVrlgoWLKiHHnrI+boGDRro4sWLGj58uP78808dP35cx48fT9f/hQsX9PLLL6tUqVLy8/NT0aJFVaxYMZ0+fVpnzpxJ1/6BBx5QeHi4kpKSVKlSpSzHm1MHDhzQE088oZCQEAUGBqpYsWLOgu2144uOjlZqaqoGDBigffv26fjx4zp16pRXxgAAAHCroxgJAACQhwUHB6t48eLaunVrpu22bt2qEiVKKCgoyOUIwKx069ZNycnJmjdvnowx+uyzz3T//fcrODjY2aZdu3Z66qmn9Pbbb6tUqVIqVqxYuhvpSNKzzz6r1157TZ07d9aXX36pH3/8UUuWLFFoaKhSU1PTtX/nnXc0f/58t8eaE1evXlXz5s21cOFCDRkyRPPmzdOSJUs0ffp0SXIZ3yOPPKLnn39e06dPV4UKFVSsWDHVqlXLlnECAADc7LhmJAAAQB53//33a8qUKfr555/1r3/9K936VatWaf/+/erVq5ckqWzZstq9e3e6dr/99pskudxlulq1arrrrrs0a9YslSxZUgcOHND48ePTvXbq1Kl6+eWXtW/fPmfhrnnz5i5tvvrqK3Xv3l3vvvuuc9nFixd1+vTpDLerdu3aaty4sQIDA90er6e2bdumPXv26JNPPlG3bt2cy5csWZKurY+Pj9555x1t27ZNCQkJ+uCDD3T06FGXU9cBAAD+qTgyEgAAII8bPHiw/P391atXL504ccJl3cmTJ9W7d2/nnagl6b777tPatWu1evVqZ7uLFy/qww8/VEREhGrXru3SR9euXfXjjz9q3LhxCg0NVevWrTMcR+nSpdW0aVPFxMQoJiYm3fp8+fKlu9HN+PHjdfXq1Rtum8PhUIsWLbR48WLt2rXLZbs++eQT1alTR+Hh4Td8vbvS7hZ+7fiMMXrvvfcybD9+/HgtW7ZMs2bNUkxMjBo2bJjjMQAAAOQFHBkJAACQx1WsWFGffPKJunTpourVq6tHjx4qW7as9u/fr6lTp+r48eP6/PPPVb58eUnSiy++qFmzZql169bq37+/ihYtqpkzZ2rnzp2aNWuW8ud3nUI+9thjevHFF/XNN9+oT58+8vX19Wic999/vz799FMFBweratWqiouL09KlSxUaGprp60aNGqXFixercePGevbZZxUQEKApU6bo9OnT+uqrr9K1j4uLc16zMu2GPXv37tWiRYucbf766y9duHBBixYtUqtWrVS5cmWVL19eL7zwgg4dOqSgoCB9/fXXGV4LcseOHXrxxRf16quvqm7duh69FwAAAHkVxUgAAIB/gIceekiVK1fW6NGjnQXI0NBQ3Xvvvfr3v/+tatWqOdsWK1ZMP//8s4YMGaLx48crJSVF1atX1zfffKP27dun6zs8PFwtWrTQ999/r65du3o8xvfee0/58uXTrFmzdPHiRTVs2FBLly5Vy5YtM31d1apVtXLlSg0bNkxvvfWWUlNTVadOHU2ePFmNGjVK175///7pls2aNcvlruBpWrduLWOMfH199d1336l///4aPXq0ChYsqA4dOqhfv36qWbOms31KSooee+wx1alTR0OHDvXgXQAAAMjbHOb6c2EAAACAbOrQoYO2bdumvXv35vZQvGb//v0qW7ZsulPHAQAA4DmuGQkAAIAcOXLkiBYuXJijoyIBAADwz8Bp2gAAAPBIQkKCfvnlF3300Ufy9fV13o07r/D398/yFHEAAABkD0dGAgAAwCMrVqxQ165dlZCQoE8++UQRERG5PSSvCg8Pd7mpDQAAAHKOa0YCAAAAAAAAsAVHRgIAAAAAAACwBcVIAAAAAAAAALbgBjaSUlNTdfjwYRUuXFgOhyO3hwMAAAAAAADcUowxOnv2rCIjI+Xjc+PjHylGSjp8+LBKlSqV28MAAAAAAAAAbmkHDx5UyZIlb7ieYqSkwoULS/r7zQoKCsrl0QAAAAAAAAC3lqSkJJUqVcpZZ7sRipGS89TsoKAgipEAAAAAAACAh7K6BCI3sAEAAAAAAABgC4q
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def check_balance(df, name):\n",
" counts = df['Exam_Score'].value_counts()\n",
" print(f\"Распределение Exam_Score в {name}:\")\n",
" print(counts)\n",
" print()\n",
"\n",
"check_balance(train_df, \"обучающей выборке\")\n",
"build_graph(train_df, 'Exam_Score', 'Обучающая', 'Оценка за экзамен')"
]
},
{
"cell_type": "code",
"execution_count": 312,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в контрольной выборке:\n",
"Exam_Score\n",
"66 166\n",
"67 155\n",
"68 145\n",
"69 138\n",
"65 127\n",
"64 106\n",
"70 104\n",
"71 69\n",
"63 68\n",
"72 54\n",
"62 46\n",
"61 37\n",
"74 27\n",
"73 24\n",
"60 14\n",
"59 11\n",
"75 9\n",
"76 4\n",
"58 2\n",
"82 2\n",
"94 2\n",
"86 2\n",
"88 1\n",
"56 1\n",
"79 1\n",
"84 1\n",
"97 1\n",
"95 1\n",
"57 1\n",
"91 1\n",
"78 1\n",
"100 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSMAAAGJCAYAAABxfiYnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABTEUlEQVR4nO3de3zO9eP/8ec1m20O25rDZmzMIefkFCMSYyjn8lEq4hsVIRX55BAdlg6OKXQgUXRwSt+IEaqlnFNiNIdoI9pmZHZ4/f7wc31ddjTb+7q2Hvfb7brdXO/X+3rveV27uK730+v9ftuMMUYAAAAAAAAAUMjcnB0AAAAAAAAAwL8DZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAFBMLFy6UzWbT9u3bM4298847stls6tmzp9LT052QDgAAAKCMBAAAKPZWrFihxx57TG3atNHSpUtVokQJZ0cCAADAvxRlJAAAQDH2zTff6L777lO9evX0xRdfyMvLy9mRAAAA8C9GGQkAAFBM7d69Wz169FClSpW0bt06+fr6Oox/+umnatq0qby9vVW+fHk98MADOnHihMM6AwcOVJkyZTJt+7PPPpPNZtM333wjSWrXrp1sNluOtytsNpuGDx+uJUuWqHbt2vLy8lLTpk21ZcuWTD9n165d6tKli3x8fFSmTBl16NBBP/zwQ5bPN7sMCxcudFinQYMGub52VzJe6+6771a1atUclr3++utq1aqVypUrJ29vbzVt2lSfffZZpscmJyfrqaeeUvXq1eXh4eGQ8a+//so1EwAAQHHg7uwAAAAAKHiHDx9W586d5enpqXXr1qlSpUoO4wsXLtTDDz+s5s2bKzIyUvHx8Zo5c6a+++477dq1S35+ftf185577jn9z//8jyTpr7/+0pNPPqkhQ4aoTZs2Wa6/efNmLVu2TCNGjJCnp6feeustde7cWT/++KO9LPzll1/Upk0b+fj4aMyYMfLw8NC8efPUrl07bd68WS1atMi03Tp16ui5555zyFHYZs6cqe7du6t///66dOmSli5dqnvvvVdr1qzRXXfdZV/vmWee0dy5czV48GC1bt1aHh4eWr58uVasWFHoGQEAAFwFZSQAAEAxEx8fr379+ik+Pl6dOnXSzTff7DCempqqsWPHqkGDBtqyZYv90O3bb79dd999t6ZPn67Jkydf18/s2LGj/c9HjhzRk08+qbCwMD3wwANZrr9v3z5t375dTZs2lST169dPtWvX1sSJE7V8+XJJ0vjx45Wamqpvv/1W1atXlyQ99NBDql27tsaMGaPNmzc7bDMtLU2VKlWy/8wrOQrbwYMH5e3tbb8/fPhwNWnSRNOmTXMoI1etWqWIiAi9++679mWHDh2ijAQAAP8qHKYNAABQzAwcOFDHjx/X/fffr6+//lqffvqpw/j27dt16tQpPf744w7nkLzrrrtUp04dffnll5m2+ddffznczp07d0MZw8LC7EWkJIWEhKhHjx5at26d0tPTlZ6erq+//lo9e/a0F5GSVKlSJd1///369ttvlZSU5LDNS5cuydPTM9efnZ6ebn8ely5dyna9ixcvZnreqampmda7uoj8+++/lZiYqDZt2mjnzp0O6507d07lypXLNR8AAEBxRhkJAABQzJw9e1aLFy/WBx98oFtvvVUjR45UYmKiffzo0aOSpNq1a2d6bJ06dezjV5w/f14VKlRwuA0aNOiGMtaqVSvTsptvvlkXLlzQ6dOndfr0aV24cCHLjHXr1lVGRoaOHz/usDwhISHL81te67fffrM/D29vb9WuXVsfffRRpvXee++9TM/766+/zrTemjVr1LJlS3l5ecnf318VKlTQ22+/7fCaS5cL2BUrVuizzz7Tn3/+qb/++ksXLlzINS8AAEBxwmHaAAAAxcxrr72me++9V5I0f/58tWzZUuPGjdNbb72Vr+15eXnpiy++cFi2detWTZky5YazFqS4uDhFRETkul61atX0zjvvSJLOnDmjWbNm6cEHH1T16tXVsmVL+3o9evTIdBGb8ePHKy4uzn5/69at6t69u9q2bau33npLlSpVkoeHhxYsWJCp4Jw/f77uu+8+++8GAADg34gyEgAAoJhp27at/c/NmzfXsGHDNGfOHD300ENq2bKlqlatKkk6cOCA2rdv7/DYAwcO2MevKFGihMLDwx2WJSQk3FDGmJiYTMsOHjyoUqVKqUKFCpKkUqVK6cCBA5nW++233+Tm5qbg4GD7sj/++EPnzp1T3bp1c/3ZpUuXdng+bdq0UeXKlfX11187lJFVqlTJ9LxnzJjhUEZ+/vnn8vLy0rp16xwOEV+wYEGmn1utWjUtXrxYDRs21KBBg9SzZ08tWrRIH374Ya6ZAQAAigsO0wYAACjmXnrpJVWqVElDhgxRWlqamjVrpooVK2ru3LlKSUmxr/fVV19p//79DhddKSzR0dEO51Q8fvy4Vq1apU6dOqlEiRIqUaKEOnXqpFWrVunIkSP29eLj4/XRRx/p9ttvl4+Pj3350qVLJSlTuZoXGRkZki6XrterRIkSstlsSk9Pty87cuSIVq5cmWndtLQ09e/fX/Xr19f06dMVHh7ucD5MAACAfwNmRgIAABRzZcuW1ezZs9W7d2+98cYbGjt2rKZOnaqHH35Yd9xxh+677z7Fx8dr5syZqlatmiVXoG7QoIEiIiI0YsQIeXp62g8hv/oq3i+++KLWr1+v22+/XY8//rjc3d01b948paSk6NVXX5V0uZycNGmS3n33XfXr10916tTJ9WcnJydr7dq1ki6fX3PWrFny8PDIVwl71113adq0aercubPuv/9+nTp1SnPmzFHNmjW1d+9eh3UnT56sn3/+Wbt27ZKHh8d1/ywAAIDigDISAADgX6BXr17q0aOHpkyZor59+2rgwIEqVaqUXnnlFY0dO1alS5dWr169NHXqVPn5+RV6njvuuENhYWGaPHmyjh07pnr16mnhwoW65ZZb7OvUr19fW7du1bhx4xQZGamMjAy1aNFCixcvVosWLSRJhw8fVlRUlCZMmKBx48bl6WcfPXpUXbp0kST5+fmpfv36Wr16tW699dbrfh7t27fXe++9p1deeUWjRo1SaGiopk6dqiNHjjiUkd9++60iIyP11ltv6eabb77unwMAAFBc2IwxxtkhAAAA8O9hs9k0bNgwvfnmm86OAgAAAItxzkgAAAAAAAAAlqCMBAAAAAAAAGAJykgAAAAAAAAAluACNgAAALAUpywHAAD492JmJAAAAAAAAABLUEYCAAAAAAAAsASHaUvKyMjQyZMnVbZsWdlsNmfHAQAAAAAAAIoUY4zOnTunoKAgubllP/+RMlLSyZMnFRwc7OwYAAAAAAAAQJF2/PhxValSJdtxykhJZcuWlXT5xfLx8XFyGgAAAAAAAKBoSUpKUnBwsL1nyw5lpGQ/NNvHx4cyEgAAAAAAAMin3E6ByAVsAAAAAAAAAFiCMhIAAAAAAACAJSgjAQAAAAAAAFiCMhIAAAAAAACAJSgjAQAAAAAAAFiCMhIAAAAAAACAJSgjAQAAAAAAAFiCMhIAAAAAAACAJSgjAQAAAAAAAFiCMhIAAAAAAACAJSgjAQAAAAAAAFjC3dkBAABwNW2GvuDsCJKkrfMmODsCAAAAABQoZkYCAAAAAAAAsAQzIwEAAK7SfOwUZ0eQJP00daKzIwAAAAAFjpmRAAAAAAAAACxBGQkAAAAAAADAEpSRAAAAAAAAACxBGQkAAAAAAADAEpSRAAAAAAAAACzB1bQBACjCWg1/wdkRJEnfvznB2REAAAAAFAHMjAQAAAAAAABgCcpIAAAAAAAAAJagjAQAAAAAAABgCaeeM3LLli167bXXtGPHDv35559asWKFevbsmeW6jz76qObNm6fp06dr1KhR9uVnz57VE088oS+++EJubm7q06ePZs6cqTJlyljzJAAAQK5ajHa
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(val_df, \"контрольной выборке\")\n",
"build_graph(test_df, 'Exam_Score', 'Контрольная', 'Оценка за экзамен')"
]
},
{
"cell_type": "code",
"execution_count": 313,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в тестовой выборке:\n",
"Exam_Score\n",
"65 148\n",
"68 147\n",
"66 144\n",
"67 142\n",
"69 118\n",
"70 116\n",
"64 94\n",
"71 84\n",
"63 77\n",
"72 70\n",
"62 54\n",
"73 33\n",
"61 26\n",
"74 19\n",
"75 17\n",
"60 13\n",
"59 5\n",
"58 3\n",
"57 2\n",
"55 1\n",
"89 1\n",
"88 1\n",
"98 1\n",
"97 1\n",
"77 1\n",
"80 1\n",
"76 1\n",
"87 1\n",
"93 1\n",
"Name: count, dtype: int64\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Temp\\ipykernel_7960\\3933200995.py:4: FutureWarning: \n",
"\n",
"Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.\n",
"\n",
" sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABSMAAAGJCAYAAABxfiYnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABReUlEQVR4nO3deVhUdeP+8XuQRVyAcAFRUFxyy8w9tMwF13Ivn8pMy59LaWqWmk8uqZWVpaZZLpVmWmrlln3TFE1biNzNckEjNQ3t0RBxQWA+vz/8Ol9HVhHODDzv13XNdTXnc+ZwzzAxc24/5xybMcYIAAAAAAAAAPKZh6sDAAAAAAAAAPjvQBkJAAAAAAAAwBKUkQAAAAAAAAAsQRkJAAAAAAAAwBKUkQAAAAAAAAAsQRkJAAAAAAAAwBKUkQAAAAAAAAAsQRkJAAAAAAAAwBKUkQAAAAAAAAAsQRkJAAAAAAAAwBKUkQAAAJAk2Wy2HN2+/fZbV0cFAABAAeXp6gAAAABwDx9//LHT/UWLFmnDhg3pltesWdPKWAAAAChEbMYY4+oQAAAAcD9DhgzR7NmzxddFAAAA5BUO0wYAAECuJCcna8KECapatap8fHwUGhqqUaNGKTk5Od26ixcvVuPGjVWsWDHddtttat68ub755htJUqVKlbI8LLxSpUqO7Vy4cEHPPfecQkND5ePjo+rVq+vNN99MV5he//giRYqofPnyGjBggBISEhzrXLlyRePHj1eDBg3k7++v4sWL695779XmzZvT5T99+rT69eunsLAwFSlSxLHtEiVK5M2LCQAA8F+Cw7QBAABw0+x2uzp37qzvv/9eAwYMUM2aNfXLL79o+vTpOnTokFatWuVYd+LEiXrppZfUtGlTTZo0Sd7e3oqJidGmTZvUtm1bzZgxQ0lJSZKk/fv369VXX9W///1vx+Hg1wo/Y4w6d+6szZs3q1+/frrrrru0fv16jRw5UidOnND06dOdMnbr1k3du3dXamqqoqOjNW/ePF26dMlx2HliYqLef/99PfLII+rfv7/Onz+vDz74QO3atdPPP/+su+66y7GtPn36aOPGjXrmmWdUt25dFSlSRPPmzdPOnTvz8VUGAAAofDhMGwAAABnK6jDtxYsXq0+fPtqyZYvuuecex/K5c+dq0KBB+uGHH9S0aVMdPnxY1atXV5cuXfT555/Lw+P/Dswxxshmszlt99tvv1XLli21efNmtWjRwmls9erV6tq1q15++WW9+OKLjuUPPfSQvvjiC8XGxqpKlSqSrs6MnDBhgl566SXHes2aNVNCQoJ+/fVXSVJaWprS0tLk7e3tWCchIUE1atTQ/fffrw8++ECSdPnyZRUvXlz9+/fXnDlzHOv27dtXn3/+uaNIBQAAQPY4TBsAAAA37bPPPlPNmjVVo0YN/ec//3HcWrVqJUmOQ51XrVolu92u8ePHOxWRktIVkdn5n//5HxUpUkRDhw51Wv7cc8/JGKOvv/7aafnFixf1n//8R/Hx8friiy+0Z88etW7d2jFepEgRRxFpt9t19uxZpaamqmHDhk4zHi9cuCC73a5SpUrdVF4AAACkx2HaAAAAuGmxsbHav3+/ypQpk+H46dOnJUlHjhyRh4eHatWqdcs/8+jRowoJCVHJkiWdll87nPvo0aNOy6dOnaqpU6c67rdv316vv/660zofffSR3nrrLR04cEApKSmO5eHh4Y7/LlWqlKpVq6b3339f9913n+666y55eHhkeG5MAAAAZI0yEgAAADfNbrerTp06mjZtWobjoaGhFidKr3fv3nr88cdlt9v1+++/a/LkyXrggQe0ceNG2Ww2LV68WH379lXXrl01cuRIlS1bVkWKFNGUKVN05MgRp20tW7ZMvXr1Urt27ZyWFy9e3MqnBAAAUOBRRgIAAOCmValSxXHYc1aHW1epUkV2u12//fab0wVhcqNixYrauHGjzp8/7zQ78sCBA47x61WuXFmRkZGO+/7+/nr00Uf1008/KSIiQp9//rkqV66sFStWOD2HCRMmpPvZ9erV0/z583Xvvfdq0qRJuvvuuzV16lT98MMPt/ScAAAA/ttwzkgAAADctJ49e+rEiROaP39+urFLly7pwoULkqSuXbvKw8NDkyZNkt1ud1rvZq+j2LFjR6Wlpemdd95xWj59+nTZbDZ16NAhy8dfunRJkhyHVxcpUiRdjpiYGEVHR6d7bGJionr37q3OnTtr7NixioyMVLly5W4qPwAAAJgZCQAAgFzo3bu3li9frkGDBmnz5s1q1qyZ0tLSdODAAS1fvlzr169Xw4YNVbVqVb344ouaPHmy7r33XnXv3l0+Pj7atm2bQkJCNGXKlBz/zE6dOqlly5Z68cUX9ccff6hu3br65ptvtHr1ag0fPtxxJe1r9u7dq8WLF8sYoyNHjmjmzJmqUKGCGjZsKEl64IEHtGLFCnXr1k3333+/4uLiNGfOHNWqVSvdFbIHDx6sS5cu6f3337/1Fw8AAOC/GGUkAAAAbpqHh4dWrVql6dOna9GiRVq5cqWKFSumypUra9iwYbr99tsd606aNEnh4eGaNWuWXnzxRRUrVkx33nmnevfufdM/c82aNRo/fryWLVumBQsWqFKlSpo6daqee+65dOuvXLlSK1eulM1mU1BQkFq2bKlXXnlFJUqUkCT17dtX8fHxmjt3rtavX69atWpp8eLF+uyzz/Ttt986trN06VItWbJEX3/9tUqXLp27FwwAAACSJJu52eNjAAAAAAAAACAXOGckAAAAAAAAAEtQRgIAAAAAAACwBGUkAAAAAAAAAEtQRgIAAAAAAACwBGUkAAAAAAAAAEtQRgIAAAAAAACwhKerA7gDu92ukydPqmTJkrLZbK6OAwAAAAAAABQoxhidP39eISEh8vDIfP4jZaSkkydPKjQ01NUxAAAAAAAAgALt+PHjqlChQqbjlJGSSpYsKenqi+Xn5+fiNAAAAAAAAEDBkpiYqNDQUEfPlhnKSMlxaLafnx9lJAAAAAAAAJBL2Z0CkQvYAAAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALCEp6sDAADgbu4dONnVESRJ380d5+oIAAAAAJCnmBkJAAAAAAAAwBLMjAQAALhOo9GTXB1BkrTt9fGujgAAAADkOWZGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEZSQAAAAAAAAAS1BGAgAAAAAAALAEV9MGAKAAazpksqsjSJJ+fGecqyMAAAAAKACYGQkAAAAAAADAEpSRAAAAAAAAACxBGQkAAAAAAADAEi49Z+TWrVs1depU7dixQ3/99ZdWrlyprl27ZrjuoEGDNHfuXE2fPl3Dhw93LD979qyeeeYZffnll/Lw8FCPHj309ttvq0SJEtY8CQAAkK0mI9zj3JYx0zi3JQAAAOBKLp0ZeeHCBdWtW1ezZ8/Ocr2VK1fqp59+UkhISLqxXr166ddff9WGDRu0du1abd26VQMGDMivyAAAAAAAAAByyaUzIzt06KAOHTpkuc6JEyf0zDPPaP369br//vudxvbv369169Zp27ZtatiwoSRp1qxZ6tixo958880My0sAAAAAAAAAruHW54y02+3q3bu3Ro4cqdq1a6cbj46OVkBAgKOIlKTIyEh5eHgoJiYm0+0mJycrMTHR6QYAAAAAAAAgf7l1Gfn666/L09NTQ4cOzXA8Pj5eZcuWdVrm6empwMBAxcfHZ7rdKVOmyN/f33ELDQ3N09wAAAAAAAAA0nPbMnLHjh16++23tXDhQtlstjzd9pgxY3Tu3DnH7fjx43m6fQAAAAAAAADpuW0Z+d133+n06dMKCwuTp6enPD09dfToUT333HOqVKmSJCk4OFinT592elxqaqrOnj2r4ODgTLft4+MjPz8/pxsAAAAAAACA/OXSC9hkpXfv3oqMjHRa1q5dO/Xu3Vt
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"check_balance(test_df, \"тестовой выборке\")\n",
"build_graph(test_df, 'Exam_Score', 'Тестовая', 'Оценка за экзамен')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9) Приращение данных с помощью oversampling и undersampling"
]
},
{
"cell_type": "code",
"execution_count": 314,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в обучающей выборке после oversampling:\n",
"Exam_Score\n",
"67 467\n",
"73 467\n",
"65 467\n",
"63 467\n",
"64 467\n",
"68 467\n",
"69 467\n",
"66 467\n",
"72 467\n",
"71 467\n",
"76 467\n",
"70 467\n",
"61 467\n",
"94 467\n",
"62 467\n",
"60 467\n",
"74 467\n",
"59 467\n",
"78 467\n",
"80 467\n",
"86 467\n",
"77 467\n",
"75 467\n",
"79 467\n",
"87 467\n",
"58 467\n",
"83 467\n",
"85 467\n",
"95 467\n",
"99 467\n",
"92 467\n",
"84 467\n",
"57 467\n",
"96 467\n",
"97 467\n",
"98 467\n",
"82 467\n",
"101 467\n",
"89 467\n",
"88 467\n",
"93 467\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"\n",
"def oversample(df):\n",
" X = df.drop('Exam_Score', axis=1)\n",
" y = df['Exam_Score']\n",
" \n",
" oversampler = RandomOverSampler(random_state=42)\n",
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
" \n",
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
" return resampled_df\n",
"\n",
"train_df_oversampled = oversample(train_df)\n",
"val_df_oversampled = oversample(val_df)\n",
"test_df_oversampled = oversample(test_df)\n",
"\n",
"check_balance(train_df_oversampled, \"обучающей выборке после oversampling\")\n"
]
},
{
"cell_type": "code",
"execution_count": 315,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в контрольной выборке после oversampling:\n",
"Exam_Score\n",
"65 166\n",
"69 166\n",
"70 166\n",
"64 166\n",
"61 166\n",
"75 166\n",
"62 166\n",
"66 166\n",
"72 166\n",
"67 166\n",
"68 166\n",
"63 166\n",
"71 166\n",
"59 166\n",
"74 166\n",
"73 166\n",
"76 166\n",
"82 166\n",
"56 166\n",
"88 166\n",
"60 166\n",
"84 166\n",
"58 166\n",
"79 166\n",
"95 166\n",
"97 166\n",
"86 166\n",
"57 166\n",
"91 166\n",
"78 166\n",
"94 166\n",
"100 166\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(val_df_oversampled, \"контрольной выборке после oversampling\")"
]
},
{
"cell_type": "code",
"execution_count": 316,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Распределение Exam_Score в тестовой выборке после oversampling:\n",
"Exam_Score\n",
"65 148\n",
"71 148\n",
"64 148\n",
"66 148\n",
"72 148\n",
"70 148\n",
"63 148\n",
"74 148\n",
"69 148\n",
"67 148\n",
"62 148\n",
"89 148\n",
"55 148\n",
"73 148\n",
"61 148\n",
"60 148\n",
"68 148\n",
"75 148\n",
"59 148\n",
"58 148\n",
"88 148\n",
"57 148\n",
"98 148\n",
"97 148\n",
"77 148\n",
"80 148\n",
"76 148\n",
"87 148\n",
"93 148\n",
"Name: count, dtype: int64\n",
"\n"
]
}
],
"source": [
"check_balance(test_df_oversampled, \"тестовой выборке после oversampling\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Scripts",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}