708 lines
180 KiB
Plaintext
708 lines
180 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Выгрузка в датафрейм первый набор (игры в Steam)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"https://www.kaggle.com/datasets/wajihulhassan369/steam-games-dataset. Набор представляет собой данные об экшенах, доступных в Steam. Эта информация полезна для изучения игровых паттернов, моделирования цен и исследования корреляции между игровыми тегами и методами ценообразования. Этот набор позволяет провести предварительный анализ данных, построить модели машинного обучения или исследовать игровую индустрию. В наборе пресдтавлена дата, различные теги, рейтинг отзывов. Так можно понять, какие теги популярнее, что в играх людям нравится больше, изменилось ли качество игр со временем и т.д. Для бизнеса такой набор данных может быть полезен для прогнозирования, в разработку каки игр целесообразнее вкладываться. Так компания не потеряет деньги.\n",
|
|||
|
"Пример цели: Разработка игры на пк в нужную фазу рынка\n",
|
|||
|
"Входные данные: год выпуска, сумма продаж\n",
|
|||
|
"Целевой признак: продаваемость игр в текущей фазе рынка в сравнении с предыдущими."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Name', 'Price', 'Release_date', 'Review_no', 'Review_type', 'Tags',\n",
|
|||
|
" 'Description'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//steam_cleaned.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 32,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0EAAAIjCAYAAADFthA8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABj3klEQVR4nO3dd3gU5f7//9emF5JQQwApAUQ6SFOkKxgQEVRAUY6AHjwiguLRc8QCAiJiwYIe2zkiIiqCICiKgIBSP4B0A0ivoUNCS4Dk/v3Bd/eXJW032ZJkno/rynWxs7Mz79mZXea19z332IwxRgAAAABgEQH+LgAAAAAAfIkQBAAAAMBSCEEAAAAALIUQBAAAAMBSCEEAAAAALIUQBAAAAMBSCEEAAAAALIUQBAAAAMBSCEEAAAAALIUQBMAS9u7dK5vNps8//9zfpTiZN2+eGjdurLCwMNlsNp05c8bfJWXx+eefy2azae/evf4updB6+eWXZbPZdOLECX+XUqTY3zcA8DVCEFDEbd68WT179lTVqlUVFhamSpUqqVOnTpo4caLX1vnVV1/pnXfeyTL98OHDevnll7VhwwavrftaS5Yskc1mc/wFBwerevXqeuihh7R7926PrGPFihV6+eWXPR5QTp48qd69eys8PFwffPCBpkyZosjIyGzntQcR+19QUJAqVaqk/v3769ChQx6tq7Dq37+/bDabGjZsKGNMludtNpueeOIJP1RmDfb33/4XGhqqWrVqacSIEUpNTfV3eX5z7XdQaGioypcvr/bt2+vVV1/V8ePH873sxMREvfzyy/wAAXhBkL8LAJB/K1asUIcOHVSlShUNHDhQcXFxOnDggFatWqV3331XQ4YM8cp6v/rqK23ZskVPPfWU0/TDhw9r1KhRqlatmho3buyVdedk6NChat68uS5fvqx169bpk08+0dy5c7V582ZVrFixQMtesWKFRo0apf79+6tkyZKeKVjSmjVrdPbsWY0ZM0YdO3Z06TWjR49WfHy8UlNTtWrVKn3++edatmyZtmzZorCwMI/Vltnf/vY33X///QoNDfXK8t21efNmzZw5U/fee6+/S7Gc0NBQ/fe//5UkJScna/bs2RozZox27dqlqVOn+rk6/7J/B6Wnp+v48eNasWKFRo4cqQkTJujbb7/Vrbfe6vYyExMTNWrUKLVv317VqlXzfNGAhRGCgCJs7NixiomJ0Zo1a7KcnB87dsw/RXnB+fPnc2whsWvTpo169uwpSRowYIBq1aqloUOHavLkyRo+fLgvynSbfR+5E6y6dOmiZs2aSZL+/ve/q2zZsho/frzmzJmj3r17e6NMBQYGKjAw0CvLdld4eLgqV66s0aNH65577rFcV6oLFy4oIiLCb+sPCgpS3759HY8ff/xx3XLLLfr66681YcIElS9f3m+1+Vvm7yC7jRs36vbbb9e9996rxMREVahQwU/VAbgW3eGAImzXrl2qV69etifRsbGxWaZ9+eWXatGihSIiIlSqVCm1bdtW8+fPdzw/e/Zsde3aVRUrVlRoaKhq1KihMWPGKD093TFP+/btNXfuXO3bt8/R/aNatWpasmSJmjdvLulqCLE/l/kanP/7v/9T586dFRMTo4iICLVr107Lly93qtF+jUBiYqIeeOABlSpVSq1bt3b7vbH/6rpnz55c51u0aJHatGmjyMhIlSxZUt27d9fWrVud6nn22WclSfHx8Y7tyqt7yvTp09W0aVOFh4erbNmy6tu3r1O3tfbt26tfv36SpObNm8tms6l///5ub2ebNm0kXT0WMtu2bZt69uyp0qVLKywsTM2aNdOcOXMcz69du1Y2m02TJ0/OssxffvlFNptNP/74o6Scrwn6+eefHe9dVFSUunbtqj///NPx/Jw5c2Sz2bRp0ybHtO+++042m0333HOP07Lq1Kmj++67L8/tDQgI0IsvvqhNmzZp1qxZuc6bU9327ktLlixxTGvfvr3q16+vTZs2qV27doqIiFDNmjU1Y8YMSdJvv/2mm266SeHh4brhhhu0cOHCbNd54sQJ9e7dW9HR0SpTpoyefPLJbLuKffnll47jo3Tp0rr//vt14MABp3nsNf3xxx9q27atIiIi9Pzzz2e73jfffFM2m0379u3L8tzw4cMVEhKi06dPS5J27Nihe++9V3FxcQoLC9N1112n+++/X8nJyTm+lzmx2Wxq3bq1jDFZup/mdXzkxpX3Z+nSperVq5eqVKmi0NBQVa5cWcOGDdPFixed5jty5IgGDBig6667TqGhoapQoYK6d+/u9vGcH40aNdI777yjM2fO6P3333dM37dvnx5//HHdcMMNCg8PV5kyZdSrVy+nmj7//HP16tVLktShQwfHd0/m49YbNQNWQQgCirCqVavqjz/+0JYtW/Kcd9SoUfrb3/6m4OBgjR49WqNGjVLlypW1aNEixzyff/65SpQooaefflrvvvuumjZtqhEjRui5555zzPPCCy+ocePGKlu2rKZMmaIpU6bonXfeUZ06dTR69GhJ0qOPPup4rm3btpKuho22bdsqJSVFI0eO1KuvvqozZ87o1ltv1erVq7PU26tXL124cEGvvvqqBg4c6PZ7Yw8FZcqUyXGehQsXKiEhQceOHdPLL7+sp59+WitWrFCrVq0cJyP33HOP+vTpI0l6++23HdtVrly5HJf7+eefq3fv3goMDNS4ceM0cOBAzZw5U61bt3ZcV/TCCy/o0UcflXS1i9uUKVP0j3/8w+3ttNdZqlQpx7Q///xTN998s7Zu3arnnntOb731liIjI9WjRw9HcGjWrJmqV6+ub7/9Nssyp02bplKlSikhISHH9U6ZMkVdu3ZViRIlNH78eL300ktKTExU69atHTW1bt1aNptNv//+u+N1S5cuVUBAgJYtW+aYdvz4cW3bts1xrOTlgQce0PXXX6/Ro0dne21Qfp0+fVp33nmnbrrpJr3++usKDQ3V/fffr2nTpun+++/XHXfcoddee03nz59Xz549dfbs2SzL6N27t1JTUzVu3Djdcccdeu+99xz72W7s2LF66KGHdP3112vChAl66qmn9Ouvv6pt27ZZrjs7efKkunTposaNG+udd95Rhw4dsq29d+/estls2e7Pb7/9VrfffrtKlSqlS5cuKSEhQatWrdKQIUP0wQcf6NFHH9Xu3bvzfc1bdsegK8dHTlx9f6ZPn64LFy5o0KBBmjhxohISEjRx4kQ99NBDTsu79957NWvWLA0YMED/+c9/NHToUJ09e1b79+/3SL156dmzp8LDw51+cFqzZo1WrFih+++/X++9954ee+wx/frrr2rfvr0uXLggSWrbtq2GDh0qSXr++ecd3z116tTxes2AJRgARdb8+fNNYGCgCQwMNC1btjT/+te/zC+//GIuXbrkNN+OHTtMQECAufvuu016errTcxkZGY5/X7hwIcs6/vGPf5iIiAiTmprqmNa1a1dTtWrVLPOuWbPGSDKTJk3Kso7rr7/eJCQkZFlffHy86dSpk2PayJEjjSTTp08fl96DxYsXG0nms88+M8ePHzeHDx82c+fONdWqVTM2m82sWbPGGGPMnj17stTWuHFjExsba06ePOmYtnHjRhMQEGAeeughx7Q33njDSDJ79uzJs55Lly6Z2NhYU79+fXPx4kXH9B9//NFIMiNGjHBMmzRpkpHkqDE39nkXLlxojh8/bg4cOGBmzJhhypUrZ0JDQ82BAwcc8952222mQYMGTvssIyPD3HLLLeb66693TBs+fLgJDg42p06dckxLS0szJUuWNA8//HCWddu3/+zZs6ZkyZJm4MCBTjUeOXLExMTEOE2vV6+e6d27t+NxkyZNTK9evYwks3XrVmOMMTNnzjSSzMaNG3N9D/r162ciIyONMcZMnjzZSDIzZ850PC/JDB48OMe67ezHzOLFix3T2rVrZySZr776yjFt27ZtRpIJCAgwq1atckz/5ZdfshxL9uP2rrvuclrX448/7rRte/fuNYGBgWbs2LFO823evNkEBQU5TbfX9NFHH+X6vti1bNnSNG3a1Gna6tWrjST
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Преобразуем дату выпуска в формат datetime\n",
|
|||
|
"df['Release_date'] = pd.to_datetime(df['Release_date'])\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация данных\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df['Release_date'], df['Review_no'])\n",
|
|||
|
"plt.xlabel('Release Date')\n",
|
|||
|
"plt.ylabel('Review Number')\n",
|
|||
|
"plt.title('Scatter Plot of Review Number vs Release Date')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"При проверке на шум можно заметить выброс в 2014 году. количество обзоров там запредельное. "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Все выбросы удалены путём определения порогов квантилями. Зашумленность не очень высокая. Покрытие данных высокое и подошло бы для поставленной задачи по актуальности."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 34,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Выбросы:\n",
|
|||
|
" Name Price Release_date Review_no \\\n",
|
|||
|
"18 GUNDAM BREAKER 4 59.99 2024-08-29 1846.0 \n",
|
|||
|
"22 LOCKDOWN Protocol 5.49 2024-07-22 2192.0 \n",
|
|||
|
"34 CarX Street 19.99 2024-08-29 4166.0 \n",
|
|||
|
"45 Harry Potter: Quidditch Champions 25.99 2024-09-03 1216.0 \n",
|
|||
|
"61 SMITE 2 18.00 2024-08-27 1633.0 \n",
|
|||
|
"... ... ... ... ... \n",
|
|||
|
"7695 Dude Simulator 2 2.99 2018-07-28 1734.0 \n",
|
|||
|
"7717 Golfing Over It with Alva Majo 2.39 2018-03-28 1367.0 \n",
|
|||
|
"7740 Dungeon Siege II 4.99 2005-08-16 2274.0 \n",
|
|||
|
"7765 Phantom Doctrine 12.99 2018-08-14 3538.0 \n",
|
|||
|
"7768 NECROPOLIS: BRUTAL EDITION 19.99 2016-07-12 3668.0 \n",
|
|||
|
"\n",
|
|||
|
" Review_type Tags \\\n",
|
|||
|
"18 Very Positive Action,Robots,Hack and Slash,RPG,Mechs,Action ... \n",
|
|||
|
"22 Very Positive Multiplayer,Social Deduction,Conversation,Acti... \n",
|
|||
|
"34 Mixed Racing,Open World,Automobile Sim,PvP,Multiplay... \n",
|
|||
|
"45 Mostly Positive Action,Sports,Flight,Arcade,Third Person,Magic... \n",
|
|||
|
"61 Mixed Action,MOBA,Third Person,Strategy,Adventure,Ca... \n",
|
|||
|
"... ... ... \n",
|
|||
|
"7695 Mixed Life Sim,Indie,Simulation,Racing,Action,Advent... \n",
|
|||
|
"7717 Mostly Positive Difficult,Physics,Golf,Platformer,Precision Pl... \n",
|
|||
|
"7740 Mostly Positive RPG,Fantasy,Action RPG,Hack and Slash,Singlepl... \n",
|
|||
|
"7765 Mostly Positive Turn-Based Tactics,Strategy,Cold War,Stealth,R... \n",
|
|||
|
"7768 Mixed Souls-like,Action Roguelike,Co-op,Adventure,Ro... \n",
|
|||
|
"\n",
|
|||
|
" Description \n",
|
|||
|
"18 Create your own ultimate Gundam in the newest ... \n",
|
|||
|
"22 A first person social deduction game, combinin... \n",
|
|||
|
"34 Conquer mountain roads, highways, and city str... \n",
|
|||
|
"45 Your next chapter takes flight! Immerse yourse... \n",
|
|||
|
"61 Become a god and wage war in SMITE 2, the Unre... \n",
|
|||
|
"... ... \n",
|
|||
|
"7695 Dude Simulator 2 is an open world sandbox game... \n",
|
|||
|
"7717 The higher you climb, the bigger the fall. \n",
|
|||
|
"7740 NaN \n",
|
|||
|
"7765 The year is 1983. The world teeters on the ver... \n",
|
|||
|
"7768 NECROPOLIS: BRUTAL EDITION is a major update f... \n",
|
|||
|
"\n",
|
|||
|
"[1049 rows x 7 columns]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1sAAAIjCAYAAAD1OgEdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOyde3yO9f/HX/c9O7OTYSNsZmKGIafm3EQUJWmoUD/0DUUnVEQq6YTim1Kpr1M6EEUra/o6zSEMM9Vo4xsbbbONbbbZff3+mOvuPlyHz3W4T/N+Ph4eD7vv6/C5Ptfnuu7P+/N+v19vA8dxHAiCIAiCIAiCIAhdMbq6AQRBEARBEARBEHURMrYIgiAIgiAIgiAcABlbBEEQBEEQBEEQDoCMLYIgCIIgCIIgCAdAxhZBEARBEARBEIQDIGOLIAiCIAiCIAjCAZCxRRAEQRAEQRAE4QDI2CIIgiAIgiAIgnAAZGwRBEEQBEEQBEE4ADK2CMLDyM3NhcFgwGeffebqpliRkpKChIQE+Pn5wWAwoLi42NVNsuOzzz6DwWBAbm6uq5vitsyfPx8GgwEFBQWubopHwfcb4Rq+/PJLhIWF4erVq6r2v3jxIkaNGoWGDRvCYDBg6dKl+jawDjBhwgRERUVZfWYwGDB//nzz3zfjO1bo2Y+KisKECROc3paVK1eiRYsWqKysdPq5CXHI2CLchhMnTmDUqFFo2bIl/Pz80KxZMwwaNAjvv/++w865fv16wR/VCxcuYP78+cjIyHDYuW355ZdfYDAYzP+8vb3RqlUrPPLII/jzzz91Oce+ffswf/583Q2hwsJCjB49Gv7+/lixYgXWrFmDwMBAwW35H2P+X7169dCsWTNMmDAB58+f17Vd7sqECRNgMBjQsWNHcBxn973BYMC0adNc0LKbA77/+X++vr5o06YN5s2bh2vXrrm6eS7D9h3k6+uLJk2aoH///nj99dfx999/qz52VlYW5s+f75BJeE1NDV5++WVMnz4d9evXF/y+adOmMBgM+OGHHwSPMXPmTPz444+YM2cO1qxZgyFDhmD79u1WhoQzmDBhguA18NC7QZ69e/fivvvuQ5MmTeDr64uoqChMmTIF586dU33M8vJyzJ8/H7/88ot+DXUAEyZMQFVVFT788ENXN4WwgIwtwi3Yt28fbrvtNhw7dgyTJk3C8uXL8X//938wGo1YtmyZw84rZWwtWLDAqcYWz5NPPok1a9bgo48+wrBhw7Bx40Z069YNFy5c0Hzsffv2YcGCBbobW4cOHcKVK1ewcOFCPPbYY3jooYfg7e0tuc8rr7yCNWvWYOXKlbjrrruwdu1a9OvXz6GT3YcffhgVFRVo2bKlw86hhBMnTmDTpk2ubsZNia+vL9asWYM1a9bg3XffRVRUlHn83uxYvoOee+45hIWF4eWXX0a7du2Qlpam6phZWVlYsGCBQ4yt7777Dr///jsmT54s+H1aWhry8vIQFRWFdevWiW4zYsQIPPvss3jooYfQtm1bbN++HQsWLNC9vXUZV79j33//ffTp0wcnTpzA9OnT8e9//xujRo3Cxo0b0bFjR+zbt0/VccvLy7FgwQJmY+v333/HqlWrVJ1LC35+fhg/fjzeffddwYU8wjXUc3UDCAIAXnvtNQQHB+PQoUMICQmx+u7SpUuuaZQDKCsrE/X48PTp0wejRo0CAEycOBFt2rTBk08+ic8//xxz5sxxRjMVw98j23snxV133YXbbrsNAPB///d/CA8Px+LFi7F161aMHj3aEc2El5cXvLy8HHJspfj7+6N58+Z45ZVXMHLkyJsuBK28vBwBAQEuO3+9evXw0EMPmf9+4okncPvtt2PDhg1499130aRJE5e1zdVYvoN4jh07hjvvvBP3338/srKyEBkZ6aLW2bN69WokJiaiWbNmgt+vXbsWXbp0wfjx4/HCCy8IvocvXbqk6P2lFo7jcO3aNfj7+zv8XK7Ale/YvXv3YsaMGejduzdSUlKs3i//+te/kJiYiFGjRuHkyZMIDQ11aFt8fX11O9b169dhMpng4+PDtP3o0aPx5ptvYufOnRg4cKBu7SDUQ54twi04c+YM2rdvL/hj17hxY7vP1q5di+7duyMgIAChoaHo27cvfvrpJ/P3W7ZswbBhw9C0aVP4+voiJiYGCxcuRE1NjXmb/v37Y9u2bTh79qw5bCYqKgq//PILunXrBqDW2OG/s8yROnDgAIYMGYLg4GAEBASgX79+2Lt3r1Ub+TjurKwsjB07FqGhoejdu7fivuFfljk5OZLbpaWloU+fPggMDERISAhGjBiBU6dOWbXnueeeAwBER0ebr0tupfmrr75C165d4e/vj/DwcDz00ENW4X79+/fH+PHjAQDdunWDwWBQFavep08fALVjwZLffvsNo0aNQlhYGPz8/HDbbbdh69at5u9//fVXGAwGfP7553bH/PHHH2EwGPD9998DEM8n+OGHH8x916BBAwwbNgwnT540f79161YYDAYcP37c/Nk333wDg8GAkSNHWh2rXbt2ePDBB2Wv12g04qWXXsLx48exefNmyW3F2s2HfVmutvbv3x/x8fE4fvw4+vXrh4CAALRu3Rpff/01AOC///0vevToAX9/f9x6661ITU0VPGdBQQFGjx6NoKAgNGzYEE899ZSg13Ht2rXm8REWFobk5GT873//s9qGb9Phw4fRt29fBAQE4IUXXhA879tvvw2DwYCzZ8/afTdnzhz4+Pjg8uXLAIDs7Gzcf//9iIiIgJ+fH2655RYkJyejpKREtC/FMBgM6N27NziOswvblRsfUrD0z+7du/HAAw+gRYsW8PX1RfPmzTFz5kxUVFRYbZefn4+JEyfilltuga+vLyIjIzFixAjF41kNnTp1wtKlS1FcXIzly5ebPz979iyeeOIJ3HrrrfD390fDhg3xwAMPWLXps88+wwMPPAAAGDBggPndYzlu1bb52rVrSElJQVJSkuD3FRUV2Lx5M5KTkzF69GhUVFRgy5YtVm0zGAzgOA4rVqwwt23ChAlYsWIFAFiFVvKYTCYsXboU7du3h5+fH5o0aYIpU6aYxyZPVFQU7r77bvz444+47bbb4O/vr2uIV1VVFebNm4euXbsiODgYgYGB6NOnD3bu3Gm1HZ/r+/bbb+Ojjz5CTEwMfH190a1bNxw6dMjuuN9++y3i4+Ph5+eH+Ph42XcUj9C7iu+DPXv2oHv37vDz80OrVq3wn//8x25//r3l7++PW265Ba+++ipWr17N9Fu1cOFC82+B7UJOTEwM3nzzTeTl5Vn1f//+/dG/f3+7Y1nmp+Xm5qJRo0YAgAULFpjHglSIqVDOVnFxMWbMmIHmzZvD19cXrVu3xuLFi2EymczbWN6npUuXmu9TVlYWgFrPXfv27c1zn9tuuw3r16+3Ok/Xrl0RFhZmNc4J10LGFuEWtGzZEocPH0ZmZqbstgsWLMDDDz8Mb29vvPLKK1iwYAGaN29uFd7y2WefoX79+nj66aexbNkydO3aFfPmzcPs2bPN27z44otISEhAeHi4OZxo6dKlaNeuHV555RUAwOTJk83f9e3bF0CtUdO3b1+Ulpbi5Zdfxuuvv47i4mIMHDgQBw8etGvvAw88gPLycrz++uuYNGmS4r7hjY+GDRuKbpOamorBgwfj0qVLmD9/Pp5++mns27cPiYmJ5h+okSNHYsyYMQCAJUuWmK+L/xER4rPPPsPo0aPh5eWFRYsWYdKkSdi0aRN69+5tDkV88cUXzeE7fGjglClTFF8n307LFceTJ0+iZ8+eOHXqFGbPno133nkHgYGBuPfee80//rfddhtatWqFL7/80u6YGzduRGhoKAYPHix63jVr1mDYsGGoX78+Fi9ejLlz5yIrKwu9e/c2t6l3794wGAzYtWuXeb/du3fDaDRiz5495s/+/vtv/Pbbb+axIsfYsWMRGxuLV155RdeQj8uXL+Puu+9Gjx498Oabb8LX1xfJycnYuHEjkpOTMXToULzxxhsoKyvDqFGjcOXKFbtjjB49GteuXcOiRYswdOhQvPfee3ZhWq+99ho
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"\n",
|
|||
|
"# Преобразуем дату выпуска в формат datetime\n",
|
|||
|
"df['Release_date'] = pd.to_datetime(df['Release_date'])\n",
|
|||
|
"\n",
|
|||
|
"# Статистический анализ для определения выбросов\n",
|
|||
|
"Q1 = df['Review_no'].quantile(0.25)\n",
|
|||
|
"Q3 = df['Review_no'].quantile(0.75)\n",
|
|||
|
"IQR = Q3 - Q1\n",
|
|||
|
"\n",
|
|||
|
"# Определение порога для выбросов\n",
|
|||
|
"threshold = 1.5 * IQR\n",
|
|||
|
"outliers = (df['Review_no'] < (Q1 - threshold)) | (df['Review_no'] > (Q3 + threshold))\n",
|
|||
|
"\n",
|
|||
|
"# Вывод выбросов\n",
|
|||
|
"print(\"Выбросы:\")\n",
|
|||
|
"print(df[outliers])\n",
|
|||
|
"\n",
|
|||
|
"# Обработка выбросов\n",
|
|||
|
"# В данном случае мы заменим выбросы на медианное значение\n",
|
|||
|
"median_review_no = df['Review_no'].median()\n",
|
|||
|
"df.loc[outliers, 'Review_no'] = median_review_no\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация данных после обработки\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.scatter(df['Release_date'], df['Review_no'])\n",
|
|||
|
"plt.xlabel('Release Date')\n",
|
|||
|
"plt.ylabel('Review Number')\n",
|
|||
|
"plt.title('Scatter Plot of Review Number vs Release Date (After Handling Outliers)')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Очистим от строк с пустыми значениями наш датасет"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 37,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"\n",
|
|||
|
"Количество удаленных строк: 515\n",
|
|||
|
"\n",
|
|||
|
"DataFrame после удаления строк с пропущенными значениями:\n",
|
|||
|
" Name Price Release_date \\\n",
|
|||
|
"0 Black Myth: Wukong 59.99 2024-08-20 \n",
|
|||
|
"2 Counter-Strike 2 0.00 2012-08-21 \n",
|
|||
|
"4 Grand Theft Auto V 10.48 2015-04-14 \n",
|
|||
|
"5 Red Dead Redemption 2 17.99 2019-12-05 \n",
|
|||
|
"6 PUBG: BATTLEGROUNDS 0.00 2017-12-21 \n",
|
|||
|
"... ... ... ... \n",
|
|||
|
"7807 Monster Hunter World: Iceborne - MHW:I Monster... 2.99 2020-02-06 \n",
|
|||
|
"7808 Gene Shift Auto: Deluxe Edition 8.99 2022-11-28 \n",
|
|||
|
"7809 Run Ralph Run 0.45 2021-03-03 \n",
|
|||
|
"7810 Quadroids 6.19 2024-02-22 \n",
|
|||
|
"7811 Divekick 4.99 2013-08-20 \n",
|
|||
|
"\n",
|
|||
|
" Review_no Review_type \\\n",
|
|||
|
"0 270.0 Overwhelmingly Positive \n",
|
|||
|
"2 270.0 Very Positive \n",
|
|||
|
"4 270.0 Very Positive \n",
|
|||
|
"5 270.0 Very Positive \n",
|
|||
|
"6 270.0 Mixed \n",
|
|||
|
"... ... ... \n",
|
|||
|
"7807 39.0 Positive \n",
|
|||
|
"7808 16.0 Positive \n",
|
|||
|
"7809 26.0 Mostly Positive \n",
|
|||
|
"7810 15.0 Positive \n",
|
|||
|
"7811 1118.0 Very Positive \n",
|
|||
|
"\n",
|
|||
|
" Tags \\\n",
|
|||
|
"0 Mythology,Action RPG,Action,Souls-like,RPG,Com... \n",
|
|||
|
"2 FPS,Shooter,Multiplayer,Competitive,Action,Tea... \n",
|
|||
|
"4 Open World,Action,Multiplayer,Crime,Automobile... \n",
|
|||
|
"5 Open World,Story Rich,Western,Adventure,Multip... \n",
|
|||
|
"6 Survival,Shooter,Battle Royale,Multiplayer,FPS... \n",
|
|||
|
"... ... \n",
|
|||
|
"7807 Action \n",
|
|||
|
"7808 Indie,Action,Free to Play,Battle Royale,Roguel... \n",
|
|||
|
"7809 Adventure,Action,Puzzle,Arcade,Platformer,Shoo... \n",
|
|||
|
"7810 Precision Platformer,Puzzle Platformer,2D Plat... \n",
|
|||
|
"7811 Fighting,Indie,2D Fighter,Parody ,Local Multip... \n",
|
|||
|
"\n",
|
|||
|
" Description \n",
|
|||
|
"0 Black Myth: Wukong is an action RPG rooted in ... \n",
|
|||
|
"2 For over two decades, Counter-Strike has offer... \n",
|
|||
|
"4 Grand Theft Auto V for PC offers players the o... \n",
|
|||
|
"5 Winner of over 175 Game of the Year Awards and... \n",
|
|||
|
"6 Play PUBG: BATTLEGROUNDS for free.\\n\\nLand on ... \n",
|
|||
|
"... ... \n",
|
|||
|
"7807 A monster figure you can use to decorate your ... \n",
|
|||
|
"7808 Gene Shift Auto is a roguelike-inspired battle... \n",
|
|||
|
"7809 Ralph is a smart dinosaur, and a great shooter. \n",
|
|||
|
"7810 Quadroids is a single-player puzzle platformer... \n",
|
|||
|
"7811 Divekick is the world’s first two-button fight... \n",
|
|||
|
"\n",
|
|||
|
"[7297 rows x 7 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Удаление строк с пропущенными значениями\n",
|
|||
|
"df_dropna = df.dropna()\n",
|
|||
|
"\n",
|
|||
|
"# Вывод количества удаленных строк\n",
|
|||
|
"num_deleted_rows = len(df) - len(df_dropna)\n",
|
|||
|
"print(f\"\\nКоличество удаленных строк: {num_deleted_rows}\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nDataFrame после удаления строк с пропущенными значениями:\")\n",
|
|||
|
"print(df_dropna)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь создадим выборки."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размер обучающей выборки: 4687\n",
|
|||
|
"Размер контрольной выборки: 1562\n",
|
|||
|
"Размер тестовой выборки: 1563\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//steam_cleaned.csv\")\n",
|
|||
|
"\n",
|
|||
|
"train_df, temp_df = train_test_split(df, test_size=0.4, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Разделение остатка на контрольную и тестовую выборки\n",
|
|||
|
"val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(\"Размер обучающей выборки:\", len(train_df))\n",
|
|||
|
"print(\"Размер контрольной выборки:\", len(val_df))\n",
|
|||
|
"print(\"Размер тестовой выборки:\", len(test_df))\n",
|
|||
|
"\n",
|
|||
|
"# Сохранение выборок в файлы (опционально)\n",
|
|||
|
"train_df.to_csv(\".//static//csv//train_data.csv\", index=False)\n",
|
|||
|
"val_df.to_csv(\".//static//csv//val_data.csv\", index=False)\n",
|
|||
|
"test_df.to_csv(\".//static//csv//test_data.csv\", index=False)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Проанализируем сбалансированность выборок"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Распределение Review_type в обучающей выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Very Positive 2117\n",
|
|||
|
"Mostly Positive 810\n",
|
|||
|
"Mixed 797\n",
|
|||
|
"Positive 710\n",
|
|||
|
"Overwhelmingly Positive 209\n",
|
|||
|
"Mostly Negative 15\n",
|
|||
|
"Very Negative 2\n",
|
|||
|
"Overwhelmingly Negative 1\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент положительных отзывов: 17.28%\n",
|
|||
|
"Процент отрицательных отзывов: 4.46%\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в контрольной выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Very Positive 708\n",
|
|||
|
"Mostly Positive 290\n",
|
|||
|
"Mixed 241\n",
|
|||
|
"Positive 224\n",
|
|||
|
"Overwhelmingly Positive 78\n",
|
|||
|
"Mostly Negative 6\n",
|
|||
|
"Very Negative 2\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент положительных отзывов: 18.57%\n",
|
|||
|
"Процент отрицательных отзывов: 4.99%\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в тестовой выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Very Positive 713\n",
|
|||
|
"Mostly Positive 276\n",
|
|||
|
"Mixed 253\n",
|
|||
|
"Positive 240\n",
|
|||
|
"Overwhelmingly Positive 67\n",
|
|||
|
"Mostly Negative 5\n",
|
|||
|
"Very Negative 1\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Процент положительных отзывов: 17.66%\n",
|
|||
|
"Процент отрицательных отзывов: 4.29%\n",
|
|||
|
"\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n",
|
|||
|
"Необходима аугментация данных для балансировки классов.\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"train_df = pd.read_csv(\".//static//csv//train_data.csv\")\n",
|
|||
|
"val_df = pd.read_csv(\".//static//csv//val_data.csv\")\n",
|
|||
|
"test_df = pd.read_csv(\".//static//csv//test_data.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Оценка сбалансированности\n",
|
|||
|
"def check_balance(df, name):\n",
|
|||
|
" counts = df['Review_type'].value_counts()\n",
|
|||
|
" print(f\"Распределение Review_type в {name}:\")\n",
|
|||
|
" print(counts)\n",
|
|||
|
" print(f\"Процент положительных отзывов: {counts['Mostly Positive'] / len(df) * 100:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных отзывов: {counts['Overwhelmingly Positive'] / len(df) * 100:.2f}%\")\n",
|
|||
|
" print()\n",
|
|||
|
"\n",
|
|||
|
"# Определение необходимости аугментации данных\n",
|
|||
|
"def need_augmentation(df):\n",
|
|||
|
" counts = df['Review_type'].value_counts()\n",
|
|||
|
" ratio = counts['Mostly Positive'] / counts['Overwhelmingly Positive']\n",
|
|||
|
" if ratio > 1.5 or ratio < 0.67:\n",
|
|||
|
" print(\"Необходима аугментация данных для балансировки классов.\")\n",
|
|||
|
" else:\n",
|
|||
|
" print(\"Аугментация данных не требуется.\")\n",
|
|||
|
" \n",
|
|||
|
"check_balance(train_df, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df, \"тестовой выборке\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"need_augmentation(train_df)\n",
|
|||
|
"need_augmentation(val_df)\n",
|
|||
|
"need_augmentation(test_df)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"По результатам анализа требуется приращение, соотношения отзывов вне допустимого диапазона"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Оверсэмплинг:\n",
|
|||
|
"Распределение Review_type в обучающей выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Mostly Positive 2117\n",
|
|||
|
"Mixed 2117\n",
|
|||
|
"Very Positive 2117\n",
|
|||
|
"Positive 2117\n",
|
|||
|
"Overwhelmingly Positive 2117\n",
|
|||
|
"Mostly Negative 2117\n",
|
|||
|
"Very Negative 2117\n",
|
|||
|
"Overwhelmingly Negative 2117\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в контрольной выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Very Negative 708\n",
|
|||
|
"Mostly Positive 708\n",
|
|||
|
"Mixed 708\n",
|
|||
|
"Overwhelmingly Positive 708\n",
|
|||
|
"Overwhelmingly Negative 708\n",
|
|||
|
"Positive 708\n",
|
|||
|
"Mostly Negative 708\n",
|
|||
|
"Very Positive 708\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в тестовой выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Very Negative 713\n",
|
|||
|
"Mostly Positive 713\n",
|
|||
|
"Overwhelmingly Positive 713\n",
|
|||
|
"Mixed 713\n",
|
|||
|
"Overwhelmingly Negative 713\n",
|
|||
|
"Very Positive 713\n",
|
|||
|
"Mostly Negative 713\n",
|
|||
|
"Positive 713\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n",
|
|||
|
"Андерсэмплинг:\n",
|
|||
|
"Распределение Review_type в обучающей выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Mixed 1\n",
|
|||
|
"Mostly Negative 1\n",
|
|||
|
"Mostly Positive 1\n",
|
|||
|
"Overwhelmingly Negative 1\n",
|
|||
|
"Overwhelmingly Positive 1\n",
|
|||
|
"Positive 1\n",
|
|||
|
"Very Negative 1\n",
|
|||
|
"Very Positive 1\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в контрольной выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Mixed 2\n",
|
|||
|
"Mostly Negative 2\n",
|
|||
|
"Mostly Positive 2\n",
|
|||
|
"Overwhelmingly Negative 2\n",
|
|||
|
"Overwhelmingly Positive 2\n",
|
|||
|
"Positive 2\n",
|
|||
|
"Very Negative 2\n",
|
|||
|
"Very Positive 2\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n",
|
|||
|
"Распределение Review_type в тестовой выборке:\n",
|
|||
|
"Review_type\n",
|
|||
|
"Mixed 1\n",
|
|||
|
"Mostly Negative 1\n",
|
|||
|
"Mostly Positive 1\n",
|
|||
|
"Overwhelmingly Negative 1\n",
|
|||
|
"Overwhelmingly Positive 1\n",
|
|||
|
"Positive 1\n",
|
|||
|
"Very Negative 1\n",
|
|||
|
"Very Positive 1\n",
|
|||
|
"Name: count, dtype: int64\n",
|
|||
|
"Отсутствуют один или оба класса (Positive/Negative).\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from imblearn.over_sampling import RandomOverSampler\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"from sklearn.preprocessing import LabelEncoder\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"train_df = pd.read_csv(\".//static//csv//train_data.csv\")\n",
|
|||
|
"val_df = pd.read_csv(\".//static//csv//val_data.csv\")\n",
|
|||
|
"test_df = pd.read_csv(\".//static//csv//test_data.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование категориальных признаков в числовые\n",
|
|||
|
"def encode(df):\n",
|
|||
|
" label_encoders = {}\n",
|
|||
|
" for column in df.select_dtypes(include=['object']).columns:\n",
|
|||
|
" if column != 'Review_type': # Пропускаем целевую переменную\n",
|
|||
|
" le = LabelEncoder()\n",
|
|||
|
" df[column] = le.fit_transform(df[column])\n",
|
|||
|
" label_encoders[column] = le\n",
|
|||
|
" return label_encoders\n",
|
|||
|
"\n",
|
|||
|
"# Преобразование целевой переменной в числовые значения\n",
|
|||
|
"def encode_target(df):\n",
|
|||
|
" le = LabelEncoder()\n",
|
|||
|
" df['Review_type'] = le.fit_transform(df['Review_type'])\n",
|
|||
|
" return le\n",
|
|||
|
"\n",
|
|||
|
"# Применение кодирования\n",
|
|||
|
"label_encoders = encode(train_df)\n",
|
|||
|
"encode(val_df)\n",
|
|||
|
"encode(test_df)\n",
|
|||
|
"\n",
|
|||
|
"# Кодирование целевой переменной\n",
|
|||
|
"le_target = encode_target(train_df)\n",
|
|||
|
"encode_target(val_df)\n",
|
|||
|
"encode_target(test_df)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка типов данных\n",
|
|||
|
"def check_data_types(df):\n",
|
|||
|
" for column in df.columns:\n",
|
|||
|
" if df[column].dtype == 'object':\n",
|
|||
|
" print(f\"Столбец '{column}' содержит строковые данные.\")\n",
|
|||
|
"\n",
|
|||
|
"check_data_types(train_df)\n",
|
|||
|
"check_data_types(val_df)\n",
|
|||
|
"check_data_types(test_df)\n",
|
|||
|
"\n",
|
|||
|
"# Функция для выполнения oversampling\n",
|
|||
|
"def oversample(df):\n",
|
|||
|
" if 'Review_type' not in df.columns:\n",
|
|||
|
" print(\"Столбец 'Review_type' отсутствует.\")\n",
|
|||
|
" return df\n",
|
|||
|
" \n",
|
|||
|
" X = df.drop('Review_type', axis=1)\n",
|
|||
|
" y = df['Review_type']\n",
|
|||
|
" \n",
|
|||
|
" oversampler = RandomOverSampler(random_state=42)\n",
|
|||
|
" X_resampled, y_resampled = oversampler.fit_resample(X, y)\n",
|
|||
|
" \n",
|
|||
|
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
|
|||
|
" return resampled_df\n",
|
|||
|
"\n",
|
|||
|
"# Функция для выполнения undersampling\n",
|
|||
|
"def undersample(df):\n",
|
|||
|
" if 'Review_type' not in df.columns:\n",
|
|||
|
" print(\"Столбец 'Review_type' отсутствует.\")\n",
|
|||
|
" return df\n",
|
|||
|
" \n",
|
|||
|
" X = df.drop('Review_type', axis=1)\n",
|
|||
|
" y = df['Review_type']\n",
|
|||
|
" \n",
|
|||
|
" undersampler = RandomUnderSampler(random_state=42)\n",
|
|||
|
" X_resampled, y_resampled = undersampler.fit_resample(X, y)\n",
|
|||
|
" \n",
|
|||
|
" resampled_df = pd.concat([X_resampled, y_resampled], axis=1)\n",
|
|||
|
" return resampled_df\n",
|
|||
|
"\n",
|
|||
|
"# Применение oversampling и undersampling к каждой выборке\n",
|
|||
|
"train_df_oversampled = oversample(train_df)\n",
|
|||
|
"val_df_oversampled = oversample(val_df)\n",
|
|||
|
"test_df_oversampled = oversample(test_df)\n",
|
|||
|
"\n",
|
|||
|
"train_df_undersampled = undersample(train_df)\n",
|
|||
|
"val_df_undersampled = undersample(val_df)\n",
|
|||
|
"test_df_undersampled = undersample(test_df)\n",
|
|||
|
"\n",
|
|||
|
"# Обратное преобразование целевой переменной в строковые метки\n",
|
|||
|
"def decode_target(df, le_target):\n",
|
|||
|
" df['Review_type'] = le_target.inverse_transform(df['Review_type'])\n",
|
|||
|
"\n",
|
|||
|
"decode_target(train_df_oversampled, le_target)\n",
|
|||
|
"decode_target(val_df_oversampled, le_target)\n",
|
|||
|
"decode_target(test_df_oversampled, le_target)\n",
|
|||
|
"\n",
|
|||
|
"decode_target(train_df_undersampled, le_target)\n",
|
|||
|
"decode_target(val_df_undersampled, le_target)\n",
|
|||
|
"decode_target(test_df_undersampled, le_target)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка результатов\n",
|
|||
|
"def check_balance(df, name):\n",
|
|||
|
" if 'Review_type' not in df.columns:\n",
|
|||
|
" print(f\"Столбец 'Review_type' отсутствует в {name}.\")\n",
|
|||
|
" return\n",
|
|||
|
" \n",
|
|||
|
" counts = df['Review_type'].value_counts()\n",
|
|||
|
" print(f\"Распределение Review_type в {name}:\")\n",
|
|||
|
" print(counts)\n",
|
|||
|
" \n",
|
|||
|
" if 'Positive' in counts and 'Negative' in counts:\n",
|
|||
|
" print(f\"Процент положительных отзывов: {counts['Positive'] / len(df) * 100:.2f}%\")\n",
|
|||
|
" print(f\"Процент отрицательных отзывов: {counts['Negative'] / len(df) * 100:.2f}%\")\n",
|
|||
|
" else:\n",
|
|||
|
" print(\"Отсутствуют один или оба класса (Positive/Negative).\")\n",
|
|||
|
" print()\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности после oversampling\n",
|
|||
|
"print(\"Оверсэмплинг:\")\n",
|
|||
|
"check_balance(train_df_oversampled, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df_oversampled, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df_oversampled, \"тестовой выборке\")\n",
|
|||
|
"\n",
|
|||
|
"# Проверка сбалансированности после undersampling\n",
|
|||
|
"print(\"Андерсэмплинг:\")\n",
|
|||
|
"check_balance(train_df_undersampled, \"обучающей выборке\")\n",
|
|||
|
"check_balance(val_df_undersampled, \"контрольной выборке\")\n",
|
|||
|
"check_balance(test_df_undersampled, \"тестовой выборке\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 14,400 Classic Rock Tracks (with Spotify Data)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"https://www.kaggle.com/datasets/thebumpkin/14400-classic-rock-tracks-with-spotify-data Этот набор данных, содержащий 1200 уникальных альбомов и 14 400 треков, представляет собой не просто коллекцию — это хроника эволюции классического рока. Каждый трек тщательно каталогизирован с 18 столбцами данных, включая ключевые метаданные, такие как название трека, исполнитель, альбом и год выпуска, наряду с функциями Spotify audio, которые позволяют получить представление о звуковом ландшафте этих неподвластных времени мелодий. Бизнес-цель может заключаться в улучшении стратегии маркетинга и продвижения музыкальных треков. Предположим как этот набор может быть полезен для бизнеса:\n",
|
|||
|
"Персонализированные рекомендации: Создание алгоритмов, которые будут рекомендовать пользователям музыку на основе их предпочтений.\n",
|
|||
|
"Цель технического проекта: Разработать и внедрить систему рекомендаций, которая будет предсказывать и рекомендовать пользователям музыкальные треки на основе их предпочтений и поведения.\n",
|
|||
|
"Входные данные:\n",
|
|||
|
"Данные о пользователях: Идентификатор пользователя, история прослушиваний, оценки треков, время прослушивания, частота прослушивания.\n",
|
|||
|
"Данные о треках: Атрибуты треков (название, исполнитель, альбом, год, длительность, танцевальность, энергичность, акустичность и т.д.).\n",
|
|||
|
"Данные о взаимодействии: Время и частота взаимодействия пользователя с определенными треками.\n",
|
|||
|
"Целевой признак:\n",
|
|||
|
"Рекомендации: Булева переменная, указывающая, должен ли конкретный трек быть рекомендован пользователю (1 - рекомендуется, 0 - не рекомендуется)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Index(['Track', 'Artist', 'Album', 'Year', 'Duration', 'Time_Signature',\n",
|
|||
|
" 'Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness',\n",
|
|||
|
" 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo',\n",
|
|||
|
" 'Popularity'],\n",
|
|||
|
" dtype='object')\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//UltimateClassicRock.csv\")\n",
|
|||
|
"print(df.columns)"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|