758 lines
90 KiB
Plaintext
758 lines
90 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## **Diamonds Prices**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Unnamed: 0\n",
|
|||
|
"carat\n",
|
|||
|
"cut\n",
|
|||
|
"color\n",
|
|||
|
"clarity\n",
|
|||
|
"depth\n",
|
|||
|
"table\n",
|
|||
|
"price\n",
|
|||
|
"x\n",
|
|||
|
"y\n",
|
|||
|
"z\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"attributes = df.columns\n",
|
|||
|
"for attribute in attributes:\n",
|
|||
|
" print(attribute)\n",
|
|||
|
" "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### **Бизнес-цели**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**1. Определение цены алмаза на основе его характеристик**\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес-цель:** Разработать модель, которая позволяет прогнозировать цену алмаза на основе его характеристик (таких как карат, чистота, цвет и огранка).\n",
|
|||
|
"\n",
|
|||
|
"**Техническая цель:** Создать и обучить модель машинного обучения, которая принимает на вход параметры алмаза и выдает предсказанную цену. Цель — минимизация ошибки предсказания.\n",
|
|||
|
"\n",
|
|||
|
"**2. Классификация качества алмаза на основе характеристик**\n",
|
|||
|
"\n",
|
|||
|
"**Бизнес-цель:** Определить качество алмаза (например, высокая, средняя, низкая) на основе его характеристик для автоматической сортировки.\n",
|
|||
|
"\n",
|
|||
|
"**Техническая цель:** Создать модель классификации, которая принимает на вход характеристики алмаза и присваивает ему категорию качества. Цель — максимизация точности классификации."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### **Подготовка данных**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 68,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Пропущенные данные по каждому столбцу:\n",
|
|||
|
"Unnamed: 0 0\n",
|
|||
|
"carat 0\n",
|
|||
|
"cut 0\n",
|
|||
|
"color 0\n",
|
|||
|
"clarity 0\n",
|
|||
|
"depth 0\n",
|
|||
|
"table 0\n",
|
|||
|
"price 0\n",
|
|||
|
"x 0\n",
|
|||
|
"y 0\n",
|
|||
|
"z 0\n",
|
|||
|
"dtype: int64\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"# Проверка на пропущенные значения\n",
|
|||
|
"missing_data = df.isnull().sum()\n",
|
|||
|
"print(\"Пропущенные данные по каждому столбцу:\")\n",
|
|||
|
"print(missing_data)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропущенных значений не найдено"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### **Разбиение каждого набора данных на обучающую, контрольную и тестовую выборки**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 32365 записей\n",
|
|||
|
"Валидационная выборка: 10789 записей\n",
|
|||
|
"Тестовая выборка: 10789 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWkAAAImCAYAAACRnXtqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABLmklEQVR4nO3de1hU1f4G8HcYGBhEUQHB8g4B3hAvkJZX9Hj6qV0U05OipaKUGh5TUdM0r5kSWhgqibeSvGd5yjTtXopiaZaQjop3BMRAnIGBYf/+IHaODMw4orN03s/z8Ch7rf2dNYvxZbtm79kKSZIkEBGRkBxsPQAiIqocQ5qISGAMaSIigTGkiYgExpAmIhIYQ5qISGAMaSIigTGkiYgExpAmIhKYo60HQGI7cuQI1q1bh19++QX5+fmoV68eOnXqhBEjRsDX17dC/927dyM5ORlpaWkoKCiAJElQKBSYP38+Bg4caINnQLc7cOAAkpKScPz4ceTn56O0tBQAMH78eLz66qs2Hh3dTsHLwqkyiYmJiIuLQ+fOndG/f394eXnh3Llz+Pjjj6HRaPDWW2+hb9++cv9169ZhxYoViIqKQkBAAFxdXeHo6Ih69erB29vbhs+Eyn311VeYMmUKIiMjERQUBDc3Nzg6OqJu3bpo0KCBrYdHpkhEJnz99deSv7+/FB8fX6FNr9dLr776qtSqVSvp5MmT8rbQ0FDpu+++u99DpTvw9NNPS8nJybYeBt0BrkmTScuXL0ezZs0wbty4Cm1OTk6YO3culEolPvjgAwDAqVOnUFhYiIYNGyIqKgpt27bF448/junTpyM3NxcA8PbbbyMoKAg3btwwqpeQkID27dtDo9EgICAAO3bsMGqfNm0awsLC5O8LCwvxzjvvoHfv3mjVqhXatWuHESNGIC0trdJ99Ho93n77bTz55JNo06YNnn/+eRw4cEBuT0lJQUBAAFJSUuRt8fHxCAgIqHIsALB161YMGDAAwcHBCAoKwrPPPovdu3dXeOxu3bqhefPmCAgIkL9uf663GjZsmFHfdu3aYeTIkbhw4YLcJywsDNOmTauyxrBhwwAAeXl5+PPPP9G2bVtMnjwZ7du3R/v27REdHY2LFy8a7ZeRkYHo6Gg8+eSTCA4OxrBhw3DkyBG5/eLFiwgICMDnn3+Ol19+GW3atEH37t3x/vvvy8snt4+vtLQUEyZMQKtWrXD69Gm5TkxMDDp37oyWLVuiU6dOiImJwfXr1yt9TvaGIU0V5Obm4vfff0ePHj2gUChM9qlduzaeeOIJ7N+/HwBw6dIlODk54cUXX4RarcbSpUsxdepU/Pjjj3jxxRdRWFiIgQMHoqioCF9++aVRrU8//RR9+vSBi4uLReOLiYnB9u3bMWbMGKxZswbTp0/HqVOnMGnSJEiVrN4tXrwYmzdvxtixYxEfH4/69etj9OjR+OOPP+5gZirauHEjZs2ahV69emHVqlWIjY2FSqXC5MmTkZmZCQD44IMPsH79erz44otYv349Nm/ejOXLl1tUv0WLFti8ebP8OKdOnUJMTIxVY7106RIAYNKkScjJycHbb7+NefPmQaPR4D//+Q+uXbsGANBoNBgwYAAuXryImTNnIjY2FgqFAi+++CIOHTpkVPPNN9+Em5sb4uPj8eyzz2L58uV45513TD7+l19+iZSUFHzwwQdo0KABdDodhg8fjtOnT2P27NlISkrC8OHD8fnnn2Pp0qVWPceHEd84pArK/zE/+uijVfZr3Lgx9u/fj7y8PGi1Wty4cQOBgYFYtmyZ3Mff3x/h4eHYsWMHhgwZgrZt2+LTTz/F888/DwD45ZdfkJGRgUWLFkGpVAIADAZDpY+p1+tx8+ZNzJw5E3369AEAhIaGoqCgAIsWLUJOTg68vLyM9snJyUFycjLmzJkjP27nzp3Rq1cvrFy5EvHx8Xc2Qbe4cOECRo0ahbFjx8rbHn30UQwYMABHjhxB37598dtvvyEwMBAjR46U+9x+5FoZNzc3BAcHAwA6dOiAP//8Exs3brRqrFqtFkDZHH7wwQdwcnICALRv3x69evXCmjVrMGXKFCxfvhwqlQobNmyAm5sbAKB79+7o168fFi9ejG3btsk1W7ZsidjYWABA165dodVqsX79erzyyivyvuU2btyIAQMGoFOnTgCAtLQ0+Pj44O2330bDhg0BAB07dsSxY8cq/DKwZzySpgrKj0bL/xFXpjxUpb/P4ACAZ5991qhPq1at0KhRI3kZITw8HKmpqfIvgk8++QRNmzZF27ZtUbt2bSiVSvkI1BSVSoWkpCT06dMHV69excGDB7Fp0yZ88803AMoC6FYlJSU4cuQIDAYDunfvLm93cHDAk08+afRfeGtMmzYNkydPRn5+Po4ePYpPP/1UDtHysbRu3RoajQZ79uxBXl4eSkpKjJYEqiJJEkpKSqDX63HmzBl8++23aNWqlck+5mqW/4z69u1r9LP19vZG+/bt5Z/RoUOH0KNHD6OQdXR0RN++ffH777/j5s2b8vbnnnvO6DH+/e9/o7i4GL/++qu8zWAwYO/evTh27BheeOEFeXvz5s2RnJyMRx99FBkZGfjuu++QlJSEM2fOVPg52jMeSVMF5UfQ5UFamQsXLqBGjRqoXbs2atSoAQAmz+KoU6cOCgoKAAB9+vTBwoUL8emnn2LUqFHYvXs3xowZAwBQq9V4/PHHsWnTJnTp0gV+fn5IT0/HqVOnjOr98MMPWLhwIc6cOYMaNWogMDAQrq6uAGC03HHp0iW0bNlS/t7d3b3CuPLy8iyak8qcP38es2bNwoEDB+Dk5IRmzZohMDDQaCyjR49GdnY2pk+fbhRwljh8+LDRc3BwcMDs2bON+uzcuRM7d+6EQqGAh4cH2rdvjwkTJlQ4RdLcz+jKlSsAytauPT09K/Tx9PSEJEnyz9JUrbp168o1yn322Wf47LPPMHXqVPmIudzatWuxcuVK/PXXX/D09ESrVq2gVqsrvG9hz3gkTRV4eHggODgYe/bsqfTorKCgAD/99JP8Jlp5sJe/SXiry5cvw8PDA0BZUDz11FPYvXs3fvjhB2i1WqOj7wULFsDLywuDBw9G+/btMXLkSJw/f15uP3/+PMaNG4fmzZvjq6++wpEjR5CcnIwePXpUeFwvLy9s27ZNXsO9fWy5ubmoVavWnUyNkdLSUowZMwbXrl3Dtm3bcPToUXz22WfyL51yzs7OiIyMhJeXF9q0aYO1a9dixYoVFj1Gy5YtsW3bNmzduhVJSUlo164dJk2aJC9dAECPHj2wbds2bNmyBXPnzsWVK1fk9wFuVf4zKl97vtXly5flYHZ3d0dOTk6FPtnZ2QDKAr3c7W/wldcu/3kDQLdu3TBmzBjExcXh559/lrfv2rULixYtwujRo3HgwAH89NNPWLVqFZo0aWLR3NgLhjSZNH78eJw9exZxcXEV2gwGA2bPno3CwkJERkYCAB577DF4e3tj165dRn2///57ZGdno2vXrvK2gQMH4uTJk1i/fj2eeOIJo6OxRx55BJ988gn27NmDL7/8EqmpqejZs6fc/vvvv6OoqAhjxoxBo0aN5P/C//DDDwCMj6RVKhVat26Nbt26AQC+/fZbua2kpAQ///wz2rZta+0U4fr16zh79iwGDhyI1q1bw9HRUX7OAORfcKWlpZg0aRKuXbuG+Ph4PPHEE/D397foMWrUqIHWrVsjKCgInTt3RmRkJHJycqDRaOQ+tWvXlvv07NkTUVFRyM7OxpkzZ4xq1axZE8HBwdi9e7fRur9Go8GxY8fkn1FISAi++eYboyNmg8GAzz//HK1bt4ZKpZK379u3z+gx9uzZA7VajTZt2sjb6tati0mTJiEsLAwxMTHyUfKRI0dQq1YtREZGykfgN2/exJEjRyxeDrIHXO4gk7p06YJp06Zh8eLFSEtLQ3h4OOrVq4eLFy/i448/RlpaGhYsWCD/116pVGLatGmYOHEiXnvtNTz77LO4cuUK4uLi0LZtW/zf//2fXLt9+/Zo2rQpDh06ZPJdfIVCUenRVMuWLeHo6IglS5Zg5MiR0Ov12LFjhxz
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
|
|||
|
"y = df['price'] # Целевая переменная (цена)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в каждой выборке\n",
|
|||
|
"plt.figure(figsize=(12, 6))\n",
|
|||
|
"plt.subplot(1, 3, 1)\n",
|
|||
|
"plt.hist(y_train, bins=30, color='blue', alpha=0.7)\n",
|
|||
|
"plt.title('Обучающая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Данные не сбалансированы, так как существует большая разница в количестве наблюдений для разных диапазонов цен. Это может привести к тому, что модель будет хуже предсказывать цены для более дорогих алмазов, так как таких данных меньше. Применим методы приращения."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Размеры выборок:\n",
|
|||
|
"Обучающая выборка: 32365 записей\n",
|
|||
|
"Валидационная выборка: 10789 записей\n",
|
|||
|
"Тестовая выборка: 10789 записей\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1YAAAImCAYAAABQCRseAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABkBUlEQVR4nO3dd3hUddrG8XtShiSE3hILUrIBBGJAEsClGRVZsSG+FppSFKUpSFNRREFROigl0gRhEQWxoKuLim2lqohSQoSgoiGE0MKkzpz3DzazDEnIZM4kk5Dv57q4YE555neeqTfnzDkWwzAMAQAAAAA85ufrAQAAAABAeUewAgAAAACTCFYAAAAAYBLBCgAAAABMIlgBAAAAgEkEKwAAAAAwiWAFAAAAACYRrAAAAADAJIIVAAAAAJgU4OsBAIAndu/erRUrVmj79u1KS0tT3bp11b59ez388MO68sorXZb97rvvtGTJEu3evVunT5+Ww+GQJA0bNkzDhw/3xfBxiUhKStKCBQu0ZcsWpaamKjc3V5IUGxurlStX+nh0kKRjx45p/vz5+vrrr5WcnKycnBxJUv369fXxxx8rIICvQgC8w2IYhuHrQQBAcaxatUovvvii2rZtqx49eqhu3bo6fPiwlixZopMnT+qNN95Q06ZNJUn//ve/NWbMGA0aNEhRUVEKDQ1VQECAatasqSuuuMLHW4Ly7I8//lCPHj1044036oYbblCNGjUUGBio4OBgNW7cWH5+HBTia2fOnNEdd9yhJk2a6Pbbb1edOnVktVpltVrVuHFjBQYG+nqIAC4hBCsA5crOnTvVt29f9e7dW08//bTLvLS0NN15552qXbu21q9fL0m6/fbbdf/99+v+++/3xXBxCZs8ebKOHz+uWbNm+XooKMTy5cv18ccf66233vL1UABUAPx3GoByZcmSJapSpYpGjRqVb17NmjU1fvx43XDDDbLZbDp16pT279+vVq1aafTo0br22mt17bXXasSIEfrjjz+c69ntdsXHx+vWW29VVFSUoqOjdd9992nLli0u9ePi4tSkSZN8f+Li4pzLZGZmasaMGeratatatGih1q1bq3///tq7d69zmfHjx7usI53b+9GkSRNnIJSkgwcPatiwYYqNjVVMTIwGDx6sX3/9tdDls7KydMMNN6hJkybOaX379tX48eO1cOFCXXfddbr22ms1ZMgQHTlyxOX+d+/erYEDB6pt27Zq3bq1HnnkER04cMA5f+vWrS7b3KJFC8XFxWnp0qUudTZt2qRevXqpVatWatGihbp166ZVq1blq7N161aX9fr27au+ffs6bzdp0kTz5s1zWWbevHku2+bO43b69Gk99dRT6tChQ77H7cIxnM9ut2vVqlW67bbbFBUVpS5dumj69OnKyspyLrNt2zZ17txZc+fOVceOHRUVFaX77rvPWffAgQNq0qRJvi/1f/31l5o1a6b3338/33YX1iNP+vr555/rtttuU4sWLdSlSxfNmzfPeRisdO75PH78eOftgp5TBdXdt2+fhg0bpnbt2ql58+bq2LGjJk+erMzMTOcyn332mXr06KFrrrnGpecXbuv51q9fn+85dvPNN+v99993LnPhc6CwGnmv761bt+r666/XihUrdMMNN6hly5a644479Mknn7isl5WVpddee03dunVTy5Yt1bVrV8XHx7v0y53X0oXj27hxo2JiYjRjxgxJ7r/XACifOLAYQLlhGIa++eYbxcXFKTg4uMBlbrnlFue/9+zZI0l64oknVKdOHb388svKzs7Wq6++qvvuu0/vvfeeatWqpenTp+uf//ynnnjiCTVp0kRHjx7Va6+9pscee0ybN292ua/OnTtryJAhztvz589XYmKi8/bYsWO1Y8cOjRo1SvXr19fhw4c1Z84cPfHEE9q4caMsFotb23r06FHde++9qlevnp577jmFhIRo3rx5euCBB/Thhx8WuM7ixYtdAmOezz77TDVq1NCECRPkcDg0Y8YM9e3bVxs3blRwcLC2bNmiQYMGqW3btnrxxReVlZWlRYsW6b777tPatWvVuHFjZ61nn31WzZs319mzZ7Vx40a9/PLLatq0qa677jpt3rxZQ4cOVb9+/TR8+HBlZmZq9erVev7559WiRQtdc801bm27u9x53KZOnap//etfGjdunBo3bqyAgAD98ssvev755y9a+9lnn9V7772nhx56SG3atNGePXv02muvae/evVq8eLEsFouOHDmi119/XRkZGXriiSdUvXp1rV27VgMGDNCSJUvUrl07XXPNNXrvvfd07733Omtv2LBBISEh6tq1q95+++0it9OTvu7evVtDhw7VnXfeqbFjxyohIUGzZs2SzWbTuHHjit/s/0pJSVHv3r0VHR2tqVOnymq16quvvtKyZctUt25dPfzww/rtt9/02GOPqWPHjho5cqSqVq0qSZo0aZJb9/Hqq6+qTp06OnXqlNasWaNx48apZcuWatiwYbHHe+TIEf32229KTU3V448/riuuuEIff/yxRowYoZdffll33nmnDMPQI488oh9//FHDhg1T06ZNtXXrVs2ePVu///67XnjhBWe9ol5L58vMzNTzzz+vQYMG6bbbbpPk3nMWQPlFsAJQbpw4cUJZWVlu/zbKZrNJkrKzs/X66687f09x7bXX6sYbb9TSpUs1ZswYpaSkaOTIkS7/m16pUiUNHz5c+/fvV3R0tHN6zZo1893Ok52drbNnz2rChAnOgBcbG6v09HRNnTpVqampqlOnjltjX758ubKzs7Vs2TLnOk2bNtX999+vXbt2uYQd6dxekNdff13NmzfXL7/84jIvIyND69evd57Uo1GjRurRo4c2bNig+++/XzNmzNBVV12l+Ph4+fv7S5I6dOigm266SXPnztWcOXOctSIiIpzbHx0drXXr1unnn3/Wddddp8TERPXo0cPlEM1WrVqpbdu22rp1q9eDlTuP208//aQOHTq4BJvz9zoVJDExUe+8846eeOIJPfzww5Kkv//976pbt67Gjh2rr776Sp07d1ZGRoYOHTqkjz76SA0aNJB0LnjfcccdmjFjht5++2317NlTEydO1O+//+7s/4YNG9S9e3cFBQXJz89P2dnZRY6nuH199dVX1apVK7300kuSpI4dO+rMmTNavHixBg8erOrVq1/0PguTkJCgZs2aac6cOQoNDZUkXXfddfr222+1detWPfzww9qzZ49ycnI0cuRIRUZGOtfNW74ozZo1c77Gw8PD9fnnn2vv3r0eBauMjAz99ttveuONN9SuXTtJ53qRlpam6dOn6/bbb9fXX3+t//znP5o5c6a6d+8u6dzjHRQUpDlz5qhfv37629/+5qx3sdfS+T788EMFBgZq0KBBztdVcd5rAJQ/BCsA5UbelxO73e7W8nl7h7p37+7yI/V69erp2muvdR7elHeYTlpamg4ePKjDhw/riy++kKQiv/Sez2q1asmSJZLO7XE6dOiQkpKSCq2VdwY5SS6HHEnnfksWHR3tEsTCwsKctS7cM/Xyyy+rTZs2uuaaa/IFq9atW7ucKfHqq6/WlVdeqe3bt+uOO+7Q7t27NWzYMGd/Jalq1aq6/vrr9eWXX7rUcjgcys3NVVZWllavXi1JatmypSRp0KBBkqSzZ8/q0KFD+u2337R79+4Ctz2vTh7DMPLtzbtwmQt75M7j1rJlS33xxRf67rvv1KJFCwUHB+erc6Ft27ZJkvNLdp7u3bvrySef1NatW9W5c2dZLBZdc801zlAlnXvOdevWTXPnztXZs2fVvXt3vfTSS3rvvfc0bNgwff/990pKStLUqVMlSbVq1dKPP/540fEUt685OTn6/vvv9dBDD7nM69SpkxYsWKBdu3apc+fOF73PwnTo0EEdOnRQTk6OEhMTdfjwYSUkJCgtLc0Z1po3b66AgAC9+eabGjx4sOrUqSM/P78CH+OC5D3uGRkZWrt2rQICApwno8mTm5sri8Xi8pwtiMViUVhYmDNU5fnHP/6hL774QgcPHtS2bdsUEBCgbt26uSxz++23a86cOdq2bZszWF3
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from imblearn.under_sampling import RandomUnderSampler\n",
|
|||
|
"from imblearn.over_sampling import SMOTE\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков (features) и целевой переменной (target)\n",
|
|||
|
"X = df.drop(columns=['price']) # Признаки (все столбцы, кроме 'price')\n",
|
|||
|
"y = df['price'] # Целевая переменная (цена)\n",
|
|||
|
"\n",
|
|||
|
"# Применение one-hot encoding для категориальных признаков\n",
|
|||
|
"X = pd.get_dummies(X, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"# Разбиение на обучающую (60%), валидационную (20%) и тестовую (20%) выборки\n",
|
|||
|
"X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)\n",
|
|||
|
"X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка размеров выборок\n",
|
|||
|
"print(f\"Размеры выборок:\")\n",
|
|||
|
"print(f\"Обучающая выборка: {X_train.shape[0]} записей\")\n",
|
|||
|
"print(f\"Валидационная выборка: {X_val.shape[0]} записей\")\n",
|
|||
|
"print(f\"Тестовая выборка: {X_test.shape[0]} записей\")\n",
|
|||
|
"\n",
|
|||
|
"# Применение RandomUnderSampler для уменьшения размеров больших классов\n",
|
|||
|
"undersampler = RandomUnderSampler(random_state=42)\n",
|
|||
|
"X_undersampled, y_undersampled = undersampler.fit_resample(X_train, y_train)\n",
|
|||
|
"\n",
|
|||
|
"# Применение SMOTE для увеличения сбалансированности\n",
|
|||
|
"smote = SMOTE(random_state=42)\n",
|
|||
|
"X_resampled, y_resampled = smote.fit_resample(X_undersampled, y_undersampled)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация распределения цен в сбалансированной выборке\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.hist(y_resampled, bins=30, color='orange', alpha=0.7)\n",
|
|||
|
"plt.title('Сбалансированная обучающая выборка')\n",
|
|||
|
"plt.xlabel('Цена')\n",
|
|||
|
"plt.ylabel('Количество')\n",
|
|||
|
"plt.show()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Теперь данные намного более сбаланчированные."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### **Конструировании признаков**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. Унитарное кодирование категориальных признаков (One-Hot Encoding)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Унитарное кодирование уже было пременено. Выведем имеющиеся столбцы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 84,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до унитарного кодирования:\n",
|
|||
|
" Unnamed: 0 carat cut color clarity depth table price x y \\\n",
|
|||
|
"0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 \n",
|
|||
|
"1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 \n",
|
|||
|
"2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 \n",
|
|||
|
"3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 \n",
|
|||
|
"4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 \n",
|
|||
|
"\n",
|
|||
|
" z \n",
|
|||
|
"0 2.43 \n",
|
|||
|
"1 2.31 \n",
|
|||
|
"2 2.31 \n",
|
|||
|
"3 2.63 \n",
|
|||
|
"4 2.75 \n",
|
|||
|
"\n",
|
|||
|
"Данные после унитарного кодирования:\n",
|
|||
|
" Unnamed: 0 carat depth table price x y z cut_Good \\\n",
|
|||
|
"0 1 0.23 61.5 55.0 326 3.95 3.98 2.43 False \n",
|
|||
|
"1 2 0.21 59.8 61.0 326 3.89 3.84 2.31 False \n",
|
|||
|
"2 3 0.23 56.9 65.0 327 4.05 4.07 2.31 True \n",
|
|||
|
"3 4 0.29 62.4 58.0 334 4.20 4.23 2.63 False \n",
|
|||
|
"4 5 0.31 63.3 58.0 335 4.34 4.35 2.75 True \n",
|
|||
|
"\n",
|
|||
|
" cut_Ideal ... color_H color_I color_J clarity_IF clarity_SI1 \\\n",
|
|||
|
"0 True ... False False False False False \n",
|
|||
|
"1 False ... False False False False True \n",
|
|||
|
"2 False ... False False False False False \n",
|
|||
|
"3 False ... False True False False False \n",
|
|||
|
"4 False ... False False True False False \n",
|
|||
|
"\n",
|
|||
|
" clarity_SI2 clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 \n",
|
|||
|
"0 True False False False False \n",
|
|||
|
"1 False False False False False \n",
|
|||
|
"2 False True False False False \n",
|
|||
|
"3 False False True False False \n",
|
|||
|
"4 True False False False False \n",
|
|||
|
"\n",
|
|||
|
"[5 rows x 25 columns]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"Данные до унитарного кодирования:\")\n",
|
|||
|
"print(df1.head())\n",
|
|||
|
"\n",
|
|||
|
"# Применение унитарного кодирования для категориальных признаков\n",
|
|||
|
"df_encoded = pd.get_dummies(df1, drop_first=True)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после унитарного кодирования:\")\n",
|
|||
|
"print(df_encoded.head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Видим что данные изменились."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"2. Дискретизация числовых признаков\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 86,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до дискретизации:\n",
|
|||
|
" Unnamed: 0 carat cut color clarity depth table price x y \\\n",
|
|||
|
"0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 \n",
|
|||
|
"1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 \n",
|
|||
|
"2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 \n",
|
|||
|
"3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 \n",
|
|||
|
"4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 \n",
|
|||
|
"\n",
|
|||
|
" z \n",
|
|||
|
"0 2.43 \n",
|
|||
|
"1 2.31 \n",
|
|||
|
"2 2.31 \n",
|
|||
|
"3 2.63 \n",
|
|||
|
"4 2.75 \n",
|
|||
|
"\n",
|
|||
|
"Данные после дискретизации:\n",
|
|||
|
" price price_bins\n",
|
|||
|
"0 326 0-5k\n",
|
|||
|
"1 326 0-5k\n",
|
|||
|
"2 327 0-5k\n",
|
|||
|
"3 334 0-5k\n",
|
|||
|
"4 335 0-5k\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"print(\"Данные до дискретизации:\")\n",
|
|||
|
"print(df1.head())\n",
|
|||
|
"\n",
|
|||
|
"bins = [0, 5000, 10000, 15000, 20000]\n",
|
|||
|
"labels = ['0-5k', '5k-10k', '10k-15k', '15k-20k']\n",
|
|||
|
"df1['price_bins'] = pd.cut(df1['price'], bins=bins, labels=labels, right=False)\n",
|
|||
|
"\n",
|
|||
|
"print(\"\\nДанные после дискретизации:\")\n",
|
|||
|
"print(df1[['price', 'price_bins']].head())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Видим, что данные изменились."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### 3. «Ручной» синтез признаков"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 89,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до синтеза признака:\n",
|
|||
|
" Unnamed: 0 carat cut color clarity depth table price x y \\\n",
|
|||
|
"0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 \n",
|
|||
|
"1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 \n",
|
|||
|
"2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 \n",
|
|||
|
"3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 \n",
|
|||
|
"4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 \n",
|
|||
|
"\n",
|
|||
|
" z \n",
|
|||
|
"0 2.43 \n",
|
|||
|
"1 2.31 \n",
|
|||
|
"2 2.31 \n",
|
|||
|
"3 2.63 \n",
|
|||
|
"4 2.75 \n",
|
|||
|
"\n",
|
|||
|
"Данные после синтеза признака 'price_per_carat':\n",
|
|||
|
" price carat price_per_carat\n",
|
|||
|
"0 326 0.23 1417.391304\n",
|
|||
|
"1 326 0.21 1552.380952\n",
|
|||
|
"2 327 0.23 1421.739130\n",
|
|||
|
"3 334 0.29 1151.724138\n",
|
|||
|
"4 335 0.31 1080.645161\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных\n",
|
|||
|
"print(\"Данные до синтеза признака:\")\n",
|
|||
|
"print(df1.head())\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'price_per_carat' (цена за караты)\n",
|
|||
|
"df1['price_per_carat'] = df1['price'] / df1['carat']\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после синтеза признака\n",
|
|||
|
"print(\"\\nДанные после синтеза признака 'price_per_carat':\")\n",
|
|||
|
"print(df1[['price', 'carat', 'price_per_carat']].head())\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### 4. масштабирование признаков на основе нормировки и стандартизации.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 91,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Данные до масштабирования:\n",
|
|||
|
" price carat price_per_carat\n",
|
|||
|
"0 326 0.23 1417.391304\n",
|
|||
|
"1 326 0.21 1552.380952\n",
|
|||
|
"2 327 0.23 1421.739130\n",
|
|||
|
"3 334 0.29 1151.724138\n",
|
|||
|
"4 335 0.31 1080.645161\n",
|
|||
|
"\n",
|
|||
|
"Данные после нормировки:\n",
|
|||
|
" price carat price_per_carat\n",
|
|||
|
"0 0.000000 0.006237 0.021828\n",
|
|||
|
"1 0.000000 0.002079 0.029874\n",
|
|||
|
"2 0.000054 0.006237 0.022087\n",
|
|||
|
"3 0.000433 0.018711 0.005994\n",
|
|||
|
"4 0.000487 0.022869 0.001757\n",
|
|||
|
"\n",
|
|||
|
"Данные после стандартизации:\n",
|
|||
|
" price carat price_per_carat\n",
|
|||
|
"0 -0.904102 -1.198189 -1.287394\n",
|
|||
|
"1 -0.904102 -1.240384 -1.220321\n",
|
|||
|
"2 -0.903851 -1.198189 -1.285233\n",
|
|||
|
"3 -0.902096 -1.071605 -1.419396\n",
|
|||
|
"4 -0.901846 -1.029411 -1.454713\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'price_per_carat' (цена за караты)\n",
|
|||
|
"df1['price_per_carat'] = df1['price'] / df1['carat']\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных до масштабирования\n",
|
|||
|
"print(\"Данные до масштабирования:\")\n",
|
|||
|
"print(df1[['price', 'carat', 'price_per_carat']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Масштабирование признаков на основе нормировки\n",
|
|||
|
"min_max_scaler = MinMaxScaler()\n",
|
|||
|
"df1[['price', 'carat', 'price_per_carat']] = min_max_scaler.fit_transform(df1[['price', 'carat', 'price_per_carat']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после нормировки\n",
|
|||
|
"print(\"\\nДанные после нормировки:\")\n",
|
|||
|
"print(df1[['price', 'carat', 'price_per_carat']].head())\n",
|
|||
|
"\n",
|
|||
|
"# Стандартизация признаков\n",
|
|||
|
"standard_scaler = StandardScaler()\n",
|
|||
|
"df1[['price', 'carat', 'price_per_carat']] = standard_scaler.fit_transform(df1[['price', 'carat', 'price_per_carat']])\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк данных после стандартизации\n",
|
|||
|
"print(\"\\nДанные после стандартизации:\")\n",
|
|||
|
"print(df1[['price', 'carat', 'price_per_carat']].head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"конструирование признаков с применением фреймворка Featuretools."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 94,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\featuretools\\entityset\\entityset.py:1733: UserWarning: index index not found in dataframe, creating new integer column\n",
|
|||
|
" warnings.warn(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\woodwork\\type_sys\\utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
|
|||
|
" pd.to_datetime(\n",
|
|||
|
"c:\\D\\semester5\\mii\\aimenv\\Lib\\site-packages\\featuretools\\synthesis\\deep_feature_synthesis.py:169: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\n",
|
|||
|
" warnings.warn(\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Built 12 features\n",
|
|||
|
"Elapsed: 00:00 | Progress: 100%|██████████\n",
|
|||
|
"Новые признаки, созданные с помощью Featuretools:\n",
|
|||
|
" Unnamed: 0 carat cut color clarity depth table price x \\\n",
|
|||
|
"index \n",
|
|||
|
"0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 \n",
|
|||
|
"1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 \n",
|
|||
|
"2 3 0.23 Good E VS1 56.9 65.0 327 4.05 \n",
|
|||
|
"3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 \n",
|
|||
|
"4 5 0.31 Good J SI2 63.3 58.0 335 4.34 \n",
|
|||
|
"\n",
|
|||
|
" y z price_per_carat \n",
|
|||
|
"index \n",
|
|||
|
"0 3.98 2.43 1417.391304 \n",
|
|||
|
"1 3.84 2.31 1552.380952 \n",
|
|||
|
"2 4.07 2.31 1421.739130 \n",
|
|||
|
"3 4.23 2.63 1151.724138 \n",
|
|||
|
"4 4.35 2.75 1080.645161 \n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import featuretools as ft\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'price_per_carat'\n",
|
|||
|
"df1['price_per_carat'] = df1['price'] / df1['carat']\n",
|
|||
|
"\n",
|
|||
|
"# Создание EntitySet\n",
|
|||
|
"es = ft.EntitySet(id='diamonds')\n",
|
|||
|
"\n",
|
|||
|
"# Добавление данных\n",
|
|||
|
"es = es.add_dataframe(dataframe_name='diamonds_data', dataframe=df1, index='index')\n",
|
|||
|
"\n",
|
|||
|
"# Конструирование признаков\n",
|
|||
|
"features, feature_defs = ft.dfs(entityset=es, target_dataframe_name='diamonds_data', verbose=True)\n",
|
|||
|
"\n",
|
|||
|
"# Проверка первых строк новых признаков\n",
|
|||
|
"print(\"Новые признаки, созданные с помощью Featuretools:\")\n",
|
|||
|
"print(features.head())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### **Оценка качества**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 95,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Время обучения: 0.0694 секунд\n",
|
|||
|
"MAE: 352.05\n",
|
|||
|
"MSE: 363100.29\n",
|
|||
|
"R^2: 0.98\n",
|
|||
|
"Корреляция: 0.99\n",
|
|||
|
"Среднее MSE (кросс-валидация): 885042.98\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"import time\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"from sklearn.linear_model import LinearRegression\n",
|
|||
|
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"# Загрузка данных\n",
|
|||
|
"df1 = pd.read_csv(\".//static//csv//Diamonds Prices2022.csv\")\n",
|
|||
|
"\n",
|
|||
|
"# Создание нового признака 'price_per_carat'\n",
|
|||
|
"df1['price_per_carat'] = df1['price'] / df1['carat']\n",
|
|||
|
"\n",
|
|||
|
"# Унитарное кодирование\n",
|
|||
|
"X = pd.get_dummies(df1.drop(columns=['price']), drop_first=True)\n",
|
|||
|
"y = df1['price']\n",
|
|||
|
"\n",
|
|||
|
"# Разделение на обучающую и тестовую выборки\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|||
|
"\n",
|
|||
|
"# Оценка качества\n",
|
|||
|
"def evaluate_model(X_train, X_test, y_train, y_test):\n",
|
|||
|
" model = LinearRegression()\n",
|
|||
|
" \n",
|
|||
|
" # Измерение времени обучения\n",
|
|||
|
" start_time = time.time()\n",
|
|||
|
" model.fit(X_train, y_train)\n",
|
|||
|
" training_time = time.time() - start_time\n",
|
|||
|
" \n",
|
|||
|
" # Прогнозирование\n",
|
|||
|
" y_pred = model.predict(X_test)\n",
|
|||
|
" \n",
|
|||
|
" # Оценка метрик\n",
|
|||
|
" mae = mean_absolute_error(y_test, y_pred)\n",
|
|||
|
" mse = mean_squared_error(y_test, y_pred)\n",
|
|||
|
" r2 = r2_score(y_test, y_pred)\n",
|
|||
|
" \n",
|
|||
|
" # Корреляция\n",
|
|||
|
" correlation = np.corrcoef(y_test, y_pred)[0, 1]\n",
|
|||
|
" \n",
|
|||
|
" print(f\"Время обучения: {training_time:.4f} секунд\")\n",
|
|||
|
" print(f\"MAE: {mae:.2f}\")\n",
|
|||
|
" print(f\"MSE: {mse:.2f}\")\n",
|
|||
|
" print(f\"R^2: {r2:.2f}\")\n",
|
|||
|
" print(f\"Корреляция: {correlation:.2f}\")\n",
|
|||
|
"\n",
|
|||
|
"# Оценка качества набора признаков\n",
|
|||
|
"evaluate_model(X_train, X_test, y_train, y_test)\n",
|
|||
|
"\n",
|
|||
|
"# Дополнительно: оценка надежности с помощью кросс-валидации\n",
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"cv_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_squared_error')\n",
|
|||
|
"print(f\"Среднее MSE (кросс-валидация): {-np.mean(cv_scores):.2f}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Вывод:**\n",
|
|||
|
"\n",
|
|||
|
"Время обучения: \n",
|
|||
|
"\n",
|
|||
|
"Время обучения модели составляет 0.0694 секунды, что является очень коротким. Это указывает на то, что модель обучается быстро и может эффективно обрабатывать данные.\n",
|
|||
|
"\n",
|
|||
|
"Предсказательная способность: \n",
|
|||
|
"\n",
|
|||
|
"MAE (Mean Absolute Error): 352.05 — это средняя абсолютная ошибка предсказаний модели. Значение MAE относительно невелико, что означает, что предсказанные значения в среднем отклоняются от реальных на 352.05. Это может быть приемлемым уровнем ошибки.\n",
|
|||
|
"\n",
|
|||
|
"MSE (Mean Squared Error): 363100.29 — это среднее значение квадратов ошибок. Хотя MSE высокое, оно также может быть связано с большими значениями целевой переменной (цен).\n",
|
|||
|
"\n",
|
|||
|
"R² (коэффициент детерминации): 0.98 — это очень высокий уровень, указывающий на то, что модель объясняет 98% вариации целевой переменной. Это свидетельствует о высокой предсказательной способности модели.\n",
|
|||
|
"\n",
|
|||
|
"Корреляция:\n",
|
|||
|
"\n",
|
|||
|
"Корреляция (0.99) между предсказанными и реальными значениями говорит о том, что предсказания модели имеют очень сильную линейную зависимость с реальными значениями. Это подтверждает, что модель хорошо обучена и делает точные прогнозы.\n",
|
|||
|
"\n",
|
|||
|
"Надежность (кросс-валидация):\n",
|
|||
|
"\n",
|
|||
|
"Среднее MSE (кросс-валидация): 885042.98 — это значительно выше, чем обычное MSE, что может указывать на потенциальные проблемы с переобучением. Модель может хорошо работать на обучающих данных, но ее производительность на новых данных (или в реальных условиях) может быть менее стабильной.\n",
|
|||
|
" \n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.6"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|