{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Лабораторная работа 3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Определение бизнес и технических целей\n", "1. Прогнозирование цены автомобиля\n", "Бизнес-цель: Оптимизация ценовой политики.\n", "Техническая цель: Построение модели прогнозирования цены.\n", "\n", "Конструирование признаков:\n", "Объем двигателя:\n", "\n", "Извлечь числовую часть из столбца Engine volume (например, 3.5).\n", "Добавить бинарный признак: наличие турбонаддува (Turbo).\n", "Возраст автомобиля:\n", "\n", "Вычислить возраст автомобиля как разницу между текущим годом и Prod. year.\n", "Привод (Drive wheels):\n", "\n", "Закодировать тип привода (например, 4x4, Front, Rear) с помощью One-Hot Encoding.\n", "Категория автомобиля (Category):\n", "\n", "Преобразовать категорию в числовые признаки с помощью One-Hot Encoding.\n", "Технические характеристики:\n", "\n", "Нормализовать числовые параметры, такие как пробег (Mileage) и количество цилиндров (Cylinders).\n", "Состояние интерьера:\n", "\n", "Закодировать признак наличия кожаного салона (Leather interior) как бинарный.\n", "\n", "2. Классификация популярности автомобиля\n", "Бизнес-цель: Изучение предпочтений клиентов.\n", "Техническая цель: Определение популярности автомобилей.\n", "\n", "Конструирование признаков:\n", "Рейтинг безопасности:\n", "\n", "Использовать количество подушек безопасности (Airbags) для создания индикатора безопасности автомобиля.\n", "Тип топлива:\n", "\n", "Закодировать Fuel type как категориальный признак (например, Petrol, Diesel, Hybrid).\n", "Цвет автомобиля:\n", "\n", "Создать признак редкости цвета на основе частоты его встречаемости в данных.\n", "Стоимость обслуживания:\n", "\n", "Преобразовать Levy в числовой признак и обработать пропущенные значения (например, заменить на среднее/медианное значение).\n", "Сегмент рынка:\n", "\n", "Объединить категории автомобилей (например, Jeep, Hatchback) в несколько сегментов (премиум, эконом, компакт).\n", "Особенности привода:\n", "\n", "Создать бинарные признаки, указывающие на тип управления (Left wheel/Right-hand drive)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((11542, 215), (3847, 215), (3848, 215), (11542,), (3847,), (3848,))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "\n", "data = pd.read_csv(\"car_price_prediction.csv\")\n", "# Preparing the data by removing unnecessary columns and handling categorical data\n", "data_cleaned = data.copy()\n", "\n", "# Converting \"Levy\" and \"Mileage\" to numeric (handling non-numeric values like '-')\n", "data_cleaned[\"Levy\"] = pd.to_numeric(data_cleaned[\"Levy\"], errors=\"coerce\")\n", "data_cleaned[\"Mileage\"] = (\n", " data_cleaned[\"Mileage\"].str.replace(\" km\", \"\").str.replace(\" \", \"\").astype(float)\n", ")\n", "\n", "# Dropping columns that are identifiers or too detailed for prediction (like ID, Model)\n", "data_cleaned = data_cleaned.drop([\"ID\", \"Model\"], axis=1)\n", "\n", "# Encoding categorical columns\n", "categorical_cols = data_cleaned.select_dtypes(include=\"object\").columns\n", "data_encoded = pd.get_dummies(data_cleaned, columns=categorical_cols, drop_first=True)\n", "\n", "# Splitting the data into features (X) and target (y)\n", "X = data_encoded.drop(\"Price\", axis=1)\n", "y = data_encoded[\"Price\"]\n", "\n", "# Splitting into training, validation, and testing datasets\n", "X_train, X_temp, y_train, y_temp = train_test_split(\n", " X, y, test_size=0.4, random_state=42\n", ") # 60% training data\n", "X_val, X_test, y_val, y_test = train_test_split(\n", " X_temp, y_temp, test_size=0.5, random_state=42\n", ") # 20% validation, 20% testing\n", "\n", "# Displaying the sizes of the datasets\n", "X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Price | \n", "Levy | \n", "Manufacturer | \n", "Leather interior | \n", "Engine volume | \n", "Mileage | \n", "Cylinders | \n", "Gear box type | \n", "Doors | \n", "Airbags | \n", "... | \n", "Fuel type_LPG | \n", "Fuel type_Petrol | \n", "Fuel type_Plug-in Hybrid | \n", "Wheel_Right-hand drive | \n", "Car age | \n", "Car age bins | \n", "Mileage bins | \n", "Turbo | \n", "Safety rating | \n", "Color rarity | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "13328 | \n", "1.065632 | \n", "LEXUS | \n", "Yes | \n", "1.357980 | \n", "-0.027813 | \n", "1.180937 | \n", "Automatic | \n", "04-May | \n", "12 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "15 | \n", "11-20 | \n", "High | \n", "0 | \n", "0.750 | \n", "0.197120 | \n", "
1 | \n", "16621 | \n", "0.240688 | \n", "CHEVROLET | \n", "No | \n", "0.788363 | \n", "-0.027689 | \n", "1.180937 | \n", "Tiptronic | \n", "04-May | \n", "8 | \n", "... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "14 | \n", "11-20 | \n", "Very High | \n", "0 | \n", "0.500 | \n", "0.261631 | \n", "
2 | \n", "8467 | \n", "NaN | \n", "HONDA | \n", "No | \n", "-1.148338 | \n", "-0.027524 | \n", "-0.485866 | \n", "Variator | \n", "04-May | \n", "2 | \n", "... | \n", "False | \n", "True | \n", "False | \n", "True | \n", "19 | \n", "11-20 | \n", "Very High | \n", "0 | \n", "0.125 | \n", "0.261631 | \n", "
3 | \n", "3607 | \n", "-0.097084 | \n", "FORD | \n", "Yes | \n", "0.218745 | \n", "-0.028165 | \n", "-0.485866 | \n", "Automatic | \n", "04-May | \n", "0 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "14 | \n", "11-20 | \n", "High | \n", "0 | \n", "0.000 | \n", "0.233352 | \n", "
4 | \n", "11726 | \n", "-0.997809 | \n", "HONDA | \n", "Yes | \n", "-1.148338 | \n", "-0.029757 | \n", "-0.485866 | \n", "Automatic | \n", "04-May | \n", "4 | \n", "... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "11 | \n", "11-20 | \n", "Medium | \n", "0 | \n", "0.250 | \n", "0.197120 | \n", "
5 rows × 35 columns
\n", "