1036 lines
734 KiB
Plaintext
1036 lines
734 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Лабораторная работа №4. Обучение с учителем.\n",
|
|||
|
"\n",
|
|||
|
"## Датасет \"Набор данных для анализа и прогнозирования сердечного приступа\".\n",
|
|||
|
"\n",
|
|||
|
"[**Ссылка**](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)\n",
|
|||
|
"\n",
|
|||
|
"### Описание датасета\n",
|
|||
|
"\n",
|
|||
|
"**Проблемная область**: Датасет связан с медицинской статистикой и направлен на анализ факторов, связанных с риском сердечного приступа. Это важно для прогнозирования и разработки стратегий профилактики сердечно-сосудистых заболеваний.\n",
|
|||
|
"\n",
|
|||
|
"**Актуальность**: Сердечно-сосудистые заболевания являются одной из ведущих причин смертности во всем мире. Анализ данных об образе жизни, состоянии здоровья и наследственных факторах позволяет выделить ключевые предикторы, влияющие на развитие сердечно-сосудистых заболеваний. Этот датасет предоставляет инструменты для анализа таких факторов и может быть полезен в создании прогнозных моделей, направленных на снижение рисков и своевременную диагностику.\n",
|
|||
|
"\n",
|
|||
|
"**Объекты наблюдения**: Каждая запись представляет собой данные о человеке, включая информацию об их состоянии здоровья, образе жизни, демографических характеристиках и наличию определенных заболеваний. Объекты наблюдений — это индивидуальные пациенты.\n",
|
|||
|
"\n",
|
|||
|
"**Атрибуты объектов:**\n",
|
|||
|
"- `HeartDisease` — наличие сердечного приступа (Yes/No).\n",
|
|||
|
"- `BMI` — индекс массы тела (Body Mass Index), числовой показатель.\n",
|
|||
|
"- `Smoking` — курение (Yes/No).\n",
|
|||
|
"- `AlcoholDrinking` — употребление алкоголя (Yes/No).\n",
|
|||
|
"- `Stroke` — наличие инсульта (Yes/No).\n",
|
|||
|
"- `PhysicalHealth` — количество дней в месяц, когда физическое здоровье было неудовлетворительным.\n",
|
|||
|
"- `MentalHealth` — количество дней в месяц, когда психическое здоровье было неудовлетворительным.\n",
|
|||
|
"- `DiffWalking` — трудности при ходьбе (Yes/No).\n",
|
|||
|
"- `Sex` — пол (Male/Female).\n",
|
|||
|
"- `AgeCategory` — возрастная категория (например, 55-59, 80 or older).\n",
|
|||
|
"- `Race` — расовая принадлежность (например, White, Black).\n",
|
|||
|
"- `Diabetic` — наличие диабета (Yes/No/No, borderline diabetes).\n",
|
|||
|
"- `PhysicalActivity` — физическая активность (Yes/No).\n",
|
|||
|
"- `GenHealth` — общее состояние здоровья (от Excellent до Poor).\n",
|
|||
|
"- `SleepTime` — среднее количество часов сна за сутки.\n",
|
|||
|
"- `Asthma` — наличие астмы (Yes/No).\n",
|
|||
|
"- `KidneyDisease` — наличие заболеваний почек (Yes/No).\n",
|
|||
|
"- `SkinCancer` — наличие кожного рака (Yes/No)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 49,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>HeartDisease</th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>Smoking</th>\n",
|
|||
|
" <th>AlcoholDrinking</th>\n",
|
|||
|
" <th>Stroke</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>DiffWalking</th>\n",
|
|||
|
" <th>Sex</th>\n",
|
|||
|
" <th>AgeCategory</th>\n",
|
|||
|
" <th>Race</th>\n",
|
|||
|
" <th>Diabetic</th>\n",
|
|||
|
" <th>PhysicalActivity</th>\n",
|
|||
|
" <th>GenHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Asthma</th>\n",
|
|||
|
" <th>KidneyDisease</th>\n",
|
|||
|
" <th>SkinCancer</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>16.60</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>55-59</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>20.34</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>80 or older</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>7.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>26.58</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>20.0</td>\n",
|
|||
|
" <td>30.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Male</td>\n",
|
|||
|
" <td>65-69</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Fair</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>24.21</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>75-79</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Good</td>\n",
|
|||
|
" <td>6.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>23.71</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>28.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Female</td>\n",
|
|||
|
" <td>40-44</td>\n",
|
|||
|
" <td>White</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>Yes</td>\n",
|
|||
|
" <td>Very good</td>\n",
|
|||
|
" <td>8.0</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" <td>No</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth \\\n",
|
|||
|
"0 No 16.60 Yes No No 3.0 \n",
|
|||
|
"1 No 20.34 No No Yes 0.0 \n",
|
|||
|
"2 No 26.58 Yes No No 20.0 \n",
|
|||
|
"3 No 24.21 No No No 0.0 \n",
|
|||
|
"4 No 23.71 No No No 28.0 \n",
|
|||
|
"\n",
|
|||
|
" MentalHealth DiffWalking Sex AgeCategory Race Diabetic \\\n",
|
|||
|
"0 30.0 No Female 55-59 White Yes \n",
|
|||
|
"1 0.0 No Female 80 or older White No \n",
|
|||
|
"2 30.0 No Male 65-69 White Yes \n",
|
|||
|
"3 0.0 No Female 75-79 White No \n",
|
|||
|
"4 0.0 Yes Female 40-44 White No \n",
|
|||
|
"\n",
|
|||
|
" PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer \n",
|
|||
|
"0 Yes Very good 5.0 Yes No Yes \n",
|
|||
|
"1 Yes Very good 7.0 No No No \n",
|
|||
|
"2 Yes Fair 8.0 Yes No No \n",
|
|||
|
"3 No Good 6.0 No No Yes \n",
|
|||
|
"4 Yes Very good 8.0 No No No "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 49,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"df = pd.read_csv(\".//static//csv//heart_2020_cleaned.csv\")\n",
|
|||
|
"df.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Бизнес-цель. Задача кластеризации\n",
|
|||
|
"\n",
|
|||
|
"### Описание бизнес-цели\n",
|
|||
|
"\n",
|
|||
|
"**Цель**: группировка пациентов для выявления скрытых паттернов в их состоянии здоровья. Необходимо можно кластеризовать пациентов на основе факторов риска (индекс массы тела, физическая активность, возраст, качество сна и т.д.), чтобы лучше понять, какие группы населения требуют большего внимания для профилактики сердечных заболеваний."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Уменьшение объема датасета\n",
|
|||
|
"\n",
|
|||
|
"Так как исходный набор данных слишком объемный (более 300000 наблюдений), для дальнейшего удобства визуализации и скорости обработки снизим размер датасета до 10000 наблюдений:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 50,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"# Уменьшаем размер выборки, сохраняя пропорции классов целевой переменной\n",
|
|||
|
"df, _ = train_test_split(df, train_size=10000, stratify=df['HeartDisease'], random_state=42)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Удаление целевого признака для кластеризации"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 51,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"df = df.drop(columns=['HeartDisease'])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Предобработка данных\n",
|
|||
|
"\n",
|
|||
|
"Построим **конвейер для предобработки**, включающий *стандартизацию*, *унитарное кодирование* и *заполнение пропусков*, и применим его для набора данных:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 52,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <th>PhysicalHealth</th>\n",
|
|||
|
" <th>MentalHealth</th>\n",
|
|||
|
" <th>SleepTime</th>\n",
|
|||
|
" <th>Smoking_Yes</th>\n",
|
|||
|
" <th>AlcoholDrinking_Yes</th>\n",
|
|||
|
" <th>Stroke_Yes</th>\n",
|
|||
|
" <th>DiffWalking_Yes</th>\n",
|
|||
|
" <th>Sex_Male</th>\n",
|
|||
|
" <th>AgeCategory_25-29</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>Diabetic_Yes</th>\n",
|
|||
|
" <th>Diabetic_Yes (during pregnancy)</th>\n",
|
|||
|
" <th>PhysicalActivity_Yes</th>\n",
|
|||
|
" <th>GenHealth_Fair</th>\n",
|
|||
|
" <th>GenHealth_Good</th>\n",
|
|||
|
" <th>GenHealth_Poor</th>\n",
|
|||
|
" <th>GenHealth_Very good</th>\n",
|
|||
|
" <th>Asthma_Yes</th>\n",
|
|||
|
" <th>KidneyDisease_Yes</th>\n",
|
|||
|
" <th>SkinCancer_Yes</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>-0.392085</td>\n",
|
|||
|
" <td>1.326284</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>1.348988</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>0.253157</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.366659</td>\n",
|
|||
|
" <td>0.646792</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>-0.961511</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>-0.055403</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>0.749993</td>\n",
|
|||
|
" <td>-0.175382</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>-0.757599</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>-1.435763</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>0.646792</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9995</th>\n",
|
|||
|
" <td>-0.959898</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>-0.055403</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9996</th>\n",
|
|||
|
" <td>0.348330</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>1.348988</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9997</th>\n",
|
|||
|
" <td>-0.517907</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>-0.055403</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9998</th>\n",
|
|||
|
" <td>0.646754</td>\n",
|
|||
|
" <td>0.200034</td>\n",
|
|||
|
" <td>0.387473</td>\n",
|
|||
|
" <td>0.646792</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9999</th>\n",
|
|||
|
" <td>0.441890</td>\n",
|
|||
|
" <td>-0.425660</td>\n",
|
|||
|
" <td>-0.492347</td>\n",
|
|||
|
" <td>-0.055403</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>1.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>10000 rows × 37 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" BMI PhysicalHealth MentalHealth SleepTime Smoking_Yes \\\n",
|
|||
|
"0 -0.392085 1.326284 -0.492347 1.348988 1.0 \n",
|
|||
|
"1 0.253157 -0.425660 -0.366659 0.646792 1.0 \n",
|
|||
|
"2 -0.961511 -0.425660 -0.492347 -0.055403 1.0 \n",
|
|||
|
"3 0.749993 -0.175382 -0.492347 -0.757599 1.0 \n",
|
|||
|
"4 -1.435763 -0.425660 -0.492347 0.646792 1.0 \n",
|
|||
|
"... ... ... ... ... ... \n",
|
|||
|
"9995 -0.959898 -0.425660 -0.492347 -0.055403 0.0 \n",
|
|||
|
"9996 0.348330 -0.425660 -0.492347 1.348988 0.0 \n",
|
|||
|
"9997 -0.517907 -0.425660 -0.492347 -0.055403 1.0 \n",
|
|||
|
"9998 0.646754 0.200034 0.387473 0.646792 0.0 \n",
|
|||
|
"9999 0.441890 -0.425660 -0.492347 -0.055403 1.0 \n",
|
|||
|
"\n",
|
|||
|
" AlcoholDrinking_Yes Stroke_Yes DiffWalking_Yes Sex_Male \\\n",
|
|||
|
"0 0.0 0.0 0.0 1.0 \n",
|
|||
|
"1 0.0 0.0 0.0 1.0 \n",
|
|||
|
"2 0.0 0.0 0.0 0.0 \n",
|
|||
|
"3 0.0 0.0 0.0 0.0 \n",
|
|||
|
"4 0.0 0.0 0.0 0.0 \n",
|
|||
|
"... ... ... ... ... \n",
|
|||
|
"9995 0.0 0.0 0.0 1.0 \n",
|
|||
|
"9996 0.0 0.0 0.0 1.0 \n",
|
|||
|
"9997 0.0 0.0 0.0 0.0 \n",
|
|||
|
"9998 0.0 0.0 1.0 0.0 \n",
|
|||
|
"9999 0.0 0.0 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" AgeCategory_25-29 ... Diabetic_Yes Diabetic_Yes (during pregnancy) \\\n",
|
|||
|
"0 0.0 ... 1.0 0.0 \n",
|
|||
|
"1 0.0 ... 0.0 0.0 \n",
|
|||
|
"2 0.0 ... 0.0 0.0 \n",
|
|||
|
"3 0.0 ... 0.0 0.0 \n",
|
|||
|
"4 0.0 ... 0.0 0.0 \n",
|
|||
|
"... ... ... ... ... \n",
|
|||
|
"9995 0.0 ... 0.0 0.0 \n",
|
|||
|
"9996 0.0 ... 0.0 0.0 \n",
|
|||
|
"9997 0.0 ... 0.0 0.0 \n",
|
|||
|
"9998 0.0 ... 0.0 0.0 \n",
|
|||
|
"9999 0.0 ... 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" PhysicalActivity_Yes GenHealth_Fair GenHealth_Good GenHealth_Poor \\\n",
|
|||
|
"0 1.0 0.0 1.0 0.0 \n",
|
|||
|
"1 1.0 0.0 0.0 0.0 \n",
|
|||
|
"2 1.0 0.0 0.0 0.0 \n",
|
|||
|
"3 1.0 0.0 1.0 0.0 \n",
|
|||
|
"4 1.0 0.0 1.0 0.0 \n",
|
|||
|
"... ... ... ... ... \n",
|
|||
|
"9995 1.0 0.0 0.0 0.0 \n",
|
|||
|
"9996 1.0 1.0 0.0 0.0 \n",
|
|||
|
"9997 1.0 0.0 0.0 0.0 \n",
|
|||
|
"9998 1.0 0.0 1.0 0.0 \n",
|
|||
|
"9999 1.0 0.0 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
" GenHealth_Very good Asthma_Yes KidneyDisease_Yes SkinCancer_Yes \n",
|
|||
|
"0 0.0 0.0 1.0 1.0 \n",
|
|||
|
"1 1.0 0.0 0.0 0.0 \n",
|
|||
|
"2 0.0 0.0 0.0 0.0 \n",
|
|||
|
"3 0.0 0.0 0.0 0.0 \n",
|
|||
|
"4 0.0 0.0 0.0 0.0 \n",
|
|||
|
"... ... ... ... ... \n",
|
|||
|
"9995 0.0 0.0 0.0 0.0 \n",
|
|||
|
"9996 0.0 0.0 0.0 0.0 \n",
|
|||
|
"9997 0.0 0.0 0.0 1.0 \n",
|
|||
|
"9998 0.0 0.0 0.0 1.0 \n",
|
|||
|
"9999 1.0 0.0 0.0 0.0 \n",
|
|||
|
"\n",
|
|||
|
"[10000 rows x 37 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 52,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.impute import SimpleImputer\n",
|
|||
|
"from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
|
|||
|
"from sklearn.compose import ColumnTransformer\n",
|
|||
|
"from sklearn.pipeline import Pipeline\n",
|
|||
|
"\n",
|
|||
|
"# Разделение признаков на числовые и категориальные\n",
|
|||
|
"num_columns = [\n",
|
|||
|
" column\n",
|
|||
|
" for column in df.columns\n",
|
|||
|
" if df[column].dtype != \"object\"\n",
|
|||
|
"]\n",
|
|||
|
"cat_columns = [\n",
|
|||
|
" column\n",
|
|||
|
" for column in df.columns\n",
|
|||
|
" if df[column].dtype == \"object\"\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"# Числовая обработка: заполнение пропусков медианой и стандартизация\n",
|
|||
|
"num_imputer = SimpleImputer(strategy=\"median\")\n",
|
|||
|
"num_scaler = StandardScaler()\n",
|
|||
|
"preprocessing_num = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", num_imputer),\n",
|
|||
|
" (\"scaler\", num_scaler),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Категориальная обработка: заполнение пропусков значением \"unknown\" и унитарное кодирование\n",
|
|||
|
"cat_imputer = SimpleImputer(strategy=\"constant\", fill_value=\"unknown\")\n",
|
|||
|
"cat_encoder = OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False, drop=\"first\")\n",
|
|||
|
"preprocessing_cat = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"imputer\", cat_imputer),\n",
|
|||
|
" (\"encoder\", cat_encoder),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Общий конвейер обработки признаков\n",
|
|||
|
"features_preprocessing = ColumnTransformer(\n",
|
|||
|
" verbose_feature_names_out=False,\n",
|
|||
|
" transformers=[\n",
|
|||
|
" (\"prepocessing_num\", preprocessing_num, num_columns),\n",
|
|||
|
" (\"prepocessing_cat\", preprocessing_cat, cat_columns),\n",
|
|||
|
" ],\n",
|
|||
|
" remainder=\"passthrough\"\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"# Итоговый конвейер\n",
|
|||
|
"pipeline_end = Pipeline(\n",
|
|||
|
" [\n",
|
|||
|
" (\"features_preprocessing\", features_preprocessing),\n",
|
|||
|
" ]\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"preprocessing_result = pipeline_end.fit_transform(df)\n",
|
|||
|
"preprocessed_df = pd.DataFrame(\n",
|
|||
|
" preprocessing_result,\n",
|
|||
|
" columns=pipeline_end.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"\n",
|
|||
|
"preprocessed_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Понижение размерности признаков и визуализация данных\n",
|
|||
|
"\n",
|
|||
|
"Используем алгоритм снижения размерности **PCA** (*метод главных компонент*), чтобы уменьшить количество признаков до 2 и отобразить данные на графике:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 53,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0wAAAIkCAYAAAAtTas0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd5ijV3nw/69611RNLzs723e93fa67rpTbGMMDmCK7RhecBxCEiABkrhQEkh4E4cfoYOB1zHdxqbY4LbuZV22eL19Z6c3TVEvj/Q8vz800kozmrrT5/5cly9YSSM9TY/Ofc597qPTNE1DCCGEEEIIIcQI+rneACGEEEIIIYSYryRgEkIIIYQQQohRSMAkhBBCCCGEEKOQgEkIIYQQQgghRiEBkxBCCCGEEEKMQgImIYQQQgghhBiFBExCCCGEEEIIMQoJmIQQQgghhBBiFBIwCSGEEEIIIcQoJGASQgghhBBCiFFIwCSEmHanTp1Cp9ON+M/hcLBx40buvvtugsHgXG+mEEIA+e9ZZrOZ2tpabrzxRvbv35/37xKJBPfeey/veMc7qKiowGw2U1BQwNlnn80///M/09zcPOpn/vSnP8181p49e2Zq14QQ00CnaZo21xshhFhcTp06RUNDA42NjXzoQx8CQNM0ent7eeSRRzh16hQ7duzgueeew2AwzPHWCiGWunz3rGAwyEsvvcTzzz+PxWLhiSee4IILLsj8TXNzM+9617vYt28f5eXlXHHFFdTW1hIKhXj99dd58cUXMRqNvPnmm6xYsWLEZ+7cuZNnn30WTdP4xCc+wbe//e1Z218hxORIwCSEmHbpxsdVV13Fo48+mvNcLBbjvPPO44033uCJJ57g0ksvnaOtFEKIlLHuWf/8z//MV77yFXbu3Mnu3bsBCAQCnH322Rw5coTPfvazfOlLX8JiseT83fHjx/n7v/97vvjFL7J58+ac544dO8aqVau49tprOXz4MN3d3XR2dmKz2WZyN4UQUyQpeUKIWWWxWLjkkksA8Hq9Oc8tW7aMZcuWjfibhx9+OJO6km6wAOzdu5drrrmG5cuX43A4KCwsZOvWrdxzzz0oigKAqqrU19dTUlJCLBbLu00XX3wxRqORtrY2AHw+H1/72tfYuXMnVVVVmM1mqqqq+MhHPsKJEydG3bebb745byqiTqfj5ptvznmtTqdj165dI97jG9/4RuZvTp06lXn88ccf56qrrqKurg6r1UpJSQnnnXceP/7xj0e8x4MPPsgHPvABVqxYgd1up6CggIsuuojf/OY3I16bTkUavn3D9yl7W2Dy5wqgr6+P22+/nWXLlmE2m0cco8nYvXv3qMc633bN9DH58Y9/jE6ny3s+IP/xGu3YDnfXXXflHE9N03jHO96BTqfjF7/4Rc5rNU3j7W9/e97nxnrvsf7Ld502Nzdz6623Ul1djdlspqamhltvvZWWlpa8nxMIBLj77rvZuHFj5vhv2bKFf/mXf8l8V9NGS+nN971I7/OPfvQjLrjgAtxuN3a7ne3bt/OjH/1o3P2fiE9+8pMAOWlzX//61zly5Agf+tCH+Pd///cRwRLAihUrePjhh1m3bt2I59Lb9pGPfIQPf/jD+Hw+fv3rX0/L9gohpp9xrjdACLG0xOPxTGN3eK/raK//9Kc/nfe5trY2vF4vl112GR6Ph1AoxJ/+9Cf+7u/+jjfffJMf/OAH6PV6PvrRj3LHHXfwm9/8hhtvvDHnPY4cOcKzzz7LO9/5TmpqagA4dOgQd9xxB5dccgnvfve7cTgcHD58mPvvv58//OEPvP7669TX14+6zZ/61KcoLCwEYHBwkP/+7/+e0LHxer3cddddeZ87ceIE0WiUd7zjHRQXFzM4OMjvf/97brnlFtra2vjnf/7nzGs///nPYzabufDCC6msrKS3t5eHH36Y9773vXzjG9/INACn21jnKt2Q37NnD5s3b+a9730vTqcTSAUbY831GMvOnTtzGvT33HNP3tfN1TGZCTqdjnvvvZeNGzfy8Y9/nB07dmSux3vuuYdHH32Um2++mfe9730Tfs+bbropb6B59913j3js6NGjXHjhhfT29nLNNdewfv163nzzTX70ox/xu9/9jueee45Vq1ZlXt/T08POnTs5fPgwmzdv5rbbbkNVVQ4fPszXvvY1Pv3pT2e+L9k2bdrEddddl/n3b3/7W/bt25fzGk3T+OAHP8jPfvYzVq5cyY033ojZbOaxxx7j1ltv5a233uLrX//6hI/DWLKD+nTAc8cdd4z7d2azOeffyWSSn/zkJxQVFXH11Vezfft27rjjDn74wx/y4Q9/eFq2VQgxzTQhhJhmTU1NGqA1NjZqd955p3bnnXdqd9xxh/ZXf/VXWmNjo2a1WrX/+I//GPF39fX1Wn19fc5jX/3qVzVA27ZtmwZoTz311JifHY/HtcbGRs3hcGQea29v14xGo7Zr164Rr//MZz6jAdpvf/vbzGODg4NaX1/fiNc++eSTml6v1z760Y/m/ewPfvCDGqCdOnVqxLG46aabcl4LaDt37sx57BOf+ISm1+u1zZs3a4DW1NQ05r76/X7N4XBo69evz3n8xIkTI14bCAS0s846SysoKNBCodC425d200035d2WyZ6rgwcPaoC2ZcsWLZFI5Pzdzp07tcn+HD3++OMaoN11113jbpemzfwxuffeezVAu/fee/P+Tb7tGu3YDnfnnXfmvfYfeeQRTafTaeeff76WSCS0N954QzObzdrKlSu1QCAw5nuO995p+a7TSy65RAO07373uzmP/8///I8GaJdeemnO4+95z3s0QPvCF74w4v27uro0RVFyHjt27JgGaDfffHPO4/mO1/e+9z0N0G655RYtHo9nHo/FYto111yjAdqrr7462u5npM/5VVddNeK5O+64QwO0Sy65RNM0TTt16pQGaDU1NeO+bz4PP/ywBmgf//jHM49dfPHFmk6n044dOzal9xRCzCxJyRNCzJgTJ05w9913c/fdd/PFL36Rb33rW5w4cYLLL7+cyy+/fNy/7+rq4itf+QpXXnklV1999biv9/v9/PKXv6S9vT2nt7yqqoprrrmGp59+muPHj2ceVxSFn/70p1RWVvLOd74z83hBQQHFxcUj3v+SSy5h/fr1PP7443k/P51alC89Zzz79u3j+9//PrfeeiubNm0a9/V9fX3ce++9hEKhESMDy5cvH/F6p9PJzTffjM/nm5GKXOOdq3A4DMDq1aunpdBHJBIBRvbej2YujslMe9vb3sanPvUpXnjhBT73uc/xgQ98AE3T+NnPfpYZvZtuLS0tPPXUU6xbt46PfexjOc994hOfYM2aNTz55JO0trYCqevigQceoLGxMe/oaXl5OUZjbrLLZL5H3/zmN3E4HPzP//wPJpMp87jZbOYrX/kKAD/72c8mvH/Hjx/nrrvu4q677uKzn/0sF198MV/84hexWq2Z9+vq6gLIjEhP1g9/+EMglY6X9pGPfCSTWiiEmH8kJU8IMWOGT6Du6+vj+eef51Of+hQXXHABTz75JOeee+6of/+5z32OSCTCf/3Xf/HLX/5y1Nd99KMfzTRCANatW8dPfvKTnNd8/OMf58EHH+QHP/gBX/3qV4HUfJuenh6+8IUvjGi07d69m3vuuYeXX34Zr9dLIpHIPDdaI93v9wNgtVpH3dbR/O3f/i1Op5OvfOUrfPaznx31dZdffjlPPPFE5t87duzgG9/4Rs5renp6+OpXv8ojjzxCc3NzJrhI6+jomPT2jWe8c7V69WpcLhcPPfQQDzzwAFdddRUOh2PKnzcwMACA3W6f0Ounckz27t2bt5G/d+/eUT/nt7/9bd45SYODg3nTziCVRldYWIher8fj8bBq1SouueQS9Prx+zS/+tWvsnv37kza2de+9jW2bds27t9NVXrfd+7cOWLemV6v5+KLL+bw4cPs3buX2tpaXn31VTRN45JLLskJaMYy0e9ROBzmwIEDVFVV8bWvfW3E8+nA6/DhwxP6XDjdyQNgMpkoLy/nxhtv5HOf+xxnnXXWhN9nNF1dXfzhD39gxYo
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"\n",
|
|||
|
"pca = PCA(n_components=2)\n",
|
|||
|
"data_pca = pca.fit_transform(preprocessed_df)\n",
|
|||
|
"\n",
|
|||
|
"# Преобразуем результат в DataFrame для визуализации через scatterplot\n",
|
|||
|
"df_pca = pd.DataFrame(data_pca, columns=['Principal Component 1', 'Principal Component 2'])\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация данных после PCA\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"sns.scatterplot(\n",
|
|||
|
" x='Principal Component 1',\n",
|
|||
|
" y='Principal Component 2',\n",
|
|||
|
" data=df_pca,\n",
|
|||
|
" alpha=0.6\n",
|
|||
|
")\n",
|
|||
|
"plt.title('Визуализация данных после PCA', fontsize=14)\n",
|
|||
|
"plt.xlabel('Главная компонента 1')\n",
|
|||
|
"plt.ylabel('Главная компонента 2')\n",
|
|||
|
"plt.grid(True)\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Выбор количества кластеров\n",
|
|||
|
"\n",
|
|||
|
"Для определения оптимального числа кластеров используется:\n",
|
|||
|
"1. **Инерция** (*метод локтя*): показывает, как внутрикластерная вариация уменьшается с увеличением числа кластеров.\n",
|
|||
|
"2. **Коэффициент силуэта**: измеряет, насколько хорошо каждый объект кластеризован."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 54,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABKUAAAHqCAYAAADVi/1VAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC5wElEQVR4nOzdd3hTZf8G8DtJm6YzbSmddLGnLbS07CGVAgqiILKHDH8oIAIqOBiC8ooLUAThlSVTpsirFWSP0kkpLRs6KHTQPaAz5/dHaTS2hba0PWlyf64rl+ScJyd3ou97Dt885/tIBEEQQEREREREREREVI+kYgcgIiIiIiIiIiL9w6IUERERERERERHVOxaliIiIiIiIiIio3rEoRURERERERERE9Y5FKSIiIiIiIiIiqncsShERERERERERUb1jUYqIiIiIiIiIiOodi1JERERERERERFTvWJQiIiIiIiIiIqJ6x6IUEWkNiUSCxYsXq58vXrwYEokEqamp4oXSUm5ubnjppZfq/H1OnjwJiUSCkydP1vl7ERER6QM3NzdMnDhR/byic22fPn3Qvn37+g9HRFTPWJQiojq1efNmSCSSSh8XLlwQO2KNubm5QSKRwM/Pr8L9GzZsUH/O0NDQah//ypUrWLx4MWJjY58xKREREdW1y5cvY/jw4XB1dYVCoYCTkxNeeOEFfPfdd2JHqxNPuk754YcfsHnz5lp/T5VKha1bt8LX1xfW1tYwNzdHy5YtMX78+AZ9TUmkzwzEDkBE+uHTTz+Fu7t7ue3NmzcXIU3tUSgUOHHiBJKSkmBvb6+xb/v27VAoFMjPz6/Rsa9cuYIlS5agT58+cHNzq4W0REREVBfOnz+Pvn37wsXFBVOnToW9vT3u3r2LCxcuYNWqVZg5c6Z67PXr1yGVNvy5AU+6Tvnhhx9gY2OjMSOsNsyaNQtr1qzByy+/jDFjxsDAwADXr1/HH3/8gaZNm6JLly61+n5EVPdYlCKiejFw4EB4e3uLHaPWde/eHSEhIdi9ezfeeecd9faEhAScOXMGr7zyCvbt2ydiQiIiIqprn332GZRKJUJCQmBpaamxLyUlReO5kZFRPSbTHcnJyfjhhx8wdepUrF+/XmPfypUr8eDBg3rLUlxcDJVKBblcXm/vSaSrGn6Jnoh0XmpqKkaMGAELCws0atQI77zzTrnZR8XFxVi6dCmaNWsGIyMjuLm54cMPP0RBQYF6zJw5c9CoUSMIgqDeNnPmTEgkEqxevVq9LTk5GRKJBGvXrn1qNoVCgVdffRU7duzQ2L5z505YWVnB39+/wtddu3YNw4cPh7W1NRQKBby9vXHo0CH1/s2bN+O1114DAPTt21d9G+C/ezudPXsWPj4+UCgUaNq0KbZu3Vruve7cuYPXXnsN1tbWMDExQZcuXfC///2v3LiEhAQMHToUpqamsLW1xbvvvqvx/REREVHFbt++jXbt2pUrSAGAra2txvN/95R6kitXrqBv374wMTGBk5MTVqxYUW5MSkoKJk+eDDs7OygUCnh4eGDLli0aYyrrERkbGwuJRFLuVrtnuU5xc3NDdHQ0Tp06pd7ep08f9WszMzMxe/ZsODs7w8jICM2bN8cXX3wBlUr1xO8iJiYGgiCge/fu5fZJJJJy33NmZibeffdduLm5wcjICE2aNMH48eM1epVW5bsr+46++uorrFy5Un2teeXKlSp9VwBQVFSEJUuWoEWLFlAoFGjUqBF69OiBo0ePPvEzE+kDzpQionqRlZVVrmG5RCJBo0aNnvraESNGwM3NDcuXL8eFCxewevVqZGRkaBRgpkyZgi1btmD48OGYO3cugoKCsHz5cly9ehUHDhwAAPTs2RPffvstoqOj1c1Dz5w5A6lUijNnzmDWrFnqbQDQq1evKn220aNHo3///rh9+zaaNWsGANixYweGDx8OQ0PDcuOjo6PRvXt3ODk5Yf78+TA1NcUvv/yCoUOHYt++fXjllVfQq1cvzJo1C6tXr8aHH36INm3aAID6nwBw69YtDB8+HJMnT8aECROwceNGTJw4EV5eXmjXrh2A0gJbt27d8PDhQ8yaNQuNGjXCli1bMGTIEOzduxevvPIKAODRo0fo168f4uPjMWvWLDg6OuLnn3/G8ePHq/QdEBER6TNXV1cEBgYiKiqq1hqUZ2RkYMCAAXj11VcxYsQI7N27Fx988AE6dOiAgQMHAig9f/fp0we3bt3CjBkz4O7ujj179mDixInIzMzUmMVdVc96nbJy5UrMnDkTZmZm+OijjwAAdnZ2AICHDx+id+/euHfvHt588024uLjg/PnzWLBgARITE7Fy5cpKc7m6ugIA9uzZg9deew0mJiaVjs3NzUXPnj1x9epVvPHGG+jUqRNSU1Nx6NAhJCQkwMbGptrf3aZNm5Cfn49p06bByMgI1tbWVfqugNLFe5YvX44pU6bAx8cH2dnZCA0NRXh4OF544YVq/zsi0ikCEVEd2rRpkwCgwoeRkZHGWADCokWL1M8XLVokABCGDBmiMe6tt94SAAiXLl0SBEEQIiIiBADClClTNMbNmzdPACAcP35cEARBSElJEQAIP/zwgyAIgpCZmSlIpVLhtddeE+zs7NSvmzVrlmBtbS2oVKonfjZXV1fhxRdfFIqLiwV7e3th6dKlgiAIwpUrVwQAwqlTp9SfPyQkRP26fv36CR06dBDy8/PV21QqldCtWzehRYsW6m179uwRAAgnTpyo8L0BCKdPn1ZvS0lJEYyMjIS5c+eqt82ePVsAIJw5c0a9LScnR3B3dxfc3NyEkpISQRAEYeXKlQIA4ZdfflGPy8vLE5o3b15pBiIiIip15MgRQSaTCTKZTOjatavw/vvvC3/++adQWFhYbqyrq6swYcIE9fMTJ06UO9f27t1bACBs3bpVva2goECwt7cXhg0bpt5Wdv7etm2belthYaHQtWtXwczMTMjOzq70PQRBEGJiYgQAwqZNm9TbauM6pV27dkLv3r3LbV+6dKlgamoq3LhxQ2P7/PnzBZlMJsTHx5d7zT+NHz9eACBYWVkJr7zyivDVV18JV69eLTdu4cKFAgBh//795faVXd9V9bsr+44sLCyElJQUjWNV9bvy8PAQXnzxxSd+NiJ9xdv3iKherFmzBkePHtV4/PHHH1V67dtvv63xvKxZ6O+//67xzzlz5miMmzt3LgCob1Vr3LgxWrdujdOnTwMAzp07B5lMhvfeew/Jycm4efMmgNKZUj169IBEIqlSPplMhhEjRmDnzp0AShucOzs7o2fPnuXGpqen4/jx4xgxYgRycnKQmpqK1NRUpKWlwd/fHzdv3sS9e/eq9L5t27bVeI/GjRujVatWuHPnjnrb77//Dh8fH/To0UO9zczMDNOmTUNsbKx66vnvv/8OBwcHDB8+XD3OxMQE06ZNq1IWIiIiffbCCy8gMDAQQ4YMwaVLl7BixQr4+/vDycmp3K1cVWVmZoaxY8eqn8vlcvj4+JQ7z9vb22PUqFHqbYaGhpg1axZyc3Nx6tSpar1nbV6nVGTPnj3o2bMnrKys1MdOTU2Fn58fSkpK1Ndoldm0aRO+//57uLu748CBA5g3bx7atGmDfv36aeTat28fPDw81DOV/qns+q66392wYcPQuHFj9fPqfFeWlpaIjo5WX2sS0d94+x4R1QsfH58aNzpv0aKFxvNmzZpBKpWqlyCOi4uDVCott5Kfvb09LC0tERcXp97Ws2dPdRHrzJkz8Pb2hre3N6ytrXHmzBnY2dnh0qVLGD16dLUyjh49GqtXr8alS5ewY8cOjBw5ssKi1q1btyAIAj755BN88sknFR4rJSUFTk5OT31PFxeXctusrKyQkZGhfh4XFwdfX99y48qm2cfFxaF9+/aIi4tD8+bNy2Vu1arVU3MQERER0LlzZ+zfvx+FhYW4dOkSDhw4gG+//RbDhw9HREQE2rZtW63jNWnSpNx52crKCpGRkerncXFxaNGiRbnV/P55nq+O2rxOqcjNmzcRGRmpUdz597GfRCqV4u2338bbb7+NtLQ0nDt3DuvWrcMff/yBkSNHqls
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1200x500 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.cluster import KMeans\n",
|
|||
|
"from sklearn.metrics import silhouette_score\n",
|
|||
|
"\n",
|
|||
|
"inertia = []\n",
|
|||
|
"silhouette_scores = []\n",
|
|||
|
"k_range = range(2, 11)\n",
|
|||
|
"\n",
|
|||
|
"for k in k_range:\n",
|
|||
|
" kmeans = KMeans(n_clusters=k, random_state=42)\n",
|
|||
|
" kmeans.fit(data_pca)\n",
|
|||
|
" inertia.append(kmeans.inertia_)\n",
|
|||
|
" silhouette_scores.append(silhouette_score(data_pca, kmeans.labels_))\n",
|
|||
|
"\n",
|
|||
|
"# Графики для определения оптимального числа кластеров\n",
|
|||
|
"plt.figure(figsize=(12, 5))\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 2, 1)\n",
|
|||
|
"plt.plot(k_range, inertia, marker='o')\n",
|
|||
|
"plt.title('Elbow Method')\n",
|
|||
|
"plt.xlabel('Number of Clusters')\n",
|
|||
|
"plt.ylabel('Inertia')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 2, 2)\n",
|
|||
|
"plt.plot(k_range, silhouette_scores, marker='o')\n",
|
|||
|
"plt.title('Silhouette Scores')\n",
|
|||
|
"plt.xlabel('Number of Clusters')\n",
|
|||
|
"plt.ylabel('Silhouette Score')\n",
|
|||
|
"\n",
|
|||
|
"plt.tight_layout()\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Судя по графику инерции, в качестве оптимального числа кластеров можно выбрать k=8, так как в этой точке происходит резкое улучшение метрики (спад инерции) и скорость уменьшения инерции заметно замедляется."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Кластерный анализ\n",
|
|||
|
"\n",
|
|||
|
"Используются два подхода:\n",
|
|||
|
"1. **Иерархическая кластеризация** - агломеративная кластеризация.\n",
|
|||
|
"2. **Неиерархическая кластеризация** - алгоритм k-means."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Иерархическая кластеризация:**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 76,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArEAAAIjCAYAAAAUdENlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hc53Xg/++9d3qfQe+FIAmCvYhFlEiKVJcsyZIty7JiucR2sk6sdWIn66x/m+a1nc3G9u4mcY0dW1axrWbJ6oUqpNg7wQqidwyA6fWW3x8gQUEAKBaAAKj38zx6bE67ZwYDzJn3nvccyTAMA0EQBEEQBEGYQeSpDkAQBEEQBEEQLpRIYgVBEARBEIQZRySxgiAIgiAIwowjklhBEARBEARhxhFJrCAIgiAIgjDjiCRWEARBEARBmHFEEisIgiAIgiDMOCKJFQRBEARBEGYckcQKgiAIgiAIM45IYgVBEK5AGzZsYMOGDVMdxkV58803kSSJN998c6pDEQRhGhNJrCAI08p//ud/IkkSu3fvHnF5OBxm5cqV2Gw2XnrpJQD+7u/+DkmSkGWZtra2UY8ViUSw2+1IksSf/dmfXZb4J1tPTw9f+9rXqK2txeFw4HQ6Wb58Od/61rcIhUKXLY5vf/vbPPPMM5fteIIgCO9nmuoABEEQPkgkEuHGG2/k4MGDPP3009x8880jrrdarTz22GP81V/91YjLn3rqqcsZ5qTbtWsXt956K7FYjAceeIDly5cDsHv3br773e/y9ttv88orr1yWWL797W/zsY99jLvuumvCH3vdunUkk0ksFsuEP7YgCFcOkcQKgjCtRaNRbrrpJvbv389TTz3FLbfcMuo2t95665hJ7KOPPsptt93Gk08+ebnCnTShUIiPfvSjKIrCvn37qK2tHXH9//yf/5Of/vSnUxTdxEilUlgsFmRZxmazTXU4giBMc6KcQBCEaSsWi3HzzTezd+9ennzySW677bYxb3f//fezf/9+jh07NnxZd3c3b7zxBvfff/+Y90mn0/zt3/4tNTU1WK1WysrK+Ku/+ivS6fSI2/3iF79g48aN5OfnY7Vaqaur44c//OGox6usrOT2229ny5Ytw2UP1dXV/OpXvxpxu2w2y9///d8ze/ZsbDYbOTk5XHPNNbz66qvnfC1+/OMf09HRwfe+971RCSxAQUEB3/zmN8e9/5kyjebm5hGXj1V/evLkSe655x4KCwux2WyUlpZy3333EQ6HAZAkiXg8zi9/+UskSUKSJD7zmc8M37+jo4PPfe5zFBQUYLVamT9/Pj//+c/HPO7jjz/ON7/5TUpKSnA4HEQikTFj2rBhAwsWLODIkSNcd911OBwOSkpK+F//63+Neq4tLS3ccccdOJ1O8vPz+epXv8rLL78s6mwF4QojVmIFQZiW4vE4t9xyC7t27eKJJ57g9ttvH/e269ato7S0lEcffZR/+Id/AOA3v/kNLpdrzMRX13XuuOMOtmzZwhe/+EXmzZvHoUOH+P73v8+JEydG1Hr+8Ic/ZP78+dxxxx2YTCaee+45/st/+S/ous6Xv/zlEY/b0NDAxz72MT7/+c/z4IMP8vOf/5zPfOYzLF++nPnz5wNDdbzf+c53+OM//mNWrlxJJBJh9+7d7N27lxtuuGHc5/jss89it9v52Mc+diEv4wXLZDLcdNNNpNNp/vzP/5zCwkI6Ojr4wx/+QCgUwuv18vDDDw/H/8UvfhGAWbNmAUM1u6tXrx6uQ87Ly+PFF1/k85//PJFIhP/6X//riOP94z/+IxaLha997Wuk0+lzlhAMDg5y8803c/fdd3PvvffyxBNP8Nd//dcsXLhweIU+Ho+zceNGurq6eOihhygsLOTRRx9l8+bNk/OCCYIwdQxBEIRp5Be/+IUBGBUVFYbZbDaeeeaZcW/7t3/7twZg9PX1GV/72teMmpqa4euuuuoq47Of/axhGIYBGF/+8peHr3v44YcNWZaNd955Z8Tj/ehHPzIAY+vWrcOXJRKJUce96aabjOrq6hGXVVRUGIDx9ttvD1/W29trWK1W4y//8i+HL1u8eLFx2223fdDLMIrf7zcWL1583rdfv369sX79+uF/n3ldm5qaRtxu8+bNBmBs3rzZMAzD2LdvnwEYv/vd7875+E6n03jwwQdHXf75z3/eKCoqMoLB4IjL77vvPsPr9Q6/nmeOW11dPeo1fn9MZ54PYPzqV78aviydThuFhYXGPffcM3zZv/zLvxjAiPdNMpk0amtrRz2mIAgzmygnEARhWurp6cFms1FWVnZet7///vtpaGhg165dw/87XinB7373O+bNm0dtbS3BYHD4v40bNwKMWLWz2+3D/z8cDhMMBlm/fj2NjY3Dp9fPqKur49prrx3+d15eHnPnzqWxsXH4Mp/PR319PSdPnjyv53VGJBLB7XZf0H0uhtfrBeDll18mkUhc0H0Nw+DJJ5/kIx/5CIZhjHhtb7rpJsLhMHv37h1xnwcffHDEa3wuLpeLBx54YPjfFouFlStXjnh9X3rpJUpKSrjjjjuGL7PZbHzhC1+4oOciCML0J5JYQRCmpR//+MdYLBZuvvlmjh8//oG3X7p0KbW1tTz66KM88sgjFBYWDiel73fy5Enq6+vJy8sb8d+cOXMA6O3tHb7t1q1buf7663E6nfh8PvLy8vibv/kbgFFJbHl5+ahj+f1+BgcHh//9D//wD4RCIebMmcPChQv5+te/zsGDBz/w+Xk8HqLR6Afe7lJVVVXxF3/xF/zsZz8jNzeXm266iX/7t38b9VzH0tfXRygU4ic/+cmo1/azn/0sMPK1PXO881VaWookSSMue//r29LSwqxZs0bdrqam5ryPIwjCzCBqYgVBmJbq6up44YUX2LRpEzfccANbt279wFXZ+++/nx/+8Ie43W4+8YlPIMtjf0/XdZ2FCxfyve99b8zrzxzn1KlTbNq0idraWr73ve9RVlaGxWLhhRde4Pvf/z66ro+4n6IoYz6eYRjD/3/dunWcOnWK3//+97zyyiv87Gc/4/vf/z4/+tGP+OM//uNxn1ttbS379+8nk8lcVOup9yd1Z2iaNuqyf/mXf+Ezn/nMcIxf+cpX+M53vsP27dspLS0d9xhnXo8HHniABx98cMzbLFq0aMS/z3cVFs7v9RUE4cNDJLGCIExbK1eu5JlnnuG2227jhhtu4J133iEvL2/c299///38j//xP+jq6uLhhx8e93azZs3iwIEDbNq0adzkDuC5554jnU7z7LPPjlhlvdRNQoFAgM9+9rN89rOfJRaLsW7dOv7u7/7unEnsRz7yEbZt28aTTz7JJz/5yQs+pt/vBxg1EKGlpWXM2y9cuJCFCxfyzW9+k3fffZe1a9fyox/9iG9961vA2ElxXl4ebrcbTdO4/vrrLzjGiVBRUcGRI0cwDGNEjA0NDVMSjyAIk0eUEwiCMK1t2rSJxx57jIaGBm6++WYikci4t501axY/+MEP+M53vsPKlSvHvd29995LR0fHmH1Vk8kk8XgcOLvy996VvnA4zC9+8YuLfTr09/eP+LfL5aKmpmZUa6/3+5M/+ROKior4y7/8S06cODHq+t7e3uEEcyxnuge8/fbbw5dpmsZPfvKTEbeLRCKoqjrisoULFyLL8ogYnU7nqIRYURTuuecennzySQ4fPjwqhr6+vvGf4AS56aab6Ojo4Nlnnx2+LJVKzfgeuoIgjCZWYgVBmPY++tGP8tOf/pTPfe5z3HHHHbz00kvjNsN/6KGHPvDx/uiP/ojf/va3/Mmf/AmbN29m7dq1aJrGsWPH+O1vf8vLL7/MihUruPHGG7FYLHzkIx/hS1/6ErFYjJ/+9Kfk5+fT1dV1Uc+lrq6ODRs2sHz5cgKBALt37+aJJ574wLG4fr+fp59+mltvvZUlS5aMmNi1d+9eHnvsMdasWTPu/efPn8/q1av5xje+wcDAAIFAgMcff3xUwvrGG2/wZ3/2Z3z84x9nzpw5qKrKww8/PJygnrF8+XJee+01vve971FcXExVVRWrVq3iu9/9Lps3b2bVqlV84QtfoK6ujoG
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Используем неиерархическую кластеризацию KMeans с оптимальным количеством кластеров 4\n",
|
|||
|
"kmeans = KMeans(n_clusters=7, random_state=42)\n",
|
|||
|
"kmeans_labels = kmeans.fit_predict(data_pca)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация кластеров KMeans\n",
|
|||
|
"plt.figure(figsize=(8, 6))\n",
|
|||
|
"plt.scatter(data_pca[:, 0], data_pca[:, 1], c=kmeans_labels, cmap='viridis', alpha=0.6)\n",
|
|||
|
"plt.title('KMeans Clustering')\n",
|
|||
|
"plt.xlabel('PCA 1')\n",
|
|||
|
"plt.ylabel('PCA 2')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**Неиерархическая кластеризация:**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 67,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAz8AAAJlCAYAAAD0N0xmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf1ElEQVR4nO3deXgUVcL24SchJCRAwp6A7MuILCKyRkAiRBFlFUQclPUVdAREFhFHVhcE0UEQBOeVRQdHRQXRGVBk1ZFFUVFBUUY2hQQQSICErOf7gy/90qQ7SXW60+nU776uviDVp6tOVZ1anq6q00HGGCMAAAAAKOGC/V0BAAAAACgKhB8AAAAAtkD4AQAAAGALhB8AAAAAtkD4AQAAAGALhB8AAAAAtkD4AQAAAGALhB8AAAAAtkD4AQAAAGALhB8AxV7dunU1dOhQf1fDIzNmzFBQUJBOnz6db1lfz2dQUJBmzJjh1XEOHTpUdevW9eo4i8rhw4cVFBSkFStW+LsqxU5cXJzi4uL8XQ0A8DrCD4AitWLFCgUFBemrr75y+X5cXJyaNWtWxLXC1ZKTkzVz5ky1aNFC5cqVU3h4uJo1a6bJkyfr+PHjRVaPxYsXl8hwsnXrVgUFBTleYWFhio6OVlxcnJ599lmdOnXK31UEgBIpxN8VAID8HDhwQMHBJf+7muIyn7/++qvi4+N19OhR3X333Ro5cqRCQ0P13Xff6bXXXtOaNWv0888/F0ldFi9erCpVqvjkilidOnWUmpqq0qVLe33cBTV27Fi1adNGWVlZOnXqlL744gtNnz5dL774ot555x116dLFb3UDgJKI8AOg2AsLC/PauDIzM5Wdna3Q0FC/jsMVb86npzIzM3XXXXcpMTFRW7duVceOHZ3ef+aZZzRnzhw/1c47rlx/ZcqU8WtdOnXqpP79+zsN27t3r2677Tb169dP+/fvV/Xq1f1Uu7xdunRJoaGhRRLYfbXNAbAf/3/FCAD5cPUszLlz5zRu3DjVqlVLYWFhatiwoebMmaPs7GxHmZxnOubNm6f58+erQYMGCgsL0/79+5Wenq5p06apVatWioqKUtmyZdWpUydt2bLFaTp5jUOSfvrpJw0YMEBVq1ZVeHi4rr32Wv31r3/NNQ/nzp3T0KFDVaFCBUVFRWnYsGFKSUkp0Hw++uijqlu3rsLCwlSzZk0NHjzY8QxRQeejoN577z3t3btXf/3rX3MFH0mKjIzUM8884/bzObdzbd261Wm4q+drEhISNGzYMNWsWVNhYWGqXr26evfurcOHDzuWx759+7Rt2zbH7WFXPodS2Dbgqk5Dhw5VuXLl9Pvvv6tPnz4qV66cqlatqokTJyorK8tpnv744w/df//9ioyMVIUKFTRkyBDt3bu30M8RtWjRQvPnz9e5c+f08ssvO733+++/a/jw4YqOjlZYWJiaNm2qZcuWOZXJWQfvvPOOnnnmGdWsWVNlypRR165ddfDgwVzTe/XVV9WgQQOFh4erbdu2+uyzz3KVyRnnW2+9pSeffFLXXHONIiIilJycLElavXq1WrVqpfDwcFWpUkX33Xeffv/991zjWb16tZo0aaIyZcqoWbNmWrNmTa7nxry93S5atEj169dXRESEbrvtNh07dkzGGD311FOqWbOmwsPD1bt3b505c6bA6whA4OLKDwC/SEpKctkJQEZGRr6fTUlJUefOnfX7779r1KhRql27tr744gtNmTJFJ06c0Pz5853KL1++XJcuXdLIkSMVFhamSpUqKTk5Wf/7v/+re++9Vw888IDOnz+v1157Td26ddPu3bt1ww035DuO7777Tp06dVLp0qU1cuRI1a1bV//973/14Ycf5goIAwYMUL169TR79mx9/fXX+t///V9Vq1Ytz6soFy5cUKdOnfTjjz9q+PDhuvHGG3X69GmtW7dOv/32m6pUqWJ5PvKzbt06SdL9999v6XOe6Nevn/bt26cxY8aobt26OnnypDZu3KijR4+qbt26mj9/vsaMGaNy5co5AmV0dLQk77SBK0PSlbKystStWze1a9dO8+bN06effqoXXnhBDRo00EMPPSRJys7OVs+ePbV792499NBDaty4sT744AMNGTLEK8umf//+GjFihD755BNHW0pMTFT79u0VFBSk0aNHq2rVqlq/fr1GjBih5ORkjRs3zmkczz33nIKDgzVx4kQlJSVp7ty5GjRokHbt2uUo89prr2nUqFG66aabNG7cOP3666/q1auXKlWqpFq1auWq11NPPaXQ0FBNnDhRaWlpCg0N1YoVKzRs2DC1adNGs2fPVmJiol566SX95z//0TfffKMKFSpIkv71r3/pnnvuUfPmzTV79mydPXtWI0aM0DXXXONyGXhju121apXS09M1ZswYnTlzRnPnztWAAQPUpUsXbd26VZMnT9bBgwe1cOFCTZw4MVeQBFACGQAoQsuXLzeS8nw1bdrU6TN16tQxQ4YMcfz91FNPmbJly5qff/7Zqdzjjz9uSpUqZY4ePWqMMebQoUNGkomMjDQnT550KpuZmWnS0tKchp09e9ZER0eb4cOHO4blNY6bb77ZlC9f3hw5csRpeHZ2tuP/06dPN5KcxmmMMX379jWVK1fOcz6nTZtmJJn333/fXC1nGgWdD2OMkWSmT5+ea1xXatmypYmKisqzzJWGDBli6tSp4/h7y5YtRpLZsmWLU7mc5bh8+XJHHSWZ559/Ps/xN23a1HTu3DnXcG+0gavrlDM/ksysWbOcyrZs2dK0atXK8fd7771nJJn58+c7hmVlZZkuXbrkGqcrOctp9erVbsu0aNHCVKxY0fH3iBEjTPXq1c3p06edyg0cONBERUWZlJQUp3Ffd911Tm3jpZdeMpLM999/b4wxJj093VSrVs3ccMMNTuVeffVVI8lpueeMs379+o7pXDmOZs2amdTUVMfwjz76yEgy06ZNcwxr3ry5qVmzpjl//rxj2NatW40kpzbkze22atWq5ty5c47hU6ZMMZJMixYtTEZGhmP4vffea0JDQ82lS5cMgJKN294A+MWiRYu0cePGXK/rr78+38+uXr1anTp1UsWKFXX69GnHKz4+XllZWdq+fbtT+X79+qlq1apOw0qVKuV4fiA7O1tnzpxRZmamWrdura+//jrXNK8ex6lTp7R9+3YNHz5ctWvXdiobFBSU6/MPPvig09+dOnXSH3/84bhtyJX33ntPLVq0UN++fXO9lzMNq/ORn+TkZJUvX97y56wKDw9XaGiotm7dqrNnz1r+vDfaQF5cra9ff/3V8feGDRtUunRpPfDAA45hwcHBevjhhy3PizvlypXT+fPnJUnGGL333nvq2bOnjDFO89ytWzclJSXlWt/Dhg1zekamU6dOkuSYj6+++konT57Ugw8+6FRu6NChioqKclmnIUOGKDw83PF3zjj+8pe/OD0/deedd6px48b617/+JUk6fvy4vv/+ew0ePFjlypVzlOvcubOaN2/uclre2G7vvvtup3lp166dJOm+++5TSEiI0/D09HSXt+oBKFm47Q2AX7Rt21atW7fONTznZDYvv/zyi7777ju3J7MnT550+rtevXouy61cuVIvvPCCfvrpJ6fb7VyVv3pYzglkQbvlvjogVaxYUZJ09uxZRUZGuvzMf//7X/Xr1y/fcVuZj/xERkY6neT7SlhYmObMmaMJEyYoOjpa7du3V48ePTR48GDFxMTk+3lvtQFXypQpk2u8FStWdAppR44cUfXq1RUREeFUrmHDhgWeTn4uXLjgCKKnTp3SuXPn9Oqrr+rVV191Wf7qec6rzUmX50GSGjVq5FSudOnSql+/vstpXL0cc8Zx7bXX5irbuHFjff75507lXC2fhg0bugwu3thur14GOUHo6lv6coZ7EsQBBBbCD4CAk52drVtvvVWPPfaYy/f/9Kc/Of195TfVOf7xj39o6NCh6tOnjyZNmqRq1aqpVKlSmj17tv773//mKu9qHFaUKlXK5XBjTKHGa3U+8tO4cWN98803OnbsmMtnPvLj6qqXpFydBUjSuHHj1LNnT61du1Yff/yxpk6
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x700 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.cluster import AgglomerativeClustering\n",
|
|||
|
"from scipy.cluster.hierarchy import dendrogram, linkage\n",
|
|||
|
"\n",
|
|||
|
"# Иерархическая кластеризация\n",
|
|||
|
"hierarchical = AgglomerativeClustering(n_clusters=7)\n",
|
|||
|
"hierarchical_labels = hierarchical.fit_predict(data_pca)\n",
|
|||
|
"\n",
|
|||
|
"# Визуализация с помощью дендрограммы\n",
|
|||
|
"plt.figure(figsize=(10, 7))\n",
|
|||
|
"linkage_matrix = linkage(data_pca, method='ward')\n",
|
|||
|
"dendrogram(linkage_matrix, truncate_mode='level', p=5)\n",
|
|||
|
"plt.title('Hierarchical Clustering Dendrogram')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 68,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjUAAAHHCAYAAABHp6kXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOz9d5xcd33vjz9PmTN9ZnvTNu2q92JJ7t2YYgdMcSAQMCQQuFxI4d6bL4/8EsJN4XITvpffTQJJSCAQBwhgDKYa29iWi2RLllWstr3X6X3mtO8fZ3el9e7K6ruSPk8ewtqZOed8zsxqzuu8y+st2bZtIxAIBAKBQHCFIy/2AgQCgUAgEAguBkLUCAQCgUAguCoQokYgEAgEAsFVgRA1AoFAIBAIrgqEqBEIBAKBQHBVIESNQCAQCASCqwIhagQCgUAgEFwVCFEjEAgEAoHgqkCIGoFAIBAIBFcFQtQIBEuE1tZWHnroocVexnnx53/+50iSRCQSecPXXurzlCSJP//zP7+o+3zooYdobW29qPu8XPT19SFJEv/2b/+22EsRCC45QtQIBJeAf/u3f0OSJPbv3z/v87fffjsbNmy4zKsSvJ5UKsXnP/95Nm/eTCAQwOv1smHDBv74j/+YkZGRy7aOr3zlK0J0CAQXAXWxFyAQCBxOnjyJLF/99xlL5Tx7enq4++67GRgY4D3veQ8f+9jH0DSNw4cP86//+q88+uijdHR0XJa1fOUrX6GqquqSRLBaWlrI5/O4XK6Lvm+BYKkhRI1AsERwu90XbV+GYWBZFpqmLeo+5uNinuf5YhgG73znOxkfH+eZZ57h5ptvnvX8X/3VX/HFL35xkVZ3cTj98/N4PIu9HIHgsrD4t0sCgQCYv9YkkUjwB3/wBzQ1NeF2u1mxYgVf/OIXsSxr5jXTNRN/+7d/y5e//GXa29txu90cO3aMUqnEn/3Zn7F9+3bC4TB+v59bbrmFp59+etZxzrQPgBMnTvDggw9SXV2N1+tl9erV/Mmf/Mmcc0gkEjz00EOUlZURDof58Ic/TC6XO6vz/MM//ENaW1txu900NjbywQ9+cKZG52zP42x55JFHOHToEH/yJ38yR9AAhEIh/uqv/mrB7Z955hkkSeKZZ56Z9fh89StjY2N8+MMfprGxEbfbTX19PW9/+9vp6+ubeT+OHj3Ks88+iyRJSJLE7bffPuu9uZDfgfnW9NBDDxEIBBgeHuYd73gHgUCA6upq/tt/+2+YpjnrnKLRKL/9279NKBSirKyMD33oQxw6dEjU6QiWJCJSIxBcQpLJ5LzFs7quv+G2uVyO2267jeHhYX7v936P5uZmXnzxRT772c8yOjrKl7/85Vmv/8Y3vkGhUOBjH/sYbrebiooKUqkU//Iv/8L73vc+PvrRj5JOp/nXf/1X7r33Xl5++WW2bNnyhvs4fPgwt9xyCy6Xi4997GO0trbS3d3NT37ykzkX/gcffJDly5fzhS98gQMHDvAv//Iv1NTUnDHqkclkuOWWWzh+/Dgf+chH2LZtG5FIhMcee4yhoSGqqqrO+TzeiMceewyA3/7t3z6n7c6Hd73rXRw9epRPfepTtLa2MjExwRNPPMHAwACtra18+ctf5lOf+hSBQGBGKNbW1gIX53fgdPFzOqZpcu+997Jr1y7+9m//lieffJIvfelLtLe384lPfAIAy7K4//77efnll/nEJz7BmjVr+PGPf8yHPvShS/eGCQQXgi0QCC463/jGN2zgjH/Wr18/a5uWlhb7Qx/60MzPf/EXf2H7/X67o6Nj1uv+n//n/7EVRbEHBgZs27bt3t5eG7BDoZA9MTEx67WGYdjFYnHWY/F43K6trbU/8pGPzDx2pn3ceuutdjAYtPv7+2c9blnWzN8/97nP2cCsfdq2bT/wwAN2ZWXlGc/zz/7sz2zA/uEPf2i/nuljnO152LZtA/bnPve5Ofs6na1bt9rhcPiMrzmdD33oQ3ZLS8vMz08//bQN2E8//fSs102/j9/4xjdm1gjYf/M3f3PG/a9fv96+7bbb5jx+MX4HXr+m6fMB7P/5P//nrNdu3brV3r59+8zPjzzyiA3YX/7yl2ceM03TvvPOO+fsUyBYCoj0k0BwCfmHf/gHnnjiiTl/Nm3a9Ibbfv/73+eWW26hvLycSCQy8+fuu+/GNE1279496/Xvete7qK6unvWYoigzNTGWZRGLxTAMg+uuu44DBw7MOebr9zE5Ocnu3bv5yEc+QnNz86zXSpI0Z/uPf/zjs36+5ZZbiEajpFKpBc/zkUceYfPmzTzwwANznps+xrmexxuRSqUIBoPnvN254vV60TSNZ555hng8fs7bX4zfgTMx3+fV09Mz8/Mvf/lLXC4XH/3oR2cek2WZT37yk+d8LgLB5UCknwSCS8jOnTu57rrr5jw+fZE6E52dnRw+fHjBi9TExMSsn5cvXz7v6775zW/ypS99iRMnTsxKe833+tc/Nn2BO9v289cLn/LycgDi8TihUGjebbq7u3nXu971hvs+l/N4I0Kh0KyL96XC7XbzxS9+kc985jPU1tZy/fXXc9999/HBD36Qurq6N9z+Yv0OzIfH45mz3/Ly8lniq7+/n/r6enw+36zXrVix4qyPIxBcToSoEQiWKJZlcc899/A//sf/mPf5VatWzfrZ6/XOec3DDz/MQw89xDve8Q7++3//79TU1KAoCl/4whfo7u6e8/r59nEuKIoy7+O2bV/Qfs/1PN6INWvW8OqrrzI4OEhTU9M5bz9flAqYU2QL8Ad/8Afcf//9/OhHP+Lxxx/nT//0T/nCF77Ar3/9a7Zu3XrG41yM34GFWOizEgiuZISoEQiWKO3t7WQyGe6+++7z3scPfvAD2tra+OEPfzjrQvy5z33urLZva2sD4LXXXjvvNbwR7e3tb7j/Cz2P13P//ffzne98h4cffpjPfvaz57z9dAQqkUjMery/v3/e17e3t/OZz3yGz3zmM3R2drJlyxa+9KUv8fDDDwMLi6SL8TtwIbS0tPD000+Ty+VmRWu6uroWZT0CwRshamoEgiXKgw8+yJ49e3j88cfnPJdIJDAM4w33MX03fnqk5KWXXmLPnj1ntYbq6mpuvfVWvv71rzMwMDDruQuNvkzzrne9i0OHDvHoo4/OeW76GBd6Hq/n3e9+Nxs3buSv/uqv5t1HOp2et2V9mpaWFhRFmVPT8pWvfGXWz7lcjkKhMOux9vZ2gsEgxWJx5jG/3z9HIMHF+R24EO699150XedrX/vazGOWZfEP//APl/S4AsH5IiI1AsES5b//9//OY489xn333cdDDz3E9u3byWazHDlyhB/84Af09fVRVVV1xn3cd999/PCHP+SBBx7gbW97G729vfzjP/4j69atI5PJnNU6/u///b/cfPPNbNu2jY997GMsX76cvr4+fvazn3Hw4MGLcp4/+MEPeM973sNHPvIRtm/fTiwW47HHHuMf//Ef2bx580U5j9NxuVz88Ic/5O677+bWW2/lwQcf5KabbsLlcnH06FG+/e1vU15evqBXTTgc5j3veQ9/93d/hyRJtLe389Of/nROjUtHRwd33XUXDz74IOvWrUNVVR599FHGx8d573vfO/O67du389WvfpW//Mu/ZMWKFdTU1HDnnXdelN+BC+Ed73gHO3fu5DOf+QxdXV2sWbOGxx57jFgsBiwcYRIIFgshagSCJYrP5+PZZ5/lr//6r/n+97/Pt771LUKhEKtWreLzn/884XD4Dffx0EMPMTY2xj/90z/x+OOPs27dOh5++GG+//3vzzGOW4jNmzezd+9e/vRP/5SvfvWrFAoFWlpaePDBBy/wDB0CgQDPPfccn/vc53j00Uf55je/SU1NDXfddReNjY0X7Txez4oVKzh48CD/5//8Hx599FF+9KMfYVkWK1as4Hd/93f59Kc/fcbt/+7v/g5d1/nHf/xH3G43Dz74IH/zN38zq6i6qamJ973vfTz11FP8+7//O6qqsmbNGr73ve/NKo7+sz/7M/r7+/nf//t/k06nue2227jzzjsvyu/AhaAoCj/72c/4/d//fb7
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.scatter(data_pca[:, 0], data_pca[:, 1], c=hierarchical_labels, cmap='viridis', alpha=0.6)\n",
|
|||
|
"plt.title('Hierarchical Clustering')\n",
|
|||
|
"plt.xlabel('PCA 1')\n",
|
|||
|
"plt.ylabel('PCA 2')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Оценка качества кластеризации\n",
|
|||
|
"\n",
|
|||
|
"Используемая метрика - **коэффициент силуэта**:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 57,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Silhouette Score for KMeans: 0.39\n",
|
|||
|
"Silhouette Score for Hierarchical Clustering: 0.34\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"silhouette_kmeans = silhouette_score(data_pca, kmeans_labels)\n",
|
|||
|
"silhouette_hierarchical = silhouette_score(data_pca, hierarchical_labels)\n",
|
|||
|
"\n",
|
|||
|
"print(f'Silhouette Score for KMeans: {silhouette_kmeans:.2f}')\n",
|
|||
|
"print(f'Silhouette Score for Hierarchical Clustering: {silhouette_hierarchical:.2f}')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"K-Means показал лучшие результаты и может быть рекомендован для дальнейшей работы с данными. Однако качество кластеризации в целом среднее, что указывает на необходимость дополнительной оптимизации модели или работы с признаками."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "aimenv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.5"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|