1318 lines
3.3 MiB
Plaintext
1318 lines
3.3 MiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Вариант 2. Показатели сердечных заболеваний"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Этот датасет представляет собой данные, собранные в ходе ежегодного опроса CDC о состоянии здоровья более 400 тысяч взрослых в США. Он включает информацию о различных факторах риска сердечных заболеваний, таких как гипертония, высокий уровень холестерина, курение, диабет, ожирение, недостаток физической активности и злоупотребление алкоголем. Также содержатся данные о состоянии здоровья респондентов, наличии хронических заболеваний (например, диабет, артрит, астма), уровне физической активности, психологическом здоровье, а также о социальных и демографических характеристиках, таких как пол, возраст, этническая принадлежность и место проживания. Датасет предоставляет информацию, которая может быть использована для анализа и предсказания риска сердечных заболеваний, а также для разработки программ профилактики и улучшения общественного здоровья."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Бизнес-цель\n",
|
|||
|
"\n",
|
|||
|
"Разработка и внедрение модели машинного обучения для кластеризации пациентов на основе данных об их здоровье с целью выявления групп с повышенным риском сердечно-сосудистых заболеваний (ССЗ), таких как инфаркты и сердечная недостаточность. Это позволит оптимизировать профилактические меры, улучшить управление ресурсами медицинских учреждений и создать эффективные программы долгосрочного наблюдения.\n",
|
|||
|
"\n",
|
|||
|
"### Основные аспекты бизнес-цели:\n",
|
|||
|
"\n",
|
|||
|
"1. **Раннее выявление и профилактика сердечно-сосудистых заболеваний:**\n",
|
|||
|
" - **Цель:** Использование модели машинного обучения для идентификации пациентов с высоким риском ССЗ на ранних стадиях.\n",
|
|||
|
" - **Преимущества:** Возможность своевременного вмешательства и реализации профилактических мер, таких как улучшение питания, повышение физической активности и управление стрессом.\n",
|
|||
|
" - **Результат:** Снижение числа случаев инфарктов и сердечной недостаточности за счет превентивных мероприятий.\n",
|
|||
|
"\n",
|
|||
|
"2. **Оптимизация работы кардиологических центров:**\n",
|
|||
|
" - **Цель:** Эффективное распределение ресурсов (кардиологов, оборудования, диагностических тестов) на основе кластеризации пациентов по уровню риска.\n",
|
|||
|
" - **Преимущества:** Более рациональное использование ресурсов, снижение времени ожидания для пациентов с высоким риском и улучшение качества обслуживания.\n",
|
|||
|
" - **Результат:** Повышение эффективности работы кардиологических центров и улучшение клинических исходов.\n",
|
|||
|
"\n",
|
|||
|
"3. **Создание персонализированных программ наблюдения:**\n",
|
|||
|
" - **Цель:** Разработка долгосрочных программ мониторинга и раннего вмешательства для пациентов с повышенным риском ССЗ.\n",
|
|||
|
" - **Преимущества:** Персонализированный подход к наблюдению, учитывающий уникальные характеристики и потребности каждой группы пациентов.\n",
|
|||
|
" - **Результат:** Регулярный мониторинг состояния здоровья пациентов и своевременное проведение лечебных мероприятий для предотвращения осложнений.\n",
|
|||
|
"\n",
|
|||
|
"4. **Улучшение клинических и экономических показателей:**\n",
|
|||
|
" - **Цель:** Снижение затрат на лечение сердечно-сосудистых заболеваний за счет раннего выявления и профилактики.\n",
|
|||
|
" - **Преимущества:** Уменьшение количества дорогостоящих госпитализаций и операций, повышение общей эффективности системы здравоохранения.\n",
|
|||
|
" - **Результат:** Значительное сокращение расходов на лечение ССЗ и улучшение финансовых показателей медицинских учреждений.\n",
|
|||
|
"\n",
|
|||
|
"5. **Повышение уровня удовлетворенности пациентов:**\n",
|
|||
|
" - **Цель:** Обеспечение пациентов качественными медицинскими услугами и персонализированным уходом.\n",
|
|||
|
" - **Преимущества:** Повышение доверия к медицинским учреждениям, улучшение общего самочувствия и качества жизни пациентов.\n",
|
|||
|
" - **Результат:** Высокий уровень удовлетворенности пациентов и их активное участие в программах профилактики и мониторинга.\n",
|
|||
|
"\n",
|
|||
|
"Таким образом, разработка модели машинного обучения для кластеризации пациентов с проблемами сердечно-сосудистой системы позволит не только улучшить клинические и экономические показатели, но и повысить уровень удовлетворенности пациентов, обеспечивая им качественную и своевременную медицинскую помощь."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 78,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from typing import Any, List\n",
|
|||
|
"import math \n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"from pandas import DataFrame, Series\n",
|
|||
|
"from pprint import pprint\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.preprocessing import StandardScaler\n",
|
|||
|
"\n",
|
|||
|
"RANDOM_STATE = 34"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 79,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"RangeIndex: 246022 entries, 0 to 246021\n",
|
|||
|
"Data columns (total 40 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 State 246022 non-null object \n",
|
|||
|
" 1 Sex 246022 non-null object \n",
|
|||
|
" 2 GeneralHealth 246022 non-null object \n",
|
|||
|
" 3 PhysicalHealthDays 246022 non-null float64\n",
|
|||
|
" 4 MentalHealthDays 246022 non-null float64\n",
|
|||
|
" 5 LastCheckupTime 246022 non-null object \n",
|
|||
|
" 6 PhysicalActivities 246022 non-null object \n",
|
|||
|
" 7 SleepHours 246022 non-null float64\n",
|
|||
|
" 8 RemovedTeeth 246022 non-null object \n",
|
|||
|
" 9 HadHeartAttack 246022 non-null object \n",
|
|||
|
" 10 HadAngina 246022 non-null object \n",
|
|||
|
" 11 HadStroke 246022 non-null object \n",
|
|||
|
" 12 HadAsthma 246022 non-null object \n",
|
|||
|
" 13 HadSkinCancer 246022 non-null object \n",
|
|||
|
" 14 HadCOPD 246022 non-null object \n",
|
|||
|
" 15 HadDepressiveDisorder 246022 non-null object \n",
|
|||
|
" 16 HadKidneyDisease 246022 non-null object \n",
|
|||
|
" 17 HadArthritis 246022 non-null object \n",
|
|||
|
" 18 HadDiabetes 246022 non-null object \n",
|
|||
|
" 19 DeafOrHardOfHearing 246022 non-null object \n",
|
|||
|
" 20 BlindOrVisionDifficulty 246022 non-null object \n",
|
|||
|
" 21 DifficultyConcentrating 246022 non-null object \n",
|
|||
|
" 22 DifficultyWalking 246022 non-null object \n",
|
|||
|
" 23 DifficultyDressingBathing 246022 non-null object \n",
|
|||
|
" 24 DifficultyErrands 246022 non-null object \n",
|
|||
|
" 25 SmokerStatus 246022 non-null object \n",
|
|||
|
" 26 ECigaretteUsage 246022 non-null object \n",
|
|||
|
" 27 ChestScan 246022 non-null object \n",
|
|||
|
" 28 RaceEthnicityCategory 246022 non-null object \n",
|
|||
|
" 29 AgeCategory 246022 non-null object \n",
|
|||
|
" 30 HeightInMeters 246022 non-null float64\n",
|
|||
|
" 31 WeightInKilograms 246022 non-null float64\n",
|
|||
|
" 32 BMI 246022 non-null float64\n",
|
|||
|
" 33 AlcoholDrinkers 246022 non-null object \n",
|
|||
|
" 34 HIVTesting 246022 non-null object \n",
|
|||
|
" 35 FluVaxLast12 246022 non-null object \n",
|
|||
|
" 36 PneumoVaxEver 246022 non-null object \n",
|
|||
|
" 37 TetanusLast10Tdap 246022 non-null object \n",
|
|||
|
" 38 HighRiskLastYear 246022 non-null object \n",
|
|||
|
" 39 CovidPos 246022 non-null object \n",
|
|||
|
"dtypes: float64(6), object(34)\n",
|
|||
|
"memory usage: 75.1+ MB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>count</th>\n",
|
|||
|
" <th>mean</th>\n",
|
|||
|
" <th>std</th>\n",
|
|||
|
" <th>min</th>\n",
|
|||
|
" <th>25%</th>\n",
|
|||
|
" <th>50%</th>\n",
|
|||
|
" <th>75%</th>\n",
|
|||
|
" <th>max</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>PhysicalHealthDays</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>4.119026</td>\n",
|
|||
|
" <td>8.405844</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>3.00</td>\n",
|
|||
|
" <td>30.00</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>MentalHealthDays</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>4.167140</td>\n",
|
|||
|
" <td>8.102687</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>0.00</td>\n",
|
|||
|
" <td>4.00</td>\n",
|
|||
|
" <td>30.00</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>SleepHours</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>7.021331</td>\n",
|
|||
|
" <td>1.440681</td>\n",
|
|||
|
" <td>1.00</td>\n",
|
|||
|
" <td>6.00</td>\n",
|
|||
|
" <td>7.00</td>\n",
|
|||
|
" <td>8.00</td>\n",
|
|||
|
" <td>24.00</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>HeightInMeters</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>1.705150</td>\n",
|
|||
|
" <td>0.106654</td>\n",
|
|||
|
" <td>0.91</td>\n",
|
|||
|
" <td>1.63</td>\n",
|
|||
|
" <td>1.70</td>\n",
|
|||
|
" <td>1.78</td>\n",
|
|||
|
" <td>2.41</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>WeightInKilograms</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>83.615179</td>\n",
|
|||
|
" <td>21.323156</td>\n",
|
|||
|
" <td>28.12</td>\n",
|
|||
|
" <td>68.04</td>\n",
|
|||
|
" <td>81.65</td>\n",
|
|||
|
" <td>95.25</td>\n",
|
|||
|
" <td>292.57</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>BMI</th>\n",
|
|||
|
" <td>246022.0</td>\n",
|
|||
|
" <td>28.668136</td>\n",
|
|||
|
" <td>6.513973</td>\n",
|
|||
|
" <td>12.02</td>\n",
|
|||
|
" <td>24.27</td>\n",
|
|||
|
" <td>27.46</td>\n",
|
|||
|
" <td>31.89</td>\n",
|
|||
|
" <td>97.65</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" count mean std min 25% 50% \\\n",
|
|||
|
"PhysicalHealthDays 246022.0 4.119026 8.405844 0.00 0.00 0.00 \n",
|
|||
|
"MentalHealthDays 246022.0 4.167140 8.102687 0.00 0.00 0.00 \n",
|
|||
|
"SleepHours 246022.0 7.021331 1.440681 1.00 6.00 7.00 \n",
|
|||
|
"HeightInMeters 246022.0 1.705150 0.106654 0.91 1.63 1.70 \n",
|
|||
|
"WeightInKilograms 246022.0 83.615179 21.323156 28.12 68.04 81.65 \n",
|
|||
|
"BMI 246022.0 28.668136 6.513973 12.02 24.27 27.46 \n",
|
|||
|
"\n",
|
|||
|
" 75% max \n",
|
|||
|
"PhysicalHealthDays 3.00 30.00 \n",
|
|||
|
"MentalHealthDays 4.00 30.00 \n",
|
|||
|
"SleepHours 8.00 24.00 \n",
|
|||
|
"HeightInMeters 1.78 2.41 \n",
|
|||
|
"WeightInKilograms 95.25 292.57 \n",
|
|||
|
"BMI 31.89 97.65 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 79,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df = pd.read_csv('csv\\\\heart_2022_no_nans.csv')\n",
|
|||
|
"\n",
|
|||
|
"df.info()\n",
|
|||
|
"df.describe().transpose()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 80,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def get_null_columns_info(df: DataFrame) -> DataFrame:\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" Возвращает информацию о пропущенных значениях в колонках датасета\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" w = []\n",
|
|||
|
" df_len = len(df)\n",
|
|||
|
"\n",
|
|||
|
" for column in df.columns:\n",
|
|||
|
" column_nulls = df[column].isnull()\n",
|
|||
|
" w.append([column, column_nulls.any(), column_nulls.sum() / df_len])\n",
|
|||
|
"\n",
|
|||
|
" null_df = DataFrame(w).rename(columns={0: \"Column\", 1: \"Has Null\", 2: \"Null Percent\"})\n",
|
|||
|
"\n",
|
|||
|
" return null_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>Column</th>\n",
|
|||
|
" <th>Has Null</th>\n",
|
|||
|
" <th>Null Percent</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>State</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>Sex</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>GeneralHealth</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>PhysicalHealthDays</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>MentalHealthDays</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>5</th>\n",
|
|||
|
" <td>LastCheckupTime</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>6</th>\n",
|
|||
|
" <td>PhysicalActivities</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>7</th>\n",
|
|||
|
" <td>SleepHours</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>8</th>\n",
|
|||
|
" <td>RemovedTeeth</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>9</th>\n",
|
|||
|
" <td>HadHeartAttack</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>10</th>\n",
|
|||
|
" <td>HadAngina</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>11</th>\n",
|
|||
|
" <td>HadStroke</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>12</th>\n",
|
|||
|
" <td>HadAsthma</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>13</th>\n",
|
|||
|
" <td>HadSkinCancer</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>14</th>\n",
|
|||
|
" <td>HadCOPD</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>15</th>\n",
|
|||
|
" <td>HadDepressiveDisorder</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16</th>\n",
|
|||
|
" <td>HadKidneyDisease</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>17</th>\n",
|
|||
|
" <td>HadArthritis</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>18</th>\n",
|
|||
|
" <td>HadDiabetes</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19</th>\n",
|
|||
|
" <td>DeafOrHardOfHearing</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>20</th>\n",
|
|||
|
" <td>BlindOrVisionDifficulty</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21</th>\n",
|
|||
|
" <td>DifficultyConcentrating</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>22</th>\n",
|
|||
|
" <td>DifficultyWalking</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>23</th>\n",
|
|||
|
" <td>DifficultyDressingBathing</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>24</th>\n",
|
|||
|
" <td>DifficultyErrands</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25</th>\n",
|
|||
|
" <td>SmokerStatus</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>26</th>\n",
|
|||
|
" <td>ECigaretteUsage</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>27</th>\n",
|
|||
|
" <td>ChestScan</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>28</th>\n",
|
|||
|
" <td>RaceEthnicityCategory</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>29</th>\n",
|
|||
|
" <td>AgeCategory</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>30</th>\n",
|
|||
|
" <td>HeightInMeters</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>31</th>\n",
|
|||
|
" <td>WeightInKilograms</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>32</th>\n",
|
|||
|
" <td>BMI</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>33</th>\n",
|
|||
|
" <td>AlcoholDrinkers</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>34</th>\n",
|
|||
|
" <td>HIVTesting</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>35</th>\n",
|
|||
|
" <td>FluVaxLast12</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>36</th>\n",
|
|||
|
" <td>PneumoVaxEver</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>37</th>\n",
|
|||
|
" <td>TetanusLast10Tdap</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>38</th>\n",
|
|||
|
" <td>HighRiskLastYear</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>39</th>\n",
|
|||
|
" <td>CovidPos</td>\n",
|
|||
|
" <td>False</td>\n",
|
|||
|
" <td>0.0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" Column Has Null Null Percent\n",
|
|||
|
"0 State False 0.0\n",
|
|||
|
"1 Sex False 0.0\n",
|
|||
|
"2 GeneralHealth False 0.0\n",
|
|||
|
"3 PhysicalHealthDays False 0.0\n",
|
|||
|
"4 MentalHealthDays False 0.0\n",
|
|||
|
"5 LastCheckupTime False 0.0\n",
|
|||
|
"6 PhysicalActivities False 0.0\n",
|
|||
|
"7 SleepHours False 0.0\n",
|
|||
|
"8 RemovedTeeth False 0.0\n",
|
|||
|
"9 HadHeartAttack False 0.0\n",
|
|||
|
"10 HadAngina False 0.0\n",
|
|||
|
"11 HadStroke False 0.0\n",
|
|||
|
"12 HadAsthma False 0.0\n",
|
|||
|
"13 HadSkinCancer False 0.0\n",
|
|||
|
"14 HadCOPD False 0.0\n",
|
|||
|
"15 HadDepressiveDisorder False 0.0\n",
|
|||
|
"16 HadKidneyDisease False 0.0\n",
|
|||
|
"17 HadArthritis False 0.0\n",
|
|||
|
"18 HadDiabetes False 0.0\n",
|
|||
|
"19 DeafOrHardOfHearing False 0.0\n",
|
|||
|
"20 BlindOrVisionDifficulty False 0.0\n",
|
|||
|
"21 DifficultyConcentrating False 0.0\n",
|
|||
|
"22 DifficultyWalking False 0.0\n",
|
|||
|
"23 DifficultyDressingBathing False 0.0\n",
|
|||
|
"24 DifficultyErrands False 0.0\n",
|
|||
|
"25 SmokerStatus False 0.0\n",
|
|||
|
"26 ECigaretteUsage False 0.0\n",
|
|||
|
"27 ChestScan False 0.0\n",
|
|||
|
"28 RaceEthnicityCategory False 0.0\n",
|
|||
|
"29 AgeCategory False 0.0\n",
|
|||
|
"30 HeightInMeters False 0.0\n",
|
|||
|
"31 WeightInKilograms False 0.0\n",
|
|||
|
"32 BMI False 0.0\n",
|
|||
|
"33 AlcoholDrinkers False 0.0\n",
|
|||
|
"34 HIVTesting False 0.0\n",
|
|||
|
"35 FluVaxLast12 False 0.0\n",
|
|||
|
"36 PneumoVaxEver False 0.0\n",
|
|||
|
"37 TetanusLast10Tdap False 0.0\n",
|
|||
|
"38 HighRiskLastYear False 0.0\n",
|
|||
|
"39 CovidPos False 0.0"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"get_null_columns_info(df)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def get_filtered_columns(df: DataFrame, no_numeric=False, no_text=False) -> list[str]:\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" Возвращает список колонок по фильтру\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" w = []\n",
|
|||
|
" for column in df.columns:\n",
|
|||
|
" if no_numeric and pd.api.types.is_numeric_dtype(df[column]):\n",
|
|||
|
" continue\n",
|
|||
|
" if no_text and not pd.api.types.is_numeric_dtype(df[column]):\n",
|
|||
|
" continue\n",
|
|||
|
" w.append(column)\n",
|
|||
|
" return w"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Визуализация взаимосвязей"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"['PhysicalHealthDays',\n",
|
|||
|
" 'MentalHealthDays',\n",
|
|||
|
" 'SleepHours',\n",
|
|||
|
" 'HeightInMeters',\n",
|
|||
|
" 'WeightInKilograms',\n",
|
|||
|
" 'BMI']"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"num_columns = get_filtered_columns(df, no_text=True)\n",
|
|||
|
"\n",
|
|||
|
"num_columns"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Признак BMI зависит от признаков HeightInMeters и WeightInKilograms, так что смысла использовать HeightInMeters и WeightInKilograms нет, исключим их и оставим только BMI "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 84,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Колонки для визулизации:\n",
|
|||
|
"['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'BMI']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"columns_to_drop = [\n",
|
|||
|
" 'HeightInMeters',\n",
|
|||
|
" 'WeightInKilograms'\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"for col in columns_to_drop:\n",
|
|||
|
" if col in num_columns:\n",
|
|||
|
" num_columns.remove(col)\n",
|
|||
|
"\n",
|
|||
|
"print('Колонки для визулизации:')\n",
|
|||
|
"print(num_columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 85,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def draw_data_2d(\n",
|
|||
|
" df: pd.DataFrame,\n",
|
|||
|
" col1: int,\n",
|
|||
|
" col2: int,\n",
|
|||
|
" y: List | None = None,\n",
|
|||
|
" classes: List | None = None,\n",
|
|||
|
" subplot: Any | None = None,\n",
|
|||
|
"):\n",
|
|||
|
" ax = None\n",
|
|||
|
" if subplot is None:\n",
|
|||
|
" _, ax = plt.subplots()\n",
|
|||
|
" else:\n",
|
|||
|
" ax = subplot\n",
|
|||
|
" scatter = ax.scatter(df[df.columns[col1]], df[df.columns[col2]], c=y, cmap=\"viridis\", alpha=0.7)\n",
|
|||
|
" ax.set(xlabel=df.columns[col1], ylabel=df.columns[col2])\n",
|
|||
|
" if classes is not None:\n",
|
|||
|
" ax.legend(scatter.legend_elements()[0], classes, loc=\"lower right\", title=\"Classes\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 86,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def show_scatters_by_pairs(\n",
|
|||
|
" df: DataFrame,\n",
|
|||
|
" columns: List[str],\n",
|
|||
|
" y: List = None,\n",
|
|||
|
" y_names: List[str] = None) -> None:\n",
|
|||
|
" pairs_count = math.comb(len(columns), 2)\n",
|
|||
|
" plot_columns_count = 2\n",
|
|||
|
" plot_rows_count = math.ceil(pairs_count / plot_columns_count) \n",
|
|||
|
"\n",
|
|||
|
" plt.figure(figsize=(plot_columns_count * 8, plot_rows_count * 8))\n",
|
|||
|
"\n",
|
|||
|
" count = 0\n",
|
|||
|
" for i in range(len(columns)):\n",
|
|||
|
" for j in range(i + 1, len(columns)):\n",
|
|||
|
" count += 1\n",
|
|||
|
" print(columns[i], 'vs', columns[j])\n",
|
|||
|
" draw_data_2d(\n",
|
|||
|
" df,\n",
|
|||
|
" i, j,\n",
|
|||
|
" y,\n",
|
|||
|
" y_names,\n",
|
|||
|
" subplot=plt.subplot(plot_rows_count, plot_columns_count, count)\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 87,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"PhysicalHealthDays vs MentalHealthDays\n",
|
|||
|
"PhysicalHealthDays vs SleepHours\n",
|
|||
|
"PhysicalHealthDays vs BMI\n",
|
|||
|
"MentalHealthDays vs SleepHours\n",
|
|||
|
"MentalHealthDays vs BMI\n",
|
|||
|
"SleepHours vs BMI\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"name": "stderr",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"C:\\Users\\ns.potapov\\AppData\\Local\\Temp\\ipykernel_52300\\1030510231.py:14: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored\n",
|
|||
|
" scatter = ax.scatter(df[df.columns[col1]], df[df.columns[col2]], c=y, cmap=\"viridis\", alpha=0.7)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjUAAAlUCAYAAABfY/AfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdf3hcZZ3//9c587PTJGNDSGtpOwVCbC3CYsXaglwgVhaKflb0u65WWtzt6sov+eGuAp/lUvwU1HXZVde669aFFgu468Iukop2saLYEjUIXTClROyUUts0pp1kOp2f53z/GBKaNCmZZO6ZOcnzcV25rk7m7jvvuc+PuWfe59y35bquKwAAAAAAAAAAgBpnVzsBAAAAAAAAAACAsaCoAQAAAAAAAAAAPIGiBgAAAAAAAAAA8ASKGgAAAAAAAAAAwBMoagAAAAAAAAAAAE+gqAEAAAAAAAAAADyBogYAAAAAAAAAAPAEihoAAAAAAAAAAMAT/NVOwDTHcbRv3z7V19fLsqxqpwMAAAB4muu66u/v1+zZs2XbU+caKT5XAAAAAOUzkc8Vk76osW/fPs2dO7faaQAAAACTyssvv6w5c+ZUO42K4XMFAAAAUH7j+Vwx6Ysa9fX1koqd09DQUOVsAAAAAG/r6+vT3LlzB8fZ1XbXXXfpoYce0s6dOzVt2jQtW7ZMX/rSl/SmN71psM2FF16oJ554Ysj/+8QnPqF//ud/HvPf4XMFAAAAUD4T+Vwx6YsaA7eGNzQ08OEDAAAAKJNamYLpiSee0DXXXKNzzz1X+Xxet956q97znvfoN7/5jaZPnz7Y7i//8i91xx13DD6ORCIl/R0+VwAAAADlN57PFZO+qAEAAABg8nrssceGPL733nvV3Nysjo4OXXDBBYO/j0QimjVrVqXTAwAAAFBmU2dlPwAAAACTXiKRkCQ1NjYO+f2mTZvU1NSkM888U7fccotSqdQJ42QyGfX19Q35AQAAAFB93KkBAAAAYFJwHEc33HCDzjvvPJ155pmDv//IRz6iWCym2bNna8eOHfrMZz6jF154QQ899NCose666y59/vOfr0TaAAAAAEpgua7rVjsJk/r6+hSNRpVIJJj7FgAAAJigWh5ff/KTn9QPfvADPfnkk5ozZ86o7X784x/r4osvVldXl04//fQR22QyGWUymcHHAwsZ1uLrBgAAALxmIp8ruFMDAAAAgOdde+21evTRR/XTn/70hAUNSVqyZIkknbCoEQqFFAqFyp4nAAAAgImhqAEAAADAs1zX1XXXXaeHH35YP/nJT3Tqqae+7v955plnJElvfOMbDWcHAAAAoNwoagAAAADwrGuuuUb333+//vu//1v19fXav3+/JCkajWratGn67W9/q/vvv1+XXXaZTjrpJO3YsUM33nijLrjgAp111llVzh4AAABAqShqAAAAAPCsb37zm5KkCy+8cMjv77nnHl111VUKBoP6n//5H/3jP/6jjhw5orlz5+oDH/iA/u///b9VyBYAAADARFHUAAAAAOBZruue8Pm5c+fqiSeeqFA2AAAAAEyzq50AAAAAAAAAAADAWFDUAAAAAAAAAAAAnkBRAwAAAAAAAAAAeAJFDQAAAAAAAAAA4AkUNQAAAAAAAAAAgCdQ1AAAAAAAAAAAAJ5AUQMAAAAAAAAAAHgCRQ0AAAAAAAAAAOAJFDUAAAAAAAAAAIAnUNQAAAAAAAAAAACeQFEDAAAAAAAAAAB4AkUNAAAAAAAAAADgCRQ1AAAAAAAAAACAJ1DUAAAAAAAAAAAAnkBRAwAAAAAAAAAAeAJFDQAAAAAAAAAA4AkUNQAAAAAAAAAAgCf4q53AZDf/s23H/W73F1dMOO6OvQf0vn/61eDjR659m86aM3PCcdf9dLu+vLl38PHfXNaoqy9YOuG4D+3o1E33vzT4+O6PnKYrzlo44bim+uFffv4L3fX9g4OPb3nvyfrEeW+fcNzkkaxu/N6zevlQSnNnRPQPHzxbddODE477X/+7Uzds+u3g439cebr+5C0LJhzXVL6SlM872rLzgPYn0poVDWv5gpny+ydeZ733l8/oc//5yuDjz33gFF117h9NOK4pqVROtz/6vPb0pjSvMaI7Ll+kSCRQ7bRG9eNdv9Of/9tvBh//25+/We9qPXXCcb++9Un9/Q8Tg49vviSq6y46f8JxHcfVru5+JVI5RSMBtTbXy7atCcc1tf+aimuqH/79mef1Nw/uHnz85T+brz/9o0UTjuu148JU/5ryUvchXfLVbcoVpIBP+uGnlum05hnVTuuETB0bJt/nTDA1rgQAoNK8Nn4CgMlkMpyDLdd13Wr98W9+85v65je/qd27d0uSFi1apNtvv12XXnqpJCmdTuvmm2/Wgw8+qEwmo0suuUTr1q3TzJlj/9K6r69P0WhUiURCDQ0NJl7GqEb64DlgIh9AievNuO/7+pPa8UriuN+fdUpUj1w3/i9vvZavJG1qj2vd1i71JLNyXFe2ZampLqirL2rRyiWxccc11RemfHR9u57s6jnu9+e3NOk7a5ZUIaMT89ox1xHv1YZtcXV1J5XNFxT0+9TSXKfVy2JaHGscd1xT+6+puKb6wdR289pxYap/TTn1s20aaeBnSfpdDZ4nJXPHhsn3ORNq6T2umuPrapqqrxsAys1r4ycAmExq6Rw8kfF1VYsa3//+9+Xz+XTGGWfIdV1t2LBBf/d3f6df//rXWrRokT75yU+qra1N9957r6LRqK699lrZtq2f//znY/4b1frwcaIPngPG8wGUuN6MO9oXJwPG+wWK1/KVil9OrW3rVK7gKOz3KeCzlCu4SucLCvhs3bZi4bi+pDLVF6aM9sXtgFr7Atdrx1xHvFdr2zp1OJVTc31I4YBP6VxBB5MZRacFdNuKheN6sza1/5qKa6ofTG03rx0XpvrXlNEKGgNqsbBh6tgw+T5nQq29x03VL/en6usGgHLy2vgJACaTWjsHT2R8XdU1Nd773vfqsssu0xlnnKHW1latXbtWdXV1euqpp5RIJPTtb39bd999t971rndp8eLFuueee7Rt2zY99dRT1Uz7dY3lg2cp7Qbs2HugrO0GrPvp9rK2G/DQjs6ythtgqh/+5ee/KGu7Ackj2RN+cSJJO15JKHkkW1Lc//rfnWVtN8BUvlJx+pB1W7uUKziqD/kV8tuyLUshv636kF+5QvH5fN4pKe69v3ymrO1MS6VyJ/ziVpKe7OpRKpWrUEYn9uNdvytruwFf3/pkWdsNcBxXG7bFdTiV0/yTIpoe8stnW5oe8ivWGFHiaE4bt8XlOKXV9E3tv6bimuqHf3/m+bK2G+C148JU/5ryUvehExY0JMl9tV2tMHVsmHyfM8HUuBIAgErz2vgJACaTyXYOrpmFwguFgh588EEdOXJES5cuVUdHh3K5nN797ncPtlmwYIHmzZun7dtH/3I9k8mor69vyM9kcezaEeVoN+DYNTTK0W7AsWtolKPdAFP9cOwaGuVoN+DG7z1b1nYDjl1DoxztSs2j1HwlacvOA+pJZhX2+2RbQ+fqsy1LYb9PPcmstuwsrSB17Boa5Whn2u2Pju3L3rG2M+3YNTTK0W7AsWtolKPdgF3d/erqTqq5PiRr2H5mWZZOrgvpxe6kdnX3lxTX1P5rKq6pfjh2DY1ytBvgtePCVP+acslXt5W1XSWYOjZMvs8BAIDReW38BACTyWQ7B1e9qPG///u/qqurUygU0l/91V/p4Ycf1pvf/Gbt379fwWBQb3jDG4a0nzlzpvbv3z9qvLvuukvRaHTwZ+7cuYZfAXBiLx9KlbWdaSbz3Z9Iy3FdBXwjLz4U8FlyXFf7E+mSY3vJnt6x9d1Y22GoRCqnbL6gcMA34vPhgE/ZfEGJEq/4N7X/moprqh9M8dpx4bX+zRXK264STB0bXntfBgBgsvDa+AkAJpPJdg6uelHjTW96k5555hm1t7frk5/8pFavXq3f/Ka0q32PdcsttyiRSAz+vPzyy2XMFijd3BmRsrYzzWS+s6Jh2VZxPvSR5ArFBWBnRcMlx/aSeY1j67uxtsNQ0UhAQX9
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1600x2400 with 6 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"df_reduced = df[num_columns]\n",
|
|||
|
"\n",
|
|||
|
"FRACTION = 0.1\n",
|
|||
|
"\n",
|
|||
|
"df_reduced_sampled = df_reduced.sample(frac=FRACTION, random_state=RANDOM_STATE)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"show_scatters_by_pairs(df_reduced_sampled, num_columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Стандартизация данных для кластеризации"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 88,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"scaler = StandardScaler()\n",
|
|||
|
"data_reduced_scaled = scaler.fit_transform(df_reduced_sampled)\n",
|
|||
|
"\n",
|
|||
|
"df_scaled = pd.DataFrame(data_reduced_scaled, columns=df_reduced_sampled.columns)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Иерархическая агломеративная кластеризация"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 89,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from sklearn import cluster\n",
|
|||
|
"from scipy.cluster import hierarchy\n",
|
|||
|
"\n",
|
|||
|
"def run_agglomerative(\n",
|
|||
|
" df: pd.DataFrame,\n",
|
|||
|
" num_clusters: int = 2\n",
|
|||
|
") -> cluster.AgglomerativeClustering:\n",
|
|||
|
" agglomerative = cluster.AgglomerativeClustering(\n",
|
|||
|
" n_clusters=num_clusters,\n",
|
|||
|
" compute_distances=True,\n",
|
|||
|
" )\n",
|
|||
|
" return agglomerative.fit(df)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def get_linkage_matrix(\n",
|
|||
|
" model: cluster.AgglomerativeClustering\n",
|
|||
|
" ) -> np.ndarray:\n",
|
|||
|
" counts = np.zeros(model.children_.shape[0])\n",
|
|||
|
" n_samples = len(model.labels_)\n",
|
|||
|
" for i, merge in enumerate(model.children_):\n",
|
|||
|
" current_count = 0\n",
|
|||
|
" for child_idx in merge:\n",
|
|||
|
" if child_idx < n_samples:\n",
|
|||
|
" current_count += 1\n",
|
|||
|
" else:\n",
|
|||
|
" current_count += counts[child_idx - n_samples]\n",
|
|||
|
" counts[i] = current_count\n",
|
|||
|
"\n",
|
|||
|
" return np.column_stack([model.children_, model.distances_, counts]).astype(float)\n",
|
|||
|
"\n",
|
|||
|
"def draw_dendrogram(linkage_matrix: np.ndarray):\n",
|
|||
|
" hierarchy.dendrogram(linkage_matrix, truncate_mode=\"level\", p=3)\n",
|
|||
|
" plt.xticks(fontsize=10, rotation=45)\n",
|
|||
|
" plt.tight_layout()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 90,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnMAAAHWCAYAAAAciQ/OAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABTTklEQVR4nO3deVxU9f7H8c+ggqCggiyauIFbKoprlgsqalqWSYtmpi1aht7USrOs3IoWs6y0ut3U273a4q00rSzN3K6IS6FlXVNzwQTcIQRZP78//M1phkVFZpg58Ho+HvNQzjlzzme+58zw5sz5fo9FVVUAAABgSh6uLgAAAABXjzAHAABgYoQ5AAAAEyPMAQAAmBhhDgAAwMQIcwAAACZGmAMAADAxwhwAAICJVXV1AVejoKBAjh8/Lr6+vmKxWFxdDgAAQJmpqvz5559Sv3598fC48vNtpgxzx48fl9DQUFeXAQAA4HBJSUnSoEGDK17elGHO19dXRC6+WD8/PxdXAwAAUHbp6ekSGhpq5JwrZcowZ/1q1c/PjzAHAAAqlNJeQkYHCAAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiVV1dQFAeVFVycrNd3UZACoJ72pVxGKxuLoMVAKEOVQKqiq3vxMvu46cdXUpACqJTo3qyPKHuxHo4HR8zYpKISs3nyAHoFztPHKWbwNQLjgzh0pn5/Ro8fGs4uoyAFRQmTn50mnOOleXgUqEMIdKx8ezivh4cugDACoGvmYFAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMrVZiLi4uTzp07i6+vrwQFBcmQIUNk3759dstcuHBBYmNjJSAgQGrWrCkxMTGSmppqt8zRo0flpptuEh8fHwkKCpInnnhC8vLyyv5qAAAAKplShbmNGzdKbGysbNu2TdauXSu5ubnSv39/OX/+vLHMpEmTZNWqVbJ8+XLZuHGjHD9+XIYOHWrMz8/Pl5tuuklycnJk69at8s9//lOWLFkizz77rONeFQAAQCVRqsG21qxZY/fzkiVLJCgoSHbt2iU9e/aUtLQ0ef/992XZsmXSp08fERFZvHixtGrVSrZt2ybXXXedfPvtt/LLL7/IunXrJDg4WNq3by+zZ8+WqVOnyowZM8TT09Nxrw4AAKCCK9M1c2lpaSIi4u/vLyIiu3btktzcXImOjjaWadmypTRs2FDi4+NFRCQ+Pl7atm0rwcHBxjIDBgyQ9PR02bt3b7Hbyc7OlvT0dLsHAAAAyhDmCgoKZOLEiXLDDTdImzZtREQkJSVFPD09pXbt2nbLBgcHS0pKirGMbZCzzrfOK05cXJzUqlXLeISGhl5t2QAAABXKVYe52NhY+fnnn+Wjjz5yZD3FmjZtmqSlpRmPpKQkp28TAADADK7qBpXjx4+X1atXy6ZNm6RBgwbG9JCQEMnJyZFz587ZnZ1LTU2VkJAQY5nt27fbrc/a29W6TGFeXl7i5eV1NaUCAABUaKU6M6eqMn78ePn8889l/fr10qRJE7v5HTt2lGrVqsl3331nTNu3b58cPXpUunXrJiIi3bp1k59++klOnDhhLLN27Vrx8/OTa6+9tiyvBQAAoNIp1Zm52NhYWbZsmaxcuVJ8fX2Na9xq1aol3t7eUqtWLXnggQdk8uTJ4u/vL35+fjJhwgTp1q2bXHfddSIi0r9/f7n22mtl5MiR8vLLL0tKSopMnz5dYmNjOfsGAABQSqUKc2+//baIiERFRdlNX7x4sYwePVpERF577TXx8PCQmJgYyc7OlgEDBsjChQuNZatUqSKrV6+WcePGSbdu3aRGjRoyatQomTVrVtleCQAAQCVUqjCnqpddpnr16rJgwQJZsGBBics0atRIvvrqq9JsGgAAAMXg3qwAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiZU6zG3atEkGDx4s9evXF4vFIitWrLCbb7FYin288sorxjKNGzcuMv/FF18s84sBAACobEod5s6fPy/t2rWTBQsWFDs/OTnZ7rFo0SKxWCwSExNjt9ysWbPslpswYcLVvQIAAIBKrGppnzBw4EAZOHBgifNDQkLsfl65cqX07t1bmjZtajfd19e3yLIAAAAoHadeM5eamipffvmlPPDAA0XmvfjiixIQECCRkZHyyiuvSF5eXonryc7OlvT0dLsHAAAAruLMXGn885//FF9fXxk6dKjd9L/97W/SoUMH8ff3l61bt8q0adMkOTlZ5s2bV+x64uLiZObMmc4sFQAAwJScGuYWLVokI0aMkOrVq9tNnzx5svH/iIgI8fT0lIceekji4uLEy8uryHqmTZtm95z09HQJDQ11XuEAAAAm4bQwt3nzZtm3b598/PHHl122a9eukpeXJ4cPH5YWLVoUme/l5VVsyAMAAKjsnHbN3Pvvvy8dO3aUdu3aXXbZxMRE8fDwkKCgIGeVAwAAUCGV+sxcRkaGHDhwwPj50KFDkpiYKP7+/tKwYUMRufg16PLly+XVV18t8vz4+HhJSEiQ3r17i6+vr8THx8ukSZPknnvukTp16pThpQAAAFQ+pQ5zO3fulN69exs/W69lGzVqlCxZskRERD766CNRVRk+fHiR53t5eclHH30kM2bMkOzsbGnSpIlMmjTJ7po4AAAAXJlSh7moqChR1UsuM3bsWBk7dmyx8zp06CDbtm0r7WYBAABQDO7NCgAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMrdZjbtGmTDB48WOrXry8Wi0VWrFhhN3/06NFisVjsHjfeeKPdMmfOnJERI0aIn5+f1K5dWx544AHJyMgo0wsBAACojEod5s6fPy/t2rWTBQsWlLjMjTfeKMnJycbjww8/tJs/YsQI2bt3r6xdu1ZWr14tmzZtkrFjx5a+egAAgEquammfMHDgQBk4cOAll/Hy8pKQkJBi5/3666+yZs0a2bFjh3Tq1ElERN58800ZNGiQzJ07V+rXr1/akgAAACotp1wzt2HDBgkKCpIWLVrIuHHj5PTp08a8+Ph4qV27thHkRESio6PFw8NDEhISnFEOAABAhVXqM3OXc+ONN8rQoUOlSZMmcvDgQXnqqadk4MCBEh8fL1WqVJGUlBQJCgqyL6JqVfH395eUlJRi15mdnS3Z2dnGz+np6Y4uGwAAwJQcHuaGDRtm/L9t27YSEREhYWFhsmHDBunbt+9VrTMuLk5mzpzpqBIBAAAqDKcPTdK0aVOpW7euHDhwQEREQkJC5MSJE3bL5OXlyZkzZ0q8zm7atGmSlpZmPJKSkpxdNgAAgCk4/MxcYceOHZPTp09LvXr1RESkW7ducu7cOdm1a5d07NhRRETWr18vBQUF0rVr12LX4eXlJV5eXs4uFQBQCqoqWbn5ri7D7WTm5BX7f/zFu1oVsVgsri6jwih1mMvIyDDOsomIHDp0SBITE8Xf31/8/f1l5syZEhMTIyEhIXLw4EGZMmWKhIe
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"tree = run_agglomerative(df_scaled)\n",
|
|||
|
"linkage_matrix = get_linkage_matrix(tree)\n",
|
|||
|
"draw_dendrogram(linkage_matrix)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пробуем представить данные в виде 3 больших кластеров и визуализируем результаты иерархической кластеризации"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 91,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"PhysicalHealthDays vs MentalHealthDays\n",
|
|||
|
"PhysicalHealthDays vs SleepHours\n",
|
|||
|
"PhysicalHealthDays vs BMI\n",
|
|||
|
"MentalHealthDays vs SleepHours\n",
|
|||
|
"MentalHealthDays vs BMI\n",
|
|||
|
"SleepHours vs BMI\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjUAAAlUCAYAAABfY/AfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wc1bn/8c/M9tVKq+Ii94ILYGOK6abXQCAJJSQhkJBCEgK5CSHlkpBCbnIJ+aXnJqRDEmpI6DWAwRTTwZhq3KtkW3VX23fm/P5Y2bhIslbSaiX7+3699mVpZvzo2dndmdnzzDnHMsYYREREREREREREREREhji73AmIiIiIiIiIiIiIiIj0hooaIiIiIiIiIiIiIiIyLKioISIiIiIiIiIiIiIiw4KKGiIiIiIiIiIiIiIiMiyoqCEiIiIiIiIiIiIiIsOCihoiIiIiIiIiIiIiIjIsqKghIiIiIiIiIiIiIiLDgooaIiIiIiIiIiIiIiIyLHjLnUCpua7Lhg0bqKysxLKscqcjIiIiIjKsGWOIx+OMHTsW295z7pHS9woRERERkYHTn+8Vu31RY8OGDUyYMKHcaYiIiIiI7FbWrl3L+PHjy53GoNH3ChERERGRgdeX7xW7fVGjsrISKOycqqqqMmcjIiIiIjK8xWIxJkyYsPU6u9yuueYa7rjjDt555x1CoRBHHnkk1157LTNnzty6zXHHHceCBQu2+3+f//zn+f3vf9/rv6PvFSIiIiIiA6c/3yt2+6LGlq7hVVVV+vIhIiIiIjJAhsoQTAsWLODSSy/lkEMOIZ/P861vfYtTTjmFt956i4qKiq3bXXzxxfzgBz/Y+ns4HC7q7+h7hYiIiIjIwOvL94rdvqghIiIiIiK7r4ceemi732+44QZGjRrFyy+/zDHHHLN1eTgcpr6+frDTExERERGRAbbnzOwnIiIiIiK7vfb2dgBqa2u3W37TTTcxYsQIZs+ezZVXXkkymewxTiaTIRaLbfcQEREREZHyU08NERERERHZLbiuy1e+8hXmzZvH7Nmzty4///zzmTRpEmPHjmXx4sV885vfZMmSJdxxxx3dxrrmmmu4+uqrByNtEREREREpgmWMMeVOopRisRjRaJT29naNfSsiIiIi0k9D+fr6kksu4cEHH+Tpp59m/Pjx3W43f/58TjzxRJYtW8Zee+3V5TaZTIZMJrP19y0TGQ7F5y0iIiIiMtz053uFemqIiIiIiMiwd9lll3Hffffx5JNP9ljQADjssMMAeixqBAIBAoHAgOcpIiIiIiL9o6KGiIiIiIgMW8YYvvSlL3HnnXfyxBNPMGXKlF3+n0WLFgEwZsyYEmcnIiIiIiIDTUUNEREREREZti699FJuvvlm7r77biorK2lsbAQgGo0SCoVYvnw5N998M6effjp1dXUsXryYyy+/nGOOOYY5c+aUOXsRERERESmWihoiIiIiIjJsXXfddQAcd9xx2y2//vrrueiii/D7/Tz66KP88pe/JJFIMGHCBM455xyuuuqqMmQrIiIiIiL9paKGiIiIiIgMW8aYHtdPmDCBBQsWDFI2IiIiIiJSana5ExAREREREREREREREekNFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWFBRQ0REREREREREREREhgUVNUREREREREREREREZFhQUUNERERERERERERERIYFFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWFBRQ0REREREREREREREhgUVNUREREREREREREREZFhQUUNERERERERERERERIYFFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWPCWO4Hd3Yp3ZlAbtDBAa9owde93BySuu/FJMF8CXLC+jT36owMSN9/wBxpTv8c1FqOCEwmOu2tA4uYa3iaePhuP7ZK2DmH0xBsHJK67cSMpcwY2KeBcQvXfH5C4TuMLpLgAD+AQIVL/yoDEBXDj/wBnOQROxA4dPWBxW5bMwmvnyFLDiOnPD1hct+MOyC8G/zzs8MkDFze/GpwG8IzB9k4asLirnz8Ar53BODD+8LcHLG6pJFpeI9H8JL7QBGrGf6jc6ezSWy9eQjrxIq5rM2vv2wiNnTIgcW+79gxaNsexbZczL7qGsfseNSBxc9kczRtasT02I8bVYtsDU8vPZVM0r1uGZVnUjZ+O1xcYkLjZdJZlr67EuIbpc6fiD/oHJK4xDribwDjgGYVlDUzcDSvf5J9XX4LrWIybdQAf+e9fDUjcfC7HwjvuI9URZ/8Tj6d+8oQBiVsqxhhaN7aRSWWpHllFKBIqd0q79J0P/jct69cw9aADuOKP3yh3Oru0cfUmGlduZsSEWsbtNWbA4j5z1/O88ujrTN1/Eu+/eODOcaVyiP1hQoAD+IAn3NvLnJGIiEjfxDIZ2tIpIn4/taFwudMREdmjtKVTxDIZqgIBqoND//trVyxjjCl3EqUUi8WIRqO0t7dTVVU1aH93xTszmFgFlrX9cmNgUwbGTulbccPduBjMuV2vtH6OPfqMPsV1Gp5lfsN3uXf1XqzqiGKwGBlIcur4lXxgUpbQuIf7FDfX8DbG/SCeHdoRjYGN2ZmMm3xvn+IC5NbPYMf2SWMgZR9FZf1f+xw3u24GHs/OceM21NT3vSjlNn8Ccs/tsNQDke9gR87vc9zEihnsePwxBuIxi5q9l/Q5rtvyVcjet8NSG0KfwY5+ve9xM89B/JeQfwdwCzG9e0Plf2EHjuxz3PUv7k2k0hCuNIXPnYFkwiIRsxlz8NArbrQ3PkR89Xeorovj9ZrCa9buJ2cdz7j9f13u9Hby6pNnM2XCO0SCebYc1hwX1jWFmbr/oj7H/cc17+Ol+21WvR4gl7WxMFREXfY5JsXnv/+dPhc3ctkc829+mqf+9RwtjW1YFoydPoYTPnYUh58xF2vHg3Mv5XMZnrjxOp7814s0bcgAMHpikOM+Oo+jzrsYj6dv9wpkszn+8s0bWfCvZ+loTYCBcDTM0Wcfxud/+ok+FzeMMZB5DJO6F5y1gCkUNYKnQvCMPhc3Nqx8k19+6nLeejFCNm1jAI/HMHF6mnkfHscnv//HPsUF+MvXLmP8xJeYOCOFbUOsxcPiF8Zx0md+yqRZM/sct1TeeWEp//n7At59aTlO3iFcGeLwMw/m1E8eR1VdZbnT28kXDzybY8/cyOzDEnh9Lpm0zctPVLLouRn8auFfyp3eTl5/+m2uv+oWlr26EifvYntspsyewAXfPZdD33dQn+P+/erbuOmH/8Z1tr8MPvJDB3P1Hd/sb9oD7jD7w0SAHY9cDvB4GQob5bq+Lrc99XmLiAykxo44t7/1Jk+uXkk6n8dn2xw8dhzn7DubmXUjyp2eiMhubU17G7e/9QYL164h6zj4PR6OGD+Rc/edxeTqmkHPpz/X12Utalx33XVcd911rFq1CoBZs2bx3e9+l9NOOw2AdDrNFVdcwa233komk+HUU0/ld7/7HaNHj+713yjHl4/kqhn4/TsXNLYwBrJZCE8uroHc3bgRzC7u7Lf+jD36mKLiAtz0zAe5cdksHGNR7U9jY4jlAuRcD8ePXc0V+xkCY4vvXZHfMKPb/QDQlDu0T702uipobCtlnUik/roBjxu3oLoPhQ138xng9PD/It/vU2EjtWoG/h5uEI/HoHpmH/JtvhhyC7rfIHghdvV3io+bXgDtXwOTAPzQ2RcGsmCFIfpT7OBxRcfd8OLejKh38XjBdQEDWGDb4DrQ1OgZUoWN9saHcFu+Srgyj5u3cJzC8cLrMzh5i+bmQ5lwyD/KneZWi5+6gJlTX8Db+dnYctLY8tHeFAsyZubiouPedu0Z3HOdTfMGH7YNXq+LwSKftbBs2Gdekp899mDRcfO5PNdfdSvP3fsSXr+XypoKXGOINcexbZszv3AKZ15yatFxHSfPP676Bk/duQqP1yJS7QFjiLW6YOCUC/fh3P/+QdG9QVzX5dtnXMNrj78
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1600x2400 with 6 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"result = hierarchy.fcluster(linkage_matrix, 3, criterion=\"maxclust\")\n",
|
|||
|
"y_names = ['Кластер 1', 'Кластер 2', 'Кластер 3']\n",
|
|||
|
"\n",
|
|||
|
"show_scatters_by_pairs(df_reduced_sampled, num_columns, result, y_names)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### KMeans (неиерархическая четкая кластеризация) для сравнения"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 92,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from typing import Tuple\n",
|
|||
|
"from sklearn.cluster import KMeans\n",
|
|||
|
"\n",
|
|||
|
"def print_cluster_result(\n",
|
|||
|
" df: pd.DataFrame,\n",
|
|||
|
" clusters_num: int,\n",
|
|||
|
" labels: np.ndarray,\n",
|
|||
|
" separator: str = \", \"):\n",
|
|||
|
" for cluster_id in range(clusters_num):\n",
|
|||
|
" cluster_indices = np.where(labels == cluster_id)[0]\n",
|
|||
|
" print(f\"Cluster {cluster_id + 1} ({len(cluster_indices)}):\")\n",
|
|||
|
" rules = [str(df.index[idx]) for idx in cluster_indices]\n",
|
|||
|
" print(separator.join(rules))\n",
|
|||
|
" print(\"\")\n",
|
|||
|
" print(\"--------\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def run_kmeans(\n",
|
|||
|
" df: pd.DataFrame,\n",
|
|||
|
" num_clusters: int,\n",
|
|||
|
" random_state: int) -> Tuple[np.ndarray, np.ndarray]:\n",
|
|||
|
" kmeans = KMeans(n_clusters=num_clusters, random_state=random_state)\n",
|
|||
|
" labels = kmeans.fit_predict(df)\n",
|
|||
|
" return labels, kmeans.cluster_centers_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 93,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Cluster 1 (6480):\n",
|
|||
|
"5, 7, 14, 15, 18, 24, 27, 29, 38, 42, 46, 49, 51, 56, 57, 60, 62, 63, 66, 67, 68, 71, 76, 80, 85, 88, 90, 91, 95, 96, 97, 105, 109, 113, 117, 123, 136, 138, 139, 143, 146, 150, 151, 152, 154, 158, 166, 171, 175, 188, 190, 192, 193, 195, 203, 210, 214, 228, 230, 234, 244, 246, 248, 249, 254, 256, 261, 263, 272, 274, 276, 277, 278, 279, 283, 286, 287, 295, 297, 302, 307, 310, 311, 312, 316, 317, 318, 327, 333, 340, 345, 347, 348, 356, 357, 363, 370, 371, 373, 383, 386, 387, 392, 393, 394, 397, 400, 407, 411, 420, 422, 423, 425, 429, 434, 438, 443, 445, 452, 453, 455, 458, 459, 464, 469, 472, 477, 479, 482, 487, 492, 498, 502, 503, 505, 506, 507, 509, 511, 519, 521, 525, 527, 535, 541, 545, 555, 560, 567, 568, 572, 573, 574, 575, 579, 580, 582, 584, 586, 587, 588, 595, 600, 601, 603, 604, 610, 611, 613, 621, 623, 625, 626, 628, 633, 634, 635, 643, 647, 648, 652, 657, 658, 659, 660, 664, 667, 671, 675, 680, 681, 684, 685, 688, 692, 699, 703, 711, 713, 715, 719, 721, 724, 726, 730, 736, 738, 740, 761, 766, 772, 774, 778, 779, 781, 784, 793, 798, 800, 809, 810, 818, 833, 839, 841, 849, 863, 864, 879, 880, 884, 888, 889, 890, 892, 920, 923, 931, 938, 942, 948, 952, 956, 957, 961, 963, 977, 982, 986, 994, 996, 997, 1004, 1006, 1007, 1010, 1012, 1016, 1018, 1020, 1024, 1032, 1039, 1041, 1042, 1049, 1054, 1056, 1058, 1064, 1067, 1068, 1070, 1076, 1080, 1081, 1091, 1099, 1100, 1107, 1108, 1113, 1115, 1125, 1127, 1129, 1131, 1132, 1140, 1141, 1147, 1152, 1156, 1158, 1173, 1179, 1180, 1181, 1183, 1187, 1188, 1191, 1192, 1198, 1201, 1206, 1213, 1217, 1218, 1219, 1222, 1235, 1236, 1242, 1248, 1252, 1254, 1258, 1260, 1261, 1262, 1265, 1270, 1274, 1275, 1277, 1281, 1284, 1285, 1287, 1292, 1294, 1306, 1307, 1308, 1313, 1318, 1324, 1329, 1331, 1332, 1333, 1346, 1350, 1370, 1371, 1372, 1385, 1387, 1391, 1402, 1403, 1407, 1408, 1409, 1412, 1413, 1414, 1416, 1420, 1421, 1422, 1428, 1440, 1444, 1445, 1446, 1448, 1456, 1459, 1463, 1464, 1466, 1471, 1474, 1483, 1484, 1488, 1495, 1497, 1500, 1502, 1503, 1510, 1512, 1514, 1526, 1530, 1531, 1532, 1536, 1542, 1551, 1557, 1561, 1564, 1567, 1575, 1580, 1582, 1586, 1591, 1592, 1593, 1594, 1613, 1618, 1621, 1622, 1626, 1627, 1631, 1635, 1640, 1646, 1648, 1649, 1651, 1652, 1653, 1656, 1659, 1663, 1666, 1669, 1675, 1682, 1691, 1692, 1693, 1697, 1707, 1709, 1711, 1713, 1715, 1717, 1720, 1723, 1724, 1726, 1727, 1728, 1729, 1731, 1733, 1734, 1735, 1737, 1739, 1747, 1756, 1759, 1768, 1774, 1780, 1782, 1799, 1800, 1803, 1807, 1815, 1823, 1831, 1833, 1835, 1840, 1844, 1845, 1846, 1851, 1852, 1860, 1861, 1862, 1863, 1864, 1867, 1873, 1877, 1884, 1888, 1893, 1900, 1902, 1913, 1917, 1918, 1919, 1928, 1930, 1932, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1944, 1953, 1954, 1955, 1960, 1961, 1962, 1966, 1968, 1969, 1971, 1975, 1988, 1989, 1990, 1996, 1999, 2000, 2006, 2021, 2030, 2031, 2037, 2052, 2053, 2061, 2068, 2072, 2077, 2082, 2087, 2092, 2100, 2103, 2106, 2109, 2111, 2115, 2116, 2118, 2122, 2129, 2135, 2137, 2138, 2144, 2146, 2154, 2162, 2163, 2165, 2166, 2172, 2174, 2176, 2181, 2187, 2191, 2194, 2204, 2206, 2211, 2217, 2224, 2225, 2226, 2231, 2233, 2234, 2235, 2238, 2249, 2257, 2259, 2267, 2274, 2277, 2278, 2281, 2292, 2295, 2297, 2304, 2311, 2314, 2315, 2317, 2318, 2320, 2321, 2323, 2325, 2330, 2337, 2339, 2340, 2341, 2344, 2345, 2353, 2357, 2360, 2361, 2363, 2364, 2365, 2368, 2371, 2372, 2377, 2384, 2385, 2386, 2388, 2390, 2393, 2397, 2398, 2400, 2409, 2413, 2417, 2419, 2420, 2423, 2424, 2426, 2431, 2437, 2441, 2445, 2446, 2454, 2458, 2459, 2463, 2464, 2470, 2475, 2479, 2485, 2494, 2496, 2498, 2500, 2506, 2517, 2522, 2524, 2527, 2529, 2530, 2532, 2536, 2540, 2541, 2547, 2550, 2551, 2552, 2553, 2556, 2563, 2566, 2574, 2577, 2583, 2587, 2593, 2596, 2598, 2602, 2603, 2604, 2606, 2607, 2608, 2609, 2612, 2613, 2617, 2620, 2623, 2624, 2626, 2627, 2638, 2643, 2645, 2660, 2662, 2668, 2672, 2677, 2681, 2688, 2689, 2693, 2697, 2699, 2705, 2708, 2709, 2711, 2713, 2714, 2718, 2721, 2722, 2724, 2727, 2728, 2729, 2732, 2733, 2739, 2740, 2743, 2745, 2750, 2753, 2756, 2771, 2776, 2780, 2781, 2784, 2787, 2789, 2790,
|
|||
|
"\n",
|
|||
|
"--------\n",
|
|||
|
"Cluster 2 (14328):\n",
|
|||
|
"0, 1, 3, 4, 6, 8, 9, 11, 13, 17, 20, 21, 22, 23, 25, 26, 28, 31, 33, 35, 36, 39, 40, 41, 44, 45, 47, 52, 54, 55, 58, 59, 61, 64, 65, 69, 70, 72, 73, 74, 77, 78, 79, 84, 86, 87, 89, 92, 93, 94, 98, 99, 100, 101, 103, 104, 107, 108, 110, 111, 112, 115, 116, 118, 119, 120, 122, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 140, 142, 144, 145, 147, 148, 149, 153, 156, 157, 159, 160, 162, 163, 164, 167, 169, 172, 173, 174, 176, 179, 180, 182, 183, 184, 185, 186, 189, 191, 194, 196, 197, 198, 199, 200, 201, 202, 204, 207, 209, 211, 212, 213, 215, 216, 217, 218, 219, 220, 221, 223, 224, 225, 226, 227, 229, 231, 232, 233, 236, 237, 238, 239, 240, 241, 242, 243, 245, 247, 250, 251, 252, 253, 255, 258, 260, 262, 264, 265, 266, 267, 268, 269, 270, 271, 273, 280, 284, 285, 288, 289, 291, 292, 293, 294, 296, 298, 299, 300, 301, 303, 304, 305, 306, 308, 309, 313, 314, 315, 319, 320, 322, 323, 324, 325, 326, 328, 329, 331, 332, 334, 335, 336, 337, 338, 339, 341, 342, 343, 344, 349, 351, 352, 353, 354, 355, 358, 359, 360, 361, 362, 365, 366, 367, 368, 372, 374, 375, 376, 377, 379, 380, 381, 382, 384, 385, 388, 389, 395, 396, 398, 401, 402, 403, 404, 406, 408, 409, 410, 412, 413, 414, 415, 416, 417, 418, 419, 421, 424, 427, 428, 430, 432, 435, 437, 440, 442, 447, 448, 449, 450, 451, 454, 456, 457, 460, 461, 462, 463, 465, 468, 470, 471, 473, 474, 476, 480, 481, 483, 484, 485, 486, 488, 490, 491, 493, 494, 495, 496, 499, 500, 501, 504, 508, 510, 512, 514, 515, 516, 517, 518, 520, 524, 526, 529, 530, 531, 534, 537, 538, 540, 542, 543, 544, 547, 548, 549, 550, 552, 553, 554, 556, 557, 558, 559, 562, 563, 564, 566, 569, 571, 576, 577, 578, 581, 583, 585, 589, 590, 592, 593, 594, 596, 597, 598, 599, 602, 606, 607, 608, 609, 612, 615, 616, 617, 618, 619, 620, 622, 624, 627, 629, 630, 631, 632, 637, 638, 639, 640, 641, 642, 646, 649, 650, 651, 654, 655, 656, 661, 662, 665, 666, 669, 670, 672, 673, 674, 676, 677, 678, 679, 682, 683, 686, 687, 690, 691, 693, 694, 695, 696, 697, 698, 700, 701, 702, 705, 707, 709, 712, 714, 716, 717, 718, 723, 725, 727, 728, 731, 732, 733, 734, 735, 737, 739, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 754, 755, 756, 757, 758, 759, 760, 762, 763, 764, 767, 768, 769, 770, 773, 775, 776, 777, 780, 782, 783, 785, 787, 788, 789, 791, 792, 794, 795, 797, 799, 801, 802, 803, 804, 805, 806, 808, 811, 813, 814, 815, 816, 817, 821, 823, 825, 826, 827, 828, 829, 830, 832, 834, 836, 837, 838, 840, 842, 845, 846, 848, 850, 851, 852, 855, 856, 857, 859, 860, 861, 862, 865, 866, 867, 869, 871, 872, 873, 874, 875, 876, 877, 878, 881, 883, 885, 886, 887, 891, 893, 894, 895, 896, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 910, 911, 912, 913, 917, 918, 919, 921, 922, 924, 925, 927, 928, 932, 934, 935, 936, 939, 940, 941, 943, 944, 945, 946, 947, 949, 950, 951, 953, 954, 955, 958, 962, 964, 965, 966, 969, 970, 972, 974, 975, 978, 979, 980, 981, 983, 985, 988, 989, 991, 992, 993, 998, 999, 1000, 1001, 1002, 1003, 1005, 1008, 1009, 1013, 1014, 1015, 1017, 1022, 1025, 1028, 1029, 1031, 1033, 1034, 1036, 1037, 1038, 1043, 1046, 1048, 1050, 1052, 1053, 1059, 1060, 1061, 1066, 1069, 1071, 1072, 1074, 1077, 1079, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1092, 1093, 1095, 1096, 1098, 1102, 1103, 1104, 1105, 1106, 1109, 1110, 1111, 1112, 1114, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1126, 1128, 1130, 1133, 1135, 1136, 1137, 1139, 1142, 1143, 1144, 1145, 1148, 1149, 1150, 1153, 1154, 1157, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1171, 1172, 1174, 1175, 1176, 1177, 1184, 1185, 1186, 1190, 1193, 1194, 1195, 1196, 1197, 1202, 1203, 1205, 1209, 1210, 1211, 1214, 1215, 1216, 1220, 1221, 1224, 1225, 1227, 1229, 1230, 1231, 1232, 1233, 1239, 1243, 1244, 1246, 1247, 1249, 1250, 1251, 1253, 1255, 1257, 1263, 1268, 1269, 1271, 1273, 1276, 1278, 1280, 1282, 1283, 1289, 1290, 1291, 1293, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1309, 1310, 1311, 1312, 1314, 1315, 1317, 1319, 1320, 1321, 1322, 1323, 1328, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1343, 1344,
|
|||
|
"\n",
|
|||
|
"--------\n",
|
|||
|
"Cluster 3 (3794):\n",
|
|||
|
"2, 10, 12, 16, 19, 30, 32, 34, 37, 43, 48, 50, 53, 75, 81, 82, 83, 102, 106, 114, 121, 124, 141, 155, 161, 165, 168, 170, 177, 178, 181, 187, 205, 206, 208, 222, 235, 257, 259, 275, 281, 282, 290, 321, 330, 346, 350, 364, 369, 378, 390, 391, 399, 405, 426, 431, 433, 436, 439, 441, 444, 446, 466, 467, 475, 478, 489, 497, 513, 522, 523, 528, 532, 533, 536, 539, 546, 551, 561, 565, 570, 591, 605, 614, 636, 644, 645, 653, 663, 668, 689, 704, 706, 708, 710, 720, 722, 729, 752, 753, 765, 771, 786, 790, 796, 807, 812, 819, 820, 822, 824, 831, 835, 843, 844, 847, 853, 854, 858, 868, 870, 882, 897, 909, 914, 915, 916, 926, 929, 930, 933, 937, 959, 960, 967, 968, 971, 973, 976, 984, 987, 990, 995, 1011, 1019, 1021, 1023, 1026, 1027, 1030, 1035, 1040, 1044, 1045, 1047, 1051, 1055, 1057, 1062, 1063, 1065, 1073, 1075, 1078, 1094, 1097, 1101, 1116, 1134, 1138, 1146, 1151, 1155, 1168, 1169, 1170, 1178, 1182, 1189, 1199, 1200, 1204, 1207, 1208, 1212, 1223, 1226, 1228, 1234, 1237, 1238, 1240, 1241, 1245, 1256, 1259, 1264, 1266, 1267, 1272, 1279, 1286, 1288, 1295, 1316, 1325, 1326, 1327, 1330, 1341, 1342, 1361, 1365, 1366, 1373, 1375, 1389, 1393, 1400, 1431, 1453, 1460, 1467, 1470, 1475, 1477, 1479, 1480, 1492, 1496, 1511, 1513, 1519, 1521, 1527, 1529, 1540, 1543, 1544, 1549, 1550, 1565, 1573, 1576, 1577, 1584, 1595, 1600, 1603, 1608, 1610, 1614, 1616, 1617, 1632, 1643, 1650, 1654, 1667, 1668, 1670, 1674, 1680, 1688, 1698, 1705, 1732, 1736, 1762, 1766, 1778, 1779, 1783, 1789, 1796, 1812, 1814, 1820, 1825, 1829, 1854, 1874, 1880, 1881, 1882, 1895, 1896, 1897, 1903, 1916, 1920, 1926, 1931, 1943, 1949, 1956, 1959, 1978, 1984, 1987, 1991, 1995, 2024, 2029, 2042, 2045, 2048, 2058, 2063, 2067, 2086, 2089, 2102, 2105, 2107, 2112, 2113, 2119, 2125, 2126, 2131, 2140, 2142, 2151, 2155, 2158, 2169, 2178, 2202, 2207, 2208, 2210, 2212, 2215, 2221, 2239, 2242, 2244, 2254, 2258, 2261, 2263, 2264, 2266, 2268, 2296, 2301, 2322, 2327, 2328, 2331, 2348, 2369, 2373, 2374, 2375, 2378, 2383, 2402, 2404, 2407, 2412, 2421, 2422, 2428, 2436, 2443, 2447, 2452, 2453, 2478, 2480, 2486, 2504, 2512, 2519, 2521, 2533, 2539, 2543, 2561, 2562, 2570, 2571, 2572, 2579, 2582, 2585, 2601, 2611, 2614, 2644, 2647, 2650, 2655, 2659, 2663, 2670, 2676, 2685, 2686, 2692, 2694, 2702, 2703, 2706, 2716, 2717, 2725, 2735, 2736, 2744, 2746, 2749, 2751, 2755, 2761, 2762, 2767, 2770, 2774, 2792, 2794, 2798, 2803, 2812, 2815, 2816, 2821, 2823, 2825, 2830, 2832, 2837, 2846, 2852, 2853, 2858, 2866, 2869, 2874, 2882, 2884, 2886, 2896, 2897, 2912, 2915, 2919, 2932, 2943, 2952, 2954, 2958, 2964, 2971, 2981, 2983, 2984, 2988, 2989, 3007, 3023, 3031, 3032, 3033, 3035, 3036, 3048, 3053, 3058, 3061, 3067, 3071, 3075, 3079, 3084, 3094, 3100, 3104, 3106, 3111, 3117, 3125, 3141, 3143, 3149, 3151, 3154, 3159, 3163, 3166, 3175, 3176, 3189, 3191, 3196, 3201, 3202, 3218, 3222, 3229, 3243, 3244, 3247, 3258, 3261, 3265, 3267, 3271, 3278, 3282, 3285, 3286, 3288, 3289, 3290, 3295, 3300, 3302, 3303, 3304, 3322, 3337, 3338, 3348, 3355, 3391, 3392, 3393, 3394, 3395, 3397, 3407, 3413, 3424, 3430, 3441, 3443, 3453, 3457, 3458, 3473, 3483, 3485, 3488, 3491, 3493, 3506, 3509, 3514, 3515, 3521, 3525, 3531, 3538, 3557, 3558, 3561, 3566, 3571, 3572, 3574, 3577, 3583, 3595, 3596, 3599, 3618, 3622, 3623, 3629, 3649, 3650, 3663, 3668, 3672, 3681, 3684, 3699, 3703, 3704, 3712, 3718, 3738, 3747, 3754, 3766, 3773, 3778, 3779, 3801, 3807, 3815, 3832, 3833, 3843, 3844, 3845, 3847, 3856, 3859, 3868, 3870, 3874, 3898, 3900, 3903, 3925, 3932, 3935, 3937, 3938, 3942, 3959, 3971, 4004, 4008, 4029, 4030, 4035, 4036, 4050, 4056, 4061, 4062, 4073, 4077, 4088, 4094, 4097, 4098, 4113, 4121, 4148, 4149, 4162, 4165, 4169, 4171, 4177, 4178, 4210, 4216, 4221, 4231, 4233, 4234, 4237, 4239, 4247, 4252, 4263, 4273, 4276, 4277, 4289, 4323, 4344, 4347, 4349, 4352, 4356, 4364, 4374, 4394, 4396, 4410, 4423, 4424, 4425, 4428, 4430, 4432, 4441, 4460, 4469, 4471, 4476, 4477, 4479, 4502, 4517, 4528, 4530, 4531, 4535, 4550, 4561, 4563, 4564, 4573, 4575, 4589, 4592, 4594, 4597, 4598, 4614, 4621, 4622, 4642, 4643, 4645, 4648, 4654, 4662, 4669, 4
|
|||
|
"\n",
|
|||
|
"--------\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([[-0.26768418, -0.22724145, -0.33671544, 1.03735944],\n",
|
|||
|
" [-0.33771529, -0.31519476, 0.23113456, -0.5206252 ],\n",
|
|||
|
" [ 1.73265623, 1.57836843, -0.29417739, 0.18428677]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"labels, centers = run_kmeans(df_scaled, 3, RANDOM_STATE)\n",
|
|||
|
"print_cluster_result(df_scaled, 3, labels)\n",
|
|||
|
"display(centers)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Визуализируем результаты"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 94,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def draw_cluster_results(\n",
|
|||
|
" df: pd.DataFrame,\n",
|
|||
|
" col1: int,\n",
|
|||
|
" col2: int,\n",
|
|||
|
" labels: np.ndarray,\n",
|
|||
|
" cluster_centers: np.ndarray,\n",
|
|||
|
" subplot: Any | None = None,\n",
|
|||
|
"):\n",
|
|||
|
" ax = None\n",
|
|||
|
" if subplot is None:\n",
|
|||
|
" ax = plt\n",
|
|||
|
" else:\n",
|
|||
|
" ax = subplot\n",
|
|||
|
"\n",
|
|||
|
" centroids = cluster_centers\n",
|
|||
|
" u_labels = np.unique(labels)\n",
|
|||
|
"\n",
|
|||
|
" for i in u_labels:\n",
|
|||
|
" ax.scatter(\n",
|
|||
|
" df[labels == i][df.columns[col1]],\n",
|
|||
|
" df[labels == i][df.columns[col2]],\n",
|
|||
|
" label=i,\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" ax.scatter(centroids[:, col1], centroids[:, col2], s=80, color=\"k\")\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def show_clusters_by_pairs(\n",
|
|||
|
" df: DataFrame,\n",
|
|||
|
" columns: List[str],\n",
|
|||
|
" labels: Any = None,\n",
|
|||
|
" centers: Any = None) -> None:\n",
|
|||
|
" pairs_count = math.comb(len(columns), 2)\n",
|
|||
|
" plot_columns_count = 2\n",
|
|||
|
" plot_rows_count = math.ceil(pairs_count / plot_columns_count) \n",
|
|||
|
"\n",
|
|||
|
" plt.figure(figsize=(plot_columns_count * 8, plot_rows_count * 8))\n",
|
|||
|
"\n",
|
|||
|
" count = 0\n",
|
|||
|
" for i in range(len(columns)):\n",
|
|||
|
" for j in range(i + 1, len(columns)):\n",
|
|||
|
" count += 1\n",
|
|||
|
" print(columns[i], 'vs', columns[j])\n",
|
|||
|
" draw_cluster_results(\n",
|
|||
|
" df,\n",
|
|||
|
" i, j,\n",
|
|||
|
" labels,\n",
|
|||
|
" centers, \n",
|
|||
|
" plt.subplot(plot_rows_count, plot_columns_count, count))\n",
|
|||
|
"\n",
|
|||
|
" plt.tight_layout()\n",
|
|||
|
" plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 95,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"PhysicalHealthDays vs MentalHealthDays\n",
|
|||
|
"PhysicalHealthDays vs SleepHours\n",
|
|||
|
"PhysicalHealthDays vs BMI\n",
|
|||
|
"MentalHealthDays vs SleepHours\n",
|
|||
|
"MentalHealthDays vs BMI\n",
|
|||
|
"SleepHours vs BMI\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjQAAAlUCAYAAACwoZshAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdfXhU9Z3//9fMxECGMOHG3EAJqb3BXhFqd1RaqFrQqOgq9aaKKBpxa1srpXb8tkR741pbDbadbanV3bUVoyimtt7/XFqjROuiRZ3VglwrrXVDqIREbjKEJAyZOb8/JgmZZCbMTObMnJM8H9eVa/ecefP23c85Z+ac8z43DsMwDAEAAAAAAAAAAFiYM9cFAAAAAAAAAAAAHA0NDQAAAAAAAAAAYHk0NAAAAAAAAAAAgOXR0AAAAAAAAAAAAJZHQwMAAAAAAAAAAFgeDQ0AAAAAAAAAAGB5NDQAAAAAAAAAAIDl5eW6gMEikYg++OADTZw4UQ6HI9flAAAAABjAMAwdOHBA06dPl9Np7eujOLYAAAAArCudYwvLNTQ++OADlZeX57oMAAAAAMNobm7WjBkzcl3GsDi2AAAAAKwvlWMLyzU0Jk6cKCn6P8Lj8eS4GgAAAAADBYNBlZeX9++3WxnHFgAAAIB1pXNsYbmGRt+t4B6Ph4MOAAAAwKLs8Agnji0AAAAA60vl2MLaD70FAAAAMCrdeeedOuWUUzRx4kSVlJTowgsv1LvvvhsTs2DBAjkcjpi/r33tazmqGAAAAECu0dAAAAAAkHUvvfSSbrjhBr322mt6/vnndfjwYZ199tk6ePBgTNx1112nXbt29f/dddddOaoYAAAAQK5Z7pFTAAAAAEa/DRs2xEw/8MADKikp0ZtvvqnTTz+9f77b7VZZWVlSOQ8dOqRDhw71TweDwcwUCwAAAMASuEMDAAAAQM61t7dLkqZMmRIz/+GHH9axxx6r2bNn6+abb1ZnZ2fCHHfeeaeKior6/8rLy02tGQAAAEB2OQzDMHJdxEDBYFBFRUVqb2/nxX0AAACAxZixvx6JRLR48WLt379fr7zySv/8//zP/1RFRYWmT5+uv/zlL1q1apXmzp2rxx9/PG6eeHdolJeXc2wBAAAAWFA6xxY8cgoAAABATt1www3aunVrTDNDkr7yla/0//9z5szRtGnTdOaZZ+q9997Txz/+8SF5xo0bp3HjxpleLwAAAIDc4JFTAAAAAHJmxYoVevbZZ7Vx40bNmDFj2NjPfvazkqS//e1v2SgNAAAAgMVwhwYAAACArDMMQ9/4xjf0xBNPqLGxUccdd9xR/81bb70lSZo2bZrJ1QEAAACwIhoaAAAAALLuhhtu0COPPKKnnnpKEydOVEtLiySpqKhIBQUFeu+99/TII4/ovPPO09SpU/WXv/xF3/rWt3T66afr05/+dI6rBwAAAJALNDQAAAAAZN29994rSVqwYEHM/LVr1+qaa65Rfn6+Ghoa9POf/1wHDx5UeXm5LrnkEn3ve9/LQbUAAAAArICGBgAAAICsMwxj2M/Ly8v10ksvZakaAAAAAHbAS8EBAAAAAAAAAIDl0dAAAAAAAAAAAACWR0MDAAAAAAAAAABYHg0NAAAAAAAAAABgeTQ0AAAAAAAAAACA5dHQAAAAAAAAAAAAlkdDAwAAAAAAAAAAWB4NDQAAAAAAAAAAYHk0NAAAAAAAAAAAgOXR0AAAAAAAAAAAAJZHQwMAAAAAAAAAAFgeDQ0AAAAAAAAAAGB5NDQAAAAAAAAAAIDl0dAAAAAAAAAAAACWR0MDAAAAAAAAAABYXl6uC7CjOXVzpLAkhyRDkkvaUr1lxHlf2fnfuv6Fr/VP33vmv+vUGZ8fcV5Juu3l2/S793/XP/2l476kW0+/dcR5H/jLA/rZ//ysf/qmf7pJ13z6mhHnlaRNO1/RV1+4vn/6P868V/NnnDrivHf89x1a/7f1/dNLP7FUt3z+lhHn3Xtwr659/lp92PWhji04Vvefdb+mTJgy4ryS9OCWB/WTwE/6p7/t/baunnP1iPOaWXOoJ6T67fVqDjar3FOuJbOWKD8vf8R517z+S9237T/7p6+r/IpWnvKNEec1U3tnu1Y0rlDLwRaVTSjT3QvuVpG7KNdlDevJ7U/q+69+v3/69nm368JZF4447w82/kBP7Hiif/qimRfphwt/OOK8khSOhBVoDaits03F7mJ5S7xyOV0jzmvWumxWXsm8sfj1W7/WL97+Rf/0N0/8pr78mS+POK9kv+3ErDE201utb+mq/7qqf/qhcx/SZ0o+k7uCkmDmdmLmb6AZ5tTNGTIvE/ufAACYKhKWmjZJHbulwlKpYr5k8X0mABjt7Hg8O5DDMAwj10UMFAwGVVRUpPb2dnk8nlyXM8Sc++dITkNyOI7MNAwp4tCWa9M/qJxTNyfaHBmQtm96pAerZuWOd2DdJzM1xxlnh8OSNS+oX6A93XuGzJ86fqoalzSmnVeyZ83+N/yq21aniBHpn+d0OFVdWS3fyb6085q5nZjlvN+fp+aO5iHzywvL9dwlz+WgoqMza50z8zujoalBtZtrtbtzd/+8UnepaubWqKqiKu28Zq3LZuWVzBsLM5ef3bYTs8bYTGYuP7OYuZ2Y+RtoBqstP6vvrw9kp1oBYNTZ9rS0YZUU/ODIPM90adFqqXJx7uoCgDHMasez6eyvp/TIqXvvvVef/vSn5fF45PF4NG/ePP3Xf/1X/+fd3d264YYbNHXqVBUWFuqSSy7R7t27h8loL/3NjHicRvTzdPLWzZFhRM/LDmQoeg5/uIPYXOU+2r8bac1K1GczDMvVnOikiCTt6d6jBfUL0sqbTE1WrNn/hl9r31kbcwJKkiJGRGvfWSv/G/608pq5nZgl0UlaSWruaNZ5vz8vyxUdnVnrnJnfGQ1NDfI1+mJ+jCWptbNVvkafGpoa0spr1rpsVl7JvLEwc/nZbTsxa4zNZObyM4uZ24mZv4FmsOPyAwBA256Wfnt1bDNDkoK7ovO3PZ2bugBgDLPj8Ww8KTU0ZsyYodraWr355pt64403dMYZZ+iLX/yi3nnnHUnSt771LT3zzDN67LHH9NJLL+mDDz7QxRdfbErh2TanbkAzY+BdAwOnnamfbH9l53/3n6FNlFZGb1yKbnv5tqRy3/bybSnlfeAvD2Q0bqBNO1850sxIVLRhRONScMd/35HRuD57D+5NeFKkz57uPdp7cG9KeaXoY6YyGdfHzJpDPSHVbasbNqZuW51CPaGU8q55/ZdJrctrXv9lSnnN1N7ZnvAkbZ/mjma1d7ZnqaKje3L7kxmN6/ODjT/IaNxA4UhYtZtrZQxpdal/3urNqxWOhFPKa9a6bFZeybyx+PVbv85o3EB2207MGmMzvdX6VkbjssHM7cTM30AzJLtfSVMDAGApkXD0zow4+0z98zbUROMAAFlhx+PZRFJqaFxwwQU677zz9MlPflKzZs3Sj3/8YxUWFuq1115Te3u7fvOb38jv9+uMM87QSSedpLVr12rTpk167bXXEuY8dOiQgsFgzJ8lhRU9czr4bGqfvs9SXObXv/A16Shp5VDMuzWS9bv3f5dU7oHv1kjGwHdmZCJuoK++cH1S4zzw3RrJGPjOjEzE9bn2+WszGjfQwHdmZCIu1VrSqbl+e/2Qq2kHixgR1W+vTynvfdv+M6l1eeC7NXJtReOKjMZlw8B3ZmQirs/Ad2ZkIm6gQGtgyJUFAxky1NLZokBrIKW8Zq3LZuWVzBuLge/MyETcQHbbTswaYzMNfGdGJuKywcztxMzfQAAA0Ktp09A7M2IYUvAf0TgAQFbY8Xg2kZQaGgOFw2E9+uijOnjwoObNm6c333xThw8fVlXVkWdtfepTn9LMmTP16quvJsxz5513qqioqP+vvLw83ZLMleBEatpxGBU+7Powo3HZYGbNzcHhr7RONc7OWg62ZDQO8bV1tmU0ro9Z67KZ24hZY2Emu20ndhxjOzJzO7Hj7zYAALbTkeSjx5ONAwCM2Gg6nk25obFlyxYVFhZq3Lhx+trXvqYnnnhClZWVamlpUX5+viZNmhQ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1600x2400 with 6 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"show_clusters_by_pairs(df_reduced_sampled, num_columns, labels, centers)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### РСА для визуализации сокращенной размерности"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 97,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjYAAAJOCAYAAAAUHj4bAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wc9Zk/8M9s711addmWezc2NgYMNhiMAdNCcwqmhuQgCSWQcLmjB0g4Uu7gSAgJzo8WjlCTgDEYg+nghnuRLVtWr9v7zvz+EFp7vbsqtlarlT7v10t/aL6zM89qR9I883yLIEmSBCIiIiIiIiIiIiIiojwgy3UAREREREREREREREREfcXCBhERERERERERERER5Q0WNoiIiIiIiIiIiIiIKG+wsEFERERERERERERERHmDhQ0iIiIiIiIiIiIiIsobLGwQEREREREREREREVHeYGGDiIiIiIiIiIiIiIjyBgsbRERERERERERERESUN1jYICIiIiIiIiIiIiKivMHCBhENaffeey8EQch1GBkdOHAAgiBg5cqVx/T6lStXQhAEHDhwYEDjGo6uvvpqjBo1Ktdh5NSoUaNw9dVX5zqMHuX7NZ0PP2MiIiLKL8xpaDjJ98+b9/tEwwcLG0R5pPsGovtLo9Fg/PjxuPnmm9Hc3Jyyf3NzM376059i4sSJ0Ol00Ov1mD17Nh588EG4XK6055g7dy4EQcCTTz6Z5XdDx+uFF17A7373u0E/r8vlgkajgSAI2Llz56Cff7j79NNPce+992b8Hc2VeDyOZ555BgsXLoTNZoNarcaoUaNwzTXXYP369YMWx1tvvYV777130M5HREREA4s5DR1psHMaQRBw8803p2x/6KGHIAgCrr32WoiimCj2CIKABx98MO2xvvOd70AQBBgMhmyHPSh4v09E+YaFDaI8dP/99+PZZ5/F448/jpNPPhlPPvkk5s+fj0AgkNjnq6++wtSpU/HEE09gwYIF+M1vfoPHHnsMs2bNwiOPPILLL7885bh79+7FV199hVGjRuH5558fzLdExyBXhY2XX34ZgiCgqKiI10kWfPrpp7jvvvvSJuq7d+/Gn/70p0GPKRgM4vzzz8e1114LSZLw7//+73jyySdx1VVX4bPPPsPcuXNRV1c3KLG89dZbuO+++7J2/Fz9jImIiEYa5jQE5C6nOdIjjzyCX/ziF1ixYgWefvppyGSHH5VpNBq8+OKLKa/x+/144403oNFoBjPUrOH9PhHlI0WuAyCi/lu6dCnmzJkDALj++utht9vxm9/8Bm+88QaWL18Ol8uFiy++GHK5HJs2bcLEiROTXv/LX/4y7T/y5557DoWFhXjsscdw6aWX4sCBA8c89U8oFIJKpUq6KaTh4bnnnsO5556LyspKvPDCCxl7MFEXv98PvV4/IMdSq9UDcpz+uuOOO7Bq1Sr89re/xS233JLUds899+C3v/1tTuIaKJIkIRQKQavV5uxnTERENNIwp6Gh4NFHH8Vdd92Fq666Cn/5y19SPutzzz0Xr776Kr7++mvMmDEjsf2NN95AJBLBOeecg/fff3+wwx5wvN8nonzE/85Ew8AZZ5wBAKipqQEA/PGPf0R9fT1+85vfpCQAAOB0OvEf//EfKdtfeOEFXHrppTj//PNhNpvxwgsv9On8H3zwAQRBwN/+9jf8x3/8B0pLS6HT6eDxeAAAX3zxBc455xyYzWbodDqcfvrp+OSTT1KO8/HHH+PEE0+ERqNBVVUV/vjHP6bs09P8r4IgpAxZra+vx3XXXYeSkhKo1WqMHj0aP/zhDxGJRBL7uFwu3HLLLSgvL4darcbYsWPxq1/9CqIoJh3L5XLh6quvhtlshsViwYoVK/o1XdD27dtxxhlnQKvVoqysDA8++GDKOYCum+TzzjsvEXNVVRUeeOABxOPxxD4LFy7Ev/71Lxw8eDAxRLo7YYtEIrj77rsxe/ZsmM1m6PV6LFiwAGvXrk05V2NjI3bt2oVoNNqn91BbW4uPPvoIV155Ja688krU1NTg008/TbvvE088gTFjxkCr1WLu3Ln46KOPsHDhQixcuDBpv4MHD+KCCy6AXq9HYWEhbr31VrzzzjsQBAEffPBBj/H4/X7cfvvtic9uwoQJ+K//+i9IkpS0X/eQ85dffhmTJ0+GVqvF/PnzsXXrVgBdvzNjx46FRqPBwoUL084X25fruHv+5B07duDb3/42rFYrTj31VADAli1bcPXVV2PMmDHQaDQoKirCtddei/b29qTX33HHHQCA0aNHJz7b7niOnA92/fr1EAQBf/3rX1Ni7f75/fOf/0xsq6+vx7XXXgun0wm1Wo0pU6bgL3/5S48/XwCoq6vDH//4R5x11lkpSQ4AyOVy/PSnP0VZWVnGY6T73Tz6/QBANBrFfffdh3HjxkGj0cBut+PUU0/Fu+++C6BrnZUnnngicczur26iKOJ3v/sdpkyZAo1GA6fTiRtvvBGdnZ0p5z3//PPxzjvvYM6cOdBqtYm/N0fH1D1dxieffILbbrsNBQUF0Ov1uPjii9Ha2pp0XFEUce+996KkpAQ6nQ6LFi3Cjh07OI8vERFRHzCn6cKcZhSA7OY03X7zm9/gzjvvxHe/+10888wzaQtY8+fPx+jRo1Ouo+effx7nnHMObDZb2mO//fbbWLBgAfR6PYxGI8477zxs3749aZ++5AfA4RyjuroaV199NSwWC8xmM6655pqkEU4A8O677+LUU0+FxWKBwWDAhAkT8O///u89/hx4v8/7faJ8xREbRMPAvn37AAB2ux0A8Oabb0Kr1eLSSy/t8zG++OILVFdX45lnnoFKpcIll1yC559/vteboCM98MADUKlU+OlPf4pwOAyVSoX3338fS5cuxezZs3HPPfdAJpPhmWeewRlnnIGPPvoIc+fOBQBs3boVZ599NgoKCnDvvfciFovhnnvugdPp7MdPIllDQwPmzp0Ll8uF73//+5g4cSLq6+vx97//HYFAACqVCoFAAKeffjrq6+tx4403oqKiAp9++inuuusuNDY2JoZFS5KECy+8EB9//DF+8IMfYNKkSXjttdewYsWKPsXS1NSERYsWIRaL4ec//zn0ej2eeuopaLXalH1XrlwJg8GA2267DQaDAe+//z7uvvtueDwePProowCAX/ziF3C73airq0v0nume29Xj8eDpp5/G8uXLccMNN8Dr9eLPf/4zlixZgi+//BIzZ85MnOuuu+7CX//6V9TU1PSpJ9uLL74IvV6P888/H1qtFlVVVXj++edx8sknJ+335JNP4uabb8aCBQtw66234sCBA7joootgtVqTboj9fj/OOOMMNDY24ic/+QmKiorwwgsvpE1YjiZJEi644AKsXbsW1113HWbOnIl33nkHd9xxB+rr61N6FX300Ud48803cdNNNwEAHn74YZx//vm488478b//+7/4t3/7N3R2duLXv/41rr322qSeV329jrtddtllGDduHB566KFEkeXdd9/F/v37cc0116CoqAjbt2/HU089he3bt+Pzzz+HIAi45JJLsGfPHrz44ov47W9/C4fDAQAoKChIef9z5szBmDFj8H//938p1+FLL70Eq9WKJUuWAOiam/qkk05KFHgKCgrw9ttv47rrroPH40mbwHR7++23EYvF8L3vfa/Xz+R43XvvvXj44Ydx/fXXY+7cufB4PFi/fj02btyIs846CzfeeCMaGhrw7rvv4tlnn015/Y033oiVK1fimmuuwY9//GPU1NTg8ccfx6ZNm/DJJ59AqVQm9t29ezeWL1+OG2+8ETfccAMmTJjQY2w/+tGPYLVacc899+DAgQP43e9+h5tvvhkvvfRSYp+77roLv/71r7Fs2TIsWbIEX3/9NZYsWYJQKDRwPyQiIqJhijlNesxpBj6nAYDf//73uP322/Htb38bK1eu7HFUzvLly/Hcc8/hkUcegSAIaGtrw+rVq/Hss89i1apVKfs/++yzWLFiBZYsWYJf/epXCAQCePLJJ3Hqqadi06ZNiRj7kh8c6fLLL8fo0aPx8MMPY+PGjXj66adRWFi
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1600x600 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"import seaborn as sns\n",
|
|||
|
"pca = PCA(n_components=2)\n",
|
|||
|
"\n",
|
|||
|
"reduced_data = pca.fit_transform(data_reduced_scaled)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(16, 6))\n",
|
|||
|
"plt.subplot(1, 2, 1)\n",
|
|||
|
"sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=result, palette='Set1', alpha=0.6)\n",
|
|||
|
"plt.title('PCA reduced data: Agglomerative Clustering')\n",
|
|||
|
"\n",
|
|||
|
"plt.subplot(1, 2, 2)\n",
|
|||
|
"sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=labels, palette='Set1', alpha=0.6)\n",
|
|||
|
"plt.title('PCA reduced data: KMeans Clustering')\n",
|
|||
|
"\n",
|
|||
|
"plt.tight_layout()\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Анализ инерции для метода локтя (метод оценки суммы квадратов расстояний)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 98,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2wAAAIjCAYAAAB/FZhcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACS2UlEQVR4nOzdeVxU9f7H8ffMsK8CyqIi4pKKSyYmUllZKC55tWzRrMxsM62Ubovdrkt182a/0hbT26YteivvvW1qKplpJeFe7rmguLCo7CD7/P5AJidAEYEZ4PV8PHjgnPM9Zz4z84V4d77n+zWYzWazAAAAAAB2x2jrAgAAAAAAlSOwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAA2KFFixbJYDBo8+bNti4FgA0R2ADgIpT/AWUwGPTTTz9V2G82mxUcHCyDwaCbbrrJBhUCAIDGhMAGADXg4uKiJUuWVNi+bt06HTt2TM7OzjaoCgAANDYENgCogSFDhmjp0qUqLi622r5kyRKFh4crMDDQRpUBAIDGhMAGADUwevRonT59WrGxsZZthYWF+s9//qM777yz0mNKS0s1d+5cde3aVS4uLgoICNBDDz2k9PR0S5u2bdtahlxW9tW2bVtL29zcXD3xxBMKDg6Ws7OzOnXqpP/7v/+T2Wyu8Nw//PBDleesrnvvvbfS42fMmGHV7vvvv1e/fv3k7u6uZs2aafjw4dqzZ49VmxkzZlR47rVr18rZ2VkPP/ywVZvzff3www+W4+fPn69u3brJzc3Nqs1//vOfar2+66+/vlqvT7IeGnvu1/XXX2/Vbtu2bRo0aJBatGhh1a46w2Wr+/lWp78cPnz4gu/lvffea/XaDh8+bHmO0tJS9ejRQwaDQYsWLbJsL+8TPXv2rFD/rFmzZDAY5OHhYbV94cKFuuGGG+Tv7y9nZ2eFhYVp/vz5lb4HVfXbc38Oyttc6HMu70+nTp2y2r558+YKr0u6uH785y8HBwerdkuXLlV4eLhcXV3VvHlz3XXXXTp+/Ph5661Kenq6+vTpo9atW2vfvn01OgeAhsXhwk0AAH/Wtm1bRUZG6t///rcGDx4sSfr222+VmZmpUaNG6Y033qhwzEMPPaRFixZp3Lhxeuyxx5SQkKC33npL27Zt088//yxHR0fNnTtXOTk5kqQ9e/bopZde0rPPPqsuXbpIkuWPX7PZrL/85S9au3atxo8fr549e2rVqlV68skndfz4cc2ZM6fSuh977DFdeeWVkqSPPvrIKnBWR/Pmza3Offfdd1vt/+677zR48GC1a9dOM2bM0JkzZ/Tmm2/q6quv1tatW63+0D7Xr7/+qhEjRmjIkCGaN2+eJOmWW25Rhw4dLG2mTJmiLl266MEHH7RsK39fPvvsMz3yyCO6/vrr9eijj8rd3d3y/l2M1q1ba9asWZKknJwcTZgw4bzt58yZo+bNm0uS/vGPf1jty8zM1ODBg2U2mxUTE6Pg4GDL67iQi/18BwwYoHvuucdq26uvvmr5nwEtWrTQxx9/bNn3v//9T1988YXVtvbt21dZz8cff6wdO3ZUus/BwUG7du3Stm3bdMUVV1i2L1q0SC4uLhXaz58/X127dtVf/vIXOTg46JtvvtEjjzyi0tJSTZw4sdLnOPdn4J133lFiYmKVtdaGi+3H8+fPtwqmRuMf/z+8/Gf+yiuv1KxZs5SSkqLXX39dP//8s7Zt26ZmzZpVu65Tp05pwIABSktL07p16877mQFoRMwAgGpbuHChWZJ506ZN5rfeesvs6elpzsvLM5vNZvNtt91m7t+/v9lsNptDQkLMQ4cOtRz3448/miWZFy9ebHW+lStXVrrdbDab165da5ZkXrt2bYV9X375pVmS+cUXX7Tafuutt5oNBoP5wIEDVttXr15tlmT+z3/+Y9k2ceJE88X8Z2DMmDHm0NBQq22SzNOnT7c87tmzp9nf3998+vRpy7Zff/3VbDQazffcc49l2/Tp0y3PffjwYXNQUJD5mmuuMZ85c6bK5w8JCTGPHTu20n2jR482N2vWzOr48vdv6dKl1Xp9V111lblbt26WxydPnqzw+sq9++67ZknmI0eOWLZdd9115uuuu87yeNWqVWZJ5n//+98VXse5faMyF/P5SjJPnDixwjmGDh1qDgkJqfT8577/f1bexxMSEsxms9mcn59vbtOmjXnw4MFmSeaFCxda2o4dO9bs7u5uHjZsmHnSpEmW7T/++KPZ1dXVPGLECLO7u7vV+ct/Xs4VHR1tbteuXYXtsbGxZknmdevWWT3nua+rup9z+Ws+efKk1fZNmzZVeF0X24//fM5yhYWFZn9/f3O3bt2s+uayZcvMkszTpk07b83n/r5JSkoyd+3a1dyuXTvz4cOHz3scgMaFIZEAUEO33367zpw5o2XLlik7O1vLli2rcjjk0qVL5e3trQEDBujUqVOWr/DwcHl4eGjt2rUX9dwrVqyQyWTSY489ZrX9iSeekNls1rfffmu1PT8/X5IqveJRXYWFheedTCUpKUnbt2/XvffeK19fX8v2Hj16aMCAAVqxYkWFY06fPq3o6Gh5enrq66+/rnF92dnZcnNzu6TXl5+fX+3jCwsLJem870d2drYkyc/P76JrudjPty7NmzdPp0+f1vTp06tsc99992nJkiUqKCiQVDbs8ZZbbpG3t3eFtq6urpZ/Z2Zm6tSpU7ruuut06NAhZWZmWrWtzvtcLjs7W6dOnVJGRsZ526WlpVn9DP75OWvSj6uyefNmpaam6pFHHrHqW0OHDlXnzp21fPnyap3n2LFjuu6661RUVKT169crJCSk2jUAaPgIbABQQy1atFBUVJSWLFmi//3vfyopKdGtt95aadv9+/crMzNT/v7+atGihdVXTk6OUlNTL+q5jxw5opYtW8rT09Nqe/mwsSNHjlhtL79vp7I/oKsrIyOjwv1If65Jkjp16lRhX5cuXXTq1Cnl5uZabb/pppu0b98+ZWRkVHrvXXVFRkbqxIkTmjFjhhITEyv9Q/xCTp06Ve33pzwUnO/96N27txwdHTVjxgxt27bNEhBKS0sveP6L/XzrSmZmpl566SXFxMQoICCgynZDhw6Vg4ODvvrqK+Xm5urzzz/XuHHjKm37888/KyoqynJvWIsWLfTss89anu9c1Xmfy913331q0aKFfHx85OnpqTvvvFMpKSkV2nXq1Mnq5y8qKspqf036cVXOd67OnTtX+3O8++67lZqaqnXr1qlVq1bVOgZA48E9bABwCe6880498MADSk5O1uDBg6u8H6W0tFT+/v5avHhxpftbtGhRh1XKMoFEVfeQVUdycnKt/5/9vXv36ttvv9Xtt9+uJ554QgsXLqzReaZMmaJ9+/bphRde0MyZMy/6+MLCQiUlJWnAgAHVap+cnCwPDw+5u7tX2SYkJEQLFy7U448/rl69elnt69Gjx0XXaAsvv/yyjEajnnzySZ0+fbrKdo6Ojrrrrru0cOFC5eXlyc/PTzfccIPVPXKSdPDgQd14443q3LmzXnvtNQUHB8vJyUkrVqzQnDlzKoTZ5ORkSarWrKvTpk1Tv379VFRUpC1btuj5559XRkZGhSti//3vf+Xl5WV5/Pvvv1d575y9uOWWW/TRRx/p9ddft9xjCaDpILABwCW4+eab9dBDD+mXX37RZ599VmW79u3b67vvvtPVV19tNSSspkJCQvTdd98pOzvb6irM3r17LfvPtXnzZgUGBqp169Y1er6ioiIdOHBAgwYNOm9NkiqduW7v3r1q3rx5hYDz9ddfq1+/fpo1a5YmTZqku+66SzfeeONF1+fq6qp3331X27Ztk7e3t6ZPn65ff/1Vf/3rX6t1/K+//qqioiL17t27Wu13795tudp1PmPGjFFiYqJmzpypjz/+WD4+PrrrrrsueNzFfr514cSJE5aA4Onped7AJpVd4br88st19OhRjR07ttIZSL/55hsVFBTo66+/Vps2bSzbqxoSvHv3brV
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"inertias = []\n",
|
|||
|
"clusters_range = range(1, 23)\n",
|
|||
|
"for i in clusters_range:\n",
|
|||
|
" kmeans = KMeans(n_clusters=i, random_state=RANDOM_STATE)\n",
|
|||
|
" kmeans.fit(data_reduced_scaled)\n",
|
|||
|
" inertias.append(kmeans.inertia_)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.plot(clusters_range, inertias, marker='o')\n",
|
|||
|
"plt.title('Метод локтя для оптимального k')\n",
|
|||
|
"plt.xlabel('Количество кластеров')\n",
|
|||
|
"plt.ylabel('Инерция')\n",
|
|||
|
"plt.grid(True)\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Расчет коэффициентов силуета"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 99,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1YAAAIjCAYAAAAAxIqtAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACR9ElEQVR4nOzdd3hU1dbH8d9k0islPbTQQZqAICqCdEQFRVQsFDvKFeV6USxUFRUrXoTXioiFiwXBAlKtCErv0hFCElo6qXPeP8IMhNRJZjKZ8P08Tx6SM+ecWWfnZJg1e++1TYZhGAIAAAAAlJuHqwMAAAAAAHdHYgUAAAAAFURiBQAAAAAVRGIFAAAAABVEYgUAAAAAFURiBQAAAAAVRGIFAAAAABVEYgUAAAAAFURiBQAAAAAVRGIFAABQhUyaNEkmk0knTpxwdSgA7EBiBaBKmTNnjkwmk/76669Cj7377rsymUwaNGiQ8vLyKiWe6667Tg0aNLD7uNGjR8tkMjk+IAAAUCWRWAFwC19//bVGjRqlrl276vPPP5fZbHZ1SAAAADYkVgCqvNWrV2vo0KFq2bKlFi9eLF9fX1eHBAAAUACJFYAqbdOmTRo4cKCioqK0dOlShYSEFNpnwYIF6tChg/z8/BQaGqo777xTR48etT1+9OhRDR06VDExMfLx8VHDhg01btw4paamFjrXxx9/rLp166pGjRqaNm2abfv8+fMVHR2t0NBQvfTSS4WOW7p0qZo2barAwEA98sgjMgxDUn5S2KhRIwUHB2vs2LEFhjCuXr1aJpNJq1evLnCuAQMGyGQyadKkSbZtxc25+Ouvv2QymTRnzhzbtoMHDxbaJkkPP/ywTCaTRowYUWB7UlKSHn30UdWtW1c+Pj5q3LixXnrpJVkslkLnfOWVVwpde6tWrdS9e/cC11TSl/W6yjKPpEGDBoXiLYrFYtGbb76p1q1by9fXV2FhYerXr1+BIaUXtqkkTZ8+XSaTyRb/+UaMGFFi/B9++KFMJpM2btxY6NgXXnhBZrNZR48e1fbt23XDDTcoIiJCPj4+atGihZ5//nnl5OSU+lznfx08eFCS9M0332jAgAGKjo6Wj4+PGjVqpKlTp9o1PLZ79+4lXtv5rMNzL/y6sM02btyofv36KSwsrMB+1113XYmxnH9vvf7666pfv778/PzUrVs3bdu2rcC+W7Zs0YgRI9SwYUP5+voqMjJSd999t06ePFlgv1mzZqlt27YKCQlRQECA2rZtq/fff7/APiNGjFBgYGCheL744otCf5fdu3dXq1atSr0G699cYmKiwsLC1L17d9trgSTt3btXAQEBuvXWW0tsk6IcOnRIjRs3VqtWrZSQkGD38QCcz9PVAQBAcfbt26d+/frJx8dHS5cuVVRUVKF95syZo5EjR+qyyy7TtGnTlJCQoDfffFO//fabNm7cqBo1amjfvn1KSEjQv/71L9WsWVPbt2/XjBkztGLFCv3666/y8/OTJP32228aPny4rrjiCg0dOlQff/yx9u/frzNnzmjKlCl66qmn9OOPP+rJJ59UvXr1NHToUEnS/v37NWjQIDVu3FgvvPCClixZYntD//DDD+tf//qXNm7cqNdff11hYWEaP358sdf8888/6/vvv3d4W+7du1fvvvtuoe0ZGRnq1q2bjh49qgceeED16tXT77//rvHjx+vYsWN644037HqeFi1a6OOPP7b9/M4772jnzp16/fXXbdvatGlT7usozj333KM5c+aof//+uvfee5Wbm6tffvlFf/zxhzp27FjkMUlJSQWS56KEhoYWiP2uu+6yfX/zzTfr4Ycf1ieffKJLL720wHGffPKJunfvrpiYGFuy+Z///EcBAQH6888/NWHCBP3+++9avHixPDw89MADD6hXr14FnufGG2/UTTfdZNsWFhYmKf+eDwwM1NixYxUYGKiVK1dqwoQJSklJ0fTp08vcZnXq1LFdf1pamkaNGlXi/q+//rpCQ0MlSc8//3yBx5KTk9W/f38ZhqGxY8eqbt26kqTHHnuszPHMnTtXqampevjhh5WZmak333xTPXr00NatWxURESFJWrZsmfbv36+RI0cqMjJS27dv1zvvvKPt27frjz/+sM1rTE1NVZ8+fdSoUSMZhqH//e9/uvfee1WjRg0NHjy4zDGVV3h4uGbNmqUhQ4borbfe0iOPPCKLxaIRI0YoKChIb7/9tl3n27dvn3r06KFatWpp2bJltt8DgCrGAIAq5MMPPzQkGd9++63RqFEjQ5LRp0+fIvfNzs42wsPDjVatWhlnzpyxbf/2228NScaECROKfZ5ly5YZkowpU6bYtt1www1GbGyskZmZaRiGYaSmphqxsbGGv7+/sX//fsMwDMNisRhXXnml0bZtW9txjzzyiBEUFGScOHHCMAzDyMnJMS6//HJDkrF27VrbfkOHDjXCw8Nt51+1apUhyVi1apVtn86dOxv9+/c3JBkTJ060bZ84caIhyTh+/HiB6/jzzz8NScaHH35o23bgwIFC22655RajVatWRt26dY3hw4fbtk+dOtUICAgw/v777wLnffLJJw2z2WwcPny4wDmnT59eqC0vueQSo1u3boW2G4ZhDB8+3Khfv36RjxV3TeerX79+gXiLsnLlSkOS8cgjjxR6zGKx2L6/sE3HjRtnhIeHGx06dCgy/jvuuMOIjY0tsO3CcwwdOtSIjo428vLybNs2bNhQqP0v9O677xqSjLlz5xb5+IXPc76MjIxC2x544AHD39/fdm+V5oorrjBatWpl+/n48ePFPqc11kOHDtm2devWrUCbLV261JBkfPbZZwWOrV+/vjFgwIASY7HeW35+fsaRI0ds29euXWtIMh577DHbtqKu/bPPPjMkGT///HOxz5Gbm2sEBwcbo0ePtm0bPny4ERAQUGjfBQsWFPq77Natm3HJJZeUeg0X/s6HDh1q+Pv7G3///bcxffp0Q5KxcOHCYs9jdf7fxs6dO43o6GjjsssuM06dOlXqsQBch6GAAKqkESNG6J9//tHtt9+uH3/8UQsWLCi0z19//aXExEQ99NBDBeZdDRgwQM2bN9d3331n25aTk6MTJ07Yvtq1a6eOHTsWOO+KFSt07bXXysfHR5IUGBioli1bKiwsTLGxsZJkq0q4efNm2/CjFStW6Oqrr1bt2rUlSZ6enurQoYMkqVOnTrbz33TTTUpMTCw0vMnqq6++0p9//qkXX3yxXG1WnPXr12vBggWaNm2aPDwKvuwvWLBAXbt2Vc2aNQu0T69evZSXl6eff/65wP4ZGRkF9jtx4kSFKzSeOnVKJ06cUHp6ermO//LLL2UymTRx4sRCjxVXmfHo0aN666239OyzzxY5HEySsrOzbfdCcYYNG6a4uDitWrXKtu2TTz6Rn59fgZ6RrKysAm02aNAgRUREFHlfl8bawyrl98ycOHFCXbt2VUZGhnbt2lWmc2RmZpZ5rmJ2drYkldgW1mG11r+B8hg0aJBiYmJsP3fq1EmdO3cu0IN7/rVnZmbqxIkTuvzyyyVJGzZsKHC+vLw8nThxQocOHdLrr7+ulJQUde3atdDzXng/FzVE+PzznThxwtYmpfnvf/+rkJAQ3XzzzXr22Wd11113aeDAgWU6VpK2bdumbt26qUGDBlq+fLlq1qxZ5mMBVD4SKwBV0qlTpzRv3jx99NFHateuncaMGaPk5OQC+xw6dEiS1KxZs0LHN2/e3Pa4lD/MLywsrMDXX3/9pb1790qSTp8+rfT09AJv7Ipj3eeff/6x/Vue486Xl5enp556SnfccYfDh8o9+eST6tq1a5FzXfbs2aMlS5YUahvrsLTExMQC+0+cOLHQvmV9M1+cZs2aKSwsTIGBgYqIiNAzzzxjV7K2b98+RUdHq1atWmU+ZuLEiYqOjtYDDzxQ7D5JSUnFJl1WvXv3VlRUlD755BNJ+XO9PvvsMw0cOFBBQUG2/T777LNC7ZaQkGC7/+yxfft23XjjjQoJCVFwcLDCwsJ05513SlKhv5HinDhxosj5ikVJSkqSpBLbomPHjvLy8tKkSZO0ceNGWwJ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x600 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.metrics import silhouette_score\n",
|
|||
|
"silhouette_scores = []\n",
|
|||
|
"for i in clusters_range[1:]: \n",
|
|||
|
" kmeans = KMeans(n_clusters=i, random_state=RANDOM_STATE)\n",
|
|||
|
" labels = kmeans.fit_predict(data_reduced_scaled)\n",
|
|||
|
" score = silhouette_score(data_reduced_scaled, labels)\n",
|
|||
|
" silhouette_scores.append(score)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 6))\n",
|
|||
|
"plt.plot(clusters_range[1:], silhouette_scores, marker='o')\n",
|
|||
|
"plt.title('Коэффициенты силуэта для разных k')\n",
|
|||
|
"plt.xlabel('Количество кластеров')\n",
|
|||
|
"plt.ylabel('Коэффициент силуэта')\n",
|
|||
|
"plt.grid(True)\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 100,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Средний коэффициент силуэта: 0.282\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1kAAAJwCAYAAAB71at5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUVfoH8O/0mUxNnfQeWkLvSBUQEAuooKgL2Laoq65tV3fXurusZS1r198u6FpXQVYRkSJFQKnSW3rvZUrK1Pv7I2ZkmElIwqTy/TxPHs05d+59ZzIJ951zzntEgiAIICIiIiIiooAQ93QARERERERE/QmTLCIiIiIiogBikkVERERERBRATLKIiIiIiIgCiEkWERERERFRADHJIiIiIiIiCiAmWURERERERAHEJIuIiIiIiCiAmGQREREREREFEJMsIiIiIiKiAGKSRUR+rVq1CiKRyOsrIiICM2bMwNdff93T4RER9YiWv4379+/3ajeZTBg3bhyUSiU2bNjQ5mNFIhF27tzp0y8IAuLi4iASiXDFFVd0SfxE1D2kPR0AEfVuTz31FJKSkiAIAsrLy7Fq1Spcfvnl+PLLL3kTQEQEwGw247LLLsORI0fw+eefY+7cuW0er1Qq8eGHH2Ly5Mle7du3b0dRUREUCkVXhktE3YBJFhG1ad68eRgzZozn+9tuuw1GoxEfffQRkywiuuhZLBbMmTMHhw4dwpo1azBv3rzzPubyyy/Hp59+in/+85+QSn++Ffvwww8xevRoVFVVdWXIRNQNOF2QiDrEYDBApVJ53Rjk5eVBJBJh1apVXsfeddddEIlEWL58uadtzZo1GDduHEJCQqBSqTBo0CA888wzEAQBALB161aIRCJ8/vnnPtf+8MMPIRKJ8P333wMAjhw5guXLlyM5ORlKpRKRkZG49dZbUV1d7Tf2xMREnymQIpEI27Zt8zrm7HgB4NNPP4VIJEJiYqKn7fTp07j00ksRGRkJhUKBuLg4/PrXv0ZNTY3nGLvdjsceewyjR4+GXq+HWq3GlClTsHXrVq/zt7x+zz//vE/MGRkZmD59ulfb9OnTfdr27dvneT5ns1qteOCBB5CcnAyZTOb1vM93I+fvOn/9618hFovx4Ycf+n0O/r7O9vzzz2PSpEkIDQ2FSqXC6NGj8dlnn/m9/vvvv49x48YhKCgIwcHBmDp1KjZu3Aig9Z9ly9fZPyu3242XXnoJ6enpUCqVMBqN+NWvfoXa2lqv6yUmJuKKK67Axo0bMWLECCiVSgwZMgRr1qzxiS0nJweLFi1CSEgIgoKCMGHCBHz11Vdex2zbts0rJoVCgQEDBmDFihWe93tbmpqa8MQTT2DAgAFQKpWIiorCNddcg+zs7DYfd77X5mxOpxNPP/00UlJSoFAokJiYiEcffRQ2m83vOe+77z6f682ZM8fv9LaKigrPhzJKpRLDhw/Hu+++63VMa+99f+/7559/HiKRCHl5eZ62J554ol3v5eXLl3u9JwDgpZdewqBBg6BQKBAZGYlf/epXXr+/7WG1WjF37lwcPHgQq1evxvz589v1uCVLlqC6uhqbNm3ytNntdnz22We48cYb/T6mve/j//3vf5g/fz6io6OhUCiQkpKCp59+Gi6Xy+u46dOnIyMjAydOnMCMGTMQFBSEmJgYPPvssz7XfuWVV5Cenu75XRwzZozP3wAi8saRLCJqk8lkQlVVFQRBQEVFBV555RVYrVbcfPPNbT4uKysL77zzjk+72WzG+PHjsWzZMshkMmzYsAF/+MMfIJVK8cADD2D69OmIi4vDBx98gIULF3o99oMPPkBKSgomTpwIANi0aRNycnJwyy23IDIyEsePH8fbb7+N48eP44cffvC5oQSAKVOm4Je//CUA4OTJk/jb3/7W5vNwOp344x//6NNeX1+P2NhYXHnlldDpdDh27Bhee+01FBcX48svv/Q81//7v//DkiVLcMcdd8BiseBf//oX5syZg71792LEiBFtXrsjfv/73/ttf+ihh/Dmm2/itttuwyWXXAKZTIY1a9b4TWLPZ+XKlfjTn/6Ef/zjH63eCP7yl7/ElClTAMDvdV5++WVcddVVuOmmm2C32/Hxxx9j0aJFWLdundcN6pNPPoknnngCkyZNwlNPPQW5XI49e/bg22+/xWWXXYaXXnoJVqsVwM8/x0cffRSDBw8GAGg0Gs+5fvWrX2HVqlW45ZZbcM899yA3NxevvvoqfvzxR+zatQsymcxzbGZmJq6//nr8+te/xrJly7By5UosWrQIGzZswOzZswEA5eXlmDRpEhoaGnDPPfcgNDQU7777Lq666ip89tlnPu/blrgaGxvxySef4NFHH0VERARuu+22Vl9rl8uFK664Alu2bMENN9yAe++9FxaLBZs2bcKxY8eQkpLS5s9qxIgReOCBB7za3nvvPa+begC4/fbb8e677+K6667DAw88gD179mDFihU4efKkz89OqVTigw8+wHPPPed5zYqKirBlyxYolUqvYxsbGzF9+nRkZWXh7rvvRlJSEj799FMsX74cdXV1uPfee9uMv6v97W9/wx//+EdMnToVd911l+c9sWfPHuzZs6dd0/Xq6+sxb9487Nu3D5999lmHRvYTExMxceJEfPTRR56Rr6+//homkwk33HAD/vnPf/o8pr3v41WrVkGj0eD++++HRqPBt99+i8ceewxmsxnPPfec1zlra2sxd+5cXHPNNVi8eDE+++wz/P73v8fQoUM9cb3zzju45557cN111+Hee+9FU1MTjhw5gj179rT6d4CIAAhERH6sXLlSAODzpVAohFWrVnkdm5ubKwAQVq5c6WlbvHixkJGRIcTFxQnLli1r81pDhgwRrrjiCs/3jzzyiKBQKIS6ujpPW0VFhSCVSoXHH3/c09bQ0OBzro8++kgAIOzYscOnLyYmRrjllls832/dulUAIGzdutXTlpCQ4BXv66+/LigUCmHGjBlCQkJCm8/jzjvvFDQajed7p9Mp2Gw2r2Nqa2sFo9Eo3HrrrZ62ltfvueee8zlnenq6MG3aNK+2adOmebWtX79eACDMnTtXOPfPelRUlDBnzhyvtscff1wAIFRWVrb5fM6+zldffSVIpVLhgQce8HtsZmamAEB49913fa5ztnN/Zna7XcjIyBAuvfRSr3OJxWJh4cKFgsvl8jre7Xb7XNvfz7HFd999JwAQPvjgA6/2DRs2+LQnJCQIAITVq1d72kwmkxAVFSWMHDnS03bfffcJAITvvvvO02axWISkpCQhMTHRE7O/uJqamgSxWCzceeedPrGe7d///rcAQHjhhRd8+vy9BmdLSEgQ5s+f79N+1113ef08Dh06JAAQbr/9dq/jHnzwQQGA8O2333qdc/bs2UJYWJjw2WefedqffvppYdKkST7XfOmllwQAwvvvv+9ps9vtwsSJEwWNRiOYzWZBEFp/7/t73z/33HMCACE3N9fT1t738rJlyzy/v5WVlYJSqRQmT54sOBwOzzGrVq0SAAivvPJKm+dq+duYkJAgyGQyYe3atW0e7++x+/btE1599VVBq9V6ficWLVokzJgxQxAE359hR97H/v4u/upXvxKCgoKEpqYmT9u0adMEAMJ7773nabPZbEJkZKRw7bXXetquvvpqIT09vd3PkYiacbogEbXptddew6ZNm7Bp0ya8//77mDFjBm6//Xa/U6haHDhwAJ9++ilWrFgBsdj/n5mqqioUFRVh1apVyMrKwtSpUz19S5cuhc1m85pG9sknn8DpdHqNoKlUKs//NzU1oaqqChMmTAAAHDx40Oeadru9QwvKGxoa8NRTT+Huu+9GfHy832NMJhPKy8uxZcsWfPXVV17PQyKRQC6XA2ie6lNTUwOn04kxY8b4ja8zBEHAI488gmuvvRbjx4/36bdYLAgNDb2ga+zduxeLFy/Gtdde6/NJeAu73Q4A5319z/6Z1dbWwmQyYcqUKV6vx9q1a+F2u/HYY4/5vH/8jU625dNPP4Ver8fs2bNRVVXl+Ro9ejQ0Go3P1M3o6GivkSidToelS5fixx9/RFl
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 1000x700 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"kmeans = KMeans(n_clusters=3, random_state=9) \n",
|
|||
|
"df_clusters = kmeans.fit_predict(data_reduced_scaled)\n",
|
|||
|
"\n",
|
|||
|
"silhouette_avg = silhouette_score(data_reduced_scaled, df_clusters)\n",
|
|||
|
"print(f'Средний коэффициент силуэта: {silhouette_avg:.3f}')\n",
|
|||
|
"\n",
|
|||
|
"pca = PCA(n_components=2)\n",
|
|||
|
"df_pca = pca.fit_transform(data_reduced_scaled)\n",
|
|||
|
"\n",
|
|||
|
"plt.figure(figsize=(10, 7))\n",
|
|||
|
"sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=df_clusters, palette='viridis', alpha=0.7)\n",
|
|||
|
"plt.title('Визуализация кластеров с помощью K-Means')\n",
|
|||
|
"plt.xlabel('Первая компонентa PCA')\n",
|
|||
|
"plt.ylabel('Вторая компонентa PCA')\n",
|
|||
|
"plt.legend(title='Кластер', loc='upper right')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Мы можем наблюдать некоторое пересечение кластеров, что говорит о неплохом результате работы метода кластеризации для данного датасета."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": ".venv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.6"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|