1318 lines
3.3 MiB
Plaintext
Raw Normal View History

2024-12-21 13:45:48 +04:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Вариант 2. Показатели сердечных заболеваний"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Этот датасет представляет собой данные, собранные в ходе ежегодного опроса CDC о состоянии здоровья более 400 тысяч взрослых в США. Он включает информацию о различных факторах риска сердечных заболеваний, таких как гипертония, высокий уровень холестерина, курение, диабет, ожирение, недостаток физической активности и злоупотребление алкоголем. Также содержатся данные о состоянии здоровья респондентов, наличии хронических заболеваний (например, диабет, артрит, астма), уровне физической активности, психологическом здоровье, а также о социальных и демографических характеристиках, таких как пол, возраст, этническая принадлежность и место проживания. Датасет предоставляет информацию, которая может быть использована для анализа и предсказания риска сердечных заболеваний, а также для разработки программ профилактики и улучшения общественного здоровья."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Бизнес-цель\n",
"\n",
"Разработка и внедрение модели машинного обучения для кластеризации пациентов на основе данных об их здоровье с целью выявления групп с повышенным риском сердечно-сосудистых заболеваний (ССЗ), таких как инфаркты и сердечная недостаточность. Это позволит оптимизировать профилактические меры, улучшить управление ресурсами медицинских учреждений и создать эффективные программы долгосрочного наблюдения.\n",
"\n",
"### Основные аспекты бизнес-цели:\n",
"\n",
"1. **Раннее выявление и профилактика сердечно-сосудистых заболеваний:**\n",
" - **Цель:** Использование модели машинного обучения для идентификации пациентов с высоким риском ССЗ на ранних стадиях.\n",
" - **Преимущества:** Возможность своевременного вмешательства и реализации профилактических мер, таких как улучшение питания, повышение физической активности и управление стрессом.\n",
" - **Результат:** Снижение числа случаев инфарктов и сердечной недостаточности за счет превентивных мероприятий.\n",
"\n",
"2. **Оптимизация работы кардиологических центров:**\n",
" - **Цель:** Эффективное распределение ресурсов (кардиологов, оборудования, диагностических тестов) на основе кластеризации пациентов по уровню риска.\n",
" - **Преимущества:** Более рациональное использование ресурсов, снижение времени ожидания для пациентов с высоким риском и улучшение качества обслуживания.\n",
" - **Результат:** Повышение эффективности работы кардиологических центров и улучшение клинических исходов.\n",
"\n",
"3. **Создание персонализированных программ наблюдения:**\n",
" - **Цель:** Разработка долгосрочных программ мониторинга и раннего вмешательства для пациентов с повышенным риском ССЗ.\n",
" - **Преимущества:** Персонализированный подход к наблюдению, учитывающий уникальные характеристики и потребности каждой группы пациентов.\n",
" - **Результат:** Регулярный мониторинг состояния здоровья пациентов и своевременное проведение лечебных мероприятий для предотвращения осложнений.\n",
"\n",
"4. **Улучшение клинических и экономических показателей:**\n",
" - **Цель:** Снижение затрат на лечение сердечно-сосудистых заболеваний за счет раннего выявления и профилактики.\n",
" - **Преимущества:** Уменьшение количества дорогостоящих госпитализаций и операций, повышение общей эффективности системы здравоохранения.\n",
" - **Результат:** Значительное сокращение расходов на лечение ССЗ и улучшение финансовых показателей медицинских учреждений.\n",
"\n",
"5. **Повышение уровня удовлетворенности пациентов:**\n",
" - **Цель:** Обеспечение пациентов качественными медицинскими услугами и персонализированным уходом.\n",
" - **Преимущества:** Повышение доверия к медицинским учреждениям, улучшение общего самочувствия и качества жизни пациентов.\n",
" - **Результат:** Высокий уровень удовлетворенности пациентов и их активное участие в программах профилактики и мониторинга.\n",
"\n",
"Таким образом, разработка модели машинного обучения для кластеризации пациентов с проблемами сердечно-сосудистой системы позволит не только улучшить клинические и экономические показатели, но и повысить уровень удовлетворенности пациентов, обеспечивая им качественную и своевременную медицинскую помощь."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"from typing import Any, List\n",
"import math \n",
"import pandas as pd\n",
"from pandas import DataFrame, Series\n",
"from pprint import pprint\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"RANDOM_STATE = 34"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 246022 entries, 0 to 246021\n",
"Data columns (total 40 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 State 246022 non-null object \n",
" 1 Sex 246022 non-null object \n",
" 2 GeneralHealth 246022 non-null object \n",
" 3 PhysicalHealthDays 246022 non-null float64\n",
" 4 MentalHealthDays 246022 non-null float64\n",
" 5 LastCheckupTime 246022 non-null object \n",
" 6 PhysicalActivities 246022 non-null object \n",
" 7 SleepHours 246022 non-null float64\n",
" 8 RemovedTeeth 246022 non-null object \n",
" 9 HadHeartAttack 246022 non-null object \n",
" 10 HadAngina 246022 non-null object \n",
" 11 HadStroke 246022 non-null object \n",
" 12 HadAsthma 246022 non-null object \n",
" 13 HadSkinCancer 246022 non-null object \n",
" 14 HadCOPD 246022 non-null object \n",
" 15 HadDepressiveDisorder 246022 non-null object \n",
" 16 HadKidneyDisease 246022 non-null object \n",
" 17 HadArthritis 246022 non-null object \n",
" 18 HadDiabetes 246022 non-null object \n",
" 19 DeafOrHardOfHearing 246022 non-null object \n",
" 20 BlindOrVisionDifficulty 246022 non-null object \n",
" 21 DifficultyConcentrating 246022 non-null object \n",
" 22 DifficultyWalking 246022 non-null object \n",
" 23 DifficultyDressingBathing 246022 non-null object \n",
" 24 DifficultyErrands 246022 non-null object \n",
" 25 SmokerStatus 246022 non-null object \n",
" 26 ECigaretteUsage 246022 non-null object \n",
" 27 ChestScan 246022 non-null object \n",
" 28 RaceEthnicityCategory 246022 non-null object \n",
" 29 AgeCategory 246022 non-null object \n",
" 30 HeightInMeters 246022 non-null float64\n",
" 31 WeightInKilograms 246022 non-null float64\n",
" 32 BMI 246022 non-null float64\n",
" 33 AlcoholDrinkers 246022 non-null object \n",
" 34 HIVTesting 246022 non-null object \n",
" 35 FluVaxLast12 246022 non-null object \n",
" 36 PneumoVaxEver 246022 non-null object \n",
" 37 TetanusLast10Tdap 246022 non-null object \n",
" 38 HighRiskLastYear 246022 non-null object \n",
" 39 CovidPos 246022 non-null object \n",
"dtypes: float64(6), object(34)\n",
"memory usage: 75.1+ MB\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>25%</th>\n",
" <th>50%</th>\n",
" <th>75%</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>PhysicalHealthDays</th>\n",
" <td>246022.0</td>\n",
" <td>4.119026</td>\n",
" <td>8.405844</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>3.00</td>\n",
" <td>30.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>MentalHealthDays</th>\n",
" <td>246022.0</td>\n",
" <td>4.167140</td>\n",
" <td>8.102687</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>4.00</td>\n",
" <td>30.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SleepHours</th>\n",
" <td>246022.0</td>\n",
" <td>7.021331</td>\n",
" <td>1.440681</td>\n",
" <td>1.00</td>\n",
" <td>6.00</td>\n",
" <td>7.00</td>\n",
" <td>8.00</td>\n",
" <td>24.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>HeightInMeters</th>\n",
" <td>246022.0</td>\n",
" <td>1.705150</td>\n",
" <td>0.106654</td>\n",
" <td>0.91</td>\n",
" <td>1.63</td>\n",
" <td>1.70</td>\n",
" <td>1.78</td>\n",
" <td>2.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WeightInKilograms</th>\n",
" <td>246022.0</td>\n",
" <td>83.615179</td>\n",
" <td>21.323156</td>\n",
" <td>28.12</td>\n",
" <td>68.04</td>\n",
" <td>81.65</td>\n",
" <td>95.25</td>\n",
" <td>292.57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>BMI</th>\n",
" <td>246022.0</td>\n",
" <td>28.668136</td>\n",
" <td>6.513973</td>\n",
" <td>12.02</td>\n",
" <td>24.27</td>\n",
" <td>27.46</td>\n",
" <td>31.89</td>\n",
" <td>97.65</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count mean std min 25% 50% \\\n",
"PhysicalHealthDays 246022.0 4.119026 8.405844 0.00 0.00 0.00 \n",
"MentalHealthDays 246022.0 4.167140 8.102687 0.00 0.00 0.00 \n",
"SleepHours 246022.0 7.021331 1.440681 1.00 6.00 7.00 \n",
"HeightInMeters 246022.0 1.705150 0.106654 0.91 1.63 1.70 \n",
"WeightInKilograms 246022.0 83.615179 21.323156 28.12 68.04 81.65 \n",
"BMI 246022.0 28.668136 6.513973 12.02 24.27 27.46 \n",
"\n",
" 75% max \n",
"PhysicalHealthDays 3.00 30.00 \n",
"MentalHealthDays 4.00 30.00 \n",
"SleepHours 8.00 24.00 \n",
"HeightInMeters 1.78 2.41 \n",
"WeightInKilograms 95.25 292.57 \n",
"BMI 31.89 97.65 "
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('csv\\\\heart_2022_no_nans.csv')\n",
"\n",
"df.info()\n",
"df.describe().transpose()"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"def get_null_columns_info(df: DataFrame) -> DataFrame:\n",
" \"\"\"\n",
" Возвращает информацию о пропущенных значениях в колонках датасета\n",
" \"\"\"\n",
" w = []\n",
" df_len = len(df)\n",
"\n",
" for column in df.columns:\n",
" column_nulls = df[column].isnull()\n",
" w.append([column, column_nulls.any(), column_nulls.sum() / df_len])\n",
"\n",
" null_df = DataFrame(w).rename(columns={0: \"Column\", 1: \"Has Null\", 2: \"Null Percent\"})\n",
"\n",
" return null_df"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Column</th>\n",
" <th>Has Null</th>\n",
" <th>Null Percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>State</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Sex</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GeneralHealth</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PhysicalHealthDays</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>MentalHealthDays</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>LastCheckupTime</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>PhysicalActivities</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>SleepHours</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>RemovedTeeth</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>HadHeartAttack</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>HadAngina</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>HadStroke</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>HadAsthma</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>HadSkinCancer</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>HadCOPD</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>HadDepressiveDisorder</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>HadKidneyDisease</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>HadArthritis</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>HadDiabetes</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>DeafOrHardOfHearing</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>BlindOrVisionDifficulty</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>DifficultyConcentrating</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>DifficultyWalking</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>DifficultyDressingBathing</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>DifficultyErrands</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>SmokerStatus</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>ECigaretteUsage</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>ChestScan</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>RaceEthnicityCategory</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>AgeCategory</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>HeightInMeters</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>WeightInKilograms</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>BMI</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>AlcoholDrinkers</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>HIVTesting</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>FluVaxLast12</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>PneumoVaxEver</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>TetanusLast10Tdap</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>HighRiskLastYear</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>CovidPos</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Column Has Null Null Percent\n",
"0 State False 0.0\n",
"1 Sex False 0.0\n",
"2 GeneralHealth False 0.0\n",
"3 PhysicalHealthDays False 0.0\n",
"4 MentalHealthDays False 0.0\n",
"5 LastCheckupTime False 0.0\n",
"6 PhysicalActivities False 0.0\n",
"7 SleepHours False 0.0\n",
"8 RemovedTeeth False 0.0\n",
"9 HadHeartAttack False 0.0\n",
"10 HadAngina False 0.0\n",
"11 HadStroke False 0.0\n",
"12 HadAsthma False 0.0\n",
"13 HadSkinCancer False 0.0\n",
"14 HadCOPD False 0.0\n",
"15 HadDepressiveDisorder False 0.0\n",
"16 HadKidneyDisease False 0.0\n",
"17 HadArthritis False 0.0\n",
"18 HadDiabetes False 0.0\n",
"19 DeafOrHardOfHearing False 0.0\n",
"20 BlindOrVisionDifficulty False 0.0\n",
"21 DifficultyConcentrating False 0.0\n",
"22 DifficultyWalking False 0.0\n",
"23 DifficultyDressingBathing False 0.0\n",
"24 DifficultyErrands False 0.0\n",
"25 SmokerStatus False 0.0\n",
"26 ECigaretteUsage False 0.0\n",
"27 ChestScan False 0.0\n",
"28 RaceEthnicityCategory False 0.0\n",
"29 AgeCategory False 0.0\n",
"30 HeightInMeters False 0.0\n",
"31 WeightInKilograms False 0.0\n",
"32 BMI False 0.0\n",
"33 AlcoholDrinkers False 0.0\n",
"34 HIVTesting False 0.0\n",
"35 FluVaxLast12 False 0.0\n",
"36 PneumoVaxEver False 0.0\n",
"37 TetanusLast10Tdap False 0.0\n",
"38 HighRiskLastYear False 0.0\n",
"39 CovidPos False 0.0"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_null_columns_info(df)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"def get_filtered_columns(df: DataFrame, no_numeric=False, no_text=False) -> list[str]:\n",
" \"\"\"\n",
" Возвращает список колонок по фильтру\n",
" \"\"\"\n",
" w = []\n",
" for column in df.columns:\n",
" if no_numeric and pd.api.types.is_numeric_dtype(df[column]):\n",
" continue\n",
" if no_text and not pd.api.types.is_numeric_dtype(df[column]):\n",
" continue\n",
" w.append(column)\n",
" return w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Визуализация взаимосвязей"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['PhysicalHealthDays',\n",
" 'MentalHealthDays',\n",
" 'SleepHours',\n",
" 'HeightInMeters',\n",
" 'WeightInKilograms',\n",
" 'BMI']"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_columns = get_filtered_columns(df, no_text=True)\n",
"\n",
"num_columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Признак BMI зависит от признаков HeightInMeters и WeightInKilograms, так что смысла использовать HeightInMeters и WeightInKilograms нет, исключим их и оставим только BMI "
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Колонки для визулизации:\n",
"['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'BMI']\n"
]
}
],
"source": [
"columns_to_drop = [\n",
" 'HeightInMeters',\n",
" 'WeightInKilograms'\n",
"]\n",
"\n",
"for col in columns_to_drop:\n",
" if col in num_columns:\n",
" num_columns.remove(col)\n",
"\n",
"print('Колонки для визулизации:')\n",
"print(num_columns)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"def draw_data_2d(\n",
" df: pd.DataFrame,\n",
" col1: int,\n",
" col2: int,\n",
" y: List | None = None,\n",
" classes: List | None = None,\n",
" subplot: Any | None = None,\n",
"):\n",
" ax = None\n",
" if subplot is None:\n",
" _, ax = plt.subplots()\n",
" else:\n",
" ax = subplot\n",
" scatter = ax.scatter(df[df.columns[col1]], df[df.columns[col2]], c=y, cmap=\"viridis\", alpha=0.7)\n",
" ax.set(xlabel=df.columns[col1], ylabel=df.columns[col2])\n",
" if classes is not None:\n",
" ax.legend(scatter.legend_elements()[0], classes, loc=\"lower right\", title=\"Classes\")"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"def show_scatters_by_pairs(\n",
" df: DataFrame,\n",
" columns: List[str],\n",
" y: List = None,\n",
" y_names: List[str] = None) -> None:\n",
" pairs_count = math.comb(len(columns), 2)\n",
" plot_columns_count = 2\n",
" plot_rows_count = math.ceil(pairs_count / plot_columns_count) \n",
"\n",
" plt.figure(figsize=(plot_columns_count * 8, plot_rows_count * 8))\n",
"\n",
" count = 0\n",
" for i in range(len(columns)):\n",
" for j in range(i + 1, len(columns)):\n",
" count += 1\n",
" print(columns[i], 'vs', columns[j])\n",
" draw_data_2d(\n",
" df,\n",
" i, j,\n",
" y,\n",
" y_names,\n",
" subplot=plt.subplot(plot_rows_count, plot_columns_count, count)\n",
" )\n",
"\n",
" plt.tight_layout()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PhysicalHealthDays vs MentalHealthDays\n",
"PhysicalHealthDays vs SleepHours\n",
"PhysicalHealthDays vs BMI\n",
"MentalHealthDays vs SleepHours\n",
"MentalHealthDays vs BMI\n",
"SleepHours vs BMI\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\ns.potapov\\AppData\\Local\\Temp\\ipykernel_52300\\1030510231.py:14: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored\n",
" scatter = ax.scatter(df[df.columns[col1]], df[df.columns[col2]], c=y, cmap=\"viridis\", alpha=0.7)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjUAAAlUCAYAAABfY/AfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdf3hcZZ3//9c587PTJGNDSGtpOwVCbC3CYsXaglwgVhaKflb0u65WWtzt6sov+eGuAp/lUvwU1HXZVde669aFFgu468Iukop2saLYEjUIXTClROyUUts0pp1kOp2f53z/GBKaNCmZZO6ZOcnzcV25rk7m7jvvuc+PuWfe59y35bquKwAAAAAAAAAAgBpnVzsBAAAAAAAAAACAsaCoAQAAAAAAAAAAPIGiBgAAAAAAAAAA8ASKGgAAAAAAAAAAwBMoagAAAAAAAAAAAE+gqAEAAAAAAAAAADyBogYAAAAAAAAAAPAEihoAAAAAAAAAAMAT/NVOwDTHcbRv3z7V19fLsqxqpwMAAAB4muu66u/v1+zZs2XbU+caKT5XAAAAAOUzkc8Vk76osW/fPs2dO7faaQAAAACTyssvv6w5c+ZUO42K4XMFAAAAUH7j+Vwx6Ysa9fX1koqd09DQUOVsAAAAAG/r6+vT3LlzB8fZ1XbXXXfpoYce0s6dOzVt2jQtW7ZMX/rSl/SmN71psM2FF16oJ554Ysj/+8QnPqF//ud/HvPf4XMFAAAAUD4T+Vwx6YsaA7eGNzQ08OEDAAAAKJNamYLpiSee0DXXXKNzzz1X+Xxet956q97znvfoN7/5jaZPnz7Y7i//8i91xx13DD6ORCIl/R0+VwAAAADlN57PFZO+qAEAAABg8nrssceGPL733nvV3Nysjo4OXXDBBYO/j0QimjVrVqXTAwAAAFBmU2dlPwAAAACTXiKRkCQ1NjYO+f2mTZvU1NSkM888U7fccotSqdQJ42QyGfX19Q35AQAAAFB93KkBAAAAYFJwHEc33HCDzjvvPJ155pmDv//IRz6iWCym2bNna8eOHfrMZz6jF154QQ899NCose666y59/vOfr0TaAAAAAEpgua7rVjsJk/r6+hSNRpVIJJj7FgAAAJigWh5ff/KTn9QPfvADPfnkk5ozZ86o7X784x/r4osvVldXl04//fQR22QyGWUymcHHAwsZ1uLrBgAAALxmIp8ruFMDAAAAgOdde+21evTRR/XTn/70hAUNSVqyZIkknbCoEQqFFAqFyp4nAAAAgImhqAEAAADAs1zX1XXXXaeHH35YP/nJT3Tqqae+7v955plnJElvfOMbDWcHAAAAoNwoagAAAADwrGuuuUb333+//vu//1v19fXav3+/JCkajWratGn67W9/q/vvv1+XXXaZTjrpJO3YsUM33nijLrjgAp111llVzh4AAABAqShqAAAAAPCsb37zm5KkCy+8cMjv77nnHl111VUKBoP6n//5H/3jP/6jjhw5orlz5+oDH/iA/u///b9VyBYAAADARFHUAAAAAOBZruue8Pm5c+fqiSeeqFA2AAAAAEyzq50AAAAAAAAAAADAWFDUAAAAAAAAAAAAnkBRAwAAAAAAAAAAeAJFDQAAAAAAAAAA4AkUNQAAAAAAAAAAgCdQ1AAAAAAAAAAAAJ5AUQMAAAAAAAAAAHgCRQ0AAAAAAAAAAOAJFDUAAAAAAAAAAIAnUNQAAAAAAAAAAACeQFEDAAAAAAAAAAB4AkUNAAAAAAAAAADgCRQ1AAAAAAAAAACAJ1DUAAAAAAAAAAAAnkBRAwAAAAAAAAAAeAJFDQAAAAAAAAAA4AkUNQAAAAAAAAAAgCf4q53AZDf/s23H/W73F1dMOO6OvQf0vn/61eDjR659m86aM3PCcdf9dLu+vLl38PHfXNaoqy9YOuG4D+3o1E33vzT4+O6PnKYrzlo44bim+uFffv4L3fX9g4OPb3nvyfrEeW+fcNzkkaxu/N6zevlQSnNnRPQPHzxbddODE477X/+7Uzds+u3g439cebr+5C0LJhzXVL6SlM872rLzgPYn0poVDWv5gpny+ydeZ733l8/oc//5yuDjz33gFF117h9NOK4pqVROtz/6vPb0pjSvMaI7Ll+kSCRQ7bRG9eNdv9Of/9tvBh//25+/We9qPXXCcb++9Un9/Q8Tg49vviSq6y46f8JxHcfVru5+JVI5RSMBtTbXy7atCcc1tf+aimuqH/79mef1Nw/uHnz85T+brz/9o0UTjuu148JU/5ryUvchXfLVbcoVpIBP+uGnlum05hnVTuuETB0bJt/nTDA1rgQAoNK8Nn4CgMlkMpyDLdd13Wr98W9+85v65je/qd27d0uSFi1apNtvv12XXnqpJCmdTuvmm2/Wgw8+qEwmo0suuUTr1q3TzJlj/9K6r69P0WhUiURCDQ0NJl7GqEb64DlgIh9AievNuO/7+pPa8UriuN+fdUpUj1w3/i9vvZavJG1qj2vd1i71JLNyXFe2ZampLqirL2rRyiWxccc11RemfHR9u57s6jnu9+e3NOk7a5ZUIaMT89ox1xHv1YZtcXV1J5XNFxT0+9TSXKfVy2JaHGscd1xT+6+puKb6wdR289pxYap/TTn1s20aaeBnSfpdDZ4nJXPHhsn3ORNq6T2umuPrapqqrxsAys1r4ycAmExq6Rw8kfF1VYsa3//+9+Xz+XTGGWfIdV1t2LBBf/d3f6df//rXWrRokT75yU+qra1N9957r6LRqK699lrZtq2f//znY/4b1frwcaIPngPG8wGUuN6MO9oXJwPG+wWK1/KVil9OrW3rVK7gKOz3KeCzlCu4SucLCvhs3bZi4bi+pDLVF6aM9sXtgFr7Atdrx1xHvFdr2zp1OJVTc31I4YBP6VxBB5MZRacFdNuKheN6sza1/5qKa6ofTG03rx0XpvrXlNEKGgNqsbBh6tgw+T5nQq29x03VL/en6usGgHLy2vgJACaTWjsHT2R8XdU1Nd773vfqsssu0xlnnKHW1latXbtWdXV1euqpp5RIJPTtb39bd999t971rndp8eLFuueee7Rt2zY99dRT1Uz7dY3lg2cp7Qbs2HugrO0GrPvp9rK2G/DQjs6ythtgqh/+5ee/KGu7Ackj2RN+cSJJO15JKHkkW1Lc//rfnWVtN8BUvlJx+pB1W7uUKziqD/kV8tuyLUshv636kF+5QvH5fN4pKe69v3ymrO1MS6VyJ/ziVpKe7OpRKpWrUEYn9uNdvytruwFf3/pkWdsNcBxXG7bFdTiV0/yTIpoe8stnW5oe8ivWGFHiaE4bt8XlOKXV9E3tv6bimuqHf3/m+bK2G+C148JU/5ryUvehExY0JMl9tV2tMHVsmHyfM8HUuBIAgErz2vgJACaTyXYOrpmFwguFgh588EEdOXJES5cuVUdHh3K5nN797ncPtlmwYIHmzZun7dtH/3I9k8mor69vyM9kcezaEeVoN+DYNTTK0W7AsWtolKPdAFP9cOwaGuVoN+DG7z1b1nYDjl1DoxztSs2j1HwlacvOA+pJZhX2+2RbQ+fqsy1LYb9PPcmstuwsrSB17Boa5Whn2u2Pju3L3rG2M+3YNTTK0W7AsWtolKPdgF3d/erqTqq5PiRr2H5mWZZOrgvpxe6kdnX3lxTX1P5rKq6pfjh2DY1ytBvgtePCVP+acslXt5W1XSWYOjZMvs8BAIDReW38BACTyWQ7B1e9qPG///u/qqurUygU0l/91V/p4Ycf1pvf/Gbt379fwWBQb3jDG4a0nzlzpvbv3z9qvLvuukvRaHTwZ+7cuYZfAXBiLx9KlbWdaSbz3Z9Iy3FdBXwjLz4U8FlyXFf7E+mSY3vJnt6x9d1Y22GoRCqnbL6gcMA34vPhgE/ZfEGJEq/4N7X/moprqh9M8dpx4bX+zRXK264STB0bXntfBgBgsvDa+AkAJpPJdg6uelHjTW96k5555hm1t7frk5/8pFavXq3f/Ka0q32PdcsttyiRSAz+vPzyy2XMFijd3BmRsrYzzWS+s6Jh2VZxPvSR5ArFBWBnRcMlx/aSeY1j67uxtsNQ0UhAQX9
"text/plain": [
"<Figure size 1600x2400 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_reduced = df[num_columns]\n",
"\n",
"FRACTION = 0.1\n",
"\n",
"df_reduced_sampled = df_reduced.sample(frac=FRACTION, random_state=RANDOM_STATE)\n",
"\n",
"\n",
"show_scatters_by_pairs(df_reduced_sampled, num_columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Стандартизация данных для кластеризации"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"scaler = StandardScaler()\n",
"data_reduced_scaled = scaler.fit_transform(df_reduced_sampled)\n",
"\n",
"df_scaled = pd.DataFrame(data_reduced_scaled, columns=df_reduced_sampled.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Иерархическая агломеративная кластеризация"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn import cluster\n",
"from scipy.cluster import hierarchy\n",
"\n",
"def run_agglomerative(\n",
" df: pd.DataFrame,\n",
" num_clusters: int = 2\n",
") -> cluster.AgglomerativeClustering:\n",
" agglomerative = cluster.AgglomerativeClustering(\n",
" n_clusters=num_clusters,\n",
" compute_distances=True,\n",
" )\n",
" return agglomerative.fit(df)\n",
"\n",
"\n",
"def get_linkage_matrix(\n",
" model: cluster.AgglomerativeClustering\n",
" ) -> np.ndarray:\n",
" counts = np.zeros(model.children_.shape[0])\n",
" n_samples = len(model.labels_)\n",
" for i, merge in enumerate(model.children_):\n",
" current_count = 0\n",
" for child_idx in merge:\n",
" if child_idx < n_samples:\n",
" current_count += 1\n",
" else:\n",
" current_count += counts[child_idx - n_samples]\n",
" counts[i] = current_count\n",
"\n",
" return np.column_stack([model.children_, model.distances_, counts]).astype(float)\n",
"\n",
"def draw_dendrogram(linkage_matrix: np.ndarray):\n",
" hierarchy.dendrogram(linkage_matrix, truncate_mode=\"level\", p=3)\n",
" plt.xticks(fontsize=10, rotation=45)\n",
" plt.tight_layout()"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnMAAAHWCAYAAAAciQ/OAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABTTklEQVR4nO3deVxU9f7H8c+ggqCggiyauIFbKoprlgsqalqWSYtmpi1aht7USrOs3IoWs6y0ut3U273a4q00rSzN3K6IS6FlXVNzwQTcIQRZP78//M1phkVFZpg58Ho+HvNQzjlzzme+58zw5sz5fo9FVVUAAABgSh6uLgAAAABXjzAHAABgYoQ5AAAAEyPMAQAAmBhhDgAAwMQIcwAAACZGmAMAADAxwhwAAICJVXV1AVejoKBAjh8/Lr6+vmKxWFxdDgAAQJmpqvz5559Sv3598fC48vNtpgxzx48fl9DQUFeXAQAA4HBJSUnSoEGDK17elGHO19dXRC6+WD8/PxdXAwAAUHbp6ekSGhpq5JwrZcowZ/1q1c/PjzAHAAAqlNJeQkYHCAAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiVV1dQFAeVFVycrNd3UZACoJ72pVxGKxuLoMVAKEOVQKqiq3vxMvu46cdXUpACqJTo3qyPKHuxHo4HR8zYpKISs3nyAHoFztPHKWbwNQLjgzh0pn5/Ro8fGs4uoyAFRQmTn50mnOOleXgUqEMIdKx8ezivh4cugDACoGvmYFAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMrVZiLi4uTzp07i6+vrwQFBcmQIUNk3759dstcuHBBYmNjJSAgQGrWrCkxMTGSmppqt8zRo0flpptuEh8fHwkKCpInnnhC8vLyyv5qAAAAKplShbmNGzdKbGysbNu2TdauXSu5ubnSv39/OX/+vLHMpEmTZNWqVbJ8+XLZuHGjHD9+XIYOHWrMz8/Pl5tuuklycnJk69at8s9//lOWLFkizz77rONeFQAAQCVRqsG21qxZY/fzkiVLJCgoSHbt2iU9e/aUtLQ0ef/992XZsmXSp08fERFZvHixtGrVSrZt2ybXXXedfPvtt/LLL7/IunXrJDg4WNq3by+zZ8+WqVOnyowZM8TT09Nxrw4AAKCCK9M1c2lpaSIi4u/vLyIiu3btktzcXImOjjaWadmypTRs2FDi4+NFRCQ+Pl7atm0rwcHBxjIDBgyQ9PR02bt3b7Hbyc7OlvT0dLsHAAAAyhDmCgoKZOLEiXLDDTdImzZtREQkJSVFPD09pXbt2nbLBgcHS0pKirGMbZCzzrfOK05cXJzUqlXLeISGhl5t2QAAABXKVYe52NhY+fnnn+Wjjz5yZD3FmjZtmqSlpRmPpKQkp28TAADADK7qBpXjx4+X1atXy6ZNm6RBgwbG9JCQEMnJyZFz587ZnZ1LTU2VkJAQY5nt27fbrc/a29W6TGFeXl7i5eV1NaUCAABUaKU6M6eqMn78ePn8889l/fr10qRJE7v5HTt2lGrVqsl3331nTNu3b58cPXpUunXrJiIi3bp1k59++klOnDhhLLN27Vrx8/OTa6+9tiyvBQAAoNIp1Zm52NhYWbZsmaxcuVJ8fX2Na9xq1aol3t7eUqtWLXnggQdk8uTJ4u/vL35+fjJhwgTp1q2bXHfddSIi0r9/f7n22mtl5MiR8vLLL0tKSopMnz5dYmNjOfsGAABQSqUKc2+//baIiERFRdlNX7x4sYwePVpERF577TXx8PCQmJgYyc7OlgEDBsjChQuNZatUqSKrV6+WcePGSbdu3aRGjRoyatQomTVrVtleCQAAQCVUqjCnqpddpnr16rJgwQJZsGBBics0atRIvvrqq9JsGgAAAMXg3qwAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiZU6zG3atEkGDx4s9evXF4vFIitWrLCbb7FYin288sorxjKNGzcuMv/FF18s84sBAACobEod5s6fPy/t2rWTBQsWFDs/OTnZ7rFo0SKxWCwSExNjt9ysWbPslpswYcLVvQIAAIBKrGppnzBw4EAZOHBgifNDQkLsfl65cqX07t1bmjZtajfd19e3yLIAAAAoHadeM5eamipffvmlPPDAA0XmvfjiixIQECCRkZHyyiuvSF5eXonryc7OlvT0dLsHAAAAruLMXGn885//FF9fXxk6dKjd9L/97W/SoUMH8ff3l61bt8q0adMkOTlZ5s2bV+x64uLiZObMmc4sFQAAwJScGuYWLVokI0aMkOrVq9tNnzx5svH/iIgI8fT0lIceekji4uLEy8uryHqmTZtm95z09HQJDQ11XuEAAAAm4bQwt3nzZtm3b598/PHHl122a9eukpeXJ4cPH5YWLVoUme/l5VVsyAMAAKjsnHbN3Pvvvy8dO3aUdu3aXXbZxMRE8fDwkKCgIGeVAwAAUCGV+sxcRkaGHDhwwPj50KFDkpiYKP7+/tKwYUMRufg16PLly+XVV18t8vz4+HhJSEiQ3r17i6+vr8THx8ukSZPknnvukTp16pThpQAAAFQ+pQ5zO3fulN69exs/W69lGzVqlCxZskRERD766CNRVRk+fHiR53t5eclHH30kM2bMkOzsbGnSpIlMmjTJ7po4AAAAXJlSh7moqChR1UsuM3bsWBk7dmyx8zp06CDbtm0r7WYBAABQDO7NCgAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMjzAEAAJgYYQ4AAMDECHMAAAAmRpgDAAAwMcIcAACAiRHmAAAATIwwBwAAYGKEOQAAABMrdZjbtGmTDB48WOrXry8Wi0VWrFhhN3/06NFisVjsHjfeeKPdMmfOnJERI0aIn5+f1K5dWx544AHJyMgo0wsBAACojEod5s6fPy/t2rWTBQsWlLjMjTfeKMnJycbjww8/tJs/YsQI2bt3r6xdu1ZWr14tmzZtkrFjx5a+egAAgEquammfMHDgQBk4cOAll/Hy8pKQkJBi5/3666+yZs0a2bFjh3Tq1ElERN58800ZNGiQzJ07V+rXr1/akgAAACotp1wzt2HDBgkKCpIWLVrIuHHj5PTp08a8+Ph4qV27thHkRESio6PFw8NDEhISnFEOAABAhVXqM3OXc+ONN8rQoUOlSZMmcvDgQXnqqadk4MCBEh8fL1WqVJGUlBQJCgqyL6JqVfH395eUlJRi15mdnS3Z2dnGz+np6Y4uGwAAwJQcHuaGDRtm/L9t27YSEREhYWFhsmHDBunbt+9VrTMuLk5mzpzpqBIBAAAqDKcPTdK0aVOpW7euHDhwQEREQkJC5MSJE3bL5OXlyZkzZ0q8zm7atGmSlpZmPJKSkpxdNgAAgCk4/MxcYceOHZPTp09LvXr1RESkW7ducu7cOdm1a5d07NhRRETWr18vBQUF0rVr12LX4eXlJV5eXs4uFQBQCqoqWbn5ri7D7WTm5BX7f/zFu1oVsVgsri6jwih1mMvIyDDOsomIHDp0SBITE8Xf31/8/f1l5syZEhMTIyEhIXLw4EGZMmWKhIe
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tree = run_agglomerative(df_scaled)\n",
"linkage_matrix = get_linkage_matrix(tree)\n",
"draw_dendrogram(linkage_matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пробуем представить данные в виде 3 больших кластеров и визуализируем результаты иерархической кластеризации"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PhysicalHealthDays vs MentalHealthDays\n",
"PhysicalHealthDays vs SleepHours\n",
"PhysicalHealthDays vs BMI\n",
"MentalHealthDays vs SleepHours\n",
"MentalHealthDays vs BMI\n",
"SleepHours vs BMI\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjUAAAlUCAYAAABfY/AfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wc1bn/8c/M9tVKq+Ii94ILYGOK6abXQCAJJSQhkJBCEgK5CSHlkpBCbnIJ+aXnJqRDEmpI6DWAwRTTwZhq3KtkW3VX23fm/P5Y2bhIslbSaiX7+3699mVpZvzo2dndmdnzzDnHMsYYREREREREREREREREhji73AmIiIiIiIiIiIiIiIj0hooaIiIiIiIiIiIiIiIyLKioISIiIiIiIiIiIiIiw4KKGiIiIiIiIiIiIiIiMiyoqCEiIiIiIiIiIiIiIsOCihoiIiIiIiIiIiIiIjIsqKghIiIiIiIiIiIiIiLDgooaIiIiIiIiIiIiIiIyLHjLnUCpua7Lhg0bqKysxLKscqcjIiIiIjKsGWOIx+OMHTsW295z7pHS9woRERERkYHTn+8Vu31RY8OGDUyYMKHcaYiIiIiI7FbWrl3L+PHjy53GoNH3ChERERGRgdeX7xW7fVGjsrISKOycqqqqMmcjIiIiIjK8xWIxJkyYsPU6u9yuueYa7rjjDt555x1CoRBHHnkk1157LTNnzty6zXHHHceCBQu2+3+f//zn+f3vf9/rv6PvFSIiIiIiA6c/3yt2+6LGlq7hVVVV+vIhIiIiIjJAhsoQTAsWLODSSy/lkEMOIZ/P861vfYtTTjmFt956i4qKiq3bXXzxxfzgBz/Y+ns4HC7q7+h7hYiIiIjIwOvL94rdvqghIiIiIiK7r4ceemi732+44QZGjRrFyy+/zDHHHLN1eTgcpr6+frDTExERERGRAbbnzOwnIiIiIiK7vfb2dgBqa2u3W37TTTcxYsQIZs+ezZVXXkkymewxTiaTIRaLbfcQEREREZHyU08NERERERHZLbiuy1e+8hXmzZvH7Nmzty4///zzmTRpEmPHjmXx4sV885vfZMmSJdxxxx3dxrrmmmu4+uqrByNtEREREREpgmWMMeVOopRisRjRaJT29naNfSsiIiIi0k9D+fr6kksu4cEHH+Tpp59m/Pjx3W43f/58TjzxRJYtW8Zee+3V5TaZTIZMJrP19y0TGQ7F5y0iIiIiMtz053uFemqIiIiIiMiwd9lll3Hffffx5JNP9ljQADjssMMAeixqBAIBAoHAgOcpIiIiIiL9o6KGiIiIiIgMW8YYvvSlL3HnnXfyxBNPMGXKlF3+n0WLFgEwZsyYEmcnIiIiIiIDTUUNEREREREZti699FJuvvlm7r77biorK2lsbAQgGo0SCoVYvnw5N998M6effjp1dXUsXryYyy+/nGOOOYY5c+aUOXsRERERESmWihoiIiIiIjJsXXfddQAcd9xx2y2//vrrueiii/D7/Tz66KP88pe/JJFIMGHCBM455xyuuuqqMmQrIiIiIiL9paKGiIiIiIgMW8aYHtdPmDCBBQsWDFI2IiIiIiJSana5ExAREREREREREREREekNFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWFBRQ0REREREREREREREhgUVNUREREREREREREREZFhQUUNERERERERERERERIYFFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWFBRQ0REREREREREREREhgUVNUREREREREREREREZFhQUUNERERERERERERERIYFFTVERERERERERERERGRYUFFDRERERERERERERESGBRU1RERERERERERERERkWPCWO4Hd3Yp3ZlAbtDBAa9owde93BySuu/FJMF8CXLC+jT36owMSN9/wBxpTv8c1FqOCEwmOu2tA4uYa3iaePhuP7ZK2DmH0xBsHJK67cSMpcwY2KeBcQvXfH5C4TuMLpLgAD+AQIVL/yoDEBXDj/wBnOQROxA4dPWBxW5bMwmvnyFLDiOnPD1hct+MOyC8G/zzs8MkDFze/GpwG8IzB9k4asLirnz8Ar53BODD+8LcHLG6pJFpeI9H8JL7QBGrGf6jc6ezSWy9eQjrxIq5rM2vv2wiNnTIgcW+79gxaNsexbZczL7qGsfseNSBxc9kczRtasT02I8bVYtsDU8vPZVM0r1uGZVnUjZ+O1xcYkLjZdJZlr67EuIbpc6fiD/oHJK4xDribwDjgGYVlDUzcDSvf5J9XX4LrWIybdQAf+e9fDUjcfC7HwjvuI9URZ/8Tj6d+8oQBiVsqxhhaN7aRSWWpHllFKBIqd0q79J0P/jct69cw9aADuOKP3yh3Oru0cfUmGlduZsSEWsbtNWbA4j5z1/O88ujrTN1/Eu+/eODOcaVyiP1hQoAD+IAn3NvLnJGIiEjfxDIZ2tIpIn4/taFwudMREdmjtKVTxDIZqgIBqoND//trVyxjjCl3EqUUi8WIRqO0t7dTVVU1aH93xTszmFgFlrX9cmNgUwbGTulbccPduBjMuV2vtH6OPfqMPsV1Gp5lfsN3uXf1XqzqiGKwGBlIcur4lXxgUpbQuIf7FDfX8DbG/SCeHdoRjYGN2ZmMm3xvn+IC5NbPYMf2SWMgZR9FZf1f+xw3u24GHs/OceM21NT3vSjlNn8Ccs/tsNQDke9gR87vc9zEihnsePwxBuIxi5q9l/Q5rtvyVcjet8NSG0KfwY5+ve9xM89B/JeQfwdwCzG9e0Plf2EHjuxz3PUv7k2k0hCuNIXPnYFkwiIRsxlz8NArbrQ3PkR89Xeorovj9ZrCa9buJ2cdz7j9f13u9Hby6pNnM2XCO0SCebYc1hwX1jWFmbr/oj7H/cc17+Ol+21WvR4gl7WxMFREXfY5JsXnv/+dPhc3ctkc829+mqf+9RwtjW1YFoydPoYTPnYUh58xF2vHg3Mv5XMZnrjxOp7814s0bcgAMHpikOM+Oo+jzrsYj6dv9wpkszn+8s0bWfCvZ+loTYCBcDTM0Wcfxud/+ok+FzeMMZB5DJO6F5y1gCkUNYKnQvCMPhc3Nqx8k19+6nLeejFCNm1jAI/HMHF6mnkfHscnv//HPsUF+MvXLmP8xJeYOCOFbUOsxcPiF8Zx0md+yqRZM/sct1TeeWEp//n7At59aTlO3iFcGeLwMw/m1E8eR1VdZbnT28kXDzybY8/cyOzDEnh9Lpm0zctPVLLouRn8auFfyp3eTl5/+m2uv+oWlr26EifvYntspsyewAXfPZdD33dQn+P+/erbuOmH/8Z1tr8MPvJDB3P1Hd/sb9oD7jD7w0SAHY9cDvB4GQob5bq+Lrc99XmLiAykxo44t7/1Jk+uXkk6n8dn2xw8dhzn7DubmXUjyp2eiMhubU17G7e/9QYL164h6zj4PR6OGD+Rc/edxeTqmkHPpz/X12Utalx33XVcd911rFq1CoBZs2bx3e9+l9NOOw2AdDrNFVdcwa233komk+HUU0/ld7/7HaNHj+713yjHl4/kqhn4/TsXNLYwBrJZCE8uroHc3bgRzC7u7Lf+jD36mKLiAtz0zAe5cdksHGNR7U9jY4jlAuRcD8ePXc0V+xkCY4vvXZHfMKPb/QDQlDu0T702uipobCtlnUik/roBjxu3oLoPhQ138xng9PD/It/vU2EjtWoG/h5uEI/HoHpmH/JtvhhyC7rfIHghdvV3io+bXgDtXwOTAPzQ2RcGsmCFIfpT7OBxRcfd8OLejKh38XjBdQEDWGDb4DrQ1OgZUoWN9saHcFu+Srgyj5u3cJzC8cLrMzh5i+bmQ5lwyD/KneZWi5+6gJlTX8Db+dnYctLY8tHeFAsyZubiouPedu0Z3HOdTfMGH7YNXq+LwSKftbBs2Gdekp899mDRcfO5PNdfdSvP3fsSXr+XypoKXGOINcexbZszv3AKZ15yatFxHSfPP676Bk/duQqP1yJS7QFjiLW6YOCUC/fh3P/+QdG9QVzX5dtnXMNrj78
"text/plain": [
"<Figure size 1600x2400 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = hierarchy.fcluster(linkage_matrix, 3, criterion=\"maxclust\")\n",
"y_names = ['Кластер 1', 'Кластер 2', 'Кластер 3']\n",
"\n",
"show_scatters_by_pairs(df_reduced_sampled, num_columns, result, y_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### KMeans (неиерархическая четкая кластеризация) для сравнения"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"from typing import Tuple\n",
"from sklearn.cluster import KMeans\n",
"\n",
"def print_cluster_result(\n",
" df: pd.DataFrame,\n",
" clusters_num: int,\n",
" labels: np.ndarray,\n",
" separator: str = \", \"):\n",
" for cluster_id in range(clusters_num):\n",
" cluster_indices = np.where(labels == cluster_id)[0]\n",
" print(f\"Cluster {cluster_id + 1} ({len(cluster_indices)}):\")\n",
" rules = [str(df.index[idx]) for idx in cluster_indices]\n",
" print(separator.join(rules))\n",
" print(\"\")\n",
" print(\"--------\")\n",
"\n",
"\n",
"def run_kmeans(\n",
" df: pd.DataFrame,\n",
" num_clusters: int,\n",
" random_state: int) -> Tuple[np.ndarray, np.ndarray]:\n",
" kmeans = KMeans(n_clusters=num_clusters, random_state=random_state)\n",
" labels = kmeans.fit_predict(df)\n",
" return labels, kmeans.cluster_centers_"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster 1 (6480):\n",
"5, 7, 14, 15, 18, 24, 27, 29, 38, 42, 46, 49, 51, 56, 57, 60, 62, 63, 66, 67, 68, 71, 76, 80, 85, 88, 90, 91, 95, 96, 97, 105, 109, 113, 117, 123, 136, 138, 139, 143, 146, 150, 151, 152, 154, 158, 166, 171, 175, 188, 190, 192, 193, 195, 203, 210, 214, 228, 230, 234, 244, 246, 248, 249, 254, 256, 261, 263, 272, 274, 276, 277, 278, 279, 283, 286, 287, 295, 297, 302, 307, 310, 311, 312, 316, 317, 318, 327, 333, 340, 345, 347, 348, 356, 357, 363, 370, 371, 373, 383, 386, 387, 392, 393, 394, 397, 400, 407, 411, 420, 422, 423, 425, 429, 434, 438, 443, 445, 452, 453, 455, 458, 459, 464, 469, 472, 477, 479, 482, 487, 492, 498, 502, 503, 505, 506, 507, 509, 511, 519, 521, 525, 527, 535, 541, 545, 555, 560, 567, 568, 572, 573, 574, 575, 579, 580, 582, 584, 586, 587, 588, 595, 600, 601, 603, 604, 610, 611, 613, 621, 623, 625, 626, 628, 633, 634, 635, 643, 647, 648, 652, 657, 658, 659, 660, 664, 667, 671, 675, 680, 681, 684, 685, 688, 692, 699, 703, 711, 713, 715, 719, 721, 724, 726, 730, 736, 738, 740, 761, 766, 772, 774, 778, 779, 781, 784, 793, 798, 800, 809, 810, 818, 833, 839, 841, 849, 863, 864, 879, 880, 884, 888, 889, 890, 892, 920, 923, 931, 938, 942, 948, 952, 956, 957, 961, 963, 977, 982, 986, 994, 996, 997, 1004, 1006, 1007, 1010, 1012, 1016, 1018, 1020, 1024, 1032, 1039, 1041, 1042, 1049, 1054, 1056, 1058, 1064, 1067, 1068, 1070, 1076, 1080, 1081, 1091, 1099, 1100, 1107, 1108, 1113, 1115, 1125, 1127, 1129, 1131, 1132, 1140, 1141, 1147, 1152, 1156, 1158, 1173, 1179, 1180, 1181, 1183, 1187, 1188, 1191, 1192, 1198, 1201, 1206, 1213, 1217, 1218, 1219, 1222, 1235, 1236, 1242, 1248, 1252, 1254, 1258, 1260, 1261, 1262, 1265, 1270, 1274, 1275, 1277, 1281, 1284, 1285, 1287, 1292, 1294, 1306, 1307, 1308, 1313, 1318, 1324, 1329, 1331, 1332, 1333, 1346, 1350, 1370, 1371, 1372, 1385, 1387, 1391, 1402, 1403, 1407, 1408, 1409, 1412, 1413, 1414, 1416, 1420, 1421, 1422, 1428, 1440, 1444, 1445, 1446, 1448, 1456, 1459, 1463, 1464, 1466, 1471, 1474, 1483, 1484, 1488, 1495, 1497, 1500, 1502, 1503, 1510, 1512, 1514, 1526, 1530, 1531, 1532, 1536, 1542, 1551, 1557, 1561, 1564, 1567, 1575, 1580, 1582, 1586, 1591, 1592, 1593, 1594, 1613, 1618, 1621, 1622, 1626, 1627, 1631, 1635, 1640, 1646, 1648, 1649, 1651, 1652, 1653, 1656, 1659, 1663, 1666, 1669, 1675, 1682, 1691, 1692, 1693, 1697, 1707, 1709, 1711, 1713, 1715, 1717, 1720, 1723, 1724, 1726, 1727, 1728, 1729, 1731, 1733, 1734, 1735, 1737, 1739, 1747, 1756, 1759, 1768, 1774, 1780, 1782, 1799, 1800, 1803, 1807, 1815, 1823, 1831, 1833, 1835, 1840, 1844, 1845, 1846, 1851, 1852, 1860, 1861, 1862, 1863, 1864, 1867, 1873, 1877, 1884, 1888, 1893, 1900, 1902, 1913, 1917, 1918, 1919, 1928, 1930, 1932, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1944, 1953, 1954, 1955, 1960, 1961, 1962, 1966, 1968, 1969, 1971, 1975, 1988, 1989, 1990, 1996, 1999, 2000, 2006, 2021, 2030, 2031, 2037, 2052, 2053, 2061, 2068, 2072, 2077, 2082, 2087, 2092, 2100, 2103, 2106, 2109, 2111, 2115, 2116, 2118, 2122, 2129, 2135, 2137, 2138, 2144, 2146, 2154, 2162, 2163, 2165, 2166, 2172, 2174, 2176, 2181, 2187, 2191, 2194, 2204, 2206, 2211, 2217, 2224, 2225, 2226, 2231, 2233, 2234, 2235, 2238, 2249, 2257, 2259, 2267, 2274, 2277, 2278, 2281, 2292, 2295, 2297, 2304, 2311, 2314, 2315, 2317, 2318, 2320, 2321, 2323, 2325, 2330, 2337, 2339, 2340, 2341, 2344, 2345, 2353, 2357, 2360, 2361, 2363, 2364, 2365, 2368, 2371, 2372, 2377, 2384, 2385, 2386, 2388, 2390, 2393, 2397, 2398, 2400, 2409, 2413, 2417, 2419, 2420, 2423, 2424, 2426, 2431, 2437, 2441, 2445, 2446, 2454, 2458, 2459, 2463, 2464, 2470, 2475, 2479, 2485, 2494, 2496, 2498, 2500, 2506, 2517, 2522, 2524, 2527, 2529, 2530, 2532, 2536, 2540, 2541, 2547, 2550, 2551, 2552, 2553, 2556, 2563, 2566, 2574, 2577, 2583, 2587, 2593, 2596, 2598, 2602, 2603, 2604, 2606, 2607, 2608, 2609, 2612, 2613, 2617, 2620, 2623, 2624, 2626, 2627, 2638, 2643, 2645, 2660, 2662, 2668, 2672, 2677, 2681, 2688, 2689, 2693, 2697, 2699, 2705, 2708, 2709, 2711, 2713, 2714, 2718, 2721, 2722, 2724, 2727, 2728, 2729, 2732, 2733, 2739, 2740, 2743, 2745, 2750, 2753, 2756, 2771, 2776, 2780, 2781, 2784, 2787, 2789, 2790,
"\n",
"--------\n",
"Cluster 2 (14328):\n",
"0, 1, 3, 4, 6, 8, 9, 11, 13, 17, 20, 21, 22, 23, 25, 26, 28, 31, 33, 35, 36, 39, 40, 41, 44, 45, 47, 52, 54, 55, 58, 59, 61, 64, 65, 69, 70, 72, 73, 74, 77, 78, 79, 84, 86, 87, 89, 92, 93, 94, 98, 99, 100, 101, 103, 104, 107, 108, 110, 111, 112, 115, 116, 118, 119, 120, 122, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 140, 142, 144, 145, 147, 148, 149, 153, 156, 157, 159, 160, 162, 163, 164, 167, 169, 172, 173, 174, 176, 179, 180, 182, 183, 184, 185, 186, 189, 191, 194, 196, 197, 198, 199, 200, 201, 202, 204, 207, 209, 211, 212, 213, 215, 216, 217, 218, 219, 220, 221, 223, 224, 225, 226, 227, 229, 231, 232, 233, 236, 237, 238, 239, 240, 241, 242, 243, 245, 247, 250, 251, 252, 253, 255, 258, 260, 262, 264, 265, 266, 267, 268, 269, 270, 271, 273, 280, 284, 285, 288, 289, 291, 292, 293, 294, 296, 298, 299, 300, 301, 303, 304, 305, 306, 308, 309, 313, 314, 315, 319, 320, 322, 323, 324, 325, 326, 328, 329, 331, 332, 334, 335, 336, 337, 338, 339, 341, 342, 343, 344, 349, 351, 352, 353, 354, 355, 358, 359, 360, 361, 362, 365, 366, 367, 368, 372, 374, 375, 376, 377, 379, 380, 381, 382, 384, 385, 388, 389, 395, 396, 398, 401, 402, 403, 404, 406, 408, 409, 410, 412, 413, 414, 415, 416, 417, 418, 419, 421, 424, 427, 428, 430, 432, 435, 437, 440, 442, 447, 448, 449, 450, 451, 454, 456, 457, 460, 461, 462, 463, 465, 468, 470, 471, 473, 474, 476, 480, 481, 483, 484, 485, 486, 488, 490, 491, 493, 494, 495, 496, 499, 500, 501, 504, 508, 510, 512, 514, 515, 516, 517, 518, 520, 524, 526, 529, 530, 531, 534, 537, 538, 540, 542, 543, 544, 547, 548, 549, 550, 552, 553, 554, 556, 557, 558, 559, 562, 563, 564, 566, 569, 571, 576, 577, 578, 581, 583, 585, 589, 590, 592, 593, 594, 596, 597, 598, 599, 602, 606, 607, 608, 609, 612, 615, 616, 617, 618, 619, 620, 622, 624, 627, 629, 630, 631, 632, 637, 638, 639, 640, 641, 642, 646, 649, 650, 651, 654, 655, 656, 661, 662, 665, 666, 669, 670, 672, 673, 674, 676, 677, 678, 679, 682, 683, 686, 687, 690, 691, 693, 694, 695, 696, 697, 698, 700, 701, 702, 705, 707, 709, 712, 714, 716, 717, 718, 723, 725, 727, 728, 731, 732, 733, 734, 735, 737, 739, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 754, 755, 756, 757, 758, 759, 760, 762, 763, 764, 767, 768, 769, 770, 773, 775, 776, 777, 780, 782, 783, 785, 787, 788, 789, 791, 792, 794, 795, 797, 799, 801, 802, 803, 804, 805, 806, 808, 811, 813, 814, 815, 816, 817, 821, 823, 825, 826, 827, 828, 829, 830, 832, 834, 836, 837, 838, 840, 842, 845, 846, 848, 850, 851, 852, 855, 856, 857, 859, 860, 861, 862, 865, 866, 867, 869, 871, 872, 873, 874, 875, 876, 877, 878, 881, 883, 885, 886, 887, 891, 893, 894, 895, 896, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 910, 911, 912, 913, 917, 918, 919, 921, 922, 924, 925, 927, 928, 932, 934, 935, 936, 939, 940, 941, 943, 944, 945, 946, 947, 949, 950, 951, 953, 954, 955, 958, 962, 964, 965, 966, 969, 970, 972, 974, 975, 978, 979, 980, 981, 983, 985, 988, 989, 991, 992, 993, 998, 999, 1000, 1001, 1002, 1003, 1005, 1008, 1009, 1013, 1014, 1015, 1017, 1022, 1025, 1028, 1029, 1031, 1033, 1034, 1036, 1037, 1038, 1043, 1046, 1048, 1050, 1052, 1053, 1059, 1060, 1061, 1066, 1069, 1071, 1072, 1074, 1077, 1079, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1092, 1093, 1095, 1096, 1098, 1102, 1103, 1104, 1105, 1106, 1109, 1110, 1111, 1112, 1114, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1126, 1128, 1130, 1133, 1135, 1136, 1137, 1139, 1142, 1143, 1144, 1145, 1148, 1149, 1150, 1153, 1154, 1157, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1171, 1172, 1174, 1175, 1176, 1177, 1184, 1185, 1186, 1190, 1193, 1194, 1195, 1196, 1197, 1202, 1203, 1205, 1209, 1210, 1211, 1214, 1215, 1216, 1220, 1221, 1224, 1225, 1227, 1229, 1230, 1231, 1232, 1233, 1239, 1243, 1244, 1246, 1247, 1249, 1250, 1251, 1253, 1255, 1257, 1263, 1268, 1269, 1271, 1273, 1276, 1278, 1280, 1282, 1283, 1289, 1290, 1291, 1293, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1309, 1310, 1311, 1312, 1314, 1315, 1317, 1319, 1320, 1321, 1322, 1323, 1328, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1343, 1344,
"\n",
"--------\n",
"Cluster 3 (3794):\n",
"2, 10, 12, 16, 19, 30, 32, 34, 37, 43, 48, 50, 53, 75, 81, 82, 83, 102, 106, 114, 121, 124, 141, 155, 161, 165, 168, 170, 177, 178, 181, 187, 205, 206, 208, 222, 235, 257, 259, 275, 281, 282, 290, 321, 330, 346, 350, 364, 369, 378, 390, 391, 399, 405, 426, 431, 433, 436, 439, 441, 444, 446, 466, 467, 475, 478, 489, 497, 513, 522, 523, 528, 532, 533, 536, 539, 546, 551, 561, 565, 570, 591, 605, 614, 636, 644, 645, 653, 663, 668, 689, 704, 706, 708, 710, 720, 722, 729, 752, 753, 765, 771, 786, 790, 796, 807, 812, 819, 820, 822, 824, 831, 835, 843, 844, 847, 853, 854, 858, 868, 870, 882, 897, 909, 914, 915, 916, 926, 929, 930, 933, 937, 959, 960, 967, 968, 971, 973, 976, 984, 987, 990, 995, 1011, 1019, 1021, 1023, 1026, 1027, 1030, 1035, 1040, 1044, 1045, 1047, 1051, 1055, 1057, 1062, 1063, 1065, 1073, 1075, 1078, 1094, 1097, 1101, 1116, 1134, 1138, 1146, 1151, 1155, 1168, 1169, 1170, 1178, 1182, 1189, 1199, 1200, 1204, 1207, 1208, 1212, 1223, 1226, 1228, 1234, 1237, 1238, 1240, 1241, 1245, 1256, 1259, 1264, 1266, 1267, 1272, 1279, 1286, 1288, 1295, 1316, 1325, 1326, 1327, 1330, 1341, 1342, 1361, 1365, 1366, 1373, 1375, 1389, 1393, 1400, 1431, 1453, 1460, 1467, 1470, 1475, 1477, 1479, 1480, 1492, 1496, 1511, 1513, 1519, 1521, 1527, 1529, 1540, 1543, 1544, 1549, 1550, 1565, 1573, 1576, 1577, 1584, 1595, 1600, 1603, 1608, 1610, 1614, 1616, 1617, 1632, 1643, 1650, 1654, 1667, 1668, 1670, 1674, 1680, 1688, 1698, 1705, 1732, 1736, 1762, 1766, 1778, 1779, 1783, 1789, 1796, 1812, 1814, 1820, 1825, 1829, 1854, 1874, 1880, 1881, 1882, 1895, 1896, 1897, 1903, 1916, 1920, 1926, 1931, 1943, 1949, 1956, 1959, 1978, 1984, 1987, 1991, 1995, 2024, 2029, 2042, 2045, 2048, 2058, 2063, 2067, 2086, 2089, 2102, 2105, 2107, 2112, 2113, 2119, 2125, 2126, 2131, 2140, 2142, 2151, 2155, 2158, 2169, 2178, 2202, 2207, 2208, 2210, 2212, 2215, 2221, 2239, 2242, 2244, 2254, 2258, 2261, 2263, 2264, 2266, 2268, 2296, 2301, 2322, 2327, 2328, 2331, 2348, 2369, 2373, 2374, 2375, 2378, 2383, 2402, 2404, 2407, 2412, 2421, 2422, 2428, 2436, 2443, 2447, 2452, 2453, 2478, 2480, 2486, 2504, 2512, 2519, 2521, 2533, 2539, 2543, 2561, 2562, 2570, 2571, 2572, 2579, 2582, 2585, 2601, 2611, 2614, 2644, 2647, 2650, 2655, 2659, 2663, 2670, 2676, 2685, 2686, 2692, 2694, 2702, 2703, 2706, 2716, 2717, 2725, 2735, 2736, 2744, 2746, 2749, 2751, 2755, 2761, 2762, 2767, 2770, 2774, 2792, 2794, 2798, 2803, 2812, 2815, 2816, 2821, 2823, 2825, 2830, 2832, 2837, 2846, 2852, 2853, 2858, 2866, 2869, 2874, 2882, 2884, 2886, 2896, 2897, 2912, 2915, 2919, 2932, 2943, 2952, 2954, 2958, 2964, 2971, 2981, 2983, 2984, 2988, 2989, 3007, 3023, 3031, 3032, 3033, 3035, 3036, 3048, 3053, 3058, 3061, 3067, 3071, 3075, 3079, 3084, 3094, 3100, 3104, 3106, 3111, 3117, 3125, 3141, 3143, 3149, 3151, 3154, 3159, 3163, 3166, 3175, 3176, 3189, 3191, 3196, 3201, 3202, 3218, 3222, 3229, 3243, 3244, 3247, 3258, 3261, 3265, 3267, 3271, 3278, 3282, 3285, 3286, 3288, 3289, 3290, 3295, 3300, 3302, 3303, 3304, 3322, 3337, 3338, 3348, 3355, 3391, 3392, 3393, 3394, 3395, 3397, 3407, 3413, 3424, 3430, 3441, 3443, 3453, 3457, 3458, 3473, 3483, 3485, 3488, 3491, 3493, 3506, 3509, 3514, 3515, 3521, 3525, 3531, 3538, 3557, 3558, 3561, 3566, 3571, 3572, 3574, 3577, 3583, 3595, 3596, 3599, 3618, 3622, 3623, 3629, 3649, 3650, 3663, 3668, 3672, 3681, 3684, 3699, 3703, 3704, 3712, 3718, 3738, 3747, 3754, 3766, 3773, 3778, 3779, 3801, 3807, 3815, 3832, 3833, 3843, 3844, 3845, 3847, 3856, 3859, 3868, 3870, 3874, 3898, 3900, 3903, 3925, 3932, 3935, 3937, 3938, 3942, 3959, 3971, 4004, 4008, 4029, 4030, 4035, 4036, 4050, 4056, 4061, 4062, 4073, 4077, 4088, 4094, 4097, 4098, 4113, 4121, 4148, 4149, 4162, 4165, 4169, 4171, 4177, 4178, 4210, 4216, 4221, 4231, 4233, 4234, 4237, 4239, 4247, 4252, 4263, 4273, 4276, 4277, 4289, 4323, 4344, 4347, 4349, 4352, 4356, 4364, 4374, 4394, 4396, 4410, 4423, 4424, 4425, 4428, 4430, 4432, 4441, 4460, 4469, 4471, 4476, 4477, 4479, 4502, 4517, 4528, 4530, 4531, 4535, 4550, 4561, 4563, 4564, 4573, 4575, 4589, 4592, 4594, 4597, 4598, 4614, 4621, 4622, 4642, 4643, 4645, 4648, 4654, 4662, 4669, 4
"\n",
"--------\n"
]
},
{
"data": {
"text/plain": [
"array([[-0.26768418, -0.22724145, -0.33671544, 1.03735944],\n",
" [-0.33771529, -0.31519476, 0.23113456, -0.5206252 ],\n",
" [ 1.73265623, 1.57836843, -0.29417739, 0.18428677]])"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels, centers = run_kmeans(df_scaled, 3, RANDOM_STATE)\n",
"print_cluster_result(df_scaled, 3, labels)\n",
"display(centers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Визуализируем результаты"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"def draw_cluster_results(\n",
" df: pd.DataFrame,\n",
" col1: int,\n",
" col2: int,\n",
" labels: np.ndarray,\n",
" cluster_centers: np.ndarray,\n",
" subplot: Any | None = None,\n",
"):\n",
" ax = None\n",
" if subplot is None:\n",
" ax = plt\n",
" else:\n",
" ax = subplot\n",
"\n",
" centroids = cluster_centers\n",
" u_labels = np.unique(labels)\n",
"\n",
" for i in u_labels:\n",
" ax.scatter(\n",
" df[labels == i][df.columns[col1]],\n",
" df[labels == i][df.columns[col2]],\n",
" label=i,\n",
" )\n",
"\n",
" ax.scatter(centroids[:, col1], centroids[:, col2], s=80, color=\"k\")\n",
"\n",
"\n",
"\n",
"def show_clusters_by_pairs(\n",
" df: DataFrame,\n",
" columns: List[str],\n",
" labels: Any = None,\n",
" centers: Any = None) -> None:\n",
" pairs_count = math.comb(len(columns), 2)\n",
" plot_columns_count = 2\n",
" plot_rows_count = math.ceil(pairs_count / plot_columns_count) \n",
"\n",
" plt.figure(figsize=(plot_columns_count * 8, plot_rows_count * 8))\n",
"\n",
" count = 0\n",
" for i in range(len(columns)):\n",
" for j in range(i + 1, len(columns)):\n",
" count += 1\n",
" print(columns[i], 'vs', columns[j])\n",
" draw_cluster_results(\n",
" df,\n",
" i, j,\n",
" labels,\n",
" centers, \n",
" plt.subplot(plot_rows_count, plot_columns_count, count))\n",
"\n",
" plt.tight_layout()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PhysicalHealthDays vs MentalHealthDays\n",
"PhysicalHealthDays vs SleepHours\n",
"PhysicalHealthDays vs BMI\n",
"MentalHealthDays vs SleepHours\n",
"MentalHealthDays vs BMI\n",
"SleepHours vs BMI\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjQAAAlUCAYAAACwoZshAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdfXhU9Z3//9fMxECGMOHG3EAJqb3BXhFqd1RaqFrQqOgq9aaKKBpxa1srpXb8tkR741pbDbadbanV3bUVoyimtt7/XFqjROuiRZ3VglwrrXVDqIREbjKEJAyZOb8/JgmZZCbMTObMnJM8H9eVa/ecefP23c85Z+ac8z43DsMwDAEAAAAAAAAAAFiYM9cFAAAAAAAAAAAAHA0NDQAAAAAAAAAAYHk0NAAAAAAAAAAAgOXR0AAAAAAAAAAAAJZHQwMAAAAAAAAAAFgeDQ0AAAAAAAAAAGB5NDQAAAAAAAAAAIDl5eW6gMEikYg++OADTZw4UQ6HI9flAAAAABjAMAwdOHBA06dPl9Np7eujOLYAAAAArCudYwvLNTQ++OADlZeX57oMAAAAAMNobm7WjBkzcl3GsDi2AAAAAKwvlWMLyzU0Jk6cKCn6P8Lj8eS4GgAAAAADBYNBlZeX9++3WxnHFgAAAIB1pXNsYbmGRt+t4B6Ph4MOAAAAwKLs8Agnji0AAAAA60vl2MLaD70FAAAAMCrdeeedOuWUUzRx4kSVlJTowgsv1LvvvhsTs2DBAjkcjpi/r33tazmqGAAAAECu0dAAAAAAkHUvvfSSbrjhBr322mt6/vnndfjwYZ199tk6ePBgTNx1112nXbt29f/dddddOaoYAAAAQK5Z7pFTAAAAAEa/DRs2xEw/8MADKikp0ZtvvqnTTz+9f77b7VZZWVlSOQ8dOqRDhw71TweDwcwUCwAAAMASuEMDAAAAQM61t7dLkqZMmRIz/+GHH9axxx6r2bNn6+abb1ZnZ2fCHHfeeaeKior6/8rLy02tGQAAAEB2OQzDMHJdxEDBYFBFRUVqb2/nxX0AAACAxZixvx6JRLR48WLt379fr7zySv/8//zP/1RFRYWmT5+uv/zlL1q1apXmzp2rxx9/PG6eeHdolJeXc2wBAAAAWFA6xxY8cgoAAABATt1www3aunVrTDNDkr7yla/0//9z5szRtGnTdOaZZ+q9997Txz/+8SF5xo0bp3HjxpleLwAAAIDc4JFTAAAAAHJmxYoVevbZZ7Vx40bNmDFj2NjPfvazkqS//e1v2SgNAAAAgMVwhwYAAACArDMMQ9/4xjf0xBNPqLGxUccdd9xR/81bb70lSZo2bZrJ1QEAAACwIhoaAAAAALLuhhtu0COPPKKnnnpKEydOVEtLiySpqKhIBQUFeu+99/TII4/ovPPO09SpU/WXv/xF3/rWt3T66afr05/+dI6rBwAAAJALNDQAAAAAZN29994rSVqwYEHM/LVr1+qaa65Rfn6+Ghoa9POf/1wHDx5UeXm5LrnkEn3ve9/LQbUAAAAArICGBgAAAICsMwxj2M/Ly8v10ksvZakaAAAAAHbAS8EBAAAAAAAAAIDl0dAAAAAAAAAAAACWR0MDAAAAAAAAAABYHg0NAAAAAAAAAABgeTQ0AAAAAAAAAACA5dHQAAAAAAAAAAAAlkdDAwAAAAAAAAAAWB4NDQAAAAAAAAAAYHk0NAAAAAAAAAAAgOXR0AAAAAAAAAAAAJZHQwMAAAAAAAAAAFgeDQ0AAAAAAAAAAGB5NDQAAAAAAAAAAIDl0dAAAAAAAAAAAACWR0MDAAAAAAAAAABYXl6uC7CjOXVzpLAkhyRDkkvaUr1lxHlf2fnfuv6Fr/VP33vmv+vUGZ8fcV5Juu3l2/S793/XP/2l476kW0+/dcR5H/jLA/rZ//ysf/qmf7pJ13z6mhHnlaRNO1/RV1+4vn/6P868V/NnnDrivHf89x1a/7f1/dNLP7FUt3z+lhHn3Xtwr659/lp92PWhji04Vvefdb+mTJgy4ryS9OCWB/WTwE/6p7/t/baunnP1iPOaWXOoJ6T67fVqDjar3FOuJbOWKD8vf8R517z+S9237T/7p6+r/IpWnvKNEec1U3tnu1Y0rlDLwRaVTSjT3QvuVpG7KNdlDevJ7U/q+69+v3/69nm368JZF4447w82/kBP7Hiif/qimRfphwt/OOK8khSOhBVoDaits03F7mJ5S7xyOV0jzmvWumxWXsm8sfj1W7/WL97+Rf/0N0/8pr78mS+POK9kv+3ErDE201utb+mq/7qqf/qhcx/SZ0o+k7uCkmDmdmLmb6AZ5tTNGTIvE/ufAACYKhKWmjZJHbulwlKpYr5k8X0mABjt7Hg8O5DDMAwj10UMFAwGVVRUpPb2dnk8nlyXM8Sc++dITkNyOI7MNAwp4tCWa9M/qJxTNyfaHBmQtm96pAerZuWOd2DdJzM1xxlnh8OSNS+oX6A93XuGzJ86fqoalzSmnVeyZ83+N/yq21aniBHpn+d0OFVdWS3fyb6085q5nZjlvN+fp+aO5iHzywvL9dwlz+WgoqMza50z8zujoalBtZtrtbtzd/+8UnepaubWqKqiKu28Zq3LZuWVzBsLM5ef3bYTs8bYTGYuP7OYuZ2Y+RtoBqstP6vvrw9kp1oBYNTZ9rS0YZUU/ODIPM90adFqqXJx7uoCgDHMasez6eyvp/TIqXvvvVef/vSn5fF45PF4NG/ePP3Xf/1X/+fd3d264YYbNHXqVBUWFuqSSy7R7t27h8loL/3NjHicRvTzdPLWzZFhRM/LDmQoeg5/uIPYXOU+2r8bac1K1GczDMvVnOikiCTt6d6jBfUL0sqbTE1WrNn/hl9r31kbcwJKkiJGRGvfWSv/G/608pq5nZgl0UlaSWruaNZ5vz8vyxUdnVnrnJnfGQ1NDfI1+mJ+jCWptbNVvkafGpoa0spr1rpsVl7JvLEwc/nZbTsxa4zNZObyM4uZ24mZv4FmsOPyAwBA256Wfnt1bDNDkoK7ovO3PZ2bugBgDLPj8Ww8KTU0ZsyYodraWr355pt64403dMYZZ+iLX/yi3nnnHUnSt771LT3zzDN67LHH9NJLL+mDDz7QxRdfbErh2TanbkAzY+BdAwOnnamfbH9l53/3n6FNlFZGb1yKbnv5tqRy3/bybSnlfeAvD2Q0bqBNO1850sxIVLRhRONScMd/35HRuD57D+5NeFKkz57uPdp7cG9KeaXoY6YyGdfHzJpDPSHVbasbNqZuW51CPaGU8q55/ZdJrctrXv9lSnnN1N7ZnvAkbZ/mjma1d7ZnqaKje3L7kxmN6/ODjT/IaNxA4UhYtZtrZQxpdal/3urNqxWOhFPKa9a6bFZeybyx+PVbv85o3EB2207MGmMzvdX6VkbjssHM7cTM30AzJLtfSVMDAGApkXD0zow4+0z98zbUROMAAFlhx+PZRFJqaFxwwQU677zz9MlPflKzZs3Sj3/8YxUWFuq1115Te3u7fvOb38jv9+uMM87QSSedpLVr12rTpk167bXXEuY8dOiQgsFgzJ8lhRU9czr4bGqfvs9SXObXv/A16Shp5VDMuzWS9bv3f5dU7oHv1kjGwHdmZCJuoK++cH1S4zzw3RrJGPjOjEzE9bn2+WszGjfQwHdmZCIu1VrSqbl+e/2Qq2kHixgR1W+vTynvfdv+M6l1eeC7NXJtReOKjMZlw8B3ZmQirs/Ad2ZkIm6gQGtgyJUFAxky1NLZokBrIKW8Zq3LZuWVzBuLge/MyETcQHbbTswaYzMNfGdGJuKywcztxMzfQAAA0Ktp09A7M2IYUvAf0TgAQFbY8Xg2kZQaGgOFw2E9+uijOnjwoObNm6c333xThw8fVlXVkWdtfepTn9LMmTP16quvJsxz5513qqioqP+vvLw83ZLMleBEatpxGBU+7Powo3HZYGbNzcHhr7RONc7OWg62ZDQO8bV1tmU0ro9Z67KZ24hZY2Emu20ndhxjOzJzO7Hj7zYAALbTkeSjx5ONAwCM2Gg6nk25obFlyxYVFhZq3Lhx+trXvqYnnnhClZWVamlpUX5+viZNmhQ
"text/plain": [
"<Figure size 1600x2400 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_clusters_by_pairs(df_reduced_sampled, num_columns, labels, centers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### РСА для визуализации сокращенной размерности"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABjYAAAJOCAYAAAAUHj4bAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wc9Zk/8M9s711addmWezc2NgYMNhiMAdNCcwqmhuQgCSWQcLmjB0g4Uu7gSAgJzo8WjlCTgDEYg+nghnuRLVtWr9v7zvz+EFp7vbsqtlarlT7v10t/aL6zM89qR9I883yLIEmSBCIiIiIiIiIiIiIiojwgy3UAREREREREREREREREfcXCBhERERERERERERER5Q0WNoiIiIiIiIiIiIiIKG+wsEFERERERERERERERHmDhQ0iIiIiIiIiIiIiIsobLGwQEREREREREREREVHeYGGDiIiIiIiIiIiIiIjyBgsbRERERERERERERESUN1jYICIiIiIiIiIiIiKivMHCBhENaffeey8EQch1GBkdOHAAgiBg5cqVx/T6lStXQhAEHDhwYEDjGo6uvvpqjBo1Ktdh5NSoUaNw9dVX5zqMHuX7NZ0PP2MiIiLKL8xpaDjJ98+b9/tEwwcLG0R5pPsGovtLo9Fg/PjxuPnmm9Hc3Jyyf3NzM376059i4sSJ0Ol00Ov1mD17Nh588EG4XK6055g7dy4EQcCTTz6Z5XdDx+uFF17A7373u0E/r8vlgkajgSAI2Llz56Cff7j79NNPce+992b8Hc2VeDyOZ555BgsXLoTNZoNarcaoUaNwzTXXYP369YMWx1tvvYV777130M5HREREA4s5DR1psHMaQRBw8803p2x/6KGHIAgCrr32WoiimCj2CIKABx98MO2xvvOd70AQBBgMhmyHPSh4v09E+YaFDaI8dP/99+PZZ5/F448/jpNPPhlPPvkk5s+fj0AgkNjnq6++wtSpU/HEE09gwYIF+M1vfoPHHnsMs2bNwiOPPILLL7885bh79+7FV199hVGjRuH5558fzLdExyBXhY2XX34ZgiCgqKiI10kWfPrpp7jvvvvSJuq7d+/Gn/70p0GPKRgM4vzzz8e1114LSZLw7//+73jyySdx1VVX4bPPPsPcuXNRV1c3KLG89dZbuO+++7J2/Fz9jImIiEYa5jQE5C6nOdIjjzyCX/ziF1ixYgWefvppyGSHH5VpNBq8+OKLKa/x+/144403oNFoBjPUrOH9PhHlI0WuAyCi/lu6dCnmzJkDALj++utht9vxm9/8Bm+88QaWL18Ol8uFiy++GHK5HJs2bcLEiROTXv/LX/4y7T/y5557DoWFhXjsscdw6aWX4sCBA8c89U8oFIJKpUq6KaTh4bnnnsO5556LyspKvPDCCxl7MFEXv98PvV4/IMdSq9UDcpz+uuOOO7Bq1Sr89re/xS233JLUds899+C3v/1tTuIaKJIkIRQKQavV5uxnTERENNIwp6Gh4NFHH8Vdd92Fq666Cn/5y19SPutzzz0Xr776Kr7++mvMmDEjsf2NN95AJBLBOeecg/fff3+wwx5wvN8nonzE/85Ew8AZZ5wBAKipqQEA/PGPf0R9fT1+85vfpCQAAOB0OvEf//EfKdtfeOEFXHrppTj//PNhNpvxwgsv9On8H3zwAQRBwN/+9jf8x3/8B0pLS6HT6eDxeAAAX3zxBc455xyYzWbodDqcfvrp+OSTT1KO8/HHH+PEE0+ERqNBVVUV/vjHP6bs09P8r4IgpAxZra+vx3XXXYeSkhKo1WqMHj0aP/zhDxGJRBL7uFwu3HLLLSgvL4darcbYsWPxq1/9CqIoJh3L5XLh6quvhtlshsViwYoVK/o1XdD27dtxxhlnQKvVoqysDA8++GDKOYCum+TzzjsvEXNVVRUeeOABxOPxxD4LFy7Ev/71Lxw8eDAxRLo7YYtEIrj77rsxe/ZsmM1m6PV6LFiwAGvXrk05V2NjI3bt2oVoNNqn91BbW4uPPvoIV155Ja688krU1NTg008/TbvvE088gTFjxkCr1WLu3Ln46KOPsHDhQixcuDBpv4MHD+KCCy6AXq9HYWEhbr31VrzzzjsQBAEffPBBj/H4/X7cfvvtic9uwoQJ+K//+i9IkpS0X/eQ85dffhmTJ0+GVqvF/PnzsXXrVgBdvzNjx46FRqPBwoUL084X25fruHv+5B07duDb3/42rFYrTj31VADAli1bcPXVV2PMmDHQaDQoKirCtddei/b29qTX33HHHQCA0aNHJz7b7niOnA92/fr1EAQBf/3rX1Ni7f75/fOf/0xsq6+vx7XXXgun0wm1Wo0pU6bgL3/5S48/XwCoq6vDH//4R5x11lkpSQ4AyOVy/PSnP0VZWVnGY6T73Tz6/QBANBrFfffdh3HjxkGj0cBut+PUU0/Fu+++C6BrnZUnnngicczur26iKOJ3v/sdpkyZAo1GA6fTiRtvvBGdnZ0p5z3//PPxzjvvYM6cOdBqtYm/N0fH1D1dxieffILbbrsNBQUF0Ov1uPjii9Ha2pp0XFEUce+996KkpAQ6nQ6LFi3Cjh07OI8vERFRHzCn6cKcZhSA7OY03X7zm9/gzjvvxHe/+10888wzaQtY8+fPx+jRo1Ouo+effx7nnHMObDZb2mO//fbbWLBgAfR6PYxGI8477zxs3749aZ++5AfA4RyjuroaV199NSwWC8xmM6655pqkEU4A8O677+LUU0+FxWKBwWDAhAkT8O///u89/hx4v8/7faJ8xREbRMPAvn37AAB2ux0A8Oabb0Kr1eLSSy/t8zG++OILVFdX45lnnoFKpcIll1yC559/vteboCM98MADUKlU+OlPf4pwOAyVSoX3338fS5cuxezZs3HPPfdAJpPhmWeewRlnnIGPPvoIc+fOBQBs3boVZ599NgoKCnDvvfciFovhnnvugdPp7MdPIllDQwPmzp0Ll8uF73//+5g4cSLq6+vx97//HYFAACqVCoFAAKeffjrq6+tx4403oqKiAp9++inuuusuNDY2JoZFS5KECy+8EB9//DF+8IMfYNKkSXjttdewYsWKPsXS1NSERYsWIRaL4ec//zn0ej2eeuopaLXalH1XrlwJg8GA2267DQaDAe+//z7uvvtueDwePProowCAX/ziF3C73airq0v0nume29Xj8eDpp5/G8uXLccMNN8Dr9eLPf/4zlixZgi+//BIzZ85MnOuuu+7CX//6V9TU1PSpJ9uLL74IvV6P888/H1qtFlVVVXj++edx8sknJ+335JNP4uabb8aCBQtw66234sCBA7joootgtVqTboj9fj/OOOMMNDY24ic/+QmKiorwwgsvpE1YjiZJEi644AKsXbsW1113HWbOnIl33nkHd9xxB+rr61N6FX300Ud48803cdNNNwEAHn74YZx//vm488478b//+7/4t3/7N3R2duLXv/41rr322qSeV329jrtddtllGDduHB566KFEkeXdd9/F/v37cc0116CoqAjbt2/HU089he3bt+Pzzz+HIAi45JJLsGfPHrz44ov47W9/C4fDAQAoKChIef9z5szBmDFj8H//938p1+FLL70Eq9WKJUuWAOiam/qkk05KFHgKCgrw9ttv47rrroPH40mbwHR7++23EYvF8L3vfa/Xz+R43XvvvXj44Ydx/fXXY+7cufB4PFi/fj02btyIs846CzfeeCMaGhrw7rvv4tlnn015/Y033oiVK1fimmuuwY9//GPU1NTg8ccfx6ZNm/DJJ59AqVQm9t29ezeWL1+OG2+8ETfccAMmTJjQY2w/+tGPYLVacc899+DAgQP43e9+h5tvvhkvvfRSYp+77roLv/71r7Fs2TIsWbIEX3/9NZYsWYJQKDRwPyQiIqJhijlNesxpBj6nAYDf//73uP322/Htb38bK1eu7HFUzvLly/Hcc8/hkUcegSAIaGtrw+rVq/Hss89i1apVKfs/++yzWLFiBZYsWYJf/epXCAQCePLJJ3Hqqadi06ZNiRj7kh8c6fLLL8fo0aPx8MMPY+PGjXj66adRWFi
"text/plain": [
"<Figure size 1600x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"import seaborn as sns\n",
"pca = PCA(n_components=2)\n",
"\n",
"reduced_data = pca.fit_transform(data_reduced_scaled)\n",
"\n",
"plt.figure(figsize=(16, 6))\n",
"plt.subplot(1, 2, 1)\n",
"sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=result, palette='Set1', alpha=0.6)\n",
"plt.title('PCA reduced data: Agglomerative Clustering')\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=labels, palette='Set1', alpha=0.6)\n",
"plt.title('PCA reduced data: KMeans Clustering')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Анализ инерции для метода локтя (метод оценки суммы квадратов расстояний)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA2wAAAIjCAYAAAB/FZhcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACS2UlEQVR4nOzdeVxU9f7H8ffMsK8CyqIi4pKKSyYmUllZKC55tWzRrMxsM62Ubovdrkt182a/0hbT26YteivvvW1qKplpJeFe7rmguLCo7CD7/P5AJidAEYEZ4PV8PHjgnPM9Zz4z84V4d77n+zWYzWazAAAAAAB2x2jrAgAAAAAAlSOwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAAAICdIrABAAAAgJ0isAEAAACAnSKwAQAA2KFFixbJYDBo8+bNti4FgA0R2ADgIpT/AWUwGPTTTz9V2G82mxUcHCyDwaCbbrrJBhUCAIDGhMAGADXg4uKiJUuWVNi+bt06HTt2TM7OzjaoCgAANDYENgCogSFDhmjp0qUqLi622r5kyRKFh4crMDDQRpUBAIDGhMAGADUwevRonT59WrGxsZZthYWF+s9//qM777yz0mNKS0s1d+5cde3aVS4uLgoICNBDDz2k9PR0S5u2bdtahlxW9tW2bVtL29zcXD3xxBMKDg6Ws7OzOnXqpP/7v/+T2Wyu8Nw//PBDleesrnvvvbfS42fMmGHV7vvvv1e/fv3k7u6uZs2aafjw4dqzZ49VmxkzZlR47rVr18rZ2VkPP/ywVZvzff3www+W4+fPn69u3brJzc3Nqs1//vOfar2+66+/vlqvT7IeGnvu1/XXX2/Vbtu2bRo0aJBatGhh1a46w2Wr+/lWp78cPnz4gu/lvffea/XaDh8+bHmO0tJS9ejRQwaDQYsWLbJsL+8TPXv2rFD/rFmzZDAY5OHhYbV94cKFuuGGG+Tv7y9nZ2eFhYVp/vz5lb4HVfXbc38Oyttc6HMu70+nTp2y2r558+YKr0u6uH785y8HBwerdkuXLlV4eLhcXV3VvHlz3XXXXTp+/Ph5661Kenq6+vTpo9atW2vfvn01OgeAhsXhwk0AAH/Wtm1bRUZG6t///rcGDx4sSfr222+VmZmpUaNG6Y033qhwzEMPPaRFixZp3Lhxeuyxx5SQkKC33npL27Zt088//yxHR0fNnTtXOTk5kqQ9e/bopZde0rPPPqsuXbpIkuWPX7PZrL/85S9au3atxo8fr549e2rVqlV68skndfz4cc2ZM6fSuh977DFdeeWVkqSPPvrIKnBWR/Pmza3Offfdd1vt/+677zR48GC1a9dOM2bM0JkzZ/Tmm2/q6quv1tatW63+0D7Xr7/+qhEjRmjIkCGaN2+eJOmWW25Rhw4dLG2mTJmiLl266MEHH7RsK39fPvvsMz3yyCO6/vrr9eijj8rd3d3y/l2M1q1ba9asWZKknJwcTZgw4bzt58yZo+bNm0uS/vGPf1jty8zM1ODBg2U2mxUTE6Pg4GDL67iQi/18BwwYoHvuucdq26uvvmr5nwEtWrTQxx9/bNn3v//9T1988YXVtvbt21dZz8cff6wdO3ZUus/BwUG7du3Stm3bdMUVV1i2L1q0SC4uLhXaz58/X127dtVf/vIXOTg46JtvvtEjjzyi0tJSTZw4sdLnOPdn4J133lFiYmKVtdaGi+3H8+fPtwqmRuMf/z+8/Gf+yiuv1KxZs5SSkqLXX39dP//8s7Zt26ZmzZpVu65Tp05pwIABSktL07p16877mQFoRMwAgGpbuHChWZJ506ZN5rfeesvs6elpzsvLM5vNZvNtt91m7t+/v9lsNptDQkLMQ4cOtRz3448/miWZFy9ebHW+lStXVrrdbDab165da5ZkXrt2bYV9X375pVmS+cUXX7Tafuutt5oNBoP5wIEDVttXr15tlmT+z3/+Y9k2ceJE88X8Z2DMmDHm0NBQq22SzNOnT7c87tmzp9nf3998+vRpy7Zff/3VbDQazffcc49l2/Tp0y3PffjwYXNQUJD5mmuuMZ85c6bK5w8JCTGPHTu20n2jR482N2vWzOr48vdv6dKl1Xp9V111lblbt26WxydPnqzw+sq9++67ZknmI0eOWLZdd9115uuuu87yeNWqVWZJ5n//+98VXse5faMyF/P5SjJPnDixwjmGDh1qDgkJqfT8577/f1bexxMSEsxms9mcn59vbtOmjXnw4MFmSeaFCxda2o4dO9bs7u5uHjZsmHnSpEmW7T/++KPZ1dXVPGLECLO7u7vV+ct/Xs4VHR1tbteuXYXtsbGxZknmdevWWT3nua+rup9z+Ws+efKk1fZNmzZVeF0X24//fM5yhYWFZn9/f3O3bt2s+uayZcvMkszTpk07b83n/r5JSkoyd+3a1dyuXTvz4cOHz3scgMaFIZEAUEO33367zpw5o2XLlik7O1vLli2rcjjk0qVL5e3trQEDBujUqVOWr/DwcHl4eGjt2rUX9dwrVqyQyWTSY489ZrX9iSeekNls1rfffmu1PT8/X5IqveJRXYWFheedTCUpKUnbt2/XvffeK19fX8v2Hj16aMCAAVqxYkWFY06fPq3o6Gh5enrq66+/rnF92dnZcnNzu6TXl5+fX+3jCwsLJem870d2drYkyc/P76JrudjPty7NmzdPp0+f1vTp06tsc99992nJkiUqKCiQVDbs8ZZbbpG3t3eFtq6urpZ/Z2Zm6tSpU7ruuut06NAhZWZmWrWtzvtcLjs7W6dOnVJGRsZ526WlpVn9DP75OWvSj6uyefNmpaam6pFHHrHqW0OHDlXnzp21fPnyap3n2LFjuu6661RUVKT169crJCSk2jUAaPgIbABQQy1atFBUVJSWLFmi//3vfyopKdGtt95aadv9+/crMzNT/v7+atGihdVXTk6OUlNTL+q5jxw5opYtW8rT09Nqe/mwsSNHjlhtL79vp7I/oKsrIyOjwv1If65Jkjp16lRhX5cuXXTq1Cnl5uZabb/pppu0b98+ZWRkVHrvXXVFRkbqxIkTmjFjhhITEyv9Q/xCTp06Ve33pzwUnO/96N27txwdHTVjxgxt27bNEhBKS0sveP6L/XzrSmZmpl566SXFxMQoICCgynZDhw6Vg4ODvvrqK+Xm5urzzz/XuHHjKm37888/KyoqynJvWIsWLfTss89anu9c1Xmfy913331q0aKFfHx85OnpqTvvvFMpKSkV2nXq1Mnq5y8qKspqf036cVXOd67OnTtX+3O8++67lZqaqnXr1qlVq1bVOgZA48E9bABwCe6880498MADSk5O1uDBg6u8H6W0tFT+/v5avHhxpftbtGhRh1XKMoFEVfeQVUdycnKt/5/9vXv36ttvv9Xtt9+uJ554QgsXLqzReaZMmaJ9+/bphRde0MyZMy/6+MLCQiUlJWnAgAHVap+cnCwPDw+5u7tX2SYkJEQLFy7U448/rl69elnt69Gjx0XXaAsvv/yyjEajnnzySZ0+fbrKdo6Ojrrrrru0cOFC5eXlyc/PTzfccIPVPXKSdPDgQd14443q3LmzXnvtNQUHB8vJyUkrVqzQnDlzKoTZ5ORkSarWrKvTpk1Tv379VFRUpC1btuj5559XRkZGhSti//3vf+Xl5WV5/Pvvv1d575y9uOWWW/TRRx/p9ddft9xjCaDpILABwCW4+eab9dBDD+mXX37RZ599VmW79u3b67vvvtPVV19tNSSspkJCQvTdd98pOzvb6irM3r17LfvPtXnzZgUGBqp169Y1er6ioiIdOHBAgwYNOm9NkiqduW7v3r1q3rx5hYDz9ddfq1+/fpo1a5YmTZqku+66SzfeeONF1+fq6qp3331X27Ztk7e3t6ZPn65ff/1Vf/3rX6t1/K+//qqioiL17t27Wu13795tudp1PmPGjFFiYqJmzpypjz/+WD4+PrrrrrsueNzFfr514cSJE5aA4Onped7AJpVd4br88st19OhRjR07ttIZSL/55hsVFBTo66+/Vps2bSzbqxoSvHv3brV
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"inertias = []\n",
"clusters_range = range(1, 23)\n",
"for i in clusters_range:\n",
" kmeans = KMeans(n_clusters=i, random_state=RANDOM_STATE)\n",
" kmeans.fit(data_reduced_scaled)\n",
" inertias.append(kmeans.inertia_)\n",
"\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.plot(clusters_range, inertias, marker='o')\n",
"plt.title('Метод локтя для оптимального k')\n",
"plt.xlabel('Количество кластеров')\n",
"plt.ylabel('Инерция')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Расчет коэффициентов силуета"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1YAAAIjCAYAAAAAxIqtAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAACR9ElEQVR4nOzdd3hU1dbH8d9k0islPbTQQZqAICqCdEQFRVQsFDvKFeV6USxUFRUrXoTXioiFiwXBAlKtCErv0hFCElo6qXPeP8IMhNRJZjKZ8P08Tx6SM+ecWWfnZJg1e++1TYZhGAIAAAAAlJuHqwMAAAAAAHdHYgUAAAAAFURiBQAAAAAVRGIFAAAAABVEYgUAAAAAFURiBQAAAAAVRGIFAAAAABVEYgUAAAAAFURiBQAAAAAVRGIFAABQhUyaNEkmk0knTpxwdSgA7EBiBaBKmTNnjkwmk/76669Cj7377rsymUwaNGiQ8vLyKiWe6667Tg0aNLD7uNGjR8tkMjk+IAAAUCWRWAFwC19//bVGjRqlrl276vPPP5fZbHZ1SAAAADYkVgCqvNWrV2vo0KFq2bKlFi9eLF9fX1eHBAAAUACJFYAqbdOmTRo4cKCioqK0dOlShYSEFNpnwYIF6tChg/z8/BQaGqo777xTR48etT1+9OhRDR06VDExMfLx8VHDhg01btw4paamFjrXxx9/rLp166pGjRqaNm2abfv8+fMVHR2t0NBQvfTSS4WOW7p0qZo2barAwEA98sgjMgxDUn5S2KhRIwUHB2vs2LEFhjCuXr1aJpNJq1evLnCuAQMGyGQyadKkSbZtxc25+Ouvv2QymTRnzhzbtoMHDxbaJkkPP/ywTCaTRowYUWB7UlKSHn30UdWtW1c+Pj5q3LixXnrpJVkslkLnfOWVVwpde6tWrdS9e/cC11TSl/W6yjKPpEGDBoXiLYrFYtGbb76p1q1by9fXV2FhYerXr1+BIaUXtqkkTZ8+XSaTyRb/+UaMGFFi/B9++KFMJpM2btxY6NgXXnhBZrNZR48e1fbt23XDDTcoIiJCPj4+atGihZ5//nnl5OSU+lznfx08eFCS9M0332jAgAGKjo6Wj4+PGjVqpKlTp9o1PLZ79+4lXtv5rMNzL/y6sM02btyofv36KSwsrMB+1113XYmxnH9vvf7666pfv778/PzUrVs3bdu2rcC+W7Zs0YgRI9SwYUP5+voqMjJSd999t06ePFlgv1mzZqlt27YKCQlRQECA2rZtq/fff7/APiNGjFBgYGCheL744otCf5fdu3dXq1atSr0G699cYmKiwsLC1L17d9trgSTt3btXAQEBuvXWW0tsk6IcOnRIjRs3VqtWrZSQkGD38QCcz9PVAQBAcfbt26d+/frJx8dHS5cuVVRUVKF95syZo5EjR+qyyy7TtGnTlJCQoDfffFO//fabNm7cqBo1amjfvn1KSEjQv/71L9WsWVPbt2/XjBkztGLFCv3666/y8/OTJP32228aPny4rrjiCg0dOlQff/yx9u/frzNnzmjKlCl66qmn9OOPP+rJJ59UvXr1NHToUEnS/v37NWjQIDVu3FgvvPCClixZYntD//DDD+tf//qXNm7cqNdff11hYWEaP358sdf8888/6/vvv3d4W+7du1fvvvtuoe0ZGRnq1q2bjh49qgceeED16tXT77//rvHjx+vYsWN644037HqeFi1a6OOPP7b9/M4772jnzp16/fXXbdvatGlT7usozj333KM5c+aof//+uvfee5Wbm6tffvlFf/zxhzp27FjkMUlJSQWS56KEhoYWiP2uu+6yfX/zzTfr4Ycf1ieffKJLL720wHGffPKJunfvrpiYGFuy+Z///EcBAQH6888/NWHCBP3+++9avHixPDw89MADD6hXr14FnufGG2/UTTfdZNsWFhYmKf+eDwwM1NixYxUYGKiVK1dqwoQJSklJ0fTp08vcZnXq1LFdf1pamkaNGlXi/q+//rpCQ0MlSc8//3yBx5KTk9W/f38ZhqGxY8eqbt26kqTHHnuszPHMnTtXqampevjhh5WZmak333xTPXr00NatWxURESFJWrZsmfbv36+RI0cqMjJS27dv1zvvvKPt27frjz/+sM1rTE1NVZ8+fdSoUSMZhqH//e9/uvfee1WjRg0NHjy4zDGVV3h4uGbNmqUhQ4borbfe0iOPPCKLxaIRI0YoKChIb7/9tl3n27dvn3r06KFatWpp2bJltt8DgCrGAIAq5MMPPzQkGd9++63RqFEjQ5LRp0+fIvfNzs42wsPDjVatWhlnzpyxbf/2228NScaECROKfZ5ly5YZkowpU6bYtt1www1GbGyskZmZaRiGYaSmphqxsbGGv7+/sX//fsMwDMNisRhXXnml0bZtW9txjzzyiBEUFGScOHHCMAzDyMnJMS6//HJDkrF27VrbfkOHDjXCw8Nt51+1apUhyVi1apVtn86dOxv9+/c3JBkTJ060bZ84caIhyTh+/HiB6/jzzz8NScaHH35o23bgwIFC22655RajVatWRt26dY3hw4fbtk+dOtUICAgw/v777wLnffLJJw2z2WwcPny4wDmnT59eqC0vueQSo1u3boW2G4ZhDB8+3Khfv36RjxV3TeerX79+gXiLsnLlSkOS8cgjjxR6zGKx2L6/sE3HjRtnhIeHGx06dCgy/jvuuMOIjY0tsO3CcwwdOtSIjo428vLybNs2bNhQqP0v9O677xqSjLlz5xb5+IXPc76MjIxC2x544AHD39/fdm+V5oorrjBatWpl+/n48ePFPqc11kOHDtm2devWrUCbLV261JBkfPbZZwWOrV+/vjFgwIASY7HeW35+fsaRI0ds29euXWtIMh577DHbtqKu/bPPPjMkGT///HOxz5Gbm2sEBwcbo0ePtm0bPny4ERAQUGjfBQsWFPq77Natm3HJJZeUeg0X/s6HDh1q+Pv7G3///bcxffp0Q5KxcOHCYs9jdf7fxs6dO43o6GjjsssuM06dOlXqsQBch6GAAKqkESNG6J9//tHtt9+uH3/8UQsWLCi0z19//aXExEQ99NBDBeZdDRgwQM2bN9d3331n25aTk6MTJ07Yvtq1a6eOHTsWOO+KFSt07bXXysfHR5IUGBioli1bKiwsTLGxsZJkq0q4efNm2/CjFStW6Oqrr1bt2rUlSZ6enurQoYMkqVOnTrbz33TTTUpMTCw0vMnqq6++0p9//qkXX3yxXG1WnPXr12vBggWaNm2aPDwKvuwvWLBAXbt2Vc2aNQu0T69evZSXl6eff/65wP4ZGRkF9jtx4kSFKzSeOnVKJ06cUHp6ermO//LLL2UymTRx4sRCjxVXmfHo0aN666239OyzzxY5HEySsrOzbfdCcYYNG6a4uDitWrXKtu2TTz6Rn59fgZ6RrKysAm02aNAgRUREFHlfl8bawyrl98ycOHFCXbt2VUZGhnbt2lWmc2RmZpZ5rmJ2drYkldgW1mG11r+B8hg0aJBiYmJsP3fq1EmdO3cu0IN7/rVnZmbqxIkTuvzyyyVJGzZsKHC+vLw8nThxQocOHdLrr7+ulJQUde3atdDzXng/FzVE+PzznThxwtYmpfnvf/+rkJAQ3XzzzXr22Wd11113aeDAgWU6VpK2bdumbt26qUGDBlq+fLlq1qxZ5mMBVD4SKwBV0qlTpzRv3jx99NFHateuncaMGaPk5OQC+xw6dEiS1KxZs0LHN2/e3Pa4lD/MLywsrMDXX3/9pb1790qSTp8+rfT09AJv7Ipj3eeff/6x/Vue486Xl5enp556SnfccYfDh8o9+eST6tq1a5FzXfbs2aMlS5YUahvrsLTExMQC+0+cOLHQvmV9M1+cZs2aKSwsTIGBgYqIiNAzzzxjV7K2b98+RUdHq1atWmU+ZuLEiYqOjtYDDzxQ7D5JSUnFJl1WvXv3VlRUlD755BNJ+XO9PvvsMw0cOFBBQUG2/T777LNC7ZaQkGC7/+yxfft23XjjjQoJCVFwcLDCwsJ05513SlKhv5HinDhxosj5ikVJSkqSpBLbomPHjvLy8tKkSZO0ceNGWwJ
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.metrics import silhouette_score\n",
"silhouette_scores = []\n",
"for i in clusters_range[1:]: \n",
" kmeans = KMeans(n_clusters=i, random_state=RANDOM_STATE)\n",
" labels = kmeans.fit_predict(data_reduced_scaled)\n",
" score = silhouette_score(data_reduced_scaled, labels)\n",
" silhouette_scores.append(score)\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.plot(clusters_range[1:], silhouette_scores, marker='o')\n",
"plt.title('Коэффициенты силуэта для разных k')\n",
"plt.xlabel('Количество кластеров')\n",
"plt.ylabel('Коэффициент силуэта')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Средний коэффициент силуэта: 0.282\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1kAAAJwCAYAAAB71at5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3hUVfoH8O/0mUxNnfQeWkLvSBUQEAuooKgL2Laoq65tV3fXurusZS1r198u6FpXQVYRkSJFQKnSW3rvZUrK1Pv7I2ZkmElIwqTy/TxPHs05d+59ZzIJ951zzntEgiAIICIiIiIiooAQ93QARERERERE/QmTLCIiIiIiogBikkVERERERBRATLKIiIiIiIgCiEkWERERERFRADHJIiIiIiIiCiAmWURERERERAHEJIuIiIiIiCiAmGQREREREREFEJMsIiIiIiKiAGKSRUR+rVq1CiKRyOsrIiICM2bMwNdff93T4RER9YiWv4379+/3ajeZTBg3bhyUSiU2bNjQ5mNFIhF27tzp0y8IAuLi4iASiXDFFVd0SfxE1D2kPR0AEfVuTz31FJKSkiAIAsrLy7Fq1Spcfvnl+PLLL3kTQEQEwGw247LLLsORI0fw+eefY+7cuW0er1Qq8eGHH2Ly5Mle7du3b0dRUREUCkVXhktE3YBJFhG1ad68eRgzZozn+9tuuw1GoxEfffQRkywiuuhZLBbMmTMHhw4dwpo1azBv3rzzPubyyy/Hp59+in/+85+QSn++Ffvwww8xevRoVFVVdWXIRNQNOF2QiDrEYDBApVJ53Rjk5eVBJBJh1apVXsfeddddEIlEWL58uadtzZo1GDduHEJCQqBSqTBo0CA888wzEAQBALB161aIRCJ8/vnnPtf+8MMPIRKJ8P333wMAjhw5guXLlyM5ORlKpRKRkZG49dZbUV1d7Tf2xMREnymQIpEI27Zt8zrm7HgB4NNPP4VIJEJiYqKn7fTp07j00ksRGRkJhUKBuLg4/PrXv0ZNTY3nGLvdjsceewyjR4+GXq+HWq3GlClTsHXrVq/zt7x+zz//vE/MGRkZmD59ulfb9OnTfdr27dvneT5ns1qteOCBB5CcnAyZTOb1vM93I+fvOn/9618hFovx4Ycf+n0O/r7O9vzzz2PSpEkIDQ2FSqXC6NGj8dlnn/m9/vvvv49x48YhKCgIwcHBmDp1KjZu3Aig9Z9ly9fZPyu3242XXnoJ6enpUCqVMBqN+NWvfoXa2lqv6yUmJuKKK67Axo0bMWLECCiVSgwZMgRr1qzxiS0nJweLFi1CSEgIgoKCMGHCBHz11Vdex2zbts0rJoVCgQEDBmDFihWe93tbmpqa8MQTT2DAgAFQKpWIiorCNddcg+zs7DYfd77X5mxOpxNPP/00UlJSoFAokJiYiEcffRQ2m83vOe+77z6f682ZM8fv9LaKigrPhzJKpRLDhw/Hu+++63VMa+99f+/7559/HiKRCHl5eZ62J554ol3v5eXLl3u9JwDgpZdewqBBg6BQKBAZGYlf/epXXr+/7WG1WjF37lwcPHgQq1evxvz589v1uCVLlqC6uhqbNm3ytNntdnz22We48cYb/T6mve/j//3vf5g/fz6io6OhUCiQkpKCp59+Gi6Xy+u46dOnIyMjAydOnMCMGTMQFBSEmJgYPPvssz7XfuWVV5Cenu75XRwzZozP3wAi8saRLCJqk8lkQlVVFQRBQEVFBV555RVYrVbcfPPNbT4uKysL77zzjk+72WzG+PHjsWzZMshkMmzYsAF/+MMfIJVK8cADD2D69OmIi4vDBx98gIULF3o99oMPPkBKSgomTpwIANi0aRNycnJwyy23IDIyEsePH8fbb7+N48eP44cffvC5oQSAKVOm4Je//CUA4OTJk/jb3/7W5vNwOp344x//6NNeX1+P2NhYXHnlldDpdDh27Bhee+01FBcX48svv/Q81//7v//DkiVLcMcdd8BiseBf//oX5syZg71792LEiBFtXrsjfv/73/ttf+ihh/Dmm2/itttuwyWXXAKZTIY1a9b4TWLPZ+XKlfjTn/6Ef/zjH63eCP7yl7/ElClTAMDvdV5++WVcddVVuOmmm2C32/Hxxx9j0aJFWLdundcN6pNPPoknnngCkyZNwlNPPQW5XI49e/bg22+/xWWXXYaXXnoJVqsVwM8/x0cffRSDBw8GAGg0Gs+5fvWrX2HVqlW45ZZbcM899yA3NxevvvoqfvzxR+zatQsymcxzbGZmJq6//nr8+te/xrJly7By5UosWrQIGzZswOzZswEA5eXlmDRpEhoaGnDPPfcgNDQU7777Lq666ip89tlnPu/blrgaGxvxySef4NFHH0VERARuu+22Vl9rl8uFK664Alu2bMENN9yAe++9FxaLBZs2bcKxY8eQkpLS5s9qxIgReOCBB7za3nvvPa+begC4/fbb8e677+K6667DAw88gD179mDFihU4efKkz89OqVTigw8+wHPPPed5zYqKirBlyxYolUqvYxsbGzF9+nRkZWXh7rvvRlJSEj799FMsX74cdXV1uPfee9uMv6v97W9/wx//+EdMnToVd911l+c9sWfPHuzZs6dd0/Xq6+sxb9487Nu3D5999lmHRvYTExMxceJEfPTRR56Rr6+//homkwk33HAD/vnPf/o8pr3v41WrVkGj0eD++++HRqPBt99+i8ceewxmsxnPPfec1zlra2sxd+5cXHPNNVi8eDE+++wz/P73v8fQoUM9cb3zzju45557cN111+Hee+9FU1MTjhw5gj179rT6d4CIAAhERH6sXLlSAODzpVAohFWrVnkdm5ubKwAQVq5c6WlbvHixkJGRIcTFxQnLli1r81pDhgwRrrjiCs/3jzzyiKBQKIS6ujpPW0VFhSCVSoXHH3/c09bQ0OBzro8++kgAIOzYscOnLyYmRrjllls832/dulUAIGzdutXTlpCQ4BXv66+/LigUCmHGjBlCQkJCm8/jzjvvFDQajed7p9Mp2Gw2r2Nqa2sFo9Eo3HrrrZ62ltfvueee8zlnenq6MG3aNK+2adOmebWtX79eACDMnTtXOPfPelRUlDBnzhyvtscff1wAIFRWVrb5fM6+zldffSVIpVLhgQce8HtsZmamAEB49913fa5ztnN/Zna7XcjIyBAuvfRSr3OJxWJh4cKFgsvl8jre7Xb7XNvfz7HFd999JwAQPvjgA6/2DRs2+LQnJCQIAITVq1d72kwmkxAVFSWMHDnS03bfffcJAITvvvvO02axWISkpCQhMTHRE7O/uJqamgSxWCzceeedPrGe7d///rcAQHjhhRd8+vy9BmdLSEgQ5s+f79N+1113ef08Dh06JAAQbr/9dq/jHnzwQQGA8O2333qdc/bs2UJYWJjw2WefedqffvppYdKkST7XfOmllwQAwvvvv+9ps9vtwsSJEwWNRiOYzWZBEFp/7/t73z/33HMCACE3N9fT1t738rJlyzy/v5WVlYJSqRQmT54sOBwOzzGrVq0SAAivvPJKm+dq+duYkJAgyGQyYe3atW0e7++x+/btE1599VVBq9V6ficWLVokzJgxQxAE359hR97H/v4u/upXvxKCgoKEpqYmT9u0adMEAMJ7773nabPZbEJkZKRw7bXXetquvvpqIT09vd3PkYiacbogEbXptddew6ZNm7Bp0ya8//77mDFjBm6//Xa/U6haHDhwAJ9++ilWrFgBsdj/n5mqqioUFRVh1apVyMrKwtSpUz19S5cuhc1m85pG9sknn8DpdHqNoKlUKs//NzU1oaqqChMmTAAAHDx40Oeadru9QwvKGxoa8NRTT+Huu+9GfHy832NMJhPKy8uxZcsWfPXVV17PQyKRQC6XA2ie6lNTUwOn04kxY8b4ja8zBEHAI488gmuvvRbjx4/36bdYLAgNDb2ga+zduxeLFy/Gtdde6/NJeAu73Q4A5319z/6Z1dbWwmQyYcqUKV6vx9q1a+F2u/HYY4/5vH/8jU625dNPP4Ver8fs2bNRVVXl+Ro9ejQ0Go3P1M3o6GivkSidToelS5fixx9/RFl
"text/plain": [
"<Figure size 1000x700 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"kmeans = KMeans(n_clusters=3, random_state=9) \n",
"df_clusters = kmeans.fit_predict(data_reduced_scaled)\n",
"\n",
"silhouette_avg = silhouette_score(data_reduced_scaled, df_clusters)\n",
"print(f'Средний коэффициент силуэта: {silhouette_avg:.3f}')\n",
"\n",
"pca = PCA(n_components=2)\n",
"df_pca = pca.fit_transform(data_reduced_scaled)\n",
"\n",
"plt.figure(figsize=(10, 7))\n",
"sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=df_clusters, palette='viridis', alpha=0.7)\n",
"plt.title('Визуализация кластеров с помощью K-Means')\n",
"plt.xlabel('Первая компонентa PCA')\n",
"plt.ylabel('Вторая компонентa PCA')\n",
"plt.legend(title='Кластер', loc='upper right')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Мы можем наблюдать некоторое пересечение кластеров, что говорит о неплохом результате работы метода кластеризации для данного датасета."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}