Files
AIM-PIbd-31-Danilov-V-V/Lab5/Lab5.ipynb
Владимир Данилов 33ec6ee8e6 Lab5
2025-03-14 15:49:38 +04:00

1.1 MiB
Raw Blame History

Начало лабораторной

Выгрузка данных из csv файла в датафрейм

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
In [39]:
df = pd.read_csv(".//static//csv//ds_salaries.csv")
print(df.columns)
Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')

Очистка от выбросов по зарплате

In [40]:
Q1 = df["salary_in_usd"].quantile(0.25)
Q3 = df["salary_in_usd"].quantile(0.75)
IQR = Q3 - Q1
threshold = 1.5 * IQR
lower_bound = Q1 - threshold
upper_bound = Q3 + threshold

outliers = (df["salary_in_usd"] < lower_bound) | (df["salary_in_usd"] > upper_bound)
print("Выбросы:", df[outliers])

median_salary = df["salary_in_usd"].median()
df.loc[outliers, "salary_in_usd"] = median_salary
Выбросы:       work_year experience_level employment_type  \
33         2023               SE              FT   
68         2023               SE              FT   
83         2022               EN              FT   
133        2023               SE              FT   
145        2023               SE              FT   
...         ...              ...             ...   
3522       2020               MI              FT   
3675       2021               EX              CT   
3697       2020               EX              FT   
3747       2021               MI              FT   
3750       2020               SE              FT   

                               job_title  salary salary_currency  \
33              Computer Vision Engineer  342810             USD   
68                     Applied Scientist  309400             USD   
83                          AI Developer  300000             USD   
133            Machine Learning Engineer  342300             USD   
145            Machine Learning Engineer  318300             USD   
...                                  ...     ...             ...   
3522                  Research Scientist  450000             USD   
3675            Principal Data Scientist  416000             USD   
3697            Director of Data Science  325000             USD   
3747  Applied Machine Learning Scientist  423000             USD   
3750                      Data Scientist  412000             USD   

      salary_in_usd employee_residence  remote_ratio company_location  \
33           342810                 US             0               US   
68           309400                 US             0               US   
83           300000                 IN            50               IN   
133          342300                 US             0               US   
145          318300                 US           100               US   
...             ...                ...           ...              ...   
3522         450000                 US             0               US   
3675         416000                 US           100               US   
3697         325000                 US           100               US   
3747         423000                 US            50               US   
3750         412000                 US           100               US   

     company_size  
33              M  
68              L  
83              L  
133             L  
145             M  
...           ...  
3522            M  
3675            S  
3697            L  
3747            L  
3750            L  

[63 rows x 11 columns]

Визуализация взаимосвязей

In [41]:
sns.set(style="whitegrid")
plt.figure(figsize=(16, 12))
plt.subplot(2, 2, 1)
sns.scatterplot(x=df['experience_level'], y=df['salary_in_usd'], alpha=0.6)
plt.title('Experience Level vs Salary')

plt.subplot(2, 2, 2)
sns.scatterplot(x=df['job_title'], y=df['salary_in_usd'], alpha=0.6)
plt.xticks(rotation=90)
plt.title('Job Title vs Salary')

plt.subplot(2, 2, 3)
sns.scatterplot(x=df['remote_ratio'], y=df['salary_in_usd'], alpha=0.6)
plt.title('Remote Ratio vs Salary')

plt.subplot(2, 2, 4)
sns.scatterplot(x=df['company_size'], y=df['salary_in_usd'], alpha=0.6)
plt.title('Company Size vs Salary')

plt.tight_layout()
plt.show()
No description has been provided for this image

Стандартизация данных

In [42]:
df_scaled = pd.DataFrame(StandardScaler().fit_transform(df.select_dtypes(include=[np.number])), columns=df.select_dtypes(include=[np.number]).columns)

# Понижение размерности до 2 компонент
pca = PCA(n_components=2)
kc_pca = pca.fit_transform(df_scaled)

# Визуализация
plt.figure(figsize=(8, 6))
plt.scatter(kc_pca[:, 0], kc_pca[:, 1], alpha=0.6)
plt.title("PCA Visualization of data")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()
No description has been provided for this image

Иерархическая кластеризация

In [43]:
features = df[['salary_in_usd', 'remote_ratio', 'work_year']]
scaled_features = StandardScaler().fit_transform(features)

# Построение дендрограммы
linkage_matrix = linkage(scaled_features, method='ward')
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, labels=df.index, leaf_rotation=90, leaf_font_size=10)
plt.title('Иерархическая кластеризация (дендрограмма)')
plt.xlabel('Индекс образца')
plt.ylabel('Евклидово расстояние')
plt.tight_layout()
plt.show()

result = fcluster(linkage_matrix, t=20, criterion='distance')
df['cluster_agg'] = result
No description has been provided for this image

Визуализация кластеров

In [44]:
plt.figure(figsize=(16, 12))
plt.subplot(2, 2, 1)
sns.scatterplot(x=df['remote_ratio'], y=df['salary_in_usd'], hue=result, palette='Set1', alpha=0.6)
plt.title('Remote Ratio vs Salary Clusters')

plt.subplot(2, 2, 2)
sns.scatterplot(x=df['work_year'], y=df['salary_in_usd'], hue=result, palette='Set1', alpha=0.6)
plt.title('Work Year vs Salary Clusters')

plt.subplot(2, 2, 3)
sns.scatterplot(x=df['salary_in_usd'], y=df['company_size'], hue=result, palette='Set1', alpha=0.6)
plt.title('Salary vs Company Size Clusters')

plt.subplot(2, 2, 4)
sns.scatterplot(x=df['remote_ratio'], y=df['work_year'], hue=result, palette='Set1', alpha=0.6)
plt.title('Remote Ratio vs Work Year Clusters')

plt.tight_layout()
plt.show()
No description has been provided for this image

KMeans кластеризация

In [45]:
features_used = ['salary_in_usd', 'remote_ratio', 'work_year']
data_to_scale = df[features_used]
data_scaled = StandardScaler().fit_transform(data_to_scale)

random_state = 9
kmeans = KMeans(n_clusters=3, random_state=random_state, n_init=10)
labels = kmeans.fit_predict(data_scaled)
centers = kmeans.cluster_centers_

# Визуализация кластеров
plt.figure(figsize=(16, 12))
plt.subplot(1, 2, 1)
sns.scatterplot(x=kc_pca[:, 0], y=kc_pca[:, 1], hue=result, palette='Set1', alpha=0.6)
plt.title('PCA reduced data: Agglomerative Clustering')

plt.subplot(1, 2, 2)
sns.scatterplot(x=kc_pca[:, 0], y=kc_pca[:, 1], hue=labels, palette='Set1', alpha=0.6)
plt.title('PCA reduced data: KMeans Clustering')

plt.tight_layout()
plt.show()
No description has been provided for this image

Оптимальное число кластеров (Метод локтя и силуэтный коэффициент)

In [46]:
inertias = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(df_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(df_scaled, labels))

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, marker='o')
plt.xlabel("Число кластеров")
plt.ylabel("Инерция")
plt.title("Метод локтя")

plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, marker='o')
plt.xlabel("Число кластеров")
plt.ylabel("Коэффициент силуэта")
plt.title("Оценка силуэта")
plt.show()
No description has been provided for this image

Визуализация кластеров

In [47]:
kmeans = KMeans(n_clusters=3, random_state=42)  
df_clusters = kmeans.fit_predict(df_scaled)

# Оценка качества кластеризации
silhouette_avg = silhouette_score(df_scaled, df_clusters)
print(f'Средний коэффициент силуэта: {silhouette_avg:.3f}')

# Визуализация кластеров
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)

plt.figure(figsize=(10, 7))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=df_clusters, palette='viridis', alpha=0.7)
plt.title('Визуализация кластеров с помощью K-Means')
plt.xlabel('Первая компонентa PCA')
plt.ylabel('Вторая компонентa PCA')
plt.legend(title='Кластер', loc='upper right')
plt.show()
Средний коэффициент силуэта: 0.380
No description has been provided for this image

Оценка качества кластеризации

In [48]:
silhouette_kmeans = silhouette_score(data_scaled, labels)
silhouette_agg = silhouette_score(data_scaled, result)
print(f"Силуэтный коэффициент K-Means: {silhouette_kmeans}")
print(f"Силуэтный коэффициент Agglomerative Clustering: {silhouette_agg}")
Силуэтный коэффициент K-Means: 0.4333027060014635
Силуэтный коэффициент Agglomerative Clustering: 0.4550226566701248