MAI_PIbd-33_Volkov_N.A./lab1/lab1.ipynb
2024-10-26 01:15:17 +04:00

411 KiB
Raw Blame History

Работа с Pandas DataFrame

Работа с данными - чтение и запись CSV

In [48]:
import pandas as pd

df = pd.read_csv("data/healthcare-dataset-stroke-data.csv", index_col="id")

df.to_csv("test.csv")

Работа с данными - основные команды

In [49]:
df.info()

print(df.describe().transpose())

cleared_df = df.drop(["ever_married", "work_type", "Residence_type"], axis=1)
print(cleared_df.head())
print(cleared_df.tail())

sorted_df = cleared_df.sort_values(by="gender")
print(sorted_df.head())
print(sorted_df.tail())
<class 'pandas.core.frame.DataFrame'>
Index: 5110 entries, 9046 to 44679
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 479.1+ KB
                    count        mean        std    min     25%     50%  \
age                5110.0   43.226614  22.612647   0.08  25.000  45.000   
hypertension       5110.0    0.097456   0.296607   0.00   0.000   0.000   
heart_disease      5110.0    0.054012   0.226063   0.00   0.000   0.000   
avg_glucose_level  5110.0  106.147677  45.283560  55.12  77.245  91.885   
bmi                4909.0   28.893237   7.854067  10.30  23.500  28.100   
stroke             5110.0    0.048728   0.215320   0.00   0.000   0.000   

                      75%     max  
age                 61.00   82.00  
hypertension         0.00    1.00  
heart_disease        0.00    1.00  
avg_glucose_level  114.09  271.74  
bmi                 33.10   97.60  
stroke               0.00    1.00  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
9046     Male  67.0             0              1             228.69  36.6   
51676  Female  61.0             0              0             202.21   NaN   
31112    Male  80.0             0              1             105.92  32.5   
60182  Female  49.0             0              0             171.23  34.4   
1665   Female  79.0             1              0             174.12  24.0   

        smoking_status  stroke  
id                              
9046   formerly smoked       1  
51676     never smoked       1  
31112     never smoked       1  
60182           smokes       1  
1665      never smoked       1  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
18234  Female  80.0             1              0              83.75   NaN   
44873  Female  81.0             0              0             125.20  40.0   
19723  Female  35.0             0              0              82.99  30.6   
37544    Male  51.0             0              0             166.29  25.6   
44679  Female  44.0             0              0              85.28  26.2   

        smoking_status  stroke  
id                              
18234     never smoked       0  
44873     never smoked       0  
19723     never smoked       0  
37544  formerly smoked       0  
44679          Unknown       0  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
72369  Female  14.0             0              0              65.41  19.5   
3135   Female  73.0             0              0              69.35   NaN   
563    Female  41.0             0              0             216.71  36.2   
19364  Female   7.0             0              0              74.96  18.8   
55459  Female  60.0             0              0              91.82  28.3   

        smoking_status  stroke  
id                              
72369          Unknown       0  
3135      never smoked       0  
563       never smoked       0  
19364          Unknown       0  
55459  formerly smoked       0  
      gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                         
33622   Male  62.0             1              0             211.49  41.1   
51554   Male  42.0             0              0             177.91   NaN   
2296    Male  78.0             1              0              90.19   NaN   
13602   Male  73.0             1              0             102.06   NaN   
56156  Other  26.0             0              0             143.33  22.4   

        smoking_status  stroke  
id                              
33622          Unknown       0  
51554          Unknown       0  
2296           Unknown       0  
13602          Unknown       0  
56156  formerly smoked       0  

Работа с данными - работа с элементами

In [50]:
print(df["age"])

print(df.loc[63864])

print(df.loc[63864, "Residence_type"])

print(df.loc[63864:63898, ["age", "Residence_type"]])

print(df[0:3])

print(df.iloc[0])

print(df.iloc[3:5, 0:2])

print(df.iloc[[3, 4], [0, 1]])
id
9046     67.0
51676    61.0
31112    80.0
60182    49.0
1665     79.0
         ... 
18234    80.0
44873    81.0
19723    35.0
37544    51.0
44679    44.0
Name: age, Length: 5110, dtype: float64
gender                  Male
age                     62.0
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Rural
avg_glucose_level     107.61
bmi                     31.3
smoking_status       Unknown
stroke                     0
Name: 63864, dtype: object
Rural
        age Residence_type
id                        
63864  62.0          Rural
24177  57.0          Urban
57274  14.0          Urban
37213  60.0          Rural
59992  63.0          Urban
...     ...            ...
65277  78.0          Rural
52679  82.0          Rural
36728  74.0          Urban
46797  31.0          Rural
63898  53.0          Urban

[198 rows x 2 columns]
       gender   age  hypertension  heart_disease ever_married      work_type  \
id                                                                             
9046     Male  67.0             0              1          Yes        Private   
51676  Female  61.0             0              0          Yes  Self-employed   
31112    Male  80.0             0              1          Yes        Private   

      Residence_type  avg_glucose_level   bmi   smoking_status  stroke  
id                                                                      
9046           Urban             228.69  36.6  formerly smoked       1  
51676          Rural             202.21   NaN     never smoked       1  
31112          Rural             105.92  32.5     never smoked       1  
gender                          Male
age                             67.0
hypertension                       0
heart_disease                      1
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level             228.69
bmi                             36.6
smoking_status       formerly smoked
stroke                             1
Name: 9046, dtype: object
       gender   age
id                 
60182  Female  49.0
1665   Female  79.0
       gender   age
id                 
60182  Female  49.0
1665   Female  79.0

Работа с данными - отбор и группировка

In [51]:
s_values = df["gender"].unique()
print(s_values)

s_total = 0
for s_value in s_values:
    count = df[df["gender"] == s_value].shape[0]
    s_total += count
    print(s_value, "count =", count)
print("Total count = ", s_total)

print(df.groupby(["bmi", "smoking_status"]).size().reset_index(name="Count"))  # type: ignore
['Male' 'Female' 'Other']
Male count = 2115
Female count = 2994
Other count = 1
Total count =  5110
       bmi smoking_status  Count
0     10.3        Unknown      1
1     11.3        Unknown      1
2     11.5   never smoked      1
3     12.0        Unknown      1
4     12.3        Unknown      1
...    ...            ...    ...
1185  66.8        Unknown      1
1186  71.9   never smoked      1
1187  78.0         smokes      1
1188  92.0   never smoked      1
1189  97.6        Unknown      1

[1190 rows x 3 columns]

Виртуализация - Исходные данные

In [52]:
data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)
print(data)
        age      work_type   smoking_status
id                                         
9046   67.0        Private  formerly smoked
51676  61.0  Self-employed     never smoked
31112  80.0        Private     never smoked
60182  49.0        Private           smokes
1665   79.0  Self-employed     never smoked
...     ...            ...              ...
18234  80.0        Private     never smoked
44873  81.0  Self-employed     never smoked
19723  35.0  Self-employed     never smoked
37544  51.0        Private  formerly smoked
44679  44.0       Govt_job          Unknown

[5110 rows x 3 columns]

Визуализация - Линейная диаграмма

In [53]:
import matplotlib.pyplot as plt
average_age = data.groupby("smoking_status")["age"].mean()
average_age.plot(
    kind="line",
    marker="o",
    title="Average Age by Smoking Status",
    xlabel="Smoking Status",
    ylabel="Average Age",
)
plt.grid(True)
plt.show()
No description has been provided for this image

Визуализация - столбчатая диаграмма

In [62]:
pivot_table = data.groupby(["work_type", "smoking_status"]).size().unstack()

pivot_table.plot(kind="bar", stacked=True, figsize=(10, 6))

plt.title("Smoking Status by Work Type")
plt.xlabel("Work Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="Smoking Status")
plt.grid(axis='y')
plt.tight_layout()  

plt.show()
No description has been provided for this image

Визуализация - Гистограмма

In [61]:
plt.hist(data["age"], bins=10, edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(axis="y")
plt.show()
No description has been provided for this image

Визуализация - Ящик с усами

In [56]:
import pandas as pd
import matplotlib.pyplot as plt

data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)


plt.figure(figsize=(10, 6))

box_data = [
    data[data["smoking_status"] == status]["age"]
    for status in data["smoking_status"].unique()
]
plt.boxplot(box_data)

plt.xticks(
    range(1, len(data["smoking_status"].unique()) + 1),
    list(data["smoking_status"].unique()),  )

plt.title("Box Plot of Age by Smoking Status")
plt.xlabel("Smoking Status")
plt.ylabel("Age")

plt.show()
No description has been provided for this image

Визуализация - диаграммы с областями

In [57]:
data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)

grouped_data = (
    data.groupby(["work_type", "smoking_status"]).size().unstack(fill_value=0)
)

grouped_data.plot(kind="area", alpha=0.5, stacked=True)

plt.title("Area Chart of Smoking Status by Work Type")
plt.xlabel("Work Type")
plt.ylabel("Number of Observations")
plt.legend(title="Smoking Status")
plt.grid(True)

plt.show()
No description has been provided for this image

Визуализация - диаграммы рассеяния

In [58]:
plt.scatter(df["bmi"], df["avg_glucose_level"], alpha=0.5)
plt.title("BMI vs Average Glucose Level")
plt.xlabel("BMI")
plt.ylabel("Average Glucose Level")
plt.grid(True)
plt.show()
No description has been provided for this image

Визуализация - круговая диаграмма

In [59]:
gender_counts = df["gender"].value_counts()

labels = [str(label) for label in gender_counts.index]

plt.figure(figsize=(8, 6))
plt.pie(gender_counts, labels=labels, autopct="%1.1f%%", startangle=90)
plt.title("Distribution of Gender")
plt.axis("equal")
plt.show()
No description has been provided for this image