MAI_PIbd-33_Tikhonenkov_A_E/lab1.ipynb
2024-11-22 22:56:37 +04:00

420 KiB
Raw Permalink Blame History

Работа с Pandas DataFrame

Работа с данными - чтение и запись CSV

In [1]:
import pandas as pd

df = pd.read_csv("data/healthcare-dataset-stroke-data.csv", index_col="id")

df.to_csv("lab1.csv")

Работа с данными - основные команды

In [2]:
df.info()

print(df.describe().transpose())

cleared_df = df.drop(["ever_married", "work_type", "Residence_type"], axis=1)
print(cleared_df.head())
print(cleared_df.tail())

sorted_df = cleared_df.sort_values(by="gender")
print(sorted_df.head())
print(sorted_df.tail())
<class 'pandas.core.frame.DataFrame'>
Index: 5110 entries, 9046 to 44679
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 479.1+ KB
                    count        mean        std    min     25%     50%  \
age                5110.0   43.226614  22.612647   0.08  25.000  45.000   
hypertension       5110.0    0.097456   0.296607   0.00   0.000   0.000   
heart_disease      5110.0    0.054012   0.226063   0.00   0.000   0.000   
avg_glucose_level  5110.0  106.147677  45.283560  55.12  77.245  91.885   
bmi                4909.0   28.893237   7.854067  10.30  23.500  28.100   
stroke             5110.0    0.048728   0.215320   0.00   0.000   0.000   

                      75%     max  
age                 61.00   82.00  
hypertension         0.00    1.00  
heart_disease        0.00    1.00  
avg_glucose_level  114.09  271.74  
bmi                 33.10   97.60  
stroke               0.00    1.00  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
9046     Male  67.0             0              1             228.69  36.6   
51676  Female  61.0             0              0             202.21   NaN   
31112    Male  80.0             0              1             105.92  32.5   
60182  Female  49.0             0              0             171.23  34.4   
1665   Female  79.0             1              0             174.12  24.0   

        smoking_status  stroke  
id                              
9046   formerly smoked       1  
51676     never smoked       1  
31112     never smoked       1  
60182           smokes       1  
1665      never smoked       1  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
18234  Female  80.0             1              0              83.75   NaN   
44873  Female  81.0             0              0             125.20  40.0   
19723  Female  35.0             0              0              82.99  30.6   
37544    Male  51.0             0              0             166.29  25.6   
44679  Female  44.0             0              0              85.28  26.2   

        smoking_status  stroke  
id                              
18234     never smoked       0  
44873     never smoked       0  
19723     never smoked       0  
37544  formerly smoked       0  
44679          Unknown       0  
       gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                          
72369  Female  14.0             0              0              65.41  19.5   
3135   Female  73.0             0              0              69.35   NaN   
563    Female  41.0             0              0             216.71  36.2   
19364  Female   7.0             0              0              74.96  18.8   
55459  Female  60.0             0              0              91.82  28.3   

        smoking_status  stroke  
id                              
72369          Unknown       0  
3135      never smoked       0  
563       never smoked       0  
19364          Unknown       0  
55459  formerly smoked       0  
      gender   age  hypertension  heart_disease  avg_glucose_level   bmi  \
id                                                                         
33622   Male  62.0             1              0             211.49  41.1   
51554   Male  42.0             0              0             177.91   NaN   
2296    Male  78.0             1              0              90.19   NaN   
13602   Male  73.0             1              0             102.06   NaN   
56156  Other  26.0             0              0             143.33  22.4   

        smoking_status  stroke  
id                              
33622          Unknown       0  
51554          Unknown       0  
2296           Unknown       0  
13602          Unknown       0  
56156  formerly smoked       0  

Работа с данными - работа с элементами

In [13]:
print(df["age"])

print(df.loc[63864])

print(df.loc[63864, "Residence_type"])

print(df.loc[63864:63898, ["age", "Residence_type"]])

print(df[0:3])

print(df.iloc[0])

print(df.iloc[3:5, 0:2])

print(df.iloc[[3, 4], [0, 1]])
id
9046     67.0
51676    61.0
31112    80.0
60182    49.0
1665     79.0
         ... 
18234    80.0
44873    81.0
19723    35.0
37544    51.0
44679    44.0
Name: age, Length: 5110, dtype: float64
gender                  Male
age                     62.0
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Rural
avg_glucose_level     107.61
bmi                     31.3
smoking_status       Unknown
stroke                     0
Name: 63864, dtype: object
Rural
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[13], line 7
      3 print(df.loc[63864])
      5 print(df.loc[63864, "Residence_type"])
----> 7 print(df.loc[63864:63898, ["Возраст", "Residence_type"]])
      9 print(df[0:3])
     11 print(df.iloc[0])

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis)
   1417     if hasattr(key, "ndim") and key.ndim > 1:
   1418         raise ValueError("Cannot index with multidimensional key")
-> 1420     return self._getitem_iterable(key, axis=axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis)
   1357 self._validate_key(key, axis)
   1359 # A collection of keys
-> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1361 return self.obj._reindex_with_indexers(
   1362     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1363 )

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexes\base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File d:\Users\Leo\AppData\Local\pypoetry\Cache\virtualenvs\mai-S9i2J6c7-py3.12\Lib\site-packages\pandas\core\indexes\base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['Возраст'] not in index"

Работа с данными - отбор и группировка

In [4]:
s_values = df["gender"].unique()
print(s_values)

s_total = 0
for s_value in s_values:
    count = df[df["gender"] == s_value].shape[0]
    s_total += count
    print(s_value, "count =", count)
print("Total count = ", s_total)

print(df.groupby(["bmi", "smoking_status"]).size().reset_index(name="Count"))  # type: ignore
['Male' 'Female' 'Other']
Male count = 2115
Female count = 2994
Other count = 1
Total count =  5110
       bmi smoking_status  Count
0     10.3        Unknown      1
1     11.3        Unknown      1
2     11.5   never smoked      1
3     12.0        Unknown      1
4     12.3        Unknown      1
...    ...            ...    ...
1185  66.8        Unknown      1
1186  71.9   never smoked      1
1187  78.0         smokes      1
1188  92.0   never smoked      1
1189  97.6        Unknown      1

[1190 rows x 3 columns]

Виртуализация - Исходные данные

In [5]:
data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)
print(data)
        age      work_type   smoking_status
id                                         
9046   67.0        Private  formerly smoked
51676  61.0  Self-employed     never smoked
31112  80.0        Private     never smoked
60182  49.0        Private           smokes
1665   79.0  Self-employed     never smoked
...     ...            ...              ...
18234  80.0        Private     never smoked
44873  81.0  Self-employed     never smoked
19723  35.0  Self-employed     never smoked
37544  51.0        Private  formerly smoked
44679  44.0       Govt_job          Unknown

[5110 rows x 3 columns]

Визуализация - Линейная диаграмма

In [14]:
import matplotlib.pyplot as plt
average_age = data.groupby("smoking_status")["age"].mean()
average_age.plot(
    kind="line",
    marker="o",
    title="Average Age by Smoking Status",
    xlabel="Smoking Status",
    ylabel="Average Age",
)
plt.grid(True)
plt.show()
No description has been provided for this image

Визуализация - столбчатая диаграмма

In [7]:
pivot_table = data.groupby(["work_type", "smoking_status"]).size().unstack()

pivot_table.plot(kind="bar", stacked=True, figsize=(10, 6))

plt.title("Smoking Status by Work Type")
plt.xlabel("Work Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="Smoking Status")
plt.grid(axis='y')
plt.tight_layout()  

plt.show()
No description has been provided for this image

Визуализация - Гистограмма

In [8]:
plt.hist(data["age"], bins=10, edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(axis="y")
plt.show()
No description has been provided for this image

Визуализация - Ящик с усами

In [9]:
import pandas as pd
import matplotlib.pyplot as plt

data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)


plt.figure(figsize=(10, 6))

box_data = [
    data[data["smoking_status"] == status]["age"]
    for status in data["smoking_status"].unique()
]
plt.boxplot(box_data)

plt.xticks(
    range(1, len(data["smoking_status"].unique()) + 1),
    list(data["smoking_status"].unique()),  )

plt.title("Box Plot of Age by Smoking Status")
plt.xlabel("Smoking Status")
plt.ylabel("Age")

plt.show()
No description has been provided for this image

Визуализация - диаграммы с областями

In [10]:
data = df[["age", "work_type", "smoking_status"]].copy()
data.dropna(subset=["smoking_status"], inplace=True)

grouped_data = (
    data.groupby(["work_type", "smoking_status"]).size().unstack(fill_value=0)
)

grouped_data.plot(kind="area", alpha=0.5, stacked=True)

plt.title("Area Chart of Smoking Status by Work Type")
plt.xlabel("Work Type")
plt.ylabel("Number of Observations")
plt.legend(title="Smoking Status")
plt.grid(True)

plt.show()
No description has been provided for this image

Визуализация - диаграммы рассеяния

In [11]:
plt.scatter(df["bmi"], df["avg_glucose_level"], alpha=0.5)
plt.title("BMI vs Average Glucose Level")
plt.xlabel("BMI")
plt.ylabel("Average Glucose Level")
plt.grid(True)
plt.show()
No description has been provided for this image

Визуализация - круговая диаграмма

In [15]:
gender_counts = df["gender"].value_counts()

labels = [str(label) for label in gender_counts.index]

plt.figure(figsize=(8, 6))
plt.pie(gender_counts, labels=labels, autopct="%1.1f%%", startangle=90)
plt.title("Distribution of Gender")
plt.axis("equal")
plt.show()
No description has been provided for this image