2024-12-21 15:33:58 +04:00

388 KiB
Raw Permalink Blame History

Лабораторная работа 1

Датасет - Оценки студентов на экзаменах

Поля

  1. пол
  2. раса/этническая принадлежность
  3. уровень образования родителей
  4. обед
  5. курс подготовки к тесту
  6. оценка по математике
  7. оценка по чтению
  8. оценка по письму

Загрузка и сохранение данных

In [1]:
import pandas as pd

df = pd.read_csv("data/StudentsPerformance.csv")
df.to_csv("data/StudentsPerformance_updated.csv", index=False)

Получение сведений о датафрейме с данными

  1. Общая информация о датафрейме
In [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
  1. Статистическая информация
In [3]:
df.describe()
Out[3]:
math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

Получение сведений о колонках датафрейма

  1. Названия колонок
In [4]:
df.columns
Out[4]:
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

Вывод отдельных строк и столбцов

  1. Столбец "gender"
In [5]:
df[["gender"]]
Out[5]:
gender
0 female
1 female
2 female
3 male
4 male
... ...
995 female
996 male
997 female
998 female
999 female

1000 rows × 1 columns

  1. Несколько столбцокв
In [6]:
df[["race/ethnicity", "writing score"]]
Out[6]:
race/ethnicity writing score
0 group B 74
1 group C 88
2 group B 93
3 group A 44
4 group C 75
... ... ...
995 group E 95
996 group C 55
997 group C 65
998 group D 77
999 group D 86

1000 rows × 2 columns

  1. Первая строка
In [7]:
df.iloc[[0]]
Out[7]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
  1. Вывод по условию
In [8]:
df[df["writing score"] > 98]
Out[8]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
106 female group D master's degree standard none 87 100 100
114 female group E bachelor's degree standard completed 99 100 100
165 female group C bachelor's degree standard completed 96 100 100
179 female group D some high school standard completed 97 100 100
377 female group D master's degree free/reduced completed 85 95 100
403 female group D high school standard completed 88 99 100
458 female group E bachelor's degree standard none 100 100 100
566 female group E bachelor's degree free/reduced completed 92 100 100
594 female group C bachelor's degree standard completed 92 100 99
625 male group D some college standard completed 100 97 99
685 female group E master's degree standard completed 94 99 100
712 female group D some college standard none 98 100 99
717 female group C associate's degree standard completed 96 96 99
903 female group D bachelor's degree free/reduced completed 93 100 100
916 male group E bachelor's degree standard completed 100 100 100
957 female group D master's degree standard none 92 100 100
962 female group E associate's degree standard none 100 100 100
970 female group D bachelor's degree standard none 89 100 100

Группировка и агрегация данных

  1. Средняя скорость письма по полу
In [9]:
df.groupby(["gender"])[["writing score"]].mean()
Out[9]:
writing score
gender
female 72.467181
male 63.311203
  1. Группировка по уровню образования родителей - сумма баллов по математике, среднее по оценкам чтения и письма
In [10]:
df.groupby("parental level of education").agg({"math score": "sum", "reading score": "mean", "writing score": "mean"})
Out[10]:
math score reading score writing score
parental level of education
associate's degree 15070 70.927928 69.896396
bachelor's degree 8188 73.000000 73.381356
high school 12179 64.704082 62.448980
master's degree 4115 75.372881 75.677966
some college 15171 69.460177 68.840708
some high school 11366 66.938547 64.888268

Сортировка данных

  1. Сортировка по результатам по математике по убыванию
In [11]:
df.sort_values("math score", ascending=False)
Out[11]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
451 female group E some college standard none 100 92 97
458 female group E bachelor's degree standard none 100 100 100
962 female group E associate's degree standard none 100 100 100
149 male group E associate's degree free/reduced completed 100 100 93
623 male group A some college standard completed 100 96 86
... ... ... ... ... ... ... ... ...
145 female group C some college free/reduced none 22 39 33
787 female group B some college standard none 19 38 32
17 female group B some high school free/reduced none 18 32 28
980 female group B high school free/reduced none 8 24 23
59 female group C some high school free/reduced none 0 17 10

1000 rows × 8 columns

  1. Сортировка по нескольким столбцам - по оценке по математике по возрастанию, по оценке по чтению по убыванию
In [12]:
df.sort_values(["math score", "reading score"], ascending=[True, False])
Out[12]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
59 female group C some high school free/reduced none 0 17 10
980 female group B high school free/reduced none 8 24 23
17 female group B some high school free/reduced none 18 32 28
787 female group B some college standard none 19 38 32
145 female group C some college free/reduced none 22 39 33
... ... ... ... ... ... ... ... ...
916 male group E bachelor's degree standard completed 100 100 100
962 female group E associate's degree standard none 100 100 100
625 male group D some college standard completed 100 97 99
623 male group A some college standard completed 100 96 86
451 female group E some college standard none 100 92 97

1000 rows × 8 columns

Удаление строк/столбцов

  1. Удаление столбца
In [13]:
df.drop("race/ethnicity", axis=1)
Out[13]:
gender parental level of education lunch test preparation course math score reading score writing score
0 female bachelor's degree standard none 72 72 74
1 female some college standard completed 69 90 88
2 female master's degree standard none 90 95 93
3 male associate's degree free/reduced none 47 57 44
4 male some college standard none 76 78 75
... ... ... ... ... ... ... ...
995 female master's degree standard completed 88 99 95
996 male high school free/reduced none 62 55 55
997 female high school free/reduced completed 59 71 65
998 female some college standard completed 68 78 77
999 female some college free/reduced none 77 86 86

1000 rows × 7 columns

Удаление строки

In [14]:
df.drop(0, axis=0)
Out[14]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75
5 female group B associate's degree standard none 71 83 78
... ... ... ... ... ... ... ... ...
995 female group E master's degree standard completed 88 99 95
996 male group C high school free/reduced none 62 55 55
997 female group C high school free/reduced completed 59 71 65
998 female group D some college standard completed 68 78 77
999 female group D some college free/reduced none 77 86 86

999 rows × 8 columns

Создание новых столбцов

  1. Создание нового столбца со средним баллом каждого студента по всем предметам
In [15]:
df["average rating"] = (df["math score"] + df["reading score"] + df["writing score"]) / 3
print(df[["average rating"]])
     average rating
0         72.666667
1         82.333333
2         92.666667
3         49.333333
4         76.333333
..              ...
995       94.000000
996       57.333333
997       65.000000
998       74.333333
999       83.000000

[1000 rows x 1 columns]

Удаление строк с пустыми значениями

  1. Удаление строк с NaN
In [16]:
df.dropna()
Out[16]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score average rating
0 female group B bachelor's degree standard none 72 72 74 72.666667
1 female group C some college standard completed 69 90 88 82.333333
2 female group B master's degree standard none 90 95 93 92.666667
3 male group A associate's degree free/reduced none 47 57 44 49.333333
4 male group C some college standard none 76 78 75 76.333333
... ... ... ... ... ... ... ... ... ...
995 female group E master's degree standard completed 88 99 95 94.000000
996 male group C high school free/reduced none 62 55 55 57.333333
997 female group C high school free/reduced completed 59 71 65 65.000000
998 female group D some college standard completed 68 78 77 74.333333
999 female group D some college free/reduced none 77 86 86 83.000000

1000 rows × 9 columns

  1. Заполнить пустые значения для определённого столбца
In [17]:
df.fillna({"writing score": df["writing score"].mean()}, inplace=True)

Заполнение пустых значений

  1. Заполнение средним значением (только для числовых значений)
In [18]:
df.fillna(df.select_dtypes(include='number').mean(), inplace=True)

Визуализация данных с Pandas и Matplotlib

  1. Линейная диаграмма (plot). Распределение оценок по математике в зависимости от пола
In [19]:
import matplotlib.pyplot as plt
df.plot(x="gender", y="math score", kind="line")

plt.xlabel("Пол") 
plt.ylabel("Балл по математике")
plt.title("Распределение оценок по математике в зависимости от пола")

plt.show()
No description has been provided for this image
  1. Столбчатая диаграмма (bar). Средний балл по математике по полу
In [20]:
# Группируем по полу, находим средний балл по математике
grouped_df = df.groupby('gender')['math score'].mean().reset_index()

grouped_df.plot(x='gender', y='math score', kind='bar', color=['blue', 'orange'])

plt.xlabel('Пол')
plt.ylabel('Средний балл по математике')
plt.title('Средний балл по математике по полу')

plt.show()
No description has been provided for this image
  1. Гистограмма (hist). Распределение оценок по математике
In [21]:
df["math score"].plot(kind="hist")

plt.xlabel("Оценки по математике") 
plt.ylabel("Частота") 
plt.title("Распределение оценок по математике")

plt.show()
No description has been provided for this image
  1. Ящик с усами (box). Оценки по математике
In [22]:
df["math score"].plot(kind="box")

plt.ylabel("Оценки по математике") 
plt.title("Box Plot оценок по математике") 

plt.show()
No description has been provided for this image
  1. Диаграмма с областями (area).
In [23]:
df.plot(x="parental level of education", y="math score", kind="area")

plt.xlabel("Уровень образования родителей") 
plt.ylabel("Балл по математике")
plt.title("Балл по математике по Уровню образования родителей")

plt.xticks(rotation=45)  # Поворот меток оси X для лучшей читабельности
plt.show()
No description has been provided for this image
  1. Диаграмма рассеяния (scatter). Зависимость оценок по математике от оценко по чтению
In [24]:
df.plot(kind="scatter", x="math score", y="reading score")

plt.xlabel("Оценки по математике") 
plt.ylabel("Оценки по чтению")
plt.title("Оценки по математике vs. Оценки по чтению") 

plt.show()
No description has been provided for this image
  1. Круговая диаграмма (pie). Количество товаров
In [25]:
# Определение порога для объединения редких значений
threshold = 0.02  # Порог 2%

# Подсчёт количества уникальных значений и расчёт частот
value_counts = df["parental level of education"].value_counts()
total_count = value_counts.sum()

# Условие для агрегации значений ниже порога
other_values = value_counts[value_counts / total_count < threshold].sum()
main_values = value_counts[value_counts / total_count >= threshold]

# Добавление категории "Other"
main_values["Other"] = other_values

# Построение диаграммы
main_values.plot(kind="pie", 
                 autopct='%1.1f%%',  # Проценты
                 startangle=90,      # Начальный угол
                 counterclock=False, # По часовой стрелке
                 cmap="Set3",        # Цветовая схема
                 wedgeprops={'edgecolor': 'black'}) # Границы сегментов

plt.title("Распределение уровня образования родителей (агрегированные данные)")
plt.subplots_adjust(left=0.3, right=0.7, top=0.9, bottom=0.1)
plt.show()
No description has been provided for this image