Лабораторная работа 1¶

Датасет - Оценки студентов на экзаменах

Поля

пол
раса/этническая принадлежность
уровень образования родителей
обед
курс подготовки к тесту
оценка по математике
оценка по чтению
оценка по письму

Загрузка и сохранение данных

In [1]:

import pandas as pd

df = pd.read_csv("data/StudentsPerformance.csv")
df.to_csv("data/StudentsPerformance_updated.csv", index=False)

Получение сведений о датафрейме с данными

Общая информация о датафрейме

In [2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

Статистическая информация

In [3]:

df.describe()

Out[3]:

	math score	reading score	writing score
count	1000.00000	1000.000000	1000.000000
mean	66.08900	69.169000	68.054000
std	15.16308	14.600192	15.195657
min	0.00000	17.000000	10.000000
25%	57.00000	59.000000	57.750000
50%	66.00000	70.000000	69.000000
75%	77.00000	79.000000	79.000000
max	100.00000	100.000000	100.000000

Получение сведений о колонках датафрейма

Названия колонок

In [4]:

df.columns

Out[4]:

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

Вывод отдельных строк и столбцов

Столбец "gender"

In [5]:

df[["gender"]]

Out[5]:

	gender
0	female
1	female
2	female
3	male
4	male
...	...
995	female
996	male
997	female
998	female
999	female

1000 rows × 1 columns

Несколько столбцокв

In [6]:

df[["race/ethnicity", "writing score"]]

Out[6]:

	race/ethnicity	writing score
0	group B	74
1	group C	88
2	group B	93
3	group A	44
4	group C	75
...	...	...
995	group E	95
996	group C	55
997	group C	65
998	group D	77
999	group D	86

1000 rows × 2 columns

Первая строка

In [7]:

df.iloc[[0]]

Out[7]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74

Вывод по условию

In [8]:

df[df["writing score"] > 98]

Out[8]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
106	female	group D	master's degree	standard	none	87	100	100
114	female	group E	bachelor's degree	standard	completed	99	100	100
165	female	group C	bachelor's degree	standard	completed	96	100	100
179	female	group D	some high school	standard	completed	97	100	100
377	female	group D	master's degree	free/reduced	completed	85	95	100
403	female	group D	high school	standard	completed	88	99	100
458	female	group E	bachelor's degree	standard	none	100	100	100
566	female	group E	bachelor's degree	free/reduced	completed	92	100	100
594	female	group C	bachelor's degree	standard	completed	92	100	99
625	male	group D	some college	standard	completed	100	97	99
685	female	group E	master's degree	standard	completed	94	99	100
712	female	group D	some college	standard	none	98	100	99
717	female	group C	associate's degree	standard	completed	96	96	99
903	female	group D	bachelor's degree	free/reduced	completed	93	100	100
916	male	group E	bachelor's degree	standard	completed	100	100	100
957	female	group D	master's degree	standard	none	92	100	100
962	female	group E	associate's degree	standard	none	100	100	100
970	female	group D	bachelor's degree	standard	none	89	100	100

Группировка и агрегация данных

Средняя скорость письма по полу

In [9]:

df.groupby(["gender"])[["writing score"]].mean()

Out[9]:

	writing score
gender
female	72.467181
male	63.311203

Группировка по уровню образования родителей - сумма баллов по математике, среднее по оценкам чтения и письма

In [10]:

df.groupby("parental level of education").agg({"math score": "sum", "reading score": "mean", "writing score": "mean"})

Out[10]:

	math score	reading score	writing score
parental level of education
associate's degree	15070	70.927928	69.896396
bachelor's degree	8188	73.000000	73.381356
high school	12179	64.704082	62.448980
master's degree	4115	75.372881	75.677966
some college	15171	69.460177	68.840708
some high school	11366	66.938547	64.888268

Сортировка данных

Сортировка по результатам по математике по убыванию

In [11]:

df.sort_values("math score", ascending=False)

Out[11]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
451	female	group E	some college	standard	none	100	92	97
458	female	group E	bachelor's degree	standard	none	100	100	100
962	female	group E	associate's degree	standard	none	100	100	100
149	male	group E	associate's degree	free/reduced	completed	100	100	93
623	male	group A	some college	standard	completed	100	96	86
...	...	...	...	...	...	...	...	...
145	female	group C	some college	free/reduced	none	22	39	33
787	female	group B	some college	standard	none	19	38	32
17	female	group B	some high school	free/reduced	none	18	32	28
980	female	group B	high school	free/reduced	none	8	24	23
59	female	group C	some high school	free/reduced	none	0	17	10

1000 rows × 8 columns

Сортировка по нескольким столбцам - по оценке по математике по возрастанию, по оценке по чтению по убыванию

In [12]:

df.sort_values(["math score", "reading score"], ascending=[True, False])

Out[12]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
59	female	group C	some high school	free/reduced	none	0	17	10
980	female	group B	high school	free/reduced	none	8	24	23
17	female	group B	some high school	free/reduced	none	18	32	28
787	female	group B	some college	standard	none	19	38	32
145	female	group C	some college	free/reduced	none	22	39	33
...	...	...	...	...	...	...	...	...
916	male	group E	bachelor's degree	standard	completed	100	100	100
962	female	group E	associate's degree	standard	none	100	100	100
625	male	group D	some college	standard	completed	100	97	99
623	male	group A	some college	standard	completed	100	96	86
451	female	group E	some college	standard	none	100	92	97

1000 rows × 8 columns

Удаление строк/столбцов

Удаление столбца

In [13]:

df.drop("race/ethnicity", axis=1)

Out[13]:

	gender	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	bachelor's degree	standard	none	72	72	74
1	female	some college	standard	completed	69	90	88
2	female	master's degree	standard	none	90	95	93
3	male	associate's degree	free/reduced	none	47	57	44
4	male	some college	standard	none	76	78	75
...	...	...	...	...	...	...	...
995	female	master's degree	standard	completed	88	99	95
996	male	high school	free/reduced	none	62	55	55
997	female	high school	free/reduced	completed	59	71	65
998	female	some college	standard	completed	68	78	77
999	female	some college	free/reduced	none	77	86	86

1000 rows × 7 columns

Удаление строки

In [14]:

df.drop(0, axis=0)

Out[14]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93
3	male	group A	associate's degree	free/reduced	none	47	57	44
4	male	group C	some college	standard	none	76	78	75
5	female	group B	associate's degree	standard	none	71	83	78
...	...	...	...	...	...	...	...	...
995	female	group E	master's degree	standard	completed	88	99	95
996	male	group C	high school	free/reduced	none	62	55	55
997	female	group C	high school	free/reduced	completed	59	71	65
998	female	group D	some college	standard	completed	68	78	77
999	female	group D	some college	free/reduced	none	77	86	86

999 rows × 8 columns

Создание новых столбцов

Создание нового столбца со средним баллом каждого студента по всем предметам

In [15]:

df["average rating"] = (df["math score"] + df["reading score"] + df["writing score"]) / 3
print(df[["average rating"]])

     average rating
0         72.666667
1         82.333333
2         92.666667
3         49.333333
4         76.333333
..              ...
995       94.000000
996       57.333333
997       65.000000
998       74.333333
999       83.000000

[1000 rows x 1 columns]

Удаление строк с пустыми значениями

Удаление строк с NaN

In [16]:

df.dropna()

Out[16]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	average rating
0	female	group B	bachelor's degree	standard	none	72	72	74	72.666667
1	female	group C	some college	standard	completed	69	90	88	82.333333
2	female	group B	master's degree	standard	none	90	95	93	92.666667
3	male	group A	associate's degree	free/reduced	none	47	57	44	49.333333
4	male	group C	some college	standard	none	76	78	75	76.333333
...	...	...	...	...	...	...	...	...	...
995	female	group E	master's degree	standard	completed	88	99	95	94.000000
996	male	group C	high school	free/reduced	none	62	55	55	57.333333
997	female	group C	high school	free/reduced	completed	59	71	65	65.000000
998	female	group D	some college	standard	completed	68	78	77	74.333333
999	female	group D	some college	free/reduced	none	77	86	86	83.000000

1000 rows × 9 columns

Заполнить пустые значения для определённого столбца

In [17]:

df.fillna({"writing score": df["writing score"].mean()}, inplace=True)

Заполнение пустых значений

Заполнение средним значением (только для числовых значений)

In [18]:

df.fillna(df.select_dtypes(include='number').mean(), inplace=True)

Визуализация данных с Pandas и Matplotlib

Линейная диаграмма (plot). Распределение оценок по математике в зависимости от пола

In [19]:

import matplotlib.pyplot as plt
df.plot(x="gender", y="math score", kind="line")

plt.xlabel("Пол") 
plt.ylabel("Балл по математике")
plt.title("Распределение оценок по математике в зависимости от пола")

plt.show()

No description has been provided for this image

Столбчатая диаграмма (bar). Средний балл по математике по полу

In [20]:

# Группируем по полу, находим средний балл по математике
grouped_df = df.groupby('gender')['math score'].mean().reset_index()

grouped_df.plot(x='gender', y='math score', kind='bar', color=['blue', 'orange'])

plt.xlabel('Пол')
plt.ylabel('Средний балл по математике')
plt.title('Средний балл по математике по полу')

plt.show()

Гистограмма (hist). Распределение оценок по математике

In [21]:

df["math score"].plot(kind="hist")

plt.xlabel("Оценки по математике") 
plt.ylabel("Частота") 
plt.title("Распределение оценок по математике")

plt.show()

Ящик с усами (box). Оценки по математике

In [22]:

df["math score"].plot(kind="box")

plt.ylabel("Оценки по математике") 
plt.title("Box Plot оценок по математике") 

plt.show()

Диаграмма с областями (area).

In [23]:

df.plot(x="parental level of education", y="math score", kind="area")

plt.xlabel("Уровень образования родителей") 
plt.ylabel("Балл по математике")
plt.title("Балл по математике по Уровню образования родителей")

plt.xticks(rotation=45)  # Поворот меток оси X для лучшей читабельности
plt.show()

Диаграмма рассеяния (scatter). Зависимость оценок по математике от оценко по чтению

In [24]:

df.plot(kind="scatter", x="math score", y="reading score")

plt.xlabel("Оценки по математике") 
plt.ylabel("Оценки по чтению")
plt.title("Оценки по математике vs. Оценки по чтению") 

plt.show()

Круговая диаграмма (pie). Количество товаров

In [25]:

# Определение порога для объединения редких значений
threshold = 0.02  # Порог 2%

# Подсчёт количества уникальных значений и расчёт частот
value_counts = df["parental level of education"].value_counts()
total_count = value_counts.sum()

# Условие для агрегации значений ниже порога
other_values = value_counts[value_counts / total_count < threshold].sum()
main_values = value_counts[value_counts / total_count >= threshold]

# Добавление категории "Other"
main_values["Other"] = other_values

# Построение диаграммы
main_values.plot(kind="pie", 
                 autopct='%1.1f%%',  # Проценты
                 startangle=90,      # Начальный угол
                 counterclock=False, # По часовой стрелке
                 cmap="Set3",        # Цветовая схема
                 wedgeprops={'edgecolor': 'black'}) # Границы сегментов

plt.title("Распределение уровня образования родителей (агрегированные данные)")
plt.subplots_adjust(left=0.3, right=0.7, top=0.9, bottom=0.1)
plt.show()

388 KiB Raw Permalink Blame History Unescape Escape

Лабораторная работа 1¶

388 KiB

Raw Permalink Blame History