mai_pi-33_zakharov/notebooks/lab1.ipynb

183 KiB
Raw Blame History

Работа с NumPy

In [270]:
import numpy as np

matrix = np.array([[4, 5, 0], [9, 9, 9]])
print("matrix = \n", matrix, "\n")

tmatrix = matrix.T
print("tmatrix = \n", tmatrix, "\n")

vector = np.ravel(matrix)
print("vector = \n", vector, "\n")

tvector = np.reshape(vector, (6, 1))
print("tvector = \n", tvector, "\n")

list_matrix = list(matrix)
print("list_matrix = \n", list_matrix, "\n")

str_matrix = str(matrix)
print("matrix as str = \n", str_matrix, "\n")

print("matrix type is", type(matrix), "\n")

print("vector type is", type(vector), "\n")

print("list_matrix type is", type(list_matrix), "\n")

print("str_matrix type is", type(str_matrix), "\n")

formatted_vector = "; ".join(map(str, vector))
print("formatted_vector = \n", formatted_vector, "\n")
matrix = 
 [[4 5 0]
 [9 9 9]] 

tmatrix = 
 [[4 9]
 [5 9]
 [0 9]] 

vector = 
 [4 5 0 9 9 9] 

tvector = 
 [[4]
 [5]
 [0]
 [9]
 [9]
 [9]] 

list_matrix = 
 [array([4, 5, 0]), array([9, 9, 9])] 

matrix as str = 
 [[4 5 0]
 [9 9 9]] 

matrix type is <class 'numpy.ndarray'> 

vector type is <class 'numpy.ndarray'> 

list_matrix type is <class 'list'> 

str_matrix type is <class 'str'> 

formatted_vector = 
 4; 5; 0; 9; 9; 9 

Работа с Pandas DataFrame

https://pandas.pydata.org/docs/user_guide/10min.html

Работа с данными - чтение и запись CSV

In [271]:
import pandas as pd

df = pd.read_csv("../data/ds_salaries.csv")

df.to_csv("../data/test.csv")

Работа с данными - основные команды

In [272]:
df.info()
print(df.describe().transpose())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB
                count           mean            std     min       25%  \
work_year      3755.0    2022.373635       0.691448  2020.0    2022.0   
salary         3755.0  190695.571771  671676.500508  6000.0  100000.0   
salary_in_usd  3755.0  137570.389880   63055.625278  5132.0   95000.0   
remote_ratio   3755.0      46.271638      48.589050     0.0       0.0   

                    50%       75%         max  
work_year        2022.0    2023.0      2023.0  
salary         138000.0  180000.0  30400000.0  
salary_in_usd  135000.0  175000.0    450000.0  
remote_ratio        0.0     100.0       100.0  
In [273]:
cleared_df = df.drop(["work_year", "experience_level"], axis=1)
print(cleared_df.head(1))
print(cleared_df.tail(2))
  employment_type                 job_title  salary salary_currency  \
0              FT  Principal Data Scientist   80000             EUR   

   salary_in_usd employee_residence  remote_ratio company_location  \
0          85847                 ES           100               ES   

  company_size  
0            L  
     employment_type              job_title   salary salary_currency  \
3753              CT  Business Data Analyst   100000             USD   
3754              FT   Data Science Manager  7000000             INR   

      salary_in_usd employee_residence  remote_ratio company_location  \
3753         100000                 US           100               US   
3754          94665                 IN            50               IN   

     company_size  
3753            L  
3754            L  
In [274]:
sorted_df = cleared_df.sort_values(by="salary")
print(sorted_df.head(3))
print(sorted_df.tail(1))
     employment_type                      job_title  salary salary_currency  \
1548              FT                   AI Developer    6000             EUR   
573               FT  Autonomous Vehicle Technician    7000             USD   
2933              CT             Analytics Engineer    7500             USD   

      salary_in_usd employee_residence  remote_ratio company_location  \
1548           6304                 MK             0               MK   
573            7000                 GH             0               GH   
2933           7500                 BO            50               BO   

     company_size  
1548            S  
573             S  
2933            M  
     employment_type       job_title    salary salary_currency  salary_in_usd  \
3669              FT  Data Scientist  30400000             CLP          40038   

     employee_residence  remote_ratio company_location company_size  
3669                 CL           100               CL            L  

Работа с данными - работа с элементами

In [275]:
print(df["salary"])
0         80000
1         30000
2         25500
3        175000
4        120000
         ...   
3750     412000
3751     151000
3752     105000
3753     100000
3754    7000000
Name: salary, Length: 3755, dtype: int64
In [276]:
print(df[0:3])
   work_year experience_level employment_type                 job_title  \
0       2023               SE              FT  Principal Data Scientist   
1       2023               MI              CT               ML Engineer   
2       2023               MI              CT               ML Engineer   

   salary salary_currency  salary_in_usd employee_residence  remote_ratio  \
0   80000             EUR          85847                 ES           100   
1   30000             USD          30000                 US           100   
2   25500             USD          25500                 US           100   

  company_location company_size  
0               ES            L  
1               US            S  
2               US            S  
In [277]:
print(df.loc[0])
work_year                                 2023
experience_level                            SE
employment_type                             FT
job_title             Principal Data Scientist
salary                                   80000
salary_currency                            EUR
salary_in_usd                            85847
employee_residence                          ES
remote_ratio                               100
company_location                            ES
company_size                                 L
Name: 0, dtype: object
In [278]:
print(df.loc[100, "employment_type"])
FT
In [279]:
print(df.loc[100:200, ["salary", "employment_type"]])
     salary employment_type
100  104300              FT
101  145000              FT
102   65000              FT
103  165000              FT
104  132300              FT
..      ...             ...
196  230000              FT
197  200000              FT
198  180000              FT
199  115000              FT
200  200000              FT

[101 rows x 2 columns]
In [280]:
print(df.iloc[0])
work_year                                 2023
experience_level                            SE
employment_type                             FT
job_title             Principal Data Scientist
salary                                   80000
salary_currency                            EUR
salary_in_usd                            85847
employee_residence                          ES
remote_ratio                               100
company_location                            ES
company_size                                 L
Name: 0, dtype: object
In [281]:
print(df.iloc[3:5, 0:2])
   work_year experience_level
3       2023               SE
4       2023               SE
In [282]:
print(df.iloc[[3, 4], [0, 1]])
   work_year experience_level
3       2023               SE
4       2023               SE

Работа с данными - отбор и группировка

In [283]:
s_values = df["work_year"].unique()
print(s_values)
[2023 2022 2020 2021]
In [284]:
s_total = 0
for s_value in s_values:
    count = df[df["work_year"] == s_value].shape[0]
    s_total += count
    print(s_value, "count =", count)
print("Total count = ", s_total)
2023 count = 1785
2022 count = 1664
2020 count = 76
2021 count = 230
Total count =  3755
In [285]:
print(df.groupby(["job_title", "experience_level"]).size().reset_index(name="total_count").sort_values(by="total_count"))  # type: ignore
                         job_title experience_level  total_count
1    3D Computer Vision Researcher               MI            1
2    3D Computer Vision Researcher               SE            1
11              Analytics Engineer               EN            1
8                     AI Scientist               EX            1
24   Autonomous Vehicle Technician               EN            1
..                             ...              ...          ...
77                   Data Engineer               MI          205
150      Machine Learning Engineer               SE          209
59                    Data Analyst               SE          380
108                 Data Scientist               SE          608
78                   Data Engineer               SE          718

[192 rows x 3 columns]

Визуализация - Исходные данные

In [286]:
data = df[["work_year", "salary", "employee_residence"]].copy()
data.dropna(subset=["employee_residence"], inplace=True)
print(data)
      work_year   salary employee_residence
0          2023    80000                 ES
1          2023    30000                 US
2          2023    25500                 US
3          2023   175000                 CA
4          2023   120000                 CA
...         ...      ...                ...
3750       2020   412000                 US
3751       2021   151000                 US
3752       2020   105000                 US
3753       2020   100000                 US
3754       2021  7000000                 IN

[3755 rows x 3 columns]

Визуализация - Сводка пяти чисел

In [287]:
def q1(x):
    return x.quantile(0.250)


# median = quantile(0.5)
def q2(x):
    return x.quantile(0.5)


def q3(x):
    return x.quantile(0.750)


def iqr(x):
    return q3(x) - q1(x)


def low_iqr(x):
    return max(0, q1(x) - 1.5 * iqr(x))


def high_iqr(x):
    return q3(x) + 1.5 * iqr(x)

data = data.where(data["salary"] < 3000000)
quantiles = (
    data[["work_year", "salary"]]
    .groupby(["work_year"])
    .aggregate(["min", q1, q2, "median", q3, "max"])
)
print(quantiles)

iqrs = (
    data[["work_year", "salary"]]
    .groupby(["work_year"])
    .aggregate([low_iqr, iqr, high_iqr])
)
print(iqrs)

data.boxplot(column="salary", by="work_year")
           salary                                                   
              min        q1        q2    median        q3        max
work_year                                                           
2020.0     8000.0   48000.0   88000.0   88000.0  138000.0  1000000.0
2021.0     8760.0   59000.0  100000.0  100000.0  165000.0  2500000.0
2022.0     6000.0   95000.0  135000.0  135000.0  175000.0  2800000.0
2023.0     7000.0  107800.0  145000.0  145000.0  185000.0  1700000.0
           salary                    
          low_iqr       iqr  high_iqr
work_year                            
2020.0          0   90000.0  273000.0
2021.0          0  106000.0  324000.0
2022.0          0   80000.0  295000.0
2023.0          0   77200.0  300800.0
Out[287]:
<Axes: title={'center': 'salary'}, xlabel='work_year'>
No description has been provided for this image

Визуализация - Гистограмма

In [288]:
df.plot.hist(column=["work_year"], bins=80)
Out[288]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

Визуализация - Точечная диаграмма

In [289]:
df.plot.scatter(x="work_year", y="salary")

df.plot.scatter(x="experience_level", y="salary")
Out[289]:
<Axes: xlabel='experience_level', ylabel='salary'>
No description has been provided for this image
No description has been provided for this image

Визуализация - Столбчатая диаграмма

In [290]:
plot = (
    df.groupby(["work_year", "remote_ratio"])
    .size()
    .unstack()
    .plot.bar(color=["pink", "green", "red"])
)
No description has been provided for this image

Визуализация - Временные ряды

In [291]:
from datetime import datetime
import matplotlib.dates as md

ts = pd.read_csv("../data/dollar.csv")
ts["date"] = ts.apply(lambda row: datetime.strptime(row["my_date"], "%d.%m.%Y"), axis=1)

plot = ts.plot.line(x="date", y="my_value")
plot.xaxis.set_major_locator(md.DayLocator(interval=10))
plot.xaxis.set_major_formatter(md.DateFormatter("%d.%m.%Y"))
plot.tick_params(axis="x", labelrotation=90)
No description has been provided for this image