00 | Python Packages¶

numpy¶

随机数生成 ¶

In [1]:

Copied!





#对纯数字的list进行数学运算，用array
import numpy as np
a = [1,2,3,4]
d=np.array(a)

print(d+1)
print(d*2)
print(sum(d))
# 对纯数字的 list 进行数学运算，用 array
import numpy as np
a = [1,2,3,4]
d=np.array(a)

print(d+1)
print(d*2)
print(sum(d))

[2 3 4 5]
[2 4 6 8]
10

生成数组可以用

np.random.randint(start, end, (shape))
np.random.randn(shape)生成正态分布的随机数
np.arange(start, end, step)生成等差数列
np.repeat()重复数组
np.random.seed(23) 设置随机数种子

!!! note "H(head)&T(Tail) 代表硬币的正反面 "

In [2]:

Copied!





# 连续两次掷有4个面的骰子
sample=1000
x = np.random.randint(1,5,size=(2,sample))
print(x)
# 连续两次掷有 4 个面的骰子
sample=1000
x = np.random.randint(1,5,size=(2,sample))
print(x)

[[4 2 2 ... 4 4 1]
 [1 4 3 ... 3 1 2]]

In [3]:

Copied!

y = np.random.rand(5) # 默认范围是0到1
y
y = np.random.rand(5) # 默认范围是 0 到 1
y

Out[3]:

array([0.32375871, 0.63243135, 0.35551604, 0.06357434, 0.13069772])

与shuffle不同的是，permutation不会改变原数组

In [4]:

Copied!





np.random.shuffle(y) # 打乱顺序
y
np.random.permutation(y) # 打乱顺序,
y
np.random.shuffle(y) # 打乱顺序
y
np.random.permutation(y) # 打乱顺序,
y

Out[4]:

array([0.13069772, 0.32375871, 0.63243135, 0.35551604, 0.06357434])

In [5]:

Copied!

import numpy as np
arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr) # 只对第一维度进行打乱
import numpy as np
arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr) # 只对第一维度进行打乱

Out[5]:

array([[0, 1, 2],
       [6, 7, 8],
       [3, 4, 5]])

矩阵 ¶

生成矩阵 ¶

import numpy as np
np.array([1,2])
np.array([[1,2],[3,4]]) ## 注意这里的两个方括号


np.ones(4)  ## 全1矩阵       
np.zeros(4)  ## 全0矩阵
np.ones((2,4)) ## 注意这里的两层括号

np.eye(5) ## 单位阵

np.diag((1,2,4)) ## 对角阵

np.reshape()

np.empty()

np.random.randint(start,end,(shape))
np.diag(np.random.randint(10,size=3)) # 随机对角阵
np.array(np.random.randint(10,size=(3,4)))

#数组的修改
res[:, j, :, :] 切片

In [6]:

Copied!





import numpy as np
a=np.array(np.random.randint(10,size=(3,4)))
print(a)
a[1,::-2]  # 切片的应用
import numpy as np
a=np.array(np.random.randint(10,size=(3,4)))
print(a)
a[1,::-2]  # 切片的应用

[[8 4 0 6]
 [2 9 1 7]
 [3 5 3 3]]

Out[6]:

array([7, 9])

矩阵运算 ¶

@ 做矩阵乘法

* 做点乘 -> 对应相乘

matrix.T 转置

np.maximum(x, 0)
np.exp(-x)

array 在维度是 1 的时候 (1 行向量 or1 列向量 )，会自动转置，所以容易出现神奇错误，最好初始化的时候规定好

np.lib.stride_tricks.as_strided(source_list, shape=shape, strides=stride)
# stride 是数组在各个维度所对应的距离A
np.tensordot(A, B, [(1, 4, 5), (1,2, 3)])
#

广播机制 ¶

最神奇的机制：广播机制（还没有完全学会）

方程求解 ¶

In [7]:

Copied!





b = np.array(np.random.randint(20,size=(4,1)))
A = np.array(np.random.randint(20,size=(4,4)))
np.linalg.inv(A)

# 求解 Ax = b
np.linalg.solve(A,b)
np.linalg.inv(A)@b ## 更慢
b = np.array(np.random.randint(20,size=(4,1)))
A = np.array(np.random.randint(20,size=(4,4)))
np.linalg.inv(A)

# 求解 Ax = b
np.linalg.solve(A,b)
np.linalg.inv(A)@b ## 更慢

Out[7]:

array([[ 4.50684363],
       [-1.89630859],
       [-1.98257984],
       [ 4.23558689]])

矩阵分解 ¶

时间测量 ¶

在命令前面加上 %timeit 可以获得多次运行命令的时间

meshgrid 生成网格坐标 ¶

X, Y = np.meshgrid(x, y)

*xi：一个或多个一维数组，表示坐标轴上的点。如果只传入一个数组，则默认为 x 轴上的点，并会生成一个与该数组形状相同的 y 轴数组。
indexing：可选参数，用于指定网格的索引方式。默认为 'xy'，表示使用笛卡尔坐标系；也可以设置为 'ij'，表示使用矩阵坐标系。
sparse：可选参数，用于指定是否生成稀疏网格。默认为 False，表示生成密集网格；如果设置为 True，则只生成网格的行和列索引，而不生成完整的网格。
Python-Numpy 模块 Meshgrid 函数 - 知乎

In [8]:

Copied!





import numpy as np

# 生成二维网格
x = np.linspace(-5, 5, 100)  # x 轴上的点
y = np.linspace(-5, 5, 100)  # y 轴上的点
X, Y = np.meshgrid(x, y)
import numpy as np

# 生成二维网格
x = np.linspace(-5, 5, 100)  # x 轴上的点
y = np.linspace(-5, 5, 100)  # y 轴上的点
X, Y = np.meshgrid(x, y)

In [9]:

Copied!





import numpy as np
a=[['a','a','a'],['b','b','b']] ## 2个成员
b=[[1,2,3],[1,2,3]] ## 3个成员
## 纵向展开
print (np.vstack([a,b]),'\n Dimension: ',np.vstack([a,b]).shape)
## 横向展开
print(np.hstack([a,b]),'\n Dimension: ',np.hstack([a,b]).shape)
## 网格展开
print(np.dstack([a,b]),'\n Dimension: ',np.dstack([a,b]).shape)
import numpy as np
a=[['a','a','a'],['b','b','b']] ## 2个成员
b=[[1,2,3],[1,2,3]] ## 3个成员
## 纵向展开
print (np.vstack([a,b]),'\n Dimension: ',np.vstack([a,b]).shape)
## 横向展开
print(np.hstack([a,b]),'\n Dimension: ',np.hstack([a,b]).shape)
## 网格展开
print(np.dstack([a,b]),'\n Dimension: ',np.dstack([a,b]).shape)

[['a' 'a' 'a']
 ['b' 'b' 'b']
 ['1' '2' '3']
 ['1' '2' '3']] 
 Dimension:  (4, 3)
[['a' 'a' 'a' '1' '2' '3']
 ['b' 'b' 'b' '1' '2' '3']] 
 Dimension:  (2, 6)
[[['a' '1']
  ['a' '2']
  ['a' '3']]

 [['b' '1']
  ['b' '2']
  ['b' '3']]] 
 Dimension:  (2, 3, 2)

图像处理 ¶

把图片按照卷积核进行分割成多维的数组

In [10]:

Copied!





def get_feature_map(self, X, kh, kw, s):#向量化处理
    '''
    :param X:
    :param kh: height of kernel
    :param kw: width of kernel
    :param s: stride
    :return: 按k*k大小分割好的数组
    '''
    N, C, H, W = X.shape
    oh = (H - kh) // s + 1
    ow = (W - kw) // s + 1
    shape = (N, C, oh, ow, kh, kw)  #切割形状
    stride = (*X.strides[:2], X.strides[-2] * s, X.strides[-1] * s, *X.strides[-2:])#切割方式
    A = np.lib.stride_tricks.as_strided(X, shape=shape, strides=stride)
    return A
def get_feature_map(self, X, kh, kw, s):# 向量化处理
    '''
    :param X:
    :param kh: height of kernel
    :param kw: width of kernel
    :param s: stride
    :return: 按k*k大小分割好的数组
    '''
    N, C, H, W = X.shape
    oh = (H - kh) // s + 1
    ow = (W - kw) // s + 1
    shape = (N, C, oh, ow, kh, kw)  #切割形状
    stride = (*X.strides[:2], X.strides[-2] * s, X.strides[-1] * s, *X.strides[-2:])#切割方式
    A = np.lib.stride_tricks.as_strided(X, shape=shape, strides=stride)
    return A

pandas¶

Dataframe¶

表格
读入和写出
处理 csv,xls,json,sql

创建表格 ¶

从字典创建
zip 创建

In [11]:

Copied!





import pandas as pd
pd.options.mode.chained_assignment = None # 防止出现SettingWithCopyWarning
l1=[1,2,3,4]
l2=[21,18,22,19]
l3=['Adam','Bob','Cinda','David']

dict3={'ID':l1,'Age':l2,'Name':l3}
df = pd.DataFrame(dict3)
df
import pandas as pd
pd.options.mode.chained_assignment = None # 防止出现SettingWithCopyWarning
l1=[1,2,3,4]
l2=[21,18,22,19]
l3=['Adam','Bob','Cinda','David']

dict3={'ID':l1,'Age':l2,'Name':l3}
df = pd.DataFrame(dict3)
df

Out[11]:

	ID	Age	Name
0	1	21	Adam
1	2	18	Bob
2	3	22	Cinda
3	4	19	David

In [12]:

Copied!





# 简化写法
df = pd.DataFrame(list(zip(l1, l2,l3)),
               columns =['ID', 'Age', 'Name'])
df
# 简化写法
df = pd.DataFrame(list(zip(l1, l2,l3)),
               columns =['ID', 'Age', 'Name'])
df

Out[12]:

	ID	Age	Name
0	1	21	Adam
1	2	18	Bob
2	3	22	Cinda
3	4	19	David

读入读出 ¶

In [13]:

Copied!

hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"')

hang.head(5) # 前5行
hang.tail(5) # 后5行

hang.dtypes # 所有字段的格式
hang.info()

hang.describe() # 把所有有关数字的信息做一个整理
hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"')

hang.head(5) # 前5行
hang.tail(5) # 后5行

hang.dtypes # 所有字段的格式
hang.info()

hang.describe() # 把所有有关数字的信息做一个整理

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[13], line 1
----> 1 hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"')
      3 hang.head(5) # 前5行
      4 hang.tail(5) # 后5行

File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'hang.csv'

In [14]:

Copied!





# 保存到excel
hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel
# 从excel读取
hang = pd.read_excel("hang.xlsx", sheet_name="weather")
hang
# 保存到 excel
hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel
# 从excel读取
hang = pd.read_excel("hang.xlsx", sheet_name="weather")
hang 

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 # 保存到excel
----> 2 hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel
      3 # 从excel读取
      4 hang = pd.read_excel("hang.xlsx", sheet_name="weather")

NameError: name 'hang' is not defined

表格操作 ¶

table['name'] # 选择一列
type(table['name'])

reduce=hang[['time','T']] # 选取某一列
reduce[reduce["T"] > 35]
reduce.iloc[9:14, 0:2] # 先选行，后选列
reduce.loc[reduce["T"] > 35, "time"]

df.mean()
df.var()
df.std()
df.sum()

df.values()
df.values().flatten()

表格合并 ¶

merge
concat

pd.merge(sub3, sub4,
        how='left', left_on='time', right_on='time')

pd.concat([sub1, sub2], axis=0)
pd.concat([sub1, sub2], axis=1)

简单绘图 ¶

hang['P'].plot()
(hang['Tx']-hang['Tn']).plot.hist()

时间信息画图 ¶

reduce["time"]=pd.to_datetime(reduce["time"],dayfirst=True)
reduce

reduce["month"]=reduce['time'].dt.month
reduce['year']=reduce['time'].dt.year
reduce['hour']=reduce['time'].dt.hour

reduce.groupby(["month","year"]).mean()

matplotlib & seaborn¶

seaborn: statistical data visualization — seaborn 0.13.2 documentation

statistical data visualization Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

直方图 ¶

In [15]:

Copied!





# A和B约定见面，各自将迟到0-1小时。先来者等待15分钟后将离开。
# A和B见到的概率多大？

import numpy as np
import collections
import matplotlib.pyplot as plt
import seaborn as sns

ab = np.random.rand(2,10000)
print(ab)

f, axs = plt.subplots(1,3)

sns.histplot(ab[0],stat='probability',ax=axs[0])
sns.histplot(ab[1],stat='probability',ax=axs[1])
sns.histplot(ab[0]-ab[1],stat='probability',ax=axs[2],color='yellow')
plt.tight_layout() # 调整子图之间的间距
# A 和 B 约定见面，各自将迟到 0-1 小时。先来者等待 15 分钟后将离开。
# A和B见到的概率多大？

import numpy as np
import collections
import matplotlib.pyplot as plt
import seaborn as sns

ab = np.random.rand(2,10000)
print(ab)

f, axs = plt.subplots(1,3)

sns.histplot(ab[0],stat='probability',ax=axs[0])
sns.histplot(ab[1],stat='probability',ax=axs[1])
sns.histplot(ab[0]-ab[1],stat='probability',ax=axs[2],color='yellow')
plt.tight_layout() # 调整子图之间的间距

[[0.85836148 0.76317309 0.54724431 ... 0.67767214 0.79309025 0.96454926]
 [0.79351607 0.63069391 0.21904404 ... 0.09584565 0.25092932 0.0833094 ]]

No description has been provided for this image

discrete 参数在 sns.histplot 函数中的作用是指定数据是否为离散型数据。设置 discrete=True 会使得直方图的每个条形对应一个离散值，而不是一个连续的区间

概率分布 ¶

In [16]:

Copied!





# 从分布的积分图形得到概率
f, axs = plt.subplots(1)

sns.ecdfplot(abs(ab[0]-ab[1]),ax=axs) # cumulative distribution function
x_special = 0.25

for line in axs.get_lines():
    x,y = line.get_data()
    ind = np.argwhere(x > x_special)[0,0]  # first index where x is larger than x_special
    axs.text(x_special,y[ind], f' {y[ind]:.4f}', ha='left', va='top') # maybe color=line.get_color()
axs.axvline(x_special, linestyle='--', color='#cfcfcf', lw=2, alpha=0.75)
plt.show()
# 从分布的积分图形得到概率
f, axs = plt.subplots(1)

sns.ecdfplot(abs(ab[0]-ab[1]),ax=axs) # cumulative distribution function
x_special = 0.25

for line in axs.get_lines():
    x,y = line.get_data()
    ind = np.argwhere(x > x_special)[0,0]  # first index where x is larger than x_special
    axs.text(x_special,y[ind], f' {y[ind]:.4f}', ha='left', va='top') # maybe color=line.get_color()
axs.axvline(x_special, linestyle='--', color='#cfcfcf', lw=2, alpha=0.75)
plt.show()

seaborn¶

seaborn 常见绘图学习总结（分布图） - 简书

Seaborn 常见绘图总结 _seaborn 绘图总结 csdn-CSDN 博客

relational Plot | 关系图 ¶

Scatter Plot | 散点图 ¶

lineplot¶

jointplot¶

x,y：代表待分析的成对变量，有两种模式，第一种模式：在参数 data 传入数据框时，x、y 均传入字符串，指代数据框中的变量名；第二种模式：在参数 data 为 None 时，x、y 直接传入两个一维数组，不依赖数据框
data：与上一段中的说明相对应，代表数据框，默认为 None
kind：字符型变量，用于控制展示成对变量相关情况的主图中的样式
color：控制图像中对象的色彩
height：控制图像为正方形时的边长
ratio：int 型，调节联合图与边缘图的相对比例，越大则边缘图越矮，默认为 5
space：int 型，用于控制联合图与边缘图的空白大小
xlim,ylim：设置 x 轴与 y 轴显示范围
joint_kws,marginal_kws,annot_kws：传入参数字典来分别精细化控制每个组件

详解 seaborn 中的 kdeplot、rugplot、distplot 与 jointplot_seaborn.kdeplot-CSDN 博客

Seaborn 系列 ( 三 )：分布统计绘图 (distribution)_displot-CSDN 博客

distplot¶

kdeplot¶

rugplot¶

sklearn¶

安装方法，注意名字不是 sklearn

pip install scikit-learn

pip install statsmodels -i https://pypi.tuna.tsinghua.edu.cn/simple/

statsmodels · PyPI

这个库执行需要翻墙，而且很容易连接不上