00 | Python Packages¶
numpy¶
随机数生成 ¶
In [1]:
Copied!
#对纯数字的list进行数学运算,用array
import numpy as np
a = [1,2,3,4]
d=np.array(a)
print(d+1)
print(d*2)
print(sum(d))
# 对纯数字的 list 进行数学运算,用 array
import numpy as np
a = [1,2,3,4]
d=np.array(a)
print(d+1)
print(d*2)
print(sum(d))
[2 3 4 5] [2 4 6 8] 10
生成数组可以用
np.random.randint(start, end, (shape))
np.random.randn(shape)
生成正态分布的随机数np.arange(start, end, step)
生成等差数列np.repeat()
重复数组np.random.seed(23)
设置随机数种子
!!! note "H(head)&T(Tail) 代表硬币的正反面 "
In [2]:
Copied!
# 连续两次掷有4个面的骰子
sample=1000
x = np.random.randint(1,5,size=(2,sample))
print(x)
# 连续两次掷有 4 个面的骰子
sample=1000
x = np.random.randint(1,5,size=(2,sample))
print(x)
[[4 2 2 ... 4 4 1] [1 4 3 ... 3 1 2]]
In [3]:
Copied!
y = np.random.rand(5) # 默认范围是0到1
y
y = np.random.rand(5) # 默认范围是 0 到 1
y
Out[3]:
array([0.32375871, 0.63243135, 0.35551604, 0.06357434, 0.13069772])
与shuffle
不同的是,permutation
不会改变原数组
In [4]:
Copied!
np.random.shuffle(y) # 打乱顺序
y
np.random.permutation(y) # 打乱顺序,
y
np.random.shuffle(y) # 打乱顺序
y
np.random.permutation(y) # 打乱顺序,
y
Out[4]:
array([0.13069772, 0.32375871, 0.63243135, 0.35551604, 0.06357434])
In [5]:
Copied!
import numpy as np
arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr) # 只对第一维度进行打乱
import numpy as np
arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr) # 只对第一维度进行打乱
Out[5]:
array([[0, 1, 2], [6, 7, 8], [3, 4, 5]])
矩阵 ¶
生成矩阵 ¶
import numpy as np
np.array([1,2])
np.array([[1,2],[3,4]]) ## 注意这里的两个方括号
np.ones(4) ## 全1矩阵
np.zeros(4) ## 全0矩阵
np.ones((2,4)) ## 注意这里的两层括号
np.eye(5) ## 单位阵
np.diag((1,2,4)) ## 对角阵
np.reshape()
np.empty()
np.random.randint(start,end,(shape))
np.diag(np.random.randint(10,size=3)) # 随机对角阵
np.array(np.random.randint(10,size=(3,4)))
#数组的修改
res[:, j, :, :] 切片
In [6]:
Copied!
import numpy as np
a=np.array(np.random.randint(10,size=(3,4)))
print(a)
a[1,::-2] # 切片的应用
import numpy as np
a=np.array(np.random.randint(10,size=(3,4)))
print(a)
a[1,::-2] # 切片的应用
[[8 4 0 6] [2 9 1 7] [3 5 3 3]]
Out[6]:
array([7, 9])
矩阵运算 ¶
@
做矩阵乘法
*
做点乘 -> 对应相乘
matrix.T
转置
np.maximum(x, 0)
np.exp(-x)
array 在维度是 1 的时候 (1 行向量 or1 列向量 ),会自动转置,所以容易出现神奇错误,最好初始化的时候规定好
np.lib.stride_tricks.as_strided(source_list, shape=shape, strides=stride)
# stride 是数组在各个维度所对应的距离A
np.tensordot(A, B, [(1, 4, 5), (1,2, 3)])
#
广播机制 ¶
- 最神奇的机制:广播机制(还没有完全学会)
方程求解 ¶
In [7]:
Copied!
b = np.array(np.random.randint(20,size=(4,1)))
A = np.array(np.random.randint(20,size=(4,4)))
np.linalg.inv(A)
# 求解 Ax = b
np.linalg.solve(A,b)
np.linalg.inv(A)@b ## 更慢
b = np.array(np.random.randint(20,size=(4,1)))
A = np.array(np.random.randint(20,size=(4,4)))
np.linalg.inv(A)
# 求解 Ax = b
np.linalg.solve(A,b)
np.linalg.inv(A)@b ## 更慢
Out[7]:
array([[ 4.50684363], [-1.89630859], [-1.98257984], [ 4.23558689]])
矩阵分解 ¶
时间测量 ¶
在命令前面加上 %timeit
可以获得多次运行命令的时间
meshgrid 生成网格坐标 ¶
X, Y = np.meshgrid(x, y)
*xi
:一个或多个一维数组,表示坐标轴上的点。如果只传入一个数组,则默认为 x 轴上的点,并会生成一个与该数组形状相同的 y 轴数组。indexing
:可选参数,用于指定网格的索引方式。默认为 'xy',表示使用笛卡尔坐标系;也可以设置为 'ij',表示使用矩阵坐标系。sparse
:可选参数,用于指定是否生成稀疏网格。默认为 False,表示生成密集网格;如果设置为 True,则只生成网格的行和列索引,而不生成完整的网格。- Python-Numpy 模块 Meshgrid 函数 - 知乎
In [8]:
Copied!
import numpy as np
# 生成二维网格
x = np.linspace(-5, 5, 100) # x 轴上的点
y = np.linspace(-5, 5, 100) # y 轴上的点
X, Y = np.meshgrid(x, y)
import numpy as np
# 生成二维网格
x = np.linspace(-5, 5, 100) # x 轴上的点
y = np.linspace(-5, 5, 100) # y 轴上的点
X, Y = np.meshgrid(x, y)
In [9]:
Copied!
import numpy as np
a=[['a','a','a'],['b','b','b']] ## 2个成员
b=[[1,2,3],[1,2,3]] ## 3个成员
## 纵向展开
print (np.vstack([a,b]),'\n Dimension: ',np.vstack([a,b]).shape)
## 横向展开
print(np.hstack([a,b]),'\n Dimension: ',np.hstack([a,b]).shape)
## 网格展开
print(np.dstack([a,b]),'\n Dimension: ',np.dstack([a,b]).shape)
import numpy as np
a=[['a','a','a'],['b','b','b']] ## 2个成员
b=[[1,2,3],[1,2,3]] ## 3个成员
## 纵向展开
print (np.vstack([a,b]),'\n Dimension: ',np.vstack([a,b]).shape)
## 横向展开
print(np.hstack([a,b]),'\n Dimension: ',np.hstack([a,b]).shape)
## 网格展开
print(np.dstack([a,b]),'\n Dimension: ',np.dstack([a,b]).shape)
[['a' 'a' 'a'] ['b' 'b' 'b'] ['1' '2' '3'] ['1' '2' '3']] Dimension: (4, 3) [['a' 'a' 'a' '1' '2' '3'] ['b' 'b' 'b' '1' '2' '3']] Dimension: (2, 6) [[['a' '1'] ['a' '2'] ['a' '3']] [['b' '1'] ['b' '2'] ['b' '3']]] Dimension: (2, 3, 2)
图像处理 ¶
- 把图片按照卷积核进行分割成多维的数组
In [10]:
Copied!
def get_feature_map(self, X, kh, kw, s):#向量化处理
'''
:param X:
:param kh: height of kernel
:param kw: width of kernel
:param s: stride
:return: 按k*k大小分割好的数组
'''
N, C, H, W = X.shape
oh = (H - kh) // s + 1
ow = (W - kw) // s + 1
shape = (N, C, oh, ow, kh, kw) #切割形状
stride = (*X.strides[:2], X.strides[-2] * s, X.strides[-1] * s, *X.strides[-2:])#切割方式
A = np.lib.stride_tricks.as_strided(X, shape=shape, strides=stride)
return A
def get_feature_map(self, X, kh, kw, s):# 向量化处理
'''
:param X:
:param kh: height of kernel
:param kw: width of kernel
:param s: stride
:return: 按k*k大小分割好的数组
'''
N, C, H, W = X.shape
oh = (H - kh) // s + 1
ow = (W - kw) // s + 1
shape = (N, C, oh, ow, kh, kw) #切割形状
stride = (*X.strides[:2], X.strides[-2] * s, X.strides[-1] * s, *X.strides[-2:])#切割方式
A = np.lib.stride_tricks.as_strided(X, shape=shape, strides=stride)
return A
pandas¶
Dataframe¶
- 表格
- 读入和写出
- 处理 csv,xls,json,sql
创建表格 ¶
- 从字典创建
- zip 创建
In [11]:
Copied!
import pandas as pd
pd.options.mode.chained_assignment = None # 防止出现SettingWithCopyWarning
l1=[1,2,3,4]
l2=[21,18,22,19]
l3=['Adam','Bob','Cinda','David']
dict3={'ID':l1,'Age':l2,'Name':l3}
df = pd.DataFrame(dict3)
df
import pandas as pd
pd.options.mode.chained_assignment = None # 防止出现SettingWithCopyWarning
l1=[1,2,3,4]
l2=[21,18,22,19]
l3=['Adam','Bob','Cinda','David']
dict3={'ID':l1,'Age':l2,'Name':l3}
df = pd.DataFrame(dict3)
df
Out[11]:
ID | Age | Name | |
---|---|---|---|
0 | 1 | 21 | Adam |
1 | 2 | 18 | Bob |
2 | 3 | 22 | Cinda |
3 | 4 | 19 | David |
In [12]:
Copied!
# 简化写法
df = pd.DataFrame(list(zip(l1, l2,l3)),
columns =['ID', 'Age', 'Name'])
df
# 简化写法
df = pd.DataFrame(list(zip(l1, l2,l3)),
columns =['ID', 'Age', 'Name'])
df
Out[12]:
ID | Age | Name | |
---|---|---|---|
0 | 1 | 21 | Adam |
1 | 2 | 18 | Bob |
2 | 3 | 22 | Cinda |
3 | 4 | 19 | David |
读入读出 ¶
In [13]:
Copied!
hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"')
hang.head(5) # 前5行
hang.tail(5) # 后5行
hang.dtypes # 所有字段的格式
hang.info()
hang.describe() # 把所有有关数字的信息做一个整理
hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"')
hang.head(5) # 前5行
hang.tail(5) # 后5行
hang.dtypes # 所有字段的格式
hang.info()
hang.describe() # 把所有有关数字的信息做一个整理
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[13], line 1 ----> 1 hang = pd.read_csv("hang.csv",delimiter=";", quotechar='"') 3 hang.head(5) # 前5行 4 hang.tail(5) # 后5行 File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds) File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds) 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. --> 620 parser = TextFileReader(filepath_or_buffer, **kwds) 622 if chunksize or iterator: 623 return parser File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds) 1617 self.options["has_index_names"] = kwds["has_index_names"] 1619 self.handles: IOHandles | None = None -> 1620 self._engine = self._make_engine(f, self.engine) File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine) 1878 if "b" not in mode: 1879 mode += "b" -> 1880 self.handles = get_handle( 1881 f, 1882 mode, 1883 encoding=self.options.get("encoding", None), 1884 compression=self.options.get("compression", None), 1885 memory_map=self.options.get("memory_map", False), 1886 is_text=is_text, 1887 errors=self.options.get("encoding_errors", "strict"), 1888 storage_options=self.options.get("storage_options", None), 1889 ) 1890 assert self.handles is not None 1891 f = self.handles.handle File /opt/hostedtoolcache/Python/3.12.4/x64/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 868 elif isinstance(handle, str): 869 # Check whether the filename is to be opened in binary mode. 870 # Binary mode does not support 'encoding' and 'newline'. 871 if ioargs.encoding and "b" not in ioargs.mode: 872 # Encoding --> 873 handle = open( 874 handle, 875 ioargs.mode, 876 encoding=ioargs.encoding, 877 errors=errors, 878 newline="", 879 ) 880 else: 881 # Binary mode 882 handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'hang.csv'
In [14]:
Copied!
# 保存到excel
hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel
# 从excel读取
hang = pd.read_excel("hang.xlsx", sheet_name="weather")
hang
# 保存到 excel
hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel
# 从excel读取
hang = pd.read_excel("hang.xlsx", sheet_name="weather")
hang
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[14], line 2 1 # 保存到excel ----> 2 hang.to_excel("hang.xlsx", sheet_name="weather", index=False) # 转换成excel 3 # 从excel读取 4 hang = pd.read_excel("hang.xlsx", sheet_name="weather") NameError: name 'hang' is not defined
表格操作 ¶
table['name'] # 选择一列
type(table['name'])
reduce=hang[['time','T']] # 选取某一列
reduce[reduce["T"] > 35]
reduce.iloc[9:14, 0:2] # 先选行,后选列
reduce.loc[reduce["T"] > 35, "time"]
df.mean()
df.var()
df.std()
df.sum()
df.values()
df.values().flatten()
表格合并 ¶
- merge
- concat
pd.merge(sub3, sub4,
how='left', left_on='time', right_on='time')
pd.concat([sub1, sub2], axis=0)
pd.concat([sub1, sub2], axis=1)
简单绘图 ¶
hang['P'].plot()
(hang['Tx']-hang['Tn']).plot.hist()
时间信息画图 ¶
reduce["time"]=pd.to_datetime(reduce["time"],dayfirst=True)
reduce
reduce["month"]=reduce['time'].dt.month
reduce['year']=reduce['time'].dt.year
reduce['hour']=reduce['time'].dt.hour
reduce.groupby(["month","year"]).mean()
matplotlib & seaborn¶
seaborn: statistical data visualization — seaborn 0.13.2 documentation
statistical data visualization Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
直方图 ¶
In [15]:
Copied!
# A和B约定见面,各自将迟到0-1小时。先来者等待15分钟后将离开。
# A和B见到的概率多大?
import numpy as np
import collections
import matplotlib.pyplot as plt
import seaborn as sns
ab = np.random.rand(2,10000)
print(ab)
f, axs = plt.subplots(1,3)
sns.histplot(ab[0],stat='probability',ax=axs[0])
sns.histplot(ab[1],stat='probability',ax=axs[1])
sns.histplot(ab[0]-ab[1],stat='probability',ax=axs[2],color='yellow')
plt.tight_layout() # 调整子图之间的间距
# A 和 B 约定见面,各自将迟到 0-1 小时。先来者等待 15 分钟后将离开。
# A和B见到的概率多大?
import numpy as np
import collections
import matplotlib.pyplot as plt
import seaborn as sns
ab = np.random.rand(2,10000)
print(ab)
f, axs = plt.subplots(1,3)
sns.histplot(ab[0],stat='probability',ax=axs[0])
sns.histplot(ab[1],stat='probability',ax=axs[1])
sns.histplot(ab[0]-ab[1],stat='probability',ax=axs[2],color='yellow')
plt.tight_layout() # 调整子图之间的间距
[[0.85836148 0.76317309 0.54724431 ... 0.67767214 0.79309025 0.96454926] [0.79351607 0.63069391 0.21904404 ... 0.09584565 0.25092932 0.0833094 ]]
discrete
参数在 sns.histplot
函数中的作用是指定数据是否为离散型数据。设置 discrete=True 会使得直方图的每个条形对应一个离散值,而不是一个连续的区间
概率分布 ¶
In [16]:
Copied!
# 从分布的积分图形得到概率
f, axs = plt.subplots(1)
sns.ecdfplot(abs(ab[0]-ab[1]),ax=axs) # cumulative distribution function
x_special = 0.25
for line in axs.get_lines():
x,y = line.get_data()
ind = np.argwhere(x > x_special)[0,0] # first index where x is larger than x_special
axs.text(x_special,y[ind], f' {y[ind]:.4f}', ha='left', va='top') # maybe color=line.get_color()
axs.axvline(x_special, linestyle='--', color='#cfcfcf', lw=2, alpha=0.75)
plt.show()
# 从分布的积分图形得到概率
f, axs = plt.subplots(1)
sns.ecdfplot(abs(ab[0]-ab[1]),ax=axs) # cumulative distribution function
x_special = 0.25
for line in axs.get_lines():
x,y = line.get_data()
ind = np.argwhere(x > x_special)[0,0] # first index where x is larger than x_special
axs.text(x_special,y[ind], f' {y[ind]:.4f}', ha='left', va='top') # maybe color=line.get_color()
axs.axvline(x_special, linestyle='--', color='#cfcfcf', lw=2, alpha=0.75)
plt.show()
relational Plot | 关系图 ¶
Scatter Plot | 散点图 ¶
lineplot¶
jointplot¶
x
,y
:代表待分析的成对变量,有两种模式,第一种模式:在参数 data 传入数据框时,x、y 均传入字符串,指代数据框中的变量名;第二种模式:在参数 data 为 None 时,x、y 直接传入两个一维数组,不依赖数据框data
:与上一段中的说明相对应,代表数据框,默认为 Nonekind
:字符型变量,用于控制展示成对变量相关情况的主图中的样式color
:控制图像中对象的色彩height
:控制图像为正方形时的边长ratio
:int 型,调节联合图与边缘图的相对比例,越大则边缘图越矮,默认为 5space
:int 型,用于控制联合图与边缘图的空白大小xlim
,ylim
:设置 x 轴与 y 轴显示范围joint_kws
,marginal_kws
,annot_kws
:传入参数字典来分别精细化控制每个组件
详解 seaborn 中的 kdeplot、rugplot、distplot 与 jointplot_seaborn.kdeplot-CSDN 博客
Seaborn 系列 ( 三 ):分布统计绘图 (distribution)_displot-CSDN 博客
distplot¶
kdeplot¶
rugplot¶
sklearn¶
安装方法,注意名字不是 sklearn
pip install scikit-learn
pip install statsmodels -i https://pypi.tuna.tsinghua.edu.cn/simple/
这个库执行需要翻墙,而且很容易连接不上