Python Pandas

2020-07-11

Python基础

Pandas

Pandas 库介绍

Pandas 数据结构 Series及常用操作

pandas 数据结构 DataFrame及常用操作

汇总和计算描述统计

# Series 是一组数据和相关的数据索引组成的，类似于 带标签的数组，基本操作类似于 ndarray 和字典
# DataFrame 室友公用相同索引的一组列组成，主要用于表示二维数组
# 两种数据类型的擦操作：基本操作、运算操作、特征操作、关联类操作

import pandas as pd
from pandas import Series,DataFrame

Series

语法：Series(data,index=index,name=name)

# 创建 Series 
import numpy as np
import pandas as pd
series1 = pd.Series([1, 3, 8, 10, 45])
print(str.format("根据列表船舰：\n {}", series1))

根据列表船舰：
 0     1
1     3
2     8
3    10
4    45
dtype: int64

1 2	ser2 = pd.Series([0,2,8,10],index=['a','b','c','d']) print(ser2)

a     0
b     2
c     8
d    10
dtype: int64

1
2
3

# 使用标量创建
ser3 = pd.Series(30,index=['a','b','c','d'])
print(ser3,ser3.shape,np.shape(ser3),sep='\n')

a    30
b    30
c    30
d    30
dtype: int64
(4,)
(4,)

1
2
3

# 使用字典创建
ser4 = pd.Series({'a':1, 'b':3, 'c':8, 'd':10, 'e':45})
print(ser4)

a     1
b     3
c     8
d    10
e    45
dtype: int64

1
2
3

# 按照 index 指定键作为索引，如果键值对应的值不存在则显示 NaN
ser5 = pd.Series({'a':1, 'b':3, 'c':8, 'd':10, 'e':45},index=['b','d','a','f'])
print(ser4)

b     3.0
d    10.0
a     1.0
f     NaN
dtype: float64

1
2
3

# 用 ndarray 创建
ser6 = pd.Series(np.arange(5))
print(ser6)

0    0
1    1
2    2
3    3
4    4
dtype: int32

1 2	ser6 = pd.Series(np.arange(5),index=np.arange(10,5,-1)) print(ser6)

10    0
9     1
8     2
7     3
6     4
dtype: int32

Series 对象的 index 和 values 属性

s.index 获取索引列表
s.values 获取数据列表

Series 对象本身及其索引的 name 属性

访问 Series 对象的数据

位置索引访问
标签索引访问
切片索引访问
布尔值访问（条件过滤访问）

1 2	ser1 = pd.Series({'a':10,'b':20,'c':30,'d':40}) print(ser1)

a    10
b    20
c    30
d    40
dtype: int64

1 2	# 显示索引列表 ser1.index

Index(['a', 'b', 'c', 'd'], dtype='object')

1 2	# 显示数据列表 ser1.values

array([10, 20, 30, 40], dtype=int64)

# Series 对象对象及其索引 name 属性
ser1.name = '序列对象'
ser1.index.name = '索引'
print(ser1)

索引
a    10
b    20
c    30
d    40
Name: 序列对象, dtype: int64

1 2	# 通过位置访问 ser1[0]

1 2	# 通过标签访问 ser1['b']

1 2	# 访问多个数据 ser1[['a','b','c']]

索引
a    10
b    20
c    30
Name: 序列对象, dtype: int64

1	ser1[[1,2,3]]

索引
b    20
c    30
d    40
Name: 序列对象, dtype: int64

1 2	# 切片访问 ser1[:2]

索引
a    10
b    20
Name: 序列对象, dtype: int64

有区别用标签的切片最后一个是可以取到的

1	ser1['a':'c']

索引
a    10
b    20
c    30
Name: 序列对象, dtype: int64

1 2	# 布尔值访问(条件过滤) ser1[ser1>20]

索引
c    30
d    40
Name: 序列对象, dtype: int64

删除 Series 中的数据

drop 返回删除后的副本，源数据不变
pop 返回删除的数据，源数据改变
注意:* 产生自动索引的使用自动索引删除；自定义索引的，用自定义索引删除

修改 Series 中的数据

通过索引修改数据

添加数据到 Series

通过索引访问添加
append 方法末尾添加，返回副本，源数据不变

1 2	ser1 = pd.Series({'a':10,'b':20,'c':30,'d':40}) print(ser1)

a    10
b    20
c    30
d    40
dtype: int64

# 使用 drop 函数删除，返回副本，不改变源数据
s = ser1.drop('a')
print(s)
print(ser1)

b    20
c    30
d    40
dtype: int64
a    10
b    20
c    30
d    40
dtype: int64

1
2
3

# 使用 pop 函数删除源数据
ser1.pop('a')
print(ser1)

b    20
c    30
d    40
dtype: int64

1 2	ser2 = pd.Series(range(10)) print(ser2)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

1 2	ser2.pop(0) print(ser2)

1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

1
2
3

ser3 = pd.Series(range(5))
a = ser3.drop([1,3])
print(ser3,a,sep='\n')

0    0
1    1
2    2
3    3
4    4
dtype: int64
0    0
2    2
4    4
dtype: int64

1
2
3

# 修改 Series 值
ser4 = pd.Series(range(10,50,10),index=list('abcd'))
print(ser4)

a    10
b    20
c    30
d    40
dtype: int64

1
2
3

ser4[0]=100
ser4['b']=200
print(ser4)

a    100
b    200
c     30
d     40
dtype: int64

1
2
3

# 多个修改
ser4[0:2] = 1
print(ser4)

a     1
b     1
c    30
d    40
dtype: int64

1 2	ser4['a':'c']=2 print(ser4)

a     2
b     2
c     2
d    40
dtype: int64

1
2
3

# 添加
ser4['e']=500
print(ser4)

a      2
b      2
c      2
d     40
e    500
dtype: int64

ser4[4]

1
2
3

new_ser = pd.Series([600,100])
s = ser4.append(new_ser)
print(s)

a      2
b      2
c      2
d     40
e    500
0    600
1    100
dtype: int64

s[0]

1
2
3

new_ser = pd.Series([600,100],index=('f','g'))
s = ser4.append(new_ser)
print(s)

a      2
b      2
c      2
d     40
e    500
f    600
g    100
dtype: int64

s[5]

DataFrame

DataFrame 类型由公用相同索引的一组列组成
DataFrame 是一个表格型的数据类型，每列值类型可以不同
DataFrame 常用于表达二维数据，也可表达多维数据
DataFrame 既有行索引 index 也有列索引 column

	column	$\rightarrow$	axis=1
index	index_0	$\rightarrow$	data_a	data_1	…	data_w
$\downarrow$	index_1	$\rightarrow$	data_b	dat_2	…	data_x
axis=0	index_2	$\rightarrow$	data_c	dat_3	…	data_y
	index_3	$\rightarrow$	data_d	dat_4	…	data_z

# 创建 DataFrame 
import numpy as np
import pandas as pd
# 二维 ndarray 对象创建
# 没有行和列索引，自动索引
df1 = pd.DataFrame(np.random.randint(10,size=(2,5)))
print(df1)
# 指定索引
df2 = pd.DataFrame(np.random.randint(10,size=(2,5)),index=list('ab'),columns=list('ABCDE'))
print(df2)

   0  1  2  3  4
0  6  3  2  8  2
1  5  8  5  2  4
   A  B  C  D  E
a  9  0  1  6  2
b  1  6  3  2  4

用列表创建一维时直接是一列

1
2
3

# 使用列表创建
df3 = pd.DataFrame([0,1,2,3,4])
print(df3)

1 2	df4 = pd.DataFrame([[1,2,3],[2,3,4]]) print(df4)

   0  1  2
0  1  2  3
1  2  3  4

使用字典创建

字典的 key 成为索引，行索引是字典中指定的index 或自动索引
创建时，也可以显示指定 index 和 columns

# 使用字典创建，字典的键最为列索引，值的长度相同
dt = {'name':['张三','李四','王五'],'age':[20,25,35]}
df5 = pd.DataFrame(dt)
print(df5)

  name  age
0   张三   20
1   李四   25
2   王五   35

# 使用Series 类型的字典创建，值得长度可以不一样，不存在得值用 NaN 填充
dt = {'name':pd.Series(['张三','李四','王五','赵六']),'age':pd.Series([20,25,35])}
df6 = pd.DataFrame(dt)
print(df6)

  name   age
0   张三  20.0
1   李四  25.0
2   王五  35.0
3   赵六   NaN

1 2	df7 = pd.DataFrame(dt,index=[2,3],columns=['name','score']) print(df7)

  name score
2   王五   NaN
3   赵六   NaN

DataFrame 基本属性操作

index：行索引列表
colums: 列索引列表
values: 元素
dtypes: 元素类型
size：元素个数
ndim: 维度
shape: 形状

import numpy as np
import pandas as pd 
df1 = pd.DataFrame(np.random.randint(10,size=(2,5)))
print(df1)

   0  1  2  3  4
0  2  3  7  3  0
1  4  0  4  3  5

df1.index

RangeIndex(start=0, stop=2, step=1)

1	df1.columns

RangeIndex(start=0, stop=5, step=1)

1	df1.values

array([[2, 3, 7, 3, 0],
       [4, 0, 4, 3, 5]])

1	df1.dtypes

0    int32
1    int32
2    int32
3    int32
4    int32
dtype: object

df1.ndim

df1.shape

(2, 5)

df1.size

查询访问 DateFrame 中得数据

访问单列表数据
访问多列表数据
- DataFrame 中得每一列都是一个 Series 对象
访问单列多行数据
- 类似操作 Series 对象
访问某个数据
访问多列多行数据
访问多行数据
- ds[:][:5]
- head(rows=5): 访问前几行，默认5行
- tail(rows=5): 访问最后几行，默认5行

import numpy as np
import pandas as pd 
df = pd.DataFrame({'goods':['cokecola','eggplant','condom','apple','banana','milk'],'quantity':[12,3,1,5,8,10],'price':[20,12,50,5,4.5,49.9]})
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9

1	df['goods']

0    cokecola
1    eggplant
2      condom
3       apple
4      banana
5        milk
Name: goods, dtype: object

1	df[['goods','price']]

	goods	price
0	cokecola	20.0
1	eggplant	12.0
2	condom	50.0
3	apple	5.0
4	banana	4.5
5	milk	49.9

1	df['goods'][:2]

0    cokecola
1    eggplant
Name: goods, dtype: object

1	df[['goods','price']][:2]

	goods	price
0	cokecola	20.0
1	eggplant	12.0

df[:][:2]

	goods	quantity	price
0	cokecola	12	20.0
1	eggplant	3	12.0

1	df[:2][:1]

	goods	quantity	price
0	cokecola	12	20.0

df.head

<bound method NDFrame.head of       goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9>

df.tail

<bound method NDFrame.tail of       goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9>

1	df.tail(2)

	goods	quantity	price
4	banana	8	4.5
5	milk	10	49.9

1
2
3

dt = [['张三','李四','王五','赵六'],[20,25,35]]
df6 = pd.DataFrame(dt)
print(df6)

    0   1   2     3
0  张三  李四  王五    赵六
1  20  25  35  None

df6[0][1]

1	df6[:][:1]

	0	1	2	3
0	张三	李四	王五	赵六

注意

这个不是切片 s.loc[:,1] 才是切片

df6[1][:]

0    李四
1    25
Name: 1, dtype: object

df6[:][1]

0    李四
1    25
Name: 1, dtype: object

按照条件筛选数据
- and (&) or(|) not(~) xor(^)
使用 loc 切片方法
- DataFrame.loc[行索引：名称或条件，列索引：名称]
- 行索引为时间时，前后为闭合
- DataFrame.iloc[行索引：位置，列索引：位置]
- 行索引为时间时，前闭后开

import numpy as np
import pandas as pd 
df = pd.DataFrame({'goods':['cokecola','eggplant','condom','apple','banana','milk'],'quantity':[12,3,1,5,8,10],'price':[20,12,50,5,4.5,49.9]})
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9

1	df[df.goods == 'milk' ]

	goods	quantity	price
5	milk	10	49.9

1	df[(df.goods == 'milk') &(df.price == 49.9)]

	goods	quantity	price
5	milk	10	49.9

1	df[(df.goods == 'milk') &(df.price == 50)]

	goods	quantity	price

1	df[df.price > 15]

	goods	quantity	price
0	cokecola	12	20.0
2	condom	1	50.0
5	milk	10	49.9

1	df.loc[:,'goods']

0    cokecola
1    eggplant
2      condom
3       apple
4      banana
5        milk
Name: goods, dtype: object

1	df.loc[0:3,['goods','price']]

	goods	price
0	cokecola	20.0
1	eggplant	12.0
2	condom	50.0
3	apple	5.0

1	df['goods'][1]

'eggplant'

1	df.loc[df.goods=='milk',:]

	goods	quantity	price
5	milk	10	49.9

1	df[df.goods=='milk']

	goods	quantity	price
5	milk	10	49.9

1	df[:][1:2]

	goods	quantity	price
1	eggplant	3	12.0

1 2	# 使用 iloc方法 df.iloc[1:3][0:6]

	goods	quantity	price
1	eggplant	3	12.0
2	condom	1	50.0

1	df.iloc[:,0]

0    cokecola
1    eggplant
2      condom
3       apple
4      banana
5        milk
Name: goods, dtype: object

1	df.iloc[0:3,0:2]

	goods	quantity
0	cokecola	12
1	eggplant	3
2	condom	1

1	df.iloc[0:2,[0,2]]

	goods	price
0	cokecola	20.0
1	eggplant	12.0

DataFrame 添加删改数据

添加一列
- 新建一个列索引，并赋固定得值或非固定的值，添加到最后
插入一列
- df.insert(int=位置，column=列名，value=插入的值)
- 插入的值可以是固定的、列表、Series
添加行
- loc[rowindex] 方法添加一行到末尾，或直接修改源数据
- append 方法在末尾添加一行，返回副本，源数据不变

import numpy as np
import pandas as pd 
df = pd.DataFrame({'goods':['cokecola','eggplant','condom','apple','banana','milk'],'quantity':[12,3,1,5,8,10],'price':[20,12,50,5,4.5,49.9]})
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9

1
2
3

# 添加新列
df['totle']=df['quantity']*df['price']
print(df)

      goods  quantity  price  totle
0  cokecola        12   20.0  240.0
1  eggplant         3   12.0   36.0
2    condom         1   50.0   50.0
3     apple         5    5.0   25.0
4    banana         8    4.5   36.0
5      milk        10   49.9  499.0

1 2	df['IsQualified']= True print(df)

      goods  quantity  price  totle  IsQualified
0  cokecola        12   20.0  240.0         True
1  eggplant         3   12.0   36.0         True
2    condom         1   50.0   50.0         True
3     apple         5    5.0   25.0         True
4    banana         8    4.5   36.0         True
5      milk        10   49.9  499.0         True

1 2	# 指定位置插入一列 df.insert(2,'allQuantity',[100,50,30,100,100,200])

1
2
3

# 插入一行
df.loc[6]=['shampoo',13,100,50,650,True]
print(df)

      goods  quantity  allQuantity  price  totle  IsQualified
0  cokecola        12          100   20.0  240.0         True
1  eggplant         3           50   12.0   36.0         True
2    condom         1           30   50.0   50.0         True
3     apple         5          100    5.0   25.0         True
4    banana         8          100    4.5   36.0         True
5      milk        10          200   49.9  499.0         True
6   shampoo        13          100   50.0  650.0         True

1
2
3

# 使用 append 添加一行
df = df.append({'goods':'orange','quantity':5,'allQuantity':100,'price':2.5,'totle':5*2.5,'IsQualified':True},ignore_index=True)    
print(df)

      goods  quantity  allQuantity  price  totle  IsQualified
0  cokecola        12          100   20.0  240.0         True
1  eggplant         3           50   12.0   36.0         True
2    condom         1           30   50.0   50.0         True
3     apple         5          100    5.0   25.0         True
4    banana         8          100    4.5   36.0         True
5      milk        10          200   49.9  499.0         True
6   shampoo        13          100   50.0  650.0         True
7    orange         5          100    2.5   12.5         True

删除 DataFrame 中的数据

删除列
- df.drop(labels=列名，axis=1,inplace=True)
- 默认情况 axis=1 表示列，inplace=True 表示改变源数据
删除行
- df.drop(labels=索引，axis=0,inplace=True)
- axis=0 表示行，inplace=True 表示改变源数据
  - 修改 DataFrame 中的数据
  - 使用loc方法获取数据并赋值

1
2
3

# 删掉一列数据
df.drop(labels='IsQualified',axis=1,inplace=True)
print(df)

      goods  quantity  allQuantity  price  totle
0  cokecola        12          100   20.0  240.0
1  eggplant         3           50   12.0   36.0
2    condom         1           30   50.0   50.0
3     apple         5          100    5.0   25.0
4    banana         8          100    4.5   36.0
5      milk        10          200   49.9  499.0
6   shampoo        13          100   50.0  650.0
7    orange         5          100    2.5   12.5

1 2	df.drop(['allQuantity','totle'],axis=1,inplace=True) print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9
6   shampoo        13   50.0
7    orange         5    2.5

1
2
3

# 删除行数据
df.drop(7,axis=0,inplace=True)
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9
6   shampoo        13   50.0

1
2
3

# 删除多行
df.drop([5,6],axis=0,inplace=True)
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5

1
2
3

# 修改一行
df.loc[4]=['orange',8,4.5]
print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    orange         8    4.5

1
2
3

# 修改一列
df.loc[:,'quantity']=10
print(df)

      goods  quantity  price
0  cokecola        10   20.0
1  eggplant        10   12.0
2    condom        10   50.0
3     apple        10    5.0
4    orange        10    4.5

1
2
3

# 修改某些数据
df.loc[df.price>15,'price']=15
print(df)

      goods  quantity  price
0  cokecola        10   15.0
1  eggplant        10   12.0
2    condom        10   15.0
3     apple        10    5.0
4    orange        10    4.5

Pandas 中的汇总统计计算

数值型的数据统计——基于Numpy的统计
- pandas 库基于numpy 可以用numpy提供的函数对数据进行描述统计
- 例如：np.mean(df[‘colName’])

方法	说明
sum	对数组全部或者某个轴元素进行求和横轴 axis=1
mean、median	均值和中为谁
std、var、ptp、cov	标准差、方差、极差、协方差
min、max	最小最大值
argmin、argmax	最小最大值索引
cumsum	列加和
cumprod	累计积

数值型的统计——基于pandas描述性统计
- pandas 提供了更加便利的统计方法，例如 detail[‘列名’]、mean()。
- describe 方法，能够一次性得出DataFrame所有数值型特征的非空值数目、均值、四分位数、标准差。

方法名称	说明	方法名称	说明
min	最小值	max	最大值
mean	均值	ptp	极差
median	中位数	std	标准差
var	方差	cov	协方差
sem	标注误差	mode	众数
skew	样本偏度	kurt	样本峰度
quantile	四分位数	count	非空值数目
describe	描述统计	mad	平均绝对离差

1 2	df = pd.DataFrame({'goods':['cokecola','eggplant','condom','apple','banana','milk'],'quantity':[12,3,1,5,8,10],'price':[20,12,50,5,4.5,49.9]}) print(df)

      goods  quantity  price
0  cokecola        12   20.0
1  eggplant         3   12.0
2    condom         1   50.0
3     apple         5    5.0
4    banana         8    4.5
5      milk        10   49.9

1	np.sum(df['quantity'])

1	df['quantity'].sum()

1	df[['quantity','price']].describe()

	quantity	price
count	6.000000	6.000000
mean	6.500000	23.566667
std	4.230839	21.198742
min	1.000000	4.500000
25%	3.500000	6.750000
50%	6.500000	16.000000
75%	9.500000	42.425000
max	12.000000	50.000000

类别型数据统计

类别型数据统计
- 描述类别型特征的分布情况，可以使用频数统计表
  - pandas 库中实现频数的方法为value_counts
- describe 方法能够支持对category 类型的数据进行描述性统计
  - 返回四个统计量，非空元素的数目、类别的书谬、数目最多的类别、数目最多篇类别的数目
  - pandas 提供了 categories类，可以使用astype 方法将目标特征的数据类型转换为category

1
2
3

import pandas as pd 
df = pd.DataFrame({'id':range(4),'name':['Jack','Craig','Chuck','Jack'],'gender':['M','M','M','F'],'age':['13','34','24','4'],'weight':['45','60','55','30']})   
print(df)

   id   name gender age weight
0   0   Jack      M  13     45
1   1  Craig      M  34     60
2   2  Chuck      M  24     55
3   3   Jack      F   4     30

1 2	# 实现词频统计 df['name'].value_counts()

Jack     2
Craig    1
Chuck    1
Name: name, dtype: int64

1
2
3

# 数据类型转换为 category 类型
df["name"] = df['name'].astype('category')
print(df['name'].describe(),"\n",df['name'].dtype)

count        4
unique       3
top       Jack
freq         2
Name: name, dtype: object 
 category

Python Numpy

Python for Finance Chapter1