网站首页 > 知识剖析正文

Python Pandas数据分析 - Day 5 - index.repeat, apply, map &applymap

nixiaole 2024-11-13 14:08:27 知识剖析 36 ℃

前言

如今天的微头条：“腕表”所述：“无论怎样，梦想还是要有的，不然人这一辈子浑浑噩噩就过去了”，其实还是那句：“人还是要有梦想的，不然和咸鱼有什么区别？”。

我很喜欢下面的话，特地分享给热爱生活的你~

“You have to believe in yourself, challenge yourself, and push yourself until the very end; that’s the only way you’ll succeed.”

今天又是KeepPush自己的日子，今天要跟着@外星人玩Python学习下面的方法，以及如何使用这些方法对数据进行操作。

DataFrame.index.repeat方法；
apply，map以及applymap三个方法；

首先看一下apply，map以及applymap三个方法在官方的定义：

pandas.DataFrame.apply

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)[source]

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1).

By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

DataFrame中apply方法是通过传入一个函数（function），来实现对DF行或者列数据进行操作的功能。而传递给函数的对象是Series对象。下面划重点：

apply入参为函数（function）；
此函数的入参为Series，即DataFrame的行或者列数据；

pandas.Series.map

Series.map(arg, na_action=None)[source]
Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Series中map方法会根据入参的条件，来对Series中的每一个数据进行操作。重点如下：

map方法是实现Series元素级数据处理的方案；
map方法可以接受函数，字典或者另外一个Series对象；

pandas.DataFrame.applymap

DataFrame.applymap(func, na_action=None, **kwargs)[source]
Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

applymap是通过函数将DataFrame的scalar类型数据进行转换的方法。重点如下：

applymap接受函数作为参数；
此函数将对python的scalar类型数据进行转换；
python的scalar类型数据包含：数字类型，字符类型，布尔类型和日期、时间类型；

代码实现

Case Study 1:通过DataFrame.index.repeat方法生成重复列；

# -*- coding: utf-8 -*-
"""
Created on Wed Dec 29 10:35:59 2021

@author: TXDMXCG

@外星人学Python - Chapter 06 - generate duplicated rows using index.repeat
"""

# Step1: Import the Libraries

import os
os.chdir("D:/21_SuccessFactor/Python Training/pandas专栏数据与源码/文章源码/文章源码/src/06/src")

import pandas as pd

# Step2: Get your data ready

def get_df():
    return pd.read_csv("data/data.csv", encoding=("utf8"), header = 0)

df = get_df()
df
# =============================================================================
# Out[8]: 
#           公司名  销售员  销售额  times
# 0    快讯信息有限公司   陈彬   86      2
# 1  合联电子传媒有限公司   陈萍   52      1
# 2    天益信息有限公司   王梅   77      5
# 3   易动力传媒有限公司   张林   80      4
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 5    明腾网络有限公司  潘桂英   50      1
# 6  黄石金承信息有限公司  张桂兰   63      2
# 7  兰金电子传媒有限公司   许华   79      3
# 8    趋势信息有限公司   李凯   81      1
# 9    超艺科技有限公司  华建平   67      4
# =============================================================================
#%%
# Step3: Start exploring your data

# Cast Study 1: 根据times的值重复生成本行数据
# 通过标签索引 + 切片方法取第1行的数据
df.loc[0, :]
# =============================================================================
# Out[2]: 
# 公司名      快讯信息有限公司
# 销售员            陈彬
# 销售额            86
# times           2
# Name: 0, dtype: object
# =============================================================================

# 取前5行的数据
df.loc[:4 , :]
# =============================================================================
# Out[3]: 
#           公司名 销售员  销售额  times
# 0    快讯信息有限公司  陈彬   86      2
# 1  合联电子传媒有限公司  陈萍   52      1
# 2    天益信息有限公司  王梅   77      5
# 3   易动力传媒有限公司  张林   80      4
# 4  鑫博腾飞传媒有限公司  王霞   86      3

# 前情回顾，通过IndexSlice方法取前5行数据
# =============================================================================
# from pandas import IndexSlice as idx
# rows = idx[:4]
# df.loc[rows, :]
# =============================================================================

# 通过索引获取重复数据的方式，如第一行times == 2，因此第一行需要重复2次，行索引通过列表实现
df.loc[[0,0], :]
# =============================================================================
# Out[52]: 
#         公司名 销售员  销售额  times
# 0  快讯信息有限公司  陈彬   86      2
# 0  快讯信息有限公司  陈彬   86      2
# =============================================================================

# 同理，对于第5行，索引值为4的行，对应的times == 3，结果如下所示
df.loc[[4,4,4], :]
# =============================================================================
# Out[53]: 
#           公司名 销售员  销售额  times
# 4  鑫博腾飞传媒有限公司  王霞   86      3
# 4  鑫博腾飞传媒有限公司  王霞   86      3
# 4  鑫博腾飞传媒有限公司  王霞   86      3
# =============================================================================

# 思路来了，我们需要按times的值进行数据操作，df[标签]对应的各行应该重复的数据如下所示
df['times']
# =============================================================================
# Out[54]: 
# 0    2
# 1    1
# 2    5
# 3    4
# 4    3
# 5    1
# 6    2
# 7    3
# 8    1
# 9    4
# Name: times, dtype: int64
# =============================================================================

# 通过index索引的方法repeat进行数据传导，先看一下index的默认值：
df.index.values
# Out[55]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

# repeat方法接受scalar参数进行索引重复
df.index.repeat(2)
# Out[56]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype='int64')

# 如果我们把上述数据放到DF里面，结果如下：
df.loc[ df.index.repeat(2) , :]

# 所有的数据均重复了2次，因为repeat(2)
# =============================================================================
# Out[82]: 
#           公司名  销售员  销售额  times
# 0    快讯信息有限公司   陈彬   86      2
# 0    快讯信息有限公司   陈彬   86      2
# 1  合联电子传媒有限公司   陈萍   52      1
# 1  合联电子传媒有限公司   陈萍   52      1
# 2    天益信息有限公司   王梅   77      5
# 2    天益信息有限公司   王梅   77      5
# 3   易动力传媒有限公司   张林   80      4
# 3   易动力传媒有限公司   张林   80      4
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 5    明腾网络有限公司  潘桂英   50      1
# 5    明腾网络有限公司  潘桂英   50      1
# 6  黄石金承信息有限公司  张桂兰   63      2
# 6  黄石金承信息有限公司  张桂兰   63      2
# 7  兰金电子传媒有限公司   许华   79      3
# 7  兰金电子传媒有限公司   许华   79      3
# 8    趋势信息有限公司   李凯   81      1
# 8    趋势信息有限公司   李凯   81      1
# 9    超艺科技有限公司  华建平   67      4
# 9    超艺科技有限公司  华建平   67      4
# =============================================================================

# repeat方法也可以接受list列表作为参数，但前提是列表长度要与Index长度相同
# 通过len函数计算index的长度
len(df.index)
# Out[81]: 10

# 生成自定义列表
times = range(1,11)
list(times)
# Out[57]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# 通过repeat生成与times列表对应的重复索引
df.index.repeat(times)
# =============================================================================
# Out[58]: 
# Int64Index([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6,
#             6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
#             8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
#            dtype='int64')
# =============================================================================

# 真正的思路，是通过DataFrame标签'times'获取对应的重复次数
idx = df.index.repeat(df['times'])
idx
# =============================================================================
# Out[60]: 
# Int64Index([0, 0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8,
#             9, 9, 9, 9],
#            dtype='int64')
# =============================================================================

# 获得实际的重复数据
df.loc[idx, :]
# =============================================================================
# Out[61]: 
#           公司名  销售员  销售额  times
# 0    快讯信息有限公司   陈彬   86      2
# 0    快讯信息有限公司   陈彬   86      2
# 1  合联电子传媒有限公司   陈萍   52      1
# 2    天益信息有限公司   王梅   77      5
# 2    天益信息有限公司   王梅   77      5
# 2    天益信息有限公司   王梅   77      5
# 2    天益信息有限公司   王梅   77      5
# 2    天益信息有限公司   王梅   77      5
# 3   易动力传媒有限公司   张林   80      4
# 3   易动力传媒有限公司   张林   80      4
# 3   易动力传媒有限公司   张林   80      4
# 3   易动力传媒有限公司   张林   80      4
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 4  鑫博腾飞传媒有限公司   王霞   86      3
# 5    明腾网络有限公司  潘桂英   50      1
# 6  黄石金承信息有限公司  张桂兰   63      2
# 6  黄石金承信息有限公司  张桂兰   63      2
# 7  兰金电子传媒有限公司   许华   79      3
# 7  兰金电子传媒有限公司   许华   79      3
# 7  兰金电子传媒有限公司   许华   79      3
# 8    趋势信息有限公司   李凯   81      1
# 9    超艺科技有限公司  华建平   67      4
# 9    超艺科技有限公司  华建平   67      4
# 9    超艺科技有限公司  华建平   67      4
# 9    超艺科技有限公司  华建平   67      4
# =============================================================================

Case Study 2:使用apply方法返回Series

# -*- coding: utf-8 -*-
"""
Created on Wed Dec 29 15:22:22 2021

@author: TXDMXCG

@外星人学Python - Chapter 07 - apply, map & applymap
"""
import os
os.chdir("D:/21_SuccessFactor/Python Training/pandas专栏数据与源码/文章源码/文章源码/src/07/src")

import pandas as pd

# Step2: Get your data ready
def get_df():
    return pd.read_csv("data/1.csv", encoding=("utf8"), header = 0)

df = get_df()
df
# =============================================================================
# Out[86]: 
#     A    B    C
# 0   1   10   20
# 1   2   20   40
# 2   3   30   60
# 3   4   40   80
# 4   5   50  100
# 5   6   60  120
# 6   7   70  140
# 7   8   80  160
# 8   9   90  180
# 9  10  100  200
# =============================================================================
#%%
# 计算每行的和，方法1：通过匿名函数计算;
# 注意，计算每一行的和，需要指定轴axis = 1，即横轴（水平）方向
df.apply(lambda x: x.sum(), axis = 1)
# =============================================================================
# Out[98]: 
# 0     31
# 1     62
# 2     93
# 3    124
# 4    155
# 5    186
# 6    217
# 7    248
# 8    279
# 9    310
# dtype: int64
# =============================================================================
# 赋值给'D'列
df['D'] = df.apply(lambda x: x.sum(), axis = 1)
df
# =============================================================================
# Out[100]: 
#     A    B    C    D
# 0   1   10   20   31
# 1   2   20   40   62
# 2   3   30   60   93
# 3   4   40   80  124
# 4   5   50  100  155
# 5   6   60  120  186
# 6   7   70  140  217
# 7   8   80  160  248
# 8   9   90  180  279
# 9  10  100  200  310
# =============================================================================
# 方法2：通过apply内置的sum映射进行计算，注意，此时DF已经有了新增列'D'
df['sum'] = df.apply('sum', axis = 1)
df
# =============================================================================
# Out[122]: 
#     A    B    C    D  sum
# 0   1   10   20   31   62
# 1   2   20   40   62  124
# 2   3   30   60   93  186
# 3   4   40   80  124  248
# 4   5   50  100  155  310
# 5   6   60  120  186  372
# 6   7   70  140  217  434
# 7   8   80  160  248  496
# 8   9   90  180  279  558
# 9  10  100  200  310  620
# =============================================================================
# 新增加一行，用于计算每一列的和，指定轴axis = 0，纵轴（竖直）方向
# 新增行，需要同时指定行索引
total = df.apply('sum', axis = 0)
df.loc['total'] = total
# =============================================================================
# Out[136]: 
#         A    B     C     D   sum
# 0       1   10    20    31    62
# 1       2   20    40    62   124
# 2       3   30    60    93   186
# 3       4   40    80   124   248
# 4       5   50   100   155   310
# 5       6   60   120   186   372
# 6       7   70   140   217   434
# 7       8   80   160   248   496
# 8       9   90   180   279   558
# 9      10  100   200   310   620
# total  55  550  1100  1705  3410
# =============================================================================
# 或者按索引顺序增加索引值为10的行数据
df = df.drop('total')
df.loc[len(df)] = total
df
# =============================================================================
# Out[141]: 
#      A    B     C     D   sum
# 0    1   10    20    31    62
# 1    2   20    40    62   124
# 2    3   30    60    93   186
# 3    4   40    80   124   248
# 4    5   50   100   155   310
# 5    6   60   120   186   372
# 6    7   70   140   217   434
# 7    8   80   160   248   496
# 8    9   90   180   279   558
# 9   10  100   200   310   620
# 10  55  550  1100  1705  3410
# =============================================================================
# 针对Series类型，使用apply方法进行计算，此时不需要指定轴axis
# DataFrame的每一行，或者每一列类型为Series，如type(df['C'])：pandas.core.series.Series
df['C']
# =============================================================================
# Out[147]: 
# 0       20
# 1       40
# 2       60
# 3       80
# 4      100
# 5      120
# 6      140
# 7      160
# 8      180
# 9      200
# 10    1100
# Name: C, dtype: int64
# =============================================================================
# 等同于 df.loc[:, 'C']

# 计算'D'列，让'D'列的值等于'C'列 * 2
df['D'] = df['C'] * 2
# =============================================================================
# Out[150]: 
#      A    B     C     D   sum
# 0    1   10    20    40    62
# 1    2   20    40    80   124
# 2    3   30    60   120   186
# 3    4   40    80   160   248
# 4    5   50   100   200   310
# 5    6   60   120   240   372
# 6    7   70   140   280   434
# 7    8   80   160   320   496
# 8    9   90   180   360   558
# 9   10  100   200   400   620
# 10  55  550  1100  2200  3410
# =============================================================================

Case Study 3:使用apply方法返回DataFrame

# Case Study 3:使用apply方法返回DataFrame
# 定义函数

def cal_mul_col(df_a):
    return df_a['A'] * 2, df_a['B'] * 3

# 重新获取DataFrame
df = get_df()

# 应用自定义函数获取新的DataFrame，此时apply传入的是Series, cal_mul_col返回的是元组
res = df.apply(cal_mul_col, axis = 1, result_type = 'expand')
type(res)
# Out[173]: pandas.core.frame.DataFrame

# 重命名列标签
res.columns = ['a_times_2', 'b_times_3']

# 增加新的列数据
df[res.columns] = res
df
# =============================================================================
# Out[174]: 
#     A    B    C  a_times_2  b_times_3
# 0   1   10   20          2         30
# 1   2   20   40          4         60
# 2   3   30   60          6         90
# 3   4   40   80          8        120
# 4   5   50  100         10        150
# 5   6   60  120         12        180
# 6   7   70  140         14        210
# 7   8   80  160         16        240
# 8   9   90  180         18        270
# 9  10  100  200         20        300
# =============================================================================

Case study 4: applymap对DataFrame所有数据进行处理

# Case study 4: applymap对DataFrame所有数据进行处理
# 重新获取DataFrame
df = get_df()
df.applymap(lambda x: x * 2)
# =============================================================================
# Out[181]: 
#     A    B    C
# 0   2   20   40
# 1   4   40   80
# 2   6   60  120
# 3   8   80  160
# 4  10  100  200
# 5  12  120  240
# 6  14  140  280
# 7  16  160  320
# 8  18  180  360
# 9  20  200  400
# =============================================================================

# 下面使用apply方法，在不指定轴axis的情况下，与applymap结果一致
df.apply(lambda x: x * 2)
# =============================================================================
# Out[183]: 
#     A    B    C
# 0   2   20   40
# 1   4   40   80
# 2   6   60  120
# 3   8   80  160
# 4  10  100  200
# 5  12  120  240
# 6  14  140  280
# 7  16  160  320
# 8  18  180  360
# 9  20  200  400
# =============================================================================

Case study 5: map方法对Series的数据进行处理

# Case study 5: map方法对Series的数据进行处理
# 获取数据集
def get_df_1():
    return pd.read_csv("data/2.csv", encoding=("utf8"), header = 0)

df = get_df_1()
df
# =============================================================================
# Out[191]: 
#      货单号   数量    单价        折扣
# 0  S1533   10  1272  0.395971
# 1  N8906   20  3743  0.479961
# 2  S1243   30  4305  0.595302
# 3  G8578   40  3389  0.479461
# 4  P5546   50  1610  0.506235
# 5  U3625   60  1938  0.830859
# 6  B9530   70  1997  0.411044
# 7  P6370   80  3555  0.681629
# 8  C5095   90  4021  0.397100
# 9  X6917  100  2698  0.812741
# =============================================================================

# 定义字典
mapping = {'S1533':'Food', 'B9530':'Water'}

# Series.map 方法接受字典为参数，生成新的列'产品类型'，对货单号进行数据转换
df['产品类型'] = df['货单号'].map(mapping)
df
# =============================================================================
# Out[195]: 
#      货单号   数量    单价        折扣   产品类型
# 0  S1533   10  1272  0.395971   Food
# 1  N8906   20  3743  0.479961    NaN
# 2  S1243   30  4305  0.595302    NaN
# 3  G8578   40  3389  0.479461    NaN
# 4  P5546   50  1610  0.506235    NaN
# 5  U3625   60  1938  0.830859    NaN
# 6  B9530   70  1997  0.411044  Water
# 7  P6370   80  3555  0.681629    NaN
# 8  C5095   90  4021  0.397100    NaN
# 9  X6917  100  2698  0.812741    NaN
# =============================================================================

# 如果不想得到'NaN'值，可以通过Series.apply方法进行转换
# 匿名函数中使用了字典的get方法，第2个参数作为key不存在时的返回值
df['产品类型'] = df['货单号'].apply(lambda x: mapping.get(x, 'Others'))
# =============================================================================
# Out[203]: 
#      货单号   数量    单价        折扣    产品类型
# 0  S1533   10  1272  0.395971    Food
# 1  N8906   20  3743  0.479961  Others
# 2  S1243   30  4305  0.595302  Others
# 3  G8578   40  3389  0.479461  Others
# 4  P5546   50  1610  0.506235  Others
# 5  U3625   60  1938  0.830859  Others
# 6  B9530   70  1997  0.411044   Water
# 7  P6370   80  3555  0.681629  Others
# 8  C5095   90  4021  0.397100  Others
# 9  X6917  100  2698  0.812741  Others
# =============================================================================

Bonus Study: Series.apply通过指定两个参数，自定义计算列

# Series.apply通过指定两个参数，自定义计算列
# 新增一列'是否高销量',超过指定数量的产品标记为'high'，否则为'normal';

def cal_high_num(x, std_num):
    return 'high' if x > std_num else 'normal'

# apply方法的第2个参数，需要自定义函数的第二个参数一致
df['是否高销量'] = df['数量'].apply(cal_high_num, std_num = 50)
# =============================================================================
# Out[210]: 
#      货单号   数量    单价        折扣    产品类型   是否高销量
# 0  S1533   10  1272  0.395971    Food  normal
# 1  N8906   20  3743  0.479961  Others  normal
# 2  S1243   30  4305  0.595302  Others  normal
# 3  G8578   40  3389  0.479461  Others  normal
# 4  P5546   50  1610  0.506235  Others  normal
# 5  U3625   60  1938  0.830859  Others    high
# 6  B9530   70  1997  0.411044   Water    high
# 7  P6370   80  3555  0.681629  Others    high
# 8  C5095   90  4021  0.397100  Others    high
# 9  X6917  100  2698  0.812741  Others    high
# =============================================================================

  # 效果同下面的方法
std = 60
df['是否高销量'] = df['数量'].apply(lambda x: 'high' if x > std else 'mid')
df
# =============================================================================
# Out[220]: 
#      货单号   数量    单价        折扣    产品类型 是否高销量
# 0  S1533   10  1272  0.395971    Food   mid
# 1  N8906   20  3743  0.479961  Others   mid
# 2  S1243   30  4305  0.595302  Others   mid
# 3  G8578   40  3389  0.479461  Others   mid
# 4  P5546   50  1610  0.506235  Others   mid
# 5  U3625   60  1938  0.830859  Others   mid
# 6  B9530   70  1997  0.411044   Water  high
# 7  P6370   80  3555  0.681629  Others  high
# 8  C5095   90  4021  0.397100  Others  high
# 9  X6917  100  2698  0.812741  Others  high
# =============================================================================

总结

DataFrame

apply方法对行列进行操作，需要指定轴axis，如果不指定效果同applymap；
apply方法可以返回Series或者DataFrame数据；
applymap对DataFrame的所有元素进行操作。

Series

apply方法可以对所有元素进行操作，可以通过传入两个参数生成自定义计算列；
map方法可以对元素进行数据转换，可以传入字典，函数，或者Series（序列）；

下面是今天的Mindmap。

上一篇：从零开始学Python:第十一课-常用数据结构之列表
下一篇：「python+django」开发linux运维管理平台:首页显示index

网站首页 > 知识剖析 正文

Python Pandas数据分析 - Day 5 - index.repeat, apply, map &amp;applymap

前言