2.2 特征衍生方案_智能风控：原理、算法与工程实践-QQ阅读女频仙侠网

上QQ阅读APP看书，第一时间看更新

2.2 特征衍生方案

业内常用的特征衍生方案有以下两种：

❏ 通过算法自动进行特征交叉，虽然不可以解释但是可以将特征挖掘得较为深入和透彻。可以很轻松地从基础的几百维度衍生至任意维度，比如可以通过XGBoost对特征进行离散，或者通过FM算法进行特征交叉，也可以通过神经网络进行表征学习，然后将内部的参数取出来作为模型的输入。总之，只要是升高了特征维度，再和原始特征合并一起建模，都可以看成是特征衍生。

❏ 通过一些跨时间维度的计算逻辑对特征进行时间维度的比较，从而衍生出具有业务含义的特定字段。这种做法会具有更强的解释性，是早些年银行或者信用卡中心惯用的衍生方法之一。

举一个简单的例子，现在计算每个用户的额度使用率，记为特征ft。按照时间轴以月份为切片展开，得到申请前30天内的额度使用率ft1，申请前30～60天内的额度使用率ft2，申请前60～90天内的额度使用率ft3, …，申请前330～360天内的额度使用率ft12，于是得到一个用户的12个特征，如图2-1所示。

图2-1 基础特征预览

下面根据这个时间序列来进行一些基于经验的人工特征衍生。

1）计算最近mth个月特征feature大于0的月份数。

        1. def Num(feature, mth):
        2.     df = data.loc[:, feature+'1': feature+str(mth)]
        3.     auto_value = np.where(df＞0, 1, 0).sum(axis=1)
        4.   return feature + '_num' + str(mth), auto_value

为什么要用mth和feature来代替月份和特征名呢？这是因为在工业界通常都是对高维特征进行批量处理。所有设计的函数最好有足够高的灵活性，能够支持特征和月份的灵活指定。对于函数Num来说，传入不同的feature取值，会对不同的特征进行计算；而指定不同的mth值，就会对不同的月份进行聚合。因此只需要遍历每一个feature和每一种mth的取值，就可以衍生出更深层次的特征。

2）计算最近mth个月特征feature的均值。

        1. def Avg(feature, mth):
        2.     df = data.loc[:, feature+'1': feature+str(mth)]
        3.     auto_value = np.nanmean(df, axis=1)
        4.   return feature + '_avg' + str(mth), auto_value

3）计算最近mth个月，最近一次feature＞0到现在的月份数。

        1. def Msg(feature, mth):
        2.     df = data.loc[:, feature+'1': feature+str(mth)]
        3.     df_value = np.where(df＞0, 1, 0)
        4.     auto_value = []
        5.   for i in range(len(df_value)):
        6.         row_value = df_value[i, :]
        7.       if row_value.max() <= 0:
        8.              indexs = '0'
        9.              auto_value.append(indexs)
      10.       else:
      11.              indexs = 1
      12.           for j in row_value:
      13.                if j＞0:
      14.                    break
      15.                  indexs += 1
      16.              auto_value.append(indexs)
      17.   return feature + '_msg' + str(mth), auto_value

4）计算当月feature/最近mth个月feature的均值。

        1. def Cav(feature, mth):
        2.     df = data.loc[:, feature+'1':inv+str(mth)]
        3.     auto_value = df[feature+'1']/np.nanmean(df, axis=1)
        4.   return feature + '_cav' + str(mth), auto_value

5）计算最近mth个月，每两个月间feature增长量的最大值。

        1. def Mai(feature, mth):
        2.     arr = np.array(data.loc[:, feature+'1': feature+str(mth)])
        3.     auto_value = []
        4.   for i in range(len(arr)):
        5.         df_value = arr[i, :]
        6.         value_lst = []
        7.       for k in range(len(df_value)-1):
        8.              minus = df_value[k] - df_value[k+1]
        9.              value_lst.append(minus)
      10.         auto_value.append(np.nanmax(value_lst))
      11.   return feature + '_mai' + str(mth), auto_value

6）计算最近mth个月，所有月份feature的极差。

        1. def Ran(feature, mth):
        2.     df = data.loc[:, feature+'1': feature+str(mth)]
        3.     auto_value = np.nanmax(df, axis=1) - np.nanmin(df, axis=1)
        4.   return feature + '_ran' + str(mth), auto_value