上QQ阅读APP看书，第一时间看更新

Standardization

As a last bit of transformation, we would need to standardize our input data. This allow us to compare models to see if one model is better than another. To do so, I wrote two different scaling algorithms:

func scale(a [][]float64, j int) {
  l, m, h := iqr(a, 0.25, 0.75, j)
  s := h - l
  if s == 0 {
    s = 1
  }

  for _, row := range a {
    row[j] = (row[j] - m) / s
  }
}

func scaleStd(a [][]float64, j int) {
  var mean, variance, n float64
  for _, row := range a {
    mean += row[j]
    n++
  }
  mean /= n
  for _, row := range a {
    variance += (row[j] - mean) * (row[j] - mean)
  }
  variance /= (n-1)

  for _, row := range a {
    row[j] = (row[j] - mean) / variance
  }
}

If you come from the Python world of data science, the first scale function is essentially what scikits-learn's RobustScaler does. The second function is essentially StdScaler, but with the variance adapted to work for sample data.

This function takes the values in a given column (j) and scales them in such a way that all the values are constrained to within a certain value. Also, note that the input to both scaling functions is [][]float64. This is where the benefits of the tensor package comes in handy. A *tensor.Dense can be converted to [][]float64 without any extra allocations. An additional beneficial side effect is that you can mutate a and the tensor values will change as well. Essentially, [][]float64 will act as an iterator to the underlying tensor data.

Our transform function now looks like this:

func transform(it [][]float64, hdr []string, hints []bool) []int {
  var transformed []int
  for i, isCat := range hints {
    if isCat {
      continue
    }
    skewness := skew(it, i)
    if skewness > 0.75 {
      transformed = append(transformed, i)
      log1pCol(it, i)
    }
  }
  for i, h := range hints {
    if !h {
      scale(it, i)
    }
  }
  return transformed
}

Note that we only want to scale the numerical variables. The categorical variables can be scaled, but there isn't really much difference.