How to transform numerical variables?

  AI, Data, Deep Learning, Machine Learning, Pandas, Python

Example code for log,reciprocal,arcsin ,power transformers of feature-engine. You can find answer to the following question as well:

  • How to directly read hourly weather data in Canada from government climate data center?
  • How to transform positive variable with LogTransformer?
  • How to transform any variable with LogCpTransformer?
  • How to transform variable x to 1/x with ReciprocalTransformer?


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import transformation as vt

# Load dataset
#  create range of monthly dates
download_dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='MS')

#  URL from Chrome DevTools Console
base_url = ("https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&"
              "stationID=51442&Year={}&Month={}&Day=7&timeframe=1&submit=Download+Data") #  add format option to year and month

#  create list of remote URL from base URL
list_of_url = [base_url.format(date.year, date.month) for date in download_dates]

#  download and combine multiple files into one DataFrame
df = pd.concat((pd.read_csv(url) for url in list_of_url))
keepcolumns=['Station Name','Longitude (x)', 'Latitude (y)',
       'Date/Time (LST)', 'Temp (°C)','Dew Point Temp (°C)', 
       'Rel Hum (%)', 'Wind Spd (km/h)', 'Stn Press (kPa)'] 
data=df[keepcolumns]

The logarithm to 2 of the variables in the dataset


# transform the data
tf = vt.LogTransformer(variables = ['Stn Press (kPa)', 'Rel Hum (%)'])

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

Apply the logarithm to 2 of the variables x using transformation log(x + C), where C is a positive constant.The transformer to detect automatically the quantity “C” that needs to be added to the variable


# set up the variable transformer
tf = vt.LogCpTransformer(variables = ['Temp (°C)','Dew Point Temp (°C)'], C="auto")

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

Apply the reciprocal transformation 1 / x to numerical variables.


# set up the variable transformer
tf = vt.ReciprocalTransformer(variables = ['Stn Press (kPa)','Latitude (y)'])

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

Apply power or exponential transformations to numerical variables.



# set up the variable transformer
tf = vt.PowerTransformer(variables = ['Stn Press (kPa)', 'Rel Hum (%)','Wind Spd (km/h)'], exp=0.5)

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

applies the BoxCox transformation to numerical variables by

y = (x**lmbda - 1) / lmbda,  for lmbda != 0
log(x),                      for lmbda = 0

With fit(), learns the optimal lambda for the transformation. Only apply to positive variables.



# set up the variable transformer
tf = vt.BoxCoxTransformer(variables = ['Stn Press (kPa)', 'Rel Hum (%)'])

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

learns the optimal lambda for the transformation. 



# set up the variable transformer
tf = vt.YeoJohnsonTransformer(variables = ['Temp (°C)','Dew Point Temp (°C)'])

# fit the transformer
tf.fit(data)
data_t= tf.transform(data)

Apply the arcsin transformation to numerical variables.take the form of arcsin(sqrt(x)) where x is a real number between 0 and 1.

step 1: use DatetimeFeatures to extract “month”, “day_of_month”, “day_of_year”

step 2: divide them by 12, 31, and 366 to convert them to 0-1 range

step 3:apply arcsin(sqrt(x))


from feature_engine.datetime import DatetimeFeatures
dtfs = DatetimeFeatures(
    variables="Date/Time (LST)",
    features_to_extract=["month", "day_of_month", "day_of_year"]
)

data_t0 = dtfs.fit_transform(data)
data_t0['Date/Time (LST)_month']=data_t0['Date/Time (LST)_month']/12
data_t0['Date/Time (LST)_day_of_month']=data_t0['Date/Time (LST)_day_of_month']/31
data_t0['Date/Time (LST)_day_of_year']=data_t0['Date/Time (LST)_day_of_year']/366

# set up the variable transformer
tf = vt.ArcsinTransformer(variables = ['Date/Time (LST)_month','Date/Time (LST)_day_of_month','Date/Time (LST)_day_of_year'])
tf.fit(data_t0)

# fit the transformer
tf.fit(data_t0)
data_t= tf.transform(data_t0)