How to create and add new features to the dataframe with feature-engine?

Example code for creating and adding new features to a data frame using the feature-engine. It also answer following questions:

  • How to directly read weather data from website?
  • How to set DatetimeIndex as index?
  • How to use basic functions to groups of features, returning one or more additional variables?
  • How to use basic mathematical operations between a group of variables and one or more reference features, adding the resulting features to the dataframe?
  • How to create 2 new features from numerical variables that better capture the cyclical nature of the original variable?

Math features

Code


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.creation import MathFeatures
from feature_engine.creation import RelativeFeatures
from feature_engine.creation import CyclicalFeatures

#  create range of monthly dates
download_dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='MS')

#  URL from Chrome DevTools Console
base_url = ("https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&"
              "stationID=51442&Year={}&Month={}&Day=7&timeframe=1&submit=Download+Data") #  add format option to year and month

#  create list of remote URL from base URL
list_of_url = [base_url.format(date.year, date.month) for date in download_dates]

#  download and combine multiple files into one DataFrame
df = pd.concat((pd.read_csv(url) for url in list_of_url))
keepcolumns=['Date/Time (LST)','Temp (°C)','Dew Point Temp (°C)']
data=df[keepcolumns]
data=data.rename(columns={'Date/Time (LST)':'dt_var','Temp (°C)':'T','Dew Point Temp (°C)':'Td'})

datetime_series = pd.to_datetime(data['dt_var'])
datetime_index = pd.DatetimeIndex(datetime_series.values)
data1=data.set_index(datetime_index)
data1.drop('dt_var',axis=1,inplace=True)
data=data1.head(100)

Apply basic functions to groups of features, returning one or more additional variables as a result.


transformer = MathFeatures(
    variables=["T", "Td"],
    func = ["sum", "prod", "min", "max", "std"],
)
data_t = transformer.fit_transform(data)

Give meaningful names to the new variables


transformer = MathFeatures(
    variables=["T", "Td"],
    func = ["sum", "min", "max"],
    new_variables_names = ["sum(T,Td)", "min(T,Td)", "max(T,Td)"]
)
data_t = transformer.fit_transform(data)

Pass existing functions to the func argument


transformer = MathFeatures(
    variables=["T", "Td"],
    func = [np.sum, np.prod, np.min, np.max, np.std],
)

data_t = transformer.fit_transform(data)

Relative features

Code


transformer = RelativeFeatures(
    variables=["T", "Td"],
    reference=["T"],
    func = ["add","sub", "mul", "div", "truediv","floordiv","mod", "pow"],
)

data_t = transformer.fit_transform(data)

Cyclical features

  • var_sin = sin(variable * (2. * pi / max_value))
  • var_cos = cos(variable * (2. * pi / max_value))

Code


transformer = CyclicalFeatures(
    variables=["T", "Td"],drop_original=False)    
data_t = transformer.fit_transform(data)