Example code for creating and adding new features to a data frame using the feature-engine. It also answer following questions:
- How to directly read weather data from website?
- How to set DatetimeIndex as index?
- How to use basic functions to groups of features, returning one or more additional variables?
- How to use basic mathematical operations between a group of variables and one or more reference features, adding the resulting features to the dataframe?
- How to create 2 new features from numerical variables that better capture the cyclical nature of the original variable?
Table of Contents
Math features
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.creation import MathFeatures
from feature_engine.creation import RelativeFeatures
from feature_engine.creation import CyclicalFeatures
# create range of monthly dates
download_dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='MS')
# URL from Chrome DevTools Console
base_url = ("https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&"
"stationID=51442&Year={}&Month={}&Day=7&timeframe=1&submit=Download+Data") # add format option to year and month
# create list of remote URL from base URL
list_of_url = [base_url.format(date.year, date.month) for date in download_dates]
# download and combine multiple files into one DataFrame
df = pd.concat((pd.read_csv(url) for url in list_of_url))
keepcolumns=['Date/Time (LST)','Temp (°C)','Dew Point Temp (°C)']
data=df[keepcolumns]
data=data.rename(columns={'Date/Time (LST)':'dt_var','Temp (°C)':'T','Dew Point Temp (°C)':'Td'})
datetime_series = pd.to_datetime(data['dt_var'])
datetime_index = pd.DatetimeIndex(datetime_series.values)
data1=data.set_index(datetime_index)
data1.drop('dt_var',axis=1,inplace=True)
data=data1.head(100)
Apply basic functions to groups of features, returning one or more additional variables as a result.
transformer = MathFeatures(
variables=["T", "Td"],
func = ["sum", "prod", "min", "max", "std"],
)
data_t = transformer.fit_transform(data)
Give meaningful names to the new variables
transformer = MathFeatures(
variables=["T", "Td"],
func = ["sum", "min", "max"],
new_variables_names = ["sum(T,Td)", "min(T,Td)", "max(T,Td)"]
)
data_t = transformer.fit_transform(data)
Pass existing functions to the func
argument
transformer = MathFeatures(
variables=["T", "Td"],
func = [np.sum, np.prod, np.min, np.max, np.std],
)
data_t = transformer.fit_transform(data)
Relative features
Code
transformer = RelativeFeatures(
variables=["T", "Td"],
reference=["T"],
func = ["add","sub", "mul", "div", "truediv","floordiv","mod", "pow"],
)
data_t = transformer.fit_transform(data)
Cyclical features
- var_sin = sin(variable * (2. * pi / max_value))
- var_cos = cos(variable * (2. * pi / max_value))
Code
transformer = CyclicalFeatures(
variables=["T", "Td"],drop_original=False)
data_t = transformer.fit_transform(data)