Example code to transform continuous numerical variables into discrete variables with different methods. It cab also answer the following questions.
- How to directly read weather data from website?
- How to convert datetime column to DatetimeIndex?
- How to discrete continous variables with equal frequency method?
- How to discrete continous variables with equal interval?
- How to discrete continous variables arbitrarily?
- How to discrete continous variables with decision tree?
Table of Contents
Prepare data and load functions
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.discretisation import ArbitraryDiscretiser
from feature_engine.discretisation import DecisionTreeDiscretiser
# create range of monthly dates
download_dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='MS')
# URL from Chrome DevTools Console
base_url = ("https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&"
"stationID=51442&Year={}&Month={}&Day=7&timeframe=1&submit=Download+Data") # add format option to year and month
# create list of remote URL from base URL
list_of_url = [base_url.format(date.year, date.month) for date in download_dates]
# download and combine multiple files into one DataFrame
df = pd.concat((pd.read_csv(url) for url in list_of_url))
keepcolumns=['Date/Time (LST)','Temp (°C)','Dew Point Temp (°C)']
data=df[keepcolumns]
data=data.rename(columns={'Date/Time (LST)':'dt_var','Temp (°C)':'T','Dew Point Temp (°C)':'Td'})
datetime_series = pd.to_datetime(data['dt_var'])
datetime_index = pd.DatetimeIndex(datetime_series.values)
data1=data.set_index(datetime_index)
data1.drop('dt_var',axis=1,inplace=True)
data=data1.head(100)
Equal frequency discretiser
Code
Discrete data in 10 equal frequency intervals as 0,1,..,9.
# set up the discretisation transformer
disc = EqualFrequencyDiscretiser(q=10, variables=['T', 'Td'])
# fit the transformer
disc.fit(data)
data_t= disc.transform(data)
Equal width discretiser
Sorts the variable values into contiguous intervals of equal size.
# set up the discretisation transformer
disc = EqualWidthDiscretiser(bins=10, variables=['T', 'Td'])
# fit the transformer
disc.fit(data)
data_t= disc.transform(data)
Arbitrary Discretiser
Sorts the variable values into contiguous intervals which limits are arbitrarily defined by the user with a dictionary.
# set up the discretisation transformer
user_dict = {'T': [-np.Inf,-30,-20,-10,0, 10, 20, 30, np.Inf]}
disc = ArbitraryDiscretiser(binning_dict=user_dict, return_object=False, return_boundaries=False)
# fit the transformer
disc.fit(data)
data_t= disc.transform(data)
Desision Tree Discretiser
Replaces numerical values by discrete values which are the predictions of a decision tree.
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Td'], axis=1),
data['Td'], test_size=0.3, random_state=0)
# set up the discretisation transformer
disc = DecisionTreeDiscretiser(cv=3,
scoring='neg_mean_squared_error',
variables=['T'],
regression=True)
# fit the transformer
disc.fit(X_train, y_train)
train_t= disc.transform(X_train)
test_t= disc.transform(X_test)