How to transform continuous numerical variables into discrete variables?

Example code to transform continuous numerical variables into discrete variables with different methods. It cab also answer the following questions.

  • How to directly read weather data from website?
  • How to convert datetime column to DatetimeIndex?
  • How to discrete continous variables with equal frequency method?
  • How to discrete continous variables with equal interval?
  • How to discrete continous variables arbitrarily?
  • How to discrete continous variables with decision tree?

Prepare data and load functions

Code


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.discretisation import ArbitraryDiscretiser
from feature_engine.discretisation import DecisionTreeDiscretiser
#  create range of monthly dates
download_dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='MS')

#  URL from Chrome DevTools Console
base_url = ("https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&"
              "stationID=51442&Year={}&Month={}&Day=7&timeframe=1&submit=Download+Data") #  add format option to year and month

#  create list of remote URL from base URL
list_of_url = [base_url.format(date.year, date.month) for date in download_dates]

#  download and combine multiple files into one DataFrame
df = pd.concat((pd.read_csv(url) for url in list_of_url))
keepcolumns=['Date/Time (LST)','Temp (°C)','Dew Point Temp (°C)']
data=df[keepcolumns]
data=data.rename(columns={'Date/Time (LST)':'dt_var','Temp (°C)':'T','Dew Point Temp (°C)':'Td'})

datetime_series = pd.to_datetime(data['dt_var'])
datetime_index = pd.DatetimeIndex(datetime_series.values)
data1=data.set_index(datetime_index)
data1.drop('dt_var',axis=1,inplace=True)
data=data1.head(100)

Equal frequency discretiser

Code

Discrete data in 10 equal frequency intervals as 0,1,..,9.


# set up the discretisation transformer
disc = EqualFrequencyDiscretiser(q=10, variables=['T', 'Td'])

# fit the transformer
disc.fit(data)
data_t= disc.transform(data)

Equal width discretiser

Sorts the variable values into contiguous intervals of equal size.


# set up the discretisation transformer
disc = EqualWidthDiscretiser(bins=10, variables=['T', 'Td'])

# fit the transformer
disc.fit(data)
data_t= disc.transform(data)

Arbitrary Discretiser

Sorts the variable values into contiguous intervals which limits are arbitrarily defined by the user with a dictionary.


# set up the discretisation transformer
user_dict = {'T': [-np.Inf,-30,-20,-10,0, 10, 20, 30, np.Inf]}
disc = ArbitraryDiscretiser(binning_dict=user_dict, return_object=False, return_boundaries=False)

# fit the transformer
disc.fit(data)
data_t= disc.transform(data)

Desision Tree Discretiser

Replaces numerical values by discrete values which are the predictions of a decision tree. 


X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['Td'], axis=1),
            data['Td'], test_size=0.3, random_state=0)
# set up the discretisation transformer
disc = DecisionTreeDiscretiser(cv=3,
                          scoring='neg_mean_squared_error',
                          variables=['T'],
                          regression=True)

# fit the transformer
disc.fit(X_train, y_train)
train_t= disc.transform(X_train)
test_t= disc.transform(X_test)