Boring linear forecast

Improving Performance by Adding Some Dummies

Author

Joost de Theije + LLM

Published

March 5, 2023

Modified

April 20, 2024

Abstract
Discover how the simple addition of dummy variables can transform a linear forecast from mundane to insightful. This article explores the significant impact of encoding seasonality into your models and provides a step-by-step guide on improving forecast accuracy.

1 TLDR;

  • Linear regression in itself is not performant for longer time-scales
  • Most systems have some capabilities for linear regression built-in
  • Adding dummy variables for datetime features(i.e. month, weekday etc.) adds predictive power

2 Introduction

Linear regression is a fundamental statistical model for determining linear relationships between variables, commonly used in Economics, Finance, and Social Sciences. Despite being considered basic, its ability to estimate direction and magnitude makes it valuable. The popularity of linear regression arises from its widespread availability, affordability, and ease of identifying optimal parameters.

For time series forecasting, dummy variables can be employed to expand the model’s capabilities. These dummies offer insights into temporal relationships by capturing seasonality and identifying one-off events such as price reductions or natural disasters. For instance, we can create dummies for specific datetime features like month or weekday/weekend status.

3 Imports

First we import all the libraries, the default data science libs and the linear model and metrics from sklearn.

Code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

4 Reading in the data

For this example, we’ll utilize the dataset from the Prophet package. The dataset is described in the Prophet docs as follows:

As an illustration, let’s examine a time series of the log daily page views for the Wikipedia page about Peyton Manning. We obtained this data using the Wikipedia trend package in R. Peyton Manning offers an excellent example because it demonstrates some of Prophet’s features such as multiple seasonality, changing growth rates, and the ability to model special days (like his playoff and Superbowl appearances).

Link to the documentation

Code
df_in = pd.read_csv(
    "https://raw.githubusercontent.com/facebook/prophet/main/examples/example_wp_log_peyton_manning.csv"
)
df_in = df_in.assign(ds=pd.to_datetime(df_in["ds"]))
df_in = df_in[(df_in["ds"] > "2012")]  # selecting data after 2012

To gain a better understanding of our dataset, let’s visualize it over time and identify any observable trends or patterns. This visualization provides insights into the patterns and trends in the data, which we can then try to incorporate into our model using seasonal dummies.

Code
plt.plot_date(
    x=df_in["ds"],
    y=df_in["y"],
    label="input timeseries",
    fmt="-",
)
plt.tick_params(axis="x", rotation=45)
plt.ylabel("target variable - $y$")
plt.title("daily visits ot Peyton Manning wiki on a daily basis (log)")
plt.show()

I’ve selected data spanning from 2012 for analyzing seasonal dummies. The initial observation is that there exists a dip around months 6 and 7. Early August hosts NFL exhibition games, preceding the start of the regular football season.

Also, we can observe a pattern over the year, it starts high then dips, and then and high again. This can be seen for the other years as well, so there is some repeating seasonality. Let us continue and train our first models. Starting with a simple ordinary linear regression and then adding dummies to see if they improve the performance of the model.

5 Train-test split

Code
# train test split
df_train = df_in[(df_in["ds"] > "2012") & (df_in["ds"] < "2015")]
df_test = df_in[(df_in["ds"] > "2015")]

To evaluate the performance of the model, let’s split the dataset into two parts: training data (from 2012 to 2015) and testing data (everything after 2015). The model will only be exposed to the training data and expected to generate predictions for the testing data. Once the predictions are obtained, we’ll calculate the performance using these predictions and the true observations.

Code
# visually inspect the train test split
plt.plot_date(
    x=df_train["ds"],
    y=df_train["y"],
    label="train",
    fmt="-",
)
plt.plot_date(
    x=df_test["ds"],
    y=df_test["y"],
    label="test",
    fmt="-",
)
plt.legend(loc="upper right")
plt.tick_params(axis="x", rotation=45)
plt.ylabel("$y$")
plt.title("data is splitted, everything before 2015 is train data after 2015 test")
plt.show()

6 Setting up the regression

Code
X_train = df_train["ds"].astype(int).to_numpy().reshape(-1, 1)
y_train = df_train["y"].to_numpy()

X_test = df_test["ds"].astype(int).to_numpy().reshape(-1, 1)
y_test = df_test["y"].to_numpy()

Preprocessing the data to prepare it for fitting the linear model: In this instance, we convert the date columns into a sequence of ever-increasing integers.

Code
# creating, fit, and inference
linear = LinearRegression()
linear.fit(X=X_train, y=y_train)
y_pred = linear.predict(X=X_test)

The linear model is swift to fit, taking approximately 4 milliseconds on my machine. Given the modest data size (though it’s significant for time series), this enables us to fit 1000 models within the span of a single coffee sip☕️.

Let’s visualize the model results by plotting all relevant components: train data, test data, and predictions. Additionally, we will calculate two error metrics – Mean Squared Error (mse) and Mean Absolute Error (mae) – to quantify the performance of the model.

Code
# calc error metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# visually inspect the prediction
plt.plot_date(
    x=df_train["ds"],
    y=df_train["y"],
    label="train",
    fmt="-",
)
plt.plot_date(
    x=df_test["ds"],
    y=df_test["y"],
    label="test",
    fmt="-",
)

plt.plot_date(
    x=df_test["ds"],
    y=y_pred,
    label="prediction",
    fmt="--",
)

plt.legend(loc="upper right")
plt.tick_params(axis="x", rotation=45)
plt.ylabel("$y$")
plt.title(f"linear regression applied (MSE= {mse:.3}, MAE={mae:.3})")
plt.show()

The linear model is represented by a plain straight line, as expected from a simple linear regression model. The green line passes through the orange data points roughly, moving in an intuitively correct direction. However, it fails to capture seasonality or other patterns present in the training set. With Mean Squared Error (MSE) and Mean Absolute Error (MAE) values around 0.6, we’ll use this as a baseline for further improvement by incorporating dummies.

mean squared error  = 0.617
mean absolute error = 0.615

7 Adding dummies

Let’s introduce dummy variables for the months to see whether we can enhance our model’s performance, both visually and numerically. We’ll create a separate column for each month, where a 1 indicates the occurrence of that month, and a 0 signifies its absence. By doing so, we embed temporal information directly into the model, allowing it to discern the specific impacts of each month and adjust future predictions accordingly. That’s the theory—now, let’s put it to the test!

Code
# creating dummies for the months
df_dummies = df_in.assign(
    month=df_in["ds"].dt.month.astype("category"),
    ds_int=df_in["ds"].astype(int),
)

not_dummy = {"y", "ds", "ds_int"}
to_dummy = list(set(df_dummies.columns) - not_dummy)

df_dummies = pd.get_dummies(data=df_dummies, columns=to_dummy)
all_features = list(set(df_dummies.columns) - {"y", "ds"})

# slicing the input in train test
df_train_dummies = df_dummies[(df_dummies["ds"] > "2012") & (df_dummies["ds"] < "2015")]
df_test_dummies = df_dummies[(df_dummies["ds"] > "2015")]

X_train = df_train_dummies.loc[:, all_features]
y_train = df_train_dummies[["y"]]

X_test = df_test_dummies.loc[:, all_features]
y_test = df_test_dummies[["y"]]

df_dummies.drop(columns=["ds_int", "y"]).sample(
    n=6,
    ignore_index=True,
    random_state=42,
)
ds month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12
0 2014-05-06 False False False False True False False False False False False False
1 2012-07-05 False False False False False False True False False False False False
2 2015-06-21 False False False False False True False False False False False False
3 2012-03-09 False False True False False False False False False False False False
4 2012-08-10 False False False False False False False True False False False False
5 2013-05-11 False False False False True False False False False False False False

We’ve added month dummies to our data—month_1 for January through month_12 for December. Next, we’ll train our model with these dummies and compare performance against the original model without them.

Code
# create the pipeline and fit pipeline
# scaler is there so that the coefs can be interpeted later
# pipeline = make_pipeline(StandardScaler(), LinearRegression())
pipeline = make_pipeline(MinMaxScaler(), LinearRegression())

pipeline.fit(X=X_train, y=y_train)
y_pred_dummies = pipeline.predict(X=X_test)

mse_dummies = mean_squared_error(y_test, y_pred_dummies)
mae_dummies = mean_absolute_error(y_test, y_pred_dummies)

plt.plot_date(
    x=df_train["ds"],
    y=df_train["y"],
    label="train",
    fmt="-",
)

plt.plot_date(
    x=df_test["ds"],
    y=df_test["y"],
    label="test",
    fmt="-",
)

plt.plot_date(
    x=df_test["ds"],
    y=y_pred,
    label="prediction",
    fmt="--",
)

plt.plot_date(
    x=df_test["ds"],
    y=y_pred_dummies,
    label="prediction with dummies",
    fmt="--",
)

plt.legend(loc="upper right")
plt.tick_params(axis="x", rotation=45)
plt.ylabel("$y$")
plt.title(
    f"linear regression with dummies applied (mse= {mse_dummies:.3}, mae={mae_dummies:.3})"
)
plt.show()

Visually, our forecast has improved significantly. It now mirrors the ups and downs of the time series more closely and captures the overall trend more accurately. These enhancements are also evident in the error metrics, with the mean squared error improving by a factor of 1.89 and the mean absolute error by a factor of 1.54. Such substantial gains from simply incorporating ones and zeros are indeed impressive.

Code
print(f"mean squared error  = {mse_dummies:.3}")
print(f"improvement factor mse month dummies -> {mse/mse_dummies:.3}x")

print("-" * 79)

print(f"mean absolute error = {mae_dummies:.3}")
print(f"improvement factor mea month dummies -> {mae/mae_dummies:.3}x")
mean squared error  = 0.323
improvement factor mse month dummies -> 1.91x
-------------------------------------------------------------------------------
mean absolute error = 0.397
improvement factor mea month dummies -> 1.55x

8 Inspecting the seasonality

Now that we have encoded the information about the seasonality in the model, we can inspect that seasonality on its own. This will give us some insight into the inner workings of the underlying time series model. First, we access the coefficients of the linear model and put them into a separate DataFrame. Then, we need to scale them so that the relative difference is more apparent. Looking at the raw coefficients would not yield any meaningful information, as the scale is not relatable to the original problem.

Code
# pull coefs into a seperate df, to inspect the seasonality
lin_reg_coefs = (
    pd.DataFrame(
        data=pipeline["linearregression"].coef_,
        columns=X_train.columns,
    )
    .T.reset_index()
    .rename(columns={"index": "month", 0: "coefficient"})
)
# exclude the time col
lin_reg_coefs = lin_reg_coefs[lin_reg_coefs["month"] != "ds_int"]

# subtract mean to get the relative difference between the coefs
lin_reg_coefs["coefficient"] = (
    lin_reg_coefs["coefficient"] - lin_reg_coefs["coefficient"].mean()
)
Code
chart = sns.barplot(
    data=lin_reg_coefs,
    x="month",
    y="coefficient",
    color=sns.color_palette()[0],
    order=[f"month_{i}" for i in range(1, 13)],
)
plt.tick_params(axis="x", rotation=45)
plt.ylabel("")
plt.title("yearly seasonality")
plt.show()

We have a clear picture of the seasonality throughout the year. There is a dip in the middle of the year, followed by a significant uptick in January. This can be attributed to the Super Bowl, which is played in the first week of February and drives a lot of traffic to the wiki page. The dip in the middle of the year is also evident at the month_6 mark.

9 Recap

In this blog post, I have demonstrated that the performance of a simple linear regression for time series forecasting can be improved by a factor of 1.54 up to 1.89 by simply adding dummy variables for the months. The nice thing about this approach is that the linear regression is available in most systems with analytical capabilities (yes, even in Excel), and adding the dummy variables is so simple that you can do it in a SQL server. The added benefit is that the model fitting is quick, allowing you to retrain the model monthly, weekly, daily, or even hourly.

10 Encore

What if we were to dummies not just for the months, but also for other datetime features and really turn it up to eleven

Sure, let’s create the dummy variables for the given datetime features and then apply the Elastic Net linear model.

First, let’s create the dummy variables for the datetime features:

  • Month: This feature will create 12 dummy variables, one for each month.
  • Week: This feature will create 52 dummy variables, one for each week of the year.
  • Day of the Week: This feature will create 7 dummy variables, one for each day of the week.
  • Is Weekend: This feature will create 2 dummy variables, one for weekdays and one for weekends.
  • Quarter: This feature will create 4 dummy variables, one for each quarter of the year.

By creating these dummy variables, we will end up with a large number of features, which can lead to overfitting. This is where the Elastic Net linear model comes in handy.

The Elastic Net is a regularized linear regression model that combines the L1 (Lasso) and L2 (Ridge) regularization techniques. This helps to handle the large number of features and prevent overfitting. The Elastic Net model will automatically select the most important features and shrink the coefficients of the less important ones towards zero.

As an exercise, you can use the sklearn.linear_model.ElasticNetCV class to train the Elastic Net model on your dataset. This class will automatically tune the hyperparameters (alpha and l1_ratio) using cross-validation, which can help to find the optimal balance between the L1 and L2 regularization.

Remember to split your dataset into training and testing sets, and evaluate the performance of the Elastic Net model on the test set. This will give you a good idea of how well the model can handle the large number of features created by the dummy variables.

Code
# creating dummies for the months
df_dummies_all = df_in.assign(
    month=df_in["ds"].dt.month.astype("category"),
    week=df_in["ds"].dt.isocalendar().week.astype("category"),
    dayofweek=df_in["ds"].dt.dayofweek.astype("category"),
    is_weekend=(df_in["ds"].dt.dayofweek) >= 5,
    quarter=df_in["ds"].dt.quarter.astype("category"),
    ds_int=df_in["ds"].astype(int),
)

not_dummy = {"y", "ds", "ds_int"}
to_dummy = list(set(df_dummies_all.columns) - not_dummy)

df_dummies_all = pd.get_dummies(
    data=df_dummies_all,
    columns=to_dummy,
    drop_first=True,  # reduce the amount of cols with no additional info
)
all_features = list(set(df_dummies_all.columns) - {"y", "ds"})

# slicing the input in train test
df_train_dummies_all = df_dummies_all[
    (df_dummies_all["ds"] > "2012") & (df_dummies_all["ds"] < "2015")
]
df_test_dummies_all = df_dummies_all[(df_dummies_all["ds"] > "2015")]

X_train = df_train_dummies_all.loc[:, all_features]
y_train = df_train_dummies_all[["y"]]

X_test = df_test_dummies_all.loc[:, all_features]
y_test = df_test_dummies_all[["y"]]

df_dummies_all.drop(columns=["ds_int", "y"]).sample(
    n=6,
    ignore_index=True,
    random_state=42,
)
ds quarter_2 quarter_3 quarter_4 dayofweek_1 dayofweek_2 dayofweek_3 dayofweek_4 dayofweek_5 dayofweek_6 ... month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12
0 2014-05-06 True False False True False False False False False ... False False True False False False False False False False
1 2012-07-05 False True False False False True False False False ... False False False False True False False False False False
2 2015-06-21 True False False False False False False False True ... False False False True False False False False False False
3 2012-03-09 False False False False False False True False False ... True False False False False False False False False False
4 2012-08-10 False True False False False False True False False ... False False False False False True False False False False
5 2013-05-11 True False False False False False False True False ... False False True False False False False False False False

6 rows × 74 columns

Code
# utilzing an elasticnet linear model to compensate for the amount of features
elastic_params = {
    "l1_ratio": np.linspace(start=0.000001, stop=1, num=100),
    "cv": 7,
    "n_alphas": 1_00,
    "n_jobs": -1,
}

pipeline_all = make_pipeline(MinMaxScaler(), ElasticNetCV(**elastic_params))

pipeline_all.fit(X=X_train, y=y_train.to_numpy().ravel())
y_pred_dummies_all = pipeline_all.predict(X=X_test)

mse_dummies_all = mean_squared_error(y_test, y_pred_dummies_all)
mae_dummies_all = mean_absolute_error(y_test, y_pred_dummies_all)

plt.plot_date(
    x=df_test["ds"],
    y=df_test["y"],
    label="test",
    fmt="-",
    alpha=0.7,
)

plt.plot_date(
    x=df_test["ds"],
    y=y_pred_dummies,
    label="prediction with dummies",
    fmt="--",
)

plt.plot_date(
    x=df_test["ds"],
    y=y_pred_dummies_all,
    label="prediction with all dummies",
    fmt="--",
)


plt.legend(loc="upper right")
plt.tick_params(axis="x", rotation=45)
plt.ylabel("$y$")
plt.title(
    f"linear regression with dummies applied (mse= {mse_dummies_all:.3}, mae={mae_dummies_all:.3})"
)
plt.show()

Immediately, it becomes evident that the model captures more of the fine-grained movement of the time series. This is also reflected in the fact that both error metrics have improved.

Code
print(f"mean squared error = {mse_dummies_all:.3}")
print(f"improvement factor mse month dummies -> {mse/mse_dummies:.3}x")
print(f"improvement factor mse all dummies   -> {mse/mse_dummies_all:.3}x")
print("-" * 79)

print(f"mean absolute error = {mae_dummies_all:.3}")
print(f"improvement factor mea month dummies -> {mae/mae_dummies:.3}x")
print(f"improvement factor mea all dummies   -> {mae/mae_dummies_all:.3}x")
mean squared error = 0.262
improvement factor mse month dummies -> 1.91x
improvement factor mse all dummies   -> 2.35x
-------------------------------------------------------------------------------
mean absolute error = 0.356
improvement factor mea month dummies -> 1.55x
improvement factor mea all dummies   -> 1.73x