Data Science – Lag Features

Posted on January 19, 2023

Neville Dubash headshot

A time series is a collection of values corresponding to known points in time, e.g., energy consumption of a residential building on an hourly basis; daily use of a mobile application; hourly pressure of a valve; etc. Time series data can be approached in numerous ways, including ARIMA and neural network-based algorithms such as LSTM, GRU, etc. Where computational costs are a concern, however, tree-based methods can be a great option.

To apply a tree model to time series data, some initial feature engineering is required to generate features corresponding to time. Feature engineering is an important part of data modeling where significant features of the data are identified and used to enhance model performance, e.g., by generating additional features from the data. Feature engineering with time series modeling is different than with other types of data because the data is sequential and derived from changes in values over time.

Some distinguishing components of time series data are:
-The mean value
-Trends/changes in the value
-Seasonality/cyclical changes

One way to capture these aspects of a time series with a tree model is to introduce lag features. A lag feature is an additional value associated with each data point that depends on prior timesteps. This is useful as lag features contain information about the future based on what has previously occurred. E.g., in energy consumption modeling, correlation between present electricity consumption and meter values read the previous day suggests that it may be useful to generate features at each time point associated with the readings from the previous day.

Research has shown that adding lag features like this to a data set significantly improves the performance of time series data models. Although tree-based models may not typically be the first choice considered for state-of-the-art time series analysis, they have desirable characteristics for prediction in practical applications like real-time analysis of time series data, including interpretability, simplicity, and lower computational cost than neural network models.