Timeseries and forecasting.

Time Series Forecasting

A time series is a series of data points collected at regular intervals, like daily, monthly, or quarterly sales or weekly or monthly temperatures. We can see how values change over time by analyzing time series data and making informed predictions. Time series forecasting is a valuable tool that is applied in many fields. Companies use it to predict future sales or stock prices, which helps with inventory management and planning. Financial models rely on time series to estimate revenue, expenses, or cash flow, guiding budgeting and investments. Weather forecasts use time series to predict temperature, rainfall, and other weather patterns. Economists analyze time series to understand economic trends and market dynamics. Time series forecasting has a broad range of applications, making it useful for data analysts, business professionals, and students in data science. Forecasting models are powerful tools that accurately spot patterns, track changes, and predict future trends. By using time series forecasting, organizations learn from past data to estimate what might happen in the future. This helps them make better decisions and feel more confident in their strategies and plans.

 

Statistical Models versus Machine Learning Models

There are two main approaches to time series forecasting: statistical models and machine learning models. Statistical models describe data features using mathematical formulas. They identify patterns like trends and seasonality, which are regular changes at specific intervals. Statistical models are easier to understand and work with because they rely on a fixed structure. Machine learning models work differently; they don’t use predefined rules or equations. Instead, they learn patterns and relationships from historical data, which traditional methods often can’t identify. Machine learning algorithms try to reduce prediction errors through iterative learning, making them suitable for large, complex datasets with unexpected changes.

Machine learning algorithms are particularly effective when data has complex relationships or additional features like external factors can improve predictions. Statistical models are still strong choices for datasets with clear trends or well-defined seasonality due to their simplicity and interpretability. However, machine-learning models are the best option for very large or complex datasets or those with many influencing factors. You can use various tools and programming languages for time series forecasting. The most common ones are Python, R, and MATLAB, all of which have strong libraries for time series analysis.

We can also use mixed methods to forecast time series, combining statistical and machine learning models.

To make forecasting easy, you can also use AutoML or Automated Machine Learning, which automates the time-consuming tasks of developing machine learning models. It tests different models for a dataset and selects the best one, saving time and maintaining accuracy. AutoML systems evaluate several models and configurations, helping users apply the latest forecasting techniques with minimal manual setup.

Testing Models.

The graph below shows the results of different time-series forecasting models using Adidas’s quarterly net sales data. I need to determine which model would give the best results to forecast Adidas’ sales data for the next few years. One way to do this is using an AutoML cloud service, like Microsoft’s Azure AutoML. I would upload the data in a tabular format and let the system automatically pick the best models. Alternatively, I can use Python and run some code to test and identify the best models for forecasting my data.

Since I don’t have the future data (Adidas’s quarterly sales from 2025-2029), it’s challenging to test the models directly. To work around this, I can split my existing data to assess the models. For example, I could use Adidas’s quarterly net sales data from 2015-2024. Using different statistical and machine learning models, I’ll use data from 2015-2020 as the training set and try to forecast the sales from 2021-2024.

The graph below shows the actual data in black and the outcomes of various models in different colors. The Random Forest model (in light green) appears to be the best, as its line is the closest to the black line representing the actual data.

To assess which model performed the best, I need to calculate the Root Mean Squared Error (RMSE). This will show how much the forecasted values deviate from the actual data. The Root Mean Squared Error (RMSE) is a metric that quantifies the average magnitude of errors between predicted and actual values. It’s calculated by taking the difference between the exact data and the forecasted values, squaring these differences to ensure they’re all positive, averaging the squared differences, and then taking the square root of the result. RMSE gives us a precise measure of how much, on average, the model’s predictions deviate from the actual values. The lower the RMSE, the better the model’s performance.

RMSE is a metric used to measure the difference between predicted values and actual values. To calculate RMSE, use the following formula:

\( \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2} \)

Where:

  • \( y_i \) is the actual value at time \( i \)
  • \( \hat{y}_i \) is the forecasted value at time \( i \)
  • \( n \) is the number of data points

RMSE Calculation for Models

The graph below shows the RMSE for various forecasting models.

The bar chart shows that the Random Forest model’s Mean Squared Error (MSE) is 147,314.63. Taking the square root of this value gives us the Root Mean Squared Error (RMSE), which is approximately 383.79 million dollars. This means that, on average, the model’s predictions are off by about 384 million dollars compared to the actual sales data.

To learn more about time series models, please look at this notebook: https://colab.research.google.com/drive/1G1Ylemquqons12_P0LM8Ih0F7PumegRP?usp=sharing