Time Series Forecasting with Transformers

Transformers rewrote NLP. Then computer vision. Now they're making serious inroads into time series forecasting — and as someone who came up through signal processing, I find this development both exciting and technically interesting.

Here's a practical breakdown of what works, what doesn't, and how to think about the problem.

Why Time Series Is Hard

Time series data violates the assumptions baked into most ML frameworks. Observations are not independent — each point is causally linked to what came before. Patterns exist at multiple scales simultaneously (hourly, daily, seasonal). And the distribution shifts over time in ways that are hard to detect until you're already wrong.

The Transformer Architecture for Sequences

The core insight behind applying transformers to time series is that attention can capture long-range dependencies more efficiently than RNNs. Instead of passing a hidden state through time (which degrades over long sequences), attention directly computes relationships between any two timesteps regardless of how far apart they are.

import torch
import torch.nn as nn

class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim, d_model=64, nhead=4, num_layers=2):
        super().__init__()
        self.embedding = nn.Linear(input_dim, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.output = nn.Linear(d_model, 1)

    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer(x)
        return self.output(x[:, -1, :])  # predict from last timestep

Positional Encoding Matters More Than You Think

Standard sinusoidal positional encodings were designed for discrete token positions in text. For time series with irregular sampling or multi-scale periodicity, you need better options: learnable positional embeddings, or encodings that incorporate the actual timestamp as a continuous feature.

Patching: The Trick That Actually Works

Recent work (PatchTST and related models) showed that treating time series segments as patches — analogous to image patches in Vision Transformers — dramatically improves forecasting accuracy. Instead of attending over individual timesteps, the model attends over chunks of the signal. This is computationally cheaper and empirically stronger.

Treating signal segments as patches is to time series what ViT was to images — a simple idea with outsized impact.

When Not to Use a Transformer

Transformers are powerful but not always the right tool. For short sequences (<100 timesteps), simpler models like N-BEATS or even LightGBM with good feature engineering often match transformer performance with a fraction of the compute. Use transformers when you have long sequences, multiple related time series (multivariate), or when you need to capture complex cross-series dependencies.

Practical Recommendations

Start with a strong baseline (ARIMA, Prophet, or XGBoost with lag features) before reaching for transformers
Use patching if your series is longer than ~500 timesteps
Normalize per series, not globally — time series have very different scales
Evaluate on multiple horizons, not just a single forecast length