Autoregressive time-series forecasting with a decoder-only transformer trained on
a derivative-based tokenization. The encoder differentiates the input, scales by
its observed range, and quantizes with a MASH (multi-stage noise-shaping)
See papers/paper.pdf (English) and papers/paper_ru.pdf (Russian) for the full write-up with proofs and benchmarks.
Two vocabulary-scaling laws hold at the tokenization level, independent of the model on top.
| quantity | per vocab doubling | reference |
|---|---|---|
| ideal CE |
|
Prop. 2 |
| reconstruction |
Thm. 1 | |
| ordinal smoothing |
|
§4 |
Theorem 1 (MASH reconstruction, any
uniformly in horizon
Proposition 2 (achievable CE). In the fine-quantization limit,
-
MASH
$k$ -th-order$\Sigma\Delta$ : stable noise-shaping at arbitrary order$k$ , tight$\Delta/2$ reconstruction bound uniformly in$T$ . Falls back to the classical carry scheme when$V \le 2^k + 2$ . -
Gaussian ordinal soft-CE: replaces one-hot targets with a discrete Gaussian
over bin indices,
$p^\text{soft}_i \propto \exp(-(i - y)^2 / 2\sigma^2)$ . Teaches the model that "off by one bin is almost right" — matches the ordinal structure of derivative-quantized tokens. On M4 Hourly: sMAPE$18.58 \to 17.49$ at$\sigma = 2$ with no architectural change. -
Per-column derivative order:
order_of_derivativecan be an array of shape$(F,)$ giving each channel its own$k_c$ . Mix$k = 0$ (raw) with$k = 1, 2, \ldots$ in a single model. - GPT architecture niceties: RoPE, RMSNorm (no learnable scale), SwiGLU with soft limit, tied embeddings, logit softcap, Block Attention Residuals (Chen et al. 2026), Schedule-Free AMSGrad optimizer.
-
Inference engine: KV-cache-accelerated generation, position-aware
force_schedulefor teacher-forced decoding (TimeXer-style exogenous covariates), best-val checkpoint auto-restore. -
NUMA-aware training: single-process CPU-affinity pin via
DiffGPT.train(pin_node=...), or one-worker-per-node DDP viaDiffGPT.train_numaon multi-socket boxes.
import numpy as np, pandas as pd, torch
from diff_gpt.diff_gpt import DiffGPT
from diff_gpt.model.gpt import GPT
from diff_gpt.data_loader import DiffDataFrameDataLoader
from diff_gpt.encoder_decoder import get_domain_of_definition
from diff_gpt.sampler.temperature import TemperatureSampler
# Your data as DataFrames (one per series).
dfs: list[pd.DataFrame] = [...]
# Compute domain (per-channel max |k-th diff|) from training data.
order = 1 # scalar, OR np.array([0, 1, 2]) for per-column
all_data = np.concatenate([df.to_numpy(dtype=np.float64) for df in dfs], axis=0)
domain = get_domain_of_definition(
inp=all_data, order_of_derivative=order, use_decimal=False,
)
# Construct GPT + DiffGPT wrapper.
base = GPT(
vocab_size=256,
n_embd=64,
block_size=138,
n_head=4,
n_layer=2,
label_smoothing_sigma=2.0, # ordinal soft-CE
)
model = DiffGPT(
model=base,
order_of_derivative=order,
domain_of_definition=domain,
use_decimal=False,
)
loader = DiffDataFrameDataLoader(
dfs=dfs,
block_size=138,
batch_size=32,
vocab_size=256,
order_of_derivative=order,
domain_of_definition=domain,
use_decimal=False,
device="cuda",
train_part=0.8,
)
model.train(loader=loader, max_iters=20_000, eval_interval=500)
# Point forecast (argmax).
context = dfs[0].iloc[-96:]
prediction = model.predict(
df=context,
max_new_points=48,
sampler=TemperatureSampler(temperature=0.0),
)Every trained model is already a probabilistic forecaster — the token head
outputs a calibrated distribution at each step (cross-entropy is a strictly
proper scoring rule). Point argmax is one readout; num_samples independent
temperature-1 trajectories give Monte-Carlo samples from the joint future
distribution, which we can reduce to any set of quantiles.
from diff_gpt.sampler.temperature import TemperatureSampler
from diff_gpt.sampler.nucleus import NucleusSampler
# N Monte-Carlo trajectories, shape (N, H, F).
# Plain temperature-1 sampling — the most diverse readout.
samples = model.predict_samples(
df=context,
max_new_points=48,
num_samples=100,
sampler=TemperatureSampler(temperature=1.0),
)
# Nucleus top-p on top of temperature: drop the low-probability tail of
# each per-step distribution before sampling. Tightens forecast bands
# without any retraining (−25% MSIS on M4 Hourly at p=0.9).
nucleus = NucleusSampler(
p=0.9, sampler=TemperatureSampler(temperature=1.0),
)
samples_nucleus = model.predict_samples(
df=context, max_new_points=48, num_samples=100, sampler=nucleus,
)
# Or directly as quantile bands {0.1, 0.5, 0.9}: each a DataFrame of shape (H, F).
bands = model.predict_quantiles(
df=context,
max_new_points=48,
quantiles=(0.1, 0.5, 0.9),
num_samples=100,
sampler=nucleus,
)
p10, p50, p90 = bands[0.1], bands[0.5], bands[0.9]No architectural change and no auxiliary pinball-loss head: one model serves arbitrary quantile sets at inference time. Since the model predicts the derivative and the decoder integrates, confidence bands naturally widen with horizon as an integral of random-walk uncertainty.
First-pass M4 Hourly numbers (same model as the point-forecast row, no probabilistic-specific tuning; 414 series, horizon 48, N = 100 samples):
| metric | T = 1 | T = 1, top-p = 0.9 |
|---|---|---|
| sMAPE (median point) | 24.18 | 22.02 |
| MASE (median point) | 2.94 | 2.41 |
| CRPS | 332.0 | 293.6 |
| MSIS (α = 0.05) | 41.1 | 30.8 (−25%) |
| coverage@80 (nominal 0.80) | 0.946 | 0.919 |
Nucleus top-p = 0.9 at inference time (zero retraining) tightens every metric, MSIS by 25%. Remaining over-coverage reflects the fact that the model was tuned for argmax sMAPE; conformal calibration or CRPS-tuned σ / temperature closes most of the rest of the gap vs M4 probabilistic winners (MSIS ~13–15).
Cross-entropy training also gives a calibrated per-step conditional
log-likelihood for free. Feeding an observed signal through a single
forward pass and taking
# df: observed multivariate signal, shape (T, F).
scores = model.anomaly_scores(df)
# scores: DataFrame, same shape and columns as df.
# First max_k rows are NaN (the derivative prefix); the first encoded
# token at column 0 is also NaN (no preceding context).
row_surprise = scores.sum(axis=1, skipna=True) # per-step aggregate
anomalies = row_surprise[row_surprise > row_surprise.quantile(0.99)]Regression-based forecasters have to approximate this with
ETTh1 oil-temperature demo (benchmarks/anomaly_demo.py):
train on 2880 clean hours, inject 5 synthetic ±5σ spikes into a 336-hour
test window, rank positions by NLL. After only 2000 training iterations
the top-10 contains 3 of 5 injected spikes (60% recall); the rest of
the top-10 are genuine irregular transitions in the unsynthetic signal.
PyTorch 2's torch.compile fuses the GPT forward/backward into a handful
of kernels — a sizable step-time speedup at this model scale, once the
one-time graph-tracing cost is amortized. Two rules:
- Disable gradient checkpointing.
use_checkpoint=Trueintroduces graph breaks that defeat the fusion. Passuse_checkpoint=FalsetoGPTwhen compiling. - Compile the inner
GPT, not theDiffGPTwrapper.DiffGPTis a numpy-side encoder/decoder shell around annn.Module;torch.compileonly operates on the latter. Compile before constructing the wrapper.
base = GPT(
vocab_size=256, n_embd=64, block_size=138, n_head=4, n_layer=2,
label_smoothing_sigma=2.0,
use_checkpoint=False, # required alongside torch.compile
)
# GPU: max-autotune without cudagraphs — peak Triton/Inductor tuning,
# no CUDA-graph capture (which breaks under variable shapes, DDP, and
# KV-cache inference). CPU: max-autotune enables Inductor's CPU GEMM
# template / tile sweep; cudagraphs is a no-op on CPU either way.
compile_mode = "max-autotune-no-cudagraphs" if torch.cuda.is_available() else "max-autotune"
base = torch.compile(base, mode=compile_mode)
model = DiffGPT(
model=base, order_of_derivative=order,
domain_of_definition=domain, use_decimal=False,
)
model.train(loader=loader, max_iters=20_000, eval_interval=500)Combined with DiffGPT.train_numa, compile inside the factory so every
worker DDP-wraps an already-compiled module:
def build_gpt(rank: int) -> GPT:
gpt = GPT(vocab_size=256, n_embd=64, block_size=138, n_head=4,
n_layer=2, use_checkpoint=False)
mode = "max-autotune-no-cudagraphs" if torch.cuda.is_available() else "max-autotune"
return torch.compile(gpt, mode=mode)The first 1–2 training iterations are dominated by tracing and kernel
compilation; benchmark step time from iteration 10+ for a fair number.
Inference (predict, predict_samples, anomaly_scores) still works
on a compiled model — OptimizedModule delegates attribute access — but
the KV-cache generation path grows T by one token per step, so the
compiled graph is recompiled or specialized frequently and the speedup
is smaller than at training time.
On multi-socket CPU boxes, cross-socket memory traffic and thread migration dominate wall-clock for this size of model. Two entry points:
Single-process pin. Pass a NUMA node id to train() — pins CPU affinity
and aligns torch.set_num_threads / OMP_NUM_THREADS to that node's core
count before the hot loop:
model.train(loader=loader, max_iters=20_000, pin_node=0)One process per NUMA node (DDP). For linear scaling across sockets,
DiffGPT.train_numa spawns one worker per detected NUMA node, pins each to
its CPUs, and wraps the inner GPT in DistributedDataParallel. Both
factories run inside the pinned worker so allocations land on local
memory; seed the loader by rank so workers draw different batches
(otherwise DDP averages identical gradients).
from diff_gpt.diff_gpt import DiffGPT
from diff_gpt.model.gpt import GPT
from diff_gpt.data_loader import DiffDataFrameDataLoader
import torch
def build_gpt(rank: int) -> GPT:
return GPT(vocab_size=256, n_embd=64, block_size=138,
n_head=4, n_layer=2, label_smoothing_sigma=2.0)
def build_loader(rank: int, world: int) -> DiffDataFrameDataLoader:
rng = torch.Generator().manual_seed(42 + rank)
return DiffDataFrameDataLoader(
dfs=dfs, block_size=138, batch_size=32, vocab_size=256,
order_of_derivative=order, domain_of_definition=domain,
use_decimal=False, device="cpu", train_part=0.8, rng=rng,
)
val_loss, _ = DiffGPT.train_numa(
model_factory=build_gpt,
loader_factory=build_loader,
train_kwargs={"max_iters": 20_000, "eval_interval": 500},
checkpoint_path="best.pt",
)
# Rebuild DiffGPT around the trained weights:
gpt = build_gpt(0); gpt.load_state_dict(torch.load("best.pt"))
model = DiffGPT(model=gpt, order_of_derivative=order,
domain_of_definition=domain, use_decimal=False)Backend: NCCL on CUDA, Gloo on CPU. Single-node systems short-circuit to the in-process pinning path — no spawn, no DDP overhead.
414 series, horizon 48, per-series z-normalization, single global model, 20k iters, argmax inference.
| config | sMAPE | MASE |
|---|---|---|
| baseline ( |
18.58 | — |
| + soft-CE |
17.55 | 1.663 |
| + soft-CE |
17.49 | 1.659 |
| + soft-CE |
17.35 | 1.949 (over-smoothed) |
Vocab sweep (U-curve, optimum at
| sMAPE | ||
|---|---|---|
| 64 | 0.5 | 19.96 |
| 128 | 1.0 | 19.39 |
| 256 | 2.0 | 17.49 |
| 512 | 4.0 | 18.21 |
Higher-order tokenization:
7 channels, seq_len = 96, pred_len = 96, chronological 12/4/4-month split.
| config | MSE | MAE |
|---|---|---|
|
|
0.9861 | 0.6056 |
|
|
0.9787 | 0.6030 |
|
|
0.9806 | 0.6034 |
ETTh1 is data-limited (~2 tokens/parameter), not loss-limited. Soft targets help most in data-starved regimes.
pip install git+https://github.com/fxeqxmulfx/diff-gptuv sync
uv run --no-sync pytest- youtube.com: Let's build GPT: from scratch, in code, spelled out
- github.com: nanoGPT
- github.com: gpt-oss
- github.com: gemma_pytorch
- github.com: schedule_free
- github.com: nanochat
- github.com: Attention-Residuals
- github.com: Time-Series-Library
- arxiv.org: Attention Is All You Need
- arxiv.org: Root Mean Square Layer Normalization
- arxiv.org: RoFormer: Enhanced Transformer with Rotary Position Embedding
- arxiv.org: GLU Variants Improve Transformer
- arxiv.org: The Road Less Scheduled
- arxiv.org: The Curious Case of Neural Text Degeneration
- arxiv.org: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
- arxiv.org: iTransformer: Inverted Transformers are Effective for Time Series Forecasting
- arxiv.org: Rethinking the Inception Architecture for Computer Vision (label smoothing)