University Research

HPO4TabPFN

Scaling Down TabPFN: How Model Size Affects Optimal Training Configurations

Institution: University of Freiburg
Course: Deep Learning Lab (DLL25)
Supervisor: Johannes Hog
TabPFN NePS Muon AdamW Lion SHAP TabArena PyTorch

Overview

How do you scale down a TabPFN? And which hyperparameters matter most at different scales?

TabPFN is a transformer-based model that solves small tabular classification problems in under a second—but it's large. This study investigates how to create smaller, more efficient versions (NanoTabPFN) while maintaining competitive performance.

Using random search with no prior assumptions, we systematically explored how scaling strategy (width vs. depth vs. compound), model size, and optimizer choice affect both performance and hyperparameter sensitivity. Our evaluation used the TabArena benchmark alongside classic datasets (Iris, Wine, Breast Cancer).

The key finding: smaller and deeper models can outperform larger, shallower configurations—challenging the assumption that more parameters always means better performance.

Model Scale Configurations

We tested four model scales with different width/depth tradeoffs:

Scale Strategy Layers Embedding MLP Hidden Parameters
Big Base 8 192 768 4.9M
Medium Width 4 192 768 2.6M
Depth 8 140 560
Compound 6 160 640
Small Width 2 192 768 1.35M
Depth 8 100 400
Compound 4 140 560
Mini Compound 6 64 192 0.37M

Key Findings

  • Compound Small (1.35M) outperformed larger models—the Pareto-optimal configuration wasn't the biggest one
  • Muon optimizer dominated—consistently beat AdamW, Lion, and AdamW Schedule-Free across all scales
  • Deeper > Wider—at equal parameter counts, deeper models slightly outperformed wider ones
  • Batch size correlates with scale—larger models benefited from larger batch sizes
  • Learning rate is critical—SHAP analysis showed LR and model scale as the most important hyperparameters
  • Smaller models need higher LR—inverse relationship between model size and optimal learning rate

Results

Performance was evaluated on the TabArena benchmark. The Compound Small configuration achieved the best accuracy-to-parameters ratio.

1st
Compound Small
2nd
Depth Small
3rd
Width Small
0.97
Mean Real Data Accuracy

SHAP Hyperparameter Importance

SHAP analysis across all model configurations revealed the relative importance of each hyperparameter:

  • Learning Rate (LR): Highest importance scores (46.4 average)—the single most critical hyperparameter
  • Optimizer: Second most important (36.6 average)—Muon's superiority was consistent
  • Batch Size: Lower but scale-dependent importance (17.0 average)—matters more for larger models

Interestingly, smaller models showed higher sensitivity to learning rate, while larger models were more robust to LR variations but more sensitive to batch size choices.

Methodology

The experimental pipeline consisted of:

  • Search Space: Learning rate [0.00001, 0.01] log-scale, batch size {32, 64, 128, 256}, optimizer {AdamW, Muon, Lion, AdamW Schedule-Free}
  • HPO Framework: Random search via NePS with no prior assumptions—ensuring unbiased exploration
  • Validation: Synthetic validation dataset (1600 samples) for HPO, real datasets (Iris, Wine, Breast Cancer) for final evaluation
  • Benchmark: TabArena for standardized comparison across configurations

Research Poster

DLL25 - Team 31 - HPO4TabPFN Download PDF

Team

Vladyslav Moroshan
Co-Author
Nastaran Alipour
Co-Author
Jude Mingay
Co-Author
Johannes Hog
Supervisor