Scaling Insight

101 papers · Page 1 of 2

What changes when you scale a system up or down. Laws, regimes, and surprises that only appear at larger or smaller orders of magnitude.

Scaling Insight / Category lead

Neural collapse is triggered by a predictable 'feature-norm threshold' (fn*) that is invariant to training conditions, serving as a new diagnostic for training progress.

This identifies a concrete, actionable metric to predict exactly when representational reorganization occurs in deep networks. It allows practitioners to monitor training dynamics beyond loss curves, identifying the specific point where a model transitions from noise to structured feature learning.

By SeriesFusion Editorial Board · April 2, 2026

Filter by desk: AI Computing Robotics Math Quantum Physics Space Earth Chemistry Engineering Ecology Biology Neuroscience Health Psychology Economics Society

Gradient-based data valuation (TracIn) outperforms all human-crafted metadata heuristics for ordering curriculum learning in motion planners.

Demonstrates that LLM judge panels follow power-law discovery curves, where panel size and persona diversity are critical for uncovering edge-case failures.

Establishes a three-dimensional scaling law for RAG-pretraining, modeling the optimal data budget allocation between model parameters, tokens, and retrieval store size.

Simple Self-Distillation (SSD) improves LLM code generation (e.g., Qwen3-30B) by 13% Pass@1 without any external verifiers or teacher models.

Identifies a 'dual-capability bottleneck' where low-rated training data is essential for state tracking while high-rated data is needed for decision quality.

Provides a computationally efficient 'early warning' system for emergent capabilities like grokking and induction head formation using 2-datapoint reduced density matrices.

Identifies 'label leakage' from limited task diversity as the primary bottleneck for relational foundation models, rather than raw data volume.

Discovers that video diffusion models commit to high-level plans in the first few denoising steps, enabling a new inference-time scaling technique called ChEaP.

Scales multi-agent path finding to 1000 agents with near-linear runtime by decoupling geometric planning from execution-time conflict resolution.

Synthetic multi-view generation breaks the performance ceiling of single-view robotic datasets.

Formalizes the 'Observability Gap' to explain why coding agents plateau: humans can only provide feedback on visible outputs, while bugs reside in invisible execution states.

Provides a high-dimensional theoretical foundation for why two-phase optimizers like DiLoCo are mathematically superior to standard SGD in specific noise regimes.

Shows that standard task-completion benchmarks fail to distinguish agent capabilities and proposes 'Working Memory Fidelity' as a more predictive metric.

Mathematical proof that LayerNorm structurally reduces model complexity compared to RMSNorm due to its mean-centering geometry.

Provides empirical evidence and a mechanistic explanation for why LoRA drastically reduces catastrophic forgetting in sequential fine-tuning compared to full fine-tuning.

A controlled study proving that the temporal organization (curriculum) of multimodal data is a first-order variable in balancing reasoning vs. OCR capabilities.

The eigenvalue tail index of a neural network's weight matrices serves as a near-perfect (R^2 = 0.984) diagnostic for label noise in the training data.

Discovers that LLM hidden states undergo geometric 'warping' at digit-count boundaries, mimicking human psychological perception.

This paper establishes the formal information-theoretic limits and conditions under which self-improving AI systems can be safely verified.

HyperP provides the first hyperparameter transfer laws for hypersphere optimization, ensuring stable scaling for models using the Muon optimizer.

Uses the Minimum Description Length principle to predict exactly when neural networks will transition from simple 'spurious' shortcuts to complex features.

A billion-scale time-series benchmark that identifies a 'context-length crossover' where foundation models start to crush deep learning baselines.

Challenges the assumption that 'background' pixels are useless in GUI agents and identifies a 'recency effect' for optimal token pruning.

An 800 Hz data glove reveals that human hand dexterity contains critical high-frequency motion energy (>100 Hz) previously invisible to standard sensors.

Provides the first sharp theoretical characterization of why spectral optimizers like Muon drastically outperform SGD in storage capacity and scaling for language models.

Proves that causal representation learning is possible with far fewer environments and unknown intervention targets than previously assumed.

Reveals that synthetic rewriting is a quality multiplier for high-grade data, but fails to fix low-quality source data.

A systematic study reveals that grokking is not an architectural property of Transformers but an interaction between weight decay and optimization stability.

MSRL scales multimodal reward modeling by transferring reasoning capabilities from text to vision-language tasks without requiring new multimodal preference data.

Synthetic Mixed Training allows an 8B model to finally outperform RAG on long-document comprehension by combining synthetic QAs with rewritten documents.

Newer LLM architectures like MoE and SSMs are making 'early-exit' decoding significantly less effective than in previous generations.

Diffusion models can be proven to generalize by capturing manifold geometry long before they achieve density estimation or memorization.

Provides a systematic blueprint for scaling Reinforcement Learning (RL) in LLMs using multi-turn synthetic data generation and difficulty-based curricula.

Identifies a 'critical threshold' in human-AI symbiosis beyond which human capability collapses abruptly and irreversibly due to over-delegation.

hidden states in LLMs occupy a Riemannian submanifold where tokens are Voronoi regions, revealing a universal 'hourglass' intrinsic dimension profile across all tested models.

The standard 'Chinchilla Approach 2' for fitting scaling laws is systematically biased, potentially leading to millions of dollars in wasted compute at frontier scales.

Reveals that RLVR-driven reasoning improvements in LLMs are the result of highly sparse changes to a tiny fraction of 'critical' token distributions.

Robotic bipedal mass scales with the square of leg length rather than the cubic scaling found in biological systems.

A quantitative model that predicts the performance gain of merging independent LLM specialists before committing compute.

Identifies the 'Caterpillar Tree' as the theoretically optimal structure for test-time computation and backtracking in LLMs.

Persistent structural memory in neural networks is fundamentally limited by the instability of jointly-learned coordinate systems.

Theoretical analysis reveals that the efficiency benefits of low-dimensional data structures for diffusion models diminish significantly when the data manifold is non-linear.

Access to conversational memory allows an 8B model to outperform a 235B model on user-specific queries while reducing inference costs by 96%.

Researchers identify a 'selection bottleneck' that mathematically determines when diverse agent teams outperform homogeneous self-consistency teams.

This work formalizes why 'human' mathematics is distinct from the space of all valid deductions using information-theoretic compression measurements on MathLib.

Discovers that language-centric training in Multimodal LLMs actively degrades their internal visual representation quality.

Identifies that in-context reasoning over pretraining knowledge only emerges after specific types of fine-tuning, not from pretraining alone.

Sensitivity to compression in Transformers spans five orders of magnitude, with early-layer MLP up-projections identified as catastrophic failure points.

Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.

Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.