Efficiency Breakthrough

375 papers · Page 6 of 8

Filter by desk: AI Computing Robotics Math Quantum Physics Space Earth Chemistry Engineering Ecology Biology Neuroscience Health Psychology Economics Society

Achieves an 80% reduction in Chain-of-Thought (CoT) tokens while slightly increasing reasoning accuracy.

Extends LLM context from 32K to 128K by teaching models to selectively skip global attention for ~80% of tokens.

Knowledge-Aware Active Learning (KA2L) uses latent space probing to identify what an LLM doesn't know and generates targeted synthetic questions.

S-VGGT introduces structure-aware subscene decomposition to break the quadratic scaling bottleneck of 3D foundation models.

DSS-GAN is the first generative adversarial network to use a Mamba (State Space Model) backbone for high-quality image synthesis.

Synthetic videos of simple geometric shapes are more effective than massive real-world datasets for teaching video-language models fundamental temporal reasoning.

Anomaly detection can be performed directly using a primary model's internal neuron output ranges, eliminating the need for expensive external AD models.

Truncated backpropagation for video decoding reduces the memory cost of fine-tuning video diffusion models from linear to constant.

ProbeFlow achieves 14.8x faster action decoding in Vision-Language-Action (VLA) models without any retraining.

Parallel multi-token prediction can be achieved in standard LLMs without training auxiliary models or modifying weights.

CARE provides a recipe for converting standard GQA models into high-efficiency Multi-head Latent Attention (MLA) architectures.

VideoAtlas enables navigation and reasoning over long-form video using compute that scales only logarithmically with video length.

MUD provides a faster, lower-overhead alternative to Muon for transformer training, achieving up to 2.6x higher throughput.

LoST introduces a semantic-first 3D tokenizer that reduces the token count for 3D shape generation by up to 99.9%.

RSM achieves 20x faster training for recursive reasoning models and enables test-time scaling for up to 20,000 refinement steps.

Reduces high-quality 3D head avatar creation time from over 24 hours to 0.5 seconds per frame.

Fuses categorical sampling into the LM-head matmul to eliminate logit materialization and speed up LLM decoding by up to 19%.

Achieves microsecond-level kinodynamic motion planning for high-DOF robots by using differential flatness to solve boundary value problems analytically.

Demonstrates that masked diffusion language models can be 21.8x more compute-efficient than traditional autoregressive models when scaled correctly.

Introduces Helium, a serving framework that treats agentic workflows as data query plans to optimize redundant LLM calls and KV caches.

Presents ZipCal, a model-agnostic calibration data selection strategy for pruning and quantization that is 240x faster than model-based methods.

VQKV uses Vector Quantization to achieve over 80% KV cache compression with almost zero loss in model performance.

FEAT is a linear-complexity foundation model designed specifically for extremely large-scale structured (tabular) data.

Enables stable 4-bit microscaling (MXFP4) quantization for Multi-modal LLMs, which previously suffered from performance collapse.

Low-precision optimizer states cause 'state staleness' where updates round back to stored values, but scheduled resets can fully recover performance loss.

GIST achieves O(N) complexity for Graph Transformers while maintaining gauge invariance, enabling scaling to meshes with 750K nodes.

Pretrained 3D generative models can be repurposed for high-quality part segmentation using less than 1% of the typical labeled data.

Truncated-Reasoning Self-Distillation (TRSD) allows models to maintain accuracy even when their chain-of-thought traces are heavily shortened.

The ICaRus architecture allows multiple different models to share a single, frozen KV cache for the same prompt.

Using parallel associative scans achieves a 44x speedup in training continuous-time Spiking Neural Networks (SNNs).

RelayCaching eliminates redundant prefill computation in multi-agent systems by reusing the decoding-phase KV cache from previous agents.

Pretrained Transformers exhibit a pervasive inter-head linear structure where many attention heads can be reconstructed from a small set of peer heads.

FineRMoE extends MoE granularity to both intermediate and output dimensions, achieving a 136x increase in decoding throughput.

Distribution-Conditioned Diffusion Decoding enables high-fidelity image generation from pre-trained VLMs without expensive full-model retraining.

Qianfan-OCR introduces 'Layout-as-Thought,' enabling a 4B model to outperform 235B models on complex document parsing and layout analysis.

Achieves significant tool-selection accuracy gains in LLM semantic routers with zero added serving-time latency or cost.

A training-free acceleration method for diffusion language models that achieves a 4x speedup in image generation.

Implements bio-inspired 'mental-state dynamics' to achieve O(N) complexity in Vision Transformers.

Reduces the number of real-world robot rollouts needed for policy comparison by up to 70% using safe, anytime-valid inference.

Outperforms fine-tuned baselines in code optimization by using semantics-preserving transformations as a generative intermediate representation.

A 140M-parameter networking foundation model (PLUME) that outperforms frontier LLMs on protocol analysis by learning from native packet structures.

Replaces the quadratic cost of self-attention in Diffusion Transformers with a convection-diffusion PDE solved in the Fourier domain.

Implicit Maximum Likelihood Estimation (IMLE) achieves multimodal trajectory planning performance comparable to diffusion models while being 100x faster.

Greedy Information Projection (GIP) provides a fast, geometrically-principled method for selecting training data that balances quality and diversity, achieving full-data performance with a fraction of the examples.

Traditional Spiking Neural Network (SNN) sparsity is a performance 'illusion' on GPUs; temporal aggregation is required for actual 13x speedups.

Enables training of CNNs from scratch in true 4-bit precision on commodity CPUs with virtually no loss in accuracy.

Introduces the FLUX preprocessing pipeline, which reduces LLM training compute by 34% by maximizing high-quality token retention.

Reduces the RAM requirement for speech neuroprosthesis CTC decoding from 320 GB to 10 GB without sacrificing accuracy.

Reveals that Graph-RAG performance is limited by reasoning failure rather than retrieval, and shows how to make an 8B model match a 70B baseline.

Amortizes iterative diffusion into a one-step trajectory policy for robotics using a novel 'Keyed Drift Field' objective.