Efficiency Breakthrough

375 papers · Page 7 of 8

Filter by desk: AI Computing Robotics Math Quantum Physics Space Earth Chemistry Engineering Ecology Biology Neuroscience Health Psychology Economics Society

Proposes a temporal mixed-precision framework for diffusion models that adaptively assigns bitwidths across different denoising timesteps.

Accelerates LLM inference by up to 1.8x using a training-free sparse pattern predictor based on SVD truncation of FFN gate matrices.

Unifies KV cache compression and sparse attention into a single 1-bit indexing structure, eliminating the need for external metadata or predictors.

Detects diffusion-generated images 126x faster than reconstruction-based methods by using Gaussian noise disturbance to exploit the statistical 'ease' of fitting synthetic data.

Enables model adaptation on edge devices and non-differentiable (quantized) models using a purely backpropagation-free optimization framework.

Achieves real-time, low-latency talking avatar generation at 34ms per frame using a one-step streaming diffusion framework.

Introduces ZoomUI, a trainless method for GUI grounding that uses inference-time scaling to anchor natural language instructions to interface elements.

FLORE achieves 1000x error reduction in linear sketching while being 100x faster than previous learning-based solutions.

SleepGate introduces a biologically inspired 'sleep cycle' for the KV cache to resolve proactive interference in long-context LLMs.

ASAP reduces LVLM computational FLOPs by ~80% with virtually no loss in performance using a training-free KV-Cache pruning recipe.

FlashHead is a drop-in replacement for the LM classification head that provides 1.75x inference speedup by treating vocabulary selection as a retrieval problem.

Reformulates diffusion sampling as a graph-theoretic planning problem that dynamically allocates compute to the most difficult denoising stages.

Generates novel, structurally plausible protein sequences from small alignments using a training-free stochastic attention mechanism on a standard laptop.

Adaptive computation for multimodal LLMs drastically reduces compute waste on easy cases while focusing on hard ones.

HO-SFL enables backprop-free fine-tuning on edge devices without the convergence penalty typical of zeroth-order methods.

RAZOR provides a lightweight, targeted unlearning framework for Transformers and Diffusion models without retraining.

Introduces an asynchronous Mixture-of-Transformers architecture for autonomous driving that decouples slow reasoning from fast action execution.

Achieves over 80% of full-resolution VLM performance while using only 1% of the original pixel budget through bio-inspired foveated sampling.

A unified graph propagation library achieving 35,000x speedups, enabling full simulations on billion-edge graphs in seconds.

AdaAnchor enables LLMs to perform multi-step reasoning entirely in latent space with an adaptive halting mechanism to optimize compute.

AnoleVLA replaces the standard Transformer backbone in robotic Vision-Language-Action models with Deep State Space Models for a 3x speedup.

Writer-R1-4B outperforms 100B+ parameter models in creative writing by utilizing memory-augmented self-reflection and fine-grained criteria generation.

Ultra-low-bitrate image compression achieves 50% bitrate savings by treating decoding as a 'next-frame' video prediction task using diffusion priors.

HapticVLA achieves tactile-aware robotic manipulation at 86.7% success rate without requiring any physical tactile sensors at inference time.

IConE enables stable self-supervised learning even at batch size 1, overcoming the memory bottlenecks of high-dimensional scientific and medical data.

FlashU is the first framework to accelerate unified multimodal models by exploiting the distinct neuron sets used for generation vs. understanding.

MeMix is a training-free, plug-and-play module that reduces 3D reconstruction error by up to 40% in long sequences by mitigating state drift.

PrismMirror is the first monocular human frontal view synthesis model to achieve real-time inference (24 FPS) without external geometric models.

A 4B parameter model matches a 120B parameter model in program verification through a rigorous data curation pipeline.

Bridges the gap between generative (MAE) and predictive (I-JEPA) self-supervised learning, achieving a 10% performance boost.

Accelerates state-of-the-art 3D human mesh recovery by over 10x, enabling real-time vision-only humanoid teleoperation.

Introduces Mixture-of-Depths Attention (MoDA) to solve signal degradation in deep LLMs with hardware-efficient implementation.

Achieves 1,000x speedups in Bayesian inverse problems by replacing repeated MCMC sampling with one-step preconditioned generative transport.

ActTail achieves 80% activation sparsity in LLMs with significantly lower perplexity degradation than uniform methods by using Heavy-Tailed Self-Regularization theory.

ReBalance is a training-free framework that dynamically modulates 'thinking' length in reasoning models to prune redundancy during overthinking and promote exploration during underthinking.

Achieves 100x speedup in robotic action generation by distilling iterative flow/diffusion models into a one-step policy without a pre-trained teacher.

Reduces Chain-of-Thought (CoT) compute costs by 14-55% by learning the optimal 'early-exit' points for Large Reasoning Models.

Accelerates Diffusion Transformers (DiTs) by 2x using a training-free framework that selectively reduces computation in non-aesthetic image regions.

Introduces a training-free framework that allows LLM agents to dynamically scale their reasoning depth based on a pre-defined token/tool budget.

Achieves a 98x speedup in LLM routing on AMD hardware using Flash Attention and prompt compression, enabling high-context classification without a dedicated GPU.

Modality-level disaggregation enables cost-optimal MLLM serving across heterogeneous GPUs over commodity PCIe, bypassing the need for expensive NVLink interconnects.

A hardware-algorithm co-design for Spiking Neural Networks achieves up to 69x energy efficiency gains using an SRAM-based Compute-in-Memory accelerator.

Achieves 4x visual token compression and 80% lower training cost while unifying multimodal comprehension and generation.

Adaptive VLM Routing reduces inference costs for Computer Use Agents by up to 78% with negligible accuracy loss.

Distills a 2B Vision-Language Retriever into a 70M text-only encoder for visual document retrieval with 50x lower latency.

CleanSight provides a training-free, test-time defense for backdoored vision-language models by detecting and pruning 'attention stealing' visual tokens.

Structured distillation for personalized agent memory achieves an 11x reduction in token count while preserving 96% of the retrieval quality of verbatim history.

Induces pretrained video models to perform SOTA image restoration using less than 2% of the training data required by specialized architectures.

Achieves 'zero-hyperparameter' circuit analysis by using a foundation model to perform in-context regression, bypassing hours of manual tuning.

Introduces Bilateral Context Conditioning to DeepSeek's GRPO, allowing models to cross-reference successful and failed reasoning traces during optimization.