SkeletonLLM allows frozen Multimodal LLMs to reason about human motion by rendering skeleton sequences into their native visual modality.
arXiv · March 19, 2026 · 2603.18003
The Takeaway
Instead of training specialized motion encoders or quantizing skeletons into abstract tokens, this method uses a differentiable renderer to turn motion into video. This allows any visual MLLM to perform action recognition, captioning, and reasoning on motion data without specialized training.
From the abstract
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating ar