LARM: A Large Articulated-Object Reconstruction Model

Sylvia Yuan1, Ruoxi Shi1, Xinyue Wei1, Xiaoshuai Zhang2, Hao Su1, Minghua Liu2

1University of California San Diego    2Hillbot

Accepted by SIGGRAPH Asia 2025

Teaser
We introduce LARM, a feedforward model for 3D articulated object reconstruction. Given sparse-view images of an articulated object in two different joint states (e.g., drawer open and closed), LARM can synthesize views from arbitrary camera poses and novel joint configurations (e.g., the drawer half-open), enabling efficient generation of continuous articulation and viewpoint variations. Beyond novel view and state synthesis, LARM also supports explicit 3D mesh reconstruction.
eval_7128_joint_0 input 0 eval_7128_joint_0 input 1 eval_7128_joint_0 input 2 eval_7128_joint_0 input 3 eval_7128_joint_0 input 4 eval_7128_joint_0 input 5
eval_11712_joint_1 input 0 eval_11712_joint_1 input 1 eval_11712_joint_1 input 2 eval_11712_joint_1 input 3 eval_11712_joint_1 input 4 eval_11712_joint_1 input 5
eval_23807_joint_0 input 0 eval_23807_joint_0 input 1 eval_23807_joint_0 input 2 eval_23807_joint_0 input 3 eval_23807_joint_0 input 4 eval_23807_joint_0 input 5
eval_44817_joint_0 input 0 eval_44817_joint_0 input 1 eval_44817_joint_0 input 2 eval_44817_joint_0 input 3 eval_44817_joint_0 input 4 eval_44817_joint_0 input 5
eval_45091_joint_1 input 0 eval_45091_joint_1 input 1 eval_45091_joint_1 input 2 eval_45091_joint_1 input 3 eval_45091_joint_1 input 4 eval_45091_joint_1 input 5
eval_45262_joint_1 input 0 eval_45262_joint_1 input 1 eval_45262_joint_1 input 2 eval_45262_joint_1 input 3 eval_45262_joint_1 input 4 eval_45262_joint_1 input 5
eval_45636_joint_2 input 0 eval_45636_joint_2 input 1 eval_45636_joint_2 input 2 eval_45636_joint_2 input 3 eval_45636_joint_2 input 4 eval_45636_joint_2 input 5
eval_45691_joint_0 input 0 eval_45691_joint_0 input 1 eval_45691_joint_0 input 2 eval_45691_joint_0 input 3 eval_45691_joint_0 input 4 eval_45691_joint_0 input 5
eval_47168_joint_0 input 0 eval_47168_joint_0 input 1 eval_47168_joint_0 input 2 eval_47168_joint_0 input 3 eval_47168_joint_0 input 4 eval_47168_joint_0 input 5
eval_101773_joint_0 input 0 eval_101773_joint_0 input 1 eval_101773_joint_0 input 2 eval_101773_joint_0 input 3 eval_101773_joint_0 input 4 eval_101773_joint_0 input 5
eval_102177_joint_0 input 0 eval_102177_joint_0 input 1 eval_102177_joint_0 input 2 eval_102177_joint_0 input 3 eval_102177_joint_0 input 4 eval_102177_joint_0 input 5
eval_102259_joint_2 input 0 eval_102259_joint_2 input 1 eval_102259_joint_2 input 2 eval_102259_joint_2 input 3 eval_102259_joint_2 input 4 eval_102259_joint_2 input 5

Abstract

Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM—a recent novel view synthesis (NVS) approach for static 3D objects—into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images.

Method Overview

LARM Pipeline Overview
LARM first patchifies the sparse, posed input images into tokens by concatenating the input RGB values, Plücker ray embeddings, and corresponding joint states. The target view to be synthesized is similarly represented by its Plücker ray embeddings and a target joint state, which are concatenated and tokenized. These input and target tokens are then fed into a decoder-only transformer model that predicts tokens used to regress the target view pixels. To enable explicit 3D reconstruction, LARM is also trained to produce additional outputs beyond RGB values, such as depth maps, foreground masks, and part masks.

Results

Comparison of Novel View and State Synthesis between PartRM, Paris, and our LARM

Novel view synthesis comparisons
For each shape, we showcase synthesized views at two novel joint states.

Comparison of 3D Articulated Object Reconstruction

3D Mesh Error Plot
Note that methods such as URDFormer and Articulate-Anything rely on part retrieval for reconstruction, often resulting in significant mismatches in geometry and texture compared to the input prompt. In contrast, our LARM model faithfully reconstructs high-quality textured meshes that closely align with the input prompts.

Real-world Demo

Real-world reconstruction comparisons
We use an iPhone to capture sparse-view images of everyday articulated objects. The results demonstrate that LARM can effectively handle such inputs and predict accurate novel views across diverse camera poses and joint states.

Citation

@article{yuan2025larmlargearticulatedobjectreconstruction,
    title={LARM: A Large Articulated-Object Reconstruction Model}, 
    author={Yuan, Sylvia and Shi, Ruoxi and Wei, Xinyue and Zhang, Xiaoshuai and Su, Hao and Liu, Minghua},
    journal={arXiv preprint arXiv:2511.11563},
    year={2025},
}