Papers
arxiv:2505.02704

VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Published on May 5, 2025
Authors:
,

Abstract

Visual-guided linguistic disambiguation framework improves metric depth estimation by resolving text ambiguity through high-level visual semantics and global linear transformations.

AI-generated summary

Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to infer absolute scale from textual descriptions. However, such language-based recovery is highly sensitive to natural language ambiguity, as the same image may be described differently across perspectives and styles. To address this, we introduce VGLD (Visually-Guided Linguistic Disambiguation), a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs. By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters that align relative depth maps with metric scale. This visually grounded disambiguation improves the stability and accuracy of scale estimation. We evaluate VGLD on representative models, including MiDaS and DepthAnything, using standard indoor (NYUv2) and outdoor (KITTI) benchmarks. Results show that VGLD significantly mitigates scale estimation bias caused by inconsistent or ambiguous language, achieving robust and accurate metric predictions. Moreover, when trained on multiple datasets, VGLD functions as a universal and lightweight alignment module, maintaining strong performance even in zero-shot settings. Code will be released upon acceptance.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.02704 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.02704 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.02704 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.