VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection Paper • 2505.20289 • Published May 26, 2025 • 10
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection Paper • 2505.20289 • Published May 26, 2025 • 10 • 2
X-Fusion: Introducing New Modality to Frozen Large Language Models Paper • 2504.20996 • Published Apr 29, 2025 • 13
YoChameleon: Personalized Vision and Language Generation Paper • 2504.20998 • Published Apr 29, 2025 • 12
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features Paper • 2504.00557 • Published Apr 1, 2025 • 15
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 16
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3, 2024 • 7
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy Paper • 2406.20095 • Published Jun 28, 2024 • 18