Logo SpaceVista
All-Scale Visual Spatial Reasoning from mm to km

Abstract

With the current surge in spatial reasoning research, researchers have made significant progress in understanding indoor scenes, but still struggle with more diverse applications. This paper aims to advance all-scale spatial reasoning by tackling two key challenges: 1)the heavy reliance on indoor 3D scans and labor-intensive annotations for dataset curation; 2)the absence of all-scale modeling, which often leads to overfitting to single scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the scope of all-scale spatial intelligence. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across five spatial scales to create SpaceVista-1M, a dataset comprising one million spatial QAs spanning 19 diverse tasks. While specialist models offer valuable domain knowledge, they are often unreliable evaluators. Therefore, we build an all-scale benchmark with precise annotations by manually recording and retrieving videos. Nevertheless, naive training with SpaceVista-1M often yields suboptimal results due to potential knowledge conflicts. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across five benchmarks, including SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios.

What is All-Scale Spatial Reasoning?

data_graph

Spatial reasoning is the ability to perceive, interpret, and act across spatial scales, from millimeter-sized components to distant aerial scenes. All-scale spatial reasoning is fundamental to next-generation intelligent systems and supports diverse applications: mm sensing for advanced manufacturing, cm and m perception for embodied agents, 10m operation for autonomous driving, and 100m for drone based sensing.
Despite progress, existing work shows clear limitations in both model design and dataset coverage. Current scene perception research mostly targets indoor scenes, narrow object classes, and limited spatial ranges, and lacks training paradigms engineered for end to end, cross scale reasoning. SpaceVista addresses this gap by presenting the first systematic optimization across both data and model dimensions to enable robust, full scene spatial reasoning.

Dataset: SpaceVista-1M

pipeline
The limited data and performance constraints in existing models necessitate the creation of a dataset with all-scale spatial context. We propose SpaceVista-1M, a diverse, real-world, all-scale reasoning dataset, as the first to the best of our knowledge. SpaceVista-1M primarily comprises diverse spatial reasoning question–answer pairs, with rich semantic (category, rationale), 2D (mask, box, point), and 3D (depth, camera parameters, point cloud) annotations, obtained either natively or through processing. The construction pipeline in the above figure follows the step-by-step procedure of preparing, transforming, and generating to obtain an all-scale dataset by integrating specialized models.
data_graph

Basic informations

We collect a large number of spatial reasoning videos from both open-source datasets and our own ollected data. Specifically, we select scenes including tabletop, indoor, outdoor, and drone-view scenes, and design 19 types of spatial reasoning tasks covering all-scale from millimeters to kilometers. The dataset contains diverse spatial reasoning question–answer pairs, enriched with semantic, 2D, and 3D annotations.

Characteristics

  • 5 spatial scales scenes
  • 19 spatial reasoning task type
  • All-scale: from mm to km
  • 38,000 videos across diverse scenes
  • Over 50 subscene categories
  • 1 million QA pairs
  • Video Data with 3D Modeling
  • Comprehensive Annotations & Metadata

Evaluation

Although we perform limited manual filtering on open-source data, its suitability for accurately evaluating real-world perception remains uncertain. To address this, we collect higher-fidelity data comprising two types:

  • 1) measured, recorded, and manually annotated data for tiny and tabletop objects
  • 2) existing videos enhanced through retrieval and verification of public information for indoor and outdoor scenes.

Why not simply train with all-scale data?

data_graph

Mixing different types of knowledge without distinction hinders, rather than facilitates, the model's reasoning, as shown in the Figure above — a problem known as knowledge conflict. In all-scale reasoning, this conflict appears when similar visual patterns are interpreted differently at different scales.

Model: SpaceVista-7B

pipeline
SpaceVista-7B ingests a question with videos and self-supervised dense features, encodes them, projects features to a shared space, and fuses them in an LLM through learnable interaction. A LoRA-like scale expert with a scale-aware router adapts the model to different spatial scales. Training uses reinforcement learning with stepwise rewards to align reasoning and final answers.

Experiment

experiment-2

Results overview. SpaceVista-7B achieves comparative improvements across all benchmarks, highlighting its advantages in spatial reasoning tasks. Although models including LLAVA-Onevision-7B also demonstrate competitive performance, SpaceVista-7B consistently shows greater robustness and adaptability across various tasks, thereby solidifying its position as a leading model in the field of spatial reasoning.

Evaluation suite. The comparison across models is conducted on multiple spatial reasoning benchmarks. We conduct a comprehensive evaluation of LLAVA-Onevision-7B, LLAVA-Next-Video-7B, InternVL3.5-8B, Qwen2.5-VL-7B, SpaceR-7B, SpatialMLLM-4-B, VILASR-7B, and our SpaceVista-7B on VSI-Bench, STI-Bench, MMSI-Bench, SPAR-Bench, and SpaceVista-Bench, highlighting the robustness and competitiveness of our model.

Leaderboard on SpaceVista-Bench

Logo Click the table header to sort in ascending or descending order.
Models highlighted in bold red indicate the top three overall performers on SpaceVista-Bench.
# Model Source Date Overall Tiny Tabletop Tabletop Indoor Outdoor
3 GPT-5 🥉 Link 2025-08 33.732.220.339.043.0
8 GPT-4o Link 2024-05 26.921.713.334.338.3
2 Gemini-2.5-Pro 🥈 Link 2025-06 33.833.038.734.529.0
11 Gemini-2.5-Flash Link 2025-06 24.420.730.019.926.9
6 Claude-Sonnet-4 Link 2025-05 29.727.319.338.134.1
10 Claude-Opus-4.1 Link 2025-08 26.421.729.524.330.0
5 Internvl3.5-38B Link 2025-08 30.729.325.241.227.0
10 Internvl3.5-14B Link 2025-08 26.427.722.331.324.3
4 Internvl3-78B Link 2025-04 33.538.323.342.230.3
9 Internvl3-38B Link 2025-04 26.518.714.334.838.0
13 GLM-4.5V Link 2025-08 23.323.017.827.325.2
14 GLM-4.1V-Thinking Link 2025-07 23.130.719.329.013.3
10 Qwen2.5VL-72B Link 2025-01 26.427.720.329.628.0
7 Qwen2.5VL-32B Link 2025-01 28.425.319.338.130.7
16 LLAVA-Onevision-72B Link 2024-08 16.025.012.015.311.7
17 LLAVA-Onevision-7B Link 2024-08 12.617.58.013.311.6
15 SpaceR Link 2025-04 21.212.917.334.919.8
12 Spatial-MLLM Link 2025-05 24.217.320.336.123.1
1 SpaceVista-7B 🥇 Link 2025-09 36.733.437.142.234.1