2026-04-16 10:54:59 +08:00
2026-04-16 10:54:59 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00
2026-04-16 09:51:30 +08:00

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

https://github.com/user-attachments/assets/fe39e095-af2c-4ec9-b68d-a8ba97e505ab


Quick Start

Installation

1. Create conda environment

conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map

2. Install PyTorch (CUDA 12.8)

pip install torch==2.9.1 torchvision==0.24.1 --index-url https://download.pytorch.org/whl/cu128

For other CUDA versions, see PyTorch Get Started.

3. Install lingbot-map

pip install -e .

4. Install FlashInfer (recommended)

FlashInfer provides paged KV cache attention for efficient streaming inference:

# CUDA 12.8 + PyTorch 2.9
pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/

For other CUDA/PyTorch combinations, see FlashInfer installation. If FlashInfer is not installed, the model falls back to SDPA (PyTorch native attention) via --use_sdpa.

5. Visualization dependencies (optional)

pip install -e ".[vis]"

Model Download

Model Name Huggingface Repository ModelScope Repository Description
lingbot-map robbyant/lingbot-map Robbyant/lingbot-map Base model checkpoint (4.63 GB)

Demo

Streaming Inference from Images

python demo.py --model_path /path/to/checkpoint.pt \
    --image_folder /path/to/images/

Streaming Inference from Video

python demo.py --model_path /path/to/checkpoint.pt \
    --video_path video.mp4 --fps 10

Streaming with Keyframe Interval

Use --keyframe_interval to reduce KV cache memory by only keeping every N-th frame as a keyframe. Non-keyframe frames still produce predictions but are not stored in the cache. This is useful for long sequences which excesses 320 frames.

python demo.py --model_path /path/to/checkpoint.pt \
    --image_folder /path/to/images/ --keyframe_interval 6

Windowed Inference (for long sequences, >3000 frames)

python demo.py --model_path /path/to/checkpoint.pt \
    --video_path video.mp4 --fps 10 \
    --mode windowed --window_size 64

With Sky Masking

python demo.py --model_path /path/to/checkpoint.pt \
    --image_folder /path/to/images/ --mask_sky

Without FlashInfer (SDPA fallback)

python demo.py --model_path /path/to/checkpoint.pt \
    --image_folder /path/to/images/ --use_sdpa

License

This project is released under the Apache License 2.0. See LICENSE file for details.

Citation

@article{chen2026geometric,
  title={Geometric Context Transformer for Streaming 3D Reconstruction},
  author={Chen, Lin-Zhuo and Gao, Jian and Chen, Yihang and Cheng, Ka Leong and Sun, Yipengjing and Hu, Liangxiao and Xue, Nan and Zhu, Xing and Shen, Yujun and Yao, Yao and Xu, Yinghao},
  journal={arXiv preprint arXiv:2604.14141},
  year={2026}
}

Acknowledgments

This work builds upon several excellent open-source projects:


Description
Mirror of github.com/Robbyant/lingbot-map — streaming 3D reconstruction
Readme 345 MiB
Languages
Python 100%