model June 02, 2026

Marlin‑2B: Tiny Video VLM for Structured Captioning & Temporal Grounding

Marlin‑2B is an open‑source 2‑billion‑parameter video‑capable Vision‑Language Model (VLM) built on top of Qwen3.5‑2B. Developed by the NemoStation team, it adds two developer‑friendly modes—`caption` for dense, scene‑plus‑event descriptions with second‑precise timestamps, and `find` for natural‑language queries that return start‑end spans in a video. The model ships with custom `transformers` code, works with the standard HF API, and can be run on a single consumer GPU using vLLM or swift‑deploy.

The model is distinguished by its strong performance on fine‑grained video captioning (CaReBench, DREAM‑1K) and natural‑language temporal grounding (TimeLens‑Bench), where it outperforms larger open models and matches proprietary systems like Gemini‑2.5 while staying in the 2B‑parameter budget. Training leveraged a curated mix of public video annotations (ActivityNet, Charades, LSMDC, etc.) enriched with dense re‑annotations generated by Gemini‑3‑Flash and refined by human review, followed by a two‑stage fine‑tuning pipeline that includes SimPO preference optimization.

Marlin‑2B is ready for production use: it includes convenience methods that return parsed dictionaries, supports both caption and grounding tasks out of the box, and requires only a few standard Python packages (`transformers>=5.7.0`, `torch>=2.11.0`, `torchcodec`, `qwen-vl-utils`). The repository also provides a Gradio demo and clear installation instructions, making it easy for developers to integrate structured video understanding into their applications.

Project Ideas

  1. Generate second‑precise scene and event captions for video content to improve accessibility and searchability.
  2. Build a temporal search tool that lets users type an event description and receive the exact start‑end timestamps in surveillance footage.
  3. Create an automated sports‑highlights extractor that identifies key actions (e.g., "goal", "slam dunk") and produces a timeline of events.
  4. Develop a video‑library indexing system that stores structured captions and timestamps, enabling fast keyword‑based navigation across large media collections.
  5. Implement a video‑question‑answering chatbot that uses the `find` mode to locate and quote relevant video segments in response to user queries.
← Back to all reports