Jagle: A 9M‑Instance Japanese Vision‑Language Dataset for VQA and Multimodal Training
Jagle is a large‑scale Japanese multimodal post‑training dataset released by the llm‑jp team. It contains roughly 9.2 million image‑text instances covering a wide range of tasks, most notably visual question answering (VQA). The dataset was assembled from several public sources—including the WAON image‑text pairs, Japanese‑photos, Wikipedia‑ja, FinePDFs‑Edu, PlotQA, TAT‑QA, and others—then enriched with VQA annotations generated by vision‑language models (e.g., Qwen3‑VL‑235B‑A22B‑Instruct) and translation pipelines. Jagle was used to train the LLM‑jp‑4‑VL 9B‑beta model, and experiments reported in the accompanying paper show notable performance gains on Japanese vision‑language benchmarks.
The dataset is hosted on a domestic server in Japan due to copyright restrictions and must be downloaded via GitLab (https://gitlab.llm-jp.nii.ac.jp/datasets/jagle). After cloning, users can load individual splits with the 🤗 datasets library, for example `load_dataset("data/Jagle/japanese-photos-VQA", split="train")`. Licensing follows Article 30‑4 of the Japanese Copyright Act, with source‑specific licenses ranging from Apache 2.0 to CC0 and CC‑BY variants.
Jagle is positioned for researchers and developers building Japanese vision‑language models, VQA systems, multimodal retrieval, or any downstream task that benefits from large‑scale image‑text data in Japanese. Its diverse source material and curated VQA pairs make it a valuable resource for advancing multimodal AI in the Japanese language.
Project Ideas
- Fine‑tune a Japanese VQA model on Jagle to improve answer accuracy for Japanese image queries.
- Create a multimodal image search engine that retrieves Japanese images based on natural language queries using the dataset's image‑text pairs.
- Develop an automated Japanese image captioning system by training on Jagle's image‑text and VQA annotations.
- Build an educational QA chatbot that answers visual questions from Japanese textbooks and PDFs using the PDF‑derived VQA data in Jagle.
- Benchmark cross‑lingual vision‑language models by evaluating performance on Jagle's Japanese VQA tasks and comparing to English counterparts.