The HF Daily Briefer

dataset March 29, 2026

Claude‑Sonnet 4.6 Reasoning Dataset: 799 Deep Thought Conversations

TeichAI/Claude-Sonnet-4.6-Reasoning-799x

The **Claude‑Sonnet‑4.6‑Reasoning‑799x** dataset, authored by TeichAI, contains 799 single‑turn user→assistant exchanges that focus exclusively on chain‑of‑thought reasoning. Each response averages ar...

dataset March 28, 2026

Michael Hafftka Catalog Raisonné: 3.8K Paintings with Rich Metadata

Hafftka/michael-hafftka-catalog-raisonne

The **Michael Hafftka – Catalog Raisonné** dataset is a curated collection of roughly 3,800 digitized paintings by the American expressionist Michael Hafftka, spanning the period from the 1970s throug...

dataset March 27, 2026

AutoMathText-V2 Dataset Overview Report

OpenSQZ/AutoMathText-V2

AutoMathText-V2 is a curated collection of 52 premium data sources spanning web content, mathematics, code, reasoning, formal proofs, and bilingual translation. It aggregates over 1.5 trillion tokens,...

dataset March 26, 2026

Hacker News Dataset Report

open-index/hacker-news

The Hacker News Complete Archive mirrors every item posted on news.ycombinator.com from its inception in October 2006 through the present day, totaling over 47 million records. The data is stored in m...

dataset March 25, 2026

OmniAction: A Massive Omni‑modal Dataset for Proactive Robot Manipulation

OpenMOSS-Team/OmniAction

The OpenMOSS-Team has released **OmniAction**, a large‑scale multimodal dataset designed for contextual instruction following in robotic manipulation. Hosted on HuggingFace, the dataset contains 141,1...

model March 24, 2026

OmniCoder-9B: A 9B Coding Agent Fine‑Tuned on 425K Agentic Trajectories

Tesslate/OmniCoder-9B

OmniCoder-9B is a 9‑billion‑parameter coding agent released by Tesslate and built on top of Qwen3.5‑9B’s hybrid Gated‑Delta/standard‑attention architecture. It has been fine‑tuned with LoRA (r=64, alp...

model March 23, 2026

Foundation-1: Structured Text‑to‑Sample Music Generator Takes Center Stage

RoyalCities/Foundation-1

Foundation-1, released by RoyalCities, is a next‑generation text‑to‑sample model fine‑tuned from Stability AI’s stable‑audio‑open‑1.0. Designed for modern music production, it interprets layered promp...

dataset March 22, 2026

Screen‑Recording Dataset Powers Next‑Gen Desktop AI Agents

markov-ai/computer-use-large

The **Computer Use Large** dataset, released by *markov-ai*, contains 48,478 trimmed screen‑recording videos totalling roughly 12,300 hours of professional software usage. All videos are audio‑free an...

model March 21, 2026

Uncensored Multimodal Power: Qwen3.5-35B-A3B Aggressive Variant

HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

The **Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive** model is an uncensored, aggressive fork of the original Qwen3.5‑35B‑A3B released by the community contributor HauhauCS. With over 210 k downloads...

model March 20, 2026

Fish Audio S2 Pro – Multilingual TTS with Fine‑Grained Inline Control

fishaudio/s2-pro

Fish Audio S2 Pro is a state‑of‑the‑art text‑to‑speech (TTS) model released by the Fish Audio research team. It supports more than 80 languages, including tier‑1 coverage for English, Chinese, and Jap...

dataset March 19, 2026

olmOCR-bench: The New Standard for PDF‑to‑Markdown OCR Evaluation

allenai/olmOCR-bench

olmOCR-bench, released by AllenAI, is a benchmark dataset comprising 1,403 PDF files and 7,010 unit test cases that capture the properties a high‑quality OCR system should preserve when converting PDF...

dataset March 18, 2026

BONES-SEED: Massive Multimodal Motion Dataset for Humanoid Robotics

bones-studio/seed

BONES-SEED (Skeletal Everyday Embodiment Dataset) is an open collection of 142,220 annotated human motion captures designed for humanoid robotics research. The dataset provides each motion in three sk...

dataset March 17, 2026

FinePhrase: 1.35B Synthetic Samples for FAQ, Math, Tables & Tutorials

HuggingFaceFW/finephrase

FinePhrase is a massive synthetic dataset created by DataTrove using the SmolLM2-1.7B-Instruct model. It re‑writes source documents from the FineWeb‑Edu corpus into four distinct instructional formats...

model March 16, 2026

NVIDIA Nemotron‑3 Super 120B FP8: Massive Context & Agentic Reasoning Model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

NVIDIA's Nemotron‑3 Super 120B‑A12B‑FP8 is a 120‑billion‑parameter large language model (with 12 B active parameters) released on March 11, 2026. Built on the Transformers library and tagged for text‑...

dataset March 15, 2026

Open-RL: Verifiable STEM Reasoning Dataset for Outcome‑Supervised RL

TuringEnterprises/Open-RL

The **Open-RL** dataset, released by **TuringEnterprises** on March 2, 2026, offers a compact collection (<1K entries) of self‑contained, verifiable STEM reasoning problems spanning physics, mathemati...

model March 14, 2026

Uncensored Power: Qwen3.5-9B Aggressive Model Goes Multimodal

HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive

The Qwen3.5-9B-Uncensored-HauhauCS-Aggressive model is a 9 billion‑parameter language model released by HauhauCS that removes all refusal filters from the original Qwen3.5‑9B architecture. According t...

dataset March 13, 2026

DyNativeGaussian_sequence: A Multimodal Text‑3D Dataset Gains Traction

LeeXiangNO1/DyNativeGaussian_sequence

The dataset "LeeXiangNO1/DyNativeGaussian_sequence" is a recently popular multimodal collection authored by LeeXiangNO1. It contains both textual and 3D data, as indicated by the tags "modality:text" ...

dataset March 12, 2026

ALL Bench Leaderboard 2026: The First Unified Multi‑Modal AI Benchmark

FINAL-Bench/ALL-Bench-Leaderboard

The **ALL Bench Leaderboard 2026** dataset, curated by FINAL‑Bench, aggregates benchmark scores for more than 90 AI models across six modalities—LLMs, VLMs, autonomous agents, image generation, video ...

model March 11, 2026

Unsloth’s GGUF‑Quantized Qwen3.5‑35B‑A3B: Vision‑Language Power on a Laptop

unsloth/Qwen3.5-35B-A3B-GGUF

The **unsloth/Qwen3.5-35B-A3B-GGUF** repository provides a GGUF‑quantized checkpoint of the Qwen3.5‑35B‑A3B model, repackaged by the Unsloth community. With a **pipeline tag of `image-text-to-text`**...

model March 10, 2026

UnsLoTh Qwen3.5-9B GGUF Model – Trending Overview

unsloth/Qwen3.5-9B-GGUF

The unsloth/Qwen3.5-9B-GGUF model is a 9‑billion‑parameter multimodal (vision‑language) foundation model quantized to the GGUF format using Unsloth Dynamic 2.0, offering superior accuracy and low‑late...

model March 09, 2026

Phi-4 Reasoning Vision 15B: Multimodal AI with Chain‑of‑Thought Power

microsoft/Phi-4-reasoning-vision-15B

Phi-4-Reasoning-Vision-15B is an open‑weight multimodal model released by Microsoft on March 4, 2026. It combines the Phi‑4‑Reasoning language backbone (5 B–15 B parameters) with a SigLIP‑2 vision enc...

model March 08, 2026

Unsloth Qwen3.5-4B GGUF – Trending Multimodal LLM (2026-03-08)

unsloth/Qwen3.5-4B-GGUF

The unsloth‑quantized Qwen3.5‑4B GGUF model is a 4‑billion‑parameter causal language model with an integrated vision encoder. It supports a native context length of 262K tokens (extendable to >1M) and...

dataset March 07, 2026

The Stack v2: Massive Multilingual Code Corpus for AI

bigcode/the-stack-v2

The Stack v2, released by the BigCode team, is a gargantuan dataset of source code harvested from over 600 programming languages. Tagged for text-generation tasks, it provides raw code files along wit...

model March 06, 2026

Qwen3.5‑397B‑A17B: Ultra‑Large Multimodal Model Redefines Vision‑Language AI

Qwen/Qwen3.5-397B-A17B

Qwen3.5‑397B‑A17B is a next‑generation multimodal language model released by the Alibaba‑Qwen team. It is an image‑to‑text (image‑text‑to‑text) model built on a causal decoder architecture and equipp...

model March 05, 2026

LocoOperator-4B: A 4B‑Parameter Local Code‑Explorer Agent

LocoreMind/LocoOperator-4B

LocoOperator-4B is a 4 billion‑parameter tool‑calling agent released by LocoreMind. It is built on the Qwen3‑4B‑Instruct‑2507 base model and distilled from the Qwen3‑Coder‑Next teacher using full‑para...

model March 04, 2026

Qwen3.5-122B-A10B – A 122‑Billion‑Parameter Sparse Mixture‑of‑Experts Vision‑Language Model

Qwen/Qwen3.5-122B-A10B

Qwen3.5-122B-A10B is a 122‑billion‑parameter causal language model with a vision encoder that activates only ~10 B parameters per inference via a 256‑expert Mixture‑of‑Experts architecture (8 routed +...

dataset March 03, 2026

CCN Dataset: Tabular Classification for Advanced Route Recommendation

GD-ML/CCN

The GD-ML/CCN dataset, released by the GD-ML team, supports the research paper *Towards Full Candidate Interaction: A Comprehensive Comparison Network for Better Route Recommendation*. It is a tabular...

dataset March 02, 2026

Real Slop: 155k Real LLM Interactions for Dialogue & Safety Research

Solenopsisbot/real-slop

Real Slop is a Hugging Face dataset released by the user Solenopsisbot that aggregates 155,000 real‑world language model interactions in English. The entries span a variety of model families and are s...

dataset March 01, 2026

GitHub Top Developer Source Code: 1.3M+ Files for Code Intelligence

ronantakizawa/github-top-code

The **GitHub Top Developer Source Code** dataset, authored by *ronantakizawa*, aggregates over **1.3 million source code files** contributed by the most highly ranked GitHub developers between 2015 an...

dataset February 28, 2026

Coding Agent Conversations: 549 Sessions of AI Tool Use

peteromallet/dataclaw-peteromallet

The *Coding Agent Conversations* dataset (ID: `peteromallet/dataclaw-peteromallet`) is a collection of 549 logged sessions where large language models act as coding assistants. Each session records me...

dataset February 27, 2026

OpenResearcher Dataset: Structured LLM Interaction Traces for Tool‑Use Research

OpenResearcher/OpenResearcher-Dataset

The OpenResearcher/OpenResearcher-Dataset is a curated collection of 6,102 multi‑turn conversational examples, each tied to a unique question (qid), a reference answer, and a detailed message log. The...

dataset February 26, 2026

Common Corpus: 2.3 T Token Open Multilingual Text Dataset

PleIAs/common_corpus

Common Corpus, released by PleIAs and a network of partners, is currently the largest openly licensed text collection, containing 2.27 trillion tokens across more than a dozen languages. The dataset a...

dataset February 25, 2026

ToolMind-Web-QA: Synthetic Multi‑Hop Web‑Search QA for Long‑Horizon Agents

Nanbeige/ToolMind-Web-QA

ToolMind-Web-QA is a publicly released, synthetic dataset created by Nanbeige for research on search‑augmented and long‑horizon search agents. It contains roughly 6,000 complex question‑answer pairs g...

dataset February 24, 2026

DeepGen 1.0 Image Dataset: Small Yet Powerful Multimodal Training Resource

deepgenteam/DeepGen-1.0

The **deepgenteam/DeepGen-1.0** dataset is a lightweight image collection released by the DeepGen team. Hosted on Hugging Face, it follows the *imagefolder* format, is licensed under Apache‑2.0, and f...

model February 23, 2026

Kimi K2.5: 1‑Trillion‑Parameter Multimodal Agent for Vision‑Language Reasoning

moonshotai/Kimi-K2.5

Kimi K2.5, released by Moonshot AI, is an open‑source, native multimodal model that bridges vision and language through a 1‑trillion‑parameter Mixture‑of‑Experts architecture. Built on top of the Kimi...

dataset February 22, 2026

MolmoSpaces: A Rich Asset Hub for Robotics and Embodied AI

allenai/molmospaces

MolmoSpaces is a dataset released by the Allen Institute for AI (AI2) that bundles asset data for the MolmoSpaces project. It provides a comprehensive collection of 3‑D objects, robot models, scene de...

model February 21, 2026

MiniMax-M2.5 Model Overview and Insights

MiniMaxAI/MiniMax-M2.5

MiniMax-M2.5 is the latest frontier model from MiniMax AI, excelling in coding, agentic tool use, search, and office work. Trained with reinforcement learning across hundreds of thousands of real-worl...

dataset February 20, 2026

Fine-T2I: 6M High‑Quality Text‑Image Pairs for Open T2I Fine‑Tuning

ma-xu/fine-t2i

Fine‑T2I is a large‑scale, open dataset released by Xu Ma, Yitian Zhang, Qihua Dong, and Yun Fu from Northeastern University. It contains over 6.15 million text–image pairs (about 2 TB) organized in W...

model February 19, 2026

Qwen3‑TTS 1.7B CustomVoice: Real‑Time Multilingual Speech with Instruction‑Driven Style

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice is a 1.7 B‑parameter text‑to‑speech model released by the Qwen team. It supports ten major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portu...

dataset February 18, 2026

Chinese-Fineweb-Edu-V2.2 Dataset: Quickstart Guide

opencsg/Fineweb-Edu-Chinese-V2.2

The Chinese-Fineweb-Edu-V2.2 dataset provides high‑quality Chinese educational text for large‑language‑model pre‑training and instruction fine‑tuning. It is organized into three tiers of pre‑training ...

dataset February 17, 2026

Kitchen Robotics: 600 Hours of Human Tele‑Operated Demonstrations

nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos

PhysicalAI‑Robotics‑Kitchen‑Sim‑Demos is a large‑scale dataset released by NVIDIA that captures 600 hours of human‑teleoperated manipulation in a simulated kitchen environment. The data spans 316 dist...

model February 16, 2026

GLM-5: 744B LLM with Sparse Attention, Tool Use, and Long‑Context Capabilities

zai-org/GLM-5

GLM-5, released by the ZAI organization, is a massive multilingual language model targeting complex systems engineering and long‑horizon agentic tasks. It scales up to 744 B parameters (with 40 B acti...

model February 15, 2026

GLM-OCR: Multilingual, High‑Performance OCR for Complex Documents

zai-org/GLM-OCR

GLM-OCR is a multimodal OCR model built on the GLM‑V encoder‑decoder architecture, integrating the CogViT visual encoder and a lightweight cross‑modal connector with a GLM‑0.5B language decoder. It op...

dataset February 14, 2026

Exploring the Massive International Travel Text Dataset

GD-ML/IntTravel_dataset

The GD-ML/IntTravel_dataset is a large‑scale text collection hosted on the Hugging Face Hub. According to its metadata, the dataset falls in the 100 M < size < 1 B range, is stored in CSV format, and ...

dataset February 13, 2026

Moonworks Lunara Aesthetic II: High‑Quality Image Variation Dataset

moonworks/lunara-aesthetic-image-variations

The **Moonworks Lunara Aesthetic II** dataset, released by the creator *moonworks*, provides 2,854 paired images designed for research on image editing, image‑to‑image generation, and identity preserv...

dataset February 12, 2026

DeepPlanning: Benchmarking Long‑Horizon Agentic Planning with Constraints

Qwen/DeepPlanning

DeepPlanning is a newly released dataset from Qwen that serves as a benchmark for evaluating the long‑horizon planning abilities of large language models (LLMs). It focuses on agentic tasks where mod...

model February 11, 2026

Intern‑S1‑Pro: Trillion‑Scale Multimodal Scientific Reasoner Takes the Lead

internlm/Intern-S1-Pro

Intern‑S1‑Pro, released by the InternLM team, is a trillion‑parameter mixture‑of‑experts (MoE) foundation model that targets scientific multimodal reasoning. Tagged with **image‑text‑to‑text**, it acc...

dataset February 10, 2026

Moltbook Annotated Posts & Submolts: A Rich Resource for Content Classification

TrustAIRLab/Moltbook

The Moltbook Dataset, released by TrustAIRLab, provides over 44,000 GPT‑5.2‑annotated posts and 12,209 submolts harvested from the agent social network Moltbook. Each post is labeled with one of nine ...

model February 09, 2026

NVIDIA Personaplex 7B: Audio‑to‑Audio Model Gains Traction

nvidia/personaplex-7b-v1

The model nvidia/personaplex-7b-v1 is an audio‑to‑audio (speech‑to‑speech) model hosted on Hugging Face. It is built with the Moshi library and distributed in the safetensors format, indicating a focu...

dataset February 08, 2026

TeichAI's Small Text Dataset for Claude‑4.5 Opus Reasoning Gains Traction

TeichAI/claude-4.5-opus-high-reasoning-250x

The dataset **TeichAI/claude-4.5-opus-high-reasoning-250x** is a compact collection of text entries (size category n<1K) stored in JSON format. Created by the user *TeichAI* on November 27, 2025, it h...

model February 07, 2026

Qwen3-Coder-Next: The Trending Code‑Focused Text Generation Model

Qwen/Qwen3-Coder-Next

Qwen/Qwen3-Coder-Next is a newly released transformer model that targets text‑generation tasks, as indicated by its pipeline_tag. The model’s identifier – “Coder‑Next” – suggests a focus on programmin...

dataset February 06, 2026

Tencent's CL-bench: A New Benchmark for Long-Context Text Generation

tencent/CL-bench

The CL-bench dataset, released by Tencent, is a recently trending English-language benchmark designed for text‑generation tasks that require handling long contexts. It contains between 1,000 and 10,00...

Daily Reports