Screen‑Recording Dataset Powers Next‑Gen Desktop AI Agents
markov-ai/computer-use-large ↗
The **Computer Use Large** dataset, released by *markov-ai*, contains 48,478 trimmed screen‑recording videos totalling roughly 12,300 hours of professional software usage. All videos are audio‑free and have been cleaned to remove non‑screen content, making them ideal for visual‑only modeling. The collection spans six major desktop applications—AutoCAD, Blender, Excel, Photoshop, Salesforce, and VS Code—each organized into dedicated folders with accompanying `metadata.jsonl` files that describe file name, software category, trimmed duration, and the number of contiguous recording segments.
Each software category contributes between 7,800 and 11,500 videos, providing a balanced mix of CAD, 3D modelling, spreadsheet manipulation, image editing, CRM, and code editing workflows. The dataset is tagged for *video‑classification* and *robotics*, signalling its relevance for training computer‑use agents that can interact with graphical user interfaces (GUIs) through actions such as clicking, typing, and scrolling. The provided metadata makes it straightforward to load a single category (e.g., `load_dataset("markov-ai/computer-use-large", "blender")`) or the entire corpus at once.
Since its creation on March 12 2026, the dataset has attracted over 106 k downloads and 133 likes, achieving a trending score of 78, reflecting strong community interest in building desktop‑automation and robotic‑process‑automation (RPA) models. Licensed under CC‑BY‑4.0, it can be freely reused for research and commercial projects that aim to teach AI systems how to operate real‑world software via visual demonstrations.
Project Ideas
- Train a video‑classification model to automatically identify which desktop application (e.g., AutoCAD, VS Code) appears in a screen‑recording.
- Develop a reinforcement‑learning desktop assistant that learns to replicate Blender workflow actions by observing the Blender video demonstrations.
- Create a self‑supervised model that predicts mouse clicks and keystrokes from video frames to automate repetitive Excel tasks.
- Build a benchmark suite for evaluating robotic process automation agents across the six software categories using the dataset's metadata and segment counts.
- Fine‑tune a transformer‑based video encoder to generate step‑by‑step textual guides from Photoshop screen recordings.