TAAC2026 Demo Recommendation Dataset (1K Samples) – A Quick Overview
The TAIC2026 data_sample_1000 dataset is a small, 1,000‑row sample of user‑item interaction records released for the TAIC2026 competition. Stored as a flat‑column Parquet file (~39 MB) it contains 120 columns covering identifiers, labels, timestamps, a rich set of integer and dense user features, integer item features, and four groups of behavioral sequence features. All columns are top‑level, making the data readily ingestible by pandas, polars, or the Hugging Face datasets library.
The dataset is explicitly tagged for "recommendation" and "timeseries", indicating its primary use for building recommendation models that can leverage temporal and sequential signals. The schema includes 46 scalar or array integer user features, 10 dense float user features, 14 item features, and 45 sequence columns (list<int64>) split across domains A‑D. Labels are provided via `label_type` and `label_time`, with a `timestamp` column for ordering events.
Because the data is released under a CC‑BY‑NC‑4.0 license and provided in a convenient Parquet format, it is easy to load with a single line of code via `datasets.load_dataset("TAAC2026/data_sample_1000")` or with pandas (`pd.read_parquet`). This accessibility, combined with the competition context, explains why the dataset is currently trending on Hugging Face.
Researchers and engineers can use the sample to prototype feature engineering pipelines, experiment with sequence‑aware recommendation algorithms (e.g., transformer‑based models), or benchmark baseline approaches before scaling to the full competition data.
Project Ideas
- Train a simple collaborative‑filtering model using the `user_id`, `item_id`, and `label_type` columns as a baseline recommendation system.
- Build a sequence‑aware recommendation model that ingests the `domain_*_seq_*` list<int64> columns with a transformer or RNN to capture user behavior over time.
- Perform feature importance analysis on the 46 integer and 10 dense user features to identify which signals most influence the `label_type` outcome.
- Create a cold‑start recommendation pipeline that predicts interactions for new users solely from their integer and dense feature vectors.
- Visualize temporal interaction patterns by plotting `timestamp` versus `label_type` for different `item_id` groups to uncover seasonal trends.