Hacker News Dataset Report
The Hacker News Complete Archive mirrors every item posted on news.ycombinator.com from its inception in October 2006 through the present day, totaling over 47 million records. The data is stored in monthly Parquet files (Zstandard‑compressed) and daily 5‑minute live blocks, making it efficiently queryable with DuckDB, streamable via the 🤗 datasets library, and downloadable in bulk. Five item types are represented (stories, comments, jobs, polls, poll options), with comments comprising 87 % of the corpus and stories 13 %. Scores follow a steep power‑law distribution (median 0, mean 1.5, top score 6,015), and the average story receives ~24 comments. The dataset includes rich metadata (author, timestamps, URLs, score, comment tree, tokenized words) and per‑month statistics (record counts, file sizes, fetch/commit durations). This comprehensive, regularly updated archive is ideal for large‑scale language model pre‑training, trend and sentiment analysis, community dynamics research, information‑retrieval benchmarking, and recommendation or ranking model development.
Project Ideas
- Fine‑tune a domain‑specific language model on the HN discussion threads to improve code‑related question answering.
- Analyze the evolution of technology topics (e.g., AI, Rust, blockchain) over the past two decades using temporal score and submission trends.
- Build a graph‑based model of comment reply networks to study influence propagation and community structure on Hacker News.
- Create a benchmark dataset of real HN questions and accepted answers for evaluating retrieval‑augmented generation systems.
- Develop a recommendation engine that predicts front‑page popularity of newly submitted stories based on early voting patterns and metadata.