Explore Race Reports from Reddit

The dashboard below is built from a dataset of race reports posted to Reddit. Each report is analyzed by AI to answer a set of questions and identify specific data points.

On the left, you can use a variety of filters to narrow down the full dataset by demographic, training, and performance characteristics. The search box also lets you find reports related to a specific race.

In the main panel, the “Race Reports” tab will show all of the relevant race reports. Clicking on “Details & AI Takeaways” will open up a larger summary and a link to the original race report. Use this to find individual race reports that you may be interested in reading.

The “Interactive Analytics” tab will show a series of graphs and visualizations. You can use the drop down to color code the visuals according to different characteristics. Each scatterplot has a button to toggle a larger full screen mode.

Explore the dashboard or scroll to the bottom for more details on the data source and methodology.

How the Data Is Collected and Processed

The source data for this dashboard comes from two subreddits: r/AdvancedRunning and r/Marathon_Training.

The original dataset came from Academic Torrents because the Reddit API won’t allow you to scrape historical data. From that point on, I checked for new posts from the two subreddits each day and saved them to the database.

Originally, posts that included the phrase “Race Report” in the title or body of the post were flagged for analysis. More recently, I added a third check for posts that include the “Race Report” flair. Any report meeting these criteria was saved to the database.

Next, each flagged post was run through a generative AI model (Gemini Flash 3.1). First, the model determined whether or not the post was actually a race report. If it was, then it answered a series of questions about the race report, wrote a short summary, and identified three key takeaways.

Questions include:

  1. The name and distance of the race
  2. The age and gender of the runner
  3. The type of training plan followed and specific details or adjustments
  4. The peak mileage per week run during training
  5. The number of marathons previously run by the runner
  6. The brand and model of shoes worn for the race
  7. The target time and finish time
  8. Whether they ran a negative, neutral, or positive split
  9. Whether they set a PR or met their BQ
  10. A description of the weather conditions
  11. Whether the runner had cramping or other issues during the race

Initially, this produced a full dataset of about 3,700 posts, 3,000 identified race reports, and 1,700 marathon race reports. Each morning, new posts are scraped, processed, and added to the database.

When I first started this project in 2025, I wrote up an analysis of the data here. I revamped the entire data collection pipeline in 2026, and I wrote up a new analysis of the data here.

In the next few months, I plan to a) publish the full dataset to Kaggle to enable further analysis and b) create an method to flag data that was misidentified by the generative AI model.