← Back to portfolio

Foodborne Illness — Reporting Disparities via Yelp

Publication Product Media 2X

Analyzed 1.5 million Yelp restaurant reviews for food poisoning mentions to uncover public health underreporting disparities.

Role: Lead Data Scientist

Focus: Civic Tech · Data Visualization · Health Equity Analytics · NLP · Social Media Analytics

At a Glance

  • Analyzed 1.5 million Yelp restaurant reviews for food poisoning mentions to uncover public health underreporting disparities.
  • Utilized machine learning to identify communities where people report illness online (via reviews) but not through official channels, highlighting equity issues in reporting.
  • Foodborne illness reports increased with more restaurants and water violations, and were highest during summer and on Sundays and Mondays.
  • Foodborne illness reporting was higher in counties with higher income and education.

The Problem

  • Reviews vs Reports Gap: Many diners post about getting sick in online reviews but never notify health authorities, creating an unseen pool of incidents.
  • Potential disparities: Certain demographics or locales may rely on review sites instead of official channels, leaving public agencies unaware of issues in those populations.
  • Missed signals: Traditional surveillance fails to capture these crowdsourced complaints, risking that outbreaks in some communities go unnoticed.

The Solution

  • Scraped, machine-classified and analyzed millions of Yelp reviews for keywords and narratives indicating foodborne illness experiences at restaurants.
  • Compared the volume and pattern of Yelp-mentioned illness incidents with official food poisoning complaint data to quantify reporting gaps.
  • Highlighted geographic and demographic patterns where many people complain online but far fewer file official reports, enabling health departments to identify where outreach or education is needed.

Architecture Overview

  • Data Collection: Implemented web scrapers to gather Yelp review data across numerous restaurants and regions, building a large dataset of customer feedback.
  • NLP Extraction: Used text processing to detect mentions of food poisoning symptoms or complaints within reviews, filtering out unrelated content.
  • Machine Learning: Using manually classified data, we trained a Support Vector Machine to identify food poisiong related reviews and if reported for themselves or others.
  • Data Fusion: Matched the timeline and location of Yelp-indicated incidents with corresponding official public health reports to directly compare unofficial vs. official reporting rates.
  • Statistical Analysis: Analyzed seasonality and socio-demographic differences. Modeled factors (e.g., neighborhood income levels, restaurant types) associated with high Yelp complaint activity but low official reporting, to understand underlying causes of reporting disparities.
  • Visualization: Created charts and heatmaps to illustrate areas of underreporting, helping stakeholders easily grasp the extent and location of the gaps.

Results and Impacts

  • Revealed hotspots where Yelp reviews indicated numerous illness incidents yet official reports were scarce, signaling critical underreporting in those areas.
  • Restaurant-heavy counties had the most online illness reports. This was the strongest pattern (r = 0.75).
  • Richer, more educated counties reported more illness online. Higher bachelor’s degree levels (r = 0.44), high-school graduation (r = 0.40).
  • Poorer counties reported less online. Higher uninsurance (r = -0.28), unemployment (r = -0.23), poverty (r = -0.18), and SNAP use (r = -0.16).
  • Yelp-based food illness reporting looks useful, but it is biased toward restaurant-dense, higher-SES counties, so it should complement—not replace—traditional surveillance.

Skills and Tools Used

Technique/SkillTools/Implementation
Web ScrapingAutomated extraction of Yelp review data (Python, BeautifulSoup)
NLP & Text Analysis Keyword filtering and content analysis of review text
Machine Learning Sequence of SVM Classifers to identify relevent content and party size and impact
Data Integration Linking crowdsourced review data with official public health databases
Statistical Insight Analyzed socio-demographic patterns in reporting behavior

Cross-Project Capabilities

  • Non-Traditional Data Utilization: Showcased the ability to convert consumer-generated content (reviews) into actionable health insights, a methodology applied in multiple projects.
  • Scalable Analysis: Gained experience handling millions of data points and extracting meaningful patterns, skills applicable to any large-scale data challenge.
  • Equity Perspective: Experience analyzing data for disparities and gaps informs efforts in other projects to ensure fair and inclusive data-driven interventions.

Published Papers/Tools

  • Peer-Reviewed Study: Published findings in a public health journal, detailing how Yelp review analysis can expose hidden food safety issues. Paper
  • Conference Presentation: Presented this work via a poster at a national public health informatics conference (2020), sharing the approach with a broader audience.