The Problem
- Reviews vs Reports Gap: Many diners post about getting sick in online reviews but never notify health authorities, creating an unseen pool of incidents.
- Potential disparities: Certain demographics or locales may rely on review sites instead of official channels, leaving public agencies unaware of issues in those populations.
- Missed signals: Traditional surveillance fails to capture these crowdsourced complaints, risking that outbreaks in some communities go unnoticed.
The Solution
- Scraped, machine-classified and analyzed millions of Yelp reviews for keywords and narratives indicating foodborne illness experiences at restaurants.
- Compared the volume and pattern of Yelp-mentioned illness incidents with official food poisoning complaint data to quantify reporting gaps.
- Highlighted geographic and demographic patterns where many people complain online but far fewer file official reports, enabling health departments to identify where outreach or education is needed.
Architecture Overview
- Data Collection: Implemented web scrapers to gather Yelp review data across numerous restaurants and regions, building a large dataset of customer feedback.
- NLP Extraction: Used text processing to detect mentions of food poisoning symptoms or complaints within reviews, filtering out unrelated content.
- Machine Learning: Using manually classified data, we trained a Support Vector Machine to identify food poisiong related reviews and if reported for themselves or others.
- Data Fusion: Matched the timeline and location of Yelp-indicated incidents with corresponding official public health reports to directly compare unofficial vs. official reporting rates.
- Statistical Analysis: Analyzed seasonality and socio-demographic differences. Modeled factors (e.g., neighborhood income levels, restaurant types) associated with high Yelp complaint activity but low official reporting, to understand underlying causes of reporting disparities.
- Visualization: Created charts and heatmaps to illustrate areas of underreporting, helping stakeholders easily grasp the extent and location of the gaps.
Results and Impacts
- Revealed hotspots where Yelp reviews indicated numerous illness incidents yet official reports were scarce, signaling critical underreporting in those areas.
- Restaurant-heavy counties had the most online illness reports. This was the strongest pattern (r = 0.75).
- Richer, more educated counties reported more illness online. Higher bachelor’s degree levels (r = 0.44), high-school graduation (r = 0.40).
- Poorer counties reported less online. Higher uninsurance (r = -0.28), unemployment (r = -0.23), poverty (r = -0.18), and SNAP use (r = -0.16).
- Yelp-based food illness reporting looks useful, but it is biased toward restaurant-dense, higher-SES counties, so it should complement—not replace—traditional surveillance.
Skills and Tools Used
| Technique/Skill | Tools/Implementation |
|---|---|
| Web Scraping | Automated extraction of Yelp review data (Python, BeautifulSoup) |
| NLP & Text Analysis | Keyword filtering and content analysis of review text |
| Machine Learning | Sequence of SVM Classifers to identify relevent content and party size and impact |
| Data Integration | Linking crowdsourced review data with official public health databases |
| Statistical Insight | Analyzed socio-demographic patterns in reporting behavior |
Cross-Project Capabilities
- Non-Traditional Data Utilization: Showcased the ability to convert consumer-generated content (reviews) into actionable health insights, a methodology applied in multiple projects.
- Scalable Analysis: Gained experience handling millions of data points and extracting meaningful patterns, skills applicable to any large-scale data challenge.
- Equity Perspective: Experience analyzing data for disparities and gaps informs efforts in other projects to ensure fair and inclusive data-driven interventions.
Published Papers/Tools
- Peer-Reviewed Study: Published findings in a public health journal, detailing how Yelp review analysis can expose hidden food safety issues. Paper
- Conference Presentation: Presented this work via a poster at a national public health informatics conference (2020), sharing the approach with a broader audience.
