The Problem
- Traditional outbreak surveillance lags weeks behind, missing early signals of new threats.
- Critical signals spread across social media, web searches, and news are siloed and hard to monitor in real time.
- Novel bio-threats start with sparse, noisy digital indicators that conventional systems often overlook.
The Solution
- Unified data pipeline continuously ingests multi-source feeds (Twitter, Google Trends, news) for anomaly detection.
- Tiered filtering by pathogen priority and source credibility isolates the most credible weak signals.
- Automated NLP and statistical algorithms flag unusual symptom clusters or location-based spikes in real time.
Architecture Overview
- Modular cloud pipeline on AWS with engines for data extraction, enrichment, event detection, visualization, evaluation, and alert sharing.
- Real-time ingestion from APIs (Twitter, Google) and RSS feeds funnels into a processing queue with feedback loops to refine signal quality.
- Custom NLP enrichment and geo-clustering components clean data and add context (e.g., mapping tweets to locations) for accurate detection.
- An interactive dashboard displays mapped alerts and timelines, updating continuously for early situational awareness.
- A feedback mechanism updates keywords and data sources based on detected events, improving sensitivity over time.
Results and Impacts
- Successfully flagged real events (e.g., Boston Marathon bombing, Hepatitis A outbreak) in retrospective tests, proving earlier detection than official reports.
- Outperformed 69 other solutions to win DHS’s Hidden Signal Challenge, earning national recognition and stakeholder buy-in.
- Delivered a functional prototype dashboard tested by public health officials, validating the tool’s practical use for early outbreak warning.
Skills and Tools Used
| Technique/Skill | Tools/Implementation |
|---|---|
| Skill/Tool Category | Application in Pandemic Pulse — Digital Biosurveillance |
| Data Engineering & APIs | Python pipelines for real-time ingestion (Twitter API, Google Trends, RSS feeds) |
| Natural Language Processing | Regex filtering and keyword extraction to identify health threat signals |
| Cloud Infrastructure | AWS cloud deployment for scalable processing of high-volume streaming data |
| Algorithms & Analytics | Statistical anomaly detection, clustering algorithms, and custom social-media geolocation |
| Visualization & Dashboard | D3.js and map libraries to build an interactive outbreak monitoring dashboard |
| Collaboration & Management | Cross-sector coordination (federal agencies, health officials) and agile project management (Trello, GitHub) with strict data governance |
Cross-Project Capabilities
- Real-time integration of diverse, noisy data sources – a skill later applied to other domains (e.g., fusing IoT sensor feeds or hospital data streams).
- Applying AI to public health: experience aligning technical solutions with epidemiological needs, a theme continued in pandemic monitoring and clinical ML projects.
- Cross-sector collaboration: honed ability to unite government, healthcare, and tech partners, replicated in later projects like COVID-19 surveys and hospital safety models.
Published Papers/Tools
- Challenge White Paper: “Pandemic Pulse: Using Digital Exhaust to Detect Bio-threats” – internal paper outlining the winning methodology.Challenge AnnouncementWinner Announcement
- Prototype Dashboard: Interactive web dashboard (internal) demonstrated multi-source outbreak alerts in one view for agencies.
- Related Research: Prior social-media surveillance studies (e.g., detecting foodborne illness from Twitter) provided the foundational methods used in this project.