The Problem
- Social platforms hide user age, making it hard to identify teen users on Twitter.
- Massive, noisy tweet streams make isolating meaningful exposure signals difficult.
- The effect of online pro-smoking content on teen smoking behavior remains unclear.
The Solution
- Built a Twitter data pipeline to collect and classify smoking-related tweets at scale.
- Used machine learning to infer users’ ages (under 18 vs. adult) from tweet patterns.
- Modeled teen tweet-reading behavior to estimate how many pro-tobacco posts they see daily.
Architecture Overview
- Scalable data ingestion archived targeted tweets and user metadata via Twitter APIs.
- NLP and feature extraction cleaned text and prepared data for classification.
- “Happy Birthday” tweets were used in an age classifier to identify under-18 users (~80% accuracy).
- A probabilistic model (Poisson process) estimated the probability that teens see key tweets.
Results and Impacts
- Estimated that about 36% of key influencers’ direct followers were likely under age 18.
- Underage followers saw a median of ~2.2 pro-tobacco tweets per day from these accounts.
- Revealed significant adolescent exposure to tobacco content online, prompting calls for social media oversight.
Skills and Tools Used
| Technique/Skill | Tools/Implementation |
|---|---|
| Social media mining | Twitter API for large-scale data collection |
| Natural language processing | Text cleaning and feature engineering |
| Machine learning | SVM and Random Forest for tweet/age classification |
| Network statistics | Probabilistic modeling of information exposure |
| Systems integration | Designed end-to-end data pipeline |
Cross-Project Capabilities
- Developed a reusable social media analysis pipeline, later applied to e-cigarette studies.
- Age inference methods generalize to finding vulnerable groups in any online contagion.
- Integration of ML and network analysis here informed later projects on community and hotspot identification.
Published Papers/Tools
- PhD Thesis Chapter: Exposure of a Vulnerable Population to Smoking-Related Messaging on Twitter. Thesis
- Policy recommendations on social media oversight for youth protection
- Custom Twitter surveillance pipeline (data collection, classification, visualization software).
- Recommendations for linking social media exposure with behavior, guiding future research.