← Back to portfolio

Social Media Exposure Modeling

Publication Policy

Analyzed Twitter data to quantify teens’ exposure to pro-tobacco messages, finding substantial underage visibility.

Role: Principal Investigator, Lead Data Scientist, System Architect & First Author

Focus: Behavior Monitoring · Exposure Modeling · ML Classification · NLP · Network Analytics · Public Health Intervention · Social Media Analytics · Youth Tobacco Prevention

Outcome: PhD Thesis chapter, Custom Twitter Surveillance Pipeline, policy recommendations on social media oversight for youth protection.Thesis Link

At a Glance

  • Analyzed Twitter data to quantify teens’ exposure to pro-tobacco messages, finding substantial underage visibility.
  • Built a scalable pipeline to collect and classify smoking-related tweets and infer user age.
  • Estimated that underage users saw multiple pro-smoking posts per day from key influencers.

The Problem

  • Social platforms hide user age, making it hard to identify teen users on Twitter.
  • Massive, noisy tweet streams make isolating meaningful exposure signals difficult.
  • The effect of online pro-smoking content on teen smoking behavior remains unclear.

The Solution

  • Built a Twitter data pipeline to collect and classify smoking-related tweets at scale.
  • Used machine learning to infer users’ ages (under 18 vs. adult) from tweet patterns.
  • Modeled teen tweet-reading behavior to estimate how many pro-tobacco posts they see daily.

Architecture Overview

  • Scalable data ingestion archived targeted tweets and user metadata via Twitter APIs.
  • NLP and feature extraction cleaned text and prepared data for classification.
  • “Happy Birthday” tweets were used in an age classifier to identify under-18 users (~80% accuracy).
  • A probabilistic model (Poisson process) estimated the probability that teens see key tweets.

Results and Impacts

  • Estimated that about 36% of key influencers’ direct followers were likely under age 18.
  • Underage followers saw a median of ~2.2 pro-tobacco tweets per day from these accounts.
  • Revealed significant adolescent exposure to tobacco content online, prompting calls for social media oversight.

Skills and Tools Used

Technique/Skill Tools/Implementation
Social media mining Twitter API for large-scale data collection
Natural language processing Text cleaning and feature engineering
Machine learning SVM and Random Forest for tweet/age classification
Network statistics Probabilistic modeling of information exposure
Systems integration Designed end-to-end data pipeline

Cross-Project Capabilities

  • Developed a reusable social media analysis pipeline, later applied to e-cigarette studies.
  • Age inference methods generalize to finding vulnerable groups in any online contagion.
  • Integration of ML and network analysis here informed later projects on community and hotspot identification.

Published Papers/Tools

  • PhD Thesis Chapter: Exposure of a Vulnerable Population to Smoking-Related Messaging on Twitter. Thesis
  • Policy recommendations on social media oversight for youth protection
  • Custom Twitter surveillance pipeline (data collection, classification, visualization software).
  • Recommendations for linking social media exposure with behavior, guiding future research.