Name File Type Size Last Modified
  Twitter-COVID-dataset---June2022 06/25/2022 02:01:AM

Project Citation: 

Gupta, Raj, Vishwanath, Ajay, and Yang, Yinping. COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-06-25. https://doi.org/10.3886/E120321V12

Project Description

Summary:  View help for Summary This paper describes a large global dataset on people’s discourse and responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 June 2022, we collected and processed over 252 million Twitter posts from more than 29 million unique users using four keywords: “corona”, “wuhan”, “nCov” and “covid”. Leveraging probabilistic topic modelling and pre-trained machine learning-based emotion recognition algorithms, we labelled each tweet with seventeen attributes, including a) ten binary attributes indicating the tweet’s relevance (1) or irrelevance (0) to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: extremely negative to 1: extremely positive) and the degree of intensity of fear, anger, sadness and happiness emotions (from 0: not at all to 1: extremely intense), and c) two categorical attributes indicating the sentiment (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion (fear, anger, sadness, happiness, no specific emotion) the tweet is mainly expressing. We discuss the technical validity and report the descriptive statistics of these attributes, their temporal distribution, and geographic representation. The paper concludes with a discussion of the dataset’s usage in communication, psychology, public health, economics, and epidemiology.
Funding Sources:  View help for Funding Sources Agency for Science, Technology and Research (A*STAR) (ETPL/18-GAP050-R20A); Agency for Science, Technology and Research (A*STAR), Singapore (C210415006); National Medical Research Council, Ministry of Health, Singapore (COVID19RF-005); National Medical Research Council, Ministry of Health, Singapore (COVID19RF-0009)

Scope of Project

Subject Terms:  View help for Subject Terms COVID-19; coronavirus; pandemic; social media analytics; Twitter; topic modelling; sentiment analysis; emotion recognition; dataset
Geographic Coverage:  View help for Geographic Coverage Global
Time Period(s):  View help for Time Period(s) 1/28/2020 – 9/1/2021
Universe:  View help for Universe Twitter posts in English language that are labelled with topics, sentiments and emotions attributes using pre-trained algorithms
Data Type(s):  View help for Data Type(s) geographic information system (GIS) data; observational data; other; text
Collection Notes:  View help for Collection Notes The latest dataset version (V12, June 2022) has the following main updates: a) Full data coverage extended to cover 28 January 2020 – 1 June 2022 (2 years and 4 months), b) Country-specific CSV files download covers 30 representative countries, c) Added new vaccine-related data covering from 3 November 2021 to 1 June 2022 (8 months), d) an updated discussion on the dataset’s usage.

Methodology

Response Rate:  View help for Response Rate Existing observational data from Twitter users' public posts (tweets)
Sampling:  View help for Sampling From 28 January 2020 to 1 June 2022, we collected and processed over 252 million Twitter posts from more than 29 million unique users using four keywords: “corona”, “wuhan”, “nCov” and “covid”. 
Data Source:  View help for Data Source Twitter standard search application programming interface (API)
Collection Mode(s):  View help for Collection Mode(s) other; web scraping
Scales:  View help for Scales Leveraging probabilistic topic modelling and pre-trained machine learning-based emotion recognition algorithms, we labelled each tweet with seventeen attributes, including a) ten binary attributes indicating the tweet’s relevance (1) or irrelevance (0) to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: extremely negative to 1: extremely positive) and the degree of intensity of fear, anger, sadness and happiness emotions (from 0: not at all to 1: extremely intense), and c) two categorical attributes indicating the sentiment (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion (fear, anger, sadness, happiness, no specific emotion) the tweet is mainly expressing. 
Unit(s) of Observation:  View help for Unit(s) of Observation Individual tweet with attributes that can be aggregated and analyzed with time stamp, location, topics, sentiments, emotions and so on
Geographic Unit:  View help for Geographic Unit Geographic is labelled to the country/region level

Related Publications

Export Metadata

Report a Problem

Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.

This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.