PuReGoMe is a research project of Utrecht University and the Netherlands eScience Center. We analyze Dutch social media messages to assess the opinion of the public towards the COVID-19 pandemic measures taken by the Dutch government.
PuReGoMe uses three data sources for social media analysis. Our main data source are Dutch tweets from Twitter. We share the ids of most of the tweets used in the project. We use Dutch posts from Reddit and comments from Nu.nl as sources of verification of the results obtained by the tweet analysis.
After a month has ended, the data from the month are collected. Here are the steps taken for each data source:
- the tweets are taken from the crawler of twiqs.nl. They are stored in json format in hourly files. (this data set is not publicly available)
- the tweets are extracted from the json files and and stored in csv files (six columns: id_str, in_reply_to_status_id_str, user, verified, text, location). The conversion is performed by the script query-text.py
- duplicate tweets are removed from from the csv files produced in step 2 by the script text-unique.py
(it would be useful to combine steps 2 and 3 in the future)
- In a new month directory, run the script get_subreddit_ids.py to automatically retrieve the subreddits of the Dutch corona reddits
- Copy the list of ids of the subreddit Megathread Coronavirus COVID-19 in Nederland (submissions_ids_thenetherlands.txt) from a previous month directory and manually add the ids of recent subreddits
- Run the script coronamessagesnl.py on the files
submissions_ids_*
to automatically retrieve the posts in the found subreddits - Run the notebook reddit.ipynb to get all the posts from the monthly
downloads
directory and store them in the directorytext
- Run code blocks 1, 3 and 4 of the notebook selenium-test.ipynb, after updating the name of the file in URLFILE in code block 1
- Restart the notebook and run code blocks 1 and 6, after changing the name of the new downloads directory in DATADIROUT in code block 6. This process takes many hours (even days) to complete
- The notebook can be copied and several copies can be run in parallel
- When the notebooks have finished: delete all pairs of files of sizes 1 and 3 in this month's directory (keep the ones with only size 1) and rerun the notebooks
- Repeat step 4 until no articles with comments are found
- go to (cd) the directory data/nunl
- run the script ../../scripts/combineNunlComments.sh to update the files in the main directory
downloads
- run the notebook nunl-convert-data.ipynb to regenerate the data files in the directory
text
- for fetching the article texts: run code blocks 1 and 2 of the notebook selenium-test.ipynb, after updating the variables
URLFILE
andOUTFILEMETADONELIST
PuReGoMe performs analysis on three different levels: by counting messages, by determining their polarity (sentiment) and by determining their stance with respect to anti-pandemic government measures.
Frequency analysis of tweets is performed in the notebook tweet-counts.ipynb. The notebook defines several pandemic queries, for example face mask, lockdown, social distancing and pandemic, where pandemic is a combination of 60+ relevant terms. The notebook produces a graph with the absolute daily frequencies of the tweets matching each of these pandemic queries:
Frequency analysis of the Nu.nl and Reddit data is included in the respective data generation notebooks nunl-convert-data.ipynb and reddit.ipynb
Polarity analysis is the same as sentiment analysis. This analysis is performed by the notebook sentiment-pattern.ipynb which uses the Python package Pattern for sentiment analysis of Dutch text (De Smedt & Daelemans, 2011). The notebook requires two types of input files: the csv files in the text directories of each of the data sources and sentiment score files which should be generated from these csv files with the script ../../scripts/sentiment-pattern-text.py The polarity analysis of the different topics takes a lot of time and can be run in parallel. It produces time series graphs for all tweets, all pandemic tweets and several individual pandemic topics.
Stance analysis is performed by the notebook fasttext.ipynb. The analysis originates from a model trained by fastText on manually labeled tweets. The notebook contains a section for searching for the best parameters of fastText using grid search but when the training data is unchanged this section can be skipped. The notebook has two main modes related to topics: analysis related to the social distancing policy and analysis related to the former (April 2020) face mask policy. These are the only two topics for which we have manually labeled training data. The graphs combine analysis for all three data sources used in the project: Twitter, Nu.nl and Reddit.
The notebook topic-analysis.ipynb is used to find new topics in a day of tweets. It compares the vocabulary of a day of tweets with the vocabulary of the preceding day.
echo-chambers.ipynb is used for finding groups of users which collectively retweet similar content. The notebook found a group of a few hundred users retweeting right-wing propaganda. Further study needs to be done to check if this content has any effect on the findings of this project.
geo-analysis.ipynb and geo-classification.ipynb can be used to divide the tweets in groups depending on on the location of the tweeter. This only works for about half of the data set. Next, maps representing tweet data can be created with the notebook maps.ipynb.
corona-nl-totals.ipynb creates graphs of the number infections, hospitalizations and deaths in The Netherlands based on data provided by the health organization RIVM.
Erik Tjong Kim Sang, Shihan Wang, Marijn Schraagen and Mehdi Dastani, Estimating Stances on Anti-Pandemic Measures By Social Media Analysis, ICT.OPEN2022 (poster), Amsterdam, The Netherlands, 2022
Erik Tjong Kim Sang, Shihan Wang, Marijn Schraagen and Mehdi Dastani, Extracting Stances on Pandemic Measures from Social Media Data. 17th IEEE eScience Conference (poster), 2021.
Erik Tjong Kim Sang, Marijn Schraagen, Shihan Wang and Mehdi Dastani, Transfer Learning for Stance Analysis in COVID-19 Tweets. CLIN 2021. (data annotations)
Erik Tjong Kim Sang, Marijn Schraagen, Mehdi Dastani and Shihan Wang, Discovering Pandemic Topics on Twitter. DHBenelux 2021.
Erik Tjong Kim Sang PuReGoMe: Social Media Analysis of the Pandemic. Lunch talk, Netherlands eScience Center, Amsterdam, The Netherlands, 11 February 2021.
Shihan Wang, Marijn Schraagen, Erik Tjong Kim Sang and Mehdi Dastani, Dutch General Public Reaction on Governmental COVID-19 Measures and Announcements in Twitter Data. Preprint report on arXiv.org, 21 December 2020.
Shihan Wang, Marijn Schraagen, Erik Tjong Kim Sang and Mehdi Dastani, Public Sentiment on Governmental COVID-19 Measures in Dutch Social Media. In: Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020 (NLP-COVID19-EMNLP), 20 November 2020.
Erik Tjong Kim Sang PuReGoMe: Dutch Public Reaction on Governmental COVID-19 Measures and Announcements. Lunch talk, Netherlands eScience Center, Amsterdam, The Netherlands, 26 June 2020.
Shihan Wang, Public Sentiment during COVID-19 -Data Mining on Twitter data. Talk at the CLARIN Café, Utrecht, The Netherlands, 27 May 2020.
Redactie Emerce, Onderzoekers leiden publieke opinie coronamaatregelen af uit social media-data. Emerce, 12 May 2020 (in Dutch).
Nienke Vergunst, Researchers use social media data to analyse public sentiment about Coronavirus measures. University of Utrecht news message, 11 May 2020.
The project setup is documented in project_setup.md. Feel free to remove this document (and/or the link to this document) if you don't need it.
To install notebooks from GitHub repository, do:
git clone https://github.com/puregome/notebooks.git
cd notebooks
python3 -m pip install .
Include a link to your project's full documentation here.
If you want to contribute to the development of notebooks, have a look at the contribution guidelines.
This package was created with Cookiecutter and the NLeSC/python-template.