PuReGoMe

PuReGoMe is a research project of Utrecht University and the Netherlands eScience Center. We analyze Dutch social media messages to assess the opinion of the public towards the COVID-19 pandemic measures taken by the Dutch government.

Data

PuReGoMe uses three data sources for social media analysis. Our main data source are Dutch tweets from Twitter. We share the ids of most of the tweets used in the project. We use Dutch posts from Reddit and comments from Nu.nl as sources of verification of the results obtained by the tweet analysis.

After a month has ended, the data from the month are collected. Here are the steps taken for each data source:

Twitter

the tweets are taken from the crawler of twiqs.nl. They are stored in json format in hourly files. (this data set is not publicly available)
the tweets are extracted from the json files and and stored in csv files (six columns: id_str, in_reply_to_status_id_str, user, verified, text, location). The conversion is performed by the script query-text.py
duplicate tweets are removed from from the csv files produced in step 2 by the script text-unique.py

(it would be useful to combine steps 2 and 3 in the future)

Reddit

In a new month directory, run the script get_subreddit_ids.py to automatically retrieve the subreddits of the Dutch corona reddits
Copy the list of ids of the subreddit Megathread Coronavirus COVID-19 in Nederland (submissions_ids_thenetherlands.txt) from a previous month directory and manually add the ids of recent subreddits
Run the script coronamessagesnl.py on the files submissions_ids_* to automatically retrieve the posts in the found subreddits
Run the notebook reddit.ipynb to get all the posts from the monthly downloads directory and store them in the directory text

Nu.nl

Run code blocks 1, 3 and 4 of the notebook selenium-test.ipynb, after updating the name of the file in URLFILE in code block 1
Restart the notebook and run code blocks 1 and 6, after changing the name of the new downloads directory in DATADIROUT in code block 6. This process takes many hours (even days) to complete
The notebook can be copied and several copies can be run in parallel
When the notebooks have finished: delete all pairs of files of sizes 1 and 3 in this month's directory (keep the ones with only size 1) and rerun the notebooks
Repeat step 4 until no articles with comments are found
go to (cd) the directory data/nunl
run the script ../../scripts/combineNunlComments.sh to update the files in the main directory downloads
run the notebook nunl-convert-data.ipynb to regenerate the data files in the directory text
for fetching the article texts: run code blocks 1 and 2 of the notebook selenium-test.ipynb, after updating the variables URLFILE and OUTFILEMETADONELIST

Analysis

PuReGoMe performs analysis on three different levels: by counting messages, by determining their polarity (sentiment) and by determining their stance with respect to anti-pandemic government measures.

Frequency analysis

Frequency analysis of tweets is performed in the notebook tweet-counts.ipynb. The notebook defines several pandemic queries, for example face mask, lockdown, social distancing and pandemic, where pandemic is a combination of 60+ relevant terms. The notebook produces a graph with the absolute daily frequencies of the tweets matching each of these pandemic queries:

Frequency analysis of the Nu.nl and Reddit data is included in the respective data generation notebooks nunl-convert-data.ipynb and reddit.ipynb

Polarity analysis

Polarity analysis is the same as sentiment analysis. This analysis is performed by the notebook sentiment-pattern.ipynb which uses the Python package Pattern for sentiment analysis of Dutch text (De Smedt & Daelemans, 2011). The notebook requires two types of input files: the csv files in the text directories of each of the data sources and sentiment score files which should be generated from these csv files with the script ../../scripts/sentiment-pattern-text.py The polarity analysis of the different topics takes a lot of time and can be run in parallel. It produces time series graphs for all tweets, all pandemic tweets and several individual pandemic topics.

Stance analysis

Stance analysis is performed by the notebook fasttext.ipynb. The analysis originates from a model trained by fastText on manually labeled tweets. The notebook contains a section for searching for the best parameters of fastText using grid search but when the training data is unchanged this section can be skipped. The notebook has two main modes related to topics: analysis related to the social distancing policy and analysis related to the former (April 2020) face mask policy. These are the only two topics for which we have manually labeled training data. The graphs combine analysis for all three data sources used in the project: Twitter, Nu.nl and Reddit.

Other analyses

The notebook topic-analysis.ipynb is used to find new topics in a day of tweets. It compares the vocabulary of a day of tweets with the vocabulary of the preceding day.

echo-chambers.ipynb is used for finding groups of users which collectively retweet similar content. The notebook found a group of a few hundred users retweeting right-wing propaganda. Further study needs to be done to check if this content has any effect on the findings of this project.

geo-analysis.ipynb and geo-classification.ipynb can be used to divide the tweets in groups depending on on the location of the tweeter. This only works for about half of the data set. Next, maps representing tweet data can be created with the notebook maps.ipynb.

corona-nl-totals.ipynb creates graphs of the number infections, hospitalizations and deaths in The Netherlands based on data provided by the health organization RIVM.

Publications, talks and media coverage

Erik Tjong Kim Sang, Shihan Wang, Marijn Schraagen and Mehdi Dastani, Estimating Stances on Anti-Pandemic Measures By Social Media Analysis, ICT.OPEN2022 (poster), Amsterdam, The Netherlands, 2022

Erik Tjong Kim Sang, Shihan Wang, Marijn Schraagen and Mehdi Dastani, Extracting Stances on Pandemic Measures from Social Media Data. 17th IEEE eScience Conference (poster), 2021.

Erik Tjong Kim Sang, Marijn Schraagen, Shihan Wang and Mehdi Dastani, Transfer Learning for Stance Analysis in COVID-19 Tweets. CLIN 2021. (data annotations)

Erik Tjong Kim Sang, Marijn Schraagen, Mehdi Dastani and Shihan Wang, Discovering Pandemic Topics on Twitter. DHBenelux 2021.

Erik Tjong Kim Sang PuReGoMe: Social Media Analysis of the Pandemic. Lunch talk, Netherlands eScience Center, Amsterdam, The Netherlands, 11 February 2021.

Shihan Wang, Marijn Schraagen, Erik Tjong Kim Sang and Mehdi Dastani, Dutch General Public Reaction on Governmental COVID-19 Measures and Announcements in Twitter Data. Preprint report on arXiv.org, 21 December 2020.

Shihan Wang, Marijn Schraagen, Erik Tjong Kim Sang and Mehdi Dastani, Public Sentiment on Governmental COVID-19 Measures in Dutch Social Media. In: Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020 (NLP-COVID19-EMNLP), 20 November 2020.

Erik Tjong Kim Sang PuReGoMe: Dutch Public Reaction on Governmental COVID-19 Measures and Announcements. Lunch talk, Netherlands eScience Center, Amsterdam, The Netherlands, 26 June 2020.

Shihan Wang, Public Sentiment during COVID-19 -Data Mining on Twitter data. Talk at the CLARIN Café, Utrecht, The Netherlands, 27 May 2020.

Redactie Emerce, Onderzoekers leiden publieke opinie coronamaatregelen af uit social media-data. Emerce, 12 May 2020 (in Dutch).

Nienke Vergunst, Researchers use social media data to analyse public sentiment about Coronavirus measures. University of Utrecht news message, 11 May 2020.

Information added by the Python template

Badges

fair-software.eu recommendations
(1/5) code repository
(2/5) license
(3/5) community registry
(4/5) citation
(5/5) checklist
howfairis
Other best practices
Static analysis
Coverage
GitHub Actions
Build
Metadata consistency
Lint
Publish
SonarCloud
MarkDown link checker

How to use notebooks

The project setup is documented in project_setup.md. Feel free to remove this document (and/or the link to this document) if you don't need it.

Installation

To install notebooks from GitHub repository, do:

git clone https://github.com/puregome/notebooks.git
cd notebooks
python3 -m pip install .

Documentation

Include a link to your project's full documentation here.

Contributing

If you want to contribute to the development of notebooks, have a look at the contribution guidelines.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
.githooks		.githooks
.github/workflows		.github/workflows
csv		csv
docs		docs
notebooks		notebooks
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.editorconfig		.editorconfig
.gitignore		.gitignore
.mlc-config.json		.mlc-config.json
.prospector.yml		.prospector.yml
.zenodo.json		.zenodo.json
202005.ipynb		202005.ipynb
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.dev.md		README.dev.md
README.md		README.md
annotated-graphs.ipynb		annotated-graphs.ipynb
annotation-graph.ipynb		annotation-graph.ipynb
annotation-results.ipynb		annotation-results.ipynb
annotation-select.ipynb		annotation-select.ipynb
antivac-data.ipynb		antivac-data.ipynb
botometer-test.ipynb		botometer-test.ipynb
chatting_users.ipynb		chatting_users.ipynb
corona-nl-totals.ipynb		corona-nl-totals.ipynb
coverage.ipynb		coverage.ipynb
crawler-change.ipynb		crawler-change.ipynb
decrease-vector-file.ipynb		decrease-vector-file.ipynb
distance.ipynb		distance.ipynb
domain-adaptation.ipynb		domain-adaptation.ipynb
echo-chambers.ipynb		echo-chambers.ipynb
fasttext.ipynb		fasttext.ipynb
geo-analysis.ipynb		geo-analysis.ipynb
geo-classification.ipynb		geo-classification.ipynb
hashtags.ipynb		hashtags.ipynb
ieee.ipynb		ieee.ipynb
keyword-frequencies.ipynb		keyword-frequencies.ipynb
keywords.ipynb		keywords.ipynb
language-identification.ipynb		language-identification.ipynb
lda.ipynb		lda.ipynb
library.py		library.py
maps.ipynb		maps.ipynb
nederland.png		nederland.png
nunl-convert-data.ipynb		nunl-convert-data.ipynb
nunl-counts.ipynb		nunl-counts.ipynb
politics-bash.ipynb		politics-bash.ipynb
politics-python.ipynb		politics-python.ipynb
popular-topics.ipynb		popular-topics.ipynb
project_setup.md		project_setup.md
pyproject.toml		pyproject.toml
rate-limit-checks.ipynb		rate-limit-checks.ipynb
reddit.ipynb		reddit.ipynb
regions.csv		regions.csv
remove-line-feeds.ipynb		remove-line-feeds.ipynb
report-202005.ipynb		report-202005.ipynb
report-202007.ipynb		report-202007.ipynb
retweets.ipynb		retweets.ipynb
sentiment-all.png		sentiment-all.png
sentiment-cbs.ipynb		sentiment-cbs.ipynb
sentiment-pattern.ipynb		sentiment-pattern.ipynb
sentiment-text.ipynb		sentiment-text.ipynb
sentiment-twiqs.ipynb		sentiment-twiqs.ipynb
sentiment-verified.ipynb		sentiment-verified.ipynb
setup.cfg		setup.cfg
setup.py		setup.py
social-distancing-aggregate.ipynb		social-distancing-aggregate.ipynb
social-distancing-all.png		social-distancing-all.png
social-distancing-student.ipynb		social-distancing-student.ipynb
social-distancing.ipynb		social-distancing.ipynb
sonar-project.properties		sonar-project.properties
topic-analysis.ipynb		topic-analysis.ipynb
topic-keywords.ipynb		topic-keywords.ipynb
tweet-counts.ipynb		tweet-counts.ipynb
tweet-frequencies.png		tweet-frequencies.png
tweet-text-duplicates.ipynb		tweet-text-duplicates.ipynb
twitter-counts.ipynb		twitter-counts.ipynb
ukraine-war.ipynb		ukraine-war.ipynb
user-analysis.ipynb		user-analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PuReGoMe

Data

Twitter

Reddit

Nu.nl

Analysis

Frequency analysis

Polarity analysis

Stance analysis

Other analyses

Publications, talks and media coverage

Information added by the Python template

Badges

How to use notebooks

Installation

Documentation

Contributing

Credits

About

Releases 1

Packages

Contributors 2

Languages

License

puregome/notebooks

Folders and files

Latest commit

History

Repository files navigation

PuReGoMe

Data

Twitter

Reddit

Nu.nl

Analysis

Frequency analysis

Polarity analysis

Stance analysis

Other analyses

Publications, talks and media coverage

Information added by the Python template

Badges

How to use notebooks

Installation

Documentation

Contributing

Credits

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages