Collection, Management, and Analysis of Twitter Data
Posted on June 1, 2022 by R on Methods Bites in R bloggers | 0 Comments
[This article was first published on R on Methods Bites , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a highly relevant platform for political and social online interactions, researchers increasingly analyze Twitter data. As of 01/2021, Twitter renewed its API, which now includes access to the full history of tweets for academic usage. In this Methods Bites Tutorial , Andreas Küpfer (Technical University of Darmstadt & MZES) presents a walkthrough of the collection, management, and analysis of Twitter data.
After reading this blog post and engaging with the applied exercises, readers will be able to:
complete the academic research track application process for the Twitter API.
crawl tweets using customized queries based on the R package academictwitteR (Barrie and Ho 2021 ).
apply a selection of pre-processing steps to these tweets.
take decisions in order to minimize reprodubcibility issues with Twitter data and to comply with the policies.
Note: This blog post provides a summary of Andreas’ workshop in the MZES Social Science Data Lab . The original workshop materials, including slides and scripts, are available from our GitHub . A live recording of the workshop is available on our YouTube channel .
Introduction to social media and Twitter API v2
Social media posts are full of potential for data mining and analysis. Despite problems tackling fake accounts and bots on the platform, it can be a very fruitful source to tackle research questions in a bandwidth of disciplines, including social sciences (e.g., Barberá 2015 ; Nguyen et al. 2021 ; Valle-Cruz et al. 2022 ; Sältzer 2022 ). Recognizing this potential also for commercial usage, platform providers increasingly restrict free access to such data.
Especially Twitter is an important data source with its richness of social and political interactions. As well Twitter did not offer a free-of-charge option to implement a full archive search of all tweets and users. Back then, the free version of API v1.1 was very limited with a maximum of 3,200 tweets or the past seven days of tweets. In addition, the range of available meta data 1 as well as implemented query 2 options were rather small. These limitations were lifted by the introduction of the redeveloped and rearranged Twitter API v2 in January 2021. 3 For academic purposes, they opened up access to all available tweets and other objects posted on Twitter without any monetary costs for the researcher.
While this blog post focuses on the retrieval of textual data, Twitter content certainly offers more. Looking at social network interactions (e.g., followers, likes, …) is just one of the opportunities beyond text to reveal valuable information. This can be, for example, the usage of follower networks to estimate ideological positions (e.g., Barberá 2015 ) or measuring the importance of a user in a social network based on social interaction data.
Academic research track application process
As Application Programming Interfaces (APIs) are powerful tools which allow access to vast databases full of information, companies offering them are increasingly careful about who is allowed to use them. While the previous version of the Twitter API provided access without a dedicated application (for a detailed description, see this Methods Bites tutorial ), the novel version requires you to go through an application process where you have to provide several details about you and your project with Twitter. This information includes data regarding yourself as well as the research project where you intend to work with Twitter data.
Before getting access, you have to fulfill several formal prerequisites to be eligible for application:
You are either a master’s student, a doctoral candidate, a post-doc, a faculty member, or a research-focused employee at an academic institution or university.
You have a clearly defined research objective, and you have specific plans for how you intend to use, analyze, and share Twitter data from your research.
You will use this access for non-commercial purposes. 4
Furthermore, you need a Twitter account which is also used to log in to the Twitter Developer Platform after a successful application. This portal lets you configure your API projects, keep an eye on your monthly tweet cap 5 , and more. A more detailed explanation of prerequisites can be found on the Twitter API academic research track Track website.
The whole process can be initiated by clicking Apply on the official Twitter API academic research track Website. You’ll be asked to log in with your personal Twitter account.
Figure 1: Twitter Application Steps for academic research track API access.
The figure above visualizes the steps you have to complete before your application can finally be submitted for Twitter’s internal review:
Basic Info: such as phone number verification and country selection
Academic Profile: such as link to an official profile (department website or similar) and academic role
Project Details: such as information about findings, description of the project itself, and how the API should be used there (e.g. methodologies and how the outcomes will be shared)
Review: provides an overview of the previous steps
Terms: developer agreement and policy
Before starting, it is recommended to carefully read which kind of career levels, projects and data behaviors are not allowed to use the API and thus have a high chance of receiving a refusal for their application. To give an example, if you plan to share the content of tweets publicly, you most probably won’t get access to the API as this would violate the Twitter rules. Again, more detailed information about this can be found on the Twitter API academic research track and Developer Terms information guides.
Step one requests generic information about your Twitter account while in step two you have to provide information about your academic profile. This includes a link to a publicly available record on an official department website and information regarding the academic institution you are working in. The third step is the most sophisticated one: your research project. It asks for short paragraphs about the project in general, what and how Twitter data is used there, and how the outcome of your work is shared with the public. The last two steps, review and terms, do not require any user-specific input but provide an overview of all filled-in information as well as the chance to read the developer agreement and policy.
After submitting your application
After submitting your application, you receive a decision via the e-mail address connected with your Twitter account (usually) within a few days. However, according to Twitter, this process can take up to two weeks.
You application may be rejected for two common reasons: First, you do violate the policy at one point according to the information given, or second, you do not meet the requirements (as described above) . Further explanations what can be the next steps after a rejection can be found in the Developer Account Support FAQ .
As of writing this blog post (May 2022), submitting a reapplication for access using the same account is not possible.
Using the API
After your successful application, the Twitter Developer Portal is there to manage projects and environments (which belong to a project), generate API keys (“credentials” for API access), get an overview of real-time monthly tweet cap usage, check available API endpoints and their specifics and more.
After the creation of a project, an environment can be added and API keys generated.
Figure 2: API keys of an environment
The following keys are generated automatically and used depending on the API interface (e.g. the R package) at hand:
API key \(\approx\) username (also called consumer key)
API key secret \(\approx\) password (also called consumer secret)
Bearer token \(\approx\) special access token (also called an authentication token)
It is crucial to keep them private and not push them to GitHub or similar! Otherwise someone else could gain access to your API account. Instead, store them somewhere locally or directly within an environment variable. The package we’re going into detail later on this blog post is guiding you safely through this process.
However, in case you’re plan to use them in other applications, you can store your keys in different ways. The most common way in R is to add them to the .Renviron file. To do this with comfort, install the R package usethis and call its method usethis::edit_r_environ() which lets you edit the .Renviron in the home directory of your computer. In the following you can add tokens (or anything else you want to keep stored locally) using this format:
Key1=value1 Key2=value2 # ...
After saving the file you can access values by calling Sys.getenv("Key1") within your R application. More best practices on managing your secrets can be found on the website APIs for Social Scientists .
Postman is an easily accessible application to try out different queries, tokens, and more. Without any programming knowledge, you get the API results immediately. Here you can find an official tutorial to use Postman with the Twitter API.
However, there are several reasons why Postman cannot replace a package and programming code.
Building flexible queries (e.g., a list of users to retrieve tweets from)
Handling large responses which come split up during pagination
Public Twitter lists (e.g. https://twitter.com/i/lists/912241909002833921 ) 6
legislatoR R Package (Göbel and Munzert 2021 )
Afterward, we are ready to crawl our first tweets using a simple wrapper function (get_tweets_from_user()) asking for a single user_id. get_all_tweets(), which is called inside this function is the heart of our code. It manages the generation of queries for the API, working with rate limits as well as storing the data in JSON-files (which can be transformed later).
In case you look for specific content, tweet types, or even topics, you can add another parameter to the package function: query. It allows you to narrow down your search by using specific strings. To give an example, one could look for English retweets containing the keywords putin or selenskij having a geo-location attached. This can be achieved by simply assigning the following string to the query parameter:
(putin OR selenskyj) -is:retweet lang:en has:geo
Beyond that, there exist many more parameters to individualize the crawling method. All of them are documented in the official academictwitteR CRAN documentation of the package. However, in this tutorial I only restrict my search to a Twitter user ID as well as a start and end date for the tweets we are interested in:
# function to retrieve tweets in a specific time period of a single user # (list of user IDs would be possible but one should keep # the max. query string of 1024 characters in mind) get_tweets_from_user % # removes capitialization quanteda::dfm_remove( stopwords("german")) %>% # removes German stopwords quanteda::dfm_wordstem( language = "german") # transforms words to their German wordstems # "wurd berlin ca miet- eigentumswohn umgewandelt #umwandlungsverbot"
The function dfm (called above) returns a sparse document-feature matrix which could be a fruitful starting point for first-word frequency analysis:
head(dfm) ## Document-feature matrix of: 6 documents, 87 features (79.77% sparse) ## and 0 docvars. ## features ## docs leb plotzlich mehr schablon gut bos pass ???? #esk #miet ## 44608858 1 1 1 1 1 1 1 1 1 1 ## 819914159915667456 0 0 0 0 0 0 0 0 0 0 ## 1391875208 0 0 0 0 0 0 0 0 0 0 ## 569832889 0 0 1 0 0 0 0 0 0 1 ## ...
You are finally at the step of applying further methods to tackle your research question and getting deeper insights into your crawled tweets. There is much more to explore: You can find further text-as-data tutorials on our blog.
Reproducibility of research based on Twitter data
As reproducible results are one of the major requirements of research projects, it has to be discussed how this could affect your work with Twitter data. The Twitter development agreement includes a clear statement of what researchers are allowed to publish along with their work:
“Academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research. For example, you are permitted to share an unlimited number of Tweet IDs for the purpose of enabling peer review or validation of your research.” 7
This means that the content of tweets must not be shared publicly. As tweets can be deleted or accounts can be suspended this certainly states an issue for subsequent researchers attempting to replicate the findings as they won’t be able to recrawl such tweets via the API. However, there are also platforms like polititweet.org , which track public figures and based on that justify the publication even of deleted tweets:
polititweet.org section of the landing page" width="450" />
Figure 3: polititweet.org section of the landing page
Source: polititweet.org landing page
To conclude, this makes the decision of how to share what kind of data not easier. But one still has to ensure to choose the best available option to share his or her data without violating the Twitter rules which are as of today to at least share tweet IDs amongst the community.
This blog post provides a first glimpse into the academic research track Twitter API and the information richness of Twitter data. As there certainly will be further updates and changes to the API in the future, there are plenty of easy-to-use packages that build on active user communities. The community is there to keep the packages updated accordingly to the current Twitter API version. While there exist a lot of powerful packages to tackle the data gathering step, researchers still need to think carefully about how to further process the crawled information depending on their research question and method as well as how to make their research accessible to the community in an open science approach.
About the author
is a graduate of the Mannheim Master in Data Science and a doctoral researcher at the Technical University of Darmstadt. His interdisciplinary research interests include text-as-data, applying machine learning technologies, and substantial inference in the fields of political communication and political competition.
Barberá, Pablo. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23 (1): 76–91. https://doi.org/10.1093/pan/mpu011 .
Barrie, Christopher, and Justin Chun-ting Ho. 2021. “AcademictwitteR: An R Package to Access the Twitter Academic Research Product Track V2 Api Endpoint.” Journal of Open Source Software 6 (62): 3272. https://doi.org/10.21105/joss.03272 .
Göbel, Sascha, and Simon Munzert. 2021. “The Comparative Legislators Database.” British Journal of Political Science, 1–11. https://doi.org/10.1017/S0007123420000897 .
Nguyen, Thu T., Shaniece Criss, Eli K. Michaels, Rebekah I. Cross, Jackson S. Michaels, Pallavi Dwivedi, Dina Huang, et al. 2021. “Progress and Push-Back: How the Killings of Ahmaud Arbery, Breonna Taylor, and George Floyd Impacted Public Discourse on Race and Racism on Twitter.” SSM - Population Health 15: 100922. https://doi.org/https://doi.org/10.1016/j.ssmph.2021.100922 .
Sältzer, Marius. 2022. “Finding the Bird’s Wings: Dimensions of Factional Conflict on Twitter.” Party Politics 28 (1): 61–70. https://doi.org/10.1177/1354068820957960 .
Valle-Cruz, David, Vanessa Fernandez, Asdrubal Lopez-Chau, and Rodrigo Sandoval Almazan. 2022. “Does Twitter Affect Stock Market Decisions? Financial Sentiment Analysis During Pandemics: A Comparative Study of the H1n1 and the Covid‐19 Periods.” Cognitive Computation 14 (January). https://doi.org/10.1007/s12559-021-09819-8 .
Vliet, Livia van, Petter Törnberg, and Justus Uitermark. 2020. “The Twitter Parliamentarian Database: Analyzing Twitter Politics Across 26 Countries.” PLoS ONE 15.
Meta data serves as a explanatory information such as topical indicators or the language of the tweet which should explain and enrich the actual tweet, image or main object retrieved from the API ↩
Queries are filter operators to narrow down the amount of tweets which should be retrieved ↩
API stands for Application Programming Interface and allows, simply speaking, the communication between software. ↩
There is a maximum of tweets which can be retrieved via the API which gets resetted once in a month. ↩
Use such lists with caution as they may do not come from verified sources. ↩
You can find a detailed description of the content redistribution of Twitter data in the official developer policies . ↩
To leave a comment for the author, please follow the link and comment on their blog: R on Methods Bites .