Code - Tweet Extraction from Command Line with Twarc2
Disclaimer: This post does not aim to give you all the tiny bit details about twarc2 or Twitter data; it’s not a fully comprehensive article. This is a general overview of how to use twarc2 with academic access. But I’ll leave links to some good sources where appropriate. I also give some great tips and tricks, don’t miss them out!
At the beginning of 2021, thanks to our friend Jack from Twitter, he thought of academics and opened a new way for academics to get Twitter data generously: 'academic track'. In this way, not only universities with abundant resources, but also ordinary researchers like you and me can pull up to 10 million tweets per month from Twitter. Our minute and 15-minute limits are also higher. So, for example, now you can easily collect all the tweets from July 2016 or you want to get tweets containing ‘Netflix’ between the 2015-2018. In this case, I recommend you to open an academic developer account.
It's a bit of a cliché, but first you need to open a Twitter Developer account as an academic. You can find some guidelines regarding this here. Basically, Twitter is trying to understand that you really are an academic. For this verification, your Google Scholar page or a page opened in your name on the university's website makes things easier. You also need to specify what kind of academic research you will use it for. You usually get the email about the results in 3-5 days after making the application.
The main thing I want to talk about in this article is twarc2, which is essentially a command line tool but can also be used as a Python library. God, these guys have developed a great tool. Special thanks to Igor Brigadir, who helped dozens of people on Github issues, Twitter direct messages, and whatever developer forum you can think of. Let me compare it with other programs first.
If you are an R user, you may be familiar with the previous methods:
like rtweet package. However, this package does not support the much
more advanced API v2 that Twitter released last year. In R, there is a
new package compatible with the new academic access:
Since this one is still in development, it had some problems, and I
haven't been able to find much of an online community to get help. Also
there were still issues converting JSON files pulled from Twitter to CSV
the last time I checked (summer 2021, I guess).
If you're a Python user, you're luckier because there's already an advanced tool like Tweepy, and it's API v2 compatible. It's actually a pretty cool package, but personally I was still having problems.
Normally, tweet extraction could be some trouble, even if we have a
very-well established API. Although Twitter API V2 parameters are pretty
comprehensive, there’s a ton of tweaks to these parameters. For example,
whether you use R or Python, we normally put something like
time.sleep() in the code so that we don't exceed Twitter’s rate
limits. Or converting the file that we get from Twitter API as JSON to
CSV at the end of the day is really troublesome because users are given
unique (meaning only once), tweets are given unique; so these need to be
matched. This process is called flattening. Or you need to paginate
when extracting tweets, and you may need to write a detailed code on how
to handle pagination. Anyway, let's not get bogged down in details. In
summary, I mean that twarc2 makes these adjustments for us and pulls all
the data from Twitter -in a maximalist way, with around 75 variables -
without you needing to specify anything separately. Neither Python’s
Tweepy nor R’s
AcademicTwitter nor manual codes - thumbs up twarc2!
We've taken too long, let’s go ahead.
First of all, you need to have python installed on your computer. If it’s not installed, let's get you here.
Then, we install twarc2 on our computer using pip. From now on, all subsequent codes will be written to the terminal (control panel). Now open the terminal on your computer and paste the following code:
pip install --upgrade twarc
If you are a Macbook user (you may need to have brew installed already):
brew install twarc
We give Twarc2 our bearer token to pull tweets using our developer account. What we call a token is actually kind of our unique Twitter developer ID. As I mentioned in the beginning, if you have opened a developer account, you can find it in the BEARER TOKEN tab under the project after logging in to the developer portal. After that, we come back to the terminal and type:
After pressing ‘enter’, this code will ask us for our BEARER TOKEN. Here we paste our token that we copied from the Twitter Developer Portal. With this line, our Bearer Token is actually saved among the program files - so we don't even need to look back.
Then let's do our first tweet with a simple search. Let's take a quick look at what's in the #netflix hashtag for the last 7 days:
twarc2 search #netflix --limit 300 tweets.jsonl
What does this code do? It tells twarc2 to pull tweets containing the word #netflix. Because we want to keep a small set, we have limited the maximum number of tweets to 300.
Let's say you want to do a more complicated search. Let's go through another example for that.
twarc2 search --archive --limit 3000 --start-time 2022-01-01 --end-time 2022-03-01 '(#starwars OR #obi-wan OR (obi-wan kenobi)) lang:en -is:retweet' tweets.json
Now let's try to understand what this code does, one by one.
- search is the tweet search command.
- --archive indicates that my developer account has academic access.
- --start-time says what date I want it to start searching for > tweets. For example, I asked it to get tweets since the new > year, 2022.
- --end-time tells me what date I want it to finish searching. I told > it to finish on 3rd of March. But it probably won't be able to > reach this date anyway, because I put the limit for 3000. When it > exceed 3000 tweets, it will terminate the command.
- The next part that I put inside the apostrophes is the query part. > Here you can build your search parameters. The most advanced tool > I've found for this is this github > page. > Recently, I also came across a > tweet > from Suhem Parack, and there seems to be a great tool for that by > Twitter itself: query > builder. > I am yet to try this tool - but from the first glance, it seems > very cool, especially the feature where you can draw rectangles on > the map and get tweets by these coordinates. Thirdly, the most > beginner-friendly method is this: Go to Twitter Advanced > Search, fill > the boxes with the filters you want, then paste the query from the > search space on top-right.
- Finally, tet me mention about the content of this query. The above > code will give me tweets with #starwars or #obi-wan or “obi-wan > kenobi” in them. Since I specified lang:en, it will only give > English tweets. Since I put the -is-retweet option, it will not > give me retweets, which is usually recommended unless you’re doing > something about who retweets whom. Otherwise, for example, you get > 10k lines for a tweet that has been RTed 10k times.
- The tweets.json code at the end tells me with which name and > extension I will save the file. It registers in whichever > directory you are in at that moment.
I'm tired. But let’s keep going.
Let's say you searched for tweets according to one of the simple or complicated queries above and at the end of the day, you got a file named tweets.json.
If you noticed, this is a very complicated file because it is nested. Tweets, users, places... all in different places scattered. These need to be matched with each other. We call this process flattening. Thankfully, twarc2 handles this in a single line. It helps to convert it to the classical, rectangular, tabular data format:
twarc2 csv tweets.json tweets.csv
Congratulations! Now you have a dataset that you can easily export to your favourite data processing software!
After extracting the CSV file, I usually switch to R, but you can import it into Python and continue your analysis from there.
Tweet Counts - Do this Before Extracting Tweet
If you are going to start a big search on Twitter, say you’ll pull a few dozen million tweets, be sure to check beforehand how many tweets are posted according to the parameters you want to search. Because, for example, you can find that 100 million tweets have been sent on the subject, which may be well above your quota. For tweets on Afghanistan withdrawal in 2021, I got 16 million tweets, for example. But for the Ukraine case, just in a few weeks, that number is 168 million (of course I didn’t get started pulling them - not yet at least). Or, if there are very few tweets, you may not be able to reach a sufficient sample size you are looking for. Thanks to Twarc, it has a method for that as well:
twarc2 searches --archive --start-time 2021-10-01 --end-time 2021-11-01 --counts-only query.txt tweet_counts.csv
This code gives me how many tweets were posted on which day, in csv form, according to the query I built earlier. This is an awesome thing. Here's what you need to pay attention to: You create a file called query.txt in the directory you are working in (it can be a different name of course) - then you paste your search query here inside that txt file. It should be in one line.
Do not hesitate to get in touch with me if there are any unclear points.