Analyzing Tweets in R with Quanteda and ggplot2
For this tutorial, you will need to have already collected and formatted your text/tweet data for analysis. You can find my process for querying and preparing tweets for analysis with the “twitteR” package HERE.
0.1 Before you begin, be sure you have installed these R packages. You can attach them now, but preserve the order when attaching.
You should also load your data into the workspace (if you arrived here from the data collection tutorial, you should have your data in the workspace already).
Code
load("/Library/Frameworks/R.framework/Versions/4.2/WP_EX_Proj/quantedaDemoFin.RData")
1 Step 1: Strip the tweet text
Special characters and punctuation need to be “stripped” (removed, leaving only raw text) prior to any text analysis, but especially so if the text data contain many special characters. Tweets are a good example, commonly including things like hashtags (#), emojis, and tags mentioning other users (@).
There are several methods available to strip the text of your tweets to remove punctuation, special characters, and even commonly used words like prepositions. Words you want to eliminate from your analysis are called “stop words,” and they are often removed separately from punctuation and special characters.
If you have your text data stored in a character vector as a data file in your working/project directory, you can use the quanteda “readtext” function to strip the text as you upload the object to your R workspace.
If you arrived at this tutorial because you completed the previous tutorial on downloading and preparing tweet data with “twitteR,” you likely have dataframe and list objects storing your data in the R workspace. In this case, you will need to extract the text vector(s) from the dataframe(s) prior to stripping the tweet text.
Let’s start by extracting the text vector from the dataframe and storing as a character vector in the workspace. you can call on variables within a dataframe with the dollar ($) symbol.
In this case, I want to extract the tweets from my keyword query of tweets mentioning “covid” so I identify the covid tweets dataframe (cov_tweet_df1), then I find the variable within that dataframe that contains the tweet text. The variable is conveniently named “text”, so I put the dollar sign after the name of the dataframe, then I type “text” to identify that variable for extraction.
I want to make sure my extracted data are in the correct format, character, so I use the function “as.character()” to wrap everything to the right of my arrow (what I am storing).
Remember the arrow points to the new object you plan to create. Here I call my new object “cov_stripTweet” before indicating that I will place inside of it a character vector containing the text variable in the cov_tweet_df1 dataframe.
Code
<- cov_tweet_df1$text cov_stripTweet
Now you should have a new object in the “values” section of your workspace environment. I want to store my stripped text back in the dataframe, but I want to create a new, “stripped text” object first using the “text” object I just created. I like the “gsub” function {base} for stripping text. To use it, I call the function, then identify the pattern I want to strip, the thing to replace that pattern (like empty whitespace), and the character vector to change. I want to place the result back into the same character vector, so I also identify it to the left of my arrow.
I will perform three substitutions, then I will place my vector back into my dataframe as a new variable called “strippedT”
Code
<- tolower(cov_stripTweet) #make text lowercase
cov_stripTweet <- gsub("\\http?//*.+", "", cov_stripTweet)
cov_stripTweet #remove url
<- gsub("[[:punct:]]", "", cov_stripTweet)
cov_stripTweet #remove punctuation
$strippedT <- cov_stripTweet
cov_tweet_df1#store as new variable in dataframe
To check that it worked the way you intended, you can print the variable, or if you are using a program like RStudio, you can click on the dataframe in the workspace environment to open and view as a spreadsheet.
If you executed more than one keyword query in your Twitter data collection in the previous tutorial and have more than one dataframe, repeat this process for each dataframe with tweets you want to analyze.
2 Step 2: Make a text corpus
Text data are computationally inefficient. To subvert this limitation, many quantitative text analysis techniques work with “corpus” objects storing “tokens” instead of using raw text. Tokens are smaller units within a text, like words.
Let’s try constructing a text corpus using the “quanteda” package. I can use the “corpus()” function in quanteda create a corpus based on the stripped text (“strippedT”) variable I just stored in my dataframe.
Code
<- corpus(cov_tweet_df1$strippedT) covT_Qcorp
I can view the contents of my quanteda corpus to get a better idea of what information I can explore with my data. Quickly view the contents of your “quanteda” corpus using the “summary” function.
Code
summary(covT_Qcorp)
In my corpus, I can see that I have 1000 documents but only 100 documents visible, and I can see that in addition to the “text” information identifying each document (tweet) I have “types,” “tokens,” and “sentences.”
It is a good time to save your workspace. Before moving to step 3, save your workspace to the working/project directory.
Code
save.image("~/DataPathToProjDirectory/NameTheWorkspace.RData")
3 Step 3: Tokenize the corpus and create a document-feature matrix (dfm)
Next I can further simplify the raw text from “strippedT” to make my results more meaningful and/or increase computational efficiency. In the previous step, one of the list items stored in the quanteda corpus included “Tokens”--we can use several commands with the pipe operator ( %>% ) to tokenize the text, transform the text through things like “stemming” the words to their root, and remove stopwords.
Let’s try removing stopwords from the corpus (to make the results more meaningful) and stemming the words to their root (to speed up our analyses) by constructing a document feature matrix using the “dfm()” function. I’ll call the dfm object “cov_dfm1” (numbering is useful in case I want to make another dfm later on).
To do this with the pipe operator, start with the tokens command. Here I note that I want to tokenize the quanteda corpus named “covT_Qcorp” and remove symbols. We already removed symbols when we stripped the text, but I like to do this a second time to catch any that were missed by the {base} package. I also specify to include the document variables (docvars).
Next, I use the pipe operator to layer commands in the order I want them to execute after tokens but before constructing the final new object.
I use the “dfm()” command to construct the dfm from the tokens object, then I layer on the “dfm_wordstem” command to stem the words to their root. I add one final layer to remove stopwords from the dfm using the “dfm_select()” command.
Code
<- tokens(covT_Qcorp,
cov_dfm1 what = "word",
remove_symbols = T,
include_docvars = T) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_select(pattern = stopwords("english"),
selection = c("remove"),
valuetype = c("fixed"))
Now I have a dfm, but the dfm does not have the metadata from the dataframe. I can specify the metadata for the dfm using the “meta()” function. In this tutorial, we’ve tended to put the commands on the right side of the arrow and the object on the left [NewObj <- command(old object)]. With the “meta()” command, we call the metadata on the right of the arrow, still showing that is what we want to place in the dfm, but we use the command on the dfm itself to the left of the arrow.
Let’s specify that the dataframe housing the text we used to make our corpus also houses the metadata.
Code
meta(cov_dfm1) <- cov_tweet_df1
I can check my corpus in RStudio by clicking on the environment object, then “meta,” then “user,” and I will see all of the variables from my dataframe stored here. I can also check this manually by printing the variable names contained in the user metadata in the corpus object using the “@” and “$” selectors.
Code
names(cov_dfm1@meta$user)
4 Step 4: Explore the dfm with plots
If you want to get some basic information about your data but aren’t ready to build a topic model, you can make some basic plots and other visualizations. Let’s make a wordcloud using the “quanteda.textplots” package. Use the “set.seed()” command to give your plot a random seed, otherwise you will generate a new plot every time you print the plot.
Code
library(quanteda.textplots)
set.seed(3162312)
textplot_wordcloud(cov_dfm1)
We can change the colors, set minimum counts for the number of documents a word must appear in to be included in the cloud, and add weights to help with readability.
Let’s change the colors, set the rotation, and select a minimum number of hits. You can choose the palette, and you can expand your choices with packages like RColorBrewer. To view the base graphics palettes, you can call “palette.pals()”. I’ll use the “ggplot2” palette, then I’ll show a brewer palette.
Code
set.seed(3162312)
textplot_wordcloud(cov_dfm1,
min_count = 10,
rotation = .25,
color = palette.colors(5, "ggplot2"))
Now with RColorBrewer’s “Set1” palette. You can view the brewer palettes by calling “display.brewer.all()”
Code
library(RColorBrewer)
set.seed(3162312)
textplot_wordcloud(cov_dfm1,
min_count = 10,
rotation = .25,
color = RColorBrewer::brewer.pal(5,
"Set1"))
5 Step 5: Do some basic analysis with ‘top words’
There are also some basic analyses available for looking at things like top words by frequency. These require further some manipulation of the dfm object. Let’s make a dataframe object containing the top 50 words and their frequencies.
Code
<- as.data.frame(topfeatures(cov_dfm1, 50)) covT_topWord
If we look at this dataframe, we only have 1 variable. Where are the words?
They are in the rownames. We need to make a new column storing that information so we can use it for analysis. We can do this with dplyr’s mutate command.
Code
<- covT_topWord %>%
covT_topWord mutate(word = rownames(covT_topWord))
Now you should have two variables. There’s another issue here, though. The name of your frequency variable is probably something like “topfeatures(cov_dfm1, 50)” . There’s a lot going on in this variable name, and we want it to be easier to access. Let’s change it, remembering to use the format “new_name = old_name”
Code
<- covT_topWord %>%
covT_topWord rename(count = "topfeatures(cov_dfm1, 50)")
You should have a dataframe with two variables now, “count” with the frequencies, and “word” with the words pulled from the row names. Now we can plot our new dataframe.
We can do this with the base graphics package, or we can use ggplot2. Let’s start with the base graphics barplot.
Code
barplot(height = covT_topWord$count,
names.arg = covT_topWord$word)
Now with ggplot2 “geom_col”
Code
ggplot(covT_topWord, aes(x = word, y = count)) +
geom_col()
The ggplot2 plot contains all of the appropriate information, but is hard to read. How can we adjust it?
One thing we can do is rotate the words to fall on the Y axis with coord_flip().
Code
ggplot(covT_topWord, aes(x = word, y = count)) +
geom_col() +
coord_flip()
We can use geom_text() to add the words to the inside of the plot, next to the bars.
Code
ggplot(covT_topWord, aes(x = word, y = count)) +
geom_col() +
coord_flip() +
geom_text(aes(label = word),
position = position_dodge(1),
size = 2)
We can change the colors and sizes of the bars, labels, and background of the plot area.
Code
ggplot(covT_topWord, aes(x = word,
y = count)) +
geom_col(fill = "grey30") +
coord_flip() +
theme(text=element_text(size=8),
panel.background = element_rect(fill = c("black")),
panel.grid = element_line(color = c("black"))) +
geom_text(aes(label = word),
position = position_dodge(1),
size = 2,
color = c("white"))
We can add a title to the plot.
Code
ggplot(covT_topWord, aes(x = word,
y = count)) +
geom_col(fill = "grey30") +
coord_flip() +
theme(text=element_text(size=8),
panel.background = element_rect(fill = c("black")),
panel.grid = element_line(color = c("black"))) +
geom_text(aes(label = word),
position = position_dodge(1),
size = 2,
color = c("white")) +
ggtitle("Number of Tweets Containing Top Words from a Query of Tweets with the keyword COVID")
And we can change the font family, which may be necessary if you are making a plot for something like a publication.
Code
ggplot(covT_topWord, aes(x = word,
y = count)) +
geom_col(fill = "grey30") +
coord_flip() +
theme(text=element_text(size=8,
family = "serif"),
panel.background = element_rect(fill = c("black")),
panel.grid = element_line(color = c("black"))) +
geom_text(aes(label = word, family = "serif"),
position = position_dodge(1),
size = 2,
color = c("white")) +
ggtitle("Number of Tweets Containing Top Words from a Query of Tweets with the keyword COVID")
Remember to save your workspace before you exit so you can access your objects again at a later date!