banner image with dark blue background textured to look like oil paint. Title is in the upper right corner in bubble letters, top in green reads "Analyzing Twitter Data" and bottom in pink reads "obtaining and analyzing word frequencies with quanteda and ggplot2". three graphics, top left is circular wordcloud with red/orange text against a white background and the word covid visible in the center, bottom left is a grey cartoon hand holding blue icons for the twitter retweet, tweet, and like symbols, bottom center is a cartoon imac desktop with a barchart with grey bars and a black background on the screen

Analyzing Tweets: A Tidy Approach to Basic Analyses in R with Quanteda and ggplot2


Analyzing Tweets in R with Quanteda and ggplot2

Analyzing Tweets in R with Quanteda and ggplot2

Author

Heather Sue M. Rosen

For this tutorial, you will need to have already collected and formatted your text/tweet data for analysis. You can find my process for querying and preparing tweets for analysis with the “twitteR” package HERE.

0.1 Before you begin, be sure you have installed these R packages. You can attach them now, but preserve the order when attaching.

You should also load your data into the workspace (if you arrived here from the data collection tutorial, you should have your data in the workspace already).

Code
load("/Library/Frameworks/R.framework/Versions/4.2/WP_EX_Proj/quantedaDemoFin.RData")

1 Step 1: Strip the tweet text

Special characters and punctuation need to be “stripped” (removed, leaving only raw text) prior to any text analysis, but especially so if the text data contain many special characters. Tweets are a good example, commonly including things like hashtags (#), emojis, and tags mentioning other users (@).

There are several methods available to strip the text of your tweets to remove punctuation, special characters, and even commonly used words like prepositions. Words you want to eliminate from your analysis are called “stop words,” and they are often removed separately from punctuation and special characters.

If you have your text data stored in a character vector as a data file in your working/project directory, you can use the quanteda “readtext” function to strip the text as you upload the object to your R workspace.

If you arrived at this tutorial because you completed the previous tutorial on downloading and preparing tweet data with “twitteR,” you likely have dataframe and list objects storing your data in the R workspace. In this case, you will need to extract the text vector(s) from the dataframe(s) prior to stripping the tweet text.

Let’s start by extracting the text vector from the dataframe and storing as a character vector in the workspace. you can call on variables within a dataframe with the dollar ($) symbol.

In this case, I want to extract the tweets from my keyword query of tweets mentioning “covid” so I identify the covid tweets dataframe (cov_tweet_df1), then I find the variable within that dataframe that contains the tweet text. The variable is conveniently named “text”, so I put the dollar sign after the name of the dataframe, then I type “text” to identify that variable for extraction.

I want to make sure my extracted data are in the correct format, character, so I use the function “as.character()” to wrap everything to the right of my arrow (what I am storing).

Remember the arrow points to the new object you plan to create. Here I call my new object “cov_stripTweet” before indicating that I will place inside of it a character vector containing the text variable in the cov_tweet_df1 dataframe.

Code
cov_stripTweet <- cov_tweet_df1$text

Now you should have a new object in the “values” section of your workspace environment. I want to store my stripped text back in the dataframe, but I want to create a new, “stripped text” object first using the “text” object I just created. I like the “gsub” function {base} for stripping text. To use it, I call the function, then identify the pattern I want to strip, the thing to replace that pattern (like empty whitespace), and the character vector to change. I want to place the result back into the same character vector, so I also identify it to the left of my arrow.

I will perform three substitutions, then I will place my vector back into my dataframe as a new variable called “strippedT”

Code
cov_stripTweet <- tolower(cov_stripTweet) #make text lowercase
cov_stripTweet <- gsub("\\http?//*.+", "", cov_stripTweet)
  #remove url
cov_stripTweet <- gsub("[[:punct:]]", "", cov_stripTweet)
  #remove punctuation
cov_tweet_df1$strippedT <- cov_stripTweet 
  #store as new variable in dataframe

To check that it worked the way you intended, you can print the variable, or if you are using a program like RStudio, you can click on the dataframe in the workspace environment to open and view as a spreadsheet.

If you executed more than one keyword query in your Twitter data collection in the previous tutorial and have more than one dataframe, repeat this process for each dataframe with tweets you want to analyze.

2 Step 2: Make a text corpus

Text data are computationally inefficient. To subvert this limitation, many quantitative text analysis techniques work with “corpus” objects storing “tokens” instead of using raw text. Tokens are smaller units within a text, like words.

Let’s try constructing a text corpus using the “quanteda” package. I can use the “corpus()” function in quanteda create a corpus based on the stripped text (“strippedT”) variable I just stored in my dataframe.

Code
covT_Qcorp <- corpus(cov_tweet_df1$strippedT)

I can view the contents of my quanteda corpus to get a better idea of what information I can explore with my data. Quickly view the contents of your “quanteda” corpus using the “summary” function.

Code
summary(covT_Qcorp)

In my corpus, I can see that I have 1000 documents but only 100 documents visible, and I can see that in addition to the “text” information identifying each document (tweet) I have “types,” “tokens,” and “sentences.”

It is a good time to save your workspace. Before moving to step 3, save your workspace to the working/project directory.

Code
save.image("~/DataPathToProjDirectory/NameTheWorkspace.RData")

3 Step 3: Tokenize the corpus and create a document-feature matrix (dfm)

Next I can further simplify the raw text from “strippedT” to make my results more meaningful and/or increase computational efficiency. In the previous step, one of the list items stored in the quanteda corpus included “Tokens”--we can use several commands with the pipe operator ( %>% ) to tokenize the text, transform the text through things like “stemming” the words to their root, and remove stopwords.

Let’s try removing stopwords from the corpus (to make the results more meaningful) and stemming the words to their root (to speed up our analyses) by constructing a document feature matrix using the “dfm()” function. I’ll call the dfm object “cov_dfm1” (numbering is useful in case I want to make another dfm later on).

To do this with the pipe operator, start with the tokens command. Here I note that I want to tokenize the quanteda corpus named “covT_Qcorp” and remove symbols. We already removed symbols when we stripped the text, but I like to do this a second time to catch any that were missed by the {base} package. I also specify to include the document variables (docvars).

Next, I use the pipe operator to layer commands in the order I want them to execute after tokens but before constructing the final new object.

I use the “dfm()” command to construct the dfm from the tokens object, then I layer on the “dfm_wordstem” command to stem the words to their root. I add one final layer to remove stopwords from the dfm using the “dfm_select()” command.

Code
cov_dfm1 <- tokens(covT_Qcorp,
                   what = "word",
                   remove_symbols = T,
                   include_docvars = T) %>%
  dfm() %>%
  dfm_wordstem() %>%
  dfm_select(pattern = stopwords("english"),
             selection = c("remove"),
             valuetype = c("fixed"))

Now I have a dfm, but the dfm does not have the metadata from the dataframe. I can specify the metadata for the dfm using the “meta()” function. In this tutorial, we’ve tended to put the commands on the right side of the arrow and the object on the left [NewObj <- command(old object)]. With the “meta()” command, we call the metadata on the right of the arrow, still showing that is what we want to place in the dfm, but we use the command on the dfm itself to the left of the arrow.

Let’s specify that the dataframe housing the text we used to make our corpus also houses the metadata.

Code
meta(cov_dfm1) <- cov_tweet_df1

I can check my corpus in RStudio by clicking on the environment object, then “meta,” then “user,” and I will see all of the variables from my dataframe stored here. I can also check this manually by printing the variable names contained in the user metadata in the corpus object using the “@” and “$” selectors.

Code
names(cov_dfm1@meta$user)

4 Step 4: Explore the dfm with plots

If you want to get some basic information about your data but aren’t ready to build a topic model, you can make some basic plots and other visualizations. Let’s make a wordcloud using the “quanteda.textplots” package. Use the “set.seed()” command to give your plot a random seed, otherwise you will generate a new plot every time you print the plot.

Code
library(quanteda.textplots)
set.seed(3162312)
textplot_wordcloud(cov_dfm1)
Wordcloud of the 50 most frequent words in a set of 1000 tweets with the keyword covid. Text is blue Arial font, the words COVID and rt are large in the center.

We can change the colors, set minimum counts for the number of documents a word must appear in to be included in the cloud, and add weights to help with readability.

Let’s change the colors, set the rotation, and select a minimum number of hits. You can choose the palette, and you can expand your choices with packages like RColorBrewer. To view the base graphics palettes, you can call “palette.pals()”. I’ll use the “ggplot2” palette, then I’ll show a brewer palette.

Code
set.seed(3162312)
textplot_wordcloud(cov_dfm1,
                   min_count = 10,
                   rotation = .25,
                   color = palette.colors(5, "ggplot2"))
Wordcloud of the 50 most frequent words in a set of 1000 tweets with the keyword covid. Outer words are black Arial font, vaccine is red/orange and midsized towards the center of the cloud, covid and rt are turquoise and large in the center of the cloud

Now with RColorBrewer’s “Set1” palette. You can view the brewer palettes by calling “display.brewer.all()”

Code
library(RColorBrewer)
set.seed(3162312)
textplot_wordcloud(cov_dfm1,
                   min_count = 10,
                   rotation = .25,
                   color = RColorBrewer::brewer.pal(5, 
                                                    "Set1"))
Wordcloud of the 50 most frequent words in a set of 1000 tweets with the keyword covid. Outer words are red Arial font, vaccine is blue and midsized towards the center of the cloud, covid and rt are orange and large in the center of the cloud

5 Step 5: Do some basic analysis with ‘top words’

There are also some basic analyses available for looking at things like top words by frequency. These require further some manipulation of the dfm object. Let’s make a dataframe object containing the top 50 words and their frequencies.

Code
covT_topWord <- as.data.frame(topfeatures(cov_dfm1, 50))

If we look at this dataframe, we only have 1 variable. Where are the words?

They are in the rownames. We need to make a new column storing that information so we can use it for analysis. We can do this with dplyr’s mutate command.

Code
covT_topWord <- covT_topWord %>%
  mutate(word = rownames(covT_topWord))

Now you should have two variables. There’s another issue here, though. The name of your frequency variable is probably something like “topfeatures(cov_dfm1, 50)” . There’s a lot going on in this variable name, and we want it to be easier to access. Let’s change it, remembering to use the format “new_name = old_name”

Code
covT_topWord <- covT_topWord %>%
  rename(count = "topfeatures(cov_dfm1, 50)")

You should have a dataframe with two variables now, “count” with the frequencies, and “word” with the words pulled from the row names. Now we can plot our new dataframe.

We can do this with the base graphics package, or we can use ggplot2. Let’s start with the base graphics barplot.

Code
barplot(height = covT_topWord$count,
        names.arg = covT_topWord$word)
Barplot showing words by frequency, base graphics

Now with ggplot2 “geom_col”

Code
ggplot(covT_topWord, aes(x = word, y = count)) +
  geom_col()
Barplot showing words by frequency, ggplot2, no additional formatting so the words overlap and are not legible along the x axis. x axis is labeled word, y axis is labeled count and ranges from 0-1000. Plot area is very light grey with horizontal and vertical gridlines showing. The bars are dark grey.

The ggplot2 plot contains all of the appropriate information, but is hard to read. How can we adjust it?

One thing we can do is rotate the words to fall on the Y axis with coord_flip().

Code
ggplot(covT_topWord, aes(x = word, y = count)) +
  geom_col() +
  coord_flip()
Barplot showing words by frequency, ggplot2, flipped coordinates so words are along the y axis and count is along the x axis, making the words legible but crowded,. y axis is labeled word, x axis is labeled count and ranges from 0-1000. Plot area is very light grey with horizontal and vertical gridlines showing. The bars are dark grey.

We can use geom_text() to add the words to the inside of the plot, next to the bars.

Code
ggplot(covT_topWord, aes(x = word, y = count)) +
  geom_col() +
  coord_flip() +
  geom_text(aes(label = word), 
            position = position_dodge(1),
            size = 2)
Barplot showing words by frequency, ggplot2, annotations adding word labels to a second location, the bars themselves, and flipped coordinates so words are along the y axis and count is along the x axis, making the words legible but crowded,. y axis is labeled word, x axis is labeled count and ranges from 0-1000. Plot area is very light grey with horizontal and vertical gridlines showing. The bars are dark grey. The word labels are half on, half off of their respective bars and are black, making them barely legible

We can change the colors and sizes of the bars, labels, and background of the plot area.

Code
ggplot(covT_topWord, aes(x = word, 
                         y = count)) +
  geom_col(fill = "grey30") +
  coord_flip() +
  theme(text=element_text(size=8),
        panel.background = element_rect(fill = c("black")),
        panel.grid = element_line(color = c("black"))) +
  geom_text(aes(label = word), 
            position = position_dodge(1),
            size = 2,
            color = c("white"))
Barplot showing words by frequency, ggplot2, annotations adding word labels to a second location, the bars themselves, and flipped coordinates so words are along the y axis and count is along the x axis,. y axis is labeled word with label size reduced to eliminate overlap, x axis is labeled count and ranges from 0-1000. Plot area is black with no gridlines showing. The bars are dark grey. The word labels are half on, half off of their respective bars and are white, making it easy to identify the word whose frequency is represented by each bar.

We can add a title to the plot.

Code
ggplot(covT_topWord, aes(x = word, 
                         y = count)) +
  geom_col(fill = "grey30") +
  coord_flip() +
  theme(text=element_text(size=8),
        panel.background = element_rect(fill = c("black")),
        panel.grid = element_line(color = c("black"))) +
  geom_text(aes(label = word), 
            position = position_dodge(1),
            size = 2,
            color = c("white")) + 
  ggtitle("Number of Tweets Containing Top Words from a Query of Tweets with the keyword COVID")
Barplot showing words by frequency, ggplot2, annotations adding word labels to a second location, the bars themselves, and flipped coordinates so words are along the y axis and count is along the x axis, title at top of plot is number of tweets containing top words from a query of tweets with the keyword COVID, y axis is labeled word with label size reduced to eliminate overlap, x axis is labeled count and ranges from 0-1000. Plot area is black with no gridlines showing. The bars are dark grey. The word labels are half on, half off of their respective bars and are white, making it easy to identify the word whose frequency is represented by each bar.

And we can change the font family, which may be necessary if you are making a plot for something like a publication.

Code
ggplot(covT_topWord, aes(x = word, 
                         y = count)) +
  geom_col(fill = "grey30") +
  coord_flip() +
  theme(text=element_text(size=8,
                          family = "serif"),
        panel.background = element_rect(fill = c("black")),
        panel.grid = element_line(color = c("black"))) +
  geom_text(aes(label = word, family = "serif"), 
            position = position_dodge(1),
            size = 2,
            color = c("white")) + 
  ggtitle("Number of Tweets Containing Top Words from a Query of Tweets with the keyword COVID")
Barplot showing words by frequency, ggplot2, annotations adding word labels to a second location, the bars themselves, and flipped coordinates so words are along the y axis and count is along the x axis, title at top of plot is number of tweets containing top words from a query of tweets with the keyword COVID, y axis is labeled word with label size reduced to eliminate overlap, x axis is labeled count and ranges from 0-1000. Plot area is black with no gridlines showing. The bars are dark grey. The word labels are half on, half off of their respective bars and are white, making it easy to identify the word whose frequency is represented by each bar. Font is changed from Arial to Times New Roman

Remember to save your workspace before you exit so you can access your objects again at a later date!


Download this tutorial:


Request a tutorial:

Create a website or blog at WordPress.com

Discover more from Medical Sociology on Wheels

Subscribe now to keep reading and get access to the full archive.

Continue reading