Combining and Manipulating Dataframes in R with dplyr
This tutorial covers tidy methods for combining and manipulating dataframes in R. For this tutorial, you should already have at least two dataframes loaded into your workspace.
I will use several dataframes containing tweets and associated metadata from five queries with COVID-related keywords. You can find the process for obtaining these data with the twitteR package HERE.
❗Note, due to recent changes to the Twitter API, you may need a paid business tier developer account to access and download tweets❗
You do not need twitter or text data to use these methods.
1 Before you begin, install and load the tidyverse in R.
We will use dplyr for the tutorial, which is part of the tidyverse. Calling the tidyverse will load dplyr.
Code
library(tidyverse)
2 You will also need to load your data into the workspace.
If your data are already formatted as an R dataframe (.RData), you can load them by calling your project directory, where your data should be stored.
If your data are in another format, there are several packages in R to handle external data:
readxl can load Microsoft Excel files
haven can read and write SPSS, Stata, and SAS files
readr can load csv files
you can learn more about the reticulate package and using python with R HERE
Tailor the code here to your own directory location and file names. I include two “files” in the example code here, but you may have several dataframes. Load them all prior to beginning the data manipulations.
Code
load("/User/ProjectDirectory/YourDataFrame1.RData")
load("/User/ProjectDirectory/YourDataFrame2.RData")
3 Step 1: Add a variable (column) to your dataframe
You can add columns to dataframes with the dplyr “mutate” function. I have five dataframes, and I want to combine them, but I also want to be able to identify the cases (rows) by query in my new dataframe. Before I combine the five dataframes, I will add a new variable (column) to each dataframe. I will use the same variable name so the transformation goes smoothly, but I will set the variable value to reflect the original keyword/dataframe.
I’m going to name this variable “keyw” and set the value to the string used in the keyword query to collect each dataframe.
I’ll start with my “covid” query dataframe. I do not create a new dataframe yet, so I give the objects on the left and right sides of my arrow the same name, the name of the original dataframe. In this case, my first dataframe is named “cov_tweet_df1” and I want to add a column with the label “keyw” and the value “covid”
Code
<- cov_tweet_df1 %>%
cov_tweet_df1 mutate(keyw = "covid")
Now I can check to see that my new variable is there. I can also pull the values for the first few rows using “head” to make sure the value is “covid”
Code
head(cov_tweet_df1$keyw)
[1] "covid" "covid" "covid" "covid" "covid" "covid"
Repeat this step for each of your dataframes.
Code
<- pand_tweet_df1 %>%
pand_tweet_df1 mutate(keyw = "pandemic")
<- prot_tweet_df1 %>%
prot_tweet_df1 mutate(keyw = "protect")
<- risk_tweet_df1 %>%
risk_tweet_df1 mutate(keyw = "risk")
<- viru_tweet_df1 %>%
viru_tweet_df1 mutate(keyw = "virus")
Remember, it is good practice to check the values. I will do this for the first few rows of each dataframe.
Code
head(pand_tweet_df1$keyw)
[1] "pandemic" "pandemic" "pandemic" "pandemic" "pandemic" "pandemic"
Code
head(prot_tweet_df1$keyw)
[1] "protect" "protect" "protect" "protect" "protect" "protect"
Code
head(risk_tweet_df1$keyw)
[1] "risk" "risk" "risk" "risk" "risk" "risk"
Code
head(viru_tweet_df1$keyw)
[1] "virus" "virus" "virus" "virus" "virus" "virus"
4 Step 2: Join the dataframes
I want to preserve all of my cases (rows) for each dataframe when combining. It is important that my variable (column) names match so I can use dplyr’s “full_join” function. All of my variables are the same, so I will not specify a “by” variable. If you have different variables in your dataframes, you must have at least one variable in common to bind the frames together. For example, if all of my dataframes had different variables, but I had still added the “keyw” variable to each frame, I could specify “by = join_by(keyw)”
I’ll start with combining two dataframes to create a new frame called “BigDat”
Code
<- cov_tweet_df1 %>%
BigDat full_join(pand_tweet_df1)
You can view, print, or obtain the summary of your new dataframe to check that your join worked. I can see that my “BigDat” dataframe has 18 variables and 2000 observations, reflecting my joining of two dataframes with the same 18 variables and 1000 observations (rows) each.
Now I will bind the remaining frames to “BigDat” .
Code
<- BigDat %>%
BigDat full_join(prot_tweet_df1)
Code
<- BigDat %>%
BigDat full_join(risk_tweet_df1)
Code
<- BigDat %>%
BigDat full_join(viru_tweet_df1)
My BigDat dataframe now has 5000 observations but still has the same 18 variables, reflecting the 5 dataframes with 1000 unique observations each and 18 shared variables.
5 Step 3: Create a new dataframe with fewer variables, selecting only the ones needed for analysis.
There are many times it will make sense to reduce the number of variables in your dataframe. If I am trying to reduce the size of my new dataframe, I might create another new frame, selecting only the variables of immediate interest from my larger joined dataframe.
In the case of the tweet data used in this example, there are many ways users could be identified (screenname, full tweet text, etc.). This data is sensitive and will not add anything to my analysis. Keeping the data also moves my research from exempt status for “human subjects” research, meaning I would need additional review of my project by and Institutional Review Board (IRB) to ensure my use of the human subjects for research is ethical. I can handle this situation two ways: I can delete the obviously identifying variables, or I can select variables I know I plan to use, excluding all others. Maybe I don’t quite know what variables will work best for my analysis yet, so let’s try selecting all but the obviously identifiable variables (effectively “deleting” the others).
I will store the smaller dataframe in a new object named “BabyDat”. When using dplyr’s “select” function, I can also change the order of the variables (columns) in the new dataframe without changing the order of the cases (rows) by typing the variables in the order I want them to appear in the frame. It makes sense to keep the “id” variable because it identifies unique tweets (cases/observations) without automatically identifying the user who posted the tweet. I’ll put this variable first, since it identifies my cases (stored in the rows). Next I will select the stripped text variable (created in THIS demo). I will also select the favorite and retweet count variables, the “created” variable indicating the date and time the tweet was posted, “replytoSID” which, if not “NA” , represents the id of the tweet being replied to (the case is a comment, “SID” is the id of the original post), “replytoUID” representing the user tied to an original tweet if a case is a comment, and my “keyw” variable representing the query each tweet was collected under.
Code
<- BigDat %>%
BabyDat select(id, strippedT, favoriteCount, retweetCount, created, replyToSID, replyToUID, keyw)
Now I have another new dataframe, still with 5000 observations (tweets, rows), but now only 8 variables instead of the 18 I had in the bigger dataframe.
6 Step 4: Rename variables
If you did not manually input your data, you probably have variables that were named by someone else (the developer, like these tweets, or the original researcher, in the case of secondary data). If you want to change the variable names to something else, you can use the dplyr “rename” function. For this function, the new name comes first, so that your code shows “new name = old name” . Here I will change the replyToSID and replyToUID variables to “reply2Tweet” and “reply2User” and the created variable to “Post_DateTime” . Can also shorten names of variables, like “favoriteCount” and “retweetCount” here, as long as my reduction does not limit my understanding of the variable (I still need to know my variable is measuring the number of favorites, for example, and not indicating a logical that it was (TRUE) or was not (FALSE) favorited). I’ll change these to “likes” and “RTs” since I know I now only have variables for total count of favorites/retweets (I dropped the others, they are still in BigDat).
Code
<- BabyDat %>%
BabyDat rename(likes = favoriteCount,
RTs = retweetCount,
Post_DateTime = created,
reply2Tweet = replyToSID,
reply2User = replyToUID)
I can check the names of the variables manually in the viewer if I am using something like RStudio, or I can use the “names” function to check the that the variables have been renamed correctly.
Code
names(BabyDat)
[1] "id" "strippedT" "likes" "RTs"
[5] "Post_DateTime" "reply2Tweet" "reply2User" "keyw"
7 Step 5: Grouping by Value of a Variable
Sometimes you will want to identify a grouping scheme prior to analysis. For example, this is helpful for identifying ties between nodes based on some shared characteristic in network data.
These data already have a grouping variable, “keyw,” but I need to identify the data as grouped to use this. Let’s try it. I’ll store the result in another dataframe called “BabyGroups” to preserve the ungrouped data.
Code
<- BabyDat %>%
BabyGroups group_by(keyw)
I can summarize the new dataframe to see the grouping
Code
summarise(BabyGroups)
# A tibble: 5 × 1
keyw
<chr>
1 covid
2 pandemic
3 protect
4 risk
5 virus
I can also use summarize to look at other information by group. For example, I can look at the mean number of favorites (likes) by group per tweet (case, row). Here I will use summarise to look at the mean number of favorites/likes, and separately, retweets/RTs, by keyword group. You can do this without storing, or you can store the information in a new object for later reference. I’ll store this information in a new object called “BbGrp_Avg_LikeRT” that will be in the form of a “tibble” (a tidy dataframe).
Code
<- BabyGroups %>%
BbGrp_Avg_LikeRT summarise(likes = mean(likes), RTs = mean(RTs))
I can print the new object to see the results.
Code
print(BbGrp_Avg_LikeRT)
# A tibble: 5 × 3
keyw likes RTs
<chr> <dbl> <dbl>
1 covid 0.057 830.
2 pandemic 0.144 821.
3 protect 0.095 540.
4 risk 0.1 338.
5 virus 0.178 689.
In this example there were many tweets with a total of 0 likes/favorites across all groups. On average, the tweets in the keyword “covid” group received the fewest likes/favorites when compared to tweets in the keyword “pandemic,” “protect,” “risk,” and “virus” groups (0.057), while the mean number of likes/favorites was highest for tweets in the keyword “virus” group (0.178).
The RT/retweet averages give information that is a little more useful. The average number of retweets per group ranged from ~338 to ~830. On average, tweets mentioning “risk” received the fewest number of retweets (338.267), while tthe keyword “covid” group had the highest mean number of retweets (829.807). The category with the mean number of tweets closest to the “covid” group was the “pandemic” group, where the average number of retweets was around 821.