animated banner image. Background is a photo purple-blue clouds. On the right, not animated, are four icons (top to bottom): a hot pink funnel with three spreadsheets going into it, a hot pink graphic of a computer monitor with a spreadsheet showing set against a pink background, a light pink icon of two small spreadsheets with arrows that meet in the middle pointing to a larger spreadsheet, a blue icon of a spreadsheet with a linegraph image in the bottom left corner and a bar graph image in the top right corner. On the right, there is a purple rectangle with hot pink crayon-esque border and yellow stars of several different sizes on the top left and bottom right corners; the border and stars are moving, the border wiggles, the stars twinkle. Over the rectangle are light blue words, "combining and manipulating dataframes in R" with the R set to the R computing logo. In the bottom left corner, there is a circle that looks as if it has been scribbled with marker. Over the scribble are light blue bubble letters reading "with dplyr"

Combining Dataframes and Manipulating Variables with dplyr in R


Combining and Manipulating Dataframes in R with dplyr

Combining and Manipulating Dataframes in R with dplyr

Author

Heather Sue M. Rosen

This tutorial covers tidy methods for combining and manipulating dataframes in R. For this tutorial, you should already have at least two dataframes loaded into your workspace.

I will use several dataframes containing tweets and associated metadata from five queries with COVID-related keywords. You can find the process for obtaining these data with the twitteR package HERE.
Note, due to recent changes to the Twitter API, you may need a paid business tier developer account to access and download tweets

You do not need twitter or text data to use these methods.

1 Before you begin, install and load the tidyverse in R.

We will use dplyr for the tutorial, which is part of the tidyverse. Calling the tidyverse will load dplyr.

Code
library(tidyverse)

2 You will also need to load your data into the workspace.

If your data are already formatted as an R dataframe (.RData), you can load them by calling your project directory, where your data should be stored.

If your data are in another format, there are several packages in R to handle external data:

  • readxl can load Microsoft Excel files

  • haven can read and write SPSS, Stata, and SAS files

  • readr can load csv files

  • you can learn more about the reticulate package and using python with R HERE

Tailor the code here to your own directory location and file names. I include two “files” in the example code here, but you may have several dataframes. Load them all prior to beginning the data manipulations.

Code
load("/User/ProjectDirectory/YourDataFrame1.RData")
load("/User/ProjectDirectory/YourDataFrame2.RData")

3 Step 1: Add a variable (column) to your dataframe

You can add columns to dataframes with the dplyr “mutate” function. I have five dataframes, and I want to combine them, but I also want to be able to identify the cases (rows) by query in my new dataframe. Before I combine the five dataframes, I will add a new variable (column) to each dataframe. I will use the same variable name so the transformation goes smoothly, but I will set the variable value to reflect the original keyword/dataframe.

I’m going to name this variable “keyw” and set the value to the string used in the keyword query to collect each dataframe.

I’ll start with my “covid” query dataframe. I do not create a new dataframe yet, so I give the objects on the left and right sides of my arrow the same name, the name of the original dataframe. In this case, my first dataframe is named “cov_tweet_df1” and I want to add a column with the label “keyw” and the value “covid”

Code
cov_tweet_df1 <- cov_tweet_df1 %>%
  mutate(keyw = "covid")

Now I can check to see that my new variable is there. I can also pull the values for the first few rows using “head” to make sure the value is “covid”

Code
head(cov_tweet_df1$keyw)
[1] "covid" "covid" "covid" "covid" "covid" "covid"

Repeat this step for each of your dataframes.

Code
pand_tweet_df1 <- pand_tweet_df1 %>%
  mutate(keyw = "pandemic")

prot_tweet_df1 <- prot_tweet_df1 %>%
  mutate(keyw = "protect")

risk_tweet_df1 <- risk_tweet_df1 %>%
  mutate(keyw = "risk")

viru_tweet_df1 <- viru_tweet_df1 %>%
  mutate(keyw = "virus")

Remember, it is good practice to check the values. I will do this for the first few rows of each dataframe.

Code
head(pand_tweet_df1$keyw)
[1] "pandemic" "pandemic" "pandemic" "pandemic" "pandemic" "pandemic"
Code
head(prot_tweet_df1$keyw)
[1] "protect" "protect" "protect" "protect" "protect" "protect"
Code
head(risk_tweet_df1$keyw)
[1] "risk" "risk" "risk" "risk" "risk" "risk"
Code
head(viru_tweet_df1$keyw)
[1] "virus" "virus" "virus" "virus" "virus" "virus"

4 Step 2: Join the dataframes

I want to preserve all of my cases (rows) for each dataframe when combining. It is important that my variable (column) names match so I can use dplyr’s “full_join” function. All of my variables are the same, so I will not specify a “by” variable. If you have different variables in your dataframes, you must have at least one variable in common to bind the frames together. For example, if all of my dataframes had different variables, but I had still added the “keyw” variable to each frame, I could specify “by = join_by(keyw)”

I’ll start with combining two dataframes to create a new frame called “BigDat”

Code
BigDat <- cov_tweet_df1 %>%
  full_join(pand_tweet_df1)

You can view, print, or obtain the summary of your new dataframe to check that your join worked. I can see that my “BigDat” dataframe has 18 variables and 2000 observations, reflecting my joining of two dataframes with the same 18 variables and 1000 observations (rows) each.

Now I will bind the remaining frames to “BigDat” .

Code
BigDat <- BigDat %>%
  full_join(prot_tweet_df1)
Code
BigDat <- BigDat %>%
  full_join(risk_tweet_df1)
Code
BigDat <- BigDat %>%
  full_join(viru_tweet_df1)

My BigDat dataframe now has 5000 observations but still has the same 18 variables, reflecting the 5 dataframes with 1000 unique observations each and 18 shared variables.

5 Step 3: Create a new dataframe with fewer variables, selecting only the ones needed for analysis.

There are many times it will make sense to reduce the number of variables in your dataframe. If I am trying to reduce the size of my new dataframe, I might create another new frame, selecting only the variables of immediate interest from my larger joined dataframe.

In the case of the tweet data used in this example, there are many ways users could be identified (screenname, full tweet text, etc.). This data is sensitive and will not add anything to my analysis. Keeping the data also moves my research from exempt status for “human subjects” research, meaning I would need additional review of my project by and Institutional Review Board (IRB) to ensure my use of the human subjects for research is ethical. I can handle this situation two ways: I can delete the obviously identifying variables, or I can select variables I know I plan to use, excluding all others. Maybe I don’t quite know what variables will work best for my analysis yet, so let’s try selecting all but the obviously identifiable variables (effectively “deleting” the others).

I will store the smaller dataframe in a new object named “BabyDat”. When using dplyr’s “select” function, I can also change the order of the variables (columns) in the new dataframe without changing the order of the cases (rows) by typing the variables in the order I want them to appear in the frame. It makes sense to keep the “id” variable because it identifies unique tweets (cases/observations) without automatically identifying the user who posted the tweet. I’ll put this variable first, since it identifies my cases (stored in the rows). Next I will select the stripped text variable (created in THIS demo). I will also select the favorite and retweet count variables, the “created” variable indicating the date and time the tweet was posted, “replytoSID” which, if not “NA” , represents the id of the tweet being replied to (the case is a comment, “SID” is the id of the original post), “replytoUID” representing the user tied to an original tweet if a case is a comment, and my “keyw” variable representing the query each tweet was collected under.

Code
BabyDat <- BigDat %>%
  select(id, strippedT, favoriteCount, retweetCount, created, replyToSID, replyToUID, keyw)

Now I have another new dataframe, still with 5000 observations (tweets, rows), but now only 8 variables instead of the 18 I had in the bigger dataframe.

6 Step 4: Rename variables

If you did not manually input your data, you probably have variables that were named by someone else (the developer, like these tweets, or the original researcher, in the case of secondary data). If you want to change the variable names to something else, you can use the dplyr “rename” function. For this function, the new name comes first, so that your code shows “new name = old name” . Here I will change the replyToSID and replyToUID variables to “reply2Tweet” and “reply2User” and the created variable to “Post_DateTime” . Can also shorten names of variables, like “favoriteCount” and “retweetCount” here, as long as my reduction does not limit my understanding of the variable (I still need to know my variable is measuring the number of favorites, for example, and not indicating a logical that it was (TRUE) or was not (FALSE) favorited). I’ll change these to “likes” and “RTs” since I know I now only have variables for total count of favorites/retweets (I dropped the others, they are still in BigDat).

Code
BabyDat <- BabyDat %>%
  rename(likes = favoriteCount, 
         RTs = retweetCount,
         Post_DateTime = created,
         reply2Tweet = replyToSID,
         reply2User = replyToUID)

I can check the names of the variables manually in the viewer if I am using something like RStudio, or I can use the “names” function to check the that the variables have been renamed correctly.

Code
names(BabyDat)
[1] "id"            "strippedT"     "likes"         "RTs"          
[5] "Post_DateTime" "reply2Tweet"   "reply2User"    "keyw"         

7 Step 5: Grouping by Value of a Variable

Sometimes you will want to identify a grouping scheme prior to analysis. For example, this is helpful for identifying ties between nodes based on some shared characteristic in network data.

These data already have a grouping variable, “keyw,” but I need to identify the data as grouped to use this. Let’s try it. I’ll store the result in another dataframe called “BabyGroups” to preserve the ungrouped data.

Code
BabyGroups <- BabyDat %>%
  group_by(keyw)

I can summarize the new dataframe to see the grouping

Code
summarise(BabyGroups)
# A tibble: 5 × 1
  keyw    
  <chr>   
1 covid   
2 pandemic
3 protect 
4 risk    
5 virus   

I can also use summarize to look at other information by group. For example, I can look at the mean number of favorites (likes) by group per tweet (case, row). Here I will use summarise to look at the mean number of favorites/likes, and separately, retweets/RTs, by keyword group. You can do this without storing, or you can store the information in a new object for later reference. I’ll store this information in a new object called “BbGrp_Avg_LikeRT” that will be in the form of a “tibble” (a tidy dataframe).

Code
BbGrp_Avg_LikeRT <- BabyGroups %>%
  summarise(likes = mean(likes), RTs = mean(RTs))

I can print the new object to see the results.

Code
print(BbGrp_Avg_LikeRT)
# A tibble: 5 × 3
  keyw     likes   RTs
  <chr>    <dbl> <dbl>
1 covid    0.057  830.
2 pandemic 0.144  821.
3 protect  0.095  540.
4 risk     0.1    338.
5 virus    0.178  689.

In this example there were many tweets with a total of 0 likes/favorites across all groups. On average, the tweets in the keyword “covid” group received the fewest likes/favorites when compared to tweets in the keyword “pandemic,” “protect,” “risk,” and “virus” groups (0.057), while the mean number of likes/favorites was highest for tweets in the keyword “virus” group (0.178).

The RT/retweet averages give information that is a little more useful. The average number of retweets per group ranged from ~338 to ~830. On average, tweets mentioning “risk” received the fewest number of retweets (338.267), while tthe keyword “covid” group had the highest mean number of retweets (829.807). The category with the mean number of tweets closest to the “covid” group was the “pandemic” group, where the average number of retweets was around 821.

Create a website or blog at WordPress.com

Discover more from Medical Sociology on Wheels

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version