As a Fellow for the Program for Advanced Research in the Social Sciences, I have the opportunity to teach students, faculty, and staff at Duke how to develop research designs, chooose quantitative methods, and implement those methods with statistical software.
Recently, a student asked me for help in calculating average score scales from multiple survey items. Since this provided a good opportunity to teach the student that there are multiple approaches to any programming problem and that each approach faces different trade-offs in terms of computational cost, verbosity, generality, and the opportunity for making mistakes, I put together a short gist I thought I’d share. Note that you should view the raw code since the $ signs in gists are converted to formulas on this site.
######################################################################################### | |
# This gist contains a quick walk-through of several ways to produce scales capturing | |
# the average value of one or more variables. Each row (observation) gets its own | |
# value. We'll assume your data are not fully "tidy." What I mean by this is that you | |
# have an observation for each row and you want to calculate that observation's value | |
# on the scale, but each variable that should go into your scale is in its own column. | |
######################################################################################### | |
######################################################################################### | |
# Set-up (packages, fake data, etc.) | |
######################################################################################### | |
# Load dplyr | |
library(dplyr) | |
# Set seed | |
set.seed(9) | |
# Create some fake data with variables x, y, z, w | |
# for example purposes | |
mydata <- data.frame(x = rnorm(10), | |
y = runif(10), | |
z = seq(1:10), | |
w = c(1,rep(NA,9))) | |
# Let's take a look at our raw data. Note | |
# that we have missing values in variable w. | |
mydata | |
######################################################################################### | |
# Scale creation | |
######################################################################################### | |
# We'll now create 2 scales. Each scale will | |
# capture the average of two variable in our data. | |
# Which variables? Let's choose two variables | |
# for each of the scales: scale1 is going | |
# to reflect the mean of variables named x and y, | |
# scale2 will reflect the average of variables z and y. | |
scale1_vars <- c("x","y") | |
scale2_vars <- c("y","z") | |
# Now let's generate our scales. If a row is missing a | |
# value for one of the variables, the average will be | |
# computed from the other variable if it is not also | |
# missing. If both values are missing, the scale value | |
# will be NA. | |
# The code below tells the mutate() function from the dplyr package | |
# that we want to generate new variables (scale1 and scale2) by | |
# calculating the average of the values observed in each row for | |
# only those columns containing the variables we previously included | |
# in scale1_vars or scale2_vars. A different scale is generated for | |
# each group of variables. | |
fulldf <- mydata %>% dplyr::mutate(scale1 = rowMeans(mydata[ ,scale1_vars], na.rm = TRUE), | |
scale2 = rowMeans(mydata[ ,scale2_vars], na.rm = TRUE)) | |
# A benefit of the above approach is that you can put any variables you | |
# want into scales1_vars or scales2_vars and you don't need to know how | |
# many variables you put in - the code will simply calculate the average | |
# across all of the columns. | |
# Now let's calculate the averages manually. We can do this by | |
# adding up the variables we want in our scale, then dividing by the | |
# number of variables to get the average of the variables. | |
fulldf <- fulldf %>% dplyr::mutate(scale1_manual = (x + y) / 2, | |
scale2_manual = (y + z) / 2) | |
# Let's make sure the manual approach and the other approach produce | |
# the same result. We can use the identical() function to tell us if the | |
# two variables are identical! TRUE if so, FALSE if not. | |
identical(fulldfscale1) # TRUE | |
identical(fulldfscale2) # TRUE | |
# Based on the above, our approach works! The former approach is nice because | |
# (1) you don't need to type out the # of variables you are including in the scale | |
# and so don't run the risk of forgetting to change the number you're dividing | |
# by if/when you change the number of items in your scale and (2) it calculates | |
# averages for rows using the available variables where there is no missingness | |
# rather than simply returning NA if ANY variable is missing. On the other hand, | |
# (a) it is more lines of code because you first say which sets of variables you | |
# want to include in your scales in some lines of code then actually generate | |
# the scales in some more code, (b) it's not so obvious mathematically what you | |
# are doing unless you immdiately see "rowMeans" and know that means it is | |
# calculating the average of the rows for each scale, and (c) you might want | |
# to drop any observation with missing data rather than simply use the available | |
# data to calculate an average. For example, if we make a scale using the variable | |
# with missing data (w) then we will find the two approaches produce different | |
# results: | |
# Approach 1 | |
scale3_vars <- c("w","z") | |
fulldf <- fulldf %>% dplyr::mutate(scale3 = rowMeans(fulldf[ ,scale3_vars], na.rm = TRUE)) | |
# Approach 2 | |
fulldf <- fulldf %>% dplyr::mutate(scale3_manual = (w + z) / 2) | |
# Test if identical | |
identical(fulldfscale3_manual) | |
# To see why, here are the computed scale values for the two different approaches: | |
fulldf$scale3 | |
fulldf$scale3_manual | |
# Note: there are more efficient ways of doing this as well, such as writing a | |
# function to implement the second approach simply by giving the function the | |
# variable names and scale name, while also building in functionality to let | |
# the user choose to include or omit variables with missing values. The | |
# examples above are reasonable ways of doing this for those just learning R | |
# and/or the Tidyverse. |