In this lab you will build multiple regression models, explore interactions between a pair of continuous predictors, and you will be introduced to the SEM package lavaan
. The lavaan
package can be used to fit regression models as well as systems of structural regression models.
The data you will be using for this consists of people currently involved in romantic relationships living in the city of Detroit. You can read more about the study that goes with this data here. We will be particularly interested in understanding predictors of relationship satisfaction including gender, how much tension they perceive in their relationships, and how positively they see their partners.
In this lab we will explore the data using the dplyr
package and visualize it using the ggplot2
package for data visualization. The data can be found at the URL given below. Later we will also use the lavaan
package, so please install it now.
Let’s load the packages.
library(dplyr)
library(ggplot2)
#install.packages("lavaan")
library(lavaan)
Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.
The data set for this lab consists of 296 persons involved heterosexual dating relationships from the city of Detroit.
Let’s load the data:
acitelli <- read.csv("http://www.science.smith.edu/~rgarcia/sds390-S19/acitelli-individual.csv")
Take a look at the data. You can use glimpse()
, head()
, View()
, however you like to look at your data frames. We have observations on 8 different variables, some are categorical and some numerical.
The gender
variable is coded -1(= Women) and 1 (= Men). You will change this next. The self_pos
is the mean of positive self variables and other_pos
is the mean of positive partner variables, where both variables the average of Cooperative, Mature, Friendly, Hard Working and Care about Others. Our main response variable, satisfaction
is the mean of four variables (theoretical range from 1 to 4). The variable tension
is the mean of two: perceived disagreement and perceived tension (theoretical range from 1 to 4). Lastly, simhob
is the perception of having similar hobbies to one’s partner (1 = very true; 0 = somewhat true; -1 = not at all true). We won’t be using this variable today, but you can play with it if you have time.
The gender
variable is currently coded as -1
for women and 1
for men, create two new indicator variables, one called woman
, and another man
.
How many observations are in each level of cuplid
? What are the counts of women and men in the data? Based on these two summaries, do you think the cuplid
variable means?
Select two variables in the data set and visualize their relationship with each other. Also, calculate an appropriate summary statistic for this relationship. Describe the relationship.
The most important research question we ask today is does perceiving tension in one’s relationship lead to lower relationship satisfaction. Let’s create a scatterplot to see if this appears to be the case:
ggplot(data = acitelli, aes(x = tension, y = satisfaction)) +
geom_point()
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
Replot the scatterplot, but this time using a geom
that corrects the problem or another visualization solution. Comment on your plot. Hint: You can jitter the points, create transparency, or check out geom_beeswarm()
from the ggbeeswarm
package for funzies!
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called mod_t
to predict satisfaction score by tension. Interpret the slope and intercept of this model. Is tension a statistically significant predictor of satisfaction? Does it appear to be a practically significant predictor? Also, add a line of best fit to your visualization.
Use residual plots, plot(mod_t)
, to evaluate whether the conditions of least squares regression are reasonable. Comment on each of the regression assumptions—remember LINE: Linearity, Independence, Normality, and Error homogeneity.
If you suspected above that these participants might be in romantic relationships with each other, you’re correct! This indeed violates the independence assumption. For this lab we will be ignoring this fact, but we will return to it later in the course for path analysis.
The data set contains several relationship variables including the participants’ feelings about themselves and their partner. Let’s take a look at the relationship between one of these positivity variables, other_pos
, and tension
. Before running the code below:
other_pos
and tension
to be?Now run the following:
ggplot(data = acitelli, aes(x = other_pos, y = tension)) +
geom_jitter()
acitelli %>%
summarise(cor(other_pos, tension))
As expected the relationship is negative. In order to see if perceived tension is still a significant predictor of satisfaction after we’ve accounted for perceptions of other positivity, we can add the other_pos
term into the model.
Fit a multiple regression model called mod_t_op
with both tension and other positivity as predictors of satisfaction. Interpret the intercept and slope coefficients. Are they statistically significant predictors? Has the addition of other_pos
to the model changed the parameter estimate for tension
?
What is the R2 value for this model? Give its interpretation.
Now let’s add gender into the model that already contains tension
and other_pos
. First fit a model using the woman
indicator you created earlier. Call it mod_wo
. Then, fit a model with the original gender
variable. Call it mod_gen
.
woman
in mod_wo
and gender
in mod_gen
. What do the intercepts mean in these two models? Are the coefficients for tension
and other_pos
different across these two models?The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for tension
reflects how much more satisfied a person is expected be if they perceive tension in their relationship one point higher while holding all other variables constant. In this case, that translates into considering only people of the same gender, who feel the same amount of positivity towards their partner, with tension
scores that are one point apart.
Now we can ask if positive perceptions of your partner difuse the negative effect of tension on relationship satisfaction. That is, we can ask, is the effect of tension on satisfaction weakened (or zero) when other positivity is high? Questions like this are Answered with interaction terms. .
tension
and other_pos
to the model and leave gender
out of this model. Is there a statistically significant interaction of tension
and other_pos
? Can you give an interpretation? Hint: see below.To aid in interpretation of this interaction we can estimate the effect of tension
on satisfaction
when other_pos
is low (=1) and when other_pos
is high (=5). We will need to create two new variables, other_pos_low
and other_pos_high
to do this. For the low version, everyone who’s 1
will be made 0
, and for the high version, everyone who is 5
will be made 0
.
acitelli <- acitelli %>%
mutate(other_pos_low = other_pos - 1,
other_pos_high = other_pos - 5)
Then we simply re-run the model two more times and look at the main effect of tension
in each model.
mod_low <- lm(satisfaction ~ tension*other_pos_low, data = acitelli)
summary(mod_low)
mod_high <- lm(satisfaction ~ tension*other_pos_high, data = acitelli)
summary(mod_high)
Interpretation: When other positivity is low, we would expect a strong negative effect of tension on satisfaction, but when other positivity is high, we would still expect a statistically significant negative effect of tension on satisfaction, but it is far less strong.
lavaan
The package that we will be using to fit structural equation models is called lavaan
. lavaan
can be used to run path analyses (PA), confirmatory factor analyses (CFA), and the combination of the two, which is a SEM. It stands for Latent Variable Analysis. It is based a three step system where you first specify a model as a character string, second you pass this model to a function (along with the data) that finds parameter estimates, and finally you inspect the output. There are lavaan
tutorials online here.
SEM lavaan has three steps:
* Specify the model.
* Estimate that model.
* Examine the output.
lavaan
relational symbols used to specify a model:
* x ~ y + z — regression paths from y and z to x
* x ~~ y — covariance between x and y * x ~~ x — variance of x
We can actually use lavaan
to fit the mod_wo
model we ran above.
library(lavaan)
After installing lavaan
and loading it, use the following code to specify, fit, and examine your model output.
model <- 'satisfaction ~ tension + other_pos + woman'
fit <- sem(model, data = acitelli)
summary(fit)
Compare the path estimates obtained from lavaan
to the regression coefficient estimates you got using the lm()
function for mod_wo
. Are there any differences?
In order to add interactions to your model in lavaan
, you will need to first create the product term in your data set. For this last exercise, please add that term to your data set and fit the model you ran in exercise 11. Are the estimates similar across models?
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted from a lab written by Mine Çetinkaya-Rundel and Andrew Bray.