Path Analysis

Romantic Relationships in Detroit Redux

In this lab you will build path analysis models that correspond to the multiple regression models you fit in Lab 1. To do this, you will be introduced to the SEM package lavaan. The lavaan package can be used to fit regression models as well as systems of structural regression models. You will also learn to extend the idea of multiple regression models to true Path Analysis involving multiple models.

The data you will use for this lab is the acitelli data set. It consists of heterosexual couples living in the city of Detroit. You can read more about the study that goes with this data here. We will be particularly interested in understanding how relationship satisfaction, gender, how much tension they perceive in their relationships, and how positively they see their partners, all relate in a model.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package, visualize it using the ggplot2 package for data visualization, and restructure it using tidyr. So we are just going to load the tidyverse package that contains all of these things. The data can be found at the URL given below. Later we will also use the lavaan package, so please install it now.

Let’s load the packages.

library(tidyverse)
#install.packages("lavaan")
library(lavaan)

Creating a reproducible lab report

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

The data

The data set for this lab consists of 296 persons in 148 heterosexual dating relationships from the city of Detroit.

Let’s load the data:

acitelli <- read_csv("http://www.science.smith.edu/~rgarcia/sds390-S19/acitelli-individual.csv")

The gender variable is coded -1(= Women) and 1 (= Men). The self_pos is the mean of positive self variables and other_pos is the mean of positive partner variables, where both variables the average of Cooperative, Mature, Friendly, Hard Working and Care about Others. Our main response variable, satisfaction
is the mean of four variables (theoretical range from 1 to 4). The variable tension is the mean of two: perceived disagreement and perceived tension (theoretical range from 1 to 4). Lastly, simhob is the perception of having similar hobbies to one’s partner (1 = very true; 0 = somewhat true; -1 = not at all true). We won’t be using this variable today, but you can play with it if you have time.

Preparing the data

Take a look at the data. You can use glimpse(), head(), View(), however you like to look at your data frames. We have observations on 8 different variables, some are categorical and some numerical.

The gender variable is currently coded as -1 for women and 1 for men, let’s first change this variable into a more descriptive spring variable.

acitelli <- acitelli %>%
  mutate(gender_string  = ifelse(gender == -1, "woman", "man"))

Multiple linear regression

In lab 1 you found that that these participants might be in romantic relationships with each other and that this violates the independence assumption. For this lab we will be ignoring this fact, but we will return to it later in the course for path analysis.

The data set contains several relationship variables including the participants’ feelings about themselves and their partner. Let’s take a look at the relationship between one of these positivity variables, other_pos, and tension split by gender_string.

ggplot(data = acitelli, aes(x = other_pos, y = tension, color = gender_string)) +
  geom_jitter()

Now let’s create a multiple regression model predicting satisfaction that contains tension, gender, and other_pos as predictors. Call it reg_mod.

reg_mod <- lm(satisfaction ~ tension + other_pos + gender, data = acitelli) 

summary(reg_mod)

The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for tension reflects how much more satisfied a person is expected be if they perceive tension in their relationship one point higher while holding all other variables constant. In this case, that translates into considering only people of the same gender, who feel the same amount of positivity towards their partner, with tension scores that are one point apart.

It is hard to visualize multiple regression models. But we can use some of the other aesthetic mappings we know to include all variables. Start with the plot below.

ggplot(data = acitelli, aes(x = other_pos, y = satisfaction)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = 0)

We can try to map the remaining variables, tension and gender_string to other features of the graph.

For this exercise, I would like you to map the tension variable to the color of the jittered points and also facet by gender_string. Hint: You will need to add aesthetic mappings directly to the geom_jitter() function.

Path Analysis

The package that we will be using to fit structural equation models is called lavaan. lavaan can be used to run path analyses (PA), confirmatory factor analyses (CFA), and the combination of the two, which is a SEM. It stands for Latent Variable Analysis. It is based a three step system where you first specify a model as a character string, second you pass this model to a function (along with the data) that finds parameter estimates, and finally you inspect the output. There are lavaan tutorials online here.

SEM lavaan has three steps:
* Specify the model.
* Estimate that model.
* Examine the output.

lavaan relational symbols used to specify a model:
* x ~ y + z — regression paths from y and z to x
* x ~~ y — covariance between x and y * x ~~ x — variance of x

We can actually use lavaan to fit the reg_mod model we ran above. After installing lavaan and loading it, use the following code to specify, fit, and examine your model output. Note that we need to be more explicit with lavaan and ask for the intercept.

The sem() function is doing the work here, estimating the model parameters.

model <- 'satisfaction ~ 1 + tension + other_pos + gender'

fit <- sem(model, data = acitelli)

summary(fit)

Compare the path estimates obtained from lavaan to the regression coefficient estimates you got using the lm() function for reg_mod. Are there any similarities?
How many known observations do we have in this model? How many unknowns (parameters we are trying to estimate)? What are the model degrees of freedom and can you find this quantity in the lavaan summary?

For fun, we can also use the semPaths() function from the semPlot package to make a diagram in R. Although this is not always the easiest solution. If you would like, install this package now. But running this code below is not required to answer the questions in this lab.

#install.packages("semPlot")
library(semPlot)

If we give semPaths() out fitted lavaan object, it will make a diagram for us. This is what it looks like without any added options.

semPaths(fit)

Those triangles are the intercepts (of the endogenous variable) and means (of the exogenous variables). The semPaths() function has many options. For example, run the following code and the triangles disappear, among other things.

semPaths(fit, what = "est", intercepts = FALSE, style = "lisrel", rotation = 2)

Check out the semPath() documentation but using the ?. This function is kind of a LOT. Honestly, I make my SEM figures in PowerPoint, Word, or in AMOS (another SEM software package connected to SPSS). But there are other options discussed in your textbook.

?semPaths

Modeling the Nonindependence

In lab 1 you found that these participants were in romantic relationships with each other. This violates the independence assumption for multiple regression. We now return to this issues and model the nonindependence using Path Analysis.

The first thing we need to do is change the observational unit from individuals to couples (i.e., dyads). While the individual data looks like this:

## # A tibble: 6 x 9
##   cuplid Yearsmar gender self_pos other_pos satisfaction tension simhob
##    <dbl>    <dbl>  <dbl>    <dbl>     <dbl>        <dbl>   <dbl>  <dbl>
## 1      3     8.20     -1      4.8       4.6         4        1.5      0
## 2      3     8.20      1      3.8       4           3.67     2.5      1
## 3     10    10.5      -1      4.6       3.8         3.17     4        0
## 4     10    10.5       1      4.2       4           3.67     2        0
## 5     11    -8.30     -1      5         4.4         3.83     2.5      0
## 6     11    -8.30      1      4.2       4.8         3.83     2.5      0
## # … with 1 more variable: gender_string <chr>

We want to create a data set called acitelli_dyad that looks like this:

## # A tibble: 6 x 12
##   cuplid Yearsmar other_pos_man other_pos_woman satisfaction_man
##    <dbl>    <dbl>         <dbl>           <dbl>            <dbl>
## 1      3     8.20           4               4.6             3.67
## 2     10    10.5            4               3.8             3.67
## 3     11    -8.30           4.8             4.4             3.83
## 4     17    -6.38           4.4             3.6             3.83
## 5     21    10.2            4.8             3.8             3.5 
## 6     22    15.0            4.6             5               4   
## # … with 7 more variables: satisfaction_woman <dbl>, self_pos_man <dbl>,
## #   self_pos_woman <dbl>, simhob_man <dbl>, simhob_woman <dbl>,
## #   tension_man <dbl>, tension_woman <dbl>

To make this happen, we will use the functions from the tidyr package, including gather(), unite(), and spread().

acitelli_dyad <- acitelli %>%
  gather(self_pos:simhob, key = "the_old_vars", value = "score") %>%
  select(-gender) %>%
  unite(the_new_vars, the_old_vars, gender_string, sep = "_", remove = TRUE) %>%
  spread(the_new_vars, score)

Try exploring what is happening at each step in this pipeline by rerunning it at several different stages and viewing the result.

Describe what the unite() function is doing. Why do you think we need this step, and and how does this relate to us dropping the gender variables in the select() function?

Now, we might want to fit the same multiple regression model the we fit above, but we can include both satisfaction_woman and satisfaction_man as endogenous variables. Recall that an endogenous variable is a variable in the model that is being directionally pointed at by other variables in the model. An endogenous variable can also point at other endogenous variables. On the other hand, exogenous variables only do the directional pointing.

If both satisfaction_woman and satisfaction_man are endogenous variables (think response variables), what do you think will be the explanatory variables for each response?
What parameter could be added to the model to indicate that the residuals of the two people in the couple might be related?

If we wanted to create a simple linear regression version of this model with only other_pos as a predictor, it might look like this.

model_slr <- 'satisfaction_woman  ~ 1 + other_pos_woman 
              satisfaction_man  ~ 1 + other_pos_man'

fit_slr <- sem(model_slr, data = acitelli_dyad)

summary(fit_slr)

Based on your answer to the exercise above, fit a Path Analysis model using lavaan that fits these two linear models to the data. It might help to sketch the model on a piece of paper first. Note that by default, lavaan will include correlations of all exogenous variables as well as the covariance of endogenous residuals.
Give interpretations of the values in the estimates column under Regressions and Covariances.

We can also use lavaan to fix parameters to be equal in our model.

summary(fit_slr)

## lavaan 0.6-3 ended normally after 24 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          7
## 
##   Number of observations                           148
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      36.180
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                        Estimate  Std.Err  z-value  P(>|z|)
##   satisfaction_woman ~                                    
##     other_pos_womn        0.278    0.066    4.197    0.000
##   satisfaction_man ~                                      
##     other_pos_man         0.345    0.061    5.662    0.000
## 
## Covariances:
##                         Estimate  Std.Err  z-value  P(>|z|)
##  .satisfaction_woman ~~                                    
##    .satisfactin_mn         0.098    0.018    5.487    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .satisfactn_wmn    2.412    0.284    8.505    0.000
##    .satisfactin_mn    2.140    0.263    8.133    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .satisfactn_wmn    0.233    0.027    8.602    0.000
##    .satisfactin_mn    0.163    0.019    8.602    0.000

Taking a look at our output from the model_slr above. We can see that the estimates of effects of other positivity on satisfaction are different for men and women. We have allowed them to be different. Letting these two numbers be different is the same as including the interaction of other_pos and gender in our model.

We can fix parameter to be equal in a Path Analysis model by including equality constraints. the sem() function will find path estimates for other_pos that are the same with the model specified below. Equality constraints are added to a model by naming a parameter with a character string, here I am calling it “b1”, and the adding the name next to the variable names in the model plus the * operator.

model_slr2 <- 'satisfaction_woman  ~ 1 + b1*other_pos_woman 
                satisfaction_man  ~ 1 + b1*other_pos_man'

fit_slr2 <- sem(model_slr2, data = acitelli_dyad)

summary(fit_slr2)

We can further fix the residual variances and the intercepts to be equal.

model_slr3 <- 'satisfaction_woman  ~ b0*1 + b1*other_pos_woman 
                satisfaction_man  ~ b0*1 + b1*other_pos_man
                satisfaction_woman ~~ v*satisfaction_woman
                satisfaction_man ~~ v*satisfaction_man'

fit_slr3 <- sem(model_slr3, data = acitelli_dyad)

summary(fit_slr3)

What are the degrees of freedom of the three slr models, model_slr, model_slr2, and model_slr3? What are the degrees of freedom changing?
Now do the same for your model where both other_pos and tension are exogenous predictors. Fit a model with the following equality constraints across men and women: equal coefficients for other_pos, equal coefficients for tension, and equal variances of the satsifaction residuals. Note: We are not constraining the two intercepts.
How does this model compare with reg_mod above? What do you think are some features of your Path Analysis model that make it different from the multiple regression model?

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted from a lab written by Mine Çetinkaya-Rundel and Andrew Bray.