## Romantic Relationships in Detroit Redux

In this lab you will build path analysis models that correspond to the multiple regression models you fit in Lab 1. To do this, you will be introduced to the SEM package `lavaan`. The `lavaan` package can be used to fit regression models as well as systems of structural regression models. You will also learn to extend the idea of multiple regression models to true Path Analysis involving multiple models.

The data you will use for this lab is the `acitelli` data set. It consists of heterosexual couples living in the city of Detroit. You can read more about the study that goes with this data here. We will be particularly interested in understanding how relationship satisfaction, gender, how much tension they perceive in their relationships, and how positively they see their partners, all relate in a model.

## Getting Started

In this lab we will explore the data using the `dplyr` package, visualize it using the `ggplot2` package for data visualization, and restructure it using `tidyr`. So we are just going to load the `tidyverse` package that contains all of these things. The data can be found at the URL given below. Later we will also use the `lavaan` package, so please install it now.

``````library(tidyverse)
#install.packages("lavaan")
library(lavaan)``````

### Creating a reproducible lab report

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

### The data

The data set for this lab consists of 296 persons in 148 heterosexual dating relationships from the city of Detroit.

``acitelli <- read_csv("http://www.science.smith.edu/~rgarcia/sds390-S19/acitelli-individual.csv")``

The `gender` variable is coded -1(= Women) and 1 (= Men). The `self_pos` is the mean of positive self variables and `other_pos` is the mean of positive partner variables, where both variables the average of Cooperative, Mature, Friendly, Hard Working and Care about Others. Our main response variable, `satisfaction`
is the mean of four variables (theoretical range from 1 to 4). The variable `tension` is the mean of two: perceived disagreement and perceived tension (theoretical range from 1 to 4). Lastly, `simhob` is the perception of having similar hobbies to one’s partner (1 = very true; 0 = somewhat true; -1 = not at all true). We won’t be using this variable today, but you can play with it if you have time.

## Preparing the data

Take a look at the data. You can use `glimpse()`, `head()`, `View()`, however you like to look at your data frames. We have observations on 8 different variables, some are categorical and some numerical.

The `gender` variable is currently coded as `-1` for women and `1` for men, let’s first change this variable into a more descriptive spring variable.

``````acitelli <- acitelli %>%
mutate(gender_string  = ifelse(gender == -1, "woman", "man"))``````

## Multiple linear regression

In lab 1 you found that that these participants might be in romantic relationships with each other and that this violates the independence assumption. For this lab we will be ignoring this fact, but we will return to it later in the course for path analysis.

The data set contains several relationship variables including the participants’ feelings about themselves and their partner. Let’s take a look at the relationship between one of these positivity variables, `other_pos`, and `tension` split by `gender_string`.

``````ggplot(data = acitelli, aes(x = other_pos, y = tension, color = gender_string)) +
geom_jitter()``````

Now let’s create a multiple regression model predicting `satisfaction` that contains `tension`, `gender`, and `other_pos` as predictors. Call it `reg_mod`.

``````reg_mod <- lm(satisfaction ~ tension + other_pos + gender, data = acitelli)

summary(reg_mod)``````

The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for `tension` reflects how much more satisfied a person is expected be if they perceive tension in their relationship one point higher while holding all other variables constant. In this case, that translates into considering only people of the same gender, who feel the same amount of positivity towards their partner, with `tension` scores that are one point apart.

It is hard to visualize multiple regression models. But we can use some of the other aesthetic mappings we know to include all variables. Start with the plot below.

``````ggplot(data = acitelli, aes(x = other_pos, y = satisfaction)) +
geom_jitter() +
geom_smooth(method = "lm", se = 0) ``````

We can try to map the remaining variables, `tension` and `gender_string` to other features of the graph.

1. For this exercise, I would like you to map the `tension` variable to the color of the jittered points and also facet by `gender_string`. Hint: You will need to add aesthetic mappings directly to the `geom_jitter()` function.

## Path Analysis

The package that we will be using to fit structural equation models is called `lavaan`. `lavaan` can be used to run path analyses (PA), confirmatory factor analyses (CFA), and the combination of the two, which is a SEM. It stands for Latent Variable Analysis. It is based a three step system where you first specify a model as a character string, second you pass this model to a function (along with the data) that finds parameter estimates, and finally you inspect the output. There are `lavaan` tutorials online here.

SEM lavaan has three steps:
* Specify the model.
* Estimate that model.
* Examine the output.

`lavaan` relational symbols used to specify a model:
* x ~ y + z — regression paths from y and z to x
* x ~~ y — covariance between x and y * x ~~ x — variance of x

We can actually use `lavaan` to fit the `reg_mod` model we ran above. After installing `lavaan` and loading it, use the following code to specify, fit, and examine your model output. Note that we need to be more explicit with `lavaan` and ask for the intercept.

The `sem()` function is doing the work here, estimating the model parameters.

``````model <- 'satisfaction ~ 1 + tension + other_pos + gender'

fit <- sem(model, data = acitelli)

summary(fit)``````
1. Compare the path estimates obtained from `lavaan` to the regression coefficient estimates you got using the `lm()` function for `reg_mod`. Are there any similarities?

2. How many known observations do we have in this model? How many unknowns (parameters we are trying to estimate)? What are the model degrees of freedom and can you find this quantity in the `lavaan` summary?

For fun, we can also use the `semPaths()` function from the `semPlot` package to make a diagram in R. Although this is not always the easiest solution. If you would like, install this package now. But running this code below is not required to answer the questions in this lab.

``````#install.packages("semPlot")
library(semPlot)``````

If we give `semPaths()` out fitted `lavaan` object, it will make a diagram for us. This is what it looks like without any added options.

``semPaths(fit)``

Those triangles are the intercepts (of the endogenous variable) and means (of the exogenous variables). The `semPaths()` function has many options. For example, run the following code and the triangles disappear, among other things.

``semPaths(fit, what = "est", intercepts = FALSE, style = "lisrel", rotation = 2)``

Check out the `semPath()` documentation but using the `?`. This function is kind of a LOT. Honestly, I make my SEM figures in PowerPoint, Word, or in AMOS (another SEM software package connected to SPSS). But there are other options discussed in your textbook.

``?semPaths``

### Modeling the Nonindependence

In lab 1 you found that these participants were in romantic relationships with each other. This violates the independence assumption for multiple regression. We now return to this issues and model the nonindependence using Path Analysis.

The first thing we need to do is change the observational unit from individuals to couples (i.e., dyads). While the individual data looks like this:

``````## # A tibble: 6 x 9
##   cuplid Yearsmar gender self_pos other_pos satisfaction tension simhob
##    <dbl>    <dbl>  <dbl>    <dbl>     <dbl>        <dbl>   <dbl>  <dbl>
## 1      3     8.20     -1      4.8       4.6         4        1.5      0
## 2      3     8.20      1      3.8       4           3.67     2.5      1
## 3     10    10.5      -1      4.6       3.8         3.17     4        0
## 4     10    10.5       1      4.2       4           3.67     2        0
## 5     11    -8.30     -1      5         4.4         3.83     2.5      0
## 6     11    -8.30      1      4.2       4.8         3.83     2.5      0
## # … with 1 more variable: gender_string <chr>``````

We want to create a data set called `acitelli_dyad` that looks like this:

``````## # A tibble: 6 x 12
##   cuplid Yearsmar other_pos_man other_pos_woman satisfaction_man
##    <dbl>    <dbl>         <dbl>           <dbl>            <dbl>
## 1      3     8.20           4               4.6             3.67
## 2     10    10.5            4               3.8             3.67
## 3     11    -8.30           4.8             4.4             3.83
## 4     17    -6.38           4.4             3.6             3.83
## 5     21    10.2            4.8             3.8             3.5
## 6     22    15.0            4.6             5               4
## # … with 7 more variables: satisfaction_woman <dbl>, self_pos_man <dbl>,
## #   self_pos_woman <dbl>, simhob_man <dbl>, simhob_woman <dbl>,
## #   tension_man <dbl>, tension_woman <dbl>``````

To make this happen, we will use the functions from the `tidyr` package, including `gather()`, `unite()`, and `spread()`.

``````acitelli_dyad <- acitelli %>%
gather(self_pos:simhob, key = "the_old_vars", value = "score") %>%
select(-gender) %>%
unite(the_new_vars, the_old_vars, gender_string, sep = "_", remove = TRUE) %>%

Try exploring what is happening at each step in this pipeline by rerunning it at several different stages and viewing the result.

1. Describe what the `unite()` function is doing. Why do you think we need this step, and and how does this relate to us dropping the gender variables in the `select()` function?

Now, we might want to fit the same multiple regression model the we fit above, but we can include both `satisfaction_woman` and `satisfaction_man` as endogenous variables. Recall that an endogenous variable is a variable in the model that is being directionally pointed at by other variables in the model. An endogenous variable can also point at other endogenous variables. On the other hand, exogenous variables only do the directional pointing.

1. If both `satisfaction_woman` and `satisfaction_man` are endogenous variables (think response variables), what do you think will be the explanatory variables for each response?

2. What parameter could be added to the model to indicate that the residuals of the two people in the couple might be related?

If we wanted to create a simple linear regression version of this model with only `other_pos` as a predictor, it might look like this.

``````model_slr <- 'satisfaction_woman  ~ 1 + other_pos_woman
satisfaction_man  ~ 1 + other_pos_man'

fit_slr <- sem(model_slr, data = acitelli_dyad)

summary(fit_slr)``````