Multiple Regression

Romantic Relationships in Detroit

In this lab you will build multiple regression models, explore interactions between a pair of continuous predictors, and you will be introduced to the SEM package lavaan. The lavaan package can be used to fit regression models as well as systems of structural regression models.

The data you will be using for this consists of people currently involved in romantic relationships living in the city of Detroit. You can read more about the study that goes with this data here. We will be particularly interested in understanding predictors of relationship satisfaction including gender, how much tension they perceive in their relationships, and how positively they see their partners.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found at the URL given below. Later we will also use the lavaan package, so please install it now.

Let’s load the packages.

library(dplyr)
library(ggplot2)
#install.packages("lavaan")
library(lavaan)

Creating a reproducible lab report

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

The data

The data set for this lab consists of 296 persons involved heterosexual dating relationships from the city of Detroit.

Let’s load the data:

acitelli <- read.csv("http://www.science.smith.edu/~rgarcia/sds390-S19/acitelli-individual.csv")

Take a look at the data. You can use glimpse(), head(), View(), however you like to look at your data frames. We have observations on 8 different variables, some are categorical and some numerical.

The gender variable is coded -1(= Women) and 1 (= Men). You will change this next. The self_pos is the mean of positive self variables and other_pos is the mean of positive partner variables, where both variables the average of Cooperative, Mature, Friendly, Hard Working and Care about Others. Our main response variable, satisfaction
is the mean of four variables (theoretical range from 1 to 4). The variable tension is the mean of two: perceived disagreement and perceived tension (theoretical range from 1 to 4). Lastly, simhob is the perception of having similar hobbies to one’s partner (1 = very true; 0 = somewhat true; -1 = not at all true). We won’t be using this variable today, but you can play with it if you have time.

Exploring and preparing the data

The gender variable is currently coded as -1 for women and 1 for men, create two new indicator variables, one called woman, and another man.
How many observations are in each level of cuplid? What are the counts of women and men in the data? Based on these two summaries, do you think the cuplid variable means?
Select two variables in the data set and visualize their relationship with each other. Also, calculate an appropriate summary statistic for this relationship. Describe the relationship.

Simple linear regression

The most important research question we ask today is does perceiving tension in one’s relationship lead to lower relationship satisfaction. Let’s create a scatterplot to see if this appears to be the case:

ggplot(data = acitelli, aes(x = tension, y = satisfaction)) +
  geom_point()

Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?

Replot the scatterplot, but this time using a geom that corrects the problem or another visualization solution. Comment on your plot. Hint: You can jitter the points, create transparency, or check out geom_beeswarm() from the ggbeeswarm package for funzies!
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called mod_t to predict satisfaction score by tension. Interpret the slope and intercept of this model. Is tension a statistically significant predictor of satisfaction? Does it appear to be a practically significant predictor? Also, add a line of best fit to your visualization.
Use residual plots, plot(mod_t), to evaluate whether the conditions of least squares regression are reasonable. Comment on each of the regression assumptions—remember LINE: Linearity, Independence, Normality, and Error homogeneity.

Multiple linear regression

If you suspected above that these participants might be in romantic relationships with each other, you’re correct! This indeed violates the independence assumption. For this lab we will be ignoring this fact, but we will return to it later in the course for path analysis.

The data set contains several relationship variables including the participants’ feelings about themselves and their partner. Let’s take a look at the relationship between one of these positivity variables, other_pos, and tension. Before running the code below:

What do you expect the direction of this relationship between other_pos and tension to be?

Now run the following:

ggplot(data = acitelli, aes(x = other_pos, y = tension)) +
  geom_jitter()

acitelli %>% 
  summarise(cor(other_pos, tension))

As expected the relationship is negative. In order to see if perceived tension is still a significant predictor of satisfaction after we’ve accounted for perceptions of other positivity, we can add the other_pos term into the model.

Fit a multiple regression model called mod_t_op with both tension and other positivity as predictors of satisfaction. Interpret the intercept and slope coefficients. Are they statistically significant predictors? Has the addition of other_pos to the model changed the parameter estimate for tension?
What is the R² value for this model? Give its interpretation.

Now let’s add gender into the model that already contains tension and other_pos. First fit a model using the woman indicator you created earlier. Call it mod_wo. Then, fit a model with the original gender variable. Call it mod_gen.

Interpret the coefficients for woman in mod_wo and gender in mod_gen. What do the intercepts mean in these two models? Are the coefficients for tension and other_pos different across these two models?

The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for tension reflects how much more satisfied a person is expected be if they perceive tension in their relationship one point higher while holding all other variables constant. In this case, that translates into considering only people of the same gender, who feel the same amount of positivity towards their partner, with tension scores that are one point apart.

Interactions

Now we can ask if positive perceptions of your partner difuse the negative effect of tension on relationship satisfaction. That is, we can ask, is the effect of tension on satisfaction weakened (or zero) when other positivity is high? Questions like this are Answered with interaction terms. .

Add the interaction of tension and other_pos to the model and leave gender out of this model. Is there a statistically significant interaction of tension and other_pos? Can you give an interpretation? Hint: see below.

To aid in interpretation of this interaction we can estimate the effect of tension on satisfaction when other_pos is low (=1) and when other_pos is high (=5). We will need to create two new variables, other_pos_low and other_pos_high to do this. For the low version, everyone who’s 1 will be made 0, and for the high version, everyone who is 5 will be made 0.

acitelli <- acitelli %>%
  mutate(other_pos_low = other_pos - 1,
         other_pos_high = other_pos - 5)

Then we simply re-run the model two more times and look at the main effect of tension in each model.

mod_low <- lm(satisfaction ~ tension*other_pos_low, data = acitelli)

summary(mod_low)

mod_high <- lm(satisfaction ~ tension*other_pos_high, data = acitelli)

summary(mod_high)

Interpretation: When other positivity is low, we would expect a strong negative effect of tension on satisfaction, but when other positivity is high, we would still expect a statistically significant negative effect of tension on satisfaction, but it is far less strong.

Introduction to `lavaan`

The package that we will be using to fit structural equation models is called lavaan. lavaan can be used to run path analyses (PA), confirmatory factor analyses (CFA), and the combination of the two, which is a SEM. It stands for Latent Variable Analysis. It is based a three step system where you first specify a model as a character string, second you pass this model to a function (along with the data) that finds parameter estimates, and finally you inspect the output. There are lavaan tutorials online here.

SEM lavaan has three steps:
* Specify the model.
* Estimate that model.
* Examine the output.

lavaan relational symbols used to specify a model:
* x ~ y + z — regression paths from y and z to x
* x ~~ y — covariance between x and y * x ~~ x — variance of x

We can actually use lavaan to fit the mod_wo model we ran above.

library(lavaan)

After installing lavaan and loading it, use the following code to specify, fit, and examine your model output.

model <- 'satisfaction ~ tension + other_pos + woman'

fit <- sem(model, data = acitelli)

summary(fit)

Compare the path estimates obtained from lavaan to the regression coefficient estimates you got using the lm() function for mod_wo. Are there any differences?
In order to add interactions to your model in lavaan, you will need to first create the product term in your data set. For this last exercise, please add that term to your data set and fit the model you ran in exercise 11. Are the estimates similar across models?

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted from a lab written by Mine Çetinkaya-Rundel and Andrew Bray.