Structural Equation Modeling

Collective action for gender pay equality

Despite women’s advances in the labor market, women on average still make less money than men, and women of color in particular are paid less money for equal work relative to their white counterparts. The gap in pay between women and men to referred to as the gender pay gap. This lab investigates the factors that might lead people of all genders to be willing to participate in collective action on behalf of decreasing this pay gap. These factors include people’s gender identity (i.e., how satisfied people feel with their gender group and how central this identity it to their self-concept), social dominance orientation (SDO; i.e., the extent to which people think some groups should have more social value than other groups), and their level of hostile sexism. Each of these constructs is measures with multiple items in the data set you will be using.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data is stored in an SPSS file so we will use the haven package to load it. Lastly, we will use the psych and lavaan packages to conduct factor analysis and SEM.

Let’s load the packages.

library(tidyverse)
library(GPArotation)
library(psych)
library(haven)
library(lavaan)

Creating a reproducible lab report

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

The data

In this lab we will analyze the data from a study investigating what factors lead people to engage in collective action behaviors specifically surrounding gender pay equality.

The data were gathered from 272 respondents living in the United States. There are men (gender = 1) and women (gender = 0) in the data set. Collective action was measured with 5 items: collact1 to collact5, and the average of these items has already been computed, it is call collact. Hostile sexism is measured with 11 items: hs1 to hs11, SDO with 4 items: sdo1 to sdo4. Gender identity is measured with 14 items: id1 to id14. Similar to collective action, scale scores for hostile sexism and sdo have also been created. The gender identity scale is actually comprised of 5 subscales (idsat, idsolid, idcent, idih, idiss) which can further be averaged into 2 second order subscales (idselfinv, idselfdef), and finally averaged into one large scale score of all 14 items combined (identification). The data also contains a self-efficacy scale that you can play with if you have time.

Let’s load the data:

sdo <- read_sav("http://www.science.smith.edu/~rgarcia/sds390-S19/sdo_identity.sav")

variable_view <- variable_view(sdo)

We have observations on 52 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the variable_view data frame:

View(variable_view)

Exploring the data

We can first take a look at all of the variables with glimspe().

glimpse(sdo)

We might also make some univariate histograms and scatterplots of the relationships between pairs of study variables.

We might want to look at the items individually, but also the scale scores.

qplot(x = sdo3, data = sdo, bins = 5)
qplot(x = sdo, data = sdo, bins = 5)

The sdo scale score appears to be skewed to the right. That’s not great for the purposes of using sdo as an endogenous variable in a path analysis where we estimate parameters with maximum likelihood estimation. But we can keep it exogenous, no problem.

Describe the shape of the collective action items as well as the collact scale score.
Create a data visualization that shows the relationship between two variables from this list: sdo, collact, hsexism, and identification. Next, create a second visualization that allows you to see if this relationship is different by gender.

Path Analysis

Recall that for path analysis, each construct is a measured variable, that is, we’re only using the scale scores (means of items) in our model. Each endogenous variable has an error or “disturbance.”

Fit the path model depicted above using lavaan.
How many degrees of freedom does this model have. How many knowns and unknowns does this model have? Please list the different unknown parameters instead of simply giving the final number. Which pair of variables is the model specifying as unrelated? Is this reasonable based on what you know about these constructs?
Interpret the path estimates from this model. Are they in the direction you would expect?

Model Fit

We have not yet formally learned about model fit, but we have the idea that a \(\chi^2\)-test could be used. In your output, find where it gives the Model Fit Test Statistic. This is your \(\chi^2\) value, and you can see the corresponding p-value for that statistic. The null hypothesis here is that our model reproduced variance-covariance matrix is not different from our sample variance-covariance matrix, so if we reject this null then we have evidence that our model does not fit the data. So we would like to fail to reject this null!

For the model above, the chi-square is NOT statistically significant, so we can claim to have a good fitting model. Said another way, the model implied covariance matrix is not significantly different from our observed variance-covariance matrix. We also have that our \(CFI\) statistics is greater than 0.95 and our \(RMSEA\) statistics is less than 0.08, which also both indicate a good fitting model.

In a results section you might write:

The model was a good fit to the data, \(\chi^2 (1)=2.682\), \(p = .101\), \(CFI = .964\), \(RMSEA = .079\).

Measurement Model

What if we want to remove measurement error from these constructs? To do this we would need for sdo, identity, hostile, and sexism to all be latent vairables. Let us now build a full structural equation model (SEM) that incorporates all of the constructs’ items. The first step is to get the measurement model to fit well. So we will first run a confirmatory factor analysis (CFA) model. In this content the CFA model is referred to as the “Measurement Model”.

There are 14 items on the identification factor, 4 items on SDO, 11 items on hostile sexism and 5 items on collective action.

We can start exploring these items by making correlation heat maps. For example, using the cor.plot() function from the GPArotation package. Doing this we see that the collective action items are mostly all related to each other, but something is up with the 5th item.

matrix <- corr.test(select(sdo, collact1:collact5))$r

cor.plot(matrix)

We can see from univariate histograms you did in exercise 1 that the 5th collective action item is severely skewed to the left compared to the other collective action items. We should keep this in mind as a possible threat to multivariate normality.

Make a correlation matrix heat map for the hostile items. Which three items appear to be in the opposite direction than the others? Does this makes sense given the questions for those items? Which item is weakly related to all of the other items on this scale?

If we do this for the identification items we might be suspicious of possibly four or five different sub-scales.

To further investigate the identification items, we can run an exploratory factor analysis (EFA) extracting five factors and allowing for an oblique rotation.

cor_matrix <- corr.test(select(sdo, id1:id14))$r
fa <- fa(cor_matrix, nfactors = 5, rotate = "oblimin", fm = "pa")
round(fa$weights, 2)

Discuss the factor loadings from the EFA, does this four factor model seem reasonable? Which items load strongly on which factors? Which one item is the worst? Give the factors names based on the questions in variable view that load on the factors.

Next we can try to specify the Measurement Model using CFA and lavaan. This model will have four latent variables: 1) identification, 2) hostile sexism, 3) collective action, and 4) social dominance orientation.

meas_model <- 'identity =~ id1 + id2 + id3 + id4 + id5 + id6 + id7 + 
                           id8 + id9 + id10 + id11 + id12 + id13 + id14
               hsexism =~ hs1 + hs2 + hs3 + hs4 + hs5 + hs6 + hs7 + 
                          hs8 + hs9 + hs10 + hs11
               collact =~ collact1 + collact2 + collact3 + collact4 + collact5
               sdo =~ sdo1 + sdo2 + sdo3 + sdo4'

cfa_fit <- cfa(meas_model, data = sdo) 
summary(cfa_fit, standardized = TRUE, fit.measures = TRUE)

We have PLEANTY degrees of freedom. It is generally difficult to get a CFA to be underidentified unless you’re trying to specify factors with only 2 or even 1 indicator, or if you’re adding a bunch of crazy correlated errors. If you have big healthy factors, you’ll be fine.

Based on the \(\chi^2\)-test, the \(CFI\), and the \(RMSEA\), is this model a good fit to the data?
What is the standardized factor loading from the item you identified as “bad” in the EFA? Going back to the correlation matrix heat map, what does this item look like there?

We must re-specify the measurement model before moving on to the structural paths. There’s NO hope for the structural equation model if the measurement model is not a good fit.

Try specifying a measurement model with five identification subscales separate, as the EFA suggested. Remove item 8 entirely. Call this model meas_model_re1. Once you fit this model, examine the output. Has the fit improved? Give your reasoning.
Next, specify another measurement model called meas_model_re2 where you remove all of the items that have standardized loadings less than 0.6 in absolute value. Has the fit improved? Give your reasoning.

This meas_model_re2 model is your final measurement model, given that it has an adequate fit to the data. Note that these five identification subscales correspond to the components of group identification in Leach et al., 2008.

Your final model should look like this:

final_meas_model <- 'identity1 =~ id1 + id2 + id3 
                     identity2 =~ id4 + id5 + id6 + id7 
                     identity3 =~ id9 + id10 
                     identity4 =~ id11 + id12
                     identity5 =~ id13 + id14
                     hsexism =~ hs2 + hs3 + hs7 + hs8 + hs9
                     collact =~ collact1 + collact2 + collact3 + collact4
                     sdo =~ sdo1 + sdo2 + sdo3 + sdo4'

Interpret the correlations between the latent variables in this final measurement model. Recall the names you gave to the factors in the EFA exercise.

Structural Model

Finally, we want to specify the structural part of the model, just as in the path analysis model you fit above. Note that you can add comments inside of your lavaan model specification code with the # operator.

The code below for the full SEM is incomplete. Fill in the structural part of the model and fit the full SEM model using the sem() function. Now that we have five different identity factors, specify each one as causing collact and sdo as a cause of identity2 only. Is this model a good fit to the data?

sem_model <- '#Measurement model
              identity1 =~ id1 + id2 + id3 
              identity2 =~ id4 + id5 + id6 + id7 
              identity3 =~ id9 + id10 
              identity4 =~ id11 + id12
              identity5 =~ id13 + id14
              hsexism =~ hs2 + hs3 + hs7 + hs8 + hs9
              collact =~ collact1 + collact2 + collact3 + collact4
              sdo =~ sdo1 + sdo2 + sdo3 + sdo4

              #Structural model
              collact ~              
              hsexism ~     '

sem_fit <- sem(sem_model, data = sdo) 
summary(sem_fit, standardized = TRUE, fit.measures = TRUE)

Which paths are statistically signficant? Give interpretations of those significant paths.
Re-specify the model by trimming all of the non-significant paths. Call this model sem_model_re. Is sem_model_re a better fit to the data than sem_model?
Starting with sem_model, try to find a respecified model that meets the criteria for a “good fitting” model.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written by Mine Çetinkaya-Rundel and Andrew Bray.