Despite women’s advances in the labor market, women on average still make less money than men, and women of color in particular are paid less money for equal work relative to their white counterparts. The gap in pay between women and men to referred to as the gender pay gap. This lab investigates the factors that might lead people of all genders to be willing to participate in collective action on behalf of decreasing this pay gap. These factors include people’s gender identity (i.e., how satisfied people feel with their gender group and how central this identity it to their self-concept), social dominance orientation (SDO; i.e., the extent to which people think some groups should have more social value than other groups), and their level of hostile sexism. Each of these constructs is measures with multiple items in the data set you will be using.
In this lab we will explore the data using the dplyr
package and visualize it using the ggplot2
package for data visualization. The data is stored in an SPSS file so we will use the haven
package to load it. Lastly, we will use the psych
and lavaan
packages to conduct factor analysis and SEM.
Let’s load the packages.
library(tidyverse)
library(GPArotation)
library(psych)
library(haven)
library(lavaan)
Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.
In this lab we will analyze the data from a study investigating what factors lead people to engage in collective action behaviors specifically surrounding gender pay equality.
The data were gathered from 272 respondents living in the United States. There are men (gender
= 1) and women (gender
= 0) in the data set. Collective action was measured with 5 items: collact1
to collact5
, and the average of these items has already been computed, it is call collact
. Hostile sexism is measured with 11 items: hs1
to hs11
, SDO with 4 items: sdo1
to sdo4
. Gender identity is measured with 14 items: id1
to id14
. Similar to collective action, scale scores for hostile sexism and sdo have also been created. The gender identity scale is actually comprised of 5 subscales (idsat
, idsolid
, idcent
, idih
, idiss
) which can further be averaged into 2 second order subscales (idselfinv
, idselfdef
), and finally averaged into one large scale score of all 14 items combined (identification
). The data also contains a self-efficacy scale that you can play with if you have time.
Let’s load the data:
sdo <- read_sav("http://www.science.smith.edu/~rgarcia/sds390-S19/sdo_identity.sav")
variable_view <- variable_view(sdo)
We have observations on 52 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the variable_view
data frame:
View(variable_view)
We can first take a look at all of the variables with glimspe()
.
glimpse(sdo)
We might also make some univariate histograms and scatterplots of the relationships between pairs of study variables.
We might want to look at the items individually, but also the scale scores.
qplot(x = sdo3, data = sdo, bins = 5)
qplot(x = sdo, data = sdo, bins = 5)
The sdo
scale score appears to be skewed to the right. That’s not great for the purposes of using sdo
as an endogenous variable in a path analysis where we estimate parameters with maximum likelihood estimation. But we can keep it exogenous, no problem.
Describe the shape of the collective action items as well as the collact
scale score.
Create a data visualization that shows the relationship between two variables from this list: sdo
, collact
, hsexism
, and identification
. Next, create a second visualization that allows you to see if this relationship is different by gender.
Recall that for path analysis, each construct is a measured variable, that is, we’re only using the scale scores (means of items) in our model. Each endogenous variable has an error or “disturbance.”
Fit the path model depicted above using lavaan
.
How many degrees of freedom does this model have. How many knowns and unknowns does this model have? Please list the different unknown parameters instead of simply giving the final number. Which pair of variables is the model specifying as unrelated? Is this reasonable based on what you know about these constructs?
Interpret the path estimates from this model. Are they in the direction you would expect?
We have not yet formally learned about model fit, but we have the idea that a \(\chi^2\)-test could be used. In your output, find where it gives the Model Fit Test Statistic
. This is your \(\chi^2\) value, and you can see the corresponding p-value for that statistic. The null hypothesis here is that our model reproduced variance-covariance matrix is not different from our sample variance-covariance matrix, so if we reject this null then we have evidence that our model does not fit the data. So we would like to fail to reject this null!
For the model above, the chi-square is NOT statistically significant, so we can claim to have a good fitting model. Said another way, the model implied covariance matrix is not significantly different from our observed variance-covariance matrix. We also have that our \(CFI\) statistics is greater than 0.95 and our \(RMSEA\) statistics is less than 0.08, which also both indicate a good fitting model.
In a results section you might write:
The model was a good fit to the data, \(\chi^2 (1)=2.682\), \(p = .101\), \(CFI = .964\), \(RMSEA = .079\).
What if we want to remove measurement error from these constructs? To do this we would need for sdo, identity, hostile, and sexism to all be latent vairables. Let us now build a full structural equation model (SEM) that incorporates all of the constructs’ items. The first step is to get the measurement model to fit well. So we will first run a confirmatory factor analysis (CFA) model. In this content the CFA model is referred to as the “Measurement Model”.
There are 14 items on the identification factor, 4 items on SDO, 11 items on hostile sexism and 5 items on collective action.
We can start exploring these items by making correlation heat maps. For example, using the cor.plot()
function from the GPArotation
package. Doing this we see that the collective action items are mostly all related to each other, but something is up with the 5th item.
matrix <- corr.test(select(sdo, collact1:collact5))$r
cor.plot(matrix)
We can see from univariate histograms you did in exercise 1 that the 5th collective action item is severely skewed to the left compared to the other collective action items. We should keep this in mind as a possible threat to multivariate normality.
If we do this for the identification items we might be suspicious of possibly four or five different sub-scales.
To further investigate the identification items, we can run an exploratory factor analysis (EFA) extracting five factors and allowing for an oblique rotation.
cor_matrix <- corr.test(select(sdo, id1:id14))$r
fa <- fa(cor_matrix, nfactors = 5, rotate = "oblimin", fm = "pa")
round(fa$weights, 2)
Next we can try to specify the Measurement Model using CFA and lavaan
. This model will have four latent variables: 1) identification, 2) hostile sexism, 3) collective action, and 4) social dominance orientation.
meas_model <- 'identity =~ id1 + id2 + id3 + id4 + id5 + id6 + id7 +
id8 + id9 + id10 + id11 + id12 + id13 + id14
hsexism =~ hs1 + hs2 + hs3 + hs4 + hs5 + hs6 + hs7 +
hs8 + hs9 + hs10 + hs11
collact =~ collact1 + collact2 + collact3 + collact4 + collact5
sdo =~ sdo1 + sdo2 + sdo3 + sdo4'
cfa_fit <- cfa(meas_model, data = sdo)
summary(cfa_fit, standardized = TRUE, fit.measures = TRUE)
We have PLEANTY degrees of freedom. It is generally difficult to get a CFA to be underidentified unless you’re trying to specify factors with only 2 or even 1 indicator, or if you’re adding a bunch of crazy correlated errors. If you have big healthy factors, you’ll be fine.
Based on the \(\chi^2\)-test, the \(CFI\), and the \(RMSEA\), is this model a good fit to the data?
What is the standardized factor loading from the item you identified as “bad” in the EFA? Going back to the correlation matrix heat map, what does this item look like there?
We must re-specify the measurement model before moving on to the structural paths. There’s NO hope for the structural equation model if the measurement model is not a good fit.
Try specifying a measurement model with five identification subscales separate, as the EFA suggested. Remove item 8 entirely. Call this model meas_model_re1
. Once you fit this model, examine the output. Has the fit improved? Give your reasoning.
Next, specify another measurement model called meas_model_re2
where you remove all of the items that have standardized loadings less than 0.6 in absolute value. Has the fit improved? Give your reasoning.
This meas_model_re2
model is your final measurement model, given that it has an adequate fit to the data. Note that these five identification subscales correspond to the components of group identification in Leach et al., 2008.
Your final model should look like this:
final_meas_model <- 'identity1 =~ id1 + id2 + id3
identity2 =~ id4 + id5 + id6 + id7
identity3 =~ id9 + id10
identity4 =~ id11 + id12
identity5 =~ id13 + id14
hsexism =~ hs2 + hs3 + hs7 + hs8 + hs9
collact =~ collact1 + collact2 + collact3 + collact4
sdo =~ sdo1 + sdo2 + sdo3 + sdo4'
Finally, we want to specify the structural part of the model, just as in the path analysis model you fit above. Note that you can add comments inside of your lavaan
model specification code with the #
operator.
sem()
function. Now that we have five different identity factors, specify each one as causing collact
and sdo
as a cause of identity2
only. Is this model a good fit to the data?sem_model <- '#Measurement model
identity1 =~ id1 + id2 + id3
identity2 =~ id4 + id5 + id6 + id7
identity3 =~ id9 + id10
identity4 =~ id11 + id12
identity5 =~ id13 + id14
hsexism =~ hs2 + hs3 + hs7 + hs8 + hs9
collact =~ collact1 + collact2 + collact3 + collact4
sdo =~ sdo1 + sdo2 + sdo3 + sdo4
#Structural model
collact ~
hsexism ~ '
sem_fit <- sem(sem_model, data = sdo)
summary(sem_fit, standardized = TRUE, fit.measures = TRUE)
Which paths are statistically signficant? Give interpretations of those significant paths.
Re-specify the model by trimming all of the non-significant paths. Call this model sem_model_re
. Is sem_model_re
a better fit to the data than sem_model
?
Starting with sem_model
, try to find a respecified model that meets the criteria for a “good fitting” model.
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written by Mine Çetinkaya-Rundel and Andrew Bray.