Factor Analysis

Factor Analysis for Latent Variables

In this lab you will explore the technique of factor analysis to measure abstract constructs like personality features, aspirations for the future, and procrastination. You again be using the package psych, but this time to conduct factor analysis. Then, you will be using lavaan to fit confirmatory factor analysis models, also called measurement models.

The data you will be using for this lab was generated by YOU. As before, we will be constructing summaries and visualizing overall trends, not creating profiles of individuals.

Getting Started

Load packages

In this lab we will load the data using the readr package, explore the data using the dplyr and tidyr package and visualize it using the ggplot2 package for data visualization. We’ll load all of these packages together with the tidyverse package. The data can be found at the URL given below. Later we will also use the psych package, lavvan, and GPArotation.

Let’s load the packages.

library(tidyverse)
library(psych)
library(lavaan)
library(GPArotation)

Note that I recommend always loading the psych package after tidyverse because the two packages contain functions with the same name. Loading both packages masks the same-named functions in the first package loaded.

Creating a reproducible lab report

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

The data

The data set for this lab consists of students in SDS 390. Refer to the survey you took here.

Let’s load the data, but using the read_csv() function from the readr package. The other rest of the code in this chuck will clean the data. That is, renaming variables, creating a data frame of variable labels called variable_view, and converting variable types to numeric.

personality <- read_csv("http://www.science.smith.edu/~rgarcia/sds390-S19/psychometrics.csv")

personality <- personality %>%
  slice(c(1, 4:length(personality))) %>%
  select(ResponseId, age:loc29) %>%
  rename(r_exp = age,
         icecream = Q17,
         icecream_other = Q17_6_TEXT,
         movies = Q18,
         movies_other = Q18_8_TEXT,
         birth_order = Q19)

variable_view <- personality %>% 
  slice(1) %>%
  gather(ResponseId:loc29, key = "variable", value = "label")

personality <- personality %>%
  slice(2:length(personality)) %>%
  mutate_at(vars(maacl_1:loc29), as.numeric)

Take a look at the data. You can use glimpse(), head(), View(), however you like to look at your data frames. Recall that there is also a function called names() that gives you a relatively condensed look at all of your variables.

names(personality)

Working with Objects in R

Last time we worked with this data set we looked at the procrastination items, procrast_ and the mood items, maacl_. This time you will work with primarily with the career aspirations scale. First, familiarize yourself with the questions on the scale by looking at the variable_view data frame.

Looking at vairable view, what questions were asked for career_1 and for career_8? Reading through all ten questions, do you think these ten items could cluster into different subgroups? Why or why not? If yes, what clusters do you see?

Next we might start with looking at the correlations among the items. We can make a correlation matrix with the corr.test() function and can also visualize the correlations between the items with the cor.plot() function from the GPArotation package.

First, let’s make a correlation matrix of the 10 career_ items. The top matrix in the output is the correlation matrix and the bottom matrix are the p-value for the test of the null hypothesis $H_0: \rho = 0$.

corr.test(select(personality, career_1:career_10))

It is sometimes handy to save the correlation matrix so that you can do other stuff with it, for example, make a heat map with cor.plot().

Save the output of corr.test() in an object called cor_object. Then, use the names() function to see the elements of the object you made. Which element do you think is the correlation matrix? What do you think p is? What do you think ci is?

We can look at specific elements of on object with the $ operator. So just as with a data frame we can call out a specific variable with $, we can call out a specific element in any list type object with $. To get the correlation matrix from our object cor_matrix we can refer to it like this:

cor_object$r

Save the correlation matrix only in an object called cor_matrix. Also, save the matrix of standard errors of the correlation estimates in an object—call it whatever you want.
Using cor.plot() from the GPArotation package, create a heat map of the correlation matrix. Do the conceptual patterns among the items you discussed in exercise 1 have any correspondence to the patterns in the heat map? (It is OK to say that you do not think so, it is hard to tell.)

It may be hard to make see any sub-groupings among the items. This is exactly the task that Exploratory Factor Analysis (EFA) can help us with.

Exploratory Factor Analysis

We will be using the fa() function in the psych package to do an EFA for the career aspiration items. The second argument to this function will be nfactors = which is the number of factors we want to extract. Because this is an exploratory analysis, you may not have a number in mind. You might have a sense from the correlation matrix analyses you conducted above, but you might not.

One popular method for finding the number of factors called the Kaiser method involves taking the number of eigenvalues of the correlation matrix that are greater than 1. The number of eigenvalues greater than 1 corresponds to the number of components (linear combinations of items) that could explain more variance than 1 variable on its own. If the eigenvalue is less than 1, that component would not help us because it is worse than 1 variable just my itself.

We can calculate the eigenvalues (and corresponding eigenvalues) of any matrix with the eigen() function. Like this:

eigen(cor_matrix)

Just as above with the cor_object, save the output of the eigen() function as and object and extract the eigenvalues. How many eigenvalues of the correlation matrix are greater than 1?

I called my object eigen_stuff. I can use my eyeballs, but I can also use the sum() function to count up the eigenvalues greater than 1.

sum(eigen_stuff$values >1)

Because there are 3, the Kaiser method would have us extract 3 factors. Also, the underlying factors of the career scale are likely correlated, so we want to ask for an oblique rotation. Oblique rotation options: “promax”, “oblimin”, “simplimax”, “bentlerQ”, and “geominQ”.

“Oblique” means that the factors are correlated, while “orthogonal” means they are independent. We can ask for an orthogonal rotation with the options: “none”, “varimax”, “quartimax”, “bentlerT”, and “geominT”.

We also want to ask for Principal Axis Factoring, fm="pa". The other estimation techniques available are fm="minres", fm="wls", fm="gls" and fm="ml". Read the documentation of pa() to find out more.

fa_3 <- fa(cor_matrix, nfactors = 3, rotate = "oblimin", fm = "pa")

fa_3

Which items appear to load on which factor? Are there any items that seem to load strongly on multiple factors (i.e., items that are cross-loaded)? Ignoring the items that have strong cross-loadings, looking back at the actual questions that were asked, can you give names to the three factors?

Because we were not able to achieve simple structure with three factors, we might want to choose a different number of factors. The Kaiser method gives you the maximum number of factors that is reasonable to extract, but we can consider going down to two.

Another popular method, besides the Kaiser method, for determining the number of factors to extract is to look at a dotplot of the eigenvalues. This plot is called a scree plot. The eigenvalues are mapped to the y-axis and the component number is mapped to the x-axis.

data.frame(x = seq(1:10), y = (eigen_stuff$values)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 1, color = "red")

In the scree plot you are looking for more of an elbow type pattern than some specific cut off, although the horizontal 1 line is still on there. So look for an elbow, or a precipitous drop in the eigenvalues followed by a much smaller drop. You take the component number of the eigenvalue just before the much smaller drop as the number you want to extract. This scree plot might say to extract two factors.

Run another EFA this time extracting only two factors. Use an oblique rotation. Are you able to make sense of the factors in this model, can you give them names? What is the correlation between the two factors? Give its interpretation.

Because we have some negatively loading items, we could reverse score them.

Reverse code items 4, 7, and 10. Next, create another heat map of the correlation matrix using these reversed items making sure to sort the items so items on the same factors are next to each other. Lastly, check to see if these two factors form reliable scales with the alpha() function.

Confirmatory Factor Analysis

Perhaps, in sample 1, we did all of this EFA work and found two career aspiration constructs that were distinct enough, psychologically and statistically, from each other. A next step might be to confirm that these sub scales existed in a new data set, sample 2. Further, we could actually test a model that asserted that the cross-loadings were all zero and we could see if that model fit the sample variance-covariance well. This would be a Confirmatory Factor Analysis (CFA). In this lab we will fit the CFA, but we will leave checking for model fit for a later class.

We will use lavaan to do CFA. To specify latent factors in lavaan you use the =~ operator. For example, we could specify a 10-item procrastination factor with the following code.

cfa_model <- ' procrast =~ procrast_1 + procrast_2 + procrast_5 + 
                           procrast_7 + procrast_9 + procrast_10 + 
                           procrast_12 + procrast_15 + procrast_19 + 
                           procrast_20 '

We then fit the model with cfa() and ask to see the output with summary().

cfa_fit <- cfa(cfa_model, data = personality)

summary(cfa_fit, standardized = TRUE)

We want to see that all of the standardized loadings are large, but not greater than 1. Let’s get rid of items 5, 10, and 20. Note that in the re-specification stage, this is when CFA becomes exploratory and is much more like EFA. Over-fitting your model to your data is a danger.

cfa_mod_re <- ' procrast =~ procrast_1 + procrast_2 + 
                           procrast_7 + procrast_9 + 
                           procrast_12 + procrast_15 + procrast_19 '

cfa_fit_re <- cfa(cfa_mod_re, data = personality)

summary(cfa_fit_re, standardized = TRUE)

Fit the two factor CFA model for the career aspirations item based on your analyses above. This model is referred to as the measurement model in SEM.
There is one standardized loading that is greater than 1 and one loading on the larger sub scale that is quite small. Remove this items from the CFA and re-fit the model. What is the correlation between your two sub scales? Give its interpretation. Does this match with your factor correlation from the EFA above?

Bonus: Looking ahead to SEM

Lastly, we can make a plot of the CFA with semPaths() from the semPlot package.

library(semPlot)

semPaths(semPlotModel(cfa_fit), rotation = 2)

When we are happy with our measurement model, we can move on to full SEM models! For example

sem_mod <- '#measurement model 
            leadership =~ career_1 + career_2 + career_4 + 
                            career_6 + career_10 
            procrast =~ procrast_1 +
                        procrast_7 + procrast_9 + 
                        procrast_12 + procrast_15 + 
                        procrast_19 

            #structural model
            leadership ~ procrast '

sem_fit <- sem(sem_mod, data = personality)

summary(sem_fit, standardized = TRUE)

semPaths(semPlotModel(sem_mod), rotation = 2)

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written by Mine Çetinkaya-Rundel and Andrew Bray.