In this lab you will explore the technique of factor analysis to measure abstract constructs like personality features, aspirations for the future, and procrastination. You again be using the package `psych`

, but this time to conduct factor analysis. Then, you will be using `lavaan`

to fit confirmatory factor analysis models, also called measurement models.

The data you will be using for this lab was generated by **YOU**. As before, we will be constructing summaries and visualizing overall trends, not creating profiles of individuals.

In this lab we will load the data using the `readr`

package, explore the data using the `dplyr`

and `tidyr`

package and visualize it using the `ggplot2`

package for data visualization. We’ll load all of these packages together with the `tidyverse`

package. The data can be found at the URL given below. Later we will also use the `psych`

package, `lavvan`

, and `GPArotation`

.

Let’s load the packages.

```
library(tidyverse)
library(psych)
library(lavaan)
library(GPArotation)
```

Note that I recommend always loading the `psych`

package after `tidyverse`

because the two packages contain functions with the same name. Loading both packages *masks* the same-named functions in the first package loaded.

Please write your lab code and answers in a R Markdown file. You are to submit the knitted HTML file to Moodle when your lab is complete.

The data set for this lab consists of students in SDS 390. Refer to the survey you took here.

Let’s load the data, but using the `read_csv()`

function from the `readr`

package. The other rest of the code in this chuck will clean the data. That is, renaming variables, creating a data frame of variable labels called `variable_view`

, and converting variable types to numeric.

```
personality <- read_csv("http://www.science.smith.edu/~rgarcia/sds390-S19/psychometrics.csv")
personality <- personality %>%
slice(c(1, 4:length(personality))) %>%
select(ResponseId, age:loc29) %>%
rename(r_exp = age,
icecream = Q17,
icecream_other = Q17_6_TEXT,
movies = Q18,
movies_other = Q18_8_TEXT,
birth_order = Q19)
variable_view <- personality %>%
slice(1) %>%
gather(ResponseId:loc29, key = "variable", value = "label")
personality <- personality %>%
slice(2:length(personality)) %>%
mutate_at(vars(maacl_1:loc29), as.numeric)
```

Take a look at the data. You can use `glimpse()`

, `head()`

, `View()`

, however you like to look at your data frames. Recall that there is also a function called `names()`

that gives you a relatively condensed look at all of your variables.

`names(personality)`

Last time we worked with this data set we looked at the procrastination items, `procrast_`

and the mood items, `maacl_`

. This time you will work with primarily with the career aspirations scale. First, familiarize yourself with the questions on the scale by looking at the `variable_view`

data frame.

- Looking at vairable view, what questions were asked for
`career_1`

and for`career_8`

? Reading through all ten questions, do you think these ten items could cluster into different subgroups? Why or why not? If yes, what clusters do you see?

Next we might start with looking at the correlations among the items. We can make a correlation matrix with the `corr.test()`

function and can also visualize the correlations between the items with the `cor.plot()`

function from the `GPArotation`

package.

First, let’s make a correlation matrix of the 10 `career_`

items. The top matrix in the output is the correlation matrix and the bottom matrix are the p-value for the test of the null hypothesis \(H_0: \rho = 0\).

`corr.test(select(personality, career_1:career_10))`

It is sometimes handy to save the correlation matrix so that you can do other stuff with it, for example, make a heat map with `cor.plot()`

.

- Save the output of
`corr.test()`

in an object called`cor_object`

. Then, use the`names()`

function to see the elements of the object you made. Which element do you think is the correlation matrix? What do you think`p`

is? What do you think`ci`

is?

We can look at specific elements of on object with the `$`

operator. So just as with a data frame we can call out a specific variable with `$`

, we can call out a specific element in any list type object with `$`

. To get the correlation matrix from our object `cor_matrix`

we can refer to it like this:

`cor_object$r`

Save the correlation matrix only in an object called

`cor_matrix`

. Also, save the matrix of standard errors of the correlation estimates in an object—call it whatever you want.Using

`cor.plot()`

from the`GPArotation`

package, create a heat map of the correlation matrix. Do the conceptual patterns among the items you discussed in exercise 1 have any correspondence to the patterns in the heat map? (It is OK to say that you do not think so, it is hard to tell.)

It may be hard to make see any sub-groupings among the items. This is exactly the task that **Exploratory Factor Analysis (EFA)** can help us with.

We will be using the `fa()`

function in the `psych`

package to do an EFA for the career aspiration items. The second argument to this function will be `nfactors =`

which is the number of factors we want to extract. Because this is an exploratory analysis, you may not have a number in mind. You *might* have a sense from the correlation matrix analyses you conducted above, but you might not.

One popular method for finding the number of factors called the Kaiser method involves taking the number of eigenvalues of the correlation matrix that are greater than 1. The number of eigenvalues greater than 1 corresponds to the number of components (linear combinations of items) that could explain more variance than 1 variable on its own. If the eigenvalue is less than 1, that component would not help us because it is worse than 1 variable just my itself.

We can calculate the eigenvalues (and corresponding eigenvalues) of *any* matrix with the `eigen()`

function. Like this:

`eigen(cor_matrix)`

- Just as above with the
`cor_object`

, save the output of the`eigen()`

function as and object and extract the eigenvalues. How many eigenvalues of the correlation matrix are greater than 1?

I called my object `eigen_stuff`

. I can use my eyeballs, but I can also use the `sum()`

function to count up the eigenvalues greater than 1.

`sum(eigen_stuff$values >1)`

Because there are 3, the Kaiser method would have us extract 3 factors. Also, the underlying factors of the career scale are likely correlated, so we want to ask for an **oblique rotation**. Oblique rotation options: “promax”, “oblimin”, “simplimax”, “bentlerQ”, and “geominQ”.

“Oblique” means that the factors are correlated, while “orthogonal” means they are independent. We can ask for an **orthogonal rotation** with the options: “none”, “varimax”, “quartimax”, “bentlerT”, and “geominT”.

We also want to ask for Principal Axis Factoring, `fm="pa"`

. The other **estimation techniques** available are `fm="minres"`

, `fm="wls"`

, `fm="gls"`

and `fm="ml"`

. Read the documentation of `pa()`

to find out more.

```
fa_3 <- fa(cor_matrix, nfactors = 3, rotate = "oblimin", fm = "pa")
fa_3
```

- Which items appear to load on which factor? Are there any items that seem to load strongly on multiple factors (i.e., items that are cross-loaded)? Ignoring the items that have strong cross-loadings, looking back at the actual questions that were asked, can you give names to the three factors?

Because we were not able to achieve simple structure with three factors, we might want to choose a different number of factors. The Kaiser method gives you the maximum number of factors that is reasonable to extract, but we can consider going down to two.

Another popular method, besides the Kaiser method, for determining the number of factors to extract is to look at a dotplot of the eigenvalues. This plot is called a **scree plot**. The eigenvalues are mapped to the y-axis and the component number is mapped to the x-axis.

```
data.frame(x = seq(1:10), y = (eigen_stuff$values)) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 1, color = "red")
```

In the scree plot you are looking for more of an elbow type pattern than some specific cut off, although the horizontal 1 line is still on there. So look for an elbow, or a precipitous drop in the eigenvalues followed by a much smaller drop. You take the component number of the eigenvalue just before the much smaller drop as the number you want to extract. This scree plot might say to extract two factors.

- Run another EFA this time extracting only two factors. Use an oblique rotation. Are you able to make sense of the factors in this model, can you give them names? What is the correlation between the two factors? Give its interpretation.

Because we have some negatively loading items, we could reverse score them.

- Reverse code items 4, 7, and 10. Next, create another heat map of the correlation matrix using these reversed items making sure to sort the items so items on the same factors are next to each other. Lastly, check to see if these two factors form reliable scales with the
`alpha()`

function.

Perhaps, in sample 1, we did all of this EFA work and found two career aspiration constructs that were distinct enough, psychologically and statistically, from each other. A next step might be to confirm that these sub scales existed in a new data set, sample 2. Further, we could actually test a model that asserted that the cross-loadings were all zero and we could see if that model fit the sample variance-covariance well. This would be a **Confirmatory Factor Analysis (CFA)**. In this lab we will fit the CFA, but we will leave checking for model fit for a later class.

We will use `lavaan`

to do CFA. To specify latent factors in `lavaan`

you use the `=~`

operator. For example, we could specify a 10-item procrastination factor with the following code.

```
cfa_model <- ' procrast =~ procrast_1 + procrast_2 + procrast_5 +
procrast_7 + procrast_9 + procrast_10 +
procrast_12 + procrast_15 + procrast_19 +
procrast_20 '
```

We then fit the model with `cfa()`

and ask to see the output with `summary()`

.

```
cfa_fit <- cfa(cfa_model, data = personality)
summary(cfa_fit, standardized = TRUE)
```

We want to see that all of the standardized loadings are large, but not greater than 1. Let’s get rid of items 5, 10, and 20. Note that in the re-specification stage, this is when CFA becomes exploratory and is much more like EFA. Over-fitting your model to your data is a danger.

```
cfa_mod_re <- ' procrast =~ procrast_1 + procrast_2 +
procrast_7 + procrast_9 +
procrast_12 + procrast_15 + procrast_19 '
cfa_fit_re <- cfa(cfa_mod_re, data = personality)
summary(cfa_fit_re, standardized = TRUE)
```

Fit the two factor CFA model for the career aspirations item based on your analyses above. This model is referred to as the

**measurement model**in SEM.There is one standardized loading that is greater than 1 and one loading on the larger sub scale that is quite small. Remove this items from the CFA and re-fit the model. What is the correlation between your two sub scales? Give its interpretation. Does this match with your factor correlation from the EFA above?

Lastly, we can make a plot of the CFA with `semPaths()`

from the `semPlot`

package.

```
library(semPlot)
semPaths(semPlotModel(cfa_fit), rotation = 2)
```

When we are happy with our measurement model, we can move on to full SEM models! For example

```
sem_mod <- '#measurement model
leadership =~ career_1 + career_2 + career_4 +
career_6 + career_10
procrast =~ procrast_1 +
procrast_7 + procrast_9 +
procrast_12 + procrast_15 +
procrast_19
#structural model
leadership ~ procrast '
sem_fit <- sem(sem_mod, data = personality)
summary(sem_fit, standardized = TRUE)
```

`semPaths(semPlotModel(sem_mod), rotation = 2)`

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written by Mine Çetinkaya-Rundel and Andrew Bray.