Dimension Reduction Tool - First Time Using

NakkyEkeanyanwu · ‎04-28-2024

Good evening,

This is my first time using the Dimension Reduction tool and unfortunately, there really aren't any Esri videos that talk about this tool. So I have census tract level data containing 30 different variables (covering race, age/sex, income, housing, health, total population, education) and I was trying to reduce these variables by splitting them into 7 components using the Dimension reduction tool. I ran the tool with a number of permuations of 99 and the tool ran fine, however I'm finding it difficult to delineate which variabales belong to which component. With that, these are my questions:

1. Using the eigenvalues table and chart (I have attached them), how can I tell which variables make up each component? The goal is to use these 7 components for a flood vulnerability analysis (I want to use an ANOVA to see if there any differences across these components within and outside the flood zones).

2. My understaning of the Broken stick value from this documentation (https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-dimension-reduction-w...) is that, it tells me the optimal number of components. Based on my scree plot (also attached), the intersection was at '8' with a 70% total variance explained but in the output message window (also attached), a value of 3 was given. However when I ran the tool without inputting any number of components, that value changed to '28' with a 100% variance explained. Can some light be shed on this please as it has me very confused.

3. Lastly, for the Bartlett's test of Sphericity, the number of components is 28 (see attached output message window). Does this mean that only 28 variables are relevant?

Please any sort of assistance on this will be very useful and I will be extremely grateful.

Also, if you feel there is a better way to go about this in ArcPro, I'm more than open to working with your suggestions.

Thank you.

NakkyEkeanyanwu · ‎04-28-2024

@LaurenGriffin @EricKrause @Omar_A and any other person on the team, please any help would be appreciated. Thank you.

EricKrause · ‎04-30-2024

Hi @NakkyEkeanyanwu,

I think the major confusion is that Dimension Reduction is not selecting a subset of the variables that you provide. Instead, it uses all variables to construct new "components" and each component is a weighted sum of all the variables. As a very simple example, let's say you have four variables (A, B, C, and D) and you want to create one component (reducing the dimension from four to one), the component might looks something like this (I am making up these coefficients):

Component = 0.7*A + 0.2*B + 0.6*C - 0.1*D

In essence, the component uses all variables, and the weights (the coefficients) indicate how "important" that particular variable is in the component. These coefficients are the eigenvector of the component, and the associated eigenvalue indicates how much of the total variability of the four variables is captured in the component.

Frequently, a large percent of the total variability of all variables can be captured in just a few components, and this is what drives things like the Broken stick and Bartlett's test methods. They try to find a compromise between minimizing the number of components and maximizing the amount of variability that is captured by the components. Determining how many components to create is the most difficult part of Principal Component Analysis, so various methodologies are performed to help you decide. In an ideal case, you see some components account for a large percent of variance (PCTVAR field), then a sudden drop in the percent. However, for your data, I don't really see this; the variability captured by each component seems to drop steadily, and I think this is why Bartlett's method is recommending using a large number of components. However, using 7 components certainly seems justifiable here as well. Really, you could justify any number between 3 and 28.

Regarding only 28 components explaining 100% of the variance, this means that two of the variables you provided are redundant, that their information is fully accounted for by other variables. If I'm reading your screenshots correctly, you use total population as a variable, and you also use the populations of particular subgroups. If the populations of the subgroups add up to the total population (or very close to it), then there is redundancy since the total population is captured by the sum of the populations of the subgroups. I suspect this is happening for two variables, resulting in 28 components that account for all variability.

Please let me know if you have any other questions.