Cluster analysis for Categorical Data?

5117
2
05-27-2016 02:11 PM
BijeshMishra
New Contributor

Hi, I wanna do cluster analysis for my categorical variable. I have different five variables which, each of them, are rated based on 1-5 rating scale. (1 lowest and 5 highest). Can I run cluster analysis for this data? If yes, do I have (can) do them together or I have to (can) do it separately? Which is the best tool to do it? Is that "Cluster and Out liars" or "Hot spot analysis" tool?

0 Kudos
2 Replies
DanPatterson_Retired
MVP Emeritus

Many of your questions are answered in the help topics.  Was there a particular section that you think meets your data form and data analysis intent?

An overview of the Mapping Clusters toolset—ArcGIS Pro | ArcGIS for Desktop

The concepts section helps clarify technical details as well

Mapping clusters and toolset concepts

0 Kudos
MervynLotter
Occasional Contributor III

Your categorical data is on an ordinal scale from low to high so I suspect it is OK to use in these tools. I am not aware of any specific scale requirements, it simply needs a range of high and low values.

For each of your variables, do you want to identify statistically significant clusters of high values, and statistically significant clusters of low values? If so then the (Optimised) Hotspot and Cluster and Outlier Analysis tools are the ones you need. Run both as both tools tell you something different about your data (I am not aware than any one is better than the other, they just tell you something different about your data). But these tools will run on individual datasets so they would need to be run on each of your 5 variables.

I am not sure of the question you are trying to solve but if you have several datasets (for each of you variables) that you want to combine in one analysis, then consider running the Group Analysis tool to create groups that share similar values. You may want to try this on your 5 raw categorical values as well as from the outputs from the (optimised)  Hotspot Analysis. So in the latter analysis, use the z-values from each of the 5 runs in the analysis field. Before you do so, you would need to create a new feature class, or add them as columns to your existing one, and import the z-values from each of your Hotspot/Cluster and Outlier Analysis runs (which ever you prefer). The z-values represent the strength of the clustering that is observed in your data, not the areas with highest and lowest values, so it will tell you something different about your dataset that is perhaps statistically more significant.

Lastly, there are some great resources from the previous years UCs that you should watch if you have not already done so. Check out:

Spatial Data Mining: A Deep Dive into Cluster Analysis (Spatial Data Mining: A Deep Dive into Cluster Analysis | Esri Video )

These are also related and interesting tech sessions:

Applying Spatial Statistics: The Analysis Process in Action | Esri Video

From the 2014 UC

Spatial Statistics: Simple Ways to Do More with Your Data | Esri Video

Modeling Spatial Relationships Using Regression Analysis | Esri Video

0 Kudos