cluster analysis based on multiple variables?

AlexeiRowles · ‎03-15-2011

Hi,

Within a set region, I would like to be able to find clusters of parcels based on multiple variables such as property value and proportion of tree cover. Cluster analysis using Anselin Local Moran's I seems close to want I want to do, but it appears that clusters can only be formed against one variable at a time.

Am I on the right track? Any help or suggestions would be greatly appreciated.
Alexei

JeffreyEvans · ‎03-15-2011

I do not believe that local autocorrelation (Local Moran's I) is what you are after. Please see this thread for multivariate cluster analysis recommendations:

http://forums.arcgis.com/threads/20288-fuzzy-c-means-cluster-analysis

Cheers,
Jeff

LaurenRosenshein · ‎03-16-2011

Hi Alexei,

The post about fuzzy c-means clustering may help you moving forward with your analysis, so definitely check that out. From there we link out a sample script that uses R to do some c-means cluster analysis. I also mention that we're working on a Group Similar Features tool for ArcGIS 10.1 that is an an implementation of K-Means ++ Clustering, with optional space/time constraints...which we're really excited about!

Alternatively, GeoDa has Bivariate Global and Local Spatial Autocorrelation analysis, so you may want to check that out.

Hope this helps!

Lauren Rosenshein
Geoprocessing Product Engineer

AlexeiRowles · ‎03-16-2011

Jeff & Lauren,

Thank you both for your help. I knew there had to be more appropriate methods out there! I'll check out the leads you suggested and see how I go.

Lauren - the new tool you're developing really sounds like it would fit in with what I'm trying to do.

Thanks,
Alexei

JeffreyEvans · ‎03-16-2011

You should be aware that there are two known problems with K-Means and C-Means. The NP-hard (non-deterministic polynomial-time hard) problem causes issues with convergence due to optimization limitations, but is solved in the implementation of K-Means++ (yeah ESRI). The second issue is distortion of the Voroni regions that define the partition vectors. This is because class centers are based on means, making the K-Means algorithm very sensitive to outliers, non-normality and variability in data ranges. A solution to this is to use the mediod rather than the mean to specify cluster centers. The mediod statistic is designed to find centrality in n-dimensional space and, as such, is not sensitive to data range or normality. There is an implementation of K-Mediods in the R cluster package. ESRI's example K-Means script can be used as a template and modified to implement alternative models.

I am still skeptical about local autocorrelation as a solution for traditional clustering. The role of multivariate clustering is to find similarity in multivariate space and create n classes. The LISA statistic is designed to identify how values cluster in space (i.e., high-low, low-high, high-high, low-low). This becomes quite interesting in bivariate space and is technically clustering but is not considered multivariate clustering and is telling you something quite different. In your original posting it sounds like you want to cluster indicator variables into similar groups. This is exactly what multivariate clustering is designed to do.

Attached is R code for finding the optimal number of clusters (K) and creating a final cluster model using K-Mediod's. Good luck on your project.

Cheers,
Jeff

AlexeiRowles · ‎03-17-2011

Thanks for the continued advice. I am new to a lot of this and it is quite a learning curve for me. I have been reading more material on this subject and I am still not sure of what method is most suitable. Jeff, the multivariate clustering you suggest sounds right, but I thought I should further clarify what I'm trying to do.

Within a region of around 1500 sq km, I have approximately 20,000 properties, a mix of farms and forest. Each property is a polygon of varying shape and size that I have characterised by a number of variables including area, land value, proportion of tree cover and proportion of land with slope > 15 %. I hope to determine clusters of similar properties based on all variables at once.

I do not know ahead of time how many clusters there will be; I'm hoping this will fall out of the analysis. I think I also need a distance element. For instance, there might be multiple clusters that are the same and would otherwise be grouped together, except that they are located too far apart in the region.

I hope this is clear. Thank you for all suggestions thus far - thought the extra detail above might help refine the best methodology for this work.

Alexei

JeffreyEvans · ‎03-18-2011

Alexei,
Just to add a further complication, it sounds like the most appropriate method for your problem may be multiple imputation (often referred to as K nearest neighbor). This would allow for the flexibility of a distance term. Given your elaboration of the problem at hand, the LISA statistic is quite inappropriate. LISA will only give you clustering based on neighbor contingency (values in the immediate vicinity). Clustering statistics like the K-means will give you a specified number of clusters based on multivariate similarity of the covariates. In other words, every polygon will get assigned a cluster membership (1-k) based on the characteristics (covariates) you have defined. The R script I provided will help with selecting the number of supported clusters (k).

Multiple imputation, on the other hand, will return a specified number of similar polygons for each polygon. That is, if you specify k=4 you will get the four most similar polygons for each polygon. If you include x,y coordinates or a distance-based covariate then this will effect the multivariate distance that defines "how similar" a given candidate-neighbor (polygon) is.

See the following article: http://www.jstatsoft.org/v23/i10

Normally, in imputation analysis, you specify a training set (y) and each untrained polygon (x) is assigned a neighbor from y based on covariates that are present in both the y and x sides of the equation. However, it is possible to conduct a unsupervised imputation where every observation is assigned a neighbor where n-1 is the candidate-neighbor pool representing y. Sorry to add another, potential confusing, alternative but KNN analysis sounds like a viable option for your problem. You are not limited to R for KNN analysis. Various implementations of this method are available in a variety of statistical software.

You could step into the spectral world by converting your polygon into rasters representing each covariate and then run an isocluster unsupervised classification. The assigned clusters would be spatially uniform within each polygon region, so you could just extract the cluster value. This would keep you in the ArcGIS world and be much simpler to implement. However, I cannot speak to the results.

Cheers,
Jeff

HamzaSiddiqui · ‎02-19-2014

Hi,

I have a similar problem. around 28 variables. a few of them are categorical and have the data in 3 ms access sheets with around 25000 observations on average.
no idea if to apply clustering, CHAID or Latent Class (using R packages such as MCLust and poLCA). Still in the initial phase of the project. Any suggestions would be highly appreciated.

---Thanks