przemyslaw.powroznik@student.uj.edu.pl

How to deal with nominal and categorical data in geostatistical analysis?

Discussion created by przemyslaw.powroznik@student.uj.edu.pl on Feb 28, 2018
Latest reply on Feb 28, 2018 by Dan_Patterson

Hello! I have a set of questionnaire data for my project. It's in point format and the data is either in nominal form (eg. What's your hair color? Answers: blonde, black, brown, dyed) or in Lickert's scale.

 

I want to:

* perform clustering of individual variables

 

* group the points into clusters based on multiple variables (Multivariate Clustering)

 

* test various hypotheses. I want to see how my explanatory variables (numerical data such as distance from city center, rent cost, crime rates) can explain the variance in the answers in my questionnaires. I was planning to run OLS first to reduce the amount of explanatory variables, and then use GWR to see what's the relationship between remaining explanatory variables and answers in questionnaires.

 

My question is: how do I do this with nominal and categorical (Lickert's scale) data? I could probably handle it if the data was in numerical format, but as it isn't I am left kinda clueless.

 

What I've done so far: at the moment I've been browsing the help of Spatial Statistics Toolbox. I've also watched workshop videos on esri.com/videos. So far I have came up with two possible solutions:

 

a) For nominal data I'd create a regular grid of squares or hexagons and count the number of people responding in a certain way. Proceed the analysis as usual using the count as a explanatory variable. This approach however brings in a whole new set of questions. How to pick the shape of my grid? How to pick it's size?

 

b) For data in Lickert's scale I'd convert it as follows: -2, -1, 0, 1, 2 where -2 is "strongly disagree" and 2 is "strongly agree". Proceed as usual using the converted scale as explanatory variables. As far as I know this shouldn't be done if I want to use GWR. In the documentation for GWR tool it says that caution is advised as I could encounter multicollinearity issues.

 

Does anyone have any experience with data in this format? Are these approaches valid (especially due to issue with GWR)? If not, what else can I do to achieve my goals? How should I proceed? I'd also be grateful for any resources on this issue - be it papers or books.

Outcomes