Select to view content in your preferred language

How to deal with nominal and categorical data in geostatistical analysis?

3265
3
02-28-2018 12:53 PM
PrzemyslawPowroznik
Emerging Contributor

Hello! I have a set of questionnaire data for my project. It's in point format and the data is either in nominal form (eg. What's your hair color? Answers: blonde, black, brown, dyed) or in Lickert's scale.

I want to:

* perform clustering of individual variables

* group the points into clusters based on multiple variables (Multivariate Clustering)

* test various hypotheses. I want to see how my explanatory variables (numerical data such as distance from city center, rent cost, crime rates) can explain the variance in the answers in my questionnaires. I was planning to run OLS first to reduce the amount of explanatory variables, and then use GWR to see what's the relationship between remaining explanatory variables and answers in questionnaires.

My question is: how do I do this with nominal and categorical (Lickert's scale) data? I could probably handle it if the data was in numerical format, but as it isn't I am left kinda clueless.

What I've done so far: at the moment I've been browsing the help of Spatial Statistics Toolbox. I've also watched workshop videos on esri.com/videos. So far I have came up with two possible solutions:

a) For nominal data I'd create a regular grid of squares or hexagons and count the number of people responding in a certain way. Proceed the analysis as usual using the count as a explanatory variable. This approach however brings in a whole new set of questions. How to pick the shape of my grid? How to pick it's size?

b) For data in Lickert's scale I'd convert it as follows: -2, -1, 0, 1, 2 where -2 is "strongly disagree" and 2 is "strongly agree". Proceed as usual using the converted scale as explanatory variables. As far as I know this shouldn't be done if I want to use GWR. In the documentation for GWR tool it says that caution is advised as I could encounter multicollinearity issues.

Does anyone have any experience with data in this format? Are these approaches valid (especially due to issue with GWR)? If not, what else can I do to achieve my goals? How should I proceed? I'd also be grateful for any resources on this issue - be it papers or books.

Tags (2)
0 Kudos
3 Replies
DanPatterson_Retired
MVP Emeritus

There are many books, some specifically dealing with the spatial domain.

Under no circumstances should you be tempted to ascribe numbers to categories so that you can use parametric statistics.  If the data are nominal, then 'regression' anything is a no go.  You can deal with association as in the tests that would fall into the same category as Chi-square.  As for spatial pattern, Joins Count statistics if the borders are pre-existing which is a presence absence unlike Moran's.  There is nothing worse than trying to force data into a body of statistics for which it is not applicable.  non-parametric statistics is for nominal/ordinal data... parametric statistics for interval/ratio data.  Sadly the gis world as a whole has put a lot more effort in providing spatial statistical tools for the parametric world than the non-parametric world.  I always get leery when I hear the term 'advise against' rather than 'shouldn't use' probably because I have seen too many projects 'correlating' things like bus ridership with distance from the city core.  When asked how they did it, the answer would be... "I simply converted the categories into numbers 1 to N so I could use regression..." sigh

0 Kudos
PrzemyslawPowroznik
Emerging Contributor

Thank you for your answer!

The usual tests have already been done by people more knowledgeable then I am. I have been only tasked with analyzing the data in the spatial domain. Allow me to rephrease your answer so I can make sure I understand it correctly (the language is still a little bit of a barrier for me).

I understand that it is absolutely forbidden to convert Lickert's scale to integers and use that in any parametric statistics.

Most importantly I wanted to see whether there are significant groups of similar respondents without using city districts polygons as an input. We are curious whether these groups exist and go beyond the boundaries of city districts.

To sum up your response - our only option is to work on Join Count results.

You've also mentioned that there are a lot of books available. I am not based in an english speaking country, so I am not very familiar with good scientific literature. Can you recommend me anything? I will at least check whether it's available in our library or in web.

0 Kudos
DanPatterson_Retired
MVP Emeritus

Cressie and many of the links in this link https://gis.stackexchange.com/questions/48754/learning-spatial-statistics

Many 'good' books are discipline specific, so even they cover the spatial domain, the transference of examples across disciplines make them a wade.

http://www.spatialanalysisonline.com/ is an online book which has good examples and is generally well done.

As for only recommending Joins count... that was an example of a solid statistical test that is directly applicable to nominal data in the spatial domain, but has lost its 'appeal' because of the misguided perception that their parametric variants are somehow superior (ie many of the tools in the spatial statistics tool set)  This situation is more the case that the market is bigger for data that can be quantified (ie $, age, income, temperature) than is qualitative in nature (ie preference, rankings of any sort, type, belong to)  

So tools to explore in the case of 'presence/absence' might start with some listed ....

http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/an-overview-of-the-mapping-cluste... or

http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/an-overview-of-the-analyzing-patt...

If you have data observed at a location, the age old question ... can I generalize and ascribe it to an area, and if so, how big an area.  What about one of the hot-spot variants? Some look at the presence/absence issue... so if space is covered well, then the use of the tool may be prudent.

More importantly, did you collect data for all possible locations (within reason obviously) or were some areas omitted.

The point I am making is, spend more time on analysing your project design in the context of spatial pattern.  Was a reasonable job done in ensuring that you have a good spatial picture on what you are studying.  A simple map from a well designed study, just showing what you have without the need to perform some statistical test to test for significance is far better than a p-value from a horribly designed project even though it used the 'latest techniques'.

I won't be selling software anytime soon, but the exploratory data tools in arcmap are good, however,

when they have a whole section on Z-scores and p-values and nothing on what do you do with nominal/interval data, then you can only assume that the tools are market driven (but I don't have time to test that hypothesis )

So good luck, just avoid the pitfalls of following the 'latest' the history of statistical analysis is littered with these.

As for stats specific information, https://stats.stackexchange.com/ the Stack Cross-validated site is good to glean for backgrounds on specific tests.  The spatial domain is a little less well covered, but it is there.

0 Kudos