Theory and data setup for crime incident analysis

1568
5
09-17-2017 02:58 PM
JamesMisencik1
New Contributor II

Hello, self-taught GIS analyst here (i.e. relative newbie), but not having any technical problems. Rather, I am having theoretical issues and trying to set up my data for analysis.

Questions are near the bottom.

I have a time-series of perhaps 800 crime incidents aggregated into communities polygon layer (i.e. a polygon for each of about 700 communities). The communities layer has a number of attributes including type of economy, population, buildings, area, wealth, etc. These are my independent variables. Together they form the community "social structure." And my dependent variable is "crime incidents". "Community" is a 4th-tier administrative unit in my research area (1st-tier is country, 2nd state, 3rd county, 4th community).

My research question is "Which of the social structure I.V.s (or combination thereof) predicts the occurrence of crime incidents at the community level of analysis?"

Originally, I thought that I needed to do some sort of hotspot analysis. In an early attempt (using fishnet polygon instead of community polygon to aggregate crime incidents), however, the hotspots were the size of 3rd tier (county) administrative units. And any fool that is remotely familiar with my study area could tell you where the crime "hotspots" are at the county level. So that won't work. I think I realize now that this occurred because:

A feature with a high value is interesting but may not be a statistically significant hot spot. To be a statistically significant hot spot, a feature will have a high value and be surrounded by other features with high values as well (from Esri "How Hot Spot Analysis Works").

In other words, because the threshold between polygons with crime incidents was so great, the tool could only determine "statistically significant" areas that were roughly the size of the districts.

You see, all 800 or so crime incidents occur in perhaps 75 communities, meaning that the vast majority of communities have no crime incident data, and those that do have incidents often are surrounded by communities without incidents. I thought perhaps if I added a "zero" for crime incidents in these communities, that might improve the analysis by giving the tool more data to crunch. Now, I don't think that is the case because: 

n cases where many of the grid polygons within the study area contain zeros for the number of events, increase the polygon grid size, if appropriate, or remove those zero-count grid polygons prior to analysis.

Question #1: Why would I remove zeros for Hotspot Analysis? Doesn't that simply reduce my sample size?

Question #2: Do I really want to do a "hotspot analysis"? Since the "crime communities" are rather disbursed, can I not just symbolize each polygon community by crime incident count (i.e. since almost all surrounding communities for a "crime community" have zeros, any crime is a "hotspot")? Which leads to my final question...

Question #3: My research question above... How or what GIS tool would allow me to determine what community polygon social structural attributes have the most affect on crime incidents? For reasons of data limitations, I had to assign slightly imprecise geolocated crime incidents to one central community when in reality incidents probably fell across 3 or 4 adjacent communities. Therefore, I'd like to be able to use inferential statistics to not only analyze the influence of my social structure attributes in the specific "crime community" but also account for the social structural impact of communities adjacent to the "crime community" at some spatial threshold.

Since the crime incidents are time stamped and some of the community attributes also vary temporally, ideally I'd also like to incorporate this into the analysis (e.g. if a community experiences a crime at Time 1, does that influence probability of crime at Time 2?). But I don't want to get ahead of myself just now.   

Apologies for the length but any assistance would be GREATLY appreciated.

Thanks!

0 Kudos
5 Replies
DanPatterson_Retired
MVP Emeritus

Why do you need to do hotspot analysis? You don't.... It is the latest wave and just a new(ish) way of looking at the same thing with a slightly different twist and name (variants of dot density mapping comes to mind).

Try a 'choropleth' map... symbolize by incident classes using a graduated colour scheme.  No randomly assigning locations within zones, you can select as large or small a zone as you need (this raises issues, but less the HS analysis)

Zeros are valid observations and often say more than one that isn't... just make sure 0 is an observation and not some goofup on assigning 0 to zones that have nodata.

Sort out 1 and 2 first. Then look at 3, with the reminder that there is a whole body of literature out there on where crimes are committed and by who and there may or may not be a spatial association.  There are lots of factors and you will find no one reason for all crimes etc etc.

AdrianWelsh
MVP Honored Contributor

James,

This sounds like a really interesting project and you likely will get interesting results. That is a fantastic reply by Dan, and I agree that using a choropleth map will yield a much better result than a hot spot analysis. Forgive me if this was stated, but what is the end-result of your research? Is it a printed map? Are you working towards a web map? Is it a written report?

As for the time analysis, you can do some fun things with a time-slider type approach. If you're using ArcMap, here are some details on how to utilize this:

Using the Time Slider window—Help | ArcGIS for Desktop 

If you plan on publishing it to the web, you need to first make sure your layer is time-enabled and then you can utilize a time slider using a widget in web app builder:

Time Slider widget—Web AppBuilder for ArcGIS | ArcGIS 

Also, it might be helpful to move this thread out of the GeoNet help and into a more appropriate space (maybe https://community.esri.com/community/gis/managing-data?sr=search&searchId=1248b519-333a-4293-b469-85...‌?).

DanPatterson_Retired
MVP Emeritus

How about Analysis‌... I will move it there with a 'share' to Mapping

0 Kudos
JamesMisencik1
New Contributor II

Thank you very much Dan and Adrian. And thanks for situating this in Analysis.

Questions/ideas numbered at bottom again.

I wanted to provide some updates to let you know I appreciated your assistance and in case anyone else stumbles across this thread.

I added zeros, since they are valid observations, and this certainly helped produce useful choropleth maps for me. My explanatory variables are cleaned, look pretty, and I've been able to start some Exploratory Regression and OLS in ArcGIS Pro (my preferred platform, although I am comfortable with desktop).

However, I've come to realize why adding zeros creates problems. And, nothing against the great ESRI team creating these example videos, but it seems a bit misleading to provide videos of crime incident statistical analysis where the zero counts have all been dropped, without really saying why they were dropped, but presumably dropped to produce statistically significant results (but not reflecting the real world). Zeros CAN be dropped (I've come to find out), but only if those zeros are normally distributed. Mine are not normally distributed and presumably most zeros people would like to drop are not normally distributed.

I have come to find: the problem with zeros is that OLS/GWR cannot handle bounded, non-continuous data for the D.V., which often shows up in counts. And there is no way to Log or Exp transfer high zero counts to produce a normal distribution. This was a huge surprise to me when I learned this/figured out what was going on. To me, most interesting questions/data depend on counts and counts tend to have a lot of zeros (e.g. "Q: How many times last month did you rob a liquor store?" "A: Um, zero...").

Any suggestions? In no particular order, here are some options I am considering:

1. Interpolate crime to those zero count areas. Seems pretty shady to me, not sure it passes the sniff test, or whether it would even help because a majority of low-value counts is effectively the same thing as having majority zeros in this case.

2. Drop the zeros and just do spatial statistics (OLS/GWR) on those that have crime counts. The problem is all those nice, peaceful neighborhoods do not get to contribute there data to the outcome.

3. Ditch spatial statistics all together and move over to SPSS and run the Poisson/zero-inflated regression models which do not assume a normal curve. This is the option I am leaning towards, however, it pains me to not be able to take into account neighbors (and makes me second guess all those statistical findings running around in the world that cannot account for geography...).

4. Magic.

Adrian, to answer your question, this is part of my dissertation. I was hoping that it would be a relatively small component. I wanted to round out the statistical portion with interviews/other qualitative data. But this portion is hanging me up. I will definitely consider publishing it to an ESRI storyboard/blog, etc. Thanks a lot for those suggestions.

0 Kudos
DanPatterson_Retired
MVP Emeritus

as you point out, all types of regression and correlation analysis are abused since there is this perverse belief that parametric statistics are somehow 'better' than there non-parametric counter parts that make no claims about the underlying distributions and/or forms of the data.  It is possible to transform data to obtain the desired normalcy but the cube root of time versus the inverse of the crime rate squared.  Map your data... there is nothing wrong with just mapping the data to show the patterns.  Role out your own Poisson model ... 10 other people in the world may need it ... Have you heard of Joins count statistics? the poor sibling of Moran's.  There are other options in the literature historically that never garnered much attention.  A research section in your thesis would be an excellent start to dispel the myth that the latest and greatest is the coolest or most appropriate.  There is nothing wrong with mapping what you have and discussing it.  There is no need for inferential statistics unless you are trying to 'infer' something.  Descriptive statistics 'describes' the state that is observed.  A clear statement about what is seen is far better than trying to justify some obscure 'correlation' between variates that happen to randomly (or hopefully) correlate. 

So my suggestion

  • don't do 1... shady it would be
  • don't do 2... you should be studying the peaceful neighborhoods and dump all those with crimes
  • non-parametric statistics would be my suggestion.  don't forget neighbors may or may not share common characteristics, so sometimes proximity and association/similarity don't go hand-in-hand
  • not magic... just a clearly articulated argument including reasons that pursuing spurious attempts to incorporate some inferential spatial component is best.  There are web pages in statistics devoted to showing spurious stats tests in both attribute and spatial analysis.... makes good reading and good examples not to include something that you have no faith in and no one else should unless there is justification to measure the strength of a demonstrated association and pattern.  Then again doing this without examining what you are doing or have found in spatial isolation may also be a venture in futility.

Statistics, maps and magic ... don't try to use them to mask a poorly articulated argument.... Good luck