POST
|
At present the z-scores are computed using the mathematics that we�??ve documented ( http://resources.arcgis.com/en/help/main/10.2/index.html#/How_Hot_Spot_Analysis_Getis_Ord_Gi_works/005p00000011000000/ , http://resources.arcgis.com/en/help/main/10.2/index.html#/How_Cluster_and_Outlier_Analysis_Anselin_Local_Moran_s_I_works/005p00000012000000/ ). These formulas were obtained from the seminal articles about the methods (the articles are listed below). We are not using Monte Carlo methods and, at present, are not computing z-scores using permutation (conditional randomization). Our z-score calculations are based on the randomization null hypothesis (theoretical distribution)�?� they are not based on simulation or permutation. P-values have a one to one correspondence with z-scores (i.e., a z-score of + or - 1.96 will always equate to a p-value of 0.05). Our tools calculate z-scores and then translate those z-scores to p-values. Our tools report both z-score and p-value results. Our empirical tests support the seminal work on Gi* by Getis and Ord who, in their 1992 paper, show that the statistic is asymptotically normal. Z-Scores do have a normal distribution so often people will ask us if it is valid to run Hot Spot Analysis (Gi*) on data that is skewed. The answer is yes, as long as the threshold distance you use is not too small or too large. How do we know? We start with very skewed data sets (like crime counts) and then compare the calculated p-values, based on the asymptotic z-scores, to the pseudo p-values obtained from permutations (conditional randomization). We found that for as low as 16 neighbors the asymptotic results provided the same significance as the permutations did over 99.9% of the time. We tested this on over 10 different skewed data sets, including mixed discrete/continuous models. In Anselin�??s article (citation below, page 99), the mathematics for calculated z-scores based on the randomization null hypothesis is given (equations 13, 14, and appendix A). The author indicates that a test for significant local spatial association may be based on these equations, but notes that the exact distribution is unknown. He suggests a conditional randomization alternative. Our empirical testing confirms that the permutation approach will be more accurate for this statistic when data is skewed; the Local Moran�??s I statistic does not appear to be asymptotically normal. We have already begun the development work to compute z-scores using permutation and will put this functionality in to the next release of ArcGIS. With the 10.1.2 release of ArcGIS we added a False Discovery Rate (FDR) p-value correction. We still report the uncorrected z-scores and p-values but use the correction to account for multiple testing and spatial dependency. For more about the FDR correction, please see: http://resources.arcgis.com/en/help/main/10.2/#/What_is_a_z_score_What_is_a_p_value/005p00000006000000/ Here are some additional resources: �?� 1992 Getis and Ord paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.1992.tb00261.x/abstract �?� 1995 Ord and Getis paper (this is the version of the Gi* we implement): http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.1995.tb00912.x/abstract �?� Seminal Anselin paper used as a basis for our Cluster and Outlier Analysis tool: Anselin, Luc. �??Local Indicators of Spatial Association �?? LISA.�?� Geographical Analysis Vol 27, no 2 (April 1995): 93-115. �?� Very good article about FDR: Caldas de Castro, Marcia, and Burton H. Singer. "Controlling the False Discovery Rate: A New Application to Account for Multiple and Dependent Test in Local Statistics of Spatial Association." Geographical Analysis 38, pp 180-208, 2006. Please let me know if I have not answered your question. Best wishes, Lauren Lauren M Scott, PhD Esri Geoprocessing, Spatial Statistics
... View more
05-29-2014
02:42 PM
|
0
|
0
|
27
|
POST
|
Hi Susanne, Yes, you are correct... if you want to compare crime from one time period to another for the same location, it is important to use the same distance band. Please keep in mind that results from Hot Spot Analysis are correct for whatever distance band you use... When you don't have any criteria to help you select any particular distance band, you can use Incremental Spatial Autocorrelation, Calculate Distance Band from Neighbors and/or Optimized Hot Spot Analysis to find an appropriate distance band for your analysis. These are some of the strategies I would try if I had several years of crime data and wanted to compare the hot spot (and Global Moran's I) results: 1) I would use Optimized Hot Spot Analysis (OHSA) to find the optimal distance for each year and I would write down the result. Or for 10.1, I would use Incremental Spatial Autocorrelation (ISA). Suppose I see the following: Year "Optimal" Distance from OHSA or ISA ----- ------------------------------------------ 2004 2301.345 2005 4043.223 2006 2290.456 2007 2310.987 2008 2301.842 Because most years have distances around 2300, if that distance seems to fit the scale of the question I'm asking, I would use that as the distance band every year (even for 2005). 2) If the distances above are all over the place, I would create a single feature class with crimes from all years and run OHSA (or ISA) on all crimes and use whatever distance it returns consistently when I do my year by year analyses. 3) The best solution (not always possible) is to have a reason for selecting your distance band... if remediation/crime prevention will be neighborhood by neighborhood, for example, I might try to come up with a distance that best reflects neighborhood structure in my study area... or perhaps I could try to find theory or evidence to tell me the distances over which related crimes occur ??? Sensitivity Analysis: Your goal is to make sure your model isn't over fit and that it predicts well across data samples. When a model is over fit, you will get a very different result by removing just a few observations. Here one strategy (there certainly are others) to help you feel confident that you've found a trustworthy model: 1) Find a model for your full data set. 2) Randomly sample 50% of the data and make sure that when you apply the model in (1) to both 50% samples that you still have a properly specified model (a properly specified model is one that meets all of the assumptions of the OLS method). I hope this helps! Best wishes, Lauren Lauren M Scott, PhD Esri Geoprocessing, Spatial Statistics
... View more
01-28-2014
08:59 AM
|
0
|
0
|
8
|
POST
|
If you have ArcGIS 10.2, you can try the new Optimized Hot Spot Analysis tool. To constrain your analysis to the road network you can: 1) Create a buffer around the road network and use this as your Bounding Polygon when you run Optimized Hot Spot Analysis (you might have to experiment a bit with different buffer widths). 2) Run Optimized Hot Spot Analysis on your accident point data and use the buffer polygon for the parameter called: Bounding Polygons Where the Incidents Are Possible. This will aggregate the events into fishnet polygons along the road network. It will automatically identify an appropriate scale of analysis and will apply an FDR correct to results. Keep in mind that this answers the question: where are there lots of accidents. It does not take into account differences in the density of the road network or traffic volumes. There are no doubt many ways to analyze your data; I hope this suggestion is helpful. Best wishes! Lauren Lauren M. Scott, PhD Esri Geoprocessing, Spatial Statistics
... View more
01-23-2014
03:37 PM
|
1
|
1
|
8
|
POST
|
Polermo, Another strategy 🙂 1) Make sure your polygon features have an ID field (so every polygon has a unique ID). Then do a Spatial Join on the points (Target = points, Join Features = polygons) to add that ID value to each point. Now each point "knows" which polygon it is in. 2) Run the Mean Center tool on the points. Use the polygon ID associated with each point as the Case Field. This will create a centroid for each polygon weighted by the points inside. I hope I have correctly understood your objectives. Best wishes! Lauren Lauren M Scott, PhD Esri Geoprocessing and Spatial Statistics
... View more
01-23-2014
01:47 PM
|
0
|
1
|
29
|
POST
|
Hi Susanne, Both the Spatial Autocorrelation (Global Moran�??s I) and the Hot Spot Analysis (Getis-Ord Gi*) tools are asymptotically normal so you do not need to transform your variables as long as you select a distance band that will ensure every feature has at least a few neighbors and none of the features have all other features as neighbors. If you have ArcGIS 10.2 or later, you can let the Optimized Hot Spot Analysis (OHSA) tool find an optimal distance value for you. Run OHSA on your polygons using your crime counts or ratios as your Analysis Field. Lots of information is written to the Results Window including what the tool identified as an optimal distance band (please see the second paragraph in this tool doc for more information: http://resources.arcgis.com/en/help/main/10.2/#/How_Optimized_Hot_Spot_Analysis_Works/005p00000057000000/ ). Use the same distance OHSA finds to be optimal when you run Spatial Autocorrelation. If you have an earlier version of ArcGIS, please let me know and I will send the instructions for finding an appropriate distance band. For Exploratory Regression I usually only transform variables if I�??m seeing curvilinear relationships... but it sometimes also helps if I'm having trouble finding an unbiased model. OLS regression does not require you to have normally distributed dependent or explanatory variables. It DOES require you to have normally distributed unbiased model residuals. If Exploratory Regression finds passing models, you can be confident you have found a model that meets all of the requirements of the OLS method. Whenever I use Exploratory Regression to find my properly specified model, however, I will want to: �?� Make sure all my candidate explanatory variables are supported by theory, or at least make sense or are supported by experts in the field. �?� Run a sensitivity analysis to make sure my model is not over fit. There are a number of ways to do this. One way is to randomly divide your data into two parts. Find your model using half the data, and then make sure the model is still valid for the other half of the data (valid meaning that it meets all the requirements of the OLS method). Here are some resources that may be helpful: http://resources.arcgis.com/en/help/main/10.2/#/Regression_analysis_basics/005p00000023000000/ (especially the section called Regression Analysis Issues) http://resources.arcgis.com/en/help/main/10.2/index.html#//005p00000053000000 I hope this helps! Very best wishes with your research, Lauren Lauren M Scott, PhD Esri Geoprocessing, spatial analysis, spatial statistics
... View more
01-15-2014
01:29 PM
|
0
|
0
|
8
|
POST
|
Hi Anthonia, Here are some suggestions that may help you answer your question. There are actually several tools that could help you analyze your presence/absence data. You could run Average Nearest Neighbor on all your surveyed households and compare that to Average Nearest Neighbor for just the positive households. This would help you answer questions about overall clustering of the positives. You could create a composite line graph of K Function results for all households surveyed and also for just the positive households to understand if there are any particular distances where clustering is statistically significant. If you have ArcGIS 10.2 or later, however, I recommend that you use the the Optimized Hot Spot Analysis tool. This will give you information about where the disease impacted households are clustered. Because you have both presence AND absence information I would like to recommend the following: 1) Run Optimized Hot Spot Analysis (OHSA) on all households surveyed. Do not use an Analysis Field. Select to aggregate to a fishnet grid (COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS). This will create a fishnet polygon grid that overlays the households that you surveyed and for now, that's what you want! Let's call this fishnet polygon grid, TheStudyArea . Because you have done the analysis on ALL households surveyed you will get coverage even in locations with no PRESENCE... and that's what you want. 2) Now run Optimized Hot Spot Analysis again only on the points (households) that were positive. This time select to aggregate incidents to aggregation polygons (COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS). Use TheStudyArea for your aggregation polygons (Polygons For Aggregating Incidents Into Counts). This analysis answers the question: where are there lots of positive households?... This might be important if you want to make sure you get resources to the places where they are most needed. Note, however, that this analysis does not control for the number of households surveyed... we are only answering where do we have LOTS of households that are disease impacted. We will control for household density next. 3) Now do a spatial join so that each polygon grid cell in TheStudyArea gets a count representing the number of households that were surveyed (positive or negative) and also a count of the number of positive households. Add a new field to TheStudyArea and calculate it to be the ratio: positives divided by all households. Now run OHSA on this ratio. This will answer the question: where do we have unexpected clustering of disease impacted households given the spatial distribution and density of the households surveyed. I hope this information is helpful. If you have questions about the spatial statistics tools, I hope you will contact me ( LScott@esri.com ). If the documentation is not clear or is not complete, please point that out to me so that I may correct the oversight as quickly as possible. Very best wishes with your research! Lauren Lauren M. Scott, PhD Esri Geoprocessing, Spatial Analysis, Spatial Statistics
... View more
01-15-2014
09:03 AM
|
0
|
0
|
22
|
POST
|
Is it possible that some of the records in your dataset are missing geometry? Try this: Open a fresh, brand new ArcMap session. Add your feature class to the TOC Test 1: Zoom to the layer ... if everything is good so far, zooming will not change what you see at all. If suddenly your features are squished together very small, there is one or more features that are really far away from the majority of your features and you will want to try to locate those stinkers. If the zoom test worked and nothing unexpected happened: Test 2: Using the interactive selection tool, select all of the features (just draw a selection box around all features). Then open the table for the feature class and see how many features it says are selected. If everything is good so far, ALL of the features will be selected. If the number selected is less than the number in the table, some of your features are missing geometry. Do a switch selection and either correct the geometry or remove those features from the feature class. Still no clue about why the mean center is AWOL? Test 3: Run the Mean center tool. Open the table to see the values for the X and Y coordinates... do they look reasonable? If not: ** Use the Export Attributes to ASCII text file tool in the Spatial Stats toolbox, Utilities toolset. ** Check the X and Y coordinates for your features. You should see the expected number of records and the X and Y coordinates should be reasonable. If they aren't, you will need to correct or delete the problem features. Test 4: After running the Mean Center tool, Zoom to the mean center output layer... did you find the mean center now? If none of this works, I don't know what's wrong... perhaps you can send your data to Tech Support to see if they can reproduce the problem. Sorry you are having problems with the tool. Best wishes! Lauren Esri Geoprocessing and Analysis
... View more
03-29-2013
10:08 AM
|
0
|
0
|
0
|
POST
|
I may not understand what you actually want to do... so forgive me if I'm off the track. I'm understanding that you have 4 binary fields and want to know the average number of 1's (across those fields) for each respondent. So if respondent A has 2 zeros and 2 ones you would like to know that the average is 0.5. Is that correct? If so, you can use Calculate Field... create your new Mean1s field, then calculate it as the sum of the 4 fields, divided by 4. Hope this helps! Lauren Esri Geoprocessing and Analysis
... View more
03-29-2013
09:46 AM
|
0
|
0
|
0
|
POST
|
Ash and I tried a csv file and it worked fine. Hopefully clearing the cache will do the trick ? Thanks, Lauren
... View more
03-22-2013
04:59 PM
|
0
|
0
|
5
|
POST
|
Hoping these ideas might be helpful: 1) Most tools work on a selection set, so you could do a query to select features with the same ID and then summarize distances. 2) If there are lots of clustered IDs and you don't know ahead of time what the values will be, you can create a model to loop through all of the unique values in a field (the ID field) and process them. You would use the Field Value iterator in Model Builder. 3) Near table might provide the summary information you need. 4) If you run the Near tool to add a field with the distance to each feature's nearest neighbor, you could then run Summary Statistics, group by ID (case field) and compute the average, minimum, range, etc. statistics for each cluster (in other words, summarize on the Distance field created when you ran Near). Perhaps one of these solutions will work. Hope so. Best wishes! Lauren Scott Esri
... View more
03-07-2013
02:40 PM
|
0
|
0
|
2
|
Online Status |
Offline
|
Date Last Visited |
11-11-2020
02:23 AM
|