ArcGIS 10.3.1
Data Sets: Sub-Sahara African grids with resolution varying between ~250m (~260 million cells) to ~1km.
Objective: Using a set of rules... select research site data (~4200 points) based on the grid values at a user-generated point (script)
Objective2: Define an inference space for the user-generated point and estimate crop production (script)
Knowing I needed the inference space, I selected the research sites spatially.
Not knowing the computer power of potential users... I didn't run the analysis in memory ... so the analysis isn't quick. As a result... I was asked to select research sites by table.
Spatial Analysis Description:
Apply rules to grids (generates 0s (don't match criteria) and 1s (match criteria)); Sum grids (for a given cell: 0 (no grids match criteria) to 7 (all grids match criteria)); Convert cell value = 7 to a shapefile; Intersect shapefile with research site points.
Tabular Analysis:
Extract grid values for research site points (ExtractMultiValuesToPoints); Apply rules to grid values (TableSelect); Export to Excel file.
Problem:
Compared the spatial and tabular output at a point. Total points were nearly identical (391 for spatial; 392 for tabular), but the overlap was only ~75% (i.e. 91 spatially-selected points didn't match the tabular analysis - which is - of course - the correct answer.
I didn't dice the rasters into identical sized / oriented cells due to file size considerations. So ArcGIS converted all the rasters to the coarsest resolution (1km) before summing the grids. Presumably this spatial "rounding" is the source of the error.
Possible Solutions:
a) I could dice all the grids to 250m, make them overlap and the mismatches would go away. Depending upon how far off the grids are relative to each other... shifting the cells could introduce significant error.
b) I could also create a set of points from the ~250m centroids (network points) and extract their associated grid values. For a given user-selected point, I would select network points using the tabular analysis described above and spatially identify the research sites that are associated with the selected network points [i.e. convert selected network points back to 250m raster cells, convert cells to shapefile and intersect shapefile with research site points].
As long as the 7 grids aren't aligned there will be mismatch errors. But the increased resolution of this analysis should decrease the number of mismatches.
Also... the productivity data could be extracted to the network points - eliminating the need to distribute productivity grids. But the user would still need the 7 grids to identify the grid values at the user-defined point.
c) Create a new ~250m integer grid from the network points. So each cell would have the 7 associated grid values. Then extract those 7 grid values to the user-identified point... which would eliminate mismatches.
Summary:
I'm leaning towards option "c"... but I'm concerned about efficiency and error generation.
Is point analysis ever preferable to raster analysis? (Predefining the relationship between the 7 grids at a series of points has to be worth some efficiency points.)
Are 7 single-valued grids more efficient to use than a single grid that has 7 values associated with each cell? The latter would likely require some sort of raster look-up function.
How would the raster look-up function analysis compare with the TableSelect analysis (with respect to efficiency)?
If I were to run the analysis in memory - would the above questions become moot?
not sure I follow all but some points or points that need clarification
Any further information as you see fit.
PS
where exactly in ss-africa? when you mean production which producing group are you referring to? what types of crops? farmer selection is not necessarily done to maximize yield for example
Before responding to your points... let me clarify...
My code works... but I don't like the answer.
When I spatially identify the research sites that conform to a set of rules, ~25% of the selections don't meet the criteria. I know this because the selected table items contain the actual grid values and 25% of the field property values clearly don't match the criteria.
Because the file contains the actual grid values associated with each research site, I can apply the rules directly to the shapefile table. And when I do that... ~25% of the selections (though true) aren't selected when I do the spatial analysis.
Varying grid cell size is the primary source of the spatial selection error. ArcGIS rounds cells to the coarsest size / grid orientation before calculating the math (in this case... inference space rules).
I can resolve the problem by changing the cell size / grid orientation of all the grids to match that of the highest resolution cell. The ~1km cell would become 16 ~250m cells and the ~500m cell would become 4 ~250m cells. That's option a.
I can convert the highest resolution grid to points (file geodatabase), extract the 7 grid values associated with each point, and apply the inference space rules to the extracted values in the file geodatabase. The selected points represent the entire inference space. Then I'd need to use the selected points to identify the associated research site points. One approach: convert the selected points back into a ~250m grid, convert the grid to polygons and intersect the polygon with the research sites. That's option b.
Option c: Convert the highest resolution grid to points (file geodatabase), extract the 7 grid values associated with each point and convert those points into an integer grid. (The grid format might not be the best raster format for this approach due to the number of cells and table values.)
Crop production is only relevant in that I'll need an areal representation of all locations that meet the inference space criteria (whether or not there are research sites at those cells) in order to characterize the production potential of that area. "All locations" means all of Sub-Saharan Africa.
The table also contains crop x nutrient modeling coefficients developed from research site data. Some coefficients are based on a considerable amount of research. Others are not. The modeling coefficients will allow a researcher to locate potential undeveloped production areas.
As for your comments...
>The extract to point tool extracts a grid value for a point and creates a new point file. But the Extract Multi Values to Points tool modifies (i.e. adds the extracted point values) to the original point file.
However... I did run into a data corruption issue. I usually create point shape files from tab-delimited text files exported from Excel. The coefficient values are all floating point. But... if the first row of coefficients were integers ArcGIS turned all the coefficient data into integers. Now I'm creating point files from Excel files (no extra rows or columns) and I haven't run into any problems.
>TableSelect worked for me. Here's the code for selecting extracted dem values from the research site point file (Data9-28-15.shp) using the "dem" value associated with a user-selected point:
# ======= 6 of 7 =========
demmax = 5818
print "dem500 resolution ~500m; Range: -155 - 5818; Selected dem500 value: " + str(dem)
# Decrease amount to add (demAdd) if dem + 300 is undefined
if demmax - dem < 300:
demAdd = demmax - dem
##print "Will add " + str(demAdd) + " instead of 300"
else:
demAdd = 300
##print "Will add 300"
demGT = dem - 300
demLT = dem + demAdd
demField = arcpy.AddFieldDelimiters("Data9-28-15.shp","DEM500")
# Create rule-based where clause for TableSelect
if dem < 700:
whereClause = "{0} <= {1}".format(demField,1000)
print whereClause
else:
whereClause = "{0} > {1} AND {0} < {2}".format(demField,demGT,demLT)
print whereClause
# Select data and output to table
arcpy.TableSelect_analysis("D:/New/.../Out/sandout", "D:/New/.../Out/demout", whereClause)
My arcpy/Python background consists of the 2-3 hour ESRI online short courses (Python Basics & Python Scripting for ArcGIS), a book and a whole lot of internet searching. I don't know anything about python/numpy
I think I've addressed your other points.
Maribeth... I will digest this tonight.
My query about the coverage was dealing with raster representation ... whether it was fairly continuous with few areas of nodata or spotty with data/cells with values clustered around points. It has a lot to do with raster processing and efficiency (ergo my query about Numpy). Hence, are you trying to process the whole of SS_A as one unit or were you interested in solutions that would tile it into sub-regions. No rush, this is obviously not one of those easy answer questions. should need/want to take anything off-line you can email me an my Uni-email at Dan_Patterson @ Carleton.ca ... it is on my GeoNet profile if you forget