Spatial join with HUGE datasets

ManuelFrias · ‎07-16-2015

How to make a spatial join with ArcGIS with these HUGE datasets?

Target: almost 3 million squares 500x500m size (the attribute table is basically empty)

Join feature: 12000 polylines

Operation: one to many

Match: intersect

All my attempts have failed:

From the Spatial Join tool in the Toolbox: after 15 min the output is a file where no join has been made (Join_Count field is 0 in all rows)

From the dialog box: It seems to work but my estimate is that it takes 4 days to complete - without any guarantee the output will be correct.

It works when I do a Spatial Join of a small area.

Any ideas on how to do this in a proper way?

I am using file geodatabase, ArcGIS 10.2, background processing, 23 GB RAM

(The idea is to overlap those polylines to the grid (the 500x500m squares). I will then make a dissolve to find out how many lines are in each square. Finally, I will make a density map.)

XanderBakker · ‎07-16-2015

Would Line Density—Help | ArcGIS for Desktop be a feasible alternative if you use a cell size of 500m?

ManuelFrias · ‎07-16-2015

That would be an option, thanks!

The problem is that I HAVE to use the 500x500m grid file.

GabrielUpchurch1 · ‎07-16-2015

Hi Manuel,

To verify, do you have the 64-bit background geoprocessing patch installed? If you don't, you are still running in 32-bit and will have limited access to RAM (4 GB max I think).

Theoretically, you should be able to use all 23 GB RAM when processing in 64-bit (at least what isn't being used by other processes). Of course, theory and reality are two different things and even with access to huge amounts of RAM, it doesn't necessarily mean that a process will succeed or be any faster. Other factors related to the processor, software, etc. will contribute to performance and the ability of a process to be completed successfully. I am being intentionally vague here because even after assisting many users with geoprocessing performance issues and questions, I still don't fully understand everything that occurs behind the scenes when a process is executed.

If you can run the process on a sample of the data but not on the entire dataset, I usually suggest subsetting the dataset, running the process on each subset, and then merging the results back together. In your case, you could subset the grid into manageable pieces and probably leave the lines alone. Many users will use ModelBuilder or Python to semi-automate the procedure. It isn't ideal but as long is it won't somehow influence the final results, you know it will get the job done and is a lot safer than waiting 4 days. I once experienced a power outage on the third day of a geoprocessing operation and didn't have any type of battery backup to keep my machine alive.

VinceAngelo · ‎07-16-2015

This question was also asked on GIS Stackexchange.

Please specify the license level of ArcGIS Desktop available (Advanced/Standard/Basic), since it significantly changes the difficulty of generating a result. Please also mention whether you have Spatial Analyst available, since Kernel Density takes just seconds.

- V

ManuelFrias · ‎07-17-2015

Thanks Xander Bakker, Gabriel Upchurch and Vince Angelo for your advice! I am experimenting with your ideas.

I have ArcGIS Desktop Advanced with Spatial Analyst (and all other extensions) available.

I am not sure the 64-bit background geoprocessing patch is installed. I run a tool called TrackBuilder that uses a 64bit Python script. But when a check the python version in my ArcGIS it says 32bit.

To clarify, I must use the 500x500m polygon grid so I I don't understand how I can use the Kernel tool so that the result fits in that grid.

MF

ManuelFrias · ‎07-17-2015

By the way I run yesterday the Intersect tool with the whole dataset (remember, 3 million polygons). It worked with a small dataset but now it's running for 19 hours!

Is there any way to know it's actually doing something? The CPU shows 0% and very little RAM is being used. According to the Geoprocessing result window ArcGIS is "Assembling features"...

EDIT: it ended after 19 hours. It was doing something after all!

DanPatterson_Retired · ‎07-17-2015

To summarize your inputs

you have 3,000,00 cells of 500mx500m size
it the number of lines in each cell that you are interest in
you will product a density map from the results of the counts per cell for the study area

Which gives rise to these questions:

is the study area in projected coordinates?
why is it the count per cell and not the length of the linear features per cell that is of interest?
if the data are not in projected coordinates, how do you interpret density in real world units? whether it be counts or length?
what is it that you need density for? sometimes the object under investigation will give rise to important questions of the utility of the measure or alternate approaches to guage/investigate its pattern

Assuming your inputs and my questions, consider the scenario that one is interested in drainage density as it relates to surficial hydrology. Now from a purely practical perspective, we can consider the total length of the rivers/streams within a specific area (ie your cell) giving us a length/are dimension. Ideally, we should have the surface area of the streams per area, giving us a dimensionless unit which makes sense... Counts per area in this context makes no sense since we could have 1 stream intersecting a 500x500 m^2 cell, but the coverage in terms of length or area could be anything (ie any length/area or the cell could be completely covered by the stream)

So in short, Xander's solution scales this down on a cell by cell basis, but other suggestions depend on what is being examined and whether count/area is the key or other measures as length/area or area/area are of interest.

ManuelFrias · ‎07-17-2015

Very relevant questions you pose, Dan Patterson,

We are doing shipping density maps. The final result will be a map showing the number of lines per cell,i.e., the number of ships that have crossed each cell.

We started out with AIS (Automated Identification System) point data that was converted to lines with a tool called TrackBuilder. We could have used ArcGIS but this tool gives the route for each ship with start and end time (sort of, it has its problems though).

We are using a grid which is the EEA standard (European Environment Agency). The idea is to overlap this grid with other analysis we'll make in the future. Traditionally, we have used raster functions like kernel or point density. The problem with this approach is that it doesn't give the number of LINES per cell and the EEA grid can not be used.

The main challenge we are facing, as I see it, is SIZE. This wouldn't be a problem if the size of the cells were bigger or the study area was smaller. It's, by the way, the Baltic Sea. For non-Europeans this is a comparison with USA:

DanPatterson_Retired · ‎07-17-2015

Manual...interesting...which raises the question about another question... You mentioned that it worked fine on a small data set. When you did this

did you select both the lines AND the cells to form a selection prior to doing the intersection, or
did you just select the lines OR the cells to form the selection
what proportion of one or the other was made during that run... 1/10 a 1/4 ?

Those conditions mean that:

you could tile by repeating spatial selections prior to doing the intersect, with option (1) being the best choice.
By limiting the possible candidates in the intersect process, you have a real time estimate of how long it will take to finish the remainder.
The results of each selection and intersection can then be merged prior to doing your final density calculations.

My feeling is that by trying to do everything all at once scales the problem up particularly if method (2) was used. If you have some thoughts on this that would be interesting. This is beginning to sound like parallel processing (not parallel processors, but multiple instances of ArcMap) would be nice.