How to make a spatial join with ArcGIS with these HUGE datasets?
Target: almost 3 million squares 500x500m size (the attribute table is basically empty)
Join feature: 12000 polylines
Operation: one to many
All my attempts have failed:
From the Spatial Join tool in the Toolbox: after 15 min the output is a file where no join has been made (Join_Count field is 0 in all rows)
From the dialog box: It seems to work but my estimate is that it takes 4 days to complete - without any guarantee the output will be correct.
It works when I do a Spatial Join of a small area.
Any ideas on how to do this in a proper way?
I am using file geodatabase, ArcGIS 10.2, background processing, 23 GB RAM
(The idea is to overlap those polylines to the grid (the 500x500m squares). I will then make a dissolve to find out how many lines are in each square. Finally, I will make a density map.)
To verify, do you have the 64-bit background geoprocessing patch installed? If you don't, you are still running in 32-bit and will have limited access to RAM (4 GB max I think).
Theoretically, you should be able to use all 23 GB RAM when processing in 64-bit (at least what isn't being used by other processes). Of course, theory and reality are two different things and even with access to huge amounts of RAM, it doesn't necessarily mean that a process will succeed or be any faster. Other factors related to the processor, software, etc. will contribute to performance and the ability of a process to be completed successfully. I am being intentionally vague here because even after assisting many users with geoprocessing performance issues and questions, I still don't fully understand everything that occurs behind the scenes when a process is executed.
If you can run the process on a sample of the data but not on the entire dataset, I usually suggest subsetting the dataset, running the process on each subset, and then merging the results back together. In your case, you could subset the grid into manageable pieces and probably leave the lines alone. Many users will use ModelBuilder or Python to semi-automate the procedure. It isn't ideal but as long is it won't somehow influence the final results, you know it will get the job done and is a lot safer than waiting 4 days. I once experienced a power outage on the third day of a geoprocessing operation and didn't have any type of battery backup to keep my machine alive.
This question was also asked on GIS Stackexchange.
Please specify the license level of ArcGIS Desktop available (Advanced/Standard/Basic), since it significantly changes the difficulty of generating a result. Please also mention whether you have Spatial Analyst available, since Kernel Density takes just seconds.
I have ArcGIS Desktop Advanced with Spatial Analyst (and all other extensions) available.
I am not sure the 64-bit background geoprocessing patch is installed. I run a tool called TrackBuilder that uses a 64bit Python script. But when a check the python version in my ArcGIS it says 32bit.
To clarify, I must use the 500x500m polygon grid so I I don't understand how I can use the Kernel tool so that the result fits in that grid.
By the way I run yesterday the Intersect tool with the whole dataset (remember, 3 million polygons). It worked with a small dataset but now it's running for 19 hours!
Is there any way to know it's actually doing something? The CPU shows 0% and very little RAM is being used. According to the Geoprocessing result window ArcGIS is "Assembling features"...
EDIT: it ended after 19 hours. It was doing something after all!
To summarize your inputs
Which gives rise to these questions:
Assuming your inputs and my questions, consider the scenario that one is interested in drainage density as it relates to surficial hydrology. Now from a purely practical perspective, we can consider the total length of the rivers/streams within a specific area (ie your cell) giving us a length/are dimension. Ideally, we should have the surface area of the streams per area, giving us a dimensionless unit which makes sense... Counts per area in this context makes no sense since we could have 1 stream intersecting a 500x500 m^2 cell, but the coverage in terms of length or area could be anything (ie any length/area or the cell could be completely covered by the stream)
So in short, Xander's solution scales this down on a cell by cell basis, but other suggestions depend on what is being examined and whether count/area is the key or other measures as length/area or area/area are of interest.
Very relevant questions you pose, Dan Patterson,
We are doing shipping density maps. The final result will be a map showing the number of lines per cell,i.e., the number of ships that have crossed each cell.
We started out with AIS (Automated Identification System) point data that was converted to lines with a tool called TrackBuilder. We could have used ArcGIS but this tool gives the route for each ship with start and end time (sort of, it has its problems though).
We are using a grid which is the EEA standard (European Environment Agency). The idea is to overlap this grid with other analysis we'll make in the future. Traditionally, we have used raster functions like kernel or point density. The problem with this approach is that it doesn't give the number of LINES per cell and the EEA grid can not be used.
The main challenge we are facing, as I see it, is SIZE. This wouldn't be a problem if the size of the cells were bigger or the study area was smaller. It's, by the way, the Baltic Sea. For non-Europeans this is a comparison with USA:
Manual...interesting...which raises the question about another question... You mentioned that it worked fine on a small data set. When you did this
Those conditions mean that:
My feeling is that by trying to do everything all at once scales the problem up particularly if method (2) was used. If you have some thoughts on this that would be interesting. This is beginning to sound like parallel processing (not parallel processors, but multiple instances of ArcMap) would be nice.