Seeking advice on how to best handle  join / field updating on ~45 million features

219
5
07-02-2012 08:31 AM
HarryBowman
New Contributor III
I have a feature class of 47 million rows representing streets.
It needs to have two fields appended to it, one for the city on the left side of the street, one for the right.
The city information is contained in two tables (40+ million rows each). These tables each have an ID field corresponding to the ID field on the feature class. They also have a city name field (FULL_NAME), which is the same in both tables.

I've tried a couple ways to get this information together, but things haven't worked so far:

Tried making table views, so that I don't have a duplicate name field. Make Table View seems to not be renaming the fields.
Tried joining the left table and using Calculate Field. It ran for three days and was 1% complete this morning. I'm sure I did something wrong, but suspect this is not the best approach anyway.
Tried Make Query Table, which ran in seconds, then hung up ArcMap. This seemed the most promising, actually.

Based on your experience, what would your recommend?
0 Kudos
5 Replies
KimOllivier
Regular Contributor II
MakeTableQuery does not do anything in 2 seconds, it just creates a query definition (and validates it?). As soon as you use it to open the table it will run out of memory. No geoprocessing tools in ArcMap work in human time with more than about a million records. It either runs out of memory and hangs in seconds forever or produces nothing. I do notice that at 10.1 there are some tools that clearly run a partitioning strategy behind the scenes to improve the scalability of some tools.

If a single process does not finish in the time it takes to drink a cup of coffee, interrupt the process and find a better way. Don't even wait for lunch!

I did manage to process file geodatabase tables (with only a tenth of your numbers) by partitioning into 12 parts and it only took a few minutes for each part after hanging on full queries to total 32 minutes.

The first option surely would be to open up the native database admin tools and run an SQL query there. You do not say what it is. Or you could upgrade to 10.1 which has a faster data_access module.

If you want to persist with geoprocessing tools, Python dictionaries work really well instead of a join, but pre 10.1 the UpdateCursor is too slow and will sabotage the effort. To get around that it can be quicker to dump the table as a CSV file in Python and load it back again. I have seen a cool solution using SQLite inside a Python script (SQLite is built into Python) instead of using a file geodatabase as an intermediate database.

Index Everything. Make sure that the keys have an attribute index.
0 Kudos
NobbirAhmed
Esri Regular Contributor
Hi Harry, I would like to help your case. I think it is possible to get your job done within ArcGIS. First of all, which version of ArcGIS you are using?

Let me be clear about your data. Of the two tables, is it that one of the tables contains records (IDs) of the cities on the left of the streets only and other table contains records (IDs) of the cities on the right? Is the ID of a street is same as the IDs of the cities on its both sides? Is my visualization of your data correct?

[ATTACH=CONFIG]15691[/ATTACH]
0 Kudos
HarryBowman
New Contributor III
Hi Harry, I would like to help your case. I think it is possible to get your job done within ArcGIS. First of all, which version of ArcGIS you are using?

Let me be clear about your data. Of the two tables, is it that one of the tables contains records (IDs) of the cities on the left of the streets only and other table contains records (IDs) of the cities on the right? Is the ID of a street is same as the IDs of the cities on its both sides? Is my visualization of your data correct?



Hello, Nobbir:
I am using ArcGIS 10.0, ArcView license.
Here is the data set up:
One file geodatabase, on a local drive
Streets feature class has a field, ID, that identifies the streets.
L_id_name table has two fields, FID (the ID of the street), FULL_NAME (the name of the city on the left side of the street)
R_id_name table has two fields, FID (the ID of the street), FULL_NAME (the name of the city on the right side of the street)

Here's what I have done and it seems to have worked - but I hope there is a Faster Way:
1. Built indices for ID on streets, FID on L_id_name, FID on R_id_name
2. Joined L_id_name to streets by FID/ID. Persisted this by GP Tool Copy Features as Test1
3. Joined R_id_name to Test1 by FID/ID.  Persisted this by GP Tool Copy Features as Test2
(At this point, I had a feature class containing two fields for the left and right city name.)
4. Created a model with Make Feature Layer feeding into Copy Features. I used this model to rename the fields from L_id_name.FULL_NAME, R_id_name.FULL_NAME to L_CITY and R_CITY, which is more user friendly.

So, in the end, I did finish! But it took some time:
15 minutes each for the three attribute indices
4 hours each for the copy features
I mistakenly ran the model in the foreground and forgot to time it, but I'd say 3-4 hours might be right

I think I could have been smart about it using some sort of multi-table join, then Make Feature Layer -> Copy Features and doing all the joining and renaming and copying in one go, but I got scared off by the duplicate field names (I didn't know how ArcGIS would handle that) and went with "one step at a time." I also didn't try chopping, running in parallel, and merging - again trading time for simplicity.

Thanks.
0 Kudos
NobbirAhmed
Esri Regular Contributor
... was there some smarter way to handle this?


Chopping (or divide and rule) does not always yield better results. You'll have to experiment with an optimum size to divide into. From our experience your results (times) are good enough. Compared to the number of features (40+ millions), 3-4 hours for each process is really good.
0 Kudos
HarryBowman
New Contributor III
Chopping (or divide and rule) does not always yield better results. You'll have to experiment with an optimum size to divide into. From our experience your results (times) are good enough. Compared to the number of features (40+ millions), 3-4 hours for each process is really good.


OK, thanks. In my original post, I said that Make Table View wasn't renaming the fields - that turned out to be user error, I guess, because Make Feature Layer certainly worked.
0 Kudos