Will ArcGIS Pro load a table with 860 million records into a geodatabase?

TimMinter · ‎10-21-2024

I'm attempting to use ArcGIS Pro 3.3.1 to read a non-geodatabase PostgreSQL 15.8 database table with ~860 million rows, then write those rows to a geodatabase table. I'm using the Append GP tool to map the input table columns to the output geodatabase table fields and load the records. When I have tried with an enterprise geodatabase target table in SQL Server 2019 and 2022, the process slows to a terrible crawl after ~262 million records are loaded. Same thing when I try with a file geodatabase target table, but it looks more like the load stopped instead of slowing at ~262 million records. Storage is fine, RAM looks happy, CPU says ArcGIS Pro is doing something noticeable. Disk reads and writes drop off for the local attempts and network transport drops off for the remote server attempts when the load stalls. I ensured that the only index on the target table columns was the ArcGIS ObjectID index, and that it's not clustered in order to eliminate the SQL Server "load slows when unsorted data is loaded to a table with a clustered index" problem. I'm using the 32-bit ObjectID because I'm not getting too close to the 2.14 billion limit that would bump me up into 64-bit ObjectID territory.

Technically, the documentation and discussions in the community suggest that the file geodatabase itself can handle "unlimited" records. Any thoughts on why ArcGIS Pro is getting stuck at ~262m?

cheers,

tim

TimMinter · ‎10-24-2024

Ok, I have _an_ answer, maybe not _the_answer. So, no, ArcGIS Pro will not do this for me when I'm using the Append tool. Data Interoperability extension claimed a successful load after 11 hours 54 minutes (ugh), all on the local machine with no network transfer happening. I still need to poke at it a bit to see if I agree with FME's "successful" claim.

So, that's that. Thanks everyone.

tim

View solution in original post

BillFox · ‎10-21-2024

hello Tim,

you said v3.3.1

but I did see this for 3.0 (maybe something similar)

https://support.esri.com/en-us/knowledge-base/table-view-record-limit-for-tables-over-2-5-million-ro...

VinceAngelo · ‎10-21-2024

Tim --

Let's just say I would never try this.

I do know the underlying API is capable, because in the distant past (e.g. 8.2? 9.0?), a colleague used my 'asc2sde' utility to load 680m rows in under 18 hours, into Oracle, SQL Server, and PostgreSQL (three different servers, concurrently). Note that this was before native geometry was the usual solution.

That said, what does Append really get you here that raw SQL wouldn't?

I'd be tempted to chunk the data into 5-10m SQL statements, then execute them in a native CLI utility (SQL*Plus/sqlcmd/psql) , but I'd also want to benchmark using a DA SearchCursor with a collection
of DA InsertCursor commands, and using a collection of arcpy.ArcSDESQLExecute cursors with INSERT SQL statements (with a del cursor every few million rows). If I were an FME guy, I'd also benchmark that, all across 50m rows, then use the one that presented the least difficulties.

I'd also strongly consider just leaving the data where it is, and adding a PostGIS geometry to it and moving forward with my project.

- V

TimMinter · ‎10-22-2024

Well, I'm with you, Vince. However, my employer needs it done, and I still enjoy being employed, so the part of me that I sell for 2080 hours a year would certainly try this. The rest of me wouldn't. ‌‌🙂

I loaded 1.3 billion records from CSV files to PostgreSQL last week in about 24 minutes, so that's pretty cool.

In further poking about the web, I ran across a bunch of grumbling about how Python will just hang during larger data loads. I'm assuming that ArcGIS Pro Append GP tool uses Python code, and I have a suspicion (that I won't pursue) that it's just another example of that behavior. If Append could swing it, it would get my records into a geodatabase table schema with fields ordered, named, and aliased per requirements without a bunch of re-messing about.

So, one of my principles is to configure before I code. I hope I don't have to code up your hints, but I'm glad you shared them. Some of my workmates noted that Data Interoperability (FME) extension has loaded larger recordsets for them, so I'm headed that direction now, hoping and expecting that SAFE didn't code up their tools with Python. I'll report back here with what worked, probably.

The table is not a feature class. It is raw material for defined and ad-hoc data products that can support spatio-temporal analysis, etc. by being related to a polygon feature class. So, big table becomes smaller table that meets someone's specialized requirements. We all know that I'm going to try Query Layer and/or Query Table on that big table whether it's wise or not.

Cheers,
tim

MarcoBoeringa · ‎10-22-2024

@TimMinter wrote:
In further poking about the web, I ran across a bunch of grumbling about how Python will just hang during larger data loads. I'm assuming that ArcGIS Pro Append GP tool uses Python code, and I have a suspicion (that I won't pursue) that it's just another example of that behavior.

AFAIU, quite a lot of the core ArcGIS geoprocessing tools have been written in C++, not Python, but Vince can give a more informed and balanced answer.

I don't agree about the "Python will just hang during larger data loads" as cause of data load issues, see my response about OpenStreetMap data processing at planetary scale. Crappy programming and poor hardware / networks can cause major issues though.

Marco

MarcoBoeringa · ‎10-22-2024

Well, I can't vouch for any specific tools like Append, but my experience has shown you can handle multi-billion record spatial tables in PostgreSQL and create ArcGIS Query Layers for them in ArcGIS Pro. However, this experience has also shown that some ArcGIS geoprocessing tools are very efficient and can handle such large datasets, and others cannot. It all depends on the tool and its exact implementation.

I currently run a >4TB large OpenStreetMap database as a non-geodatabase PostgreSQL database (v17.0 currently) based on Facebook/Meta's "Daylight Map Distribution for OpenStreetMap", that includes nearly all of Google Open Buildings. The hardware I use is a refurbished HP Z840 workstation that I beefed up with 20TB of NVMe disks (RAID-0), half of it serving as superfast backup, and 512GB RAM.

The largest table, containing all buildings and other Polygon geometries of OpenStreetMap for the entire planet, has nearly 2.5Billion(!) records (yes, I am using 64-bit ObjectIDs), see screenshot of DBeaver, and the actual styled data in Pro in the background as an exported PDF created by ArcGIS Pro.

To handle import and export, I have successfully used Python modules available in ArcGIS Pro, like sqlite3, pyodbc and psycopg2. The actual original import of OpenStreetMap data is using osm2pgsql, subsequent processing steps use the other libraries.

All of these tools have proven to be capable to process hundreds of millions of records data transport in and out the database without significant slow down and to be able to create > 400M record File Geodatabases and SQLite spatial databases as export, but it requires really good thought of how to handle stuff in your Python code. E.g. storing ObjectIDs to process as ordinary Python number objects in some huge Python list, will quickly have you run out of RAM with billions of records, even if you have 64GB or more available. I solved that issue by using numpy arrays and data types, that can be far more memory efficient.

But there is a plethora of issues to deal with if you want to write efficient Python / ArcPy code for handling these amounts of data, so I am not at all surprised some of the current ArcGIS geoprocessing tools fail to scale to those table sizes, likely because no developer ever tested them at those scales of data processing.

That said, using standard SQL and pyodbc / psycopg2, I have been able to e.g. run SQL 'UPDATE' statements that modify an entire > 1B record table at a rate of 500M - 3.5B records per hour (1 million rows updated per second) depending on the operation performed on this 2016 hardware using Python multi-threading options... Modern hardware should be able to easily double that.

TimMinter · ‎10-23-2024

Thanks Marco, lots of potentially useful info there. I'm thankful that I'm only dealing with 1.3b records on this one. The PC I'm using for this work is just a June 2023 vintage Dell Precision workstation meeting or exceeding ArcGIS Pro recommended and/or optimal specs, so nothing noteworthy.

TimMinter · ‎10-24-2024

Ok, I have _an_ answer, maybe not _the_answer. So, no, ArcGIS Pro will not do this for me when I'm using the Append tool. Data Interoperability extension claimed a successful load after 11 hours 54 minutes (ugh), all on the local machine with no network transfer happening. I still need to poke at it a bit to see if I agree with FME's "successful" claim.

So, that's that. Thanks everyone.

tim

MarcoBoeringa · ‎10-24-2024

@TimMinter wrote:
Data Interoperability extension claimed a successful load after 11 hours 54 minutes (ugh), all on the local machine with no network transfer happening.

Considering you were loading 860M records to begin with, that 12 hour FME load doesn't sound to bad if it used database INSERTs (which it likely did, unless FME can use COPY).

Database INSERTs are way more expensive in terms of performance than COPY or UPDATEs in PostgreSQL. This is the nature of PostgreSQL and its technical implementation. Your timing calculates to a rate of about 72M records / h, which seems reasonable.

My custom multi-threaded code also doesn't get above about 150M INSERTs per hour. About double the rate, but not an order of a magnitude faster or so. Considering Data Interoperability likely has more overhead due to all its inner circuitry, I think it might well be slower than the raw performance I am getting from custom code.

TimMinter · ‎10-24-2024

Yeah, I was getting 1.471 million/second (that's right) doing COPY from CSV to PostgreSQL unlogged table, which straight up spoiled me. I'm glad I didn't have to get all chunky on the input data, and I'm glad that the local power company, Dell, Microsoft, SAFE, and Esri loaded the 680m records to the EGDB table in the 12 hours that I was not paying any attention at all.

Cheers Marco,

tim