Building a Data Driven Organization, Part #9: Load 100 million features into ArcGIS Online

BruceHarold · ‎12-15-2021

This may not be your exact problem - loading 100 million features into a hosted feature service in ArcGIS Online - but the subject here is loading big data into a hosted feature service in Online, 100 million is the scale I'm talking about.

Here is what 100,000,000 features look like at the density I'm dealing with.

An extent about a city block across:

Layer not visible

Then turning layer visibility on:

Layer visible

The data is so big Pro 2.9 will not attempt to display it at scales smaller than 1:500.

Here is the back story. One of the Esri teams I work with occasionally tackles big data destined for a hosted feature service in Online. The web being what it is, very large sharing transactions can fail due to network or other issues, at which point you have to recover the situation where it broke or re-run the entire sharing process. They wanted a process which works reliably and with a recovery mechanism for any failure. They were headed down the Python path, which is fine for Pythonistas but I'm the no-code guy in the office (let's say a recovering coder who occasionally lapses). ArcGIS Data Interoperability to the rescue!

My test data is a CSV file of 143,751,910 rows.

Get Count

I took 100 million rows off the top just as a nice round number. The data has lat/lon values so the ETL aspect is a simple spatial enablement to make point features. Now how to send these to a point feature layer?

A preliminary step was to take a sample of the data to file geodatabase and make a feature service from it, this just instantiates the target layer. Then the downstream problem is two-fold:

Load the data as performantly as possible
Use a methodology that supports recovery from failure

Online team suggested a 'sweet spot' for loading this scale data would be using the Append endpoint for the feature service with batches of 500K records in two concurrent processes, and run it 'after hours' US time. No problem, here are my workspaces:

LoadTaxis

LoadTaxis creates zipped file geodatabases of sets of 500K records then hands the data path off to LoadTaxisWorker in a maximum of two processes that run asynchronously. I found the time it takes to cook the data was a good fit for how long it takes Online to eat it, a little less than 2 minutes per batch during low load hours. Watching the log file go by, there were a few instances of LoadTaxis pausing for a process slot to become available, but usually Online was ready for the next batch, so things were working as fast as my ETL could run.

LoadTaxisWorker

LoadTaxisWorker uploads each zipped file geodatabase to Online then calls the service Append endpoint, triggering a load from the new file geodatabase item. This starts a job in Online. A looping custom transformer checks for job completion every 5 seconds (with a maximum check count of 42). Note the HTTPCaller in the custom transformer needs a concurrency setting of 1 to work in a loop. On successful completion the file geodatabase item is deleted then a second child workspace - LoadTaxisCleanerUpper - is called to delete the input zipped file geodatabase in a separate process (file handlers lock deletion otherwise).

LoadTaxisCleanerUpper

You will see in LoadTaxisWorker I have a couple of Emailer transformers (disabled in the download) and the WorkspaceRunner that calls the cleanup process allows for a bump in the number of child processes so things don't hang.

Now this being the web you can get failures even following best practices, but if you do, there is a paper trail built into the methodology. The file geodatabases made for upload are not deleted from disk or as Online items like they are for successful jobs - you can see any failed data:

File geodatabases not deleted

And here are file geodatabase items relating to failed Append jobs - not deleted either.

File geodatabase items not deleted

However, if you inspect the HTTPCaller that calls Append you'll see I set rollback on failure to be true so all we have to do is:

Manually delete the failed file geodatabase items from Online content
Manually run LoadTaxisWorker using the failed file geodatabase zipfiles and batch IDs as input parameters

So after my hands-free run of the 100 million features, the tool executes without error...

LoadTaxis worked fine

...but the loaded feature count is short because of the failed jobs:

The target feature service didn't get all data

We can see 7 failed jobs equates to the missing 3,500,000 features in the service. Lets get them there!

After deleting the file geodatabase items in Online I can run each job manually like this:

Run LoadTaxisWorker manually

I actually prefer to run the LoadTaxisWorker tool in edit mode (i.e. in Workbench) so I can do things like turn the Emailers on to catch the action in detail. There is also this issue to watch for which I'll put in a spoiler tag:

Spoiler

Warning: Append called by LoadTaxisWorker may take far longer than anticipated (hours not minutes) if run at a busy time for Online - Append is a shared resource and may get queued. In this case the Emailer will tell you 210 seconds (or your preferred timeout) went by without Append finishing; it will still complete eventually but the cleanup operations will not trigger. I experienced this and enabled the Emailer that sends the job URL so I could watch it manually.

Warning: Append called by LoadTaxisWorker may take far longer than anticipated (hours not minutes) if run at a busy time for Online - Append is a shared resource and may get queued. In this case the Emailer will tell you 210 seconds (or your preferred timeout) went by without Append finishing; it will still complete eventually but the cleanup operations will not trigger. I experienced this and enabled the Emailer that sends the job URL so I could watch it manually.

Anyway, after manually running your failed jobs you will have 100 million features in your service!

Yay my 100 million features are loaded!

How hard was that?

The ETL tools are in the post download. I used ArcGIS Pro 2.9 & Data Interoperability extension.