Is concurrent append supported?

MichaelBurbea1 · ‎02-11-2025

I'm currently trying to speed up some very slow uploading point feature layer of approximately 11MM rows. There are 115 columns. Approximately, 10 are strings and the remainder are doubles/ints/floats. (Mostly doubles and mostly NULL.)

I am creating a file-geodatabase for my data, and am using the arcgis for python api to upload my file gdb, and then publishing it. It unfortunately is painfully slow. ~4hrs. It starts fast, but as the table gets larger, the upload drops drastically in speed. (I have a polling monitor script that tests the published feature class collection every 5 minutes and it starts at >300k rows per 5 minutes, but then drops to around 20k by the 3hr mark).

As an experiment, I switched uploading 2 file-geodatabases, one that sets up the schema, and one that contains 8 roughly equivalent partitions. I upload the files, publish the 0 row pdb and then run 8 different append operations in parallel. This is much faster (>1.2MM per 5minutes). However, my failure rate is skyrocketing. (To be fair, the old 4hr job often fails too). I have to update my data monthly, and most of my other tables are much smaller. (Ranging from 400 rows to about 1.6mm for the remaining datasets).

I've also tried a much more narrow table, approximately 10 columns same number of rows and I'm still getting attrocious publish speeds. I checked my network connection and it is not the file upload that is the problem. (The payload for the full table is only 2GB zipped, and I see that I finish that upload usually in under a minute as I have gigabit upload speed). The publish or the append are problematically slow.

How does one upload very large datasets to arcgis online without going crazy? I can probably spin up a new VM, set up postgres and build my table in less than 4 hours.

TonyAlmeida · ‎02-12-2025

Could you post your code, this might give people some ideas on how to improve the process.

One thing that I know slows my processes down is the Editor Tracking. Also, maybe overwrite instead of appending.

MichaelBurbea1 · ‎02-13-2025

The performance is abysmal either way I do it.
I've tried both append and publish. I use a view swap technique to keep the app running when the layers are updated.
Here is the script that handles the initial section of the publish. It either uploads the only gdb and publishes it, or it uploads the intial empty point feature layer gdb, and then the data gdb and publishes the empty feature layer. This finishes in about a minute for the empty case.
The feature clas sis also not set up for editor tracking as it is uneditable.

import truststore
import warnings
from urllib3.exceptions import InsecureRequestWarning
truststore.inject_into_ssl()
# Note work arond to silence SSL warnings in esri api::
# https://github.com/Esri/arcgis-python-api/issues/2164
warnings.simplefilter("ignore", InsecureRequestWarning)
# end work around.
import argparse
import arcgis
import json
import io
import time
from typing import List, Optional, Tuple
from arcgis.gis import ItemProperties, ItemTypeEnum, Item
from arcgis.features import FeatureLayerCollection
from arcgis.features.managers import FeatureLayerCollectionManager
from concurrent.futures import wait, Future, FIRST_EXCEPTION
    
def main(
    gdb: str,
    data_gdb: Optional[str],
    view: str,
    user: str,
    pwd: str,
) -> None:
    """Main function to handle AGOL feature layer upload and view management."""
    gis = arcgis.GIS("<org url>", user, pwd)
    vwArr = gis.content.search(
        query = f"owner:{gis.users.me.username} title:{view} typekeywords:View",
        max_items=1
    )

    if vwArr:
        vw = vwArr[0]
        oldFc = vw.related_items(rel_type = "Service2Service", direction = "reverse")[0]
        oldFc = gis.content.get(oldFc.id)
        tgt = oldFc.title[:-1] + ("A" if oldFc.title.casefold()[-1] == "b" else "B")
    else:
        tgt = view[2:] + "_A"
    existing = gis.content.search(
        query=f"owner:{gis.users.me.username} title:{tgt}"
    )
    for f in existing:
        f.delete()

    agol_fldr = gis.content.folders.get("<app specific folder here>")
    agol_gdb = agol_fldr.add(
        ItemProperties(title = tgt,item_type = ItemTypeEnum.FILE_GEODATABASE.value),
        file = gdb
    ).result()
    if data_gdb is not None:
        agol_fldr.add(
            ItemProperties(title = f"{tgt}_0",item_type = ItemTypeEnum.FILE_GEODATABASE.value),
            file = data_gdb
        ).result()
    fc = agol_gdb.publish()
    agol_gdb.delete()


def parse_args() -> argparse.Namespace:
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument("--gdb", required = True, help = "The gdbs passed as a comma delimited list, expected either one or two expected")
    parser.add_argument("--view", required = True, help = "The arcgis view name")
    parser.add_argument("--user", required = True, help = "AGOL user name")
    parser.add_argument("--pwd", required = True, help = "AGOL password")
    return parser.parse_args()

if __name__ == "__main__":
    args = parse_args()
    zips = args.gdb.split(",")
    gdb = zips.pop(0)
    data = zips.pop(0) if len(zips) > 0 else None
    main(gdb, data, args.view, args.user, args.pwd)

BlakeTerhune · ‎02-20-2025

Instead of trying to upload standalone data, can you make it available in a map service from ArcGIS Server? That way you can update it locally and there's no burden to upload all that data.

MichaelBurbea1 · ‎02-25-2025

We don't have arcgis enterprise, and the data changes monthly. Generating the gdb takes in the order of minutes. (Maybe 10ish minutes). So it's really not that the volume of the data that is the problem. I am trying to post to an arcgis online service.