Append/edit_features in AGOL (large dataset)

ASLPipeline · ‎03-04-2022

I am writing a pipeline script that gets data from different sources using post requests, adds them to a geodataframe and processes the data using Geopandas functions. Then I need to add the result to AGOL.

I don't use local files, everything is in memory.

Can I use gis.content.add to add a geojson dictionary to AGOL? Or any other in memory data like a feature set..

I have been trying this:

gis=GIS("https://arcgis.com", username, password)

json_gdf=gdf.to_json()
data_properties = {
    'title': 'new_test',
    'description': 'test',
    'tags': 'test',
    'type': 'GeoJson'
}
flayer=gis.content.add(data_properties, json_gdf)

I get the following error:

flayer=gis.content.add(data_properties, json_gdf)
File "/mnt/c/Users//Documents/Github//.venv/lib/python3.8/site-packages/arcgis/gis/__init__.py", line 5141, in add
itemid = self._portal.add_item(
File "/mnt/c/Users//Documents/Github//.venv/lib/python3.8/site-packages/arcgis/gis/_impl/_portalpy.py", line 362, in add_item
raise RuntimeError("File(" + data + ") not found.")

In a way it makes sense, because ContentManager.add expects the following formats:
"Content can be a file (such as a service definition, shapefile, CSV, layer package, file geodatabase, geoprocessing package, map package) or it can be a URL (to an ArcGIS Server service, WMS service, or an application)."

What are my options here?
I also tried adding the geodataframe to AGOL directly using spatial.to_featurelayer - it works, but the field names are messed up, since that tool uses a shapefile as an intermediary (e.g. "SOURCE_FEATURE" becomes "source_fea"). That means I cannot append that data to an existing feature layer without having to write hundreds of field mapping rules.

NOTE: edit_features is not an option, the data is too big, and even chunked, it takes too long to add (and times out in Azure, but that's another story).

jcarlson · ‎03-04-2022

Take a look at arcgis.features.GeoAccessor.from_geodataframe. You can convert your geodataframe to a spatially enabled dataframe, which can then be written directly to a new layer using to_featurelayer. Using an intermediate shapefile shouldn't be necessary.

There is also the append function, but I don't have any experience with using it.

I know you said it's another story, but how does edit_features time out? How are you chunking it up when you try that?

Also: your post makes it sound like you want this data to get added to an existing layer, rather than be added as a new layer, so it's a bit confusing that you're focusing on the content.add function here. Is the end goal a separate layer, or edits to an existing service?

If the latter: are you simply adding new records to the layer, or are some rows being updated / deleted?

- Josh Carlson
Kendall County GIS

ASLPipeline · ‎03-04-2022

I am sorry for the confusion.
Yes, I am trying to add new records to an existing layer. I was using edit_features with good results, but this being a massive dataset - it times out. I am rerunning now to be able to add the exact error here.

How do I chunk? I wrote a function that basically iterates over the feature set and adds a couple hundred features at a time. Nothing fancy.

Documentation for edit_features also mentions: "When making large number (250+ records at once) of edits, append should be used over edit_features to improve performance and ensure service stability."

Tried arcgis.features.GeoAccessor.from_geodataframe and to_featurelayer - but it messes up the field names. It's not me using the intermediate shapefile, the to_featurelayer tool uses it, thus it messes up those field names. This makes it unusable with the append tool (unless I do field mappings)

jcarlson · ‎03-04-2022

First thing: have you tried using sanitize_columns=False when you convert the dataframe? It defaults to True, and does have the potential to mess with your columns quite a bit, depending on the original names. But I know there are places where it will sanitize the column names and not let you choose otherwise.

For chunking your edits, I do the same thing, basically.

i = 0
chunk = 200

while i < len(sdf):
    fs = sdf.iloc[i:i+chunk].spatial.to_featureset()
    featurelayer.edit_features(adds=fs)
    i += chunk

I have a few services I regularly append to and edit, and some of them are quite large datasets (though I wouldn't call them "massive"), and this method seems to work just fine.

Finally, if you're working w/ JSON data, you can submit a list of dicts to the edit_feature function, so long as the follow the same format as it expects. I have a couple scripts that use that method for one reason or another.

feats = [
    {
        'attributes': {
            'some_attribute': 'a value',
            'another_attribute': 1002.14
        },
        'geometry': {
            'x': 44.05,
            'y': 27.2201,
            'spatialReference': {
                'wkid': 4326
            }
        }
    },
    {
        'attributes': {
            'some_attribute': 'a different value',
            'another_attribute': -1.04
        },
        'geometry': {
            'x': 41.3,
            'y': 28.912,
            'spatialReference': {
                'wkid': 4326
            }
        }
    }
]

featurelayer.edit_features(adds=feats)

- Josh Carlson
Kendall County GIS

JoshKalovGIS · ‎03-04-2022

I also do chunks with edit_features to avoid time outs. In a recent version of the python API they also added the future parameter which will allow asynchronous updates. I haven't had a chance to see what kind of effect that has on timeouts though.

Josh

ASLPipeline · ‎03-07-2022

Documentation says sanitize_columns is set to False by default, but I just tried it nonetheless. This was the code:

sdf = GeoAccessor.from_geodataframe(gdf)
lyr = sdf.spatial.to_featurelayer(title='test', sanitize_columns=False)

I got the following error:

Exception: TypeError: to_featurelayer() got an unexpected keyword argument 'sanitize_columns'

I like the way you chunk the sdf before converting to a Feature Set, I will probably steal that bit of code. I used to chunk the Feature Set.
I got the script to work locally without getting a timeout, by adding certain problematic chunks feature by feature:

try:
    add_result = Target_Layer.layers[0].edit_features(adds=fset)
except:
    for feature in fset:
        try:
            add_result = Target_Layer.layers[0].edit_features(adds=[feature])
        except:
            # some error message

I'll see how this acts once deployed to Azure.