How best to write dataframes to ArcGIS Enterprise MS SQL Server using Python

TylerT · ‎06-26-2024

Hello,

I use python extensively to ETL data and write dataframes to our ArcGIS Enterprise MS SQL Server using GeoAccessor method to_featureclass. Given the limitations of table handling for GeoAccessor method to_featureclass mentioned here...

https://community.esri.com/t5/arcgis-api-for-python-ideas/additional-table-handling-for-geoaccessor-...

...what are folks doing to truncate then insert into tables. In other words, rather than dropping a table and creating a new table how does one delete all records (keeping schema/columns) and insert new rows.

Using the to_feature class as-is (dropping tables) constantly runs into locked schema tables, on top of probably not being a best practice.

Please chime in with your work flow...good, bad, or ugly.

Thank you,

Tyler

jcarlson · ‎07-05-2024

Hi, Tyler! Missed your post originally, but this is definitely something we used to do.

If your service does not have Sync enabled, you can use FeatureService.manager.truncate() to empty the table. Assuming your source and destination data types are the same, you can add new data to the layer using FeatureService.edit_features(), passing dataframe.spatial.to_featureset() into the adds parameter.

The API docs suggest not doing this with large numbers of features, so we break up the dataframe into batches.

We wrote up a generic version of this process, which you can find over here:

https://codeberg.org/kendall-county-gis/project-templates/src/branch/main/generic-etl/truncate-appen...

There's also an alternative in that file for if you do have sync enabled and Truncate is not an option for you. It's written with a file-based resource in mind, but the source dataframe could come from anywhere.

For the sake of including the code here, here's an even more generalized version:

from arcgis.gis import GIS
from arcgis.features import GeoAccessor, FeatureLayer

df = GeoAccessor.from_featureclass('some file path') # or replace with something else, like a SQL query or something

fl = FeatureLayer('url')

n = 0
while n < len(df):
    fs = df.iloc[n:n+200].spatial.to_featureset()
    service.edit_features(adds=fs)
    n += 200

WARNING: the ArcGIS Python API and Pandas really don't see eye-to-eye on how to handle datetime values. If you have any datetimes in your source data, be prepared for extra work and frustration to get those added. Your best bet is to convert the dates to their epoch integer, but be sure to use the pandas dtype Int (note the capital I) so that null values can still be passed through, in the event that your data has empty values in the datetime fields.

I said this was something we did, but we've since moved away from the full truncate/append. If you load the source and destination layers into their own dataframes, you can use pandas.DataFrame.compare(), or even the recently-added gis.features.GeoAccessor.compare() to identify adds, updates, and deletes, and only edit the rows that have changed between the two. For large layers with a low volume of edits, it can really speed up the process.

- Josh Carlson
Kendall County GIS

TylerT · ‎07-05-2024

@jcarlson,
Thanks for your reply. I had a major typo in my initial post subject, so you might not have missed it, rather skipped it due to the typo title. Anyway, I'll study your response and see if I can work it into our workflow.

It's too bad we didn't have something like this with GeoAccessor:

to_featureclass(location, how=append, has_z=None, has_m=None, sanitize_columns=True)

Parameter: how
Description: Optional String. Options include
drop_create_insert, truncate_append. The
default it append.

The location is an sde connection to an EGDB giving direct access to EGDB via sde connection instead of through the portal via GIS(). I don't have any real reason to want to avoid GIS(), other than why not cut out the middle man and write directly to EDGB?

Tyler
Enterprise 11.2

jcarlson · ‎07-05-2024

If the layer isn't using any kind of versioning, writing straight to the EGDB is doable. You don't need GIS at all, just pandas. I've never done it with geometry though, and we aren't using MSSQL, so it may be trickier in actual practice than in theory.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

- Josh Carlson
Kendall County GIS

TylerT · ‎07-05-2024

Hi @jcarlson et al,

Following my initial post awaiting responses I developed the following non-spatial tables. I prefer using SqlAlchemy but other engines will do.

def df_write(df, tbl, engine):
    dtype_sa = <your data types dict>
    with engine.connect() as con:
        df.to_sql(name=tbl, con=con, if_exists='append', index=False, dtype=dtype_sa)
        print("*** SQL WRITTEN ***")
        
def truncate(tbl_name, pk, pk_dtype, engine):
    table = map_table_wo_pk(tbl_name, pk, pk_dtype, engine) #orm automapper func
    with orm.Session(engine) as session:
        session.query(table).delete()
        session.commit()

def df_trunc_insert(df, tbl, pk, pk_dtype, sde, engine):
    if tbl_exists(tbl):  
        print(f'Table "{tbl}" exists.  Truncating and inserting table.')
        truncate(tbl, pk, pk_dtype, engine)
        df['OBJECTID'] = np.arange(df.shape[0]) + 1 # OBJECTID must not be null
        df['GlobalID'] = [uuid.uuid4() for _ in np.arange(df.shape[0])] # GUID must not be null
        df_write(df, tbl, engine) 
    else:
        print(f'Table "{tbl}" does not exist.  Creating new table.')
        df.spatial.to_table(location=sde/tbl) #This method uppers() the table title and adds OBJECTID, also upper()
        arcpy.env.workspace = str(sde)
        print(f'Adding GUID/UUID')
        arcpy.management.AddGlobalIDs(tbl)

if __name__ == "__main__":
    df_trunc_insert(<df>, <tbl>, <pk>, <pk_dtype>, <sde (pathlib)>, <your engine>)

The idea here is upon initial table creation to use geoaccessor df.spatial.to_table(). df.spatial.to_table() checks data typing, registers the table with the EGDB, and maybe a few other things behind the curtains. Then, since df.spatial.to_table() doesn't have a truncate, revert to SqlAlchemy to truncate (hold schema) and insert for subsequent overwrites (if table exists). A few extra steps are required to repopulate ObjectID and repopulate GUID. And finally, take care of a little gotchya fact that df.spatial.to_table() doesn't set PKs, but SqlAlchemy automap reflection requires PKs...surmountable with proper automap setup.

I haven't setup a script for spatial tables yet, but do foresee issues with the geometry column and SqlAlchemy passing it to the EDGB properly.

I hoping to get as many ideas for best practices here. I have many production use cases, as I'm sure others do, for all of our automated dataframes to update our EGDB tables without dropping the EGDB table.

Tyler
Enterprise 11.2