I'm done with Spatially Enabled DataFrames

DrewLevitt · ‎02-11-2021

I recently updated to ArcGIS Pro 2.7.1. This new version of ArcGIS Pro comes with a new version of the arcgis package (version 1.8.3). And this new version of the arcgis package, regrettably, comes with new bugs for Spatially Enabled DataFrames - even beyond the myriad bugs I've previously encountered and solved or worked around.

Just to be clear, I'm talking about Spatially Enabled DataFrames (SEDFs), not Spatial DataFrames (SDFs). https://developers.arcgis.com/python/guide/introduction-to-the-spatially-enabled-dataframe/

I love, love love love, the idea of Spatially Enabled DataFrames - an easy way to get data from an Esri feature class into a pandas DataFrame and vice versa. But in practice, these have been the most cursed, unreliable, exasperating structures I've worked with for a long time.

First there was the snafu where sedf.spatial.to_featureclass() silently mutated the calling object, overwriting whatever index the SEDF had had with a simple integer index. Of course this breaks the ability to join data onto that SEDF based on the original index. Nowhere was this behavior documented or justified. Still, there were workarounds (generally, just call sedf.reset_index() to capture the index in a column), but talk about ungraceful and un-Pythonic.

Once upon a time, sedf.spatial.to_featureclass() could write out to an FC in the memory workspace - but no more, as of ArcGIS Pro version 2.5 or 2.6 (I forget which). Suddenly those straightforward calls were throwing errors that I could never fully resolve, so I had to rewrite a bunch of scripts to write out to FCs in the scratch GDB, which is slower and less elegant.

The latest twist as of arcgis package version 1.8.3 is that sedf.spatial.to_featureclass() mutates the data itself, casting all column names to LOWERCASE - in the output FC and in the calling object itself! Why? Why, why, why? This ruins both the calling SEDF and the output FC. In my case this led to a brand-new, mysterious crash that took me the better part of a full day to diagnose and fix.

The latest fix, by the way, was abandoning Spatially Enabled DataFrames entirely and falling back to some nice, simple, GP-tool only code from David Wasserman and Roland Viger. And that leads me back to the title of my post: I'm done with Spatially Enabled DataFrames. I have run headlong into one too many mysterious, subtle bugs that hold me up for many hours on end. I love the concept but the execution is just not there - and, as far as I can tell, it's getting worse, not better.

Have other folks had similar issues with Spatially Enabled DataFrames? Have you found any good fixes? Is there any way I can help improve SEDFs so they deliver on their amazing promise?

jcarlson · ‎02-12-2021

I've definitely got my share of gripes w/ the SEDF, and am hoping some of them get addressed in future versions. I'd submit a bug report on their GitHub page.

If you look into it, you'll see that the to_featureclass() method has a default parameter of sanitize_columns=True. Peeking into the code itself, we see:

# sanitize
    if sanitize_columns:
        # logic
        _sanitize_column_names(geo, inplace=True)

And a level deeper, we see that sanitize_column_names defaults to "snake_case". Setting sanitize_columns to False yields a result like the following:

Hooray! Field names unaltered, and my input dataframe is the same.

The issue here seems to be the unmodifiable inplace=True. If you want to output to frame but need to ensure it's unmodified for further use, it's probably a good idea to just change your code to df.copy().spatial.to_featureclass, etc., whether you ask it to sanitize your column names or not.

- Josh Carlson
Kendall County GIS

View solution in original post

jcarlson · ‎02-12-2021

I've definitely got my share of gripes w/ the SEDF, and am hoping some of them get addressed in future versions. I'd submit a bug report on their GitHub page.

If you look into it, you'll see that the to_featureclass() method has a default parameter of sanitize_columns=True. Peeking into the code itself, we see:

# sanitize
    if sanitize_columns:
        # logic
        _sanitize_column_names(geo, inplace=True)

And a level deeper, we see that sanitize_column_names defaults to "snake_case". Setting sanitize_columns to False yields a result like the following:

Hooray! Field names unaltered, and my input dataframe is the same.

The issue here seems to be the unmodifiable inplace=True. If you want to output to frame but need to ensure it's unmodified for further use, it's probably a good idea to just change your code to df.copy().spatial.to_featureclass, etc., whether you ask it to sanitize your column names or not.

- Josh Carlson
Kendall County GIS

DrewLevitt · ‎02-12-2021

Josh, thanks very much for this helpful reply! Serves me right for not consulting the latest documentation. Out of curiosity, how were you able to peek into the code? The GitHub repo seems to contain only documentation and examples.

Speaking of which, I found the documentation to be a little vague on this. to_featureclass() mentions the sanitize_columns parameter, but there's no explicit connection within the documentation itself between that parameter and the sanitize_column_names() method.

In general, I would say that a method that exports data should not simultaneously mutate those data, either in the calling object or in the output data. That just feels like a weird design pattern. I think I'd rather see sedf.sanitize_column_names() be one method, and sedf.to_featureclass() be another, and if you want to sanitize your column names for output to feature class, the approach should be to just call the two methods separately.

jcarlson · ‎02-12-2021

I use VSCode, but I'm sure any decent IDE does the same thing:

Right-click or hit F12 to "Go to Definition". That will open up the specific python file to the line which defines the function/class/method you've clicked on.

From there, you can browse further down and see what is being referenced by that code.

Note that you cannot make changes to the built-in ArcGIS Pro files, but if you're working with your own python env, you can make some tweaks to the code, like setting "sanitize_columns" to default to False, or to code in the ".copy()" so as to avoid modifying the input frame.

Of course, should you go that route, know that you're stepping outside of the accepted use of the code, but if you find it to be a useful improvement, suggest it as an enhancement on GitHub!

- Josh Carlson
Kendall County GIS

Anonymous User · ‎10-21-2021

@DrewLevitt @jcarlson thanks for sharing your opinion. We regret the surprises and inconvenience caused by the sanitizer logic. In the upcoming v2.0 release, we plan to switch the sanitizer to default to False. We also plan to make the sanitizer not mutate the original column names or indices of the calling DataFrame object. You can track the progress on https://github.com/Esri/arcgis-python-api/issues/923

DrewLevitt · ‎10-21-2021

Thank you, this is great to hear! Sounds like the long-term fix for us users will be to check the version of `arcgis` and pass `sanitize_columns=False` only if the version falls into the era when this parameter existed and defaulted to True. (Or, I guess, just to require that people using our code update their environment to `arcgis` version 2.0+.)

Thanks again, Drew