Stand-Alone Table to Pandas Data Frame

KellyTaylor · ‎11-15-2023

Hello,

I'm looking for the most efficient way to convert a stand-alone table to a pandas data frame. This table will ultimately be geocoded and saved as a feature class, but I need to manipulate the data and fields quite a bit before that. I find pandas to the easiest and most efficient way to do the data manipulation.

Currently, I am using arcpy.conversion.TableToTable() to first convert the table to a csv file, and then pandas.read_csv() to convert to a data frame. The table has roughly 63,000 records and it is taking over an hour to do the TableToTable portion of the conversion.

Is there a better way to do this? Pandas is so quick so read the csv and I so wish I could read the table directly into pandas without the intermediate of a csv.

Many Thanks

DanPatterson · ‎11-15-2023

Try

TableToNumPyArray—ArcGIS Pro | Documentation

or

TableToArrowTable—ArcGIS Pro | Documentation

to see if they are faster since Pandas has conversions for both

... sort of retired...

LaurencePerry · ‎11-15-2023

Here's a function that should do the trick...

Originally sourced from: https://gist.github.com/d-wasserman/e9c98be1d0caebc2935afecf0ba239a0?permalink_comment_id=3623359

def feature_class_to_dataframe(input_fc: str, input_fields: list = None, query: str = ""):
    """Converts a feature class to a pandas dataframe. If 
    no input fields are specified, all fields
    will be included. If a query is specified, only those
    features will be included in the dataframe.

    This is an excellent function to use when exploring data
    without having to queue up ArcGIS Pro. Particularly good
    for using pandas to generate unique field values.

    Args:
        input_fc (string): path to the input feature class
        input_fields (list, optional): List of fields for dataframe. 
            Defaults to None.
        query (str, optional): Pandas query. Defaults to "".

    Returns:
        Pandas Dataframe: Dataframe of feature class
    """

    from arcpy import Describe, ListFields
    from arcpy.da import SearchCursor
    from pandas import DataFrame

    # get list of fields if desired fields specified
    OIDFieldName = Describe(input_fc).OIDFieldName
    if input_fields:
        final_fields = [OIDFieldName] + input_fields

    # use all fields if no fields specified
    else:
        final_fields = [field.name for field in ListFields(input_fc)]

    # build dataframe row by row using search cursor
    data = [row for row in SearchCursor(
        input_fc, final_fields, where_clause=query)]
    fc_dataframe = DataFrame(data, columns=final_fields)

    # set index to object id
    fc_dataframe = fc_dataframe.set_index(OIDFieldName, drop=True)
    
    return fc_dataframe

DavidSolari · ‎11-15-2023

This is a solid script, one minor tweak that can be handy with massive datasets is:

fc_dataframe = DataFrame((row for row in SearchCursor(input_fc, final_fields, query)), columns=final_fields)

This avoids creating a list for the data before it hits the DataFrame, saving a decent chunk of memory. Might even be faster in some cases.

LaurencePerry · ‎11-16-2023

For my own knowledge mostly - I think that should still be the same in terms of memory expense no? It's going to run that tuple comprehension and build a DataFrame from it before the tuple is removed from memory I believe...

I think if you really wanted it to take as little memory as possible you might have to read a row in the cursor, convert that row to a temp dataframe, then concat that row into a main dataframe, and then delete the temp dataframe. Might be real slow...