Split by Attribute

JohnDye · ‎05-01-2013

I have a Feature Class with 3.2 million records which represents customers. Each customer is assigned to a particular store ID (DID)

So I'm trying to split the customer record into store ID datasets. That is, for each store ID (DID), create a dataset of the customers assigned to it, essentially splitting the FC by each unique attribute in the field.

I was able to pull up a list of the unique store IDs by using:

def Unique_Values(table, field):     with arcpy.da.SearchCursor(table, [field]) as cursor:         return sorted({row[0] for row in cursor})  fc = "C:\Users\username\Documents\ArcGIS\Default.gdb\MyCustomers" field = "DID"  UniqueValues (fc, field)

which resulted in a list of all of my unique store IDs, 957 in all.

What I'm not sure of however, is how to go about iterating through that list and using each value in the list in a where_clause in a "FeatureClassToFeatureClass_conversion" to export each unique dataset.

I know there are third party developed tools to do this, but they don't seem to work due to the number of records. I've tried every single one I could find.

ChrisSnyder · ‎05-01-2013

How about something like:

storeIdSet = set([r[0] for r in arcpy.da.SearchCursor(myFC, ["STORE_ID"])]) for storeId in storeIdSet:      outFC = r"C:\temp\test.gdb\store_" + str(storeId)      arcpy.Select_analysis(myFC, outFC, "STORE_ID = " + str(storeId))

View solution in original post

ChrisSnyder · ‎05-01-2013

How about something like:

storeIdSet = set([r[0] for r in arcpy.da.SearchCursor(myFC, ["STORE_ID"])]) for storeId in storeIdSet:      outFC = r"C:\temp\test.gdb\store_" + str(storeId)      arcpy.Select_analysis(myFC, outFC, "STORE_ID = " + str(storeId))

curtvprice · ‎05-01-2013

I was able to pull up a list of the unique store IDs by using:
def Unique_Values(table, field):
    with arcpy.da.SearchCursor(table, [field]) as cursor:
        return sorted({row[0] for row in cursor})

fc = "C:\Users\username\Documents\ArcGIS\Default.gdb\MyCustomers"
field = "DID"

UniqueValues (fc, field)
which resulted in a list of all of my unique store IDs, 957 in all.

Neat. I may steal that.

What I'm not sure of however, is how to go about iterating through that list and using each value in the list in a where_clause in a "FeatureClassToFeatureClass_conversion" to export each unique dataset.

Note, you could avoid scripting altogether and do this in ModelBuilder using Iterate Feature Selection.

But, if you want to do this in Python, it's a matter of making a layer and selecting on it:

lyr = arcpy.CreateFeatureLayer(management(fc, "lyr")
stores = UniqueValues(lyr, field)
for store in stores:
    where = "{0} = {1}".format(field, store)
    outFC = "store{0}".format(store)
    arcpy.Select_analysis(lyr, outFC, where)

UPDATE: Chris beat me to the post. Nice one-liner, Chris!

Just a note about my more verbose approach: If a GP tool gets a layer input, it can assume that the input exists and its field list is already in memory ready to roll. This can make a difference if you need to do 957 select/copy operations.

JohnDye · ‎05-01-2013

Chris,
I played with this notion earlier but couldn't get it working...

I modified your code a bit to export to a feature dataset and also provide print statements as it ran,
but all in all, right on the money.
Thanks a ton!

It's probably going to take several hours to crunch through all the data but it seems to be chugging
along. The only issue I see is that since I'm running it in the Python Window, it keeps adding each dataset to my TOC. I didnt really expect that. It's not a huge deal because drawing is paused, but nonetheless, something I didnt expect to happen.

So long as it crunches through all the data, I'm going to take a stab at turning it into an Addin later.

For anyone that wants the full code if you're having issues with large datasets, here's what I'm using:

import arcpy
from arcpy import env

# Create Workspace. 'Select_analysis' tool exports to the current workspace, so this is the directory 
# where your exported datasets will end up. I'm dropping mine in a Feature Dataset in the 'Default.gdb'
# on my system.
workspace = arcpy.env.workspace = "C:\Users\Username\My Documents\ArcGIS\Default.gdb\CustomersByStore"

# Create a variable to hold the FC you want to split by attribute.
fc = "C:\Users\Username\My Documents\ArcGIS\Default.gdb\Customers\MyCustomers"

# Instantiate a 'set' of unique values from the 'StoreID' field in the 'fc' dataset using the native 
# python 'set' class. Depending on how large your dataset is, this could take a bit of time.
# Mine was 3.2 million records and only took 3-4 minutes to run.
StoreSet = set([r[0] for r in arcpy.da.SearchCursor (fc, ["StoreID"])])

# Create the 'for' loop that will iterate through your BranchSet by unique ID.
for ID in BranchSet:
 # Create a variable to name each exported record according to the unique value. If your field is an 
 # integer, ensure you convert it to string. For some reason, just converting the integer to string
 # wasn't enough though, I had to also tack on a string ('Customers') as well before the conversion
 # to get it all working. By default, Select_analysis also exports to the current workspace, so if you
 # didn't set one as I did before, you need to hardcode the path here.
 OutFC = "Customers_" + str(DID)
 # Print which recordset the script is currently processing
 print "Exporting " + str(DID) + "..."
 # Export the dataset using the 'Select_analysis' function. Because my variable for 'OutFC' also 
 # holds the variable for my workspace, it drops all the datasets in my workspace. 
 arcpy.Select_analysis(fc, OutFC, "StoreID = " + str(DID))

print "Script Complete."

JamesCrandall · ‎05-01-2013

How about something like:

storeIdSet = set([r[0] for r in arcpy.da.SearchCursor(myFC, ["STORE_ID"])])
for storeId in storeIdSet:
     outFC = r"C:\temp\test.gdb\store_" + str(storeId)
     arcpy.Select_analysis(myFC, outFC, "STORE_ID = " + str(storeId))

Nice! What a very pythonic way to write. and I am stealing this 🙂

...so tough to remove the ArcObjects from my thinking!

JohnDye · ‎05-01-2013

Neat. I may steal that.

Hey Curt,
I thought about doing a MakeFeatureLayer and then running a series of SelectLayerbyAttributes but from my experience, that significantly increases processing time. Not a big deal with small to moderate sized datasets, but with 3.2 million records, I didn't want to play with those implications.

Unfortunately, I can't take credit for the UniqueValues definition either, I lifted it off of an old Cafe Python post. Someone commented on the post about leveraging the TableToNumPyArray function to get better performance, which made me really excited, but alas my feeble mind could not get it to work like I wanted.

curtvprice · ‎05-01-2013

Hey Curt,
I thought about doing a MakeFeatureLayer and then running a series of SelectLayerbyAttributes but from my experience, that significantly increases processing time. Not a big deal with small to moderate sized datasets, but with 3.2 million records, I didn't want to play with those implications.

I agree - I fixed the code above so you're just selecting off the layer directly with the Select_analysis tool.

If you added a field index you'd get the time back in spades when you ran the Select tool -- even faster!

Someone commented on the post about leveraging the TableToNumPyArray function to get better performance

Talk about gilding the lily. arcpy.da.SearchCursor is pretty darn fast. It's so fast, I wonder if Esri's team they used a numpy array to do it!

JohnDye · ‎05-01-2013

If you added a field index you'd get the time back in spades when you ran the Select tool -- even faster!

That's a good idea, may play with that tomorrow!

curtvprice · ‎05-01-2013

Here's my updated ListUnique script - works with 9x/10x.

Thanks guys, this is a big improvement to what I had before.
A cute tweak I learned that sorted(set()) returns a list.

def ListUnique(inTable,Field):
    """Create a list of unique values from a table/tableview.

    arguments

    inTable  Table or table view
    Field    Field name
    """
    Row, Rows = None, None
    try:
        try:
            # this will only work for 10.1
            import arcpy.da
            lstValues = \
                sorted({r[0] for r in
                    arcpy.da.SearchCursor(inTable, [Field])})
        except:
            import arcgisscripting
            gp = arcgisscripting.create(9.3)
            Rows = gp.SearchCursor(inTable,"","",Field,Field)
            Row = Rows.next()
            lstValues = []
            while Row:
                lstValues.append(Row.getValue(Field))
                Row = Rows.next()
            # unique-ize and sort the list
            lstValues = sorted(set(lstValues))
        return lstValues
    except:
        raise
    finally:
        if Row: del Row
        if Rows: del Rows

AmberPerenzin · ‎04-23-2014

How about something like:

storeIdSet = set([r[0] for r in arcpy.da.SearchCursor(myFC, ["STORE_ID"])])
for storeId in storeIdSet:
     outFC = r"C:\temp\test.gdb\store_" + str(storeId)
     arcpy.Select_analysis(myFC, outFC, "STORE_ID = " + str(storeId))

Chris, I am getting an error message when I use this script indicating the SQL expression used in the select_analysis tool is invalid. I know how to set the expression to select a specific crime type, but I do not understand how to write an expression that will iterate through each crime type and make a new .shp file for each crime type. How do you write a SQL expression that will iterate through values of a field?

This is how I have modified your code to fit my dataset:

AllCrimes = "CopyCopySample2011_2013.shp"
CrimeTypeSet = set([r[0] for r in arcpy.da.SearchCursor(AllCrimes, ["CrimeType"])])
for CrimeType in CrimeTypeSet:
    out_FC = "H:\\PythonOutput\\" + str(CrimeType)
    arcpy.Select_analysis(AllCrimes, out_FC, "CrimeType = " + str(CrimeType))

the error message reads:

"Executing: Select "H:\Sample Data\CopyCopySample2011_2013.shp" H:\PythonOutput\AgAssault.shp "CrimeType = AgAssault"
Start Time: Wed Apr 23 13:23:04 2014
ERROR 000358: Invalid expression CrimeType = AgAssault
A column was specified that does not exist.
A column was specified that does not exist.
Failed to execute (Select).
Failed at Wed Apr 23 13:23:05 2014 (Elapsed Time: 1.00 seconds)"

I have tried so many things and I am about to throw my computer out the window!! What am I doing wrong?