Can you summarize a table by one field and the find median value of another?

744
9
Jump to solution
04-11-2013 11:15 AM
SamHopkins
New Contributor III
I have a table of values collected at various monitoring stations. There's a row for each value with the station where and date when this value was recorded . I am trying to find a way to select all the values that fall within a certain date range and summarize them so there is only one value per station based on the median value. The "Summary Statistic" tool would work if I needed a sum or mean, but it does not have 'median' functionality. I was wondering if there was a python script that acts like the "Summary Statistic" tool but with the ability to find medians.

I'm using 10.0 at the ArcInfo level, any help is appreciated.
Tags (2)
0 Kudos
1 Solution

Accepted Solutions
by Anonymous User
Not applicable
I have altered the GetMedian function to use the average of the two medians whenever there is an even number of items in the list rather than returning 2 values. To get it to do a decimal value I just cast it as a float type. I also added a part to update the new table with the median values. This is untested but I think it should work. I have highlighted what you will need to change in red.


import arcpy, os, sys, traceback

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = float(sorted_list[median-1]
                        + sorted_list[median] / 2)
    else:
        med_val = sorted_list[median]
    return med_val

# env settings
arcpy.env.overwriteOutput = True
arcpy.env.qualifiedFieldNames = False

# Local Variables
station_fc = r'C:\Path\to_your\Stations'
new_table = r'C:\Path\to_your\new_table'

# Get County list for query
print 'starting cursor'
rows = arcpy.SearchCursor(station_fc)
Stations = list(set(r.StationNam for r in rows))
del rows

lyr = arcpy.MakeFeatureLayer_management(station_fc,'stations')
values = {}

# Get Median UseValue for each station name
for st in Stations:
    query = '"StationNam" = \'%s\'' %st
    arcpy.SelectLayerByAttribute_management(lyr, 'NEW_SELECTION', query)
    use_vals = []
    rows = arcpy.SearchCursor(lyr)
    for row in rows:
        use_vals.append(row.StationNam)
    del row, rows
    median = GetMedian(use_vals)
    values[st] = median

# if you want to print out the values, otherwise comment out
for k,v in sorted(values.iteritems()):
    print '%s:\t%s' %(k,v)


# Here is how you can update the new table with the median value
# You will need to change the row.*** varaibles to match your field names
#
rows = arcpy.UpdateCursor(new_table)
for row in rows:
    name = row.StationName                         # set this to your station field in new table
    if name in values:
        row.UseValue_Median = values[name]        # change to your UseValue field in new table
        rows.upadateRow(row)
del row, rows

# report results
print 'Updated UseValue field with Median in %s' %os.path.basename(new_table)
print 'Done'
        

View solution in original post

0 Kudos
9 Replies
by Anonymous User
Not applicable
I'm sure there are many ways to do this, probably some better than this.  But here is how I would do it:

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = [sorted_list[median-1],sorted_list[median]]
    else:
        med_val = sorted_list[median]
    return med_val

if __name__ == '__main__':

    vals = [3, 43, 17, 5, 4, 89, 107]
    median = GetMedian(vals)

    vals2 = [3, 66, 15, 12, 7, 11]
    m = GetMedian(vals2)

    print median
    print m



This yielded:

17
[11, 12]

Lists with an odd amount of values should return one median, while lists with an even amount of values should return the two middle values.

You can use a search cursor to go through your table and append all the values in a field to a list and search that for the median. So one way how you could do it with a table at Arc 10.0 would be something like this:

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = [sorted_list[median-1],sorted_list[median]]
    else:
        med_val = sorted_list[median]
    return med_val

if __name__ == '__main__':
    
    # varibles
    table = r'G:\some\path\Summary_table.dbf'

    ValueList = []
    rows = arcpy.SearchCursor(table)
    for row in rows:
        ValueList.append(row.Value_Field)
    del row, rows

    median = GetMedian(ValueList)
    print median   # this will give you the median



Hope this helps.
0 Kudos
SamHopkins
New Contributor III
Thanks. I took a sample of my data and was able to get it to find the median. I'm still pretty new to python, how could I get it to find the median for each value of another field? I also played around with changing it so that in cases where the array has an even number of fields it finds the average between the median values but I couldn't get it to output a decimal value.

I've attached a sample pic of some of the data that I'm working with.  I want to find the median value from the UseValue field for each unique value in the StationName field. Is there a way that I could iterate through the unique StationName values to create a list of the UseValue for each, then update it so it overwrites each of the UseValues with the new median value? Perhaps an update cursor that would update values for each station without affecting the values for any other station? This way I could then run summary statistics on the table to get my desired output of one row for each station with it's median value.
0 Kudos
by Anonymous User
Not applicable
I have altered the GetMedian function to use the average of the two medians whenever there is an even number of items in the list rather than returning 2 values. To get it to do a decimal value I just cast it as a float type. I also added a part to update the new table with the median values. This is untested but I think it should work. I have highlighted what you will need to change in red.


import arcpy, os, sys, traceback

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = float(sorted_list[median-1]
                        + sorted_list[median] / 2)
    else:
        med_val = sorted_list[median]
    return med_val

# env settings
arcpy.env.overwriteOutput = True
arcpy.env.qualifiedFieldNames = False

# Local Variables
station_fc = r'C:\Path\to_your\Stations'
new_table = r'C:\Path\to_your\new_table'

# Get County list for query
print 'starting cursor'
rows = arcpy.SearchCursor(station_fc)
Stations = list(set(r.StationNam for r in rows))
del rows

lyr = arcpy.MakeFeatureLayer_management(station_fc,'stations')
values = {}

# Get Median UseValue for each station name
for st in Stations:
    query = '"StationNam" = \'%s\'' %st
    arcpy.SelectLayerByAttribute_management(lyr, 'NEW_SELECTION', query)
    use_vals = []
    rows = arcpy.SearchCursor(lyr)
    for row in rows:
        use_vals.append(row.StationNam)
    del row, rows
    median = GetMedian(use_vals)
    values[st] = median

# if you want to print out the values, otherwise comment out
for k,v in sorted(values.iteritems()):
    print '%s:\t%s' %(k,v)


# Here is how you can update the new table with the median value
# You will need to change the row.*** varaibles to match your field names
#
rows = arcpy.UpdateCursor(new_table)
for row in rows:
    name = row.StationName                         # set this to your station field in new table
    if name in values:
        row.UseValue_Median = values[name]        # change to your UseValue field in new table
        rows.upadateRow(row)
del row, rows

# report results
print 'Updated UseValue field with Median in %s' %os.path.basename(new_table)
print 'Done'
        

0 Kudos
SamHopkins
New Contributor III
I have altered the GetMedian function to use the average of the two medians whenever there is an even number of items in the list rather than returning 2 values.  To get it to do a decimal value I just cast it as a float type.  I also added a part to update the new table with the median values.  This is untested but I think it should work.  I have highlighted what you will need to change in red.



Thank you! This was extremely helpful. I had to modify the code a bit to get it working but it's working well despite one problem and I've been able to implemented it in a model. The changes I made are because the stations data is actually a table from an access database that I'm joining to the feature class after adding the median values.

The problem I mentioned lies with calculating the mean of the two medians when the list has an even number of values.

This code is the section that doesn't work:

    if len(sorted_list)%2==0:
        med_val = float(sorted_list[median-1]
                       + sorted_list[median] / 2)


When I try to run it as a script tool in ArcGIS I get this error: "unsupported operand type(s) for +: 'NoneType' and 'float'"

Right now I'm just ignoring the problem and choosing one of the two medians, though this is not ideal as some stations only have a few points.



Here is the code with the changes that I've made:
import arcpy, os, sys, traceback

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    print median  
    if len(sorted_list)%2==0:
        #med_val = float(sorted_list[median-1]
         #               + sorted_list[median] / 2)
        med_val = float(sorted_list[median])

    else:
        med_val = sorted_list[median]
    return med_val

# env settings
arcpy.env.overwriteOutput = True
arcpy.env.qualifiedFieldNames = False

# Local Variables
station_fc = arcpy.GetParameterAsText(0)
new_table = arcpy.GetParameterAsText(1)





# Get County list for query
print 'starting cursor'
rows = arcpy.SearchCursor(station_fc)
Stations = list(set(r.Station_ID for r in rows))
del rows

tbl = arcpy.MakeTableView_management(station_fc,'stations_tbl')

values = {}

# Get Median UseValue for each station name
for st in Stations:
    query = '"Station_ID" = \'%s\'' %st
    arcpy.SelectLayerByAttribute_management(tbl, 'NEW_SELECTION', query)
    use_vals = []
    rows = arcpy.SearchCursor(tbl)
    for row in rows:
        use_vals.append(row.UseValue)
    del row, rows
    median = GetMedian(use_vals)
    values[st] = median

# if you want to print out the values, otherwise comment out
for k,v in sorted(values.iteritems()):
    print '%s:\t%s' %(k,v)


# Here is how you can update the new table with the median value
# You will need to change the row.*** varaibles to match your field names
#
rows = arcpy.UpdateCursor(new_table)
for row in rows:
    name = row.Station_ID                         # set this to your station field in new table
    if name in values:
        row.UseValue_Median = values[name]        # change to your UseValue field in new table
        rows.updateRow(row)
del row, rows

# report results
print 'Updated UseValue field with Median in %s' %os.path.basename(new_table)
print 'Done'
0 Kudos
by Anonymous User
Not applicable
When I try to run it as a script tool in ArcGIS I get this error: "unsupported operand type(s) for +: 'NoneType' and 'float'"


This error usually occurs when there are NULL values (these have the data type of None in Python).  One thing you could do before running this is use an update cursor on your table to turn any NULL values to 0.


rows = arcpy.UpdateCursor(feature_class)
for row in rows:
    if row.UseValue == None:
        row.UseValue = 0
        rows.updateRow(row)

0 Kudos
SamHopkins
New Contributor III
This error usually occurs when there are NULL values (these have the data type of None in Python).  One thing you could do before running this is use an update cursor on your table to turn any NULL values to 0.


Thanks again, it seems to be working perfectly now.

Instead of having it set null values to zero I have them excluded from the list. This way they shouldn't affect the median.

    for row in rows:
        if row.UseValue == None:
            pass 
        else:
            use_vals.append(row.UseValue)
0 Kudos
by Anonymous User
Not applicable
Thanks again, it seems to be working perfectly now.

Instead of having it set null values to zero I have them excluded from the list. This way they shouldn't affect the median.

    for row in rows:
        if row.UseValue == None:
            pass 
        else:
            use_vals.append(row.UseValue)


Good thinking, that works even better!  Glad you have it all going now.
0 Kudos
by Anonymous User
Not applicable
Sam,

I hope you see this, but I was using the median function today on a data set and I found a logical error. 

This line:

med_val = float(sorted_list[median-1]
                        + sorted_list[median] / 2)


was wrong.  The addition of the two middle values needed to be inside the parenthesis like this:

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = float(sorted_list[median-1]
                        + sorted_list[median]) / 2
    else:
        med_val = sorted_list[median]
    return med_val


Sorry about that, I should have done more testing!
0 Kudos
SamHopkins
New Contributor III
Sam,

I hope you see this, but I was using the median function today on a data set and I found a logical error. 

This line:

med_val = float(sorted_list[median-1]
                        + sorted_list[median] / 2)


was wrong.  The addition of the two middle values needed to be inside the parenthesis like this:

def GetMedian(in_list):
    sorted_list = sorted(in_list)
    median = int(round(len(sorted_list) / 2))
    if len(sorted_list)%2==0:
        med_val = float(sorted_list[median-1]
                        + sorted_list[median]) / 2
    else:
        med_val = sorted_list[median]
    return med_val


Sorry about that, I should have done more testing!


Thanks for finding that!

I also ran into an error this morning. I had a station that only had null values, resulting in an unpopulated list which created an error when trying to find the middle value.

To fix this I added a bit of code so that the GetMedian function is only run on populated lists.

   if len(use_vals)== 0:
        pass #If the list is empty (because all values are null and thus passed) the median is not found
    else:
        median = GetMedian(use_vals)
        values[st] = median
0 Kudos