Numpy Percentile error

4798
19
Jump to solution
10-09-2019 12:47 PM
GrantPalmer
New Contributor II

Hi all, running into an error and i'm not sure why when I am trying to rank the attribute field of a shapefile. I have a poly line shapefile of some streams that have an attribute field of some normalized data that is the data type of 'Double' and I am trying to rank these values by quartile. and store their rank in another attribute field. I know that you can use symbology > graduated colors > method: quantile with 4 classes and it is displaying my data correctly. However, I need to be able to have an attribute field with a rank to be able to use the data down the line. 

I have been using python for my processes so far but I am currently running into an error and i'm not sure why/cant really find an answer online anywhere else. Here is a sample of my code

maximum = max(row[0] for row in arcpy.da.SearchCursor(NHDFlowline_HUC12, ['Normalized_Linear']))
print(maximum)
minimum = min(row[0] for row in arcpy.da.SearchCursor(NHDFlowline_HUC12, ['Normalized_Linear']))
print(minimum)
arr = arcpy.da.FeatureClassToNumPyArray(NHDFlowline_HUC12, ('Normalized_Linear'))
p1 = np.percentile(arr, 25)
p2 = np.percentile(arr, 50)
p3 = np.percentile(arr, 75)
p4 = np.percentile(arr, 100)

with arcpy.da.UpdateCursor(NHDFlowline_HUC12, ['Linear_Rank', 'Normalized_Linear']) as cursor:
for row in cursor:
if minimum <= row[1] <= p1:
row[0] = 1
elif p1 < row[1] <= p2:
row[0] = 2
elif p2 < row[1] <= p3:
row[0] = 3
elif row[1] > p3:
row[0] = 4
cursor.updateRow(row)

first step should store a max and min value for the normalized data attribute and then create an array containing the values of my shapefile's attribute field 'Normalized_Linear' then the next steps are to assing values to p1 thru p4 as the breaks for the quartile and then use updateCursor to store in the rank. The resulting error is:

Traceback (most recent call last):
File "script path", line 143, in <module>
p1 = np.percentile(arr, 25, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
File "C:\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\function_base.py", line 4269, in percentile
interpolation=interpolation)
File "C:\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\function_base.py", line 4011, in _ureduce
r = func(a, **kwargs)
File "C:\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\function_base.py", line 4386, in _percentile
x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: invalid type promotion

I am unsure of how to go about fixing this TypeError: invalid type promotion. I feel like it may have something to do with the data type being double but if so, I would like to know how to work around this.

Any help would be much appreciated

0 Kudos
19 Replies
DanPatterson_Retired
MVP Emeritus

I suspect you will have to wait for 'stu' and 'uts'.

Far less exciting and safe is...

np.version.version

Out[10]: '1.16.4'

b = arcpy.da.FeatureClassToNumPyArray(in_fc, 'Doubles')

b
array([(1.2,), (1.4,), (1.8,), (1.6,), (1.9,), (1.1,), (1.3,), (1.5,), (1.7,)],
      dtype=[('Doubles', '<f8')])

a = b.view(dtype=np.float64)  # ---- a view with a float64 dtype

np.percentile(a, (25, 50, 75))

array([1.3, 1.5, 1.7])
JoeBorgione
MVP Emeritus

I figured you'd have the right answer!

That should just about do it....
0 Kudos
GrantPalmer
New Contributor II

Awesome, this worked! 

Just as an aside, I fully understand why its not as exciting haha but why is this "unsafe"? 

Thank you again for your help Dan and Joe.

0 Kudos
GrantPalmer
New Contributor II

Also, one more thing, is there away while doing the percentile to ignore 0 values?

Currently one of the attributes that I am calculating a rank for using percentile is skipping rank 1 all together. I guess the the p1 percentile cut off is a value extremely close to 0 but still less than the next highest value of the attribute I am trying to rank. 

So all my 0 values get a rank of 0 then the first non 0 value is 1.32783428022391E-08 and that is falling within the 2nd percentile group p2 getting a rank of 2. 

0 Kudos
DanPatterson_Retired
MVP Emeritus

Grant... 

of course 

a = np.arange(0,11)

np.percentile(a, (1, 25, 50, 75))
array([0.1, 2.5, 5. , 7.5])

b = np.nonzero(a)
b
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64),)

np.percentile(b, (1, 25, 50, 75))
array([1.09, 3.25, 5.5 , 7.75])‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
GrantPalmer
New Contributor II

Here is my updated code and the new issue I am getting. All of my variables are properly assigned in the beginning of the script not its not any issue with that I don't think. 

# Creates a new field for the Rank to be calculated into
arcpy.AddField_management(HUC12Rank, "Percent_Urban_Rank", "Double", 9,
                          field_alias="Percent_Urban_Rank", field_is_nullable="NULLABLE")
maximum_urbanlanduse = max(row[0] for row in arcpy.da.SearchCursor(HUC12Rank, ['Landuse_Area_Percent_Urban']))
print(maximum_urbanlanduse)
minimum_urbanlanduse = min(row[0] for row in arcpy.da.SearchCursor(HUC12Rank, ['Landuse_Area_Percent_Urban']))
print(minimum_urbanlanduse)

# This portion creates an array of the Landuse_Area_Percent_Urban attribute. The array is then searched for zeros and removed to get proper percentile breaks
# the numpy percentile function is ran for 25th 50th and 75th percentiles and those values are saved as variables to be called on once start ranking. We use UpdateCursor
# again here to populate the Percent_Urban_Rank field. We give all 0 fields of percent land use to be rank 0, then compare the Landuse_Area_Percent_Urban to the
# percentile values we stored as p1 p2 p3.

array_urban = arcpy.da.FeatureClassToNumPyArray(HUC12Rank, 'Landuse_Area_Percent_Urban')
array_urbanview = array_urban.view(dtype=numpy.double)
array_urban_nonzero = numpy.nonzero(array_urbanview)
p1urban = numpy.percentile(array_urban_nonzero, 25)
print(p1urban)
p2urban = numpy.percentile(array_urban_nonzero, 50)
print(p2urban)
p3urban = numpy.percentile(array_urban_nonzero, 75)
print(p3urban)

with arcpy.da.UpdateCursor(HUC12Rank, ['Percent_Urban_Rank', 'Landuse_Area_Percent_Urban']) as cursor:
    for row in cursor:
        if row[1] == 0:
            row[0] = 0
        elif minimum_urbanlanduse < row[1] <= p1urban:
            row[0] = 1
        elif p1urban < row[1] <= p2urban:
            row[0] = 2
        elif p2urban < row[1] <= p3urban:
            row[0] = 3
        elif row[1] > p3urban:
            row[0] = 4
        else:
            row[0] = 0
        cursor.updateRow(row)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
print('Done w Urban Percentile Rank')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

My results of the print outs are: 

97.859185942202
0.0
3435.5
6312.0
8775.5‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
Done w Urban Percentile Rank

There shouldn't be any value over 100 since these are percents, Max urban landuse being 97.859185942202, minimum of 0, but then the percentiles are all off. However, before adding the non-zero portion of the script, I was getting much nicer percentile breaks.

0 Kudos
DanPatterson_Retired
MVP Emeritus

That is one of the dangers of dumping 0.  If 0 is a real value, then it should be included in the percentiles.  If 0 is used to represent, 'nodata', then it should be converted to None and the nanpercentile used instead... se the equivalencies

a = np.array([np.nan, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
b = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

np.nanpercentile(a, (5, 25, 50, 75, 95))
array([1.45, 3.25, 5.5 , 7.75, 9.55])

np.percentile(b, (5, 25, 50, 75, 95))
array([1.45, 3.25, 5.5 , 7.75, 9.55])
GrantPalmer
New Contributor II

Is it possible that the numpy.nonzero step creates an array with 2 columns/rows? Such as an identifier column or row and then the actual value column/row? If so, that would make sense of my percentile breaks as my overall data set has 11226 total entries in it which may explain why my percentiles are shown to be as it is percentile-ing the identifier column/row. If that is the case, is there a way to specify which column/row of the array to perform the percentile possibly?

I am going to put a larger snippet of my code here to get a better sense of what it is I am trying to do (warning, lots to look at lol)

# Adds and calculates area for our new feature layer Reclass_NLCD_HUC12 for percent area calculations. Next after that we want to then sum up the area with the case fields of HUC12, NLCD_Land, and gridcode.
# That output is a table of attributes that each unique HUC 12 watershed has a area for all 4 reclassified values, gridcode values 1 thru 4. We then sum that table again this time with the only case field being HUC12
# and that is so we can have an overall total landuse area per HUC12. Next is to join the total landuse area back into the NLCD_HUC12_Gridcode_Area table and rename some fields to better understand our data and clean it up.
# arcpy.management.AddGeometryAttributes("Reclass_NLCD_HUC12", "AREA_GEODESIC", None, "SQUARE_KILOMETERS", None)
arcpy.management.AddGeometryAttributes(Reclass_NLCD_HUC12, "AREA_GEODESIC", None, "SQUARE_KILOMETERS",
                                       "GEOGCS['GCS_North_American_1983',DATUM['D_North_American_1983',SPHEROID['GRS_1980',6378137.0,298.257222101]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]")
arcpy.analysis.Statistics(Reclass_NLCD_HUC12,
                          r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Gridcode_Area",                                                     # Path Change #
                          "AREA_GEO SUM", "HUC_12;NLCD_Land_Cover_Class;gridcode")
NLCD_HUC12_Gridcode_Area = r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Area"                                                              # Path Change #
arcpy.analysis.Statistics("NLCD_HUC12_Gridcode_Area",
                          r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Area_Total",                                                        # Path Change #
                          "SUM_AREA_GEO SUM", "HUC_12")
NLCD_HUC12_Area_Total = r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Area_Total"                                                           # Path Change #
arcpy.management.JoinField("NLCD_HUC12_Gridcode_Area", "HUC_12", "NLCD_HUC12_Area_Total", "HUC_12", "SUM_SUM_AREA_GEO")
arcpy.management.AlterField("NLCD_HUC12_Gridcode_Area", "SUM_AREA_GEO", "Landuse_Area", "Landuse_Area", "DOUBLE", 8,
                            "NULLABLE", False)
arcpy.management.AlterField("NLCD_HUC12_Gridcode_Area", "SUM_SUM_AREA_GEO", "Landuse_Area_Total", "Landuse_Area_Total",
                            "DOUBLE", 8, "NULLABLE", False)

# Gets a percent area of each NLCD reclassified value
arcpy.AddField_management("NLCD_HUC12_Gridcode_Area", "Percent_Area", "Double", 9,
                          field_alias="Percent_Area", field_is_nullable="NULLABLE")
arcpy.management.CalculateField("NLCD_HUC12_Gridcode_Area", "Percent_Area", "(!Landuse_Area!/!Landuse_Area_Total!)*100",
                                "PYTHON3", None)

# Creates an easy to edit and use HUC12 file copy so we do not damage the original file
arcpy.management.CopyFeatures('Huc12_layer',
                              r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\HUC12Rank", None, None,                                                    # Path Change #
                              None, None)
HUC12Rank = r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\HUC12Rank"                                                                                   # Path Change #
arcpy.MakeFeatureLayer_management(HUC12Rank, 'HUC12edit')

# Makes new tables based on a selection of the NLCD_HUC12_Gridcode_Area data table. These selections are going to be for Urban land use and then repeated for Agriculture landuse.
# These new tables will then be joined to our original HUC12 shapefile as new attributes for each HUC12 subwatershed.
arcpy.management.JoinField(HUC12Rank, "HUC_12", "NLCD_HUC12_Area_Total", "HUC_12", "Landuse_Area_Total")
arcpy.analysis.TableSelect("NLCD_HUC12_Gridcode_Area",
                           r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Gridcode_Area_Urban",                                              # Path Change #
                           "gridcode = 2 And Percent_Area IS NOT NULL")
arcpy.analysis.TableSelect("NLCD_HUC12_Gridcode_Area",
                           r"E:\2019 Grant Files\DatabaseProject_DataFolder\DatabaseFinal.gdb\NLCD_HUC12_Gridcode_Area_Agriculture",                                        # Path Change #
                           "gridcode = 4 And Percent_Area IS NOT NULL")
arcpy.management.AlterField("NLCD_HUC12_Gridcode_Area_Urban", "Percent_Area", "Landuse_Area_Percent_Urban",
                            "Landuse_Area_Percent_Urban", "DOUBLE", 8, "NULLABLE", False)
arcpy.management.AlterField("NLCD_HUC12_Gridcode_Area_Agriculture", "Percent_Area", "Landuse_Area_Percent_Ag",
                            "Landuse_Area_Percent_Ag", "DOUBLE", 8, "NULLABLE", False)
arcpy.management.JoinField(HUC12Rank, "HUC_12", "NLCD_HUC12_Gridcode_Area_Urban", "HUC_12",
                           "Landuse_Area_Percent_Urban")
arcpy.management.JoinField(HUC12Rank, "HUC_12", "NLCD_HUC12_Gridcode_Area_Agriculture", "HUC_12",
                           "Landuse_Area_Percent_Ag")

# The now joined data from the Agriculture and Urban land use can now be given a rank with a similar process as the Flowlines earlier in the script

# Urban Ranking
# If a HUC 12 did not have any urban land use, e.g. middle of the rockies, the join field process in the last section would result in a NULL
# in the attribute field, this gives us trouble when trying to use the SearchCursor and UpdateCursor, so first we need to change the NULLs into
# zeros.
arcpy.management.CalculateField(HUC12Rank, "Landuse_Area_Percent_Urban", "updatevalue(!Landuse_Area_Percent_Urban!)",
                                "PYTHON3", r"def updatevalue(value):\n"
                                           r"    if (value == None):\n"
                                           r"        return '0'\n"
                                           r"    else:\n"
                                           r"        return value\n         ")

# Creates a new field for the Rank to be calculated into
arcpy.AddField_management(HUC12Rank, "Percent_Urban_Rank", "Double", 9,
                          field_alias="Percent_Urban_Rank", field_is_nullable="NULLABLE")
maximum_urbanlanduse = max(row[0] for row in arcpy.da.SearchCursor(HUC12Rank, ['Landuse_Area_Percent_Urban']))
print(maximum_urbanlanduse)
minimum_urbanlanduse = min(row[0] for row in arcpy.da.SearchCursor(HUC12Rank, ['Landuse_Area_Percent_Urban']))
print(minimum_urbanlanduse)

# This portion creates an array of the Landuse_Area_Percent_Urban attribute. The array is then searched for zeros and removed to get proper percentile breaks
# the numpy percentile function is ran for 25th 50th and 75th percentiles and those values are saved as variables to be called on once start ranking. We use UpdateCursor
# again here to populate the Percent_Urban_Rank field. We give all 0 fields of percent land use to be rank 0, then compare the Landuse_Area_Percent_Urban to the
# percentile values we stored as p1 p2 p3.
array_urban = arcpy.da.FeatureClassToNumPyArray(HUC12Rank, 'Landuse_Area_Percent_Urban')
array_urbanview = array_urban.view(dtype=numpy.double)
array_urban_nonzero = numpy.nonzero(array_urbanview)
print(array_urban_nonzero)
p1urban = numpy.percentile(array_urban_nonzero, 25)
print(p1urban)
p2urban = numpy.percentile(array_urban_nonzero, 50)
print(p2urban)
p3urban = numpy.percentile(array_urban_nonzero, 75)
print(p3urban)

with arcpy.da.UpdateCursor(HUC12Rank, ['Percent_Urban_Rank', 'Landuse_Area_Percent_Urban']) as cursor:
    for row in cursor:
        if row[1] == 0:
            row[0] = 0
        elif minimum_urbanlanduse < row[1] <= p1urban:
            row[0] = 1
        elif p1urban < row[1] <= p2urban:
            row[0] = 2
        elif p2urban < row[1] <= p3urban:
            row[0] = 3
        elif row[1] > p3urban:
            row[0] = 4
        else:
            row[0] = 0
        cursor.updateRow(row)
print('Done w Urban Percentile Rank')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The result of the array print at line line 80 is as follows

(array([    0,     1,     2, ..., 11223, 11224, 11225], dtype=int64),)

The array seems to not be holding the percent area for each subwatershed but rather just a identifier value

And as to your previous suggestion, numpy.nan what line would I insert that function into?

Full print out for this section of code:

97.859185942202
0.0
(array([    0,     1,     2, ..., 11223, 11224, 11225], dtype=int64),)
3435.5
6312.0
8775.5
Done w Urban Percentile Rank

Sorry for the trouble lol

0 Kudos
DanPatterson_Retired
MVP Emeritus

Grant, I may have failed to emphasize that nonzero return an array for each dimension of an array, so a slice is needed. 

numpy lessons continue...

z = np.array([0., 1., 0., 3, 4., 5])  #  ---- some data with zeros

z0 = np.nonzero(z)[0]  # ---- returns the indices in z and you need to slice the first dimension
array([1, 3, 4, 5], dtype=int64)


good_vals = z[z0]  # ---- now use the indices to pull out the nonzero values

np.percentile(good_vals, (25, 50, 75))  # ---- now you are ready for percentiles
array([2.5 , 3.5 , 4.25])
GrantPalmer
New Contributor II

You're the man, Dan

0 Kudos