Eliminating data outliers using Python Parser of ArcGIS Field Calculator?

GoldenJiang · ‎08-29-2016

For a while I've been using boxplot and definition query in ArcGIS to eliminate data outliers. It is a painful process when dealing with a lot of files and difficult to ensure the consistency.

I don't know much python but I want to try something like replacing outliers as `Null` in python parse of field calculator:

def function(x):
qnt = quantile(x, probs=c(.25, .75));
H = 1.5 * IQR(x);
y=x;
y[x < (qnt[1] - H)] = "";
y[x > (qnt[2] + H)] = "";
return y‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Then output in my new field function`( !Yield_kg__! )` but it turns out failure below:

 ERROR000539:Error running expression: function(0.0)
 Traceback (most recent call last)
 File "<expression>", line1, in <module>
 File "<string>', line2, in function
 NameError: globe name 'quantile' is not defined‍‍‍‍‍

Can anyone tell me how to fix this?

DanPatterson_Retired · ‎08-29-2016

numpy has a percentile, ergo, you can determine quantiles.... a function which you haven't defined, or indicated it's source.

>>> a = np.arange(20)
>>> p_20 = np.percentile(a, 20)
>>> p_80 = np.percentile(a, 80)
>>> print("p_20 {: 5.3f}, p_80 {: 5.3f}".format(p_20, p_80))
p_20  3.800, p_80  15.200
>>> # now just slice what you need
>>> between = ((a > p_20) & (a < p_80))
>>> a[between]
array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now you can put this into a function, simply by importing numpy, reading the column, classifying your values if you want

GoldenJiang · ‎08-29-2016

Sorry about my python level ... ... the python in ArcGIS confuses me a bit.

I modified your code as I am thinking something simple as if-else statement.

import numpy as np
def Reclass(a):
    p_20 = np.percentile(a, 20)
    p_80 = np.percentile(a, 80)
    if (a > p_20 and a < p_80):
        return a
    else:
        return("")
‍‍‍‍‍‍‍‍

I copied this to the Python console in ArcGIS (I assume the function is held out), run it, and then run

Reclass(!field!)‍

for my 'newField' in the Parser of field calculator.

Also, I tried copying this to 'Pre-logic Script Code' and run `Reclass(!field!)` in my 'newField' box.

Both of them still show `Null`.

DanPatterson_Retired · ‎08-29-2016

This isn't meant to be a field calculator function...

Basically you need to work with cursors or numpy to answer many questions.

Array 'a' is all the records in a field from your table. This can be obtained via a searchcursor or using

TableToNumPyArray—Help | ArcGIS for Desktop or FeatureClassToNumPyArray plus others.

For example, partially following the help topic

input = "c:/data/usa.gdb/MyData/MyTable"
field1 = "FieldA"
a = arcpy.da.TableToNumPyArray(input, (field1))‍‍‍

but now I am going to cheat using some data from before... so I will substitute 'a' for my 'a'

a = np.arange(20)
p_20 = np.percentile(a, 20)
p_80 = np.percentile(a,80)
between = ((a > p_20) & (a < p_80))
result = a[between]
array([ 4, 5, 6, 7, ..., 12, 13, 14, 15])
‍‍‍‍‍‍

Now do what you with result... graph it, determine its new statistics... blah blah

If you want to reclass, that can be done as well... let me know and I can give an example of how to reclass the array 'a' in the first part if outside the p_20 to p_80 range as you seem to want to do. But, you can't use a string "", you can reclass to np.NaN (which is equivalent to not a number in Numpy, so it is a number... but its not... got it? )

DanPatterson_Retired · ‎08-29-2016

Ok... more numpy lessons

Lets assume p_20 = 5 and p_80 = 15 (because I am too lazy to recalculate them)

Start with your search cursor or table to numpy array thing and get 'a' (I am cheating again, order isn't important eithr)

So array 'b' represents the 'result' from above... it need not be in order, but probably will be... doesn't matter.

Word of the day... np.in1d ... what it finds is all the occurrences in 'a' of what is in 'b'. It gives you a boolean array (0's and 1's)

Next thing you do is use 'np.where' to find out where the matches are, and if found give it the value in 'a' otherwise assign it that np.NaN thing I mentioned before. The results is where things match and where they don't they are 'nan' (not a number, but really...).

If you really, really must get it back to your table, then you can

>>> a = np.arange(20)
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
 18, 19])
>>> b = np.array([5,6,9,8,7,15,14,12,13,10,11])
>>> np.in1d(a,b)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], dtype=bool)
>>> np.where(np.in1d(a, b), a, np.NaN)
array([ nan, nan, nan, nan, nan, 5.000, 6.000, 7.000, 8.000, 9.000,
 10.000, 11.000, 12.000, 13.000, 14.000, 15.000, nan, nan, nan,
 nan])
‍‍‍‍‍‍‍‍‍‍‍

You can make a small table out of the result and use

ExtendTable—Help | ArcGIS for Desktop

to 'join' the resultant array back to the original table.

really its not hard

>>> outtable = np.zeros((len(result),), dtype=[('ID', 'int32'), ('Result', 'float64')])
>>> outtable['ID'] = np.arange(len(result))
>>> outtable['Result'] = result
>>> outtable
array([(0, nan), (1, nan), (2, nan), (3, nan), (4, nan), (5, 5.0), (6, 6.0),
 (7, 7.0), (8, 8.0), (9, 9.0), (10, 10.0), (11, 11.0), (12, 12.0),
 (13, 13.0), (14, 14.0), (15, 15.0), (16, nan), (17, nan), (18, nan),
 (19, nan)], 
 dtype=[('ID', '<i4'), ('Result', '<f8')])‍‍‍‍‍‍‍‍‍

Then you just do the extend table thing. Now don't let all that magic in the above table scare you, it is basic numpy stuff. Pandas (available in ArcGIS PRO) is a gentler version of numpy and handles some of this if you can't understand the magic.

What I did

outtable ..... make an empty array of zeros, which is the lengthe of the 'result' array (some magic, the comma placement is paramount

The table will have 2 columns, one called 'ID' (integer) and the other 'Result' (double, floating point number since I used NaN

outtable['ID'] just assigned number from 0 to the end into this column

outtable['Result'] same thing, just used the 'result' array for this column

On an IThingy so I can't actually do the join back to the original table, if you really really must.

Otherwise you can save the result out to a file using several options... which I wont' get into here.

GoldenJiang · ‎09-05-2016

Thanks Dan, sorry for the late reply. I need some time to know the basis.

>>> import numpy as np
>>> input = 'Stumps Pop'
... field1 = "Yld_Mass_D"
... a = arcpy.da.TableToNumPyArray(input, (field1))
... a = np.arange(20)
... p_20 = np.percentile(a, 25)
... p_80 = np.percentile(a, 75)
... between = ((a > p_20) & (a < p_80))
... result = a[between]
... print result‍‍‍‍‍‍‍‍‍‍

In this case, how do I access to my attribute column 'Yld_Mass_D' and assign new values to my new column 'NewField'?

(p.s. forgive my idiotic question... steep learning curve)

DanPatterson_Retired · ‎09-05-2016

Golden... you have to carry on in the block after the line

its really not that hard....

You have your 'result' now you have to create 'outtable as I have shown, then you need to use ExtendTable

ExtendTable—Help | ArcGIS for Desktop sort of like this

arcpy.da.ExtendTable("c:/folder/your.gdb/table_or_featureclass", "ObjectID", outtable, "ID")

where ObjectID is your object id field .... could be as shown or FID (shapefile) or whatever,

outtable is the array just created,

ID is the identification field that was created in outtable.

give it a shot... I can always repurpose a field calculator script, but learning both ways is better. Do it once, you have a template, which you can quickly repurpose.