Excel ... arrays, tables in ArcGIS Pro

DanPatterson_Retired · ‎12-12-2018

X The xlrd and openpyxl modules packaged with ArcGIS Pro is a pretty cool for working with Excel files... if you are stuck and have to use them. Pandas depends on them, so you could just use Pandas to do the data conversion, but, numpy can be called into play to do a pretty good and quick conversion while at the same time cleaning up the data on its way in to a table in ArcGIS Pro

target="_self">(1) ---- Spreadsheets gone bad
(2) ---- Spreadsheets to array
(3) ---- Spreadsheets to array to geodatabase table
(4) ---- Spreadsheets to geodatabase table via arcpy
(5) ---- Spreadsheets to Pandas to table
(6) ---- The code
(7) ---- End notes...
References:

------------------------------------------------------

(1) ---- Spreadsheets gone bad

Here is a spreadsheet with so flubs builtin.

Column A contains integers, but excel treats them as just floating point numbers without a decimal place.

Column D is just text but with leading/trailing spaces, cells with spaces, empty cells and just about anything that can go wrong when working with text.

Column E has floats but two of the cells are empty or worse... a space.

All these conditions need to be fixed.

As for fixing blank rows, missing column headers, data not in the upper left quadrant of a sheet, or data that share the page with other stuff (like graphs etc etc)… its not going to happen here. This discussion assumes that you have an understanding on how you should work with spreadsheet data if you have to.

------------------------------------------------------

(2) ---- Spreadsheets to array

So with a tad of code, the spreadsheet can be converted to a numpy structured/recarray.

During the process, numeric fields which are obviously integer get cast to the correct format.

Malformed text fields/columns are cleaned up. Leading/trailing spaces are removed and empty cells and/or those with nothing but spaces in them are replaced by 'None'.

Empty cells in numeric floating point fields are replaced with 'nan' (not a number). Sadly there isn't an equivalent for integers, so you will either have to upcast your integer data or provide a null value yourself.

Best approach... provide your own null/nodata values

------------------------------------------------------

(3) ---- Spreadsheets to array to geodatabase table

Now... The array can be converted to a geodatabase table using NumPyArrayToTable.

arcpy.da.NumPyArrayAsTable(array, path)

where

`array` is from the previous step

`path` the full path and name of the geodatabase table.

it comes in as expected. The dtypes are correct and the text column widths are as expected. Note that text column widths are twice the Unicode dtype width (ie U20 becomes 40 characters for field length)

------------------------------------------------------

(4) ---- Spreadsheets to geodatabase table via arcpy

Excel to Table does a good job on the data types, but it takes some liberty with the field length. This may be by design or useful or a pain, depending what you intend to do with the data subsequently.

You can even combine xlrd with arcpy to batch read multiple sheets at once using the code snippet in the reference link below.

(5) ---- Spreadsheets to Pandas to table

Yes pandas does it via xlrd, then the data to numpy arrays, then to series and dataframes, then out to geodatabase tables. So you can skip pandas altogether if you want.

The data types can be a bit unexpected however, and there is no cleaning up of text fields isn't carried out completely, blank/empty cells are translated to 'nan' (not a number?) but a space in a cell remains as such.

The data type for the text column is an 'object' dtype which is usually reserved for ragged arrays (ie mixed length or data type).

df['Text_null'].tolist()
[' a leading space', 'b', 'a trailing space ', nan, ' ', 'back_to_good', ' four_leading', 'b', 'a', 'done']

(6) ---- The code

I put the code in the link to my `gist` on GitHub in case code formatting on this site isn't fixed.

excel_np.py .... convert excel to a structured array

There are some things you can do to ensure a proper data type. The following demonstrates how one little blank cell can foul up a whole column or row of data.

def isfloat(a):
    """float check"""
    try:
        i = float(a)
        return i
    except ValueError:
        return np.nan
    
# ---- Take some numbers... but you forgot a value so the cell is empty ie ''
#
vals = [6, 9, 1, 3, '', 2, 7, 6, 6, 9]

# ---- convert it to an array... we will keep it a surprise for now
#
ar = np.asarray(vals)

# ---- use the helper function `isfloat` to see if there are numbers there
#
np.array([isfloat(i) for i in ar])

# ---- yup! they were all numbers except for the blank
#
array([ 6.,  9.,  1.,  3., nan,  2.,  7.,  6.,  6.,  9.])

# ---- if we hadn't checked we would have ended up with strings
#
array(['6', '9', '1', '3', '', '2', '7', '6', '6', '9'], dtype='<U11')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

If you really need to conserve the integer data type, they you will have to some hurdle jumping to check for `nan` (aka not-a-number)

# ---- our list of integers with a blank resulted in a float array
#
np.isnan(ar)  # --- conversion resulted in one `nan`
array([False, False, False, False,  True, False, False, False, False, False])

# ---- assign an appropriate integer nodata value
#
ar = ar[np.isnan(ar)] = -999

# ---- cast the array to integer and you are now done
#
ar = ar.astype('int')
array([   6,    9,    1,    3, -999,    2,    7,    6,    6,    9])‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

(7) ---- End notes...

So the next time you need to work with spreadsheets and hope that the magic of xlrd, openpyxl or pandas (which uses both) can solve all your problems.... take the time to look at your data carefully and decide if it is truly in the format you want BEFORE you bring it into ArcGIS Pro as a table

arr = excel_np("c:/temp/x.xlsx")

arr
 
array([('aaa', '01', 1), ('bbb', '02', 2)],
      dtype=[('field1', '<U4'), ('field2', '<U4'), ('field3', '<i4')])

import arcpy
arcpy.da.NumPyArrayToTable(arr, r"C:\Your_spaceless_path_to\a.gdb\x_arr")‍‍‍‍‍‍‍‍‍