After spending a few months working through a Python for Data Science curriculum, I have gotten to like woriking with pandas for data prep in Python. Naturally, I wanted to try using it with arcpy. Usually, I would attempt large batch GIS data manipulation or data moving with FME. But FME, while it is powerful, often takes too many settings or mouse-clicks to remember. Seems like I can never remember some parameter that is only visible after drilling down some UI tree. If I script it, I see all my settings. I found a great example of getting a pandas Dataframe into a geodatabase here.
import arcpy
from arcpy import envenv.workspace = "my_favorite_workspace" #pointing to SDE connection file
#I already have a dataframe df that is loaded from SQL Server using pyodbc
#import pandas as pd
#import pyodbc#sql_conn = pyodbc.connect('''DRIVER={ODBC Driver 13 for SQL Server};
# SERVER=<myserver>;
# DATABASE=mydatabase>;
# Trusted_Connection=yes''')#query = "<myquery>"
#df = pd.read_sql(query, sql_conn,columns=['count'])
This all works like a charm. I can now do all kinds of data manipulation.Then I create a structured array in numpy.
#first you turn your dataframe into a simple np series
nparray = df.values
#then you create a structured numpy array, see here for documentation
npstruct = np.rec.fromrecords(nparray)
#then you create a list of field names from the dataframecolnames = df.dtypes.index.tolist()
#then you use that list to update column names in structured array
npstruct.dtype.names = tuple(colnames)
Again, everything works fine. I get my field names and I get a structured array that looks like it should based on the documentation.The final step is to use that array in arcpy.da.NumPyArrayToTable() to create a table. ESRI has additional documentation here.
arcpy.da.NumPyArrayToTable(np_struct, "newtable")
---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)<ipython-input-31-14d9b2719bd8> in <module>----> 1 arcpy.da.NumPyArrayToTable(np_struct, "newtable")RuntimeError: create table
Need help figuring out what's the problem. I have my env.workspace pointing to an SDE connection. I've also tried using the same code to create "newtable.dbf" in a local directory. Don't think it's permissions either because the following is successful:
arcpy.CreateTable_management(env.workspace, 'new_table')
make sure you are using a full path to the table in the geodatabase... skip dbf altogether
arcpy.da.NumPyArrayToTable(array, "c:/data/gdb.gdb/out_table")
where array is the structured array, and out_table is the table name to use in the *.gdb.
PS overwriteOutput has no affect. It will fail miserably even if set, so use a different name if you have saved it before.
Thanks, Dan. I tried that and get the exact same error. Could it be that I have to actually create the table before loading data? Ha! I will try that next.
tbl = r"c:\path\to_npy_file\sample_20.npy" # save this to a volder
a = np.load(tbl)
a.dtype
dtype([('OBJECTID', '<i4'), ('f0', '<i4'), ('County', '<U2'), ('Town', '<U6'), ('Facility', '<U8'), ('Time', '<i4')])
a
array([( 1, 0, 'B', 'B_', 'Hall', 11), ( 2, 1, 'A', 'A_', 'Hall', 24),
( 3, 2, 'C', 'C_', 'Hosp', 43), ( 4, 3, 'A', 'B_', 'Hall', 43),
( 5, 4, 'B', 'B_', 'Hall', 16), ( 6, 5, 'B', 'A_', 'Hall', 8),
( 7, 6, 'A', 'C_', 'Hall', 26), ( 8, 7, 'B', 'C_', 'Hall', 31),
( 9, 8, 'C', 'C_', 'Hall', 7), (10, 9, 'A', 'A_', 'Hall', 58),
(11, 10, 'A', 'A_', 'Hosp', 20), (12, 11, 'C', 'A_', 'Hosp', 37),
(13, 12, 'C', 'B_', 'Hall', 36), (14, 13, 'A', 'B_', 'Hosp', 33),
(15, 14, 'C', 'C_', 'Hosp', 51), (16, 15, 'B', 'C_', 'Hosp', 53),
(17, 16, 'C', 'A_', 'Hosp', 21), (18, 17, 'C', 'C_', 'Hosp', 42),
(19, 18, 'A', 'B_', 'Hosp', 43), (20, 19, 'A', 'C_', 'Hall', 5)],
dtype=[('OBJECTID', '<i4'), ('f0', '<i4'), ('County', '<U2'), ('Town', '<U6'), ('Facility', '<U8'), ('Time', '<i4')])
Load that then try NumPyArrayToTable. put it into a gdb and give it a name
I know it works, I use this as a test file all the time.
Thanks, Dan, and happy new year! - I tried your above example and it works for me just fine. I had to use the full path to the SDE or GDB table. Don't understand why using only table name and env.workspace doesn't work but I can create a new table with correct field names and data types of the structured array.
However, I am still not getting anywhere with my own example. The whole purpose of this exercise was to be able to work with a dataframe in pandas, and get the pandas datatypes converted to numpy so that the arcgis function can create a table from it. By now, I suspect this method may be the culprit - scipy.org search returns no results for it at all.
In your case, the array you created "manually" looked like this:
When I use the above method on the df.values and convert back into ndarray, I get :
So although np.array() is supposed to turn the record array back into a regular nd.array, there is still that "numpy.record" in the dtype.
The more I think about it, this probably isn't a question for Geonet any longer but rather a numpy question. But if you have any suggestions, let me know.
Arne, I just don't use pandas, There is really no need to if you are just bringing arrays (aka tables and/or rasters) to get pandas involved at all. It really doesn't offer much more to numpy except for a gentler 'shell' to the interface, but for the most part, I haven't found anything that pandas offers that I can't already handle (including pivot tables, groupby operations and joins).
I have dozens of blog posts on working with numpy and arcpy for both raster and tabular data
Dan, thanks for chiming in again... I've been a little side tracked from looking at this again. I hear you. If you have a toolkit that does the trick, why learn another one that accomplishes the same. Especially in a world of wrappers around wrappers. But I kinda feel that way about pandas... I know it better than numpy and wanted to figure out how to do that. --- I did take a look at your blog just now and will be sure to return to it soon. Thanks.
version .24 of pandas will have a .to_numpy() method to convert a data frame to a numpy array. ArcGIS Pro 2.3.2 ships with version .23 so the method is not available yet. In the meantime you can use the following.
#df is the pandas dataframe
x = df.reset_index()
z = np.rec.fromrecords(x.values, names=x.columns.tolist())
arcpy.da.NumPyArrayToTable(z, out_table)
more info on this thread
https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array
Mark,
I am attempting to use your proposed solution and I get an error from Arcpy telling me that the the in_array is not a numpy array (TypeError: 'in_array' is not numpy array). Now my IDE recognizes it as an 'recarry'.
change the 2nd line to the following. You need to pass in the values of the dataframe, not the dataframe itself.
z = np.rec.fromrecords(x.values, names=x.columns.tolist())