I've been trying all sorts of methods for converting some very large CSV files (~20 million records) to FGDB Tables. Every method results in max RAM being exceeded (16GB @ 64-bit). Through ArcGIS Desktop I've tried Table to Table, Copy Rows, and Load Data. Through ArcPy I've tried the same tools, as well as Cursors. The only thing that I think will work is reading the data in chunks (around 100,000 records at a time), and inserting those "chunks" into the table. I've managed to use the Python Pandas library to read the data in chunks. I just need to figure out how to write the chunks to the GDB table. Any ideas?
Hi Eric Mahaffey,
If you firstly set up an empty table in the file gedoatabase containing fields named as per your CSV file, you could insert the records from the pandas dataframe into you the file geodatabase table using an arcpy.da.InsertCursror - something like this:
import pandas as pd
import arcpy
filename=r'C:\path\to\file.csv'
fc_table = r'C:\path\to\file.gdb\table'
chunksize = 100000
for chunk in pd.read_csv(filename, chunksize=chunksize):
keys = chunk.keys().tolist()
insert_cur = arcpy.da.InsertCursor(fc_table, keys)
for row in chunk.itertuples(index=False):
insert_cur.insertRow(row)
del(insert_cur)
James,
Thank you so much for the quick response. I've integrated your solution into my script, and I'm getting an error "CParserError: Error tokenizing data. C error: Expected 22 fields in line 68, saw 23". I believe it's because the CSV file does not have an OBJECTID field. How do I tell the InsertCursor function to either skip the first field, or identify the specific fields to insert to?
~Eric
The insert cursor will never raise a CParserError about tokenizing data. That error is common to the pandas read_csv function. You have a row in your CSV file that isn't being parsed correctly.
Joshua,
I believe you're right. This data has been nothing but problems. I managed to force it by adding the usecols paramater within the prd.read_csv function. Then I started having "bad value" errors. So I converted all of my target fields to text with a large width just the data to go in. This worked. However, when I changed a couple of columns (i.e. LAT and LONG) back to Double type fields it caused a "RuntimeError: The value type is incompatible with the field type error". I can't seem to catch a break.
Eric,
The runtime error sounds like some text in the LAT and LONG field. In the first instance I'd be catching the exception and printing out the records that are creating problems - then have a look for any common issues in the data and add a few lines of code to your script that will fix those problems in your data values before inserting into the FGDB table.
With the addition of exception handling -
import pandas as pd
import arcpy
filename=r'C:\path\to\file.csv'
fgdb_table = r'C:\path\to\file.gdb\table'
chunksize = 100000
for chunk in pd.read_csv(filename, chunksize=chunksize):
keys = chunk.keys().tolist()
insert_cur = arcpy.da.InsertCursor(fgdb_table, keys)
for row in chunk.itertuples(index=False):
try:
insert_cur.insertRow(row)
except RuntimeError:
print("ERROR - Failed to insert: {}".format(row))
del(insert_cur)
If you've not already done so, you may also want to add a call to arcpy.TruncateTable_management() at the start of script to clear out any data inserted on the previous run, and prevent data being duplicated in the FGDB table on each run of the script.
James,
You're an absolute lifesaver! The try except statement allows me to see what records are having an issue, and it allows the process to continue through the errors. As suspected, it's pretty much being caused by one value that has an extra comma in it which throws the whole thing off. Fortunately I'm able to omit this value due to only needing to use the majority of the data in each table. Now I can move on and process all the other tables that I have to do. Again, I can't you enough!
~Eric
"CParserError: Error tokenizing data. C error"
In most cases, it might be an issue with:
The error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake. The better solution is to investigate the offending file and to fix it manually so you don't need to skip the error lines.