da.InsertCursor and Parsing Large Text file

AlanDodson · ‎04-03-2019

I have a very large text file (~5 GB, ~30 million lines) that I need to parse and then output some of the data to a new point feature class. I cannot figure out how to proceed with the da.InsertCursor.

I've created the feature class so the fields are in the required order. There is an if statement that parses out the required lines of the file. The units in the text file are Feet, however I need rounded Metres in the output and the output field for that information must be Long. The round command returns float but my final value in the Tree_Height output must be long - can the float value be mapped into the long field? Converting from Feet to Metres is simple multiplication, but I believe my field types might be messed up then?

The index positions of the Lat and Long fields in the source file are 5 and 6.

The index position of the Description field in the source file is 1.

The index position of the Tree_Height (in Feet) in the source file is 9.

The index position of the Code in the source file is 2 - this is the link between the user-inputted codes and the codes in the if statement. The if statement works, but after that....

Can someone help me set up the da.InsertCursor so that these operations can (a) get done and (b) get done efficiently? I've been fumbling with lists of fields and tuples of things and am not making any progress. Have looked at help files and Penn State's online courses but still no joy.....

import arcpy, os

treesfile = arcpy.GetParameterAsText(0)
codes = arcpy.GetParameterAsText(1).split(",")   # user types in comma-delimited wildlife codes for the AOI
arcpy.env.workspace = arcpy.GetParameterAsText(2) 
sr = arcpy.SpatialReference(4326)

arcpy.env.overwriteOutput = True

Filtered_Trees = arcpy.CreateFeatureclass_management(arcpy.env.workspace, "Filtered_Trees", "POINT", "", "DISABLED", "DISABLED", sr)
arcpy.AddField_management(Filtered_Trees, "Lat", "DOUBLE", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Filtered_Trees, "Long", "DOUBLE", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Filtered_Trees, "Description", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Filtered_Trees, "Tree_Height", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")  # height value must be rounded and in Metres.
# there are many more fields but you get the idea

with open(treesfile, 'r') as file:
  for line in file:
    values = line.split("\t")
    if values[2] in codes and float(values[9]) >= 100:  
      # now I am stuck...	  	  ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

AlanDodson · ‎04-05-2019

Hi Dan - attached is some sample data and another file that explains my goal in clearer terms. In my previous posts I'm sure wasn't as succinct as I could've been. One thing to keep in mind and that I am clearly getting hung up on, is that by filtering the data first, 99% of the lines from the input file could be weeded out. In my head this seems like the best way to go about processing the data, but it's clear that what I see in my head doesn't always easily translate to the code or my understanding of how the most efficient code works.

I appreciate your posts on this and I've already learned a ton, thank you!

curtvprice · ‎04-04-2019

I think you should read 1000 lines into a list, remove the columns you do not need and do your value calculations (using python list operations or as Dan suggests, list to numpy array, operate on your data blazingly fast, convert back to list), then, when you have your data in python lists (or tuples) use arcpy.da.InsertCursor to write your 1000 records. Then repeat.

AlanDodson · ‎04-04-2019

All that sounds ideal - just to wrap my head around it now. The ratio of input lines to output lines is very lopsided due to the filtering I need to do so I'm wondering how that will work. I'm going to post some data in the morning and hopefully that will help.

DanPatterson_Retired · ‎04-04-2019

Curtis, since the array would contain coordinates, the processing 'result' can be converted directly to a point featureclass. Even without coordinates they can be converted directly to a table (via arcpy.da.NumPyArrayToFeatureclass and arcpy.da.NumPyArrayToTable ) If you had to process them in batches (worse case scenario) then I would simply save the intermediate results to disk in *.npy format, assemble when all is done, then use those. You don't need InsertCursors or Update cursors as long as the data format is known. For point featureclasses and/or plain old gdb table, there is no need for cursors at all. Processing polygon and polyline output requires a few extra steps. But I only use cursors for the geometry. I use arcpy.da.ExtendTable to bring in and join the array to featureclass tables (even taking care not to use OBJECTID as the join key)

Even if you like cursors, numpy lurks somewhere around it

cur = arcpy.da.SearchCursor(in_fc, "*", None, SR)
cur._dtype  # ---- looks np.dtype to me

dtype([('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
       ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
       ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('DT_str', '<U20')])‍‍‍‍‍‍

Same with the other access point to arrays... with the added bonus of skipping any row that hasn't been cleanup from flotsam … like nulls (which you can control what is considered 'null')

a = arcpy.da.FeatureClassToNumPyArray(in_fc, "*", skip_nulls=True)
a

array([(1, [ 300015.   , 5000005.   ], 1, 1, 4, 1., 100., 'A 10 chars', 40.   , 100., '2019/03/28 00:00:00'),
       (3, [ 300010.495, 5000010.593], 3, 3, 6, 3., 300., 'C not null', 99.416, 182., '2019/03/30 00:00:00')],
      dtype=[('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
             ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
             ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('DT_str', '<U20')])‍‍‍‍‍‍‍‍

NumPy and SciPy are really underused in gis analysis sadly, let alone in data preparation. Cute Pandas get all the attention,

curtvprice · ‎04-03-2019

At times like this I miss awk.

AlanDodson · ‎04-04-2019

Amen!

AlanDodson · ‎04-05-2019

I have figured out a solution that give me exactly what I want and it is fast (enough). It parses through the large file and writes the data I need in about 2.5 minutes, which is about 4 times faster than the external VB script it replaces, plus it gives me a GIS output within my current workflow.

I did not use any of the array code that Dan mentioned but I will be reading those answers very closely and applying whatever I can find there to future projects indeed. Very appreciated Dan Patterson

In case anyone is in the same circumstance this is what I am using:

Input = arcpy.GetParameterAsText(0)
codes = arcpy.GetParameterAsText(1).split(",")   
arcpy.env.workspace = arcpy.GetParameterAsText(2) 
sr = arcpy.SpatialReference(4326)

arcpy.env.overwriteOutput = True

Output = arcpy.CreateFeatureclass_management(arcpy.env.workspace, "Output", "POINT", "", "DISABLED", "DISABLED", sr)
arcpy.AddField_management(Output, "LAT", "DOUBLE", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "LONG", "DOUBLE", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "IOH", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "MEASURE1", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "MEASURE2", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "MEASURE3", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "CODE", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "TYPECODE", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "TYPENAME", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "FEATCODE", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "FEATNAME", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "COL", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "MULT", "LONG", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "DATE3", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "DATE5", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")
arcpy.AddField_management(Output, "AORD", "TEXT", "", "", "", "", "NULLABLE", "NON_REQUIRED", "")

with open(Input, 'r') as file:
  with arcpy.da.InsertCursor(Output,["SHAPE@XY", "LAT", "LONG", "IOH", "MEASURE1", "MEASURE2", "MEASURE3", "CODE", "TYPECODE", "TYPENAME", "FEATCODE", "FEATNAME", "COL", "MULT", "DATE3", "DATE5", "AORD"]) as InsCur:
    for line in file:
      values = line.split("\t")
      LAT = values[5]
      LONG = values[6]
      IOH = values[1]
      MEASURE1 = float(values[9])*0.2691
      MEASURE2 = float(values[10])*0.2691
      MEASURE3 = float(values[11])*0.2691
      CODE = values[2]
      TYPECODE = values[7]
      TYPENAME = values[8]
      FEATCODE = values[38]
      FEATNAME = values[39]
      COL = values[18]
      MULT = values[20]
      DATE3 = values[23]
      DATE5 = values[24]
      AORD = values[32]
      if values[2] in codes and float(values[9])*0.2691 >= 75:  
  	      newGeom = arcpy.PointGeometry(arcpy.Point(LONG,LAT))
  	      InsCur.insertRow([newGeom,LAT,LONG,IOH,MEASURE1,MEASURE2,MEASURE3,CODE,TYPECODE,TYPENAME,FEATCODE,FEATNAME,COL,MULT,DATE3,DATE5,AORD])
		  
OutputCount = arcpy.GetCount_management(Output)
OutputCountInt = int(OutputCount.getOutput(0))
if (OutputCountInt < 1):
  arcpy.AddMessage("There are no valid points in your AOI.")
else:
  arcpy.AddMessage("Points have been written to the Output feature class in your input database.")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍