I'm working with some data that is a ridiculously large table (184 fields) and I would like to asses how many instances of of each field are non-null. In other words if the values are more more often null, I'll drop the field(s). For now I've written a script that performs an iterative selection for each field, and as you can imagine with 184 fields and a few thousand records it's pretty slow. I tried a couple other approaches that failed, but I have to think there is a better way to get a count of records for which a given field is populated. Here is what I've done to date:
import arcpy
table = r'J:\some\path\to\file.gdb\tableName'
fields = []
for f in arcpy.ListFields(table):
fields.append(f.name)
arcpy.MakeTableView_management(table,'tv')
for f in fields:
select = f'{f} is not null'
arcpy.SelectLayerByAttribute_management('tv','NEW_SELECTION',select)
c = arcpy.GetCount_management('tv')
print('Field {} has {} non-null records'.format(f,c[0]))
Your current approach is the most "ArcGIS" way, i.e., using ArcGIS geoprocessing tools, of doing it. The slowness is caused more by having 184 fields, which is a ridiculously excessive number of fields even for a data warehouse, than your approach.
If memory is abundant, you could load the whole dataset into a Python or NumPy data structure and then count the number of non-null values. Counting in a Python or NumPy data structure should be much faster than your current approach, but you have to be able to load all the data into memory or paging will slow it down.
I wonder if just loading into memory then used the Arc way would be faster also.
arcpy.FeatureClassToFeatureClass_conversion(old.gdb, "in_memory", "tempFC")
doubtful since it doesn't write to memory
You raise an interesting point/question. At first, I thought the same as Dan but for different reasons. Given the way that Windows 10 caches files in spare system memory, I thought there was a good chance that running Select Layer by Attribute on the same data set over and over would be hitting a cache and therefore not sped up much by moving the data to in-memory. It turns out, I was wrong.
I did some testing on a feature class with a handful of columns but millions of rows of data. I created a feature layer and then ran successive "IS NOT NULL" selections against the columns. It turns out the first run was a bit slower but negligibly so. All of the selections took about 8 seconds. I then copied the feature class into memory workspace and repeated the test, and the selection ran in about 0.2 seconds. So, copying the data into memory made a huge difference.
The only caveat I would add is that in-memory won't help nearly as much if the data set can't actually fit in RAM. If copying the data set results in ArcGIS paging some of the data to disk because there isn't enough RAM, the selections could be even slower than leaving them in the file GDB.
Well I knew it would work since I do it all the time . Arc is very chatty on the network (or hard drive). So moving it to RAM eliminates all those calls to hardware. Plus I guessed he was hitting SDE which means a network call, a server call, and a hard drive spin.
To fit it in RAM use PyScripter 64 bit.
Hope that helps the OP.
Thanks guys- yes Doug Browning it does help, at least for future applications; this was a one and done sort of thing and I was able to press on.
I use spyder 64 bit as my ide, and ram is cheap these days so this approach seems pretty close to being universal.
Hi Joe,
As also stated by Joshua, iterating through 184 fields in arcpy is going to be slow. I suggest you leverage the arcgis module and the Spatially Enabled Data Frame (extension of pandas.DataFrame), maybe something like this:
infc = r"Path\To\Your\Data" sedf = pandas.DataFrame.spatial.from_featureclass(infc) #Creates a pandas.DataFrame from your featureclass idx = sedf.isnull() # Returns an index that is True for None, False otherwise idx.sum() # True = 1, False = 0; idx.sum() will summarize each column to tell you how many null values are in that column
That will give you something like this:
OBJECTID 0
ORIG_FID 0
DATA 8
SHAPE 0
dtype: int64
Small demo... you should use
flds = "*"
flds =['Parts', 'Int_nulls', 'Float_nulls', 'Text_nulls']
d0 = arcpy.da.TableToNumPyArray(in_fc, field_names=flds, skip_nulls=False, null_value=nd)
d1 = arcpy.da.TableToNumPyArray(in_fc, field_names=flds, skip_nulls=True)
d0
array([(2, 1, nan, 'a'),
(2, -999, 6.10, 'None'),
(1, -999, 2.10, 'None')],
dtype=[('Parts', '<i4'), ('Int_nulls', '<i4'),
('Float_nulls', '<f8'), ('Text_nulls', '<U5')])
d1
array([],
dtype=[('Parts', '<i4'), ('Int_nulls', '<i4'),
('Float_nulls', '<f8'), ('Text_nulls', '<U5')])
A table with 3 records and I samples some fields, I have a function to convert nulls of any kind so that they are converted from <null> to something useful (I blogged about this).
BUT!!! if you dump the nulls versus you keep the nulls
d1.shape
(0,)
d0.shape
(3,)
Way ******** faster than a search cursor
Is your solution done in Pro using only default python modules or did you need to add additional modules using the clone? If needing the clone, what is the module(s) that you needed to add to execute your solution?