Memory issues with a large dictionary

NeilAyres · ‎03-16-2016

I thought of asking this question over at SE, but those pro pythonistas will probably give me an answer I won't understand .

I am trying to load (via arcpy.da.SearchCursor) a bunch of tables. They are all related to each other via a series of attribute ids. The project is actually about a telcomms problem, tracing a fibre route from end to end and all the bits of kit that go along the route.

Unfortunately, there are rather a lot of records to load.

My loading code below:

tblInfoDict = {
           "route" : ["routeid", "name"],
           "routedetail" : ["routedetailid", "routeid", "x_table", "x_id", "num"],
           "port" : ["portid", "x_table", "x_id", "num", "grp"],
           "fibermngr" : ["fibermngrid", "name", "x_table", "x_id", "fibermngrtypeid"],
           "building" : ["buildingid", "name", "gpslatitude", "gpslongitude"],
           "strand" : ["strandid", "x_table", "x_id", "num", "bundle", "color"],
           "span" : ["spanid", "spantypeid", "length", "locateid"],
           "cable" : ["cableid", "spanid", "spantypeid"],
           "enclosure" : ["enclosureid", "name", "x_table", "x_id", "enclosuretypeid"],
           "access_point" : ["access_pointid", "name", "typ", "gpslatitude", "gpslongitude"],
           "ductbank" : ["ductbankid", "name"],
           "spantype" : ["spantypeid", "name"],
           "innerduct" : ["innerductid", "ductbankid", "superductid"],
           "superduct" : ["superductid", "ductbankid"]
           }
# load a series of dictionaries
data_dict = {}
for tbl, flds in tblInfoDict.iteritems():
    print "Reading {}".format(tbl)
    t1 = time.time()
    data_dict[tbl] = {}
    temp_dict = {r[0] : r[1:] for r in arcpy.da.SearchCursor(tbl, flds)}
    print "Size {}".format(sys.getsizeof(temp_dict))
    data_dict[tbl].update(temp_dict)
    del temp_dict
    print "Total Size {}".format(sys.getsizeof(data_dict))
    t2 = time.time()
    print "Read took {:.2f} secs".format(t2 - t1)

I inserted the sys.getsizeof(object) to try and get a handle on what I was consuming.

The run window:

>>> 
Reading ductbank
Size 1573004
Total Size 140
Read took 1.33 secs
Reading access_point
Size 1573004
Total Size 140
Read took 0.62 secs
Reading spantype
Size 524
Total Size 140
Read took 0.10 secs
Reading superduct
Size 3145868
Total Size 140
Read took 1.24 secs
Reading enclosure
Size 393356
Total Size 140
Read took 0.27 secs
Reading strand
Size 50331788
Total Size 524
Read took 22.63 secs
Reading routedetail
Size 25165964
Total Size 524
Read took 7.39 secs
Reading building
Size 393356
Total Size 524
Read took 0.25 secs
Reading fibermngr
Size 393356
Total Size 524
Read took 0.22 secs
Reading span
Size 1573004
Total Size 524
Read took 0.43 secs
Reading cable
Size 1573004
Total Size 524
Read took 0.59 secs
Reading route
Size 1573004
Total Size 524
Read took 0.60 secs
Reading port
Size 25165964
Traceback (most recent call last):
  File "C:\Data\ESRI-SA\DarkFibreAfrica\Vodacom\Python\ProcessDBTables.py", line 48, in <module>
    data_dict[tbl].update(temp_dict)
MemoryError
>>> sys.version
'2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]'
>>>

So the output of sys.getsizeof(), when pointing at my intermediate temp_dict reports something real.

But, the output for data_dict which is accumulating the data make no sense whatsoever.

If anyone has insight into this, please explain.

So, to get to my question.

Could I run this script using the 64 bit version? And how?

Would I be able to get the entire large dict into memory using the 64 bit python?

Running this in v10.3.1 python 2.7.8

WesMiller · ‎03-16-2016

Neil have you thought about doing this with numpy ExtendTable—Help | ArcGIS for Desktop

NeilAyres · ‎03-16-2016

Thanks Wes,

no I hadn't. But I don't think that is the answer to my problem.

The nested dictionary approach is what I need.

Those "x_table", "x_id" fields contain table names (now pointers to other parts of the nested dictionary).

With this structure and the correct flow from one to another, I should be able to recursively rip through this at the speed of light.

JoshuaBixby · ‎03-16-2016

I may not be describing this perfectly, but the gist of your problem relates to how dictionaries "store" information. A Python dictionary doesn't store the actual values in it, it stores pointers to locations in memory. So, a single call to sys.getsizeof() isn't giving you the total size of the data in the dictionary as much as it is giving you the total size of the data structure that is containing all of the reference information to your data.

>>> from sys import getsizeof
>>> dic1 = {1:range(1)}
>>> dic1
{1: [0]}
>>> getsizeof(dic1)
140
>>> dic2 = {1:range(100)}
>>> dic2
{1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ..., 98, 99]}
>>> getsizeof(dic2)
140
>>>

The Python 3 documentation for sys.getsizeof() references a recursive sizeof recipe that should work for you.

NeilAyres · ‎03-16-2016

Joshua,

thanks.

That goes some way to explaining why sys.getsizeof(ob) doesn't really report the expanding size of my master dictionary.

But I am probably muddying the waters here.

How do I get the 64 bit python to work in this case?

I am using PyScripter as my interpreter.

JoshuaBixby · ‎03-16-2016

With ArcMap, the easiest way to access a 64-bit Python interpreter that works with ArcPy is to install 64-bit Background Geoprocessing. When you install 64-bit Background Geoprocessing, it installs a 64-bit Python interpreter/package. After the 64-bit Python interpreter is installed, you can point PyScripter to use it instead of the 32-bit one.

Just a note of caution/awareness. Since the 64-bit Python interpreter is installed last or after the 32-bit, your Python file associations will likely be pointed to the 64-bit interpreter. Although this isn't an issue most of the time, there are some limitations of the 64-bit ArcPy like no access to personal geodatabases. You can change your Python file type associations back to the 32-bit interpreter if you want.

Also, since ArcGIS Pro is natively 64-bit, you can also access 64-bit ArcPy by installing that application.

BKS · ‎10-24-2018

Last year I used PyScripter 32bit to write scripts leveraging arcpy. I had installed the 64 bit BG since some of my geoprocessing tasks required the access to greater memory space than 32bit allowed. Once I installed 64 bit BG all worked just fine.

I am looking to pick up these scripts and continue development but I can't seem to configure PyScripter 32 bit to run the 64 bit python interpreter. In the python interpreter window it only shows the 32 bit interpreter.

When I run python from CMD window it brings up the AMD 64 bit version (in Win32). I can run my scripts from here bout would prefer to run them from within PyScripter.

I've tried everything I can think of (apart from moving to PyCharm or ArcGIS Pro).

Currently using Pyscripter 3.4.2 (32 bit), ArcPy (64 bit BG), ArcGIS 10.4.1, Windows 10

Shortened version of question to you: How exactly do you "After the 64-bit Python interpreter is installed, you can point PyScripter to use it instead of the 32-bit one."

See attached CMD window and PyScripter Interpreter window.

JoshuaBixby · ‎10-24-2018

I don't use PyScripter. Check this out, How to change the version of python that pyscripter uses - Stack Overflow, especially the last response.

NeilAyres · ‎03-17-2016

So, got this cracked.

I was confused, I hadn't installed the 10.3.1 64bit BG stuff.

I had an old 10.1 version instead. So, in the python cmd window, "import arcpy" worked, but not really. I tried doing this by hand, but couldn't see the tables inside the fgdb.

So, once I had the correct BG 64bit stuff installed, all is well. And it is pretty fast, and it all goes into my big dictionary with 5,221,157 records.

Although I can configure pyscripter on an external run to use the 64bit, it is a little hard to debug like that.

But I can also run the script from 64bit IDLE, and interact with my variables etc. So I will probably just carry on like that.