topic Re: Large Dictionary Compression? in Python Questions

Large Dictionary Compression?

ChrisSnyder — Wed, 23 May 2012 17:34:39 GMT

I have a simple dictionary like this:

exampleDict[123444556] = (1785,2234544,3545456, 165765.47654)

where all the keys are intgers and the values are either integers or floats.

My issue is that I have the need to store/access about 20 million keys at a time, and I am running out of 32-bit memory. I'd rather do this in 32-bit Python as I need (or would like) access to arcpy for its FGDB table reading/writting abilities.

Anyone know of a way to somehow "compress" keys and/or values in a dictionary? I'm looking into the binascii module, and I see lots of methods to compress strings, but not ints or floats. Maybe you can't meaningfully compress these since they are already quite numeric?

Anyone ever do something like this?

Re: Large Dictionary Compression?

JasonScheirer — Wed, 23 May 2012 17:59:23 GMT

If you're running out of memory, you're sort of out of luck because internally integers are already stored as space-efficiently as possible.

You might want to consider some other key-value store, such as anydbm or even setting up a Redis server and talking to that from python.

Re: Large Dictionary Compression?

ChrisSnyder — Wed, 23 May 2012 18:02:52 GMT

Thanks Jason - I'll look into those...

Re: Large Dictionary Compression?

ChrisSnyder — Thu, 24 May 2012 18:09:03 GMT

Jason, after looking at stuff... Hmmm - seems a bit over my head I think.

But my work around solution (not working quite 100% yet) is to just:
1. export the FGDB tables to .txt format (thankfully the txt versions are < 2GB!).
2. call 64-bit Python.exe as a subprocess (which actually seems to work)
3. have that 64-bit python.exe process read the "tables" (txt files) into dictionaries, do the analysis, write the results out to .txt format
4. back in 32-bit "arcpy-compliant" Python land, read the analysis txt table back into FGDB table format, and then *** big inhale *** proceed with the rest of the script.

Here's to 64-bit :cool: and the hope that we may be have a 64-bit version of ArcGIS some day!

Re: Large Dictionary Compression?

JasonScheirer — Thu, 24 May 2012 19:05:43 GMT

Nice! Glad you got something working. 10.1 server will be 64 bit out of the box.

Re: Large Dictionary Compression?

KimOllivier — Wed, 30 May 2012 07:08:03 GMT

What about using SQLite inside Python? This might manage data better and you can run an SQL query to do the matching instead of a dictionary.
SQLite is built into python and there are no 2GB size limits. Does it load everything into memory?

Attached is an example using SQLite to find duplicates in a large database where python dictionaries overflowed. (Not written by me)

Re: Large Dictionary Compression?

ChrisSnyder — Thu, 31 May 2012 16:19:08 GMT

That looks very interesting Kim, although I don't have much hardcore SQL skill...

I think for my purposes I will stick with my Python 64-bit subprocess solution... I am using these large dictionaries to traverse/trace segments of a stream network, and speed is very critical as there are so many features involved - eventually there will be 100's of millions of features. I am comfortable writting my own code in Python to emulate fancy SQL-type stuff using dictionaries and basisically see dictionaries as a great and flexible format for creating my own RDBMS with whatever "custom" features I can dream up. I am amazed at the speed of these hash table-type structures - and I seem that the code you suplied uses some sort of formal SQL hash functionality (sadly, of which I am totally ignorant of!) - very cool.

Re: Large Dictionary Compression?

Luke_Pinner — Fri, 01 Jun 2012 01:40:45 GMT

You could also take a look at the shelve module - http://docs.python.org/library/shelve.html It provides a filesystem based dict like class. Though as it's filesystem based, it will probably be slower than your 64bit python subprocess method.

Re: Large Dictionary Compression?

ChrisSnyder — Fri, 22 Feb 2013 16:48:49 GMT

Decided to finally install and test out the new 64 bit geoprocessing upgrade for 10.1 SP1. Works like a charm (except for the whole 32-bit exceptions thing, but that's okay and understandable... I never liked PGDB anyway!). Note the RAM usage in the attached screenshot (~27 GB max in use). So I can now have my huge Python dictionaries and eat arcpy too. I bet this was Jason S.' idea - thanks for implementing :).

[ATTACH=CONFIG]22090[/ATTACH]