Persisting a 50 million record geopandas dataframe?

KeithDesertspring · ‎11-04-2022

(INot sure if this is the best place to post my question. Please direct me to a better location if one exists!)

So I have a large XYZ point dataset stored in a geopandas dataframe that I'd like to persist to permanent storage and query from there. The dataset has about 50 million records and occupies about 10 GB when exported to sqlite (unfortunately it's too big to export to a single db file without my system running out of memory). The data will also need to be spatially indexed and I will be running queries against the entire dataset at once.

How would you accomplish something like this? Oracle SDE is available, but I'm looking for alternatives even if they sacrifice performance somewhat...mainly to avoid the bureaucracy associated with anything that involves servers at my organization. If I can store this on my C drive that would be perfect! Using ArcGIS Pro 2.9.

BTW - I'm finding that GeoPandas is easier to work with than arcpy for the tasks I'll be doing using this dataset, so hopefully whatever storage I ultimately use is reasonably compatible with pandas.

Thanks!

DanPatterson · ‎11-04-2022

Did you see

Reading and Writing Files — GeoPandas 0.12.1+0.g195f70b.dirty documentation

see the reference to examining fiona for all supported formats. There are references to subqueries etc.

Postgis, Feather and Parquet are also mentioned.

... sort of retired...

KeithDesertspring · ‎11-04-2022

Thanks Dan. I didn't realize you could selectively load records in those formats from disk. I'll give the documentation a thorough reading and see if those options are sufficient.

"Since geopandas is powered by Fiona, which is powered by GDAL, you can take advantage of pre-filtering when loading in larger datasets. This can be done geospatially with a geometry or bounding box. You can also filter rows loaded with a slice. Read more at geopandas.read_file()."

Anonymous User · ‎11-07-2022

You could probably normalize that table up into smaller related tables, reducing duplicate values and save some space if you still want it local.

KeithDesertspring · ‎11-07-2022

I did consider normalization but it doesn't help that much, almost all of the columns change for every row. Good idea though!

I stumbled across ESRI's own version of geopandas dataframes, the "spatially enabled dataframe", which I plan to test today or tomorrow. Apparently the version of geopandas that I'm using has a sub-optimal way of writing large volumes of data to sqlite - the data is written in small batches and passed through four different buffers. I sort of confirmed this by converting the geometry into a "dumb" column stored in a regular dataframe, which I could then write to a single sqlite file much more efficiently. So I'm hoping that ESRI's solution packaged with Pro v2.9 is more optimized.

Either way I plan to report back with my solution for anyone who comes across this in the future.