How to use feather/parquet/arrow and SEDF

Mer-lin · ‎12-17-2024

Hi there,

I am using an SEDF in ArcGIS Pro 3.1.0 on a regular basis. So far I am constructing it each day anew from an Excel file, which is slow. Now I would like to safe the data as a .parquet file, so I can quickly safe the SEDF on disk and load it whenever in a fast manner.

I tried two approaches.

1. with DASK

I used ddf = from_pandas(sdf, npartitions=1) and ddf.to_parquet(path_dir + 'test.parquet'

This throwns an Error:

ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowInvalid("Could not convert bytearray(b'\\x01\\x01\\x00\\x00\\x00\\x00\\x00\\x00 \\xf2\\xbd!A\\x00\\x00\\x00@^\\x9cVA') with type bytearray: converting to null type", 'Conversion failed for column SHAPE with type geometry')`

Any idea how I could get past the error?
I found the geoparquet file type, but no way to use dask for that?
Apparently geopandas has some compability, which at the moment I cant install as package ...

2. with the ArcGIS API

As described here, you can apparently write parquet files containing geometry with the ArcGIS API, but I couldn't read them with the sdf = pd.DataFrame.spatial.from_parquet('test.parquet') function, telling me:

ValueError: Missing geo metadata in Parquet/Feather file.
            Use pandas.read_parquet/read_feather() instead.

The doc allready tells me:
if no geometry columns are read, this will raise a ValueError - you should use the pandas read_parquet method instead.

Which from my perspective dosen't make a lot of sense since a geometry col is present ...

Edit1:
Changed the title to specify parquet.

Edit2:
Changed the title to specify feather/parquet/arrow

--------

In case this is useful, the output of sdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 293321 entries, 0 to 293324
Data columns (total 23 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   Date_Time_Alarm  293321 non-null  datetime64[ns]
 1   HE_Event_Num     292795 non-null  string        
 2   Call_Sign        293321 non-null  string        
 3   Station          293309 non-null  string        
 4   E_Event_Num      293321 non-null  string        
 5   Event_type       293321 non-null  string        
 6   Destination      274531 non-null  string        
 7   Orga_Name        293321 non-null  string        
 8   Orga_Group       293321 non-null  string        
 9   Event_min        293278 non-null  float32       
 10  Transit_min      267680 non-null  float32       
 11  Deployment_sec   291714 non-null  float32       
 12  X                293321 non-null  float32       
 13  Y                293321 non-null  float32       
 14  Street           293219 non-null  string        
 15  Housenumber      266766 non-null  string        
 16  Address          275289 non-null  string        
 17  DAT_TD_ND_CT     293321 non-null  string        
 18  DAT_Weekday      293321 non-null  string        
 19  Hour_CT_EVENT    293321 non-null  int16         
 20  Month_CT_EVENT   293321 non-null  int16         
 21  Year_CT_EVENT    293321 non-null  int16         
 22  SHAPE            293321 non-null  geometry      
dtypes: datetime64[ns](1), float32(5), geometry(1), int16(3), string(13)
memory usage: 51.1 MB

Mer-lin · ‎12-19-2024

The fastet solution I could find so far would be using feather. The format is not as highly compressed as parquet, but using

import pyarrow.feather as feather
feather.write_feather(sdf, path_dir +'test.feather')
sdf_f = pd.DataFrame.spatial.from_feather(path_dir +'test.feather', spatial_column='SHAPE', columns=None, use_threads=True)

I could write and read my sdf.shape (293321, 23) in about 45sec, as it allows to parallelize reading using multiple threads.

To use feather:

feather is build on pyarrow as well, your data needs to be structured accordingly.

[note edit below]

The biggest issue I had was that pandas dataframe accepts "object" classes. Meaning you could have in one "colum" mixed values. All formats, parquet, feather, arrow, wont accept these. Therefore you need to clean and eliminate the "object" datatype. If you deal with NULL values, use pd.NA for string columns and np.nan for numerical.

Introduction to Apache Arrow

More about the data types

import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import pyarrow.parquet as pq

arrow_table = sdf.spatial.to_arrow()
pq.write_table(arrow_table, path_dir + 'test.parquet')
retrieved_arrow_table = pq.read_table(path_dir + 'test.parquet')
sdf_pq = pd.DataFrame.spatial.from_parquet('test.parquet')

Edit:

Apparently one could use object data type, that itself appears to be no problem. I guess the data it holds may pose issues ... just a guess really.

Furthermore: from_feather only returns a pandas dataframe. To recunstruct a SEDF one needs to use

sdf = pd.DataFrame.spatial.from_df(df, sr=25832, geometry_column='SHAPE')

View solution in original post

DanPatterson · ‎12-17-2024

some options

Part-3 Data IO with SeDF - Exporting Data | ArcGIS API for Python

... sort of retired...

Mer-lin · ‎12-18-2024

Cheers. These are options, unfortunately they are very slow in comparison with a working DASK + parquet combination.

Mer-lin · ‎12-18-2024

Got it.

The entire conversion is based on pyarrow as far as I understand.

https://www.esri.com/arcgis-blog/products/arcgis-pro/developers/leverage-apache-arrow-in-arcgis-pro/

The above article was a great help to understand.

So the code would be sth. like this:

import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import pyarrow.parquet as pq

arrow_table = sdf.spatial.to_arrow()
pq.write_table(arrow_table, path_dir + 'test.parquet')
retrieved_arrow_table = pq.read_table(path_dir + 'test.parquet')
sdf_pq = pd.DataFrame.spatial.from_parquet('test.parquet')

How to use DASK in this combination I could not figure out yet. To me it seems that DASK has no geometry datatype (yet).

Edit:

constructing the sdf from parquet & arrow is quite slow, takes my maschine about 5min 30sec.

DanPatterson · ‎12-18-2024

dask-geopandas documentation — dask-geopandas

dask_geopandas.GeoDataFrame — dask-geopandas

geopandas seems to be the link for geometry needs