Hi there,
I am using an SEDF in ArcGIS Pro 3.1.0 on a regular basis. So far I am constructing it each day anew from an Excel file, which is slow. Now I would like to safe the data as a .parquet file, so I can quickly safe the SEDF on disk and load it whenever in a fast manner.
I tried two approaches.
1. with DASK
I used ddf = from_pandas(sdf, npartitions=1) and ddf.to_parquet(path_dir + 'test.parquet'
This throwns an Error:
ValueError: Failed to convert partition to expected pyarrow schema:
`ArrowInvalid("Could not convert bytearray(b'\\x01\\x01\\x00\\x00\\x00\\x00\\x00\\x00 \\xf2\\xbd!A\\x00\\x00\\x00@^\\x9cVA') with type bytearray: converting to null type", 'Conversion failed for column SHAPE with type geometry')`
Any idea how I could get past the error?
I found the geoparquet file type, but no way to use dask for that?
Apparently geopandas has some compability, which at the moment I cant install as package ...
2. with the ArcGIS API
As described here, you can apparently write parquet files containing geometry with the ArcGIS API, but I couldn't read them with the sdf = pd.DataFrame.spatial.from_parquet('test.parquet') function, telling me:
ValueError: Missing geo metadata in Parquet/Feather file.
Use pandas.read_parquet/read_feather() instead.
The doc allready tells me:
if no geometry columns are read, this will raise a ValueError - you should use the pandas read_parquet method instead.
Which from my perspective dosen't make a lot of sense since a geometry col is present ...
Edit1:
Changed the title to specify parquet.
Edit2:
Changed the title to specify feather/parquet/arrow
--------
In case this is useful, the output of sdf.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 293321 entries, 0 to 293324 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date_Time_Alarm 293321 non-null datetime64[ns] 1 HE_Event_Num 292795 non-null string 2 Call_Sign 293321 non-null string 3 Station 293309 non-null string 4 E_Event_Num 293321 non-null string 5 Event_type 293321 non-null string 6 Destination 274531 non-null string 7 Orga_Name 293321 non-null string 8 Orga_Group 293321 non-null string 9 Event_min 293278 non-null float32 10 Transit_min 267680 non-null float32 11 Deployment_sec 291714 non-null float32 12 X 293321 non-null float32 13 Y 293321 non-null float32 14 Street 293219 non-null string 15 Housenumber 266766 non-null string 16 Address 275289 non-null string 17 DAT_TD_ND_CT 293321 non-null string 18 DAT_Weekday 293321 non-null string 19 Hour_CT_EVENT 293321 non-null int16 20 Month_CT_EVENT 293321 non-null int16 21 Year_CT_EVENT 293321 non-null int16 22 SHAPE 293321 non-null geometry dtypes: datetime64[ns](1), float32(5), geometry(1), int16(3), string(13) memory usage: 51.1 MB
Solved! Go to Solution.
The fastet solution I could find so far would be using feather. The format is not as highly compressed as parquet, but using
import pyarrow.feather as feather
feather.write_feather(sdf, path_dir +'test.feather')
sdf_f = pd.DataFrame.spatial.from_feather(path_dir +'test.feather', spatial_column='SHAPE', columns=None, use_threads=True)
I could write and read my sdf.shape (293321, 23) in about 45sec, as it allows to parallelize reading using multiple threads.
To use feather:
feather is build on pyarrow as well, your data needs to be structured accordingly.
[note edit below]
The biggest issue I had was that pandas dataframe accepts "object" classes. Meaning you could have in one "colum" mixed values. All formats, parquet, feather, arrow, wont accept these. Therefore you need to clean and eliminate the "object" datatype. If you deal with NULL values, use pd.NA for string columns and np.nan for numerical.
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import pyarrow.parquet as pq
arrow_table = sdf.spatial.to_arrow()
pq.write_table(arrow_table, path_dir + 'test.parquet')
retrieved_arrow_table = pq.read_table(path_dir + 'test.parquet')
sdf_pq = pd.DataFrame.spatial.from_parquet('test.parquet')
Edit:
Apparently one could use object data type, that itself appears to be no problem. I guess the data it holds may pose issues ... just a guess really.
Furthermore: from_feather only returns a pandas dataframe. To recunstruct a SEDF one needs to use
sdf = pd.DataFrame.spatial.from_df(df, sr=25832, geometry_column='SHAPE')
some options
Part-3 Data IO with SeDF - Exporting Data | ArcGIS API for Python
Cheers. These are options, unfortunately they are very slow in comparison with a working DASK + parquet combination.
Got it.
The entire conversion is based on pyarrow as far as I understand.
https://www.esri.com/arcgis-blog/products/arcgis-pro/developers/leverage-apache-arrow-in-arcgis-pro/
The above article was a great help to understand.
So the code would be sth. like this:
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import pyarrow.parquet as pq
arrow_table = sdf.spatial.to_arrow()
pq.write_table(arrow_table, path_dir + 'test.parquet')
retrieved_arrow_table = pq.read_table(path_dir + 'test.parquet')
sdf_pq = pd.DataFrame.spatial.from_parquet('test.parquet')
How to use DASK in this combination I could not figure out yet. To me it seems that DASK has no geometry datatype (yet).
Edit:
constructing the sdf from parquet & arrow is quite slow, takes my maschine about 5min 30sec.
dask-geopandas documentation — dask-geopandas
dask_geopandas.GeoDataFrame — dask-geopandas
geopandas seems to be the link for geometry needs
The fastet solution I could find so far would be using feather. The format is not as highly compressed as parquet, but using
import pyarrow.feather as feather
feather.write_feather(sdf, path_dir +'test.feather')
sdf_f = pd.DataFrame.spatial.from_feather(path_dir +'test.feather', spatial_column='SHAPE', columns=None, use_threads=True)
I could write and read my sdf.shape (293321, 23) in about 45sec, as it allows to parallelize reading using multiple threads.
To use feather:
feather is build on pyarrow as well, your data needs to be structured accordingly.
[note edit below]
The biggest issue I had was that pandas dataframe accepts "object" classes. Meaning you could have in one "colum" mixed values. All formats, parquet, feather, arrow, wont accept these. Therefore you need to clean and eliminate the "object" datatype. If you deal with NULL values, use pd.NA for string columns and np.nan for numerical.
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import pyarrow.parquet as pq
arrow_table = sdf.spatial.to_arrow()
pq.write_table(arrow_table, path_dir + 'test.parquet')
retrieved_arrow_table = pq.read_table(path_dir + 'test.parquet')
sdf_pq = pd.DataFrame.spatial.from_parquet('test.parquet')
Edit:
Apparently one could use object data type, that itself appears to be no problem. I guess the data it holds may pose issues ... just a guess really.
Furthermore: from_feather only returns a pandas dataframe. To recunstruct a SEDF one needs to use
sdf = pd.DataFrame.spatial.from_df(df, sr=25832, geometry_column='SHAPE')