Pretty sure this is going to be an easy one, but looking for the most efficient way to take a multipoint layer from a public feature service where there is ever only one point in each feature, and convert it to point when loading a dataframe.
url = "https://services6.arcgis.com/GB33F62SbDxJjwEL/arcgis/rest/services/Vicmap_Features_of_Interest/FeatureServer/1"
query = "feature_subtype = 'aged care'"
aged_care_df = spark.read.format("feature-service").load(url)\
.withColumn("shape", ST.transform("shape", 4326)) \
.filter(query)
This returns a points geometry object:
I have tried using ST_GeometryN to then convert this to a point geometry type.
attempt1 = aged_care_df.select(ST.geometry_n("shape", 1))
But this returns a series of nulls:
Would like to avoid doing any string manipulation in python, and looking for the best way to convert these into point geometry types.
Longer term goal is to end up with a dataframe with an array of points for the setStops parameter in the Create Routes tool.
Solved! Go to Solution.
I'm not sure if this is the most efficient, but it works...
The first thing I noticed was the the feature service you pointed to did actually have some true multipoint features. I checked like this:
# count number of points
aged_care_df\
.select(ST.num_geometries("shape").alias("point_count"))\
.sort("point_count", ascending=False)\
.show()
+-----------+
|point_count|
+-----------+
| 2|
| 2|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
+-----------+
only showing top 20 rows
And see there are two records that have 2 points each. It's sort of interesting that the points in each have the same coordinates when I filter and print out just these two records...but, still, they are actually multipoint, so we'll have to deal with that to do the conversion
+-----------+--------------------------------------------------------------------------------------------+--------+
|point_count|shape |OBJECTID|
+-----------+--------------------------------------------------------------------------------------------+--------+
|2 |{"points":[[145.01641305966146,-36.58713183554724],[145.01641305966146,-36.58713183554724]]}|46169 |
|2 |{"points":[[147.59699327673584,-37.09840182302235],[147.59699327673584,-37.09840182302235]]}|46566 |
+-----------+--------------------------------------------------------------------------------------------+--------+
Since the multipoint geometries need to be exploded to convert to points (you can't just cast a multipoint to a point), I used ST_Points to create an array of points for each multipoint, then exploded those, and used ST_Point to turn the coordinates back into point geometries.
All in all, it looked like this - note that I drop the point_array that I created as an intermediate step, since I don't need that once I create the point geometry.
aged_care_df2 = aged_care_df.select("*",
F.explode(ST.points("shape")).alias("point_array"),
ST.point(ST.x("point_array"), ST.y("point_array")).alias("geom_point"))\
.drop("point_array")
That gives me individual records with the original Shape (multipoint) and the new geom_point (point). There will be two records with the same OBJECTID and other attributes since those records happened to really have a multipoint feature.
Example of one record:
-RECORD 0----------------------------------------------------------------------
OBJECTID | 1519
ufi | 67865934
pfi | 1271613
feature_id | 1271613
parent_feature_id | null
feature_type | care facility
feature_subtype | aged care
feature_status | null
name | YALLAMBEE LODGE COOMA
name_label | Yallambee Lodge Cooma
parent_name | null
child_exists | null
auth_org_code | 110
auth_org_id | null
auth_org_verified | 2023-05-31 17:00:00
vmadd_pfi | null
vicnames_id | -1959170
vicnames_status_code | 11
theme1 | null
theme2 | null
state | NSW
create_date_pfi | 2021-06-15 04:20:27
superceded_pfi | null
feature_ufi | 67865934
feature_create_date_ufi | 2023-06-27 00:37:13
create_date_ufi | 2023-06-27 00:37:13
shape | {"points":[[149.130545553053,-36.220029363958275]]}
geom_point | {"x":149.130545553053,"y":-36.220029363958275}
only showing top 1 row
I hope this helps with some ideas on how to move forward with your project.
-Sarah.
I'm not sure if this is the most efficient, but it works...
The first thing I noticed was the the feature service you pointed to did actually have some true multipoint features. I checked like this:
# count number of points
aged_care_df\
.select(ST.num_geometries("shape").alias("point_count"))\
.sort("point_count", ascending=False)\
.show()
+-----------+
|point_count|
+-----------+
| 2|
| 2|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
| 1|
+-----------+
only showing top 20 rows
And see there are two records that have 2 points each. It's sort of interesting that the points in each have the same coordinates when I filter and print out just these two records...but, still, they are actually multipoint, so we'll have to deal with that to do the conversion
+-----------+--------------------------------------------------------------------------------------------+--------+
|point_count|shape |OBJECTID|
+-----------+--------------------------------------------------------------------------------------------+--------+
|2 |{"points":[[145.01641305966146,-36.58713183554724],[145.01641305966146,-36.58713183554724]]}|46169 |
|2 |{"points":[[147.59699327673584,-37.09840182302235],[147.59699327673584,-37.09840182302235]]}|46566 |
+-----------+--------------------------------------------------------------------------------------------+--------+
Since the multipoint geometries need to be exploded to convert to points (you can't just cast a multipoint to a point), I used ST_Points to create an array of points for each multipoint, then exploded those, and used ST_Point to turn the coordinates back into point geometries.
All in all, it looked like this - note that I drop the point_array that I created as an intermediate step, since I don't need that once I create the point geometry.
aged_care_df2 = aged_care_df.select("*",
F.explode(ST.points("shape")).alias("point_array"),
ST.point(ST.x("point_array"), ST.y("point_array")).alias("geom_point"))\
.drop("point_array")
That gives me individual records with the original Shape (multipoint) and the new geom_point (point). There will be two records with the same OBJECTID and other attributes since those records happened to really have a multipoint feature.
Example of one record:
-RECORD 0----------------------------------------------------------------------
OBJECTID | 1519
ufi | 67865934
pfi | 1271613
feature_id | 1271613
parent_feature_id | null
feature_type | care facility
feature_subtype | aged care
feature_status | null
name | YALLAMBEE LODGE COOMA
name_label | Yallambee Lodge Cooma
parent_name | null
child_exists | null
auth_org_code | 110
auth_org_id | null
auth_org_verified | 2023-05-31 17:00:00
vmadd_pfi | null
vicnames_id | -1959170
vicnames_status_code | 11
theme1 | null
theme2 | null
state | NSW
create_date_pfi | 2021-06-15 04:20:27
superceded_pfi | null
feature_ufi | 67865934
feature_create_date_ufi | 2023-06-27 00:37:13
create_date_ufi | 2023-06-27 00:37:13
shape | {"points":[[149.130545553053,-36.220029363958275]]}
geom_point | {"x":149.130545553053,"y":-36.220029363958275}
only showing top 1 row
I hope this helps with some ideas on how to move forward with your project.
-Sarah.
Thank you loads Sarah, this works great