Select to view content in your preferred language

What is the optimized way to execute spatial queries in Hive, using GIS Tools for Hadoop?

956
1
06-04-2023 04:38 AM
vigneshwaran
New Contributor

Dear Esri Team,

I am trying to implement spatial queries in Hive using GIS Tools for Hadoop, following the steps mentioned in the link below.
 
https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive 
 
The implementation of the queries are as follows
 
1.Circle range query
SELECT COUNT(*) FROM table_name WHERE
ST_Intersects(ST_Buffer(ST_Point(x1,y1), radius), ST_Point(x, y));

2. BBOX range query
SELECT COUNT(*) FROM table_name WHERE
ST_Contains(ST_GeomFromText('POLYGON ((x1 y1, x2 y2, .. , x1 y1))'), ST_Point(x, y));

3. k-NN query (k=10)
SELECT ST_Distance(ST_Point(x1, y1), ST_Point(x, y)) as distance, x, y FROM
table_name ORDER BY distance ASC LIMIT 10;
 
Kindly let me know the optimized way to implement the above queries (especially k-NN).
 
Thanks & Regards
Vigneswaran
0 Kudos
1 Reply
VenkataKondepati
Occasional Contributor

Hi @vigneshwaran,
Hive can do these, but you’ll get much better performance with a two-stage filter (cheap spatial prefilter → exact test) and a coarse spatial index (grid/geohash/H3) to avoid full scans especially for k-NN.

Below are practical patterns that work well with GIS Tools for Hadoop UDFs.

0) Table prep (one-time)

  • Store as ORC/Parquet, enable Tez, vectorization, and collect stats.

  • Add a point geometry column once, rather than building it per query.

  • Add a coarse spatial key (e.g., geohash/H3/quadkey) for pruning.

-- Example schema
CREATE TABLE pts (
  id BIGINT,
  x  DOUBLE,         -- lon
  y  DOUBLE,         -- lat
  geom STRING,       -- WKT point
  gh  STRING         -- geohash / tile id
)
STORED AS ORC;

-- Populate (example)
INSERT OVERWRITE TABLE pts
SELECT id, x, y,
       ST_AsText(ST_Point(x, y))                   AS geom,
       geohash(x, y, 7)                            AS gh       -- use your geohash/H3 UDF
FROM raw_pts;

ANALYZE TABLE pts COMPUTE STATISTICS;

If you don’t have a geohash/H3 UDF handy, precompute a simple grid key (e.g., floor(x/Δ), floor(y/Δ)).

1) Circle range query (fast pattern)

Avoid buffering every row. Compute the circle once, prefilter by envelope, then run exact check.

-- Params
WITH q AS (
  SELECT ST_Point(%x1, %y1)                 AS qpt,
         %radius                            AS r
)
SELECT COUNT(*)
FROM pts
JOIN q
WHERE
  -- cheap prefilter: envelope test
  ST_EnvIntersects(
     ST_Envelope(ST_Buffer(q.qpt, q.r)),    -- computed once
     ST_Envelope(ST_Point(x, y))
  )
  -- exact test
  AND ST_Distance(ST_Point(x, y), q.qpt) <= q.r;

 

Why faster? The envelope test prunes most rows before the expensive exact distance.

2) BBOX range query (fast pattern)

Use ST_EnvIntersects first, then (optionally) ST_Contains as exact test.

WITH bbox AS (
  SELECT ST_GeomFromText('POLYGON((x1 y1, x2 y2, x3 y3, x4 y4, x1 y1))') AS poly
)
SELECT COUNT(*)
FROM pts
JOIN bbox
WHERE ST_EnvIntersects(bbox.poly, ST_Point(x, y))
  AND ST_Contains(bbox.poly, ST_Point(x, y));   -- optional exact check

3) k-NN (k=10): scalable approach

Global ORDER BY distance LIMIT 10 forces a single reducer (slow). Instead, do candidate pruning with your spatial key, then a small sort.

3a) Precompute neighbors to search

For geohash/H3, get the cell containing (x1,y1) and its immediate neighbors (ring). If < k results, expand by another ring.

-- Example: find target cell + 8 neighbors
WITH q AS (
  SELECT %x1 AS qx, %y1 AS qy, geohash(%x1, %y1, 7) AS qgh
),
nbrs AS (
  SELECT q.qx, q.qy, n.gh AS ngh
  FROM q
  LATERAL VIEW explode(geohash_neighbors(q.qgh)) n AS gh   -- UDF returns center+neighbors
)
SELECT  distance, id, x, y
FROM (
  SELECT
    ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance,
    id, x, y
  FROM pts
  JOIN nbrs ON pts.gh = nbrs.ngh
) cand
ORDER BY distance ASC
LIMIT 10;

If your 9 cells don’t yield 10 points, expand to the next ring (second-order neighbors) and rerun. In production, do this iteratively in a procedure, or over-fetch (e.g., 2 rings) once and still be fast.

3b) No geohash? Use a bounding box window as a candidate filter

Start with a small window around (x1,y1), enlarge if needed.

WITH q AS (
  SELECT %x1 AS qx, %y1 AS qy, %win AS w
)
SELECT distance, id, x, y
FROM (
  SELECT ST_Distance(ST_Point(x, y), ST_Point(qx, qy)) AS distance, id, x, y
  FROM pts JOIN q
  WHERE x BETWEEN qx - w AND qx + w
    AND y BETWEEN qy - w AND qy + w
) cand
ORDER BY distance
LIMIT 10;



Pick w based on expected density; if < 10 results, double w and retry.

Extra performance tips

Cache constants: put ST_Buffer(...), ST_Point(x1,y1) in a CTE so they’re computed once.

Partition/Bucket on your spatial key (gh) to get partition pruning.

Use SORT BY distance LIMIT 10 only if you first reduce candidates; otherwise it still hits a single reducer.

Consider converting hot datasets to Parquet with sorted Z-order (lon,lat) or by spatial key to improve locality.

If polygons are small and numerous, pre-tile them (one row per tile) and join on tile id before exact ST_Intersects.

Regards,
Venkat

0 Kudos