Partitioning Large Geometries to Improve Query Performance

VinceAngelo · ‎09-15-2014

A while back I needed to create a very efficient point-in-polygon tool, which needed to process millions of points daily, with up to ten thousand features in any single minute. I quickly determined that simple queries against a database table (at ~20 milliseconds/query) weren't going to meet the requirements, so I started looking at the SE_STABLE interface in the ArcSDE 'C' API to use a RAM cache to query the data. First I had to build some tools to convert a table with one or more attributes into a "shape table" and an attribute cache. And when I was done, I had a significantly faster tool -- with a mean query time of 0.5 milliseconds.

But I wondered what would happen if I was able to reduce the time of the worst-performing queries (on Canada, Russia, United States,... -- all the "large" countries with lots of vertices). So I added some code to split any feature larger than some threshold with a number of tiles, that in total had the same area but a much smaller number of vertices per feature. I had to change my processing code to handle the higher incidence of "on the border" results, but the performance gain was significant.

Then I mistakenly queried an RDBMS table with one of these partitioned countries layers, and discovered that I was able to achieve a significant performance gain in the database, without needing to use shape tables. Of course, the RAM cache was still faster, but the database queries on partitioned tables were approaching the cache speed without partitioning.

It's been a long time since that discovery, but I recently had cause to revisit that code, and ran a benchmark on my Linux PC using two countries polygons layers, one at 1:50m scale, and one at 1:15m scale. It seems that modern hardware is faster (big surprise, right?) but the benefits of partitioning during point-in-polygon query remain:

For the 1:50m Data & Maps country file (249 features and 0.46Mb storage), with 10,000 random queries:

Tiling Scheme	Partitions	Storage (MB)	Init Time (ms)	Query Time (ms)	Total Time (ms)	Mean Query (ms/feat)
Database	249	0.46	0.00	31269.66	31269.66	3.13
None	249	0.46	10.41	2107.54	2117.95	0.21
90 Degree	286	0.48	50.64	747.91	798.55	0.08
60 Degree	309	0.49	64.30	537.58	601.88	0.06
45 Degree	335	0.50	72.47	519.99	592.46	0.06
40 Degree	361	0.51	88.90	490.82	579.72	0.06
35 Degree	397	0.52	143.21	414.53	557.74	0.06
30 Degree	393	0.52	134.11	410.26	544.37	0.05
25 Degree	463	0.54	160.34	390.83	551.17	0.06
20 Degree	504	0.55	200.16	378.13	578.29	0.06
15 Degree	623	0.60	288.28	362.54	650.82	0.07
12 Degree	754	0.63	388.84	365.14	753.98	0.08
10 Degree	899	0.69	568.12	377.09	945.21	0.10
5 Degree	2026	1.01	1866.12	501.00	2367.12	0.24
4 Degree	2800	1.24	2774.39	591.64	3366.03	0.34
3 Degree	4268	1.67	4800.73	767.10	5567.83	0.56
2 Degree	8196	2.82	10288.24	1243.70	11531.94	1.15
1 Degree	27351	8.30	39975.24	3568.66	43543.90	4.35

For the 1:15m Data & Maps country file (249 features and 3.72Mb storage), with the same 10,000 random queries:

Tiling Scheme	Partitions	Storage (MB)	Init Time (ms)	Query Time (ms)	Total Time (ms)	Mean Query (ms/feat)
Database	249	3.72	0.00	94425.96	94425.96	9.44
None	249	3.72	44.28	9708.37	9752.65	0.98
90 Degree	290	3.74	207.26	3031.72	3238.98	0.32
60 Degree	314	3.75	309.86	1951.09	2260.95	0.23
45 Degree	340	3.75	346.04	1764.42	2110.46	0.21
40 Degree	369	3.76	410.53	1541.54	1952.07	0.20
35 Degree	403	3.78	510.59	1075.17	1585.76	0.16
30 Degree	403	3.77	485.25	1073.19	1558.44	0.16
25 Degree	474	380	716.04	883.49	1599.53	0.16
20 Degree	519	3.82	885.24	784.80	1670.04	0.17
15 Degree	640	3.86	1258.04	607.32	1865.36	0.19
12 Degree	771	3.90	1715.10	559.44	2274.54	0.23
10 Degree	925	3.95	2475.66	530.02	3005.68	0.30
5 Degree	2069	4.31	8018.14	586.19	8604.33	0.86
4 Degree	2852	4.55	11924.08	671.96	12596.04	1.26
3 Degree	4341	4.99	20588.77	848.96	21437.73	2.14
2 Degree	8321	6.16	44082.18	1340.35	45422.53	4.54
1 Degree	27684	11.75	170964.92	3889.86	174854.78	17.49

Back in 2006 when I first did this, the break-even point was with a tile size of 5 degrees, but now it seems that both the 1:50m and 1:15m layers are optimized within the cache using 30-degree tiling, and both tables now also show a significant law of diminishing returns below a grid size of 10 degrees (where before it was closer to 1 degree).

Doctoring the code a bit to export the partitioned shapes, I was also able to re-run my database queries (on PostgreSQL 9.1.9, with the same 10,000 random queries):

Tiling Scheme	1:50m Feature Count	1:50m Mean Vertex Count	1:50m Mean Query (ms/feat)	1:15m Feature Count	1:15m Mean Vertex Count	1:15m Mean Query (ms/feat)
None	249	96.13	3.20	249	951.82	9.08
90 Degree	286	84.77	2.71	290	818.34	4.51
60 Degree	309	79.41	2.63	314	756.98	3.55
45 Degree	335	73.66	2.63	340	699.31	3.31
30 Degree	393	64.17	2.59	403	591.55	2.89
20 Degree	504	51.79	2.58	519	461.38	2.70
15 Degree	623	43.35	2.56	640	375.62	2.60
10 Degree	899	32.23	2.54	925	262.23	2.53
5 Degree	2026	17.74	2.54	2069	120.89	2.49
4 Degree	2800	14.47	2.54	2852	89.42	2.49
3 Degree	4268	11.44	2.55	4341	60.73	2.49
2 Degree	8196	8.58	2.53	8321	34.41	2.48
1 Degree	27351	6.22	2.55	27684	14.06	2.51
0.5 Degree	98297	5.40	2.60	99213	7.63	2.56
0.25 Degree	369492	5.13	2.61	372212	5.74	2.57

While running a database on modern hardware has reduced the ACID overhead to nearly 2 milliseconds, that's not good enough to match a simple RAM cache, which starts submillisecond and optimizes to a fifth to a twentieth of a millisecond per query (depending on the vertex density of the queried table).

I find a couple of lessons to be learned here:

The more things change, the more they remain the same
There are those that claim optimization is unnecessary on newer systems, because the database or other software "has been designed to handle that." Even if that's true, it doesn't hurt to check. And even if optimizers are smarter, it rarely hurts to loft softballs at them. In this case, as before, queries on a large number of less complicated polygons have a consistent advantage over smaller numbers of more complicated polygons (as least, until the sheer number of features overwhelms the benefits of simplicity). In short, having good fundamentals and following best practice is always good policy.
It's often important to re-evaluate optimization decisions with each new generation of technology
While the benefits of efficiently evaluated queries don't change, the "sweet spot" that defines the most efficient set of parameters could very well change, so it doesn't hurt to keep benchmark data and procedures in an accessible archive, where they can be re-run as-is when a new set of hardware is available. It's amazing what a 1.4Ghz (AMD) 6-core CPU with 8Gb of RAM can accomplish over a 2.6 Ghz (Intel) two-CPU host with 2Gb of RAM -- especially when coupled with fast disk with a large on-board cache. Modern boxes should be faster, but it doesn't hurt to see a tangible gain, and to be able to measure it accurately. Even if the drift isn't significant every time, it doesn't hurt to revisit optimization thresholds, just so you won't miss a 5x performance gain when it is available.

- V

BTW: The IDENTITY class of the se_toolkit ASCII data conversion tool ascinfo was used to compile these metrics, using 9.x Data & Maps country files and the attached control files (these control files leverage changes to META parameters available in the recently released build of se_toolkit-10.2, so use Build 20 or later to reproduce this effort).