Using the ArcGIS API for Python I'm comparing query performance against data in Oracle vs a Hosted Feature Layer where the data is in the Spatiotemporal Big Datastore. We've got ArcGIS Enteprise (portal, hosted server, geoevent server, datastore) at version 10.6.1 running on Linux; each component is on it's own machine that meets the recommended specs. Our spatiotemporal big data store has 3 nodes each with 16 cores, 32GB RAM, and 1TB storage. I'm using ArcGIS API for Python v1.4.2. I used GeoEvent server to load 450 million rows into the Spatiotemporal Big Data Store. I made a table view of the same data in Oracle.
For one day at a time, my query gets all of the unique values in a field (myField) and counts up how many records there are for each unique value.
Using a cx_Oracle database connection I query my Oracle data like so:
select = """SELECT COUNT(myField) as CNT, myField FROM MY_TABLE WHERE ((DATETIME >= TO_DATE('2018-08-01 00:00', 'YYYY-MM-DD HH24:MI')) AND (DATETIME < TO_DATE('2018-08-02 00:00', 'YYYY-MM-DD HH24:MI'))) GROUP BY myField ORDER BY CNT desc""" unique_values = pandas.read_sql(select, <database connection>)
That query returns in 38 seconds and tells me that I have 3.8 million unique values for that day in myField. When I put that in a for loop and do that query for each day in a month the average query time is 31 seconds.
I expected my queries against the spatiotemporal big datastore to be faster but instead they are WAY slower. Doing that same query takes 1 hr 12 min. Iterating over a month one day at a time yields an average query time of 1 hr+. I'm doing the query like this:
fl = arcgis.features.FeatureLayer('<url>/MapServer/0', gis) time1 = datetime.datetime(2018,8,1) time2 = time1 + timedelta(days=1) unique_values = fl.query( where='1=1', time_filter=[time1, time2], out_fields='myField', return_distinct_values=True, return_geometry=False )
The query returns exactly what I expect it to return, which is the same thing that the Oracle query returns. This is good. But it's waaaaaaay slower at about 1 hr 15 min on average (again, it takes 30 seconds via Oracle).
How can I be sure I've optimized this query? Does Elasticsearch have some kind of query overhead that makes queries take a long time (in other words, should I only expect Elasticsearch to outperform Oracle when the scale of my data is huge?). Or should I not expect Elasticsearch to outperform Oracle at all for this type of query?
Incidentally, what does the dynamic_layer parameter in the FeatureLayer constructor do?
The built-in docstring viewed via ctrl+tab in my Jupyter Notebook says, "Optional dictionary. If the layer is given a dynamic layer definition, this will be added to the functions.". What does that mean? There are no examples and I can't figure out what this is for. I was hoping I could use it like a definition expression to limit the data in my feature layer to the time range in question, thereby making my query faster.
Thanks for the post, not because I can answer it but because I have seen similar but haven't taken the time to write it up. When it comes to all of the "distributed GIS" (spatiotemporal datastore, geoanalytics server, raster analytics), we have not been impressed with performance. Given the amount of money involved to deploy numerous virtual machines and license them with OS and GIS applications, I don't understand how "distributed GIS" is supposed to compete with a high-end local workstation conducting similar analyses.