AnsweredAssumed Answered

spatiotemporal bds REST query performance

Question asked by drc_ on Oct 4, 2018
Latest reply on Oct 4, 2018 by bixb0012

Using the ArcGIS API for Python I'm comparing query performance against data in Oracle vs a Hosted Feature Layer where the data is in the Spatiotemporal Big Datastore. We've got ArcGIS Enteprise (portal, hosted server, geoevent server, datastore) at version 10.6.1 running on Linux; each component is on it's own machine that meets the recommended specs. Our spatiotemporal big data store has 3 nodes each with 16 cores, 32GB RAM, and 1TB storage. I'm using ArcGIS API for Python v1.4.2. I used GeoEvent server to load 450 million rows into the Spatiotemporal Big Data Store. I made a table view of the same data in Oracle.

 

For one day at a time, my query gets all of the unique values in a field (myField) and counts up how many records there are for each unique value.

 

Using a cx_Oracle database connection I query my Oracle data like so:

 

select = """SELECT COUNT(myField) as CNT, myField FROM MY_TABLE
                  WHERE ((DATETIME >= TO_DATE('2018-08-01 00:00', 'YYYY-MM-DD HH24:MI')) AND
                  (DATETIME < TO_DATE('2018-08-02 00:00', 'YYYY-MM-DD HH24:MI')))
                  GROUP BY myField
                  ORDER BY CNT desc"""

unique_values = pandas.read_sql(select, <database connection>)

 

 

That query returns in 38 seconds and tells me that I have 3.8 million unique values for that day in myField. When I put that in a for loop and do that query for each day in a month the average query time is 31 seconds.

 

I expected my queries against the spatiotemporal big datastore to be faster but instead they are WAY slower. Doing that same query takes 1 hr 12 min. Iterating over a month one day at a time yields an average query time of 1 hr+. I'm doing the query like this:

 

fl = arcgis.features.FeatureLayer('<url>/MapServer/0', gis)
time1 = datetime.datetime(2018,8,1)
time2 = time1 + timedelta(days=1)
unique_values = fl.query(
                         where='1=1',
                         time_filter=[time1, time2],
                         out_fields='myField',
                         return_distinct_values=True,
                         return_geometry=False
                        )

 

The query returns exactly what I expect it to return, which is the same thing that the Oracle query returns. This is good. But it's waaaaaaay slower at about 1 hr 15 min on average (again, it takes 30 seconds via Oracle).

 

How can I be sure I've optimized this query? Does Elasticsearch have some kind of query overhead that makes queries take a long time (in other words, should I only expect Elasticsearch to outperform Oracle when the scale of my data is huge?). Or should I not expect Elasticsearch to outperform Oracle at all for this type of query?

 

Incidentally, what does the dynamic_layer parameter in the FeatureLayer constructor do?

 

arcgis.features module — arcgis 1.5.0 documentation 

 

The built-in docstring viewed via ctrl+tab in my Jupyter Notebook says, "Optional dictionary. If the layer is given a dynamic layer definition, this will be added to the functions.". What does that mean? There are no examples and I can't figure out what this is for. I was hoping I could use it like a definition expression to limit the data in my feature layer to the time range in question, thereby making my query faster.

Outcomes