Using the ArcGIS API for Python I'm comparing query performance against data in Oracle vs a Hosted Feature Layer where the data is in the Spatiotemporal Big Datastore. We've got ArcGIS Enteprise (portal, hosted server, geoevent server, datastore) at version 10.6.1 running on Linux; each component is on it's own machine that meets the recommended specs. Our spatiotemporal big data store has 3 nodes each with 16 cores, 32GB RAM, and 1TB storage. I'm using ArcGIS API for Python v1.4.2. I used GeoEvent server to load 450 million rows into the Spatiotemporal Big Data Store. I made a table view of the same data in Oracle.
For one day at a time, my query gets all of the unique values in a field (myField) and counts up how many records there are for each unique value.
Using a cx_Oracle database connection I query my Oracle data like so:
select = """SELECT COUNT(myField) as CNT, myField FROM MY_TABLE
WHERE ((DATETIME >= TO_DATE('2018-08-01 00:00', 'YYYY-MM-DD HH24:MI')) AND
(DATETIME < TO_DATE('2018-08-02 00:00', 'YYYY-MM-DD HH24:MI')))
GROUP BY myField
ORDER BY CNT desc"""
unique_values = pandas.read_sql(select, <database connection>)
That query returns in 38 seconds and tells me that I have 3.8 million unique values for that day in myField. When I put that in a for loop and do that query for each day in a month the average query time is 31 seconds.
I expected my queries against the spatiotemporal big datastore to be faster but instead they are WAY slower. Doing that same query takes 1 hr 12 min. Iterating over a month one day at a time yields an average query time of 1 hr+. I'm doing the query like this:
fl = arcgis.features.FeatureLayer('<url>/MapServer/0', gis)
time1 = datetime.datetime(2018,8,1)
time2 = time1 + timedelta(days=1)
unique_values = fl.query(
where='1=1',
time_filter=[time1, time2],
out_fields='myField',
return_distinct_values=True,
return_geometry=False
)
The query returns exactly what I expect it to return, which is the same thing that the Oracle query returns. This is good. But it's waaaaaaay slower at about 1 hr 15 min on average (again, it takes 30 seconds via Oracle).
How can I be sure I've optimized this query? Does Elasticsearch have some kind of query overhead that makes queries take a long time (in other words, should I only expect Elasticsearch to outperform Oracle when the scale of my data is huge?). Or should I not expect Elasticsearch to outperform Oracle at all for this type of query?
Incidentally, what does the dynamic_layer parameter in the FeatureLayer constructor do?
arcgis.features module — arcgis 1.5.0 documentation
The built-in docstring viewed via ctrl+tab in my Jupyter Notebook says, "Optional dictionary. If the layer is given a dynamic layer definition, this will be added to the functions.". What does that mean? There are no examples and I can't figure out what this is for. I was hoping I could use it like a definition expression to limit the data in my feature layer to the time range in question, thereby making my query faster.