Hello, we currently have an M4 Premium DataStore in our organization - we have a custom JavaScript SDK application that is used by less than 1000 users simultaneously, though the application does implement cache busting using the request interceptor to get around the 30 second cache in ArcGIS Online layers. Intermittently, we get a 200 response from a REST FeatureService query, where the content of the response is
{"error":{"code":503,"message":"An error occurred.","details":[]}}
Incidentally this response causes a CORS error because it doesn't have the appropriate headers, AND a JS SDK error (failure to load tile) because the SDK is expecting a valid PBF response.
Ignoring for a moment the best practices around cache busting, what are we to make of this error? Is it related to load on the DataStore (currently an M4), at the API level, or something else?
We saw 503s and similar a few years ago when we were exceeding the organisation request thresholds. We were expecting 429s and worked with ESRI and @MarianneFarretta to improve this behaviour and the Data Store health reports available. We now get the expected responses and can better manage our interactions with the platform.
You can view the number of request units and total against your organisation in the web traffic. Resets at the end of each minute. If your numbers are extremely high consistently during use, it might suggest your caching practices are causing issues with the org health. 1000 users on M4 shouldn't really have this problem, even with a large number of active staff and some ETL processes.
Thanks @ChristopherCounsell I was wondering if it was a threshold thing, but Esri told us with the M4 and that number of users we shouldn't be close. We also spoke with @MarianneFarretta on one of our support calls. We've looked at the DataStore metrics and the CPU is NOT plateauing, but still intermittently receive those errors - seems like a quota limit could be reached without stressing the datastore on an M4 though.
Out of curiosity, when you say "view the number of request units and total against your organization", what exactly are you referring to? How would we do that?
That's great you're working with Marianne and her team already, I don't think you can be in better hands.
The limits are in the request headers. Start with x-... You could open a browser console and look at feature layer request response headers. Add the headers to the console network view to see them for all traffic.
I wrote a crappy python script to ping a service with a non cacheable query at the end of each minute, then record the response headers. Though general usage through the day was high, it showed occasional spikes that could be attributed to ETL practises, and we could revise our behaviour. I'll DM it to you later.You might be able to do something similar during a period of no activity.
Though we worked with ESRI on expected behaviour of these limits, end of the day it was still our anti-practices that cause it. 503 or 429. I don't know enough about your work flows but 'cache busting' and '30 seconds' raise an eyebrow. Could be increasing the load or causing specific issues with handling the cache. Beyond my skillset.