What is the most appropriate health check configuration for ArcGIS Server when using an application load balancer in AWS?
The default guidance from Esri (e.g. cfn scripts) is to use the 6443:/arcgis/rest/info/healthCheck endpoint. However this does not take into consideration the web adaptor or issues within the web adaptor / web server tier.
Another option would be to use the same endpoint but from :443/webadaptorName/rest/info/healthCheck. However, even if the healthCheck endpoint returns a broken response it is still code 200 and healthy.
What would be the best option? I was thinking an export request against sampleWorldCities.
Requesting https://WA Host:443/webadaptorname/rest/info/healthCheck should be enough for ALB or NLB (your title indicates NLB but your post references ALB).
However, even if the healthCheck endpoint returns a broken response it is still code 200 and healthy.
So if you know Server is not healthy, (remove permissions to the config-store, for example), and manually make a request to https://server.domain.com:6443/arcgis/rest/info/healthCheck, you should see a 404 or 500 error, depending on the state of things. You're finding that making a request through the WA instead returns a 200? Making a request to the health check through the WA will 1) check whether the WA is accessible and 2) check whether the Server is accessible. The WA checks servers healthCheck URL as well and is only forwarding requests when it has also determined that the backend machine is healthy.
Sorry, yes it should be ALB.
When we set the health check to use the web adaptor we generally have no issues and as you have suggested it should be a great way to ensure that both the web adaptor and the site are operational. Since we are using chef to deploy and configure the web adaptor we have certain instances where the web adaptor either doesn't configure at all or it takes a few attempts to configure. During this configuration time or when it doesn't configure at all, any requests that go to https://server.domain.com/webadaptorname/ * will receive an ArcGIS error page indicating that no server machines have been configured with the web adaptor. When viewing this page as a user through a browser it is obvious that it is an error message, however the web adaptor simply sees a 200 response from the ArcGIS error page and marks the instance as healthy. This has resulted in users seeing the above error page as the instance is then put into production. If we just hit the 6443/rest/info/healthcheck we have the same problem of the instance being marked as healthy before the web adaptor has configured.
While we have not yet seen a web adaptor fail in our cloud deployment we could envisage a situation where the web adaptor drops its configuration and starts to show the no server associated with this web adaptor but still being marked as healthy as the Esri error page returns a 200 error.
Ah so before the WA is even registered, navigating to https://wahost.domain.com/webadaptorname/rest/info/healthCheck, for example, returns a 200? I feel like that's a bug; if the WA is not registered, it shouldn't return a 200 for URLs that would normally be forwarded to Server. The URLs that you can navigate through when registering the WA should be the only ones that returns a 200.