Some background... we have a 2 machine site (with machines that live in different datacenters) that share configs/directories on separate file server. We are doing some testing to see what happens we take one of the servers out of service. We have tested reboots, disconnecting from network and stopping the arcgis server, etc
We are seeing a differences in behavior when we take down each server. When we take down the 1st server (used to create the site) no services draw until server has completely started up all the SOCs and CPU has come back down. This can take 20-30 mins as there are a few hundred services.
When we take down the 2nd server (Joined to site) the affect isn;t as significant. We do see a lag in draw times, but is is brief.
My questions are is this architecture designed to be HA? or is the site dependant on the 1st machine being available?
Should we view the 1st machine as "Primary"?
Also of note. After a machine restarts, it stays in a mixed status:
configured state: stopped
realtime state: started
Seems to function ok at this point, but we have to manually start the machine to correct the status?
Thanks for any insight, this has been confusing us for awhile now.
What network load balancing and/or web adaptor setup do you have? I think in Esri's view, a HA multi-machine site would require multiple web adaptors to route requests to GIS server machines in a round-robin fashion, and should also detect when one machine is unavailable. A network load balancer is also necessary to communicate with your pool of GIS server machines. File-based data sources should be copied locall to each machine, not accessed over the network, for fast access. Finally, the file server that stores your configstore and server directories should be configured for high availablility/redundancy.
Have a look at this doc: Multiple-machine deployment with ArcGIS Web Adaptor—ArcGIS Server Administration (Windows) | ArcGIS ....
I hope this helps - this stuff is confusing. One question to consider might be, what is your service level agreement, aka the percentage of ArcGIS Server site uptime required to fulfill the business requirements of your organization?
The first thing that jumps out to me is that you have your two machines in separate data centers. You'll run into a lot of performance issues which may be related to the instability you see. Server is pretty chatty in talking to the config-store and directories, and if there's any latency in communicating with the config-store and directories location, you're going to run into problems.
If the purpose of putting the machines in separate data centers is for redundancy, you still have a single point of failure in the shared config-store and directories. If you wanted HA and redundancy, set up four machines, two in each data center, and replicate your services between the deployments. That way, you have completely isolated environments and in the event that the primary goes down, the secondary can resume normal operations without any dependencies to the primary.
In addressing your specific question, there is no primary/standby for Server, so the behavior you see is not correct. Following Micah Babinski's comments, a WA or LB in front of the servers should be aware of the health of the machines and route requests appropriately. You can use the Healthcheck URL for server as the http health check for a LB. The WA will already handle this.