I know you said you checked with your admins, but still, the first thing I would check is some network activity logs during the time it starts to give this message, if it is really occurring on both ArcGIS Server boxes at the exact same time then it is almost guaranteed to be some kind of network issue with connecting to that network appliance. Maybe they recently started testing IPv6 or the DNS server was updated or anything along those lines. I would try an nslookup of that server (arete.lc.gov) from each ArcGIS Server box when it says it can't connect to the config-store.
You could probably test this hypothesis by switching to a local config-store for 1 week on both servers and see if the problem stops.
I think they have checked the network activity logs but I'll affirm that. Also, the issue only seems to happen to one server or the other. As you can see by that log screenshot, the affected server is "Arete". The errors for that particular instance started at 8:00AM and not until 8:14 does the other server "Col" start throwing errors.
I've thought about doing the "local config-store" thing but am playing the waiting game for now. If I do choose to put the config store locally, what is the procedure for that? Just point to the new location in Manager and make sure the other server can see that location? Or do I need to do something more drastic like copy the config-store folder from the netapp to the local space first, then point to that in Manager?
I'd check the Windows Server Event Logs during the time of the communication errors as well. Might be some kind of permissions issue as well.
For changing the config-store location, I should have clarified a bit better; I was referring to having a local copy of the whole directory on each server, so in that case I would just copy-paste the whole config-store directory that is on the network location to each server on a local drive for each server.
I've checked the windows server event logs as well, with nothing unusual to report.
Regarding the config-store thing... wouldn't what you mentioned above break the site? Don't both servers need to be able to interact with a shared config-store location? If I copy that folder to each server locally and each server only sees the copy on their respective machines... wouldn't that break how "the site" works in terms of the two machines managing the shared services?
I've never actually tried the local thing myself, it was just an idea to check for network issues. I was thinking you wouldn't change any kind of configurations or publish/change any services during the time of the testing, in which case the config-store data would all stay the same, I think? In which case, would the 2 ArcGIS Servers still need to be able to modify the config-store directory? It could certainly break the site for a few minutes when you first test it, but you would know immediately I would think.
You are correct that this kind of thing is not supported or recommended though according to this documention.
http://resources.arcgis.com/en/help/main/10.1/index.html#/Expanding_from_one_GIS_server_to_multiple_...
Along that line of reasoning... if the config-store is not being actively modified by map service changes or what have you, then it shouldn't be an issue in terms of the servers communicating with it at a time like the other morning at 8AM when nothing "change oriented" was going on. However, I think that ArcGIS Server actively communicates with the config store on a regular basis and one of the outfalls of the servers not being able to communicate normally is this: On every occasion that this has occurred, if I try to go to our map service rest endpoints such as "http://maps.larimer.org/arcgis/rest/services", I'm presented with an ArcGIS Server login screen. Nothing I put in there for login/password works. So, for some reason, when communications with the config-store go down, the rest services don't know who is or isn't allowed to see them and respond by putting up a login screen.
In any event, I do think that for testing purposes, if need be, I could copy the config-store folder to the c drive of one of the two servers, then repoint the config folder setting in Manager to that, making sure the other server can see the first server's C drive. If all goes well with that setup, then I can assume there is some issue with communicating with the netapp. If the same thing happens, then I can assume the issue is with ArcGIS Server itself.
Do you have a development environment that can mimic the setup of your production environment with multiple machines? This is what my organization did so we could try to flush out multi-server environment issues in the development environment so it would not impact production users. I would suggest getting EDN licenses for your development environment to save cost.