Posting this to the community in hopes someone finds it useful if in the same situation we found ourselves.
Problem Statement -
ArcGIS Server v10.7.1 automatically disconnected from a one machine site due to configuration-store communication issues.
Problem Details -
ArcGIS Server running v10.7.1 in a single machine site. Config-store/directories were on a shared file system in preparation for additional machines to be added at a later date/time. Server was federated to 10.7.1 Portal through a web-adaptor hosted on IIS and designated as hosting server. Portal data store runs on a stand alone machine.
Observations indicate there was some 'aggressive' security patching that caused multiple restarts of both the ArcGIS Server machine and the underlying file system where the configuration store and directories are hosted. Server logs indicated communication failures during the restart window, and ultimately disconnected from the site:
<Msg time="2021-01-06T03:15:59,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Disconnecting from site.</Msg>
<Msg time="2021-01-06T03:15:59,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to verify current machine information after retries.</Msg>
<Msg time="2021-01-06T03:15:54,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:49,797" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:44,766" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:39,719" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:34,672" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:29,625" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:29,406" type="INFO" code="7720" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">The cloud regions configuration file was deleted from this server machine. If you have other server machines in the site, please make sure that the file has been deleted from them as well.</Msg>
<Msg time="2021-01-06T03:15:28,797" type="SEVERE" code="6599" source="Admin" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Failed to get the configuration of the server machine '<MACHINE><DOMAIN>'. Server machine '<MACHINE><DOMAIN>' is not registered with the Site.</Msg>
<Msg time="2021-01-06T03:12:40,695" type="VERBOSE" code="7714" source="Admin" process="6076" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Acquired 'machine-<MACHINE><DOMAIN>' workflow lock.</Msg>
<Msg time="2021-01-06T03:12:40,617" type="WARNING" code="7704" source="Server" process="6076" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Failed to stop web server. Cannot run program "cmd.exe" (in directory "C:\Program Files\ArcGIS\Server\framework\runtime\tomcat\bin"): CreateProcess error=19, The media is write protected</Msg>
We normally have these file systems excluded from automatic patching to prevent issues (like this) but the exclusion list was inadvertently ignored. We normally patch these file systems manually and shut down ArcGIS Servers before the file system reboots.
The following issues were also discovered -
1) File Missing - \\<machine>\C$\Program Files\ArcGIS\Server\framework\etc\config-store-connection.xml
2) File Inaccurate - \\<machine>\C$\Program Files\ArcGIS\Server\framework\etc\machine-config.xml
Missing: <HTTPS>6443</HTTPS>
3) File Empty - \\<fileserver>\share\config-store\machines\<MACHINE>.<DOMAIN>.json
Resolution:
Ideally we would have a second machine in the site and simply 'join' the site, however, we had an underlying configuration-store on a shared path with no machines actively participating in it. Thought about trying to 'create a new site' using a local path, then manually overlay the existing config store (on the DFS) and try to 'move' it back. Alternatively a webgisdr restore may have resolved it.
Instead, we found the following corrected the situation for us.
1) Shutdown ArcGIS Server
2) Copied a "config-store-connection.xml" file from a separate deployment into the installation path noted above. Updated file to reference the correct configuration store location (connectionString)
3) Manually edited the 'machine-config.xml' file to include the port 6443
4) Copied in a new <MACHINE>.<DOMAIN>.json file from a separate deployment into the configuration store path noted. Updated file to reference the correct machine (machineName & adminURL). Alternatively, we possibly could have extracted this from the last successful webgisdr backup file.
5) Start up Server
The server did start up correctly, web-adaptor stayed connected, and was still federated to the portal and designated as the hosting server. All server validation checks passed and existing services were still functional.
Hope this information helps if anyone runs in a similar/same issue.