KB - ArcGIS Server auto removed from site due to Config-Store Communication Issues - FIXED

5505
5
01-20-2021 09:50 AM

KB - ArcGIS Server auto removed from site due to Config-Store Communication Issues - FIXED

Posting this to the community in hopes someone finds it useful if in the same situation we found ourselves.  

Problem Statement - 

ArcGIS Server v10.7.1 automatically disconnected from a one machine site due to configuration-store communication issues.  

Problem Details - 

ArcGIS Server running v10.7.1 in a single machine site.  Config-store/directories were on a shared file system in preparation for additional machines to be added at a later date/time.  Server was federated to 10.7.1 Portal through a web-adaptor hosted on IIS and designated as hosting server.  Portal data store runs on a stand alone machine.  

Observations indicate there was some 'aggressive' security patching that caused multiple restarts of both the ArcGIS Server machine and the underlying file system where the configuration store and directories are hosted.  Server logs indicated communication failures during the restart window, and ultimately disconnected from the site: 

<Msg time="2021-01-06T03:15:59,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Disconnecting from site.</Msg>
<Msg time="2021-01-06T03:15:59,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to verify current machine information after retries.</Msg>
<Msg time="2021-01-06T03:15:54,844" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:49,797" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:44,766" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:39,719" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:34,672" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:29,625" type="WARNING" code="7725" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Verify machine registration observer: Unable to get machine '<MACHINE><DOMAIN>' information from configuration store, retrying.</Msg>
<Msg time="2021-01-06T03:15:29,406" type="INFO" code="7720" source="Server" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">The cloud regions configuration file was deleted from this server machine. If you have other server machines in the site, please make sure that the file has been deleted from them as well.</Msg>
<Msg time="2021-01-06T03:15:28,797" type="SEVERE" code="6599" source="Admin" process="1632" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Failed to get the configuration of the server machine '<MACHINE><DOMAIN>'.  Server machine '<MACHINE><DOMAIN>' is not registered with the Site.</Msg>
<Msg time="2021-01-06T03:12:40,695" type="VERBOSE" code="7714" source="Admin" process="6076" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Acquired 'machine-<MACHINE><DOMAIN>' workflow lock.</Msg>
<Msg time="2021-01-06T03:12:40,617" type="WARNING" code="7704" source="Server" process="6076" thread="1" methodName="" machine="<MACHINE><DOMAIN>" user="" elapsed="" requestID="">Failed to stop web server. Cannot run program "cmd.exe" (in directory "C:\Program Files\ArcGIS\Server\framework\runtime\tomcat\bin"): CreateProcess error=19, The media is write protected</Msg>

 

We normally have these file systems excluded from automatic patching to prevent issues (like this) but the exclusion list was inadvertently ignored.  We normally patch these file systems manually and shut down ArcGIS Servers before the file system reboots. 

 

The following issues were also discovered - 

1) File Missing - \\<machine>\C$\Program Files\ArcGIS\Server\framework\etc\config-store-connection.xml

2) File Inaccurate - \\<machine>\C$\Program Files\ArcGIS\Server\framework\etc\machine-config.xml

Missing: <HTTPS>6443</HTTPS>

3) File Empty - \\<fileserver>\share\config-store\machines\<MACHINE>.<DOMAIN>.json

Resolution: 

Ideally we would have a second machine in the site and simply 'join' the site, however, we had an underlying configuration-store on a shared path with no machines actively participating in it.  Thought about trying to 'create a new site' using a local path, then manually overlay the existing config store (on the DFS) and try to 'move' it back. Alternatively a webgisdr restore may have resolved it.  

Instead, we found the following corrected the situation for us.  

1) Shutdown ArcGIS Server

2) Copied a "config-store-connection.xml" file from a separate deployment into the installation path noted above.  Updated file to reference the correct configuration store location (connectionString)

3) Manually edited the 'machine-config.xml' file to include the port 6443

4) Copied in a new <MACHINE>.<DOMAIN>.json file from a separate deployment into the configuration store path noted.  Updated file to reference the correct machine (machineName & adminURL).  Alternatively, we possibly could have extracted this from the last successful webgisdr backup file.

5) Start up Server

 

The server did start up correctly, web-adaptor stayed connected, and was still federated to the portal and designated as the hosting server.  All server validation checks passed and existing services were still functional.  

 

Hope this information helps if anyone runs in a similar/same issue.  

Comments

@pfoppe 

Hi, thanks for sharing this out. Just for clarification, was the Config-Store/Directories located on a DFS share? wondering if that was the contributing factor here since we don't recommend using DFS file share as mentioned in our documentation: https://enterprise.arcgis.com/en/server/latest/deploy/windows/choosing-a-nas-device.htm 

 

 

Hi @HarroldSompotan , 

 

Thanks for asking.  On this specific system it was on a UNC share.  This was one of our test systems.  One additional factor, I have a CNAME record applied to the windows host and the config store uses the CNAME alias.  That was done to allow us to move this around without having to try and remap everything through the Esri product.  

EX -

File share is hosted on \\servername\sharename

CNAME Alias applied and referenced as \\cname_alias\sharename

 

For most of our mulit-machine deployments, they actually are mapped into DFS... many of these were created back at 10.1 and 10.2 when DFS was not prohibited and with reviews from Esri staff at conferences (like dev summit) the DFS concern never came up.  In fact, there was positive feedback on the benefits afforded to us with DFS, primarily for moving things around on the back-end without having to move through the Esri product itself.  

 

Looking at the doc link you provided, I see the DFS is no longer supported starting at the 10.7 release.  We have recently upgraded all our sites from 10.6.1 to 10.8.1 and missed this new requirement (couldn't find it in the whats new).  This is concerning or us given our large ArcGIS server infrastructure is tied into DFS... so expect quite a bit of work in front of us to accommodate this (new to us) requirement.

*sigh* ...  *double sigh*

-- edited --

I should have also mentioned...  We are using both DFS and aliases (at times) along with very short share names due to path length issues we've run into in the past.  If a publisher has a long service name, that ends up in the directories path mulitple times... so shorter the better...

I am having a similar issue after applying the 10.7.1 Security 2021 patch2 on my RHEL 7.8 server.

Every time I restart ArcGIS the machine-config.xml and config-store-connection.xml get deleted from .../framework/etc directory.  

We are facing this issue recently in version 10.9.1 and the most probable cause of this issue is due to the claimed non disruptive patching of the NAS storage. I will update this post once I apply the solution mentioned by @pfoppe 

Good luck!!!  We have implemented this 'fix' a few times over the years.  

Version history
Last update:
‎01-20-2021 09:50 AM
Updated by: