Recently I fixed an arcgis server instance that died due to a lack of disk space causing the server.xml refresh done at service startup to write out a 0 length file, and break the site.
I thought I'd share my notes on how to fix this so others could benefit.
The file <installroot>\arcgis\server\framework\tomcat\conf\server.xml was empty due to a startup refresh of the file failing to write it because of 0 disk space, as indicated by logs in <installroot>\arcgis\server\framework\tomcat\logs\catalina.0.log (empty unless you rename conf\logging.properties.disabled to conf\logging.properties first and then restart your server):
SEVERE: Parse fatal error at line [1] column [1] org.xml.sax.SAXParseException; systemId: file:/<installroot>/ArcGIS/Server/framework/runtime/tomcat/conf/server.xml; lineNumber: 1; columnNumber: 1; Premature end of file.
This meant the background tomcat server that hosts the arcgis admin, manager, and rest/services pages would not start, as evidenced by repeating entries in the logs in C:\arcgisserver\logs\HOSTNAME.MYDOMAIN.COM\server.
<Msg time="2023-10-05T07:19:41,390" type="WARNING" code="7709" source="Server" process="9860" thread="1" methodName="" machine="BROKENHOSTNAME.MYDOMAIN.COM" user="" elapsed="" requestID="">The Web Server was found to be stopped when it should have been started. Restarting it.</Msg>
<Msg time="2023-10-05T07:20:41,686" type="WARNING" code="7709" source="Server" process="9860" thread="1" methodName="" machine="BROKENHOSTNAME.MYDOMAIN.COM" user="" elapsed="" requestID="">The Web Server was found to be stopped when it should have been started. Restarting it.</Msg>
As additional symptoms, the arcgis\manager and arcgis\admin endpoints were not reachable after service start, ArcSOCs would launch and crash after few minutes, and the tomcat javaw process would start, fail, then start multiple processes with additional catalina logs with the error "Caused by: java.net.BindException: Address already in use: bind" because it was launching multiple instances due to not getting a response from port 6443. Low memory warnings would soon result as it started so many parallel instances of tomcat (<installroot>\arcgis\server\framework/runtime/jre\bin\javaw)
Because this server.xml holds the only unencrypted copy of the web SSL certificate store’s password, and other processes use an encrypted key to the same certificate store that is saved in keystorepass.dat, there was no way for me to determine the test server’s old certificate store password, and I could not unencrypt it from keystorepass.dat or make a new encrypted one to replace that keystorepass.dat manually, so I took the following steps to use a working environment’s configuration to repair the broken one:
1: I cleared a bunch of disk space so the issue wouldn't repeat...
2: I imported the BROKENMACHINENAME.MYDOMAIN.COM's certificate to a working arcgis instance using the same alias as indicated by the broken machine’s <installroot>arcgis\server\framework\etc\machine-config.xml
3: I then copied the working machine's <installroot>\arcgis\server\framework\tomcat\conf\server.xml and <installroot>\arcgis\server\framework\etc\certificates\arcgis.keystore + keystorepass.dat to test, and edited the new test copy of server.xml to use broken machine’s certificate alias instead of working machine’s, on the <Connector …> xml tag by changing the keyAlias value from this:
<Connector … OTHER SETTINGS … keyAlias="workingHostname2023cert" keystoreFile="<installroot>\ArcGIS\Server\framework\etc\certificates\arcgis.keystore" keystorePass="...” > xml tags.
To this:
<Connector … OTHER SETTINGS … keyAlias="brokenHostname2023cert" keystoreFile="<installroot>\ArcGIS\Server\framework\etc\certificates\arcgis.keystore" keystorePass="...” > xml tags.
These steps should be usable to recover any broken arcgis server in the event it runs out of disk space with similar log entries as evidence of the issue and a 0 length server.xml, as the server.xml is one of the few config files that get overwritten or updated during restarts unless you are actively adjusting the configuration using the admin endpoint. Most other disk usage is for temporary caches, etc. Some of you may not have another ArcGIS server sitting around to use to do this, but probably ESRI could help make a replacement set of the three files for you, given a support ticket.
As an additional note, I initially imported the certificate to the working machine with a different alias from what I originally used on the brokenmachine, and thought I could adjust the server.xml keyAlias to use the new alias, so who cares? Wrong. During startup arcgis adjusts the server.xml to use the alias saved in the config-store, so then I got "Caused by: java.io.IOException: jsse.alias_no_key_entry" and saw it had restored the old keyAlias value into my server.xml from the broken machine’s <installroot>arcgis\server\framework\etc\machine-config.xml file. It left other values alone though. So, I had to use keytool -changealias -keystore <installroot>\arcgis\server\framework\etc\certificates\arcgis.keystore -keypass <password from server.xml <Connections ... keystorePass=password> tag value> -alias <wrong_new_alias> -destalias <alias from machine-config.xml>
I know a similar answer exists in this thread: Solved: ArcGIS Service started but ArcSOC and javaw proces... - Esri Community, but I thought more details with error text and steps regarding certificate issues may help others.
-Josh Dalton
