HA portal attempting to write to standby after failover

TonyContreras_Frisco_TX · ‎09-19-2023

I have a new Enterprise 11.1 setup with HA portal. Portal1 was running as the primary, and Portal2 was the standby. I installed a patch for Portal on each server, one after the other. After both installs completed, Portal1 was again showing as primary, however I was not able to edit or create new content consistently. I have may entries of "Failed to add or update item '<itemID>'. DBUtil.doUpdateTransaction(): failed," in the Portal Admin logs, citing Portal2 as the source machine.

After searching for this error, it looked like Portal was attempting to write to the read-only database of the standby Portal machine. Once I stopped the Portal windows service on the standby, things began working as normal. I started the service up on the standby again, and the same errors are showing up. I do not see the recovery.conf or promote.dat files on either server's db directory. The portal admin shows Portal1 as the primary.

Does anyone have suggestions to prevent Portal from attempting to write to the standby Portal machine?

@JonathanQuinn

Does anyone have suggestions on how

JonathanQuinn · ‎09-19-2023

I suspected that the arcgis#sharing.xml file was the culprit, since any requests to create content will use that connection string. The /gwdb?targetServerType=master syntax should be at the end of the JDBC URL; the cleanest approach is to unregister standby and re-register it. However, if you don't want to go that route, you can update the arcgis#sharing.xml file and add /gwdb?targetServerType=master to Portal2's JDBC URL and restart the standby.

View solution in original post

JonathanQuinn · ‎09-19-2023

I'd expect the error to explicitly state that something was attempted against a read-only database:

org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction

You can:

Check nodes.properties for which is the primary/standby nodes and validate that against the Machines API.
Check for the presence of the postgres.auto.conf file under "C:\arcgisportal\db\postgresql.auto.conf", for example, and check the contents to make sure that it knows the primary is the primary
Check the config-store-connection.json file under framework\etc to make sure it lists both machine names and not localhost in the JDBC URL property
Check the arcgis#sharing.xml file under "C:\Program Files\ArcGIS\Portal\framework\runtime\tomcat\conf\Catalina\localhost" and make sure it also lists both machines and not localhost in the JDBC URL property

TonyContreras_Frisco_TX · ‎09-19-2023

Thanks for the quick reply.

The nodes.properties has the machines listed, matching what the admin API shows.

Portal1's postgresql.auto.conf file has Portal2 listed for primary_conninfo, while Portal2 as Portal1 listed in the same place.

Both config-storeconnection.json files have each machine listed, however, Portal2's string has "/gwdb?targetServerType=master" as the end.

The arcgis#sharing.xml file on both servers has "jdbc:postgresql://localhost:7654/gwdb" set as the url property.

Would updating these files manually fix the issue, or is there a better way? I know that any unexpected formatting, or incorrect syntax can break things quickly and would prefer to not restore the VMs.

JonathanQuinn · ‎09-19-2023

I suspected that the arcgis#sharing.xml file was the culprit, since any requests to create content will use that connection string. The /gwdb?targetServerType=master syntax should be at the end of the JDBC URL; the cleanest approach is to unregister standby and re-register it. However, if you don't want to go that route, you can update the arcgis#sharing.xml file and add /gwdb?targetServerType=master to Portal2's JDBC URL and restart the standby.

TonyContreras_Frisco_TX · ‎09-19-2023

Thanks for your help. I had this issue in both test and production environments, so in Test I unregistered and re-registered the machine. That worked, so then I checked the arcgis#sharing.xml file and found this format which I copied and used to update the files on the production machines. I had to run the text editor program as Administrator, of course.

url="jdbc:postgresql://PrimaryPortalServerName.domain.com:7654,StandbyPortalServerName.domain.com:7654/gwdb?targetServerType=master"

After saving the change on each machine and starting the standby as Jonathan noted, it worked.

Is there a way to avoid this situation from occurring with failover, or is it luck of the draw?

JonathanQuinn · ‎09-20-2023

This is an issue with the Portal for ArcGIS 11.1 Enterprise Sites Security Patch in HA environments and BUG-000160830 - The 11.1 version of the Portal for ArcGIS Enterprise Sites Security Patch causes issues in highly available environments is logged to address it. We are going to re-issue the patch after correcting for the problem. For those who already installed the Portal for ArcGIS 11.1 Enterprise Sites Security Patch and encountered this problem, there will be a tech article with the manual steps to restore the correct values to the arcgis#sharing.xml in order to resolve the problem.

TonyContreras_Frisco_TX · ‎09-26-2023

I read through the article posted here. It says that installing the correct version of the patch will not correct the issue, and neither does uninstalling the patch. Does that mean I have to unregister and re-register the standby Portal machine after every time we failover unless I do a clean install? Or is this a one time issue and after getting things back in order, I won't have to worry about it again?

JonathanQuinn · ‎09-26-2023

No, this is a one time issue that occurred after installing the patch. The problem won't occur when the new patch is issued and failover won't be impacted.

TonyContreras_Frisco_TX · ‎09-26-2023

Thanks for the quick response. This is good news for those of us who have installed the patch.