WebGISDR taking down our Portal (full/incremental)

JoeWeyl · ‎02-25-2019

We have an HA configuration in Windows 2016. Two Portal/Server/DataStore VM's and one for the file server. We have a large Portal (over 2500 users) and when I am trying to run the WebGISDR the resources on the machine we are running the tool from (Primary machine, currently) it s pegging the CPU at %100 and RAM close to %80. This in turn takes down our Portal for my user community and doesn't allow me to get a proper backup. So far the only way I have been successful in getting a backup completed is to plan a maintenance night to do a full backup and then attempt my incremental the days following. This works for a couple of days but then fails for the incremental telling me I will need to run a full backup before I can run an incremental.

My questions:

Can I move the webgisdr to the secondary machine or a different machine that still has the same network path for the backup? For example, when I know the primary is the active Portal, run the webgisdr from the secondary? or the File Server? Ultimately the backup is getting stored in S3 when the transfers are complete, but I haven't had a successful backup in over a week and with our size that makes me nervous from the Esri perspective (we have snapshots in AWS and other backups from the IT side).

Since this backup issue has started we are getting a few different errors out of the WebGISDR tool as well, like that it can't see the path to the temporary location, or we are getting java.net.SocketException from ArcGIS Server. I had an open case on the WebGISDR issue, but it was opened under another context, so I am going to contact support to figure out if a new case should be opened.

JonathanQuinn · ‎02-25-2019

As you may know, the DR tool will run all backups in parallel so the Portal, Server, and Data Store backups will be running at the same time. That, on top of normal usage, can certainly result in 100% CPU usage.

It's strange how the incremental backups will run for a few days and then suddenly start failing that full backups need to be taken. You're not restoring the backups, correct? Do you have a DR environment? Is the Data Store failing to create an incremental backup?

It doesn't matter where the DR tool is run from, it'll always figure out which portal is primary and run the backup on that portal machine.

Is the DR tool complaining that the SHARED_LOCATION path is not accessible?

JoeWeyl · ‎02-25-2019

Hi Jonathan -

Thanks for the feedback. If the backup can be run from anywhere will that off set the CPU issues on the primary? It had complained about the Shared_location not be accessible, but I was running it as a job not under the \.arcgis account but instead under the Windows Admin account. We don't have these VM's on our domain yet in AWS.

I saw your other post about how to use sysinternals to make the share location part of the configuration and drive mapping. I am getting that set up, but again since these vm's aren't on the domain, not sure of the impact with this part of the configuration.

At this time, getting the VM's to our domain will take getting a good backup first since we have to use the replication configuration of Portal to get this set up.

Joe

JonathanQuinn · ‎02-26-2019

The DR tool isn't doing much other than calling into APIs within Portal, Server, and Data Store. You'll likely see the javaw.exe processes for Portal and Server and a PG process for Data Store using most of the CPU, but it'd be good to verify that.

The DR tool zips the backups which may use up some CPU, but that's after the backups have been created so it'd only be contending with normal usage at that point.