I have a very tough Disaster Recovery issue going on. I will state that we are Windows 2012, Portal 10.4, no web adaptor we are using Port translation for 443->7443.
In doing some maintenance on our Portal, I realized we had our config-store and arcgisserver directories in a bad location, so I followed the steps to migrate those directories but something went wrong. When I finished the process, all of my Hosted services were stopped and I couldn't get them to start again.
We had Snapshots of the servers (we have an HA configuration with 6 servers) and I tried bring those back online, same result. Next our team decided to go to VM backups of the VM's. Those images took over 24 hours to restore due to the size of our data and when they came back online the Data Store worked. ArcGIS Server Admin api worked, but Manager didn't since it is dependent on Portal since it is federated, and Portal didn't work. Postgres won't start and I get a 500 error out of Tomcat. Oh and webgisdr didn't work for some reason, so I don't have that to fall back on.
We are at a lost here as to what direction to go. I have been working with support but I wanted to see if anyone else had ever run into this and what did you do to recover. Of course, this happens during the UC when most folks are in San Diego.
JQuinn-esristaff - Maybe you guys in Pro services have seen something like this?
Thanks Jonathan -
Support and I worked through it, there are a bunch of 'gotcha's when it comes to restoring a VM Image if the Portal/ArcGIS Server and Data Store services are all running when the image is created. I think I have this worked out and possibly have everything back online. Still testing.
The problem Joe Weyl likely ran into is when you take a VM snapshot of a machine running Portal, Server, and Data Store, it also captures files that tell the software what the process IDs are of the processes that are running. Restoring the snapshot would then cause issues as it generates new PIDs but never removes the old PID files. Stop Portal on the machine and check the C:\Program Files\ArcGIS\Portal\framework\etc\pids folder for any .pid files.
Jonathan is exactly correct. Those were the issues we ran into. It caused the recovery to be more complex. I am working with our SA's to set up scripts to stop the services before the snapshot is taken. It creates a small window of downtime on a global service, but we have an HA environment and it allows our Portal to still work while these necessary disater recover steps are taken.