I'm currently part of a team responsible for the ArcGIS ecosystem in one of the larger DSO's in The Netherlands, and little over a year ago we decided to professionalize our platform. So far this has been a massive success, from a random/loose environment with a server, some flex-apps, and "a dude" managing this all on a friday afternoon, we embraced Web-GIS, with a core dev/ops scrum team, and multiple teams developing on the ArcGIS platform; about 2500 pieces of content (on Enterprise 10.6.1), over 1000 named users "in use", and a multitude coming through internal OAuth apps... (some of our data is in reach for ~3.7mln users) long story short: a blazing success for the web GIS strategy, and we're far from done...
A long standing issue is our backup strategy... WebGISDR has proven unstable, unreliable, and unable to do certain scenarios (such as restoring one layer). And recently it even stopped working...
For some reason our logs are littered with:
Failed to export site. Export of the Portal repository failed. Failed to take a base backup of a PostgreSQL Database.
I wouldn't be a system design associate if I didn't understand some of the deeper workings of ArcGIS Portal (and frankly, ran a backup myself with pgdump), and looking through Portal's PostgreSQL logs it seems webgisdr is simply trying to use the wrong credentials (even though datastore, federated ArcGIS servers, and the portal content dir itself backup fine)... We're getting the following log entries in Portal's internal postgresql db, coinciding with the backup action itself:
2018-11-20 05:57:00 PST: : FATAL: no pg_hba.conf entry for host "127.0.0.1", user "ARCGIS", database "ARCGIS", SSL off
This... completely puzzles me... and brings me to the following questions...
Note: I can replicate this behaviour with the portals /portal/portaladmin/exportSite function
The DR tool doesn't make a direct connection to the database to run a backup. It relies on Portal to do that, so it just submits a request to the exportSite endpoint and lets Portal handle creating the backup, (which is actually done through pg_basebackup and not pg_dump). The error you see in the DB logs is not related to the backup. Portal consistently checks the connection to the database using pg_isready and it doesn't provide credentials when doing so. It defaults to the user attempting to make the connection, which would be the account used to run the Portal service.
Some organizations rely on VM backups, but they have to be timed correctly with file server backups so you capture the data of the entire deployment at a point in time. If you take backups at different times, you'll have differences in the dependencies between Portal, Server, and Data Store which will cause issues when you attempt to restore the backups.
If the export also fails in Portaladmin, then the DR tool isn't the issue, but rather the Portal/database. You should be able to run the pg_basebackup command manually to see if that returns an error, or set the logs to Portal to DEBUG and see if you get more information out of the backup process. Ideally Portal would tell you why a backup can't be created, not only that is can't be created.
I'm also interested in if you could expand this sentence "WebGISDR has proven unstable, unreliable..." Can you explain some of the issues you've run into?
Thank you for getting back to me on this. I ran pg_basebackup (made sure to use portal's own liberary) and it worked just fine:
PS G:\arcgis\portal\framework\runtime\pgsql\bin> .\pg_basebackup.exe -h 127.0.0.1 -p 7654 -U GISBeheer -D G:\pg_basebackup -Ft -z -P
43139/43139 kB (100%), 1/1 tablespace
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
I won't be able to set the debug flag on the servers' own logging until 8 December, when our next scheduled maintenance window is, sadly... I'll see if I can squeeze one in, but honestly, it'd be much better if our staging environment/test/develop would start bugging out as well
At this stage, however, having ran through all the rights etc... I'm at a loss (especially because, as stated before, the portal works excellent...)
As for your question regarding reliability, it's in part... well... stuff like this (we had restore issues before when internal SSL certificates had shifted...), the issue that single items can't be restored, and the issue that you can't migrate backups "up" a version if you decide to upgrade your ArcGIS environment... (eg: when we went to 10.6.1, all our 10.6.0 backups became useless). Also the fact that the behaviour of the tool seems to be to completely fail if even one "part" of the site export fails, and the fact it won't backup geoevent is worrysome (it backups the arcgis server component of it, not the event server config itself). Ideally I'd have a tool which granular (eg: restore a single item, preferably without downtime), more robust, etc... we're currently working on Azure Devops (formerly known as Visual Studio Team Services) to publish services/content straight from GIT as part of a "build", but we're far from that (and using that as "backup" to in case of disaster, auto-rebuild the entire site, won't do much for user generated content).
If it works manually, then getting DEBUG logs are really the only way to sort out why it's failing, unfortunately.
In regards to your feedback, thanks for providing it, I appreciate it:
1) If I assume you're talking about the Server's SSL certificates, yes, that's an issue we're going to fix in the next version of the software.
2) Right, the backups are not created in a way that allows individual items to be extracted. We are looking into creating disconnected packages of items, which still may not help as you'd need to create packages of every item you're interested in restoring manually.
3) We want to support forward compatability for backups so you can restore a 10.X backup to a later version, but not sure when we'll get around to it.
4) The tool is designed to backup or restore everything at a point in time. Given the many dependencies between the ArcGIS Data Store, (data), ArcGIS Server, (service using the data), and Portal, (item that controls security), allowing it to continue when one component failed may cause discrepancies between items.
5) The GeoEvent definitions are not included, but any services created in the Server through GeoEvent or used within any definitions should be. Can you expand on this a bit?