portal HA 10.5 issue

10446
23
01-22-2017 10:28 AM
AhmadAwada1
New Contributor II
In a portal high availability 10.5 scenario, I noticed that when a standby portal machine detects failure in the primary machine, it communicates with the failed primary machine and drops (the standby machine i mean) its "c:\arcgisportal\db" directory and create a new one (may be based on info from the failed primary db folder)!
I noticed that IP communication between the two machines shall continue to exist during fail-over! In my case, if primary machine is shutdown, the standby machine would not startup unless there is access again to the failed machine (even if portal service is down).
Conclusion:
Is portal high availability is only on the service level? That is, if both machines are up (network wise) but one of the portal services is down, then fail-over will take place, while on the other hand, if network access to the primary failed machine is lost, then standby machine will not start?
What is wrong with my configuration?
Here is the error i get when standby machine has no network access to the failed primary machine:
"The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>"
0 Kudos
23 Replies
JonathanQuinn
Esri Notable Contributor

Hi Ahmed,

It doesn't appear that the standby Portal completely promoting itself to primary.  If it did, it would attempt to connect to it's own database, not the previous primary database.  In the db folder of the standby, do you see a recovery.conf or a recovery.done file?

0 Kudos
AhmadAwada1
New Contributor II

Hi Jonathan,

First of all, thank you for the reply.

Yes I can see a recovery.conf.bak and a recovery.done files in the described folder. What does that mean?

Let me update you with my latest notices regarding the portal HA (10.5), can you test a scenario where both portal machines are shutdown and try to power on only the machine which had "standby" role just before shutdown? In my case, I am getting the error:

"The portal has been initialized and configured but is not accessible. The internal portal database does not appear to be running or accepting connections. Restart the portal machine or machines and if the problem persists, contact Esri technical support (U.S.) or your distributor (customers outside the U.S.).</Msg>"

The standby machine would never become primary or even starts until I power on the primary server!

0 Kudos
JonathanQuinn
Esri Notable Contributor

It sounds like there are two problems.  If you have a primary and standby, both running and healthy, the standby should have a recovery.conf file, not a recovery.done file and possibly a recovery.done.  The presence of a recovery.done file tells the standby Portal it has successfully promoted itself to the primary, which is why you don't see the fail over in "normal" circumstances, (you simply stop the primary while the standby is running).

Your next problem is a current limitation that we're looking to improve upon for later releases.  The next problem is a timing thing.  If you stop the standby, primary removes it from the HA configuration to make sure that it knows the standby is done.  Once the standby comes up again, it adds it back.  If you stop the standby and then stop the primary, and finally start the standby, it won't promote itself to primary because it won't have the latest snapshot of the data from the primary database, (since it was down).  If it were to promote itself, you will have lost any data created on the primary while the standby is down, since the data wasn't replicated from the primary to the standby.  In your case, where you stop both machines, it sounds like a similar scenario.  Based on the timing of when you stop both, the standby may have stopped first, and then the primary.  When you start the standby, it won't promote itself as that would potentially cause data loss.

**Edit: it's fine if one or both machines have a recovery.done file within the db directory. That's an indication that it had at some point, promoted itself to primary. The presence of the file won't affect future failover/failbacks.

GirishYadav1
New Contributor III

Hi John,

I have an ArcGIS Enterprise Base HA deployment on AWS. According to our policy both the EC2s (Primary & Standby)  shutdown everynight. Therefore, I am facing the same issue of simultaneous shutdown of primary & standby. Therefore messing up the HA configuration. Fortunately, I can control the timings of when each EC2 should shutdown and startup again (doesn't have to be simultaneous). What are your recommendations regarding shutdown and reboot order for Primary & Secondary machines? 

Thanks,

Girish

0 Kudos
JonathanQuinn
Esri Notable Contributor

I would stop the standby then primary. When you need to start them, start the primary first, then the standby.

GirishYadav1
New Contributor III

Thanks!!! Jonathan Quinn‌ , I will give it a try. 

0 Kudos
SandeepBurra4
New Contributor II

Dear Jonathan,

I would like to ask one question which can help me understand more on Portal HA.

We followed link for 10.5.1 to configure Portal HA Configure a highly available portal—Portal for ArcGIS (Windows) Installation Guide (10.5) | ArcGIS E... 

One is shown as Primary and the other is Standby. Incase the primary portal goes down, it takes about 3 minutes for the standby to become active and till then portal is not accessible.

 Why it says Active/Standby and not Active/Active? Is it because of software license restriction or is this how it is done? When we checked ESRI documentation, it says Portal supports Active/Active.

Can you please more details on this. Looking forward on this.

Regards,

Sandeep B

0 Kudos
JonathanQuinn
Esri Notable Contributor

Portal HA is active/active in that both web servers take requests. However, it's primary/standby at the database tier. Portal's internal database doesn't support multi-master so it needs to be primary/standby. The standby will automatically be promoted to primary when it detects the primary down. From 10.3.1-10.6, that will take minutes, unfortunately. At 10.6.1, we've improved the failover time to occur under a minute, typically under 30 seconds.

SandeepBurra4
New Contributor II

Dear Jonathan,

Thank you for your reply. Can you please provide some more information on below issue:

Federation is valid but not able to start hosted service in ArcGIS Server. Because of this not able to publish any hosted layer. Can you please advice on this.

Regards,

Sandeep B 

0 Kudos