PORTAL HA

6540
17
09-27-2018 04:49 AM
SandeepBurra4
Occasional Contributor

Dear All,

I would like to ask one question which can help me understand more on Portal HA.

We followed link for 10.5.1 to configure Portal HA Configure a highly available portal—Portal for ArcGIS (Windows) Installation Guide (10.5) |

ArcGIS Enterprise One is shown as Primary and the other is Standby. Incase the primary portal goes down, it takes about 3 minutes for the standby to become active and till then portal is not accessible. Why it says Active/Standby and not Active/Active? Is it because of software license restriction or is this how it is done?

When we checked ESRI documentation, it says Portal supports Active/Active.

Can you please more details on this.

Looking forward on this. 

0 Kudos
17 Replies
JonathanQuinn
Esri Notable Contributor

Your load balancer should be configured with a HTTP health check to check the portal via the health check URL:

Health Check—ArcGIS REST API: Administer your portal | ArcGIS for Developers 

A simple TCP check of whether the machine is up won't suffice, because the service could be off, but the machine available, and your LB will think the backend server is available as well.

Portal should be failing over automatically and promoting the standby to primary. Failover is dependent on whether the standby can reach the database on the primary machine. If you reach https://<original standby>:7443/arcgis/portaladmin and you see an error while the service on the original primary is off, double check the processes on the original primary and see if there are any orphaned portal processes, specifically postgres.exe.

0 Kudos
CameronBlandy
Regular Contributor

Hi Jonathan,

Thanks for the information. We have tried to set up the health check URL on our load balancer but have encountered issues due to the fact we use IWA to authenticate. As we do not allow Anonymous access to our portal we are unable to access the health check URL from the load balancer (F5). Unless there is some other way to do it?

Ideally we would want use the health check as just doing a simple RCP check is not really sufficient, as you stated.

Currently portal01 is off so I cannot reach https://portal01:7443/arcgis/portaladmin. I can reach https://portal02:7443/arcgis/portaladmin.

How do I check the postgres processes on the primary machine. Simply, in task manager?

I see the following:

How would I determine which one is an orphan? Should there be only one postgres process?

Cameron

0 Kudos
JonathanQuinn
Esri Notable Contributor

You should be able to configure your load balancer to either expect 401 responses as a healthy response, (and allow the web adaptors to check the health of the portal machines), or, ideally, configure your LB to authenticate against the 401 challenge. I'm by no means an F5 expert, but I came across the following documentation:

https://support.f5.com/csp/article/K2167 

For an HTTP/HTTPS monitor to successfully use NTLM or NTLMv2 authentication, a monitor must meet the following configuration requirements:

  • The monitor must have a send string. Because it is necessary to use HTTP 1.1, at a minimum the send string should use a format similar to the following example:

    GET /<filename/path> HTTP/1.1\r\nHost: <hostname>

  • The monitor must have a receive string.
  • The monitor cannot be a reverse monitor.
  • The monitor must have a username. The acceptable username format may vary depending on your server implementation. Common username formats are as follows:
    • Simple username. For example, UserName
    • User Principle Name (UPN). For example, UserName@domain.local
    • Down-level logon name. For example, DOMAIN\UserName
  • The monitor must have a password.

DevCentral 

Postgres forks processes, so you'll see more than one running. That's normal. Are you able to sign into the Portaladmin API on portal 2?

0 Kudos
DerrickFrese1
Emerging Contributor

We have a HA Portal site that is using 2 instances of ArcGIS Web Adaptor (installed on 2 different web servers) and a LB directing traffic to the web servers.  It looks like this:

My question is in regards to using health checks.  If one Portal machine is down, do we need a health check with the LB?  Or do the web adaptors see that the machine is down and do they handle it?

Jonathan Quinn

0 Kudos
JonathanQuinn
Esri Notable Contributor

In that example, client side traffic (blue lines) goes from the LB to the web adaptors and then to the portals. You'd want a health check from the LB to the web adaptors as well because that checks two things: 1) whether the web server is up and running 2) whether the web adaptor can communicate with the portal. For the internal traffic, (yellow lines), you'd also want a health check to make sure that the backend portal machines are healthy for the communication between federated servers and portal.

0 Kudos
VishApte
Esri Contributor

Hi @JonathanQuinn ,

Is it possible to know how "standby" portal determines  primary portal is down i.e.

1. Load Balancer or Web Adaptor finds health check URL to primary portal doesn't succeed and sends the request to a standby portal which then promotes itself to be primary.

AND/OR 

2. Standby portal calls health check or other hidden URL or TCP request on primary every X seconds and makes Y attempts before determining that primary is down and promotes itself?

We have seen that, occasionally at a large Enterprise site, the standby promoting to be primary for no apparent reason. We even had one case where both machines came up as Primary and was really tricky to resolve.. We would like to "delay" the process where standby promotes to be primary with the understanding that the failover time would be higher than 3 minutes. Is it possible?

 

Cheers,

Vish  

 

0 Kudos
JonathanQuinn
Esri Notable Contributor

The standby is checking whether the database on primary is accessible. It does this using values defined in the ha-config.properties file:

https://enterprise.arcgis.com/en/portal/latest/administer/windows/configuring-a-highly-available-por...

Optionally, define the portal's failover properties. A highly available portal checks whether a failure has occurred with the portal machines. You can define the interval in seconds and frequency for checking machine status using the steps below. These properties must be changed on each machine in the portal and must be the same on both machines.

At 10.8 and earlier, these values weren't actually honored and the first failed check would promote standby. At 10.8.1, we correct that behavior and now the values are honored. You can increase them if there are network issues that cause the standby to promote itself even if the primary is healthy, although the network issues should be investigated.

by Anonymous User
Not applicable

Thanks @JonathanQuinn 

Does the same apply to Data Store?

0 Kudos