Planning Load Balancer Configuration for Highly Available ArcGIS Enterprise

23402
34
04-30-2020 04:57 PM
NoahMayer
Esri Contributor
27 34 23.4K

Introduction

Organizations often require a certain level of system uptime for their ArcGIS Enterprise deployments, such as 99 percent of the time or higher. For these organizations, implementing a strategy to ensure high availability is crucial.

High Availability (HA), while related to Disaster Recovery (DR), is a separate concept. Generally, HA is focused on avoiding downtime for service delivery, whereas DR is focused on retaining the data and resources needed to restore a system to a previous acceptable state after a disaster.

This post will focus on best practices to configuring local load balancer (Local Traffic Manager, LTM) in a single location for high availability, and will not include considerations to configuring Global Traffic Manager (GTM) for automatic failover between different locations.

To achieve high availability, you must reduce single points of failure through duplication and load balancing.

Load balancers act as a reverse proxy and distribute traffic to back-end servers. Third-party load balancer is required in a highly available ArcGIS Enterprise deployment to improve the capacity and reliability of the software. They handle client traffic to your portal and server sites, as well as internal traffic between the software components.

ArcGIS Web Adaptor

Though ArcGIS Web Adaptor is considered a load balancer, it’s inadequate to serve as the lone load balancer in a highly available deployment, since ArcGIS Web Adaptor also requires redundancy to achieve high availability.

ArcGIS Web Adaptor is an optional component, as load balancers can forward requests directly to your portal and server sites, yet it is a recommended component. The advantages of using web adaptors are:

  • It provides an easy way to configure a single URL for the system
  • It allows you to choose context names for the different system components, e.g. portal, server, mapping, etc.
  • It is integrated natively with other ArcGIS Enterprise software components, the portal and server sites, and will automatically handle health checks and configuration tasks, e.g. adding new machine to a server site

Web Context URL

The web context URL is the public URL for the portal. Since any item in portal has a URL - file, layer, map, and app, the portal's WebContextURL property helps it construct the correct URLs on all resources it sends to the end user.

External Access and DNS

ArcGIS Enterprise portal supports only one DNS for public portal URL (the web context URL), and currently there is no supported way to change the web context URL without redoing administrative tasks, e.g. federating server sites with your portal. If your ArcGIS Enterprise requires external access, e.g. to allow access to mobile users, contractors, partners, or agencies, without VPN, or if you anticipate that you will need to allow external access in the future, you must use an externally resolvable DNS name for the portal’s web context URL, e.g. https://gis.company.com/portal.

To secure external access to ArcGIS Enterprise, it is common to host a load balancer in a DMZ, and implement a Split Domain Name System (Split DNS), i.e. internal access to ArcGIS Enterprise DNS (e.g. gis.company.com) will be resolved to an internal load balancer IP, so internal users will stay behind the firewall, and external access to ArcGIS Enterprise DNS will be resolved to an external (DMZ) load balancer IP.

Load balancing configuration with external access

URLs used in federation

Several different URLs are used in a highly available ArcGIS Enterprise deployment.

Services URL

This is the URL used by users and client applications to access ArcGIS Server sites. It’s the URL for the load balancer that handles ArcGIS Server traffic and passes requests either to the server site’s Web Adaptor or directly to the server machines.

Administrative URL

This URL is used by administrators, and internally by the portal, to access an ArcGIS Server site when performing administrative operations. This URL is also used for publishing GIS services with reference to a registered data store, e.g. SQL Server enterprise geodatabase, to a federated server site. This must direct to a load balancer; if the administrative URL points to a single machine in the server site and that machine is offline, federation will not work. This can be the same URL as the services URL or can be a second load balancer (VIP) for each federated server site admin URL via port 6443. Configuring a dedicated VIP for each federated server site admin URL via port 6443 will require to open this port for administrators and publishers, and you can disable administrative access through the web adaptor, thus providing additional security controls for the organization.

I recommend using the same URL as the services URL since it simplifies configuration. ArcGIS Server administrative access will be controlled by ArcGIS Enterprise authentication and user roles, similarly to administrative access to portal, e.g. ArcGIS Portal Directory (portaladmin) and Organization Settings. To use web adaptor URL for administrative URL, you must enable administrative access in the server web adaptor.

Private portal URL

This is an internal URL used by your server sites to communicate with the portal. This must also direct to a load balancer and should be defined prior to federating. If you federate your server sites prior to setting the privatePortalURL, follow step 8 and 9 in the topic Configure an existing deployment for high availability to update the URL within your deployment. Similar to the administrative URL, this can be the same as the public URL for the portal (portal’s web context URL), or it can be a second load balancer (VIP) via port 7443.

For configuration simplicity, I would recommend using the public URL for the portal for the private portal URL. If you choose to use a dedicated load balancer VIP via port 7443 for private portal URL, you should configure the load balancer to check the health of the portal machines.

Load Balancer Configuration

Health Check Settings 

The most important capability to use is a Health Check. As described in portal’s health check documentation:

 

“The health check reports if the responding Portal for ArcGIS machine is able to receive and process requests. For example, before creating the portal, the health check URL reports the site is unavailable because it can't take requests at that time.”

 

ArcGIS Enterprise portal and server have health checks.

When you use ArcGIS Web Adaptors, the web adaptors will take care of performing the health checks against the portal and the servers. In this case you can configure the load balancer with basic TCP/443 health check or a static page health check, with standard health check settings for timeout, failure trigger, polling interval, and healthy threshold.

If you configure the load balancer to access portal and/or the servers directly, e.g. if you don’t include web adaptors in your architecture, or if you use dedicated load balancer URL for private portal URL (port 7443) or federated server administrative URL (port 6443), you should configure the load balancer to check the health of the portal and server machines.

There are some important considerations about health check settings on the load balancer. Most organizations using a load balancer use a static page as their health check (e.g. index.html) to determine whether the web server is healthy. This is a static file that only requires a disk fetch.  Also, most web servers tend to bottleneck with I/O rather than CPU. 

However, Esri’s ArcGIS Server is different in that our health check requires a very small amount of CPU because our check is more than just a disk fetch, as the software needs to determine whether certain processes are functional.

With health checks there is a timeout value on the poll. Most load balancer administrators set the timeout value very low because disk fetches are usually very fast (though they often leave some buffer for poor network latency).

When using a low timeout value with ArcGIS Server and when using a multi-machine setup, there is a chance one of the machines may exceed the low timeout value and a healthy machine will be removed.

Esri recommends a higher timeout value, ideally at least 5 seconds. Depends on the system, you might need to increase this value even more. You should monitor your environment and adjust this value accordingly. This may seem like a high number to Network Administrators since a health check on a simple page usually takes less than 10ms normally (plus network latency).

However, it is critical for the load balancers to distinguish between when a machine is just slow vs. when a machine is truly dead and non-responsive.  If portal or server is truly down and not listening on a port at all, most load balancers will detect this even sooner than 5 seconds, so this timeout does not impact "normal" failures where a machine goes away.

The second consideration is the failure trigger - how many times does the poll need to fail before removing the machine from the load balancer. 

As a rule, Network Administrators set the failure trigger to be more than 1 failure because they do not want a single network glitch to take down the system.  As noted above, a small spike in CPU usage (which is common on ArcGIS Server systems) can cause a timeout, and it is not desirable to bring down a machine because of a single CPU spike on a single machine. 

Esri has done significant internal testing with load balancers and found that a setting of 5 failures dramatically reduced the number of false positives while still detecting true outages.  We found that a value of 3 was still highly susceptible to false positives.

The third consideration is the polling interval.  Esri has found that 30 second polling intervals, coupled with 5 failures was a sweet spot of detecting true failures and ignoring false positives. 

With this combination, the expected value (using a statistics term) or mean time to detection is 1 minute and 15 seconds of downtime before detection, with a worse case of 2 minutes and 30 seconds.  It is possible to aim for a lower mean time to detection, but the trade-off is receiving false positives.

If the lower mean time to detection is preferred, it may be needed to increase capacity so there are sufficient resources to suffer a true failure on a machine and a false positive without overloading the remaining machines.

The final setting regarding health checks is the healthy threshold for the load balancer to start sending requests again.  Esri does not have a recommendation for this and we have not observed many differences in this number, but we typically see 3 healthy consecutive polls before rejoining.

Throttling 

Throttling settings are worth considering.  ArcGIS Server is CPU-bound which means most of the request time is spent using the CPU rather than waiting for I/O. 

This means if there are 8 cores, ArcGIS Server can handle a little more than 8 simultaneous requests from a practical standpoint. If ArcGIS Server is busy, it starts queueing requests until there are hundreds of requests in line, and after that threshold it refuses connections. When there is a long backlog of requests it will result in a long wait time, but eventually the request gets processed.

ArcGIS Server has settings to control this behavior, so there are less of these abandoned requests being worked on, but it is also a best practice to control this at the load balancer level via its ability to throttle. 

Esri does not have a numerical recommendation because it depends a lot on the specific architecture and the types of requests coming in, but it is typical to throttle at a significant lower value than Network Admins would typically do for a Web Server.

This is beyond the control of the Network Administrator, but the client applications should be written to handle a throttling event and do re-tries in increasing time intervals (e.g. first time you can immediately re-try, second time you wait a second, third re-try wait 5 seconds, etc.).

Sticky Sessions 

Esri does not recommend sticky sessions except in very rare circumstances. Sticky sessions can theoretically overload a machine.  Esri has conducted load tests using sticky sessions to see if we found results that would overload our GIS Server, and this did not occur. We have also not received any customer complaints. That said, since our software is stateless we do not see the value in using sticky sessions. 

Layer 4 vs. Layer 7 

The final setting to mention is whether the load balancer does a "level 7" approach or a "layer 4" approach.  This can be a debate among Network Administrators, but below is a concise summary of the situation describing the differences and advantages of each.

A layer 7 load balancer understands http and https, it therefore decrypts https content and then re-encrypts it.  Because it understands http and https it can cache content and save requests going to the backend server. 

A layer 4 load balancer views all the traffic as TCP packets and has no idea what the packets mean; they could be for ftp, https, smtp, but this does not matter to a layer 4 load balancer.  As a result, it doesn't need to understand the http payload and can be faster.

ArcGIS Server can work with either approach and there is no recommendation on this setting for Network Admins, but there are some pieces of information that an Admin will want to be aware of. 

ArcGIS Server payloads can be much larger than HTML pages, CSS, and JS content (the actual amount would be useful but is often dependent on the data the customer uses).  This means there is more CPU load on a layer 7 load balancer decrypting and encrypting on a per-request basis. 

Also, since much of ArcGIS Server data is dynamic and frequently changing, by default cache headers prohibit caching on the client and load balancer. If the data doesn't change much and the customer wants to use a layer 7 load balancer, they can change and control these cache settings. 

Summary of Recommendations

Health Check

  • If you configure your load balancers with ArcGIS Web Adaptors, you can configure the load balancer with basic TCP/443 health check or a static page health check against the web servers
  • If you configure the load balancer to access portal and/or the servers directly (via ports 6443 and 7443), use HTTPS health check endpoint:

Throttling

  • Use a throttle setting at a significant lower value than typical web servers

Sticky Sessions

  • Do not use sticky sessions

Certificates

ArcGIS Enterprise components come pre-configured with a self-signed server certificates, which allows the software to be initially tested and to help you quickly verify that your installation was successful. However, in almost all cases, an organization should request a certificate from a trusted certificate authority (CA) and configure the software to use it. The certificate can be signed by a corporate (internal) or commercial CA. Commercial CA (known-CA) must be used for externally resolvable DNS, e.g. load balancer VIP DNS; internal domain certificates can be used for internal servers. 
For ArcGIS Enterprise system with externally resolvable DNS, if the load balancer’s SSL method is SSL-passthrough (the load balancer does not decrypt and re-encrypt https content), it does not require certificate, and a commercial CA certificate must be installed on the mapped servers (e.g. web servers where the web adaptors are installed). If the load balancer’s SSL method is SSL re-encryption, a commercial CA certificate must be installed on the load balancer.

34 Comments
RyanUthoff
Frequent Contributor

This is a very informative post, thank you!

I do have a question regarding the ArcGIS Server Administrative URL. Let's say we initially deployed ArcGIS Enterprise as a single machine deployment and initially configured the Administrative URL using the machine name. But later, we added a second machine to the site but did not re-configure the Administrative URL to point to the web adaptor URL. Obviously we want to change that Administrative URL to point to the web adaptor URL so the server is truly HA (we had the machine referenced in the URL offline for an extended period but luckily our services still continued to work from the second server, but we did lose some functionality such as publishing S123 forms).

Are there any concerns with just changing that Administrative URL to the web adaptor URL? I see there is documentation regarding this, but I'm curious to see if there are any concerns or potential issues when changing the URL (such as broken feature/host feature services, etc.). All of our services are published using our web adaptor URL.

Configure an existing ArcGIS Enterprise deployment for high availability—Portal for ArcGIS (10.6) | ... 

NoahMayer
Esri Contributor

The Administrative URL should also point to a load balancer end point and not to a lone web adaptor machine, since then the web adaptor machine becomes a single point of failure.

For the Administrative URL you should either use the same URL as the services URL, e.g. a load balancer URL which forwards the requests to web adaptor machines (or to the lone web adaptor machine, if you don't have two web adaptor machines and load balancer), or a dedicated load balancer URL which forwards the requests to the ArcGIS Server machines via port 6443. I explained the difference between these two options in my post. 

To update the Administrative URL for existing federated sites:

  1. Go to portal's Sharing API (https://portal.domain.com/webadaptor/sharing/rest)
  2. Log in as administrator
  3. Go back to ArcGIS Portal Directory Home
  4. Click the Portals, and Self
  5. Under the Child Resources section, click Servers, then select the Server ID of your federated server
  6. Update the Administrative URL
  7. Click Update Server
DavidHoy
Esri Contributor

Thank you Noah for an extremely useful post.
The recommendations regarding Load Balancer Health Check settings is especially useful as I have not seen these documented elsewhere.

I like the recommendation to use Split DNS when you have a site with external users. Often, for simplicity, Network Admins choose not to do this, and we end up with internal users' requests needing to go out through the firewall to the DMZ and then come back in again. Most times, this isn't really too much problem, but sometimes, the organisation has a forward proxy and the web adaptor/load balancer address may need to be whitelisted. It can also add some extra latency to requests/responses.

DavidHoy
Esri Contributor

Further commentary on this great article - this time regarding stickiness.

A lesson we have learnt at some sites in Australia that have HA configuration is that if stickiness is not enabled, there can be problems caused during automated publishing, overwrite and/or removal of services. If the process takes a non-zero time, there can be stability problems while the "secondary" portal and/or Data Store nodes are synchronised. The scripting sends follow-up requests to the Portal/Server to check the action has completed and if this follow-up request goes to different node than the original request, the result may not be what is expected.

The avoid this, we are now recommending stickiness SHOULD be enabled.
From what @NoahMayer says in the article, Esri testing has not demonstrated that stickiness being enabled causes any issues, so we are confident in our logic. 

BenjaminBlackshear
Regular Contributor

the default polling interval at 10.8 and beyond is now 1 second when it used to be 30 and the default failure trigger is now 3 when it used to be 5, are these new defaults now ESRI's recommended values, or are the ones listed in this article the most current recommendations?

NoahMayer
Esri Contributor
Hi Benjamin,

What are you referring to with "the default polling interval at 10.8 and beyond is now 1 second when it used to be 30 and the default failure trigger is now 3 when it used to be 5"? Is there a specific documentation which states this information?

BenjaminBlackshear
Regular Contributor

I am referring to this documentation for setting the portal-ha-config.properties:

Configure a highly available portal—Portal for ArcGIS | Documentation for ArcGIS Enterprise

are these the settings this article is referring to?

DavidHoy
Esri Contributor

Hi Benjamin,

The section in the help you are referring to is not about the Load Balancer and the settings referenced there are for Internal Portal machine-machine checking - (not via the load balancer). I dont think there will be any reason to alter these from their defaults.
But..@NoahMayer,  do you know if the old portal-portal config settings will be retained if upgrading from 10.7.1 to 10.8.1?

 

NoahMayer
Esri Contributor

@DavidHoy  From a check I did on an upgraded environment, old portal-portal config settings were retained after upgrading from 10.7.1 to 10.8.1

DavidHoy
Esri Contributor

@NoahMayer - So, I guess it is now being suggested we SHOULD alter existing upgraded environments so that the portal-portal config matches the 10.8+ defaults values

i.e. Check at 1 second intervals and 3 fails before failover

raja-gs
Occasional Contributor

Hello,

Thanks for the detailed article. I have a question related to HA setup we're implementing. Our setup doesn't have Web adapters and allows portal and server traffic directly on ports 7443 & 6443 respectively. For the private portal URL / admin access, we're setting up a private load balancer using a hostname, say, gisadmin.company.com. 

Should the load balancer always allow admin traffic via 6443,7443 ports? Our Network security is reluctant to allow traffic to F5 load balancer on ports other than default 433. Note that the goal is to have single load balancer/hostname, and direct traffic to portal & server machines based on port #. I just want to know if there are any other approaches I might be missing.

e.g., gisadmin.company.com:7443/arcgis/portaladmin 

gisadmin.company.com:6443/arcgis/manager

DavidHoy
Esri Contributor

Hi,
are you locked in to the idea you want to direct traffic by using the ports?

This would mean you need to allow incoming requests in F5 on :6443 and :7443,

I am not surprised your Security team are unwilling - this would definitely not be a recommended practise in any site facing the public web.

If it is a new site and you have not yet locked in to using those paths, then

  • With F5, you could setup forwarding/rewrite rules that sent any requests for https://gis.company.com/portal  to portalmachine1:7443/arcgis & portalmachine2:7443/arcgis
    and requests for https://gis.company.com/server to agsmachine1:6443/arcgis & agsmachine2:6443/arcgis
    The WebContextURL for the Portal and the Server would need to be set to gis.company.com.
    • You could choose to have just the one load balancer in the DMZ and use its public address for the federation privatePortalURL and ServerAdminURL and allow users and machines to communicate via this path.

or

    • use an internal load balancer for the federation admin URLs that has the same targets, but uses admingis.company.com as incoming URL 
    • To have this work to allow administrative tasks (including re-publishing or updating an existing service) to be performed from client machines in your network, the admingis.company.com address would need to be resolvable from your client desktops as well as from the servers in your data centre.

If you are however stuck with using the /arcgis paths

  • It is possible to setup a list of addresses to look for to be routed to the  :7443 portal targets (e.g. /arcgis/home, /arcgis/portaladmin, /arcgis/sharing)
    and
    :6443 server targets (everything else)

I hope this has not been too convoluted

raja-gs
Occasional Contributor

Thanks, David. I think I wasn't clear 🙂 Ports 7443 & 6443 will NOT be exposed to public. As you said, they must go through public URL, gis.company.com.

I was referring to the internal Admin load balancer setup for admin communication / private URLs. I think it matches with the second option you explained i.e., to have admingis.company.com for private communication. For this internal admin balancer, my understanding is that it has to accept traffic on 6443 & 7443 and then redirect, based on port #, to portal & server machines accordingly. These URLs will also be used for federation. I'm wondering if I'm missing any other option - rather than allowing traffic on 7443/6443 for this internal load balancer. Thanks again for your time.

e.g., admingis.company.com:7443/arcgis/portaladmin 

admingis.company.com:6443/arcgis/manager

 

This setup is similar to the below recommendation from ESRI - without the web adapters.

rajags_0-1627358319399.png

 

DavidHoy
Esri Contributor

Sorry, I did miss that you were talking only about the internal Load Balancer.

Well - you could - as you say (and as the Esri diagram shows) use :7443/:6443 as incoming for the internal LB, but, as your IT group are not keen on that - you could instead use the strategy I suggested - as you have F5 that is capable of url rewrite,

direct requests coming in on gisadmin.organisation.com/portal to the portal servers on :7443 and  for gisadmin.organisation.com/server to the server site machines on :6443

you will need to set up the Health Checks for these as Noah's article suggests

raja-gs
Occasional Contributor

Got it, David. Thanks a lot for the explanation.

NewlandAgbenowosi1
New Explorer

@NoahMayer  and team, I have a question about the impact of not using the web adaptor in a linux deployment for ArcGIS Enterprise.  

We are considering forgoing the use of the web adaptor. It looks like, after we configure with tomcat, we still need to configure with a load balancer when we are implementing our high availability configuration.  In that light, given that the web adaptor is an optional component, can we discard it and use the architecture in figure 3?  I have included Figure 1 that I obtained from this blog (https://community.esri.com/t5/implementing-arcgis-blog/planning-load-balancer-configuration-for-high...).  Figure 2 is the approach we are considering for the case if we decide to use the web adaptor. 

Note that our site is internally facing so we do not require external access. The dotted lines in figures 2 and 3 refer to the federation of server with portal.

I would also appreciate if there is anyone on this forum who has a Linux deploy to share any insight s they may have.  The insights may include any issues with migrating data from a windows deployment to a new Linux deployment.

Thank you for your help.

 

 

Figure 1 (from Esri)

NewlandAgbenowosi1_0-1653483835933.png

 

 

 

 

 

Figure 2 (using web adaptor)

 

NewlandAgbenowosi1_1-1653483835953.png

 

 

 

 

 

Figure 3  (Preferred without the web adaptor)

 

NewlandAgbenowosi1_2-1653483835972.png

 

MichalGrinvald1
Occasional Explorer

Hi,

I have a question about accessing server manager on a specific machine.

I can access the server manager through the Load balancer, but, when trying to access it through one of the machines (either one) I get an "unauthorized" message. It does not matter whether I use the 6443 access point or the Web adaptor access point. The CA includes both the LB and the local machine name in the subject alternative names.

Is there a way of accessing the manager app of a specific machine in the HA configuration?

Thanks

Michal

NoahMayer
Esri Contributor

@NewlandAgbenowosi1 you can deploy highly available ArcGIS Enterprise without web adaptors. Please refer to this documentation: Integrate your portal with a reverse proxy or load balancer

Things to notice are load balancer type (layer 3/4 or layer 7), ports configuration (e.g. 443>7443 and 443>6443, and potentially 7443>7443 & 6443>6443) supported subdomains and sites context names, X-Forwarded-Host header and redirects, and health check configuration. 

Another point to consider is how to configure your privatePortalURL and Server Administrative URLs. You want those URLs to point to a load balancer URL and not to a machine URL for high availability. These can be either the same URL as the WebContextURL / Services URLs, or a separate load balancer services via the internal ports, e.g. 7443 & 6443.  

NoahMayer
Esri Contributor

@MichalGrinvald1 I'm familiar with this issue but I'm still investigating it. In some cases, the solution provided for scenario 1 in this article, whitelisting the FQDN of the server, solved the issue for some customers, depending on their configuration.

MichalGrinvald1
Occasional Explorer

Thanks @NoahMayer , I have  tried both options, but the issue remains..

Did you find anything else?

Thanks

Michal

NoahMayer
Esri Contributor

@MichalGrinvald1  According to this document, Connect to Server Manager, this is expected behavior:

  • If you connect directly to ArcGIS Server, the URL is formatted https://gisserver.domain.com:6443/arcgis/manager. If the site includes multiple machines, this will be the URL of the machine you specified for the Administration URL when federating your site.

 

AnthonyRyanEQL
Frequent Contributor

Noah,

I’m a little confused with Portal HA. I have 2 EC2s in AWS with an ALB for portal setup. Followed the steps in https://enterprise.arcgis.com/en/portal/latest/administer/windows/configuring-a-highly-available-por... and the status of the machines using portaladmin url is primary/standby. Any request to the alb for portal works if it’s directed to the machine with status of primary. Reading the documentation in the url suggestions traffic is distributed between both machines and in my case it only works on the primary. At the moment, I have the standby up and running but removed from the target group of alb and rules. I must be missing a configuration somewhere. ArcGIS Enterprise 11.1 installed.

Any help in this area would be greatly appreciated as I was trying to find out how an alb could ping the statuses of the machines without a token to determine how to route the traffic to only the primary machine.

thanks

NoahMayer
Esri Contributor

@AnthonyRyanEQL The behavior you are describing suggests that your system is not healthy or there's some misconfiguration. In a healthy system requests should be distributed between both primary and standby portal machines. The role, primary/standby, is related to the portal database, which is active on the primary machine and passive on the standby machine, but portal web server is active on both machines thus both machines can accept requests. Regarding the health check, https://<host>:7443/arcgis/portaladmin/healthCheck?f=json, when the system is healthy both machines should return success. 

I would advise the following:

  1. Verify that the ArcGIS account (running the Portal for ArcGIS Windows service) on the standby machine can access the portal content directory and has full permissions on the directory
  2. Verify all required ports are opened between the portal machines
  3. Verify the standby machine can access the portal db on the primary machine (and vice versa) - check portal's and portal db's logs files on each machine
  4. Check that the portal index is the same on both machines, if not rebuild the index 
  5. Open a support case with Esri
rshihab
Occasional Contributor

Hi @AnthonyRyanEQL 

Did you add the private url in the portal

{
   "privatePortalURL": "https://lb.domain.com/portal",
   "WebContextURL": "https://lb.domain.com/portal"
}

Setting portal private url 

Regards 

 

AnthonyRyanEQL
Frequent Contributor

Hi @rshihab 

I sure did. We used https://portal-sbx.domain.com/arcgis and we have a Route 53 entry that resolves it to the ALB with a rule to forward to the specified target group which is configured with the static IP address attached to the EC2 instance

The other ALB & Route 53 configurations we have are https://hosting-sbx.domain.com/arcgishttps://gp-sbx.domain.com/arcgishttps://un-sbx.domain.com/arcgishttps://map-sbx.domain.com/arcgis server sites. These are federated with Portal and work as expected   

AnthonyRyanEQL
Frequent Contributor

Hi @NoahMayer 

Thanks for the information.

I can ping both machines healthCheck endpoint and get a {"status":"success"} response returned.

Some information relating to your points that I will double check again

1. Portal is running with a local windows account on each machine which has the same name and permissions. We don't really have AD setup in AWS but a little weight version was. This is for FSx file share to host the portal content directory for each machine. A drive is mapped with the credentials to allow machine access to the file share.

2. We followed the portal numbers from Ports used by Portal for ArcGIS—ArcGIS Enterprise | Documentation for ArcGIS Enterprise when setting up the security group for intra-machine comms

3. Are you able to point me in the right direction on how to do this. I did this many moons ago and can't remember where to look.

4. Same as 3 please.

5. I will if the above doesn't pan out.

Thanks again for your help on this matter.

NoahMayer
Esri Contributor

@AnthonyRyanEQL 

3. You can find portal log files under ...\arcgisportal\logs\<machine_fqdn>\portal, and portal db log files under ...\arcgisportal\logs\database\pg_log and see if there are any errors there indicating the machine cannot access primary portal db or any other errors that might suggest a problem with the configuration / environment

4. Can you access https://<machine_fqdn>:7443/arcgis/portaladmin/ from the standby machine? if yes you can check index status from https://<machine_fqdn>:7443/arcgis/portaladmin/system/indexer on both machines and see that they match 

Another test (that will validate all of the above) you might want to consider would be to stop the primary portal service to trigger a portal role switch - the standby machine should promote itself to the primary role after a few minutes and if everything works as expected after that happens that would mean that the configuration is correct. I would anyway check the logs first to see if there's anything there that indicates if there's a problem, but if everything in the logs looks good I would try to test a role switch over. 

Ranga_Tolapi
Frequent Contributor

Is it the right way to depict internal load balancer between Portal for ArcGIS and ArcGIS Server for administrative communication?

Ranga_Tolapi_0-1708244226367.png

 

DavidHoy
Esri Contributor

Yes - this is one of the possible ways to setup the administrative paths.

But if setting up an internal Load Balancer is onerous, it is possible to just use the public load balancer addresses via :443 i.e .PrivatePortalURL same as webContextURL and Server Admin URL same as the Server Services URL
But, this does mean:
- Admin Access must be enabled in the Server Web Adaptor
- The Public Load Balancer must be accessible from the Portal and Server VMs
- (this isn't always the case when the Public ALB is in a DMZ)

NoahMayer
Esri Contributor

@Ranga_Tolapi - the internal load balancer, which will usually be used for the portal's privatePortalURL and the federated servers' Administrative URL, is used for intermachine communication between the federated servers and the portal machines, but also for administrative operations, including publishing referenced services, so it's not only between Portal for ArcGIS and ArcGIS Server but also between admin/publisher users and ArcGIS Enterprise. 

As David mentioned, it is very common for our customers to use the external load balancer URL also for these URLs, and then you don't need internal load balancer. This is a more simple configuration. Some customers prefer to separate between an external load balancer and an internal load balancer for security and/or network considerations. 

AnthonyRyanEQL
Frequent Contributor

@NoahMayer  We had to change the /healthCheck?f=json to /healthCheck?f=html as AWS ALB only works on return status code and can't parse the message (eg. json string) 

AnthonyRyanEQL_0-1719192348235.png

 

NoahMayer
Esri Contributor

@AnthonyRyanEQL ArcGIS Enterprise components (portal and server) return status code 200 when they are accessible, with the health check response in the message, so only checking the return status code will work in cases portal and server are not accessible and return 4XX and 5XX status codes, but when they are accessible but the health check fails, e.g. ArcGIS Server returns a 200 return status with a {"success": false} message, the load balancer will keep sending messages to them. 

dmangold
New Explorer

@NoahMayer Under the Throttling section of this article, you mention that: "ArcGIS Server has settings to control this behavior". Are there settings beyond what is available for individual service configuration in Server Manager? I'm struggling with sporadic communication failures between a custom .NET app and a custom GP service. Requests to poll the GP service job status usually go through fine but, sporadically, the first status request fails immediately with an error stating the app is "Unable to read data from the transport connection." When this happens, I can immediately access ArcGIS server in a browser and manually check the job status. The GP jobs always run to completion without errors.

NoahMayer
Esri Contributor

@dmangold why do you think the issue is related to throttling? I would advise you to open a case with Esri support to investigate this. 

About the Author
I am a senior Enterprise Solutions Architect within Esri's Implementation Services Department. I have over 15 years of experience in technical consulting, solution design and system architecture.