Select to view content in your preferred language

502 Bad Gateway error appears intermittently ArcGIS Enterprise by Cloud Builder Azure

6807
17
10-06-2022 04:50 AM
AhmedShehata3
Occasional Contributor

Hello,

I'm having a very serious and annoying issue with ArcGIS Enterpise installed by Azure Cloud Builder.

 

Issue:

The error 502 Bad Gateway appears very frequently and in random times and the Portal becomes inaccessible. The issue appeared a few months 10.8.1 Base Deployment was installed, but after upgrade to 10.9.1 it still persists.

 

Behaviour and workarounds:

Sometimes the portal goes up again without interference and sometimes I have to restart the Portal Windows service or the entire Portal machine. Sometimes it lasts stable for weeks and all of a suddent it keeps failing several times in the day. Usually, the server manager URL keeps working normally, so it seems an issue with the Portal app/machine.

 

Troubleshooting:

I usually check the Event Viewer and can't find any lead. Same for the Portal Admin/machine logs. I used a SSL checked, but the certificate chain seems fine.

 

I have no idea what could cause this unstable behaviour and don't know where to start investigating. Does anyone have any idea?

Thanks

0 Kudos
17 Replies
AhmedShehata3
Occasional Contributor

Hi @FraserHand ,

We deployed all components in West Europe.

I can't actually tell how long it takes to return 502. How could I do that? and is it related to a specific item request? The EGIS is actually public and being accessed by many users around the world.

Also, how could you manage to deploy only the AGW in a different region, does the Azure Cloud Builder allow you decide? I can't remember!

0 Kudos
FraserHand
Frequent Contributor

Morning,

If you open dev tools in you browser (F12) and hit your portal and watch the requests go through when you see a 502 response you can check the Time column. Our 502s were almost always on 60 secs.

We haven't used Cloud Builder as we have a dedicated Azure deployment pattern and cloud builder doesn't fit that. If you have access to the Azure portal you can create a public IP and App gateway via that in a new region and use the same settings from your existing app gw, then update your DNS (if you are using custom DNS). ArcGIS is mostly concerned about the WebContextUrl setting and that the X-Forwarded-Host header has been set so check those. Also review the web adaptor settings that cloud builder creates. If it doesn't work you still have the existing app gw to rewire. You may be a little constrained via your SLAs as it would require a bit of an outage, ad you'll need some Azure skills for deploying an App GW manually.

Thanks

FraserHand
Frequent Contributor
0 Kudos
NicholasSadowy
Occasional Contributor

I have a resolution for my specific problem. We were getting a 504 Application Gateway Timeout. The solution was increasing the timeout of both the listener and the health probe from 20 seconds to 120 seconds in the Application Gateway settings. 

FraserHand
Frequent Contributor

Thanks - I'll have a look at our settings. We changed the timeout on tomcat in our deployment from 60 secs to 180 secs and we saw the response change from a 502 to 504 so it indicated it was something with portal - but I will check our app gw settings.

Thanks

0 Kudos
AhmedShehata3
Occasional Contributor

@FraserHand  @NicholasSadowy  thank you for sharing your workarounds on this. I can see you both suggest that it's a time-out error, and I want to have your comment on the below just to correlate the AGW settings to the Portal error (as it doesn't happen to the Server) for better understanding.  

1- When the Portal gets hit by this error, it takes 1 second to return 502 on the Portal's Home page, not just for a specific service. The only workaround is restarting either the Portal Windows service or the entire VM. So, it's something with the Portal!  

2-When the Portal is publicly inaccessible, I can access it from the VM/Vnet using the FQDN:7443/arcgis URL format. However, the content items do not open normally. Is this the case with you? The reason I'm asking this is that last time this happened (on Friday) I checked all services and found one imported with an unsecured URL HTTP so after removing it, the Portal is acting normally so far. But not sure if it's the root cause. 

3-In terms of the issue frequency, it may take several weeks for it to happen, but sometimes when it happens, it can be very frequent like every hour. 

4-  Sometimes the Portal goes up again after 5 mins, but on the last incident, I had to restart the Portal manually to retrieve it. 

Please let me know if you experienced similar behavior. I'm sure the Portal is impacted somehow even if the AGW is the issue. 

0 Kudos
TylerSavoy
New Contributor

I am having the exact same issue.  I had to restart my portal VM twice today already.  The VM is no longer allowing me to start and stop the Portal service or restart it. I have to restart the VM.  I also have an Azure cloud builder install.  Any fix since this last post in October?

0 Kudos
AhmedShehata3
Occasional Contributor

There's no fix until now, but the only workaround that could make things a bit better is to increase the timeout and interval parameters for the health probes in Azure Application Gateway settings to 180 sec or more

0 Kudos