ArcGIS Enterprise on Kubernetes Terraform Helm Provider Charts Error

437
3
Jump to solution
02-11-2024 10:43 PM
JoseSousa1
New Contributor II

Hi guys,

I am deploying ArcGIS Enterprise 10.2 on Azure Kubernetes (v1.28.3) using Helm Charts v1.2.0.

If i unpack the chart (*.tgz), set the variables in configure.yaml and values.yaml files, and then run it directly using helm everything works fine. 

However, this does not work sometimes, when using the Terraform Helm Provider. I have it pointing directly to the *.tgz chart file and setting the variables dynamically.

I know the variables are properly set and this in fact works sometimes.

Error:

helm_release.arcgis: Creating...
helm_release.arcgis: Still creating... [10s elapsed]
helm_release.arcgis: Still creating... [20s elapsed]
helm_release.arcgis: Still creating... [30s elapsed]
helm_release.arcgis: Still creating... [40s elapsed]
helm_release.arcgis: Still creating... [50s elapsed]

│ Warning: Helm release "arcgis" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.

│ with helm_release.arcgis,
│ on helm-arcgis.tf line 1, in resource "helm_release" "arcgis":
│ 1: resource "helm_release" "arcgis" {

│ Error: failed pre-install: 1 error occurred:
│ * job failed: BackoffLimitExceeded

│ with helm_release.arcgis,

Details:

I can see that the pod that fails e.g. arcgis-pre-install-hook-job-xss8n with the following details:

 

 

...
Waiting for pods to startup...

POD                                                  STATUS
arcgis-ingress-controller-9cb458d7-spkgk             Running
arcgis-help-557765bcc5-kdh4s                         Running
arcgis-rest-administrator-api-84cc8d9b86-mmqk4       Running
arcgis-enterprise-manager-6c67864b47-hgbh9           Running

Waiting for site availability..............................Ready
-------------------------------------------------------------------------------
                           S U C C E S S !
-------------------------------------------------------------------------------
ArcGIS Enterprise 11.2 on Kubernetes 11.2.0.5207 has been successfully deployed.

Create the ingress or OpenShift route resource now, if you have not done so yet.

The secure route should direct traffic to the arcgis-ingress-nginx service
using either re-encrypt or passthrough for TLS termination.

If using a DNS alias, you should create a CNAME record that resolves
the DNS alias to the canonical router hostname for the cluster.

Use the following URL to access the ArcGIS Enterprise 11.2 on Kubernetes
Setup Wizard and configure your organization:

    https://myapp.cloud.mywebsite.co.nz/arcgis/manager

Use this URL to access ArcGIS Enterprise 11.2 on Kubernetes help:

    https://myapp.cloud.mywebsite.co.nz/arcgis/help/en/

------------------------------------------
Waiting for enterprise URL...
------------------------------------------
Waiting 900 seconds for https://myapp.cloud.mywebsite.co.nz/arcgis/admin...
Connection successful.
HELM_CONFIGURE=true
------------------------------------------
Running configure...
------------------------------------------

Properties for /tmp/decoded-configure.properties:

HELM_DEPLOY=true

K8S_NAMESPACE="arcgis"
ARCGIS_SITENAME="arcgis"

ARCGIS_ENTERPRISE_FQDN="myapp.cloud.mywebsite.co.nz"
CONTEXT="arcgis"
ROOT_ORG_BASE_URL="https://${ARCGIS_ENTERPRISE_FQDN}/${CONTEXT}/"

SYSTEM_ARCH_PROFILE="development"

# Hard-coded
LICENSE_FILE="/tmp/license.file.json"
LICENSE_TYPE_ID="creatorUT"

ENCRYPTION_KEYFILE=""

ADMIN_USERNAME="portaladmin"
ADMIN_PASSWORD="somepassword"
ADMIN_EMAIL="my_email@org.co.nz"
ADMIN_FIRST_NAME="Jose"
ADMIN_LAST_NAME="De Sousa"

SECURITY_QUESTION_INDEX=1
SECURITY_QUESTION_ANSWER="Esri"

# Hard-coded
CLOUD_CONFIG_JSON_FILENAME=""
LOG_SETTING=""
LOG_RETENTION_MAX_DAYS=""

RELATIONAL_STORAGE_TYPE="DYNAMIC"
RELATIONAL_STORAGE_SIZE="16Gi"
RELATIONAL_STORAGE_CLASS="arcgis-managed-premium-delete"
RELATIONAL_STORAGE_LABEL_1=""
RELATIONAL_STORAGE_LABEL_2=""

OBJECT_STORAGE_TYPE="DYNAMIC"
OBJECT_STORAGE_SIZE="32Gi"
OBJECT_STORAGE_CLASS="arcgis-managed-premium-delete"
OBJECT_STORAGE_LABEL_1=""
OBJECT_STORAGE_LABEL_2=""

MEMORY_STORAGE_TYPE="DYNAMIC"
MEMORY_STORAGE_SIZE="16Gi"
MEMORY_STORAGE_CLASS="arcgis-managed-premium-delete"
MEMORY_STORAGE_LABEL_1=""
MEMORY_STORAGE_LABEL_2=""

QUEUE_STORAGE_TYPE="DYNAMIC"
QUEUE_STORAGE_SIZE="16Gi"
QUEUE_STORAGE_CLASS="arcgis-managed-premium-delete"
QUEUE_STORAGE_LABEL_1=""
QUEUE_STORAGE_LABEL_2=""

INDEXER_STORAGE_TYPE="DYNAMIC"
INDEXER_STORAGE_SIZE="16Gi"
INDEXER_STORAGE_CLASS="arcgis-managed-premium-delete"
INDEXER_STORAGE_LABEL_1=""
INDEXER_STORAGE_LABEL_2=""

SHARING_STORAGE_TYPE="DYNAMIC"
SHARING_STORAGE_SIZE="16Gi"
SHARING_STORAGE_CLASS="arcgis-managed-premium-delete"
SHARING_STORAGE_LABEL_1=""
SHARING_STORAGE_LABEL_2=""

PROMETHEUS_STORAGE_TYPE="DYNAMIC"
PROMETHEUS_STORAGE_SIZE="30Gi"
PROMETHEUS_STORAGE_CLASS="arcgis-managed-premium-delete"
PROMETHEUS_STORAGE_LABEL_1=""
PROMETHEUS_STORAGE_LABEL_2=""

-------------------------------------------------------------------------------
Running Validation Checks...
-------------------------------------------------------------------------------
Checking operating system...
Checking property values...
Checking architecture profile...
Checking organization URL...

Error validating organization URL: https://myapp.cloud.mywebsite.co.nz/arcgis/admin
Error code from server: 502

Check your properties file for valid CONTEXT and ARCGIS_ENTERPRISE_FQDN.

ERROR: Failed to validate organization URL.
ERROR: Configure from helm failed

 

 

Note that when the configuration starts it knows it is meant to wait for 15 minutes (900 seconds) in case the admin URL is not yet available. As you can see from above, the admin URL is available (it becomes available in few seconds) and the code proceeds. Later on, it runs the validation checks and the same admin URL somehow now fails to validate?

I have also set these two variables just in case, originally I had only configureWaitTimeMin as per the documentation, but this never worked. 

    "install.configureWaitTimeMin"               = "15"
    "install.validationTimeout"                  = "900"

 

Adding the second line makes it work but is still random. Sometimes it works sometimes it doesn't.

Any ideas about what is happening?

Thanks,

Jose

(Esri NZ)

1 Solution

Accepted Solutions
JoseSousa
Esri Contributor

Hi guys,

From what I can see the issue seems to be the following:

1 - App Gateway Ingress Controller (AGIC) is created in the Azure Kubernetes Service and configured with App Gateway. This creates an ingress controller with the routing in AGIC and pushes some default settings to App Gateway. The routing here is not fully functional because the 'arcgis' helm chart has not been installed yet.

2 - Once the helm chart starts the install, arcgis-ingress-nginx is created and this triggers App Gateway to update its settings including backend pools etc. The admin URL here is not functional yet.

3 - Once the helm chart install is complete it moves to 'configure'. It makes a call to check if the admin URL is available and because it's not, it waits up to 15 minutes (900 seconds).

4 - At some point, App Gateway installs the backend pools and https settings and the admin URL becomes functional, though it has not completed updating it's settings yet. The 'arcgis' helm chart configure function starts working thinking the admin URL is now functional.

5 - App Gateway stops the admin URL availability for a second as it finalizes it's update. It now states Started rather than Updating. Meanwhile, the 'arcgis' helm chart makes a second call later in the script to validate the organization admin URL ... but it does not wait. So when the 2nd call happens exactly when app gateway interrupts the service for a second the install fails. This is why this is random. Sometimes App Gateway does not interrupt the service reason why in these times everything works fine. 

It would be great if the second call could also wait few minutes.

Thanks,

Jose

 

 

 

 

 

View solution in original post

3 Replies
dimesv
by
New Contributor II

Hi guys,

The lines below make no difference. It is random at this point, it sometimes works and sometimes it doesn't. 

    "install.configureWaitTimeMin"               = "15"
    "install.validationTimeout"                  = "900"

 

I have an App Gateway Ingress Controller configured with arcgis-ingress-nginx and all seems to work well. Is it possible that the 2nd validation check is doing more than what the 1st check does? I can't understand why it works the first time and fails the second time if all it does is query the admin URL.

Is it possible for the 2nd check to also wait until whatever it is checking becomes available?

Thanks,

Jose 

0 Kudos
JoseSousa
Esri Contributor

Hi guys,

From what I can see the issue seems to be the following:

1 - App Gateway Ingress Controller (AGIC) is created in the Azure Kubernetes Service and configured with App Gateway. This creates an ingress controller with the routing in AGIC and pushes some default settings to App Gateway. The routing here is not fully functional because the 'arcgis' helm chart has not been installed yet.

2 - Once the helm chart starts the install, arcgis-ingress-nginx is created and this triggers App Gateway to update its settings including backend pools etc. The admin URL here is not functional yet.

3 - Once the helm chart install is complete it moves to 'configure'. It makes a call to check if the admin URL is available and because it's not, it waits up to 15 minutes (900 seconds).

4 - At some point, App Gateway installs the backend pools and https settings and the admin URL becomes functional, though it has not completed updating it's settings yet. The 'arcgis' helm chart configure function starts working thinking the admin URL is now functional.

5 - App Gateway stops the admin URL availability for a second as it finalizes it's update. It now states Started rather than Updating. Meanwhile, the 'arcgis' helm chart makes a second call later in the script to validate the organization admin URL ... but it does not wait. So when the 2nd call happens exactly when app gateway interrupts the service for a second the install fails. This is why this is random. Sometimes App Gateway does not interrupt the service reason why in these times everything works fine. 

It would be great if the second call could also wait few minutes.

Thanks,

Jose

 

 

 

 

 

CameronKroeker
Esri Contributor

Thanks @dimesv @JoseSousa @JoseSousa1, this is a good find and great feedback! In the next release of the Helm Charts (v1.3.0 for 11.3) we will improve this logic so that the second check on the admin url will wait to proceed until it's available.