Portal 10.5 Index Mismatch (search)

FraserHand1 · ‎03-28-2017

Hi,

Has anyone come across a situation where the Portal Search index drops? This mean things like symbology, web app templates etc start to vanish from Portal. Note that doing a reindex still does not result in a full match. I have tried the go from an empty index and reindex from here:

http://support.esri.com/technical-article/000012663

and have run the indexer manually on the portal machine from here

C:\ArcGIS\Portal\tools\indexer>Indexer.bat -f -i F:\arcgisportal\index

Still end up with a mismatch on the Search Index

Note watching the index output I see stuff like

INFO: Query returned 7167 records.

0 [WARN] GWContent: - 'extent' parameter in item card is invalid

977 [WARN] GWContent: - 'extent' parameter in item card is invalid

2480 [WARN] GWContent: - 'extent' parameter in item card is invalid

6991 [WARN] GWContent: - 'extent' parameter in item card is invalid

which looks like the items not being indexed? I had though maybe it was permissions but procmon didn't turn anything up.

Running ArcGIS enterprise 10.5

Portal is single machine with federated Server.

Thanks,

Fraser

Anonymous User · ‎12-12-2019

Just sharing that I have observed this behavior in 10.6.1 (HA) following an unscheduled/unplanned restart and following patching (where restarts occur) of the primary and secondary portal machines. Symptoms included the search indexes being out of sync and layers failing to load due to incorrect permissions in Web Apps.

The fix was:

Shut down secondary portal service
Shut down primary portal service
Start primary portal service
Reindex users, groups and then search on the primary machine
Start secondary portal service
Check index status (I found this to be quite a way out of sync)
Reindex users, groups and then search on the secondary machine

As mentioned, this was only something that was observed following either an unplanned restart or following patching of the two portal machines where the services are restarted.

KierenTinning2 · ‎12-13-2019

We found the same, we scheduled the patch process to shut down the Standby, then shut down the primary and patch it first and then patch and restart the standby system.

Unfortunately, this takes you out of HA for a short period of time, but worth it to avoid all the index problems that result. We're hoping that this issue has been addressed in 10.7.

BradKiep · ‎12-13-2019

We struggled with the index sync for years in 10.5. Since we have moved to 10.7 we have not noticed any issues. It has been so much better for HA.

MichaelRobb · ‎12-13-2019

True, except to clarify, Portal is not HA( High -availability). It is Active/passive. or in ESRI terms Primary/Secondary.

Calls don't load balance, they all go to one side, the set active side. there is an underlying posgresql replication that occurs when switching. this is not HA. And one can really bugger it up if you patch the OS lets say and restart and don't allow enough time for the replication across to take place. True HA, this would not be...(like AGS)

KierenTinning2 · ‎12-16-2019

Absolutely agree, it is actually a problem with the selection of the underlying PostgreSQL technology and how it's implemented for the Datastore which causes the primary / secondary scenario from my understanding.. I used HA in the sense that when setup the system should always "be in sync" and primary will fail to secondary smoothly, it's when you try to recover too quickly that you have the issue with the two not "lining up" as it were. It's also why you can only have two portal machines.

It would have been good had Portal followed the same HA strategy as ArcGIS Server, but alas, we have what we have.

MichaelRobb · ‎12-16-2019

You are correct, it is due to the underlying PostgreSQL technology. Funny, I have another thread response going into details. I've had to deal with corruption of replication, quite fun. From an architecture standpoint one has to really weigh the benefits vs the risk for enabling active/passive portal for what business wants. To an EA, HA means all points are HA, otherwise it is not 'ha' even if some components are. Portal would not classify an Enterprise GIS stack as true HA. My approach is reduce the amount of single points of failure within cost reason and business benefit. (e.g. SQL always on / HANA inmem HA is by no means cheap and will more than likely be single points of failure). With my experience, Portal, even at 10.7+ is just too sensitive to monthly OS patches and constant cycles or network traffic disruptions.

JonathanQuinn · ‎12-16-2019

While there are different architectures (active active or active passive) for HA, the same fundamental principles apply, which you've mentioned: eliminating single points of failure and being able to fail over and recover automatically without data loss. All internal components within portal meet those requirements. The content directory for portal poses a bit of a challenge as that can be set to a simple file share, but as long as your storage is highly available, you're fine. Taking advantage of cloud native storage makes that easier as well. Bottom line, ArcGIS Enterprise, including Portal, can certainly be configured as a highly available system.

There's also no problem sending traffic to both Portal machines. Internally, the secondary/standby portal will communicate with the database on primary.

Also, PG supports multiple standby machines, but there hasn't been a customer need to support it in portal. While you set up a multi-machine Server site for scalability as well as availability, you're not setting up an HA portal for scalability.

Being HA means you need to have HA practices, as well. This means configuring your patching to ensure that there's always one healthy machine. If you patch the primary portal, (p1), you need to make sure that the standby, (p2) becomes primary and is healthy before patching p2. This is no different than any other relational database that operates as active/standby.

Michael Robb‌ can you describe the replication corruption you experienced?

MichaelRobb · ‎12-16-2019

Hi Jonathan,

Great to hear from you.

I guess the reason why i do not say portal is HA is really due to me losing the battle of that argument with corporate EAs. HA means all components can work actively, not that 'part' of it fails over. Yes, the web adaptors can both work, portal itself can as well. what can not do HA is the darn PosgreSQL. which makes Portal software not HA.

Content directory is not a challenge as one can set up content directory on a NAS which inherently has HA built into it. typically three network channels (fibre/cable) with a complex RAID. so no problem there.

Correct, OS patching has different schedules for each of the machines I had setup, including the License server and the FO license server, pairs of the AGS servers, one of each portal and Web servers. all through an NLB, all using VIPs. The corruption occurred when the VM team had patched and rebooted one side, and begun patching the other, but the patch was small enough that the VM restart cycle time was shorter than the (4 minutes) it takes for Portal to replicate over... it was all over at this point. I am aware that newer portal versions have reduced this time... to maybe under two minutes.

(which comes to another point, not able to configure Active Directory LDAP to a VIP, only IPs... (that is for another conversation).

Id love to describe the corruption that happened but I don't think there is enough room on this forum . There have been a few cases on this (case #02200912). Escalated to Redlands without resolve (Esri Case #02377858)

I am confident there is corruption within PostgreSQL, but have never had the time to look at this further.

I can come by and discuss if you are attending the International Developer Summit.

JonathanQuinn · ‎12-17-2019

Unfortunately, I won't be at the International Dev Summit. I can ask for a few things for troubleshooting, but only if you don't mind potentially duplicate troubleshooting efforts.

In terms of being HA, the standard for being HA isn't that it's active/active at all times, it's that there are redundancies built into the software that can handle failover and can continue working effectively. Effectively is likely open to interpretation, (especially if you've run into issues with Portal HA previously) and is certainly dependent on version. Prior to 10.6.1, it took minutes for the standby portal to promote itself to primary. 10.6.1 and on, it occurs typically within 30 seconds. It seems like Oracle supports active/active where each database can accept write transactions, but SQL Server and SAP HANA seem to only support active/active as read-only systems, while being active/standby for read/write. They describe these setups under their own HA topic headings, so again, throughout the industry we will see HA covering more than just active/active.

https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.03/en-US/879d9dc46bb64ccda028872c86...