Sitecore azure hosted apps suddenly started giving 502 Bad Gateway error

Hello People,

Today, i will be writing about the issue we had on one of our Sitecore app which was hosted on azure, Everything was working fine until fine morning, Our CD site started giving "502 Bad Gateway" error page.

With my surprise, i hit the refresh button 4-5 times to double check if its really down or what? but yes it was down and had no clue, we did not have any deployments, we did not deploy anything, and why it just stopped working?

Problem

CD Sites started giving 502 bad gateway out of the blue and suddenly

Troubleshooting & Solution

To resolve this issue, we did everything possible, but very first thing i did was to blame azure infrastructure or the plan under which the app was running, Because we did not deploy anything, no new code binaries, so why it stopped, so i got on a call with our DevOps team and started discussing the possibilities from infrastructure point of view, and we did following steps together until the issue got resolved

1) We started digging into the logs, and there were random exceptions present and i had no idea why these exceptions are coming and from where, Also i was not concerned with those exceptions because here the APP was not responding at all, so i never thought that due to these exception app is down, and It can never be

2) I compared this azure app log with other environment which was also on azure (production environment), and i did not find those exceptions, which was little surprising for me, We had couple exceptions like below, For which i had no clue

3) We started looking into that specific CD app's azure options to "Diagnose & Solve problem" option and in that "Availability & Performance"

We found out the almost all the request to the app was failing and platform availability was 100% but app's availability was down to 0-5%, almost all the requests to the app were turning into 502 bad gateway, Technically app was totally down and was not able to serve anything.

Now we got the idea that, There is something wrong with behind the scene and not the azure infra structure issue

3) To know more about the nature of the exceptions and dependencies and call telemetry, I got into "Application Insights" of the associated app, which showed us that there are repetitive exceptions, and which also made us feel that, Traffic is definitely coming onto app but because of these exceptions, Something is going wrong and app is making the gateway as "Bad gateway"

4) Next we look into the configuration of the app gateway, where it was configured that if the app is not able to come online in specific time, then azure will treat it as a bad gateway, Due to these exceptions, It could be possible that app is making some calls and those are not responding in time and hence gateway is timing out.

It looked logical to us, Below was we observed in "Application Insight"

There was a big count of similar dependencies exception, We did correct those configuration to get rid of those exceptions, we were fine and exception count was reduced but still app did not came up and it was still giving the same "502 Bad Gateway"

5) We observed that the APP was in S1 plan of azure, we decided to scale up the app to S3 plan, to get some more memory and see if that works, and with our surprise the app was working, "Bad Gateway" was gone but still the app performance was very poor, So we concluded that, there is something wrong with the memory, because the moment we increased the memory, app worked

Now it was time to find out what is going wrong with the memory, and we also had seen Memory related exception in our previous azure app diagnosis.

5) We decided to create the memory dump and see in the time of crash what might have gone wrong to find out more info, also we raise a ticket with MSFT and Sitecore too so we get more insights

The memory dump revealed some important information

Memory Dump Analysis

With our luck, Sitecore also pointed out in the same direction of memory and following were the observations

In memory dump, Garbage Collector was running. 2 long running threads about ContentSearch indexing was waiting for GC to complete.

There were about 2.7 GB of memory allocated for System.String. These strings were being used by Event Queue.
Additionally, one of our item where we are storing big XML string was having a huge 6MB string in a field. This was then multiplied by EventQueue, jobs etc and had easily taken 700+ MB of memory.

So next thing was to see the eventqueue, and we had around about 300K entries and it was confirmed that, this could be the issue of why app was working fine before and why it stopped suddenly, eventqueue must have grown significantly by that time

So we cleared the eventqueue table from core, master, web DB

There were couple of other Findings too, Which are following

Our your CD had "sitecore_web_index" using "onPublishEndAsyncSingleInstance" strategy.
this is the default configuration. However in a scaled environment, we should only let 1 instance, e.g. CM perform this indexing strategy.

An improvement was made since Sitecore 9.2 with the addition of a new Indexing sub-role. This allows user to combine them with the desired ContentManagement role.

Regardless of how many CM/CD you have, only 1 CM is supposed to perform indexing operations, other instance should have this strategy set to "manual".

So we changed our CD configs to have following configurations

<strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/manual" />

Now, After doing these changes, we restarted our CD instances and hit the site and voila, Instead of seeing "Bad gateway", we were seeing our site and all gateway turned "Healthy" from "Faulty"

Solution

Because eventqueue entries were large and it was getting multiplied with those huge sitecore items, and due to that memory was getting fuller and site was taking time to come up and it was greater than defined gateway time out period and hence it was giving "Bad Gateway" error

1) Delete eventqueue table entries, and configure Sitecore to clean up eventqueue so that in future also it can only have limited number of records, Sitecore recommends 1000 records for eventqueue for recommended performance

2) Always have CM role to perform indexing operations on CDs, so change all CDs indexing strategy to manual and keep only CM to perform indexing operations.

After this change we also scale down the app back to S1 and still everything is working fine now, Hope this helps someone facing the same issue

Daivagna Nanavati

Search This Blog

Sitecore azure hosted apps suddenly started giving 502 Bad Gateway error

Labels

Comments

Post a Comment

Popular posts from this blog

Zero to Hero - A real life RCA of exact issue in Sitecore Managed Cloud environment

Set up leprechaun code generation with Sitecore XM Cloud Starterkit

An error occurred while receiving the HTTP response to This could be due to the service endpoint binding not using the HTTP protocol. This could also be due to an HTTP request context being aborted by the server (possibly due to the service shutting down). See server logs for more details.