Skip to main content

Sitecore azure hosted apps suddenly started giving 502 Bad Gateway error

 Hello People,

Today, i will be writing about the issue we had on one of our Sitecore app which was hosted on azure, Everything was working fine until fine morning, Our CD site started giving "502 Bad Gateway" error page.

With my surprise, i hit the refresh button 4-5 times to double check if its really down or what? but yes it was down and had no clue, we did not have any deployments, we did not deploy anything, and why it just stopped working?

Problem

CD Sites started giving 502 bad gateway out of the blue and suddenly

Troubleshooting & Solution

To resolve this issue, we did everything possible, but very first thing i did was to blame azure infrastructure or the plan under which the app was running, Because we did not deploy anything, no new code binaries, so why it stopped, so i got on a call with our DevOps team and started discussing the possibilities from infrastructure point of view, and we did following steps together until the issue got resolved

1) We started digging into the logs, and there were random exceptions present and i had no idea why these exceptions are coming and from where, Also i was not concerned with those exceptions because here the APP was not responding at all, so i never thought that due to these exception app is down, and It can never be

2) I compared this azure app log with other environment which was also on azure (production environment), and i did not find those exceptions, which was little surprising for me, We had couple exceptions like below, For which i had no clue

 

3) We started looking into that specific CD app's azure options to "Diagnose & Solve problem" option and in that "Availability & Performance"

We found out the almost all the request to the app was failing and platform availability was 100% but app's availability was down to 0-5%, almost all the requests to the app were turning into 502 bad gateway, Technically app was totally down and was not able to serve anything.


 

Now we got the idea that, There is something wrong with behind the scene and not the azure infra structure issue

3) To know more about the nature of the exceptions and dependencies and call telemetry, I got into "Application Insights" of the associated app, which showed us that there are repetitive exceptions, and which also made us feel that, Traffic is definitely coming onto app but because of these exceptions, Something is going wrong and app is making the gateway as "Bad gateway"

4) Next we look into the configuration of the app gateway, where it was configured that if the app is not able to come online in specific time, then azure will treat it as a bad gateway, Due to these exceptions, It could be possible that app is making some calls and those are not responding in time and hence gateway is timing out.

It looked logical to us, Below was we observed in "Application Insight"

There was a big count of similar dependencies exception, We did correct those configuration to get rid of those exceptions, we were fine and exception count was reduced but still app did not came up and it was still giving the same "502 Bad Gateway"

5) We observed that the APP was in S1 plan of azure, we decided to scale up the app to S3 plan, to get some more memory and see if that works, and with our surprise the app was working, "Bad Gateway" was gone but still the app performance was very poor, So we concluded that, there is something wrong with the memory, because the moment we increased the memory, app worked

Now it was time to find out what is going wrong with the memory, and we also had seen Memory related exception in our previous azure app diagnosis.

5) We decided to create the memory dump and see in the time of crash what might have gone wrong to find out more info, also we raise a ticket with MSFT and Sitecore too so we get more insights 

The memory dump revealed some important information 

Memory Dump Analysis

With our luck, Sitecore also pointed out in the same direction of memory and following were the observations

In memory dump, Garbage Collector was running. 2 long running threads about ContentSearch indexing was waiting for GC to complete. 

There were about 2.7 GB of memory allocated for System.String. These strings were being used by Event Queue.
Additionally, one of our item where we are storing big XML string was having a huge 6MB string in a field. This was then multiplied by EventQueue, jobs etc and had easily taken 700+ MB of memory.

So next thing was to see the eventqueue, and we had around about 300K entries and it was confirmed that, this could be the issue of why app was working fine before and why it stopped suddenly, eventqueue must have grown significantly by that time

So we cleared the eventqueue table from core, master, web DB

There were couple of other Findings too, Which are following

Our your CD had "sitecore_web_index" using "onPublishEndAsyncSingleInstance" strategy.
this is the default configuration. However in a scaled environment, we should only let 1 instance, e.g. CM perform this indexing strategy.
 

An improvement was made since Sitecore 9.2 with the addition of a new Indexing sub-role. This allows user to combine them with the desired ContentManagement role.


Regardless of how many CM/CD you have, only 1 CM is supposed to perform indexing operations, other instance should have this strategy set to "manual".

 So we changed our CD configs to have following configurations

<strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/manual" />

Now, After doing these changes, we restarted our CD instances and hit the site and voila, Instead of seeing "Bad gateway", we were seeing our site and all gateway turned "Healthy" from "Faulty"

Solution

Because eventqueue entries were large and it was getting multiplied with those huge sitecore items, and due to that memory was getting fuller and site was taking time to come up and it was greater than defined gateway time out period and hence it was giving "Bad Gateway" error

1) Delete eventqueue table entries, and configure Sitecore to clean up eventqueue so that in future also it can only have limited number of records, Sitecore recommends 1000 records for eventqueue for recommended performance

2) Always have CM role to perform indexing operations on CDs, so change all CDs indexing strategy to manual and keep only CM to perform indexing operations.

After this change we also scale down the app back to S1 and still everything is working fine now, Hope this helps someone facing the same issue

Comments

Popular posts from this blog

High CPU to completely normal CPU - SXA issue, SXA pages not loading in mobile device

  Hi Team, Today i am going to share one of the nightmarish issue with you all, We are having Sitecore 9.1.1 hosted in azure PaaS environment Our site was working just fine and no noise, but we have been working on a feature release where 7-8 months of development needed to be released to production, Big GO LIVE event right?  Also to make the development smoother we also introduced BLUE/GREEN deployment slots in the same release, so we can easily SWAP slots and go live Everything went well, we went live, we even did load and performance testing on our staging and pre-prod and we were confident enough of results Very next day we started getting "SITE DOWN" alerts, and also product owners and clients mentioned that site is very slow for them in US time and in our morning when we were accessing it, it was working lighting fast so we were clue less at start, but we started digging  1) First thing caught our eyes were HIGH CPU spikes, in US time, also without any traffic CPU u...

Set up leprechaun code generation with Sitecore XM Cloud Starterkit

Hi Sitecorians, It has been amazing learning year so far and with the change in technology and shift of the focus on frontend frameworks and composable products, it has been market demand to keep learning and exploring new things. Reasons behind this blog Today's topic is something that was in my draft from April-May, and I always thought that there is already a good documentation out there for  Leprechaun  and a blog post is not needed, Until I realized that there was so many of us facing same kind of issues and same kind of problems and spending same amount of time, That is where I thought, if I could write something which can reduce that repetitive troubleshooting time, That would really help the community. 1)  In a project environment, if we get into some configuration issues, we resolve them, we make sure we are not blocked and continue, but if you think same issue, same step and same scenario will come to other people, so if we can draft it online, it will help othe...

An error occurred while receiving the HTTP response to This could be due to the service endpoint binding not using the HTTP protocol. This could also be due to an HTTP request context being aborted by the server (possibly due to the service shutting down). See server logs for more details.

You have noticed many times that everything was working fine and suddenly the below error starts coming and you find no way to work it out An error occurred while receiving the HTTP response to This could be due to the service endpoint binding not using the HTTP protocol. This could also be due to an HTTP request context being aborted by the server (possibly due to the service shutting down). See server logs for more details. The reason for this is the receiving size of WCF service is smaller then the data which is coming from service It was working before because it was small,So you will have to try to increase the receiving setting in your end point,Possible settings can be following maxStringContentLength="2147483647" maxReceivedMessageSize="2147483647" maxBufferSize="2147483647" maxArrayLength="2147483647" That would definately help you!!!