Zero to Hero - A real life RCA of exact issue in Sitecore Managed Cloud environment

Hello All,

The purpose of today's post is to share a real life burning and escalated scenario which was new to me and how did I approach it and how big the escalations were and what was the outcome

Sitecore's goodwill was at stack not because Sitecore is not capable of handling it but just because our environment was Sitecore Managed Cloud, and any issue that comes if its infra, back end code, front end code will be first pointed as Sitecore issue and that is where our consultancy and experience will play a role to prove that it is not Sitecore issue.

Issue we faced

Out of the blue our site started giving "504 Gateway Time-out", and it was reported that almost everyone is getting this error, but when we used to browse the site, everything looked good and never 504.

504 Gateway Time-out error tells that, That the request went to Content Delivery servers of Sitecore from gateway, but gateway did not get response in time from those CDs and hence it gave time out error.

One thing we knew was, there is something wrong with backend, but backend can be Sitecore, databases, Solr, Azure itself, Backend can be anything which is involved in processing the requests/response, so we knew our scope is very broader, felt like "Finding a needle from the haystack"

But the side effect of this was customer was loosing on MAU (Monthly Active Users) on which their business model is working and it is the most important KPI for them.

Troubleshooting without initial clue

We had to start somewhere, because we had no clue why for certain user this issue is coming, so first thing that we do is to check Sitecore logs to see what happened, because this happened for other users and we were just informed.

Just to mention, we are on Sitecore 10.2 XM on manage cloud set up.

1) Sitecore Logs

We could not find any errors in Sitecore logs which could lead to anything like this, there was no specific pin pointing happening from the logs.

(Later on, we found out that because Sitecore never gave the response back to gateway in order so that something can be logged in the logs, because gateway timed out before the response.)

2) Azure application insights

Because we did not find anything in Sitecore logs, we moved to application insights and used transaction search, where you can just type any string and it will search in logs, and it gave us something

What we found were lot of SOLREXCEPTIONs in the time period where this 504 was reported, from that we found that there is an issue with SOLR whenever 504 happens, so we went to solr and checked the logs

SOLR was choking when these 504 scenarios were happening

We had little clue now what is happening but then we investigated further graphs of response time and server errors during that time of SOLR and here is what we found

Avg. Response Time

Clearly, there was so many exceptions which we were seeing in LOGs were also observed in this graph too.

But our question was, Though SOLR has less power but why it works almost all the time and what happens in the specific scenario that it chocks up?

So, we set up some theory around the scenario after getting on a call with marketing team and understanding the scenarios of what exactly they are doing and happening.

More from application insights

We observed that in specific time duration when this happens, All the request/response takes huge time to load and we tried to see in performance tab of AVG. time a request is taking in that specific time duration and what it reveled was also something pointing to SOLR

We observed that some of the layout services queries were taking huge time to get back the response, and all of our components read data from SOLR, so there is possibility that if SOLR chokes, it will take time to respond back.

Now if any request is taking more than 1 min. the gateway is bound to time out and give 504 Gateway-timeout error.

Now, we had to find what is causing this, Because if we hit the same page, it works just fine and in that time frame request finishes in perfect time, So we had to drill down to find if it's really a code issue or a caching issue where requests are being made to server all the time and timing out?

4) Sitecore Ticket

For safer side we also created a SC ticket with these dumps, in initial findings it was also showing long running queries and choking queries of layout services, and sometimes it showed one component being slow and sometimes it showed other components being slow, which was quite confusing

and memory and CPU analysis kept sending us to BOX-1 of slow running layout queries which were timing out the request, but now we knew that these could be side effect of the main cause of some kind of automatic traffic hitting the site which will choke any server.

5) Memory Dump & CPU analysis

We also enabled auto heal and memory dump and site slow down reports etc. now we had memory dumps collected, but all the dumps were showing layout service choking which we already knew.

6) Rendering caching & Performance enhancement & Cache tuning

Because the reports were also showing that layout service is taking time, we had to dig into every rendering, review implementation of third-party calls, APIM calls, DB audit etc. to make sure nothing is miss configured and which could lead to this

Because this was already escalated and P1 issue, if we do this audit and performance enhancement, it will take time, so we decided to remove some of the rendering to see what happens, even after removal of some of the renderings which were depending on third party API, issue still came very next evening

Now, we had to find a way to first know the scenario of what exactly is happening, so we decided to talk to the team who were reporting this issue and we found out that they are observing whenever they are sending out marketing communication with website link, at that time only this issue is occurring

We even increased DefaultHTMLCache and other cache values as we had enough memory to cache things, but no luck with that.

We knew that those many users are not clicking the link but still our site is showing lot of traffic, so something is happening when SMS are sent, and it is not Sitecore or a backend issue from configuration or code perspective.

So, we looked at it from a fresh perspective and set up our hypothesis.

Setting up hypothesis around the inputs from marketing team.

We now set the theory, of this is happening when there are lot SMS being sent, because marketing team is sending lot of WhatsApp and SMS comms and it seemed that exactly at the same time site is getting outages

All of the timings matched too, whenever site gave 504, all of the time marketing team sent communications

So, we had now something to think about, because on the website we already did the load testing and server scaling etc. was in place, we were sure we had a good set up which could handle the traffic.

7) Load Testing

Though we decided to do a load testing again, and our load testing was showing all results ok with the same URLs which marketing team was sending in whatsapp or SMS marketing.

So again, we were sure that, it's not about the legitimate traffic but it is something to do with SMS and without clicking also burst of traffic is coming like DDoS

8) Server scaling

We decided to put more power on CDs, so we introduced 4 additional instances just to observe for couple of days so we can take out some reports and observe what is happening.

Even if we did that, we observed same amount of 504 request, and we observed following which was pointing to the same scenario of traffic in few seconds.

9) Azure Front Door logs

One more reporting and azure diagnostic query we run to check what is coming on AFD and what is giving 504, we took that report out in excel and something we observed

We observed that, there were some IP ranges which were hitting the sites in few seconds, so if you observe the above graph too

Within seconds site got 4-5 thousand hits, this behavior is not of users, there will be delay of few seconds, here are the IPs which we found

• 64.233.173.0/24

• 66.102.6.0/24

• 66.102.7.0/24

• 66.249.82.0/24

• 66.249.83.0/24

• 66.249.84.0/24

• 66.249.88.0/24

• 192.178.11.0/24

• 74.125.215.0/24

Majority of them are google bots, but question was why those many requests are coming? and this was happening only when communications were being sent directly to user's mobile as SMS.

10) Further research yielded interesting facts

We went back to our SMS sending partner who sends out bulk SMS and asked them for a report in those dates of outages, and they also sent out circular that, they are seeing sudden high traffic spikes exactly at the same time when SMS are sent, and those can't be users.

Our research on AI platforms and google search also revealed this behavior of pre-fetching of the link to the site for some security and safe browsing link scanners services reasons, and also, we got this from our partners

"Recent updates in Google Messages introduced enhanced security scanning and automatic link previews. These previews are generated by background preview agents that fetch the URL in advance to show a link preview to users.

Since these fetches imitate real user behaviour, they were being counted as actual clicks, resulting in inflated click numbers and sudden traffic spikes on your website.

Google uses a set of proxy/crawler IPs (including the ranges you shared) to pre-fetch URLs found in SMS messages. Because of this, you may observe:
• Multiple requests to your URLs within 2–3 seconds
• IPs from Google subnets appearing as traffic sources
• Sudden server load spikes right after SMS campaigns
• “Bot clicks” recorded even before users interact"

Wow, we were thrilled to learn this behavior, we could understand google bot hitting sites but not aware about this auto clicking of URLs imitating user click behavior, where traffic looks legitimate

and the hypothesis we established seemed to have a proven theory ahead of us.

Solution

As we know that actual cause was something else and layout choking, SOLR choking and whole infrastructure getting choked was just side effect of these BOT auto clicks and sending huge traffic.

1) We blocked following user agents along with above IP ranges on WAF.

· Mediapartners-Google

· Googlebot

· AdsBot-Google

· Applebot

· HeadlessChrome

· YandexBot

· facebookexternalhit

· facebookcatalog

· WhatsApp

· YandexImageResizer

· YandexMobileBot

· YandexImages

· YandexAccessibilityBot

· YandexRenderResourcesBot

· YandexUserproxy

· GoogleMessages

· okhttp

· /web/snippet/

· http4s-blaze

· SkypeUriPreview Preview/0.5

· Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36

It has been a week, and everything seems to be working fine, and we can see on AFD reports that those IPs are now getting blocked and now all the legitimate requests are turning into 200 status code.

We double checked this behavior with Sitecore and they gave following links, Which is already offered by Sitecore Managed Cloud for DDoS kind of attack or traffic.

How-to's - Sitecore Managed Cloud – DDoS attack mitigation steps

Support Information - Sitecore Managed Cloud Standard (MCS) PaaS 1.0 — DDoS IP Protection

Next action item is to put a rate limit rule based on the campaign URLs to make sure site does not get overwhelmed as these IP ranges may change and user agent could change too.

Summary Points

Because when you are working as a Sitecore partner, for customer it's a Sitecore platform, and goodwill of Sitecore goes into stake, in this specific scenario same happened, just because it is on Sitecore Managed Cloud.

But finally, we were able to pull that off and customer was convinced that it is not Sitecore issue :)

Most important things we learn are from actual pressure situations, and backing ourselves in our intuitions which comes from experience, especially when you have not come across situation like these, here are some take away I will always call out below important points when approaching these kinds of situations

1) Holding your ground

2) Believe in your hypothesis

3) Never give up

4) Use your intuitions which comes from experiences

5) Find alternatives

6) Start with fresh ideas

7) Take a break to get more fresh ideas

8) Think out of the box

9) Work as a team, to have more brain working

10) Don't leave any stone unturned

Thank you Manglesh Vyas for being there when needed the AFD reports or azure graphs etc. and my backend mate Kiran Sawant who always made sure that if we want to send any change of backend be it caching or fine-tuning resolvers was sent and checked during this P1 issue.

High CPU to completely normal CPU - SXA issue, SXA pages not loading in mobile device

Hi Team, Today i am going to share one of the nightmarish issue with you all, We are having Sitecore 9.1.1 hosted in azure PaaS environment Our site was working just fine and no noise, but we have been working on a feature release where 7-8 months of development needed to be released to production, Big GO LIVE event right? Also to make the development smoother we also introduced BLUE/GREEN deployment slots in the same release, so we can easily SWAP slots and go live Everything went well, we went live, we even did load and performance testing on our staging and pre-prod and we were confident enough of results Very next day we started getting "SITE DOWN" alerts, and also product owners and clients mentioned that site is very slow for them in US time and in our morning when we were accessing it, it was working lighting fast so we were clue less at start, but we started digging 1) First thing caught our eyes were HIGH CPU spikes, in US time, also without any traffic CPU u...

Daivagna Nanavati

Search This Blog