Skip to main content

Sitecore 10.2 - The Ghost 500 Error that Comes and Goes - How PrefetchData ends into Race Condition and Cache Corruption

Hello Friends,

I believe this blog post is very important for everyone who is running Sitecore 10.2, because this is one of those issues which is very tricky to catch, very scary when you see it live, and very satisfying when you finally understand what is happening underneath. I will share my experience of what we faced, how we did a deep reverse engineering of Sitecore kernel, what we found and how we resolved it.

Issue we started facing

Our customer were sending new page publish in email and sms communication campaigns, and when someone clicks on that link it used to give 500 screen and on refresh it used to work, but first hit was giving 500 and user drop was happening, and because of this when we hit it, it always used to work so it was 

Our site started giving intermittent 500 errors with YSOD like following two exceptions

Exception 1: Index was outside the bounds of the array.

System.IndexOutOfRangeException: at System.Collections.Generic.List`1.Add (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Sitecore.Data.DataProviders.PrefetchData.AddChildId (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.PrefetchData.AddChildrenForEmptyDefinition (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider+<DoGetChildIDs>d__94.MoveNext (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Common.EnumerableExtensions.ForEach (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.DataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataSource.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.GetChildrenCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.ItemProvider.GetItem.GetLanguageFallbackItem.Process (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.Response.GetPageItem.GetPageItemProcessor.GetItem (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.Response.GetPageItem.GetFromRouteUrl.Process (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Presentation.PageContext.GetItem (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Presentation.PageContext.get_Item (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.ExperienceEditor.Pipelines.Request.RequestEnd.AddPageExtenders.Process (Sitecore.Mvc.ExperienceEditor, Version=10.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Mvc.ExperienceEditor, Version=10.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Routing.RouteHttpHandler.EndProcessRequest (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Web.HttpApplication+CallHandlerExecutionStep.InvokeEndHandler (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication+CallHandlerExecutionStep.OnAsyncHandlerCompletion (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)

Exception 2: Value cannot be null. Parameter name: key

System.ArgumentNullException: at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Sitecore.Caching.Generics.Cache`1+InnerBox.DoGetEntry (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Caching.Generics.Cache`1.GetValue (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Caching.Generics.Cache`1.ContainsKey (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.EnsureChildrenPrefetched (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider+<DoGetChildIDs>d__94.MoveNext (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Common.EnumerableExtensions.ForEach (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.DataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataSource.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.GetChildrenCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.ItemProvider.GetItem.GetLanguageFallbackItem.Process (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequest.GetMediaPath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequest.get_MediaUri (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaProvider.ParseMediaRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.GetMediaRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.DoProcessRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.ProcessRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication.ExecuteStepImpl (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication.ExecuteStep (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)

Side Effect

Because of this 500 error, our site pages were showing 500 custom error page intermittently and our users were impacted badly. The most frustrating part was — refresh the page and it works fine. This makes it very hard to reproduce and very easy for people to ignore, but believe me, do not ignore this one.

1. What is the issue, why it is intermittent and when does it come?

So first of all, let me explain what PrefetchData is and why it is there.

When Sitecore serves a page request, it needs to know the children of many items to resolve paths, render layouts, and process pipelines. Rather than hitting the SQL database for every single child lookup, Sitecore has a mechanism called PrefetchData which acts like a warm-up cache. It preloads child IDs of items into an in-memory collection so subsequent lookups are fast.

Now here is the interesting part — this prefetch cache gets populated on the very first request after an app pool start or recycle. And that is exactly where the problem lives.

Why is it intermittent? Because it only happens during that very small window of time when the app pool has just started and multiple concurrent requests are hitting the site at exactly the same moment. In that race window, multiple threads are trying to write child IDs into the same internal collection simultaneously.

When does it come?

  • After every app pool recycle (scheduled or memory-based)
  • After IIS restart
  • After deployment
  • First hit after idle timeout of app pool

Once that window passes and the cache is fully populated, subsequent requests read from a stable cache and everything works fine. That is why refresh fixes it — by the time you refresh, the cache is already built.

2. How we Reverse Engineered Sitecore.Kernel.dll and found the root cause

This is the most interesting part of this investigation.

We had two different DLLs — the original Sitecore 10.2 Kernel and the patched version from Sitecore's hotfix. We used dnSpy (a free open-source .NET decompiler) to decompile both and compare the exact code.

Tool Used: dnSpy You can download it from: https://github.com/dnSpy/dnSpy

Steps we followed:

  1. Open dnSpy
  2. File → Open → Load original Sitecore.Kernel.dll
  3. File → Open → Load hotfixed Sitecore.Kernel.dll
  4. Navigate to: Sitecore.Data.DataProviders → PrefetchData class
  5. Look at the AddChildId method

Original Code (Broken)

public virtual void AddChildId(ID childId) => this._childIds.Add(childId);

That single line. That is the culprit. _childIds is a plain List<ID> which is not thread-safe at all. When multiple threads call Add() simultaneously on a List<T>, you get exactly what we saw — IndexOutOfRangeException because the internal array of List<T> gets corrupted during concurrent resize operations, also see lines in RED mentioned in above exceptions (top of the page) pointing to exact same code where it fails.

Hotfixed Code (Fixed)

public virtual void AddChildId(ID childId) { lock (this._childIdsLocker) this._childIds.Add(childId); }

And somewhere in the class there is now:

private readonly object _childIdsLocker = new object();

This is the classic monitor lock pattern in C#. Only one thread can enter the lock block at a time. All other threads wait outside the door. So now when 10 concurrent requests come in during startup, they queue up neatly and add their child IDs one by one without corrupting each other.

How does Exception 2 relate to Exception 1?

This was a chained failure. Exception 1 was corrupting the prefetch data. Exception 2 (ArgumentNullException in ConcurrentDictionary.TryGetValue) was happening because EnsureChildrenPrefetched was trying to use the cache that had already been corrupted — a null ID was stored where a valid key was expected. Fix Exception 1 at source and Exception 2 disappears automatically. They are not two separate bugs, they are cause and effect.

3. Sitecore KB and Official Patch

Sitecore has acknowledged this as a known issue. The official KB article is:

Known Issues - Retrieving the child items of resource items is not thread-safe 

The patch is available on Sitecore's hotfix SharePoint portal under: Sitecore XP 10.2 → Sitecore 10.2.3 rev. 013888 PRE → Platform Patch

However, a word of caution here — the patch available on the KB link is quite a large upgrade patch and it can bring along many other changes which you may not want to introduce in a stable production CMS. Same experience we had with aliases pipeline blog post I shared earlier.

Our Recommendation

If you want a surgical fix without replacing the entire Sitecore.Kernel.dll, you can write a lightweight custom patch assembly that overrides just the AddChildId method using the virtual method override pattern. That way you get exactly what the hotfix does — just the thread safety lock — without any other changes.

Here is how:

Create a new Class Library project

using System.Collections.Generic;

using System.Reflection;

using Sitecore.Data;

using Sitecore.Data.DataProviders;

namespace Sitecore.Support.PrefetchDataFix{

    public class ThreadSafePrefetchData : PrefetchData{

        private readonly object _childIdsLocker = new object();

        private static readonly FieldInfo ChildIdsField =

            typeof(PrefetchData).GetField(

                "_childIds",

                BindingFlags.NonPublic | BindingFlags.Instance

            );

        public ThreadSafePrefetchData(ItemDefinition itemDefinition, ID templateId)

            : base(itemDefinition, templateId)

        {

        }

        public override void AddChildId(ID childId){

            var childIds = (List<ID>)ChildIdsField.GetValue(this);

            lock (_childIdsLocker){

                childIds.Add(childId);

            }

        }

    }

}

Deploy the DLL to /bin/ and wire it up via a patch config file in App_Config/Include/Z.Custom/:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> 
      <sitecore> 
         <dataProviders> 
             <main type="Sitecore.Data.DataProviders.Sql.SqlDataProvider, Sitecore.Kernel">                                       <prefetchData>
                     <patch:attribute name="type"> Sitecore.Support.PrefetchDataFix.ThreadSafePrefetchData, Sitecore.Support.PrefetchDataFix 
                    </patch:attribute> 
                </prefetchData> 
            </main> 
        </dataProviders> 
       </sitecore>
</configuration>

Verify it is active using /sitecore/admin/showconfig.aspx — search for PrefetchData and confirm your type is showing.

Quick Temporary Fix while you prepare the proper patch

If you want to stop the bleeding immediately without any code change, By creating a patch which disables the prefetch cache config to omit this caching part totally.

This disables the prefetch cache entirely so the race never happens. There will be a slight cold-start slowdown after app pool recycles but no more 500s. Good enough to stabilize production while you work on the proper fix.

NOTE: It was our decision to only patch things which was broken to be in control, Official hot fix and number of DLLs and config which we wanted to avoid as our instance was otherwise stable only, If you think you can go ahead and install the full hotfix given on the link

Observation after fix

We observed the site for 24 hours after applying the thread-safety patch and there were zero 500 errors from this issue. Happy customer, stable site.

Comments

Popular posts from this blog

Set up leprechaun code generation with Sitecore XM Cloud Starterkit

Hi Sitecorians, It has been amazing learning year so far and with the change in technology and shift of the focus on frontend frameworks and composable products, it has been market demand to keep learning and exploring new things. Reasons behind this blog Today's topic is something that was in my draft from April-May, and I always thought that there is already a good documentation out there for  Leprechaun  and a blog post is not needed, Until I realized that there was so many of us facing same kind of issues and same kind of problems and spending same amount of time, That is where I thought, if I could write something which can reduce that repetitive troubleshooting time, That would really help the community. 1)  In a project environment, if we get into some configuration issues, we resolve them, we make sure we are not blocked and continue, but if you think same issue, same step and same scenario will come to other people, so if we can draft it online, it will help othe...

Why SitecoreAI - Getting into the shoes of the customer how to select right CMS

Hi Team, Lately, I have been talking to lot of our customers / potential customers and having pre-sales demos where one question always comes is "Why Sitecore" ?  Now this question can be for any product which is out for sell. And as a technician I always get into product technical features, but at the same time as a pre-sales guy, it also makes me think, surely all competitive products have same features, so definitely answer to this is not in the technicalities.  If you step back and think, we are also a customer in our daily life and buy lot of things, what is that process we go through? When we buy, how can your customer decide if this is a right fit for you or not, why we select A over B? Is it price? is it service? Is it a brand? Is it about features? Is it about brand loyalty?  When it is a technical product, I am sure it cannot start with the technicalities of the product or selecting product itself, 100% not, I feel decision is always business strategy first and ...

401.1 Unauthorized with windows authentication error code 0xc000006d

How many of you have faced this hosting issue when you do everything what it takes to run the site with windows authentication but still you are getting the same error again and again? If you think you also have faced the same issue and you tired of reading MSDN KBs for it and still have not found the issue (If KB has solved the issue, well and good, if not you can try this trick),Please Read below Typical scenario In typical hosting with IIS, i did every possible things like enabling windows authentication, changing it in web.config, configuring connection pool, authorization rules, it asks me for window authentication login and despite of entering correct credentials it always fails and keeps on asking for login, and when pressed cancel it gives 401.1 with 0xc000006d error code Solution (Which worked for me at-least after trying for almost 6-9 hrs) You need to change the Loop Back Check in registry so that it allows the host names which you are giving in url are allowed and au...