Sitecore 10.2 - The Ghost 500 Error that Comes and Goes - How PrefetchData ends into Race Condition and Cache Corruption
Hello Friends,
I believe this blog post is very important for everyone who is running Sitecore 10.2, because this is one of those issues which is very tricky to catch, very scary when you see it live, and very satisfying when you finally understand what is happening underneath. I will share my experience of what we faced, how we did a deep reverse engineering of Sitecore kernel, what we found and how we resolved it.
Issue we started facing
Our customer were sending new page publish in email and sms communication campaigns, and when someone clicks on that link it used to give 500 screen and on refresh it used to work, but first hit was giving 500 and user drop was happening, and because of this when we hit it, it always used to work so it was
Our site started giving intermittent 500 errors with YSOD like following two exceptions
Exception 1: Index was outside the bounds of the array.
System.IndexOutOfRangeException: at System.Collections.Generic.List`1.Add (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Sitecore.Data.DataProviders.PrefetchData.AddChildId (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.PrefetchData.AddChildrenForEmptyDefinition (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider+<DoGetChildIDs>d__94.MoveNext (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Common.EnumerableExtensions.ForEach (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.DataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataSource.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.GetChildrenCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.ItemProvider.GetItem.GetLanguageFallbackItem.Process (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.Response.GetPageItem.GetPageItemProcessor.GetItem (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.Response.GetPageItem.GetFromRouteUrl.Process (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Presentation.PageContext.GetItem (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Presentation.PageContext.get_Item (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.ExperienceEditor.Pipelines.Request.RequestEnd.AddPageExtenders.Process (Sitecore.Mvc.ExperienceEditor, Version=10.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Mvc.ExperienceEditor, Version=10.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.DefaultCorePipelineManager.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Pipelines.PipelineService.RunPipeline (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Mvc.Routing.RouteHttpHandler.EndProcessRequest (Sitecore.Mvc, Version=8.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Web.HttpApplication+CallHandlerExecutionStep.InvokeEndHandler (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication+CallHandlerExecutionStep.OnAsyncHandlerCompletion (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
Exception 2: Value cannot be null. Parameter name: key
System.ArgumentNullException: at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Sitecore.Caching.Generics.Cache`1+InnerBox.DoGetEntry (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Caching.Generics.Cache`1.GetValue (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Caching.Generics.Cache`1.ContainsKey (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.EnsureChildrenPrefetched (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.Sql.SqlDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider+<DoGetChildIDs>d__94.MoveNext (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Common.EnumerableExtensions.ForEach (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.CompositeDataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataProviders.DataProvider.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.DataSource.GetChildIDs (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.GetChildrenCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetChildren (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.ResolvePath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Nexus.Data.DataCommands.ResolvePathCommand.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Engines.EngineCommand`2.Execute (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemProvider.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.ItemProvider.GetItem.GetLanguageFallbackItem.Process (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at n/a (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Pipelines.CorePipeline.Run (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.DefaultItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Data.Managers.ItemManager.GetItem (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequest.GetMediaPath (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequest.get_MediaUri (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaProvider.ParseMediaRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.GetMediaRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.DoProcessRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at Sitecore.Resources.Media.MediaRequestHandler.ProcessRequest (Sitecore.Kernel, Version=17.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication.ExecuteStepImpl (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a) at System.Web.HttpApplication.ExecuteStep (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
Side Effect
Because of this 500 error, our site pages were showing 500 custom error page intermittently and our users were impacted badly. The most frustrating part was — refresh the page and it works fine. This makes it very hard to reproduce and very easy for people to ignore, but believe me, do not ignore this one.
1. What is the issue, why it is intermittent and when does it come?
So first of all, let me explain what PrefetchData is and why it is there.
When Sitecore serves a page request, it needs to know the children of many items to resolve paths, render layouts, and process pipelines. Rather than hitting the SQL database for every single child lookup, Sitecore has a mechanism called PrefetchData which acts like a warm-up cache. It preloads child IDs of items into an in-memory collection so subsequent lookups are fast.
Now here is the interesting part — this prefetch cache gets populated on the very first request after an app pool start or recycle. And that is exactly where the problem lives.
Why is it intermittent? Because it only happens during that very small window of time when the app pool has just started and multiple concurrent requests are hitting the site at exactly the same moment. In that race window, multiple threads are trying to write child IDs into the same internal collection simultaneously.
When does it come?
- After every app pool recycle (scheduled or memory-based)
- After IIS restart
- After deployment
- First hit after idle timeout of app pool
Once that window passes and the cache is fully populated, subsequent requests read from a stable cache and everything works fine. That is why refresh fixes it — by the time you refresh, the cache is already built.
2. How we Reverse Engineered Sitecore.Kernel.dll and found the root cause
This is the most interesting part of this investigation.
We had two different DLLs — the original Sitecore 10.2 Kernel and the patched version from Sitecore's hotfix. We used dnSpy (a free open-source .NET decompiler) to decompile both and compare the exact code.
Tool Used: dnSpy You can download it from: https://github.com/dnSpy/dnSpy
Steps we followed:
- Open dnSpy
- File → Open → Load original Sitecore.Kernel.dll
- File → Open → Load hotfixed Sitecore.Kernel.dll
- Navigate to: Sitecore.Data.DataProviders → PrefetchData class
- Look at the AddChildId method
Original Code (Broken)
public virtual void AddChildId(ID childId) => this._childIds.Add(childId);
That single line. That is the culprit. _childIds is a plain List<ID> which is not thread-safe at all. When multiple threads call Add() simultaneously on a List<T>, you get exactly what we saw — IndexOutOfRangeException because the internal array of List<T> gets corrupted during concurrent resize operations, also see lines in RED mentioned in above exceptions (top of the page) pointing to exact same code where it fails.
Hotfixed Code (Fixed)
public virtual void AddChildId(ID childId) { lock (this._childIdsLocker) this._childIds.Add(childId); }
And somewhere in the class there is now:
private readonly object _childIdsLocker = new object();
This is the classic monitor lock pattern in C#. Only one thread can enter the lock block at a time. All other threads wait outside the door. So now when 10 concurrent requests come in during startup, they queue up neatly and add their child IDs one by one without corrupting each other.
How does Exception 2 relate to Exception 1?
This was a chained failure. Exception 1 was corrupting the prefetch data. Exception 2 (ArgumentNullException in ConcurrentDictionary.TryGetValue) was happening because EnsureChildrenPrefetched was trying to use the cache that had already been corrupted — a null ID was stored where a valid key was expected. Fix Exception 1 at source and Exception 2 disappears automatically. They are not two separate bugs, they are cause and effect.
3. Sitecore KB and Official Patch
Sitecore has acknowledged this as a known issue. The official KB article is:
Known Issues - Retrieving the child items of resource items is not thread-safe
The patch is available on Sitecore's hotfix SharePoint portal under: Sitecore XP 10.2 → Sitecore 10.2.3 rev. 013888 PRE → Platform Patch
However, a word of caution here — the patch available on the KB link is quite a large upgrade patch and it can bring along many other changes which you may not want to introduce in a stable production CMS. Same experience we had with aliases pipeline blog post I shared earlier.
Our Recommendation
If you want a surgical fix without replacing the entire Sitecore.Kernel.dll, you can write a lightweight custom patch assembly that overrides just the AddChildId method using the virtual method override pattern. That way you get exactly what the hotfix does — just the thread safety lock — without any other changes.
Here is how:
Create a new Class Library project
using System.Collections.Generic;
using System.Reflection;
using Sitecore.Data;
using Sitecore.Data.DataProviders;
namespace Sitecore.Support.PrefetchDataFix{
public class ThreadSafePrefetchData : PrefetchData{
private readonly object _childIdsLocker = new object();
private static readonly FieldInfo ChildIdsField =
typeof(PrefetchData).GetField(
"_childIds",
BindingFlags.NonPublic | BindingFlags.Instance
);
public ThreadSafePrefetchData(ItemDefinition itemDefinition, ID templateId)
: base(itemDefinition, templateId)
{
}
public override void AddChildId(ID childId){
var childIds = (List<ID>)ChildIdsField.GetValue(this);
lock (_childIdsLocker){
childIds.Add(childId);
}
}
}
}
Deploy the DLL to /bin/ and wire it up via a patch config file in App_Config/Include/Z.Custom/:
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">Verify it is active using /sitecore/admin/showconfig.aspx — search for PrefetchData and confirm your type is showing.
Quick Temporary Fix while you prepare the proper patch
If you want to stop the bleeding immediately without any code change, By creating a patch which disables the prefetch cache config to omit this caching part totally.
This disables the prefetch cache entirely so the race never happens. There will be a slight cold-start slowdown after app pool recycles but no more 500s. Good enough to stabilize production while you work on the proper fix.
NOTE: It was our decision to only patch things which was broken to be in control, Official hot fix and number of DLLs and config which we wanted to avoid as our instance was otherwise stable only, If you think you can go ahead and install the full hotfix given on the link
Observation after fix
We observed the site for 24 hours after applying the thread-safety patch and there were zero 500 errors from this issue. Happy customer, stable site.


Comments
Post a Comment