I recently had a challenging production problem and a couple of friends in the SharePoint space have encouraged me to blog about it so here goes…
TLDR? Scroll down to Summary
Scenario: SharePoint 2010 Farm with a corpus of approximately 2.5 million documents using a 3rd party Remote Blob Storage provider. This farm utilizes search exclusively to retrieve documents for customers. When this farm was originally stood up with 1.2 million documents migrated in from a legacy system, the record destruction process was not started due to business reasons. (Refer to point #3 from a previous post I wrote about this ECM solution – ECM Tech Tips from a SharePoint Admin)
Fast-forward 4 years … the number of documents has now doubled and the destruction process has yet to be started. After some challenging conversations with legal, compliance and risk we finally hammered out the custom Information Management Policy that would meet the business needs and we set out to begin the destruction process. Due to not doing the destruction process on a regular (by default weekly) basis as is done with the built-in Information Management Policy process in SharePoint, we now had an accumulation of approximately 300,000 documents that would be destroyed on the first run of the Information Management Policy jobs in order to “catch up”.
To mitigate any risk, we scheduled a system outage for a long week-end when our end-users were not scanning in new documents nor searching against the content and I agreed to monitor the process throughout that time. Our Information Management Policies were set up by Content Type and for each, 2 built-in timer jobs were run to permanently destroy the records:
- Information Management Policy job
- Expiration Job
The process took approximately 30 hours to complete and worked exactly as planned with approximately 300,000 documents destroyed. So far, so good.
It has been my experience when that many documents are deleted that an Index Reset is sometimes best in order to clear things out and start “fresh”. I knew that when I did this I would lose search history and analytics and all content sources would require a full crawl once done, but I had the system outage time approved so I had pre-planned for the extra time required. On this farm the only time a Full crawl was done was on the initial migration of documents in from the legacy system. In the past 4 years, only incremental crawls have been scheduled.
I did an Index Reset.
We have 2 large content sources in this Search Service Application and I chose to do a full crawl on each of them one after the other rather than at the same time. I started the first full crawl. The crawl rate was reasonable (about 150 items per second) so I decided to let it go with plans to check back in a couple of hours to see how it was doing.
And that’s when things started to go sideways…
On my return, I noticed the CPU was pinned at almost 100% on the crawl server. In this farm, we have 1 crawl component, only 2 cores (I know, I know) and it was working REAL hard to keep up. This crawl server has an entry in the local hosts file pointing back to itself for the URLs being crawled so it wasn’t going thru the load-balanced farm webservers to crawl the content, only itself. I noticed most of the CPU cycles on the crawl server were spent on a w3wp.exe process in resource monitor. I thought this was likely normal since that worker process was associated with the web application being crawled (by tracing the unique PID of the process) and would be getting lots of requests from the crawl.
I let it go for awhile longer, but I could see the crawl rate was really low and dropping. By low I mean less than 10 items per second.
Although I could have configured an additional crawl component on one of the webservers in the farm to improve throughput for this one-time full crawl, I opted instead to call in the VM Administrator (always good to be buddies with this guy) to get an additional 6 cores added to the server as this crawl server was definitely under-provisioned (minimum 4 cores is the recommendation for this size farm). For good measure we threw another 8GB RAM on the server for a total of 16GB although I didn’t believe memory was the constraining resource in this situation.
What did I observe? No change. Well, the only change I noticed was now 8 cores were pinned at almost 100% instead of 2. Crawl rate was still low and dropping. Time was ticking…think Joanne think.
I turned up verbose logging in the farm, but to be honest I couldn’t see the problem. I was swimming in a sea of logs and needed someone to talk thru the problem with. I posted a quick question on Stack Exchange, but at this point I needed to get an answer fast. I made the decision to call in Microsoft Premier Support. To be clear, I don’t call them often or make that decision lightly. As a matter of fact I’ve only done it once before in my 10 or so years of working with SharePoint but I knew I needed help and at that point it was the lifeline I needed.
Working with a couple of their engineers, they could see some timeouts from the external blob store (fileserver) and the SharePoint search process was firing so many requests to the blob store that it couldn’t keep up. They recommended I change the timeout in Search from the default 60 seconds to 150 seconds. This in effect slowed down the number of new requests to the external blob store helping it to return the blobs before the timeout period elapsed.
That seemed to be just enough time for the blob store to return the blobs that were taking a long time. The first content source full crawl finished after running for approximately 5 hours with this change.
I thought that was the end of the problems. I started the second content source full crawl and monitored it as well. Although it started off with a crawl rate of about 200 items per second, it was dropping quickly over time and I was concerned it wouldn’t finish. Also, the CPU on all 8 cores was still pinned at 100%. I talked extra nice to the Windows Server Administrator and they put the crawl server in the highest performing VM resource pool so it could get all of the attention it needed. Finally the full crawl finished after an additional 5 hours although the crawl rate was extremely low. Something still didn’t seem right to me…
The server was moved back into its regular VM resource pool, 4 cores were removed to leave the crawl server back at 4 cores and the regular incremental crawl schedule was put into place to run during regular business hours. That was the end of the planned system outage. Regular business would start the next day…
The incremental crawls started the next morning and guess what? CPU was pinned on all 4 cores and incremental crawls that would normally take less than 5 minutes prior to the big purge were now taking up to an hour. Crawl rate was around 0.01 item per second. What is going on?
It was at this point we discovered the database statistics on some of the databases were out-of-date (several days old) and hadn’t been updated since the purge. Content database stats are normally updated by a SharePoint Health Analysis timer job each night. I don’t know the root cause of why the stats hadn’t been updated as no error was thrown in the ULS logs when that job ran. We decided to create a daily process in SQL to update the statistics for insurance in case the SharePoint timer job couldn’t update the stats by following the guidance in SharePoint and Auto Statistics guidance. (Thanks to @DanielGlenn for guidance on this) .
Checked the crawl times again… still running slow.
I could see it was the w3wp.exe worker process consuming almost all of the CPU again. I decided to dig into IIS for that process so I could see the requests flowing through it (although it took me a surprisingly long time to think to do this) What I saw were several requests from the URLs in the web application (normal activity during a crawl), but there were also 10 requests for the exact same document sitting in there!! Although more than 10 requests were attempted for that document, IIS is throttling the requests after 10 in an effort to reduce the performance impact on the server this particular request may be having. I tried to open up the document by navigating thru the SharePoint UI, but it wouldn’t come back. In fact, it hung my browser session. Hmmmm. I might be onto something…no matter what I did, I couldn’t open that document. I dug into the verbose logs once again searching for this filename and FINALLY I found the error that was coming from the external Remote Blob Storage fileserver when it was trying to find the blob. It couldn’t find it!
1 blob out of 2.5 million! My “needle in the haystack”.
Aha! I had to either recover that blob or delete the document and see if I could clear those requests from the worker process. I received approval from the business to delete the document, did an IISReset to clear the requests and monitored the incremental crawls. They were finally able to complete in under 5 minutes. The CPU was kept to under 40% and most of that is attributed to the mssdmn.exe process for search and not the w3wp.exe process. This finally makes sense and, I believe, is how it should be.
I’ve continued to monitor the crawls over the past few days and they’re back to performing as they were before the large purge and CPU on the crawl server is kept to around 40% during the crawls. Life is good. 🙂
In the end, there were three problems/fixes:
- Fix #1: Increase the search timeout from the default 60 seconds to a higher value (I went to 150 seconds). This helped prevent blob requests from overwhelming the remote blob content store. (TechNet Link)
- Fix #2: Ensure SQL statistics are being updated on your farm. (SharePoint and Auto Statistics guidance)
- Fix #3: Remove the bad blob causing the w3wp process to spin out of control by deleting the associated SharePoint document. This allowed the search process, mssdmn.exe, to crawl with good performance.
Well that was a rough few days, but I did learn some things so that’s the silver lining I’ve decided to take away. Hopefully there are pieces of this story that can help someone else, but if not it was certainly therapeutic to retell it. Another SharePoint battle scar for this girl.
Thanks for reading.