Working with search for years has given me a lot of experience with Content Sources as well as a lot of questions about them, like “should I create one huge content source or would be better to split up to smaller ones?” or “can I amass my small content sources into one big?” or “how to schedule the crawls for each of my content sources?“.
As usual, there is no “silver bullet” answer for this and the only general answer to this is “it depends…”
Hey, but I know the slogan “it depends” is not why you started to read this blog post, so let me give you a few ideas of what you should consider when facing these questions.
It’s important to know that sometimes you do not have any (out-of-the-box) choice, your content source is one and cannot get split even though it’s big. For example a big database.
You may also think about amassing your small content sources into one big content source, for example when having small file shares that are “similar enough” (similarity must be your definition).
In many cases, you have the option to split your huge content source into two or more smaller content sources or you can amass the small ones into big or bigger ones, sometimes you can do both in the same environment. Examples include:
- A huge file share that can get split up by subfolders (or subfolder groups);
- A SharePoint farm you can split up by Web Applications or Site Collections or even sites (although it’s a rare situation);
- A 3rd party document management system can get split by the repositories;
- Small file shares can be treated as one big content source as well as small SharePoint sites)
The real question is when to keep/create one big content source and when to create multiple smaller ones IF it’s possible to split.
Considerations to make here:
- Content Source types – In SharePoint, the following content source types are available: SharePoint Sites, Web Sites, File Shares, Exchange Public Folders, Line of Business Data, Custom Repository. These types cannot be mixed (for example, you cannot have a SharePoint site and a file share in the same content source), but in the same type, you can add more than one start addresses on your choice. This might be a good option if you have multiple, small content sources.
Crawling time and schedule – The more changes you have since the last crawl, the longer time the crawl takes. The more often you crawl the less changes you have to process during an incremental. The more often you do an incremental the less idle time your system will have. – But the more often you crawl, the more resources you consume on both the crawler and the source system. And the more often you crawl the bigger chance you have not being able to finish the crawl before the next one should get started – the result is worse content freshness with worse search performance than you expect.
Moreover, if you have multiple content sources, you have to align their schedules to keep your system not overloaded by multiple parallel crawls.
- Performance effect on the crawler components – This is an obvious one: crawling takes resources. The more you crawl the more resources you take. If you crawl more content sources parallel, it takes more resources. If you run one huge crawl, it takes resources for longer time. If you don’t have enough resources, the crawl might fail or run “forever”, making effects on other crawls.
- Performance effect on the source system – This is usually the less considered one: crawling takes resources on the source system as well!
- Bandwidth – Crawling pulls data from the source system that will be processed on the indexer components. This data should be transferred and this takes bandwith. In many cases, this is the bottleneck in the whole crawling process, even if the source system and crawler performs well. The more crawling process you run at the same time and the more parallel threads they have the more bandwith will be needed. Serialized crawls mean more balanced bandwith requirements.
- Similar content sources? – At the same time, you might have similar content sources that should be treated the same. For example, if you have small file shares, you might “aggregate” them, collect into one content source, so that their crawls can be managed together. You definitely have to do a detailed inventory for this.
- Live content vs. Archive – While “live” content changes often, archive either doesn’t change at all or changes very rarely. While “live” content has to get crawled often, archive doesn’t need incremental crawls to run very often. Remember, after the initial full crawl, content is in the index, and due to the rare changes, it can be considered pretty up-to-date. So that, if you have a system (any kind of) with both live and archive content, you’d better to split them and crawl the live content often while the archive doesn’t need any special attention after the initial full crawl.
- Automated jobs running on the content source – There are many systems where automated jobs create or update content. In most cases, these jobs are time-scheduled, running in the late evenings or early mornings, for example. As these jobs are predictive, we have two best practices here:
It isn’t an easy decision, is it?
During the planning phase of a search project each of these points should be evaluated and the result would be something like this table:
|Source system||Type||Amount of content||Content Source(s)|
|http://intranet||SharePoint site||2,000,000||Local SharePoint Content|