When talking about search, most people only consider technology: the search engine is a huge black box, which does something invisible, something magic, and expected to provide the relevant results.
However, it is important to understand that search is a complex process – with a great deal of human involvement. Read more…
Autumn season is here, with so many events again: the calendar is full with workshops and conferences again.
One of the sessions I am doing this season has the title “Findability in YOUR Organization”.
The reason why I highlight the word *YOUR* is that each organization is unique with different needs. Therefore each organization needs a different approach and a different solution. There is no “one size fits all” when it comes to search.
To be able to serve the organization’s search needs, the first things to understand are the three search processes:
- Crawl & Index
During most of my career, I’ve been working as a consultant.
My 10+ years experience is that being responsible for Search is a huge challenge in every organization. First and foremost, because Search is challenging itself. Second, because each organization is unique, therefore needs a unique approach. There are too many components to make fit together.
Content Panda delivers in-context help, training and support content in the user interface – right where and when you need it. Answers Where You Want Them.
I am honored to do a FREE webinar with Content Panda’s Co-Founder and CEO, Simeon Cathey on Thursday, April 21, 2016:
Working with search for years has given me a lot of experience with Content Sources as well as a lot of questions about them, like “should I create one huge content source or would be better to split up to smallers?” or “can I amass my small content sources into one big?” or “how to schedule the crawls for each of my content sources?“.
As usual, there is no “silver bullet” answer for this, the only general answer to this is “it depends…”
Hey, but I know the slogan “it depends” is not why you started to read this blog post, so let me give you some ideas at least what you’d better to consider when facing these questions.
You have to know though: sometimes you don’t have any choice, your content source is one and big, cannot get split. For example a big database.
You might also think about amassing your small content sources into one big, for example when having small file shares that are “similar enough” (similarity must be your definition, again).
In many cases, the situation is that you can split your huge content source into two or more smaller and/or you can amass the small ones into big(ger) ones.
For example, a huge file share can get split up by the subfolders (or subfolder groups). Or a SharePoint farm can get split up by Web Applications or Site Collections or even sites (although it’s a rare situation). Or a 3rd party document management system can get split by the repositories.
Small file shares can be treatet as one big content source. As well as small SharePoint sites.
You get the point.
The real question is: when to keep/create one big content source and when to create multiple smaller ones IF it’s possible to split.
Considerations to make here:
- Content Source types – In SharePoint, the following content source types are available: SharePoint Sites, Web Sites, File Shares, Exchange Public Folders, Line of Business Data, Custom Repository. These types cannot be mixed (for example, you cannot have a SharePoint site and a file share in the same content source), but in the same type, you can add more than one start addresses on your choice. This might be a good option if you have multiple, small content sources.
- Crawling time and schedule – The more changes you have since the last crawl, the longer time the crawl takes. The more often you crawl the less changes you have to process during an incremental. The more often you do an incremental the less idle time your system will have. – But the more often you crawl, the more resources you consume on both the crawler and the source system. And the more often you crawl the bigger chance you have not being able to finish the crawl before the next one should get started – the result is worse content freshness with worse search performance than you expect.
Moreover, if you have multiple content sources, you have to align their schedules to keep your system not overloaded by multiple parallel crawls.
- Performance effect on the crawler components – This is an obvious one: crawling takes resources. The more you crawl the more resources you take. If you crawl more content sources parallel, it takes more resources. If you run one huge crawl, it takes resources for longer time. If you don’t have enough resources, the crawl might fail or run “forever”, making effects on other crawls.
- Performance effect on the source system – This is usually the less considered one: crawling takes resources on the source system as well!
- Bandwidth – Crawling pulls data from the source system that will be processed on the indexer components. This data should be transferred and this takes bandwith. In many cases, this is the bottleneck in the whole crawling process, even if the source system and crawler performs well. The more crawling process you run at the same time and the more parallel threads they have the more bandwith will be needed. Serialized crawls mean more balanced bandwith requirements.
- Similar content sources? – At the same time, you might have similar content sources that should be treated the same. For example, if you have small file shares, you might “aggregate” them, collect into one content source, so that their crawls can be managed together. You definitely have to do a detailed inventory for this.
- Live content vs. Archive – While “live” content changes often, archive either doesn’t change at all or changes very rarely. While “live” content has to get crawled often, archive doesn’t need incremental crawls to run very often. Remember, after the initial full crawl, content is in the index, and due to the rare changes, it can be considered pretty up-to-date. So that, if you have a system (any kind of) with both live and archive content, you’d better to split them and crawl the live content often while the archive doesn’t need any special attention after the initial full crawl.
- Automated jobs running on the content source – There are many systems where automated jobs create or update content. In most cases, these jobs are time-scheduled, running in the late evenings or early mornings, for example. As these jobs are predictive, we have two best practices here:
It isn’t an easy decision, is it?
During the planning phase of a search project, each of these points should be evaluated, and the result would be something like this table:
||Amount of content
||Local SharePoint Content