All posts tagged best practices

123

How to make sure your user profiles show up in search

In my many years of working with enterprise search, the one thing which companies want solved first is finding people. They might have an employee directory or they might already be using SharePoint user profiles, but there are always tweaks to be made to make it better. – It’s not rocket science from a technical perspective, as the hard part is figuring out which pieces of data about a person should be stored in the SharePoint user profile, where does it come from – the age-old question about master data, and how do you want to use this information in a findability scenario around your employees. Read more…

Search is more than technology

Search is more than Technology

When talking about search, most people only consider technology: the search engine is a huge black box, which does something invisible, something magic, and expected to provide the relevant results.

However, it is important to understand that search is a complex process – with a great deal of human involvement. Read more…

Search Processes

Search Processes

Autumn season is here, with so many events again: the calendar is full with workshops and conferences again.

One of the sessions I am doing this season has the title “Findability in YOUR Organization”.

There is a reason why I highlighted the word “YOUR”. Each organization is unique and has a different need than the others. Therefore, each organization requires a different approach and solution. There is no “one size fits all” when it comes to Search.

To be able to serve the organization’s search needs, the first things to understand is these three search processes:

  • Crawl & Index
  • Query
  • Analytics

Read more…

YOUR Search Success

“YOUR Search Success” Coaching Program

During most of my career, I’ve been working as a consultant.
My 10+ years experience is that being responsible for Search is a huge challenge in every organization. First and foremost, because Search is challenging itself. Second, because each organization is unique, therefore needs a unique approach. There are too many components to make fit together.

Read more…

Webinar - 10 Steps to Be Successful with Enterprise Search

Webinar – 10 Steps to Be Successful with Enterprise Search

Content Panda delivers in-context help, training and support content in the user interface – right where and when you need it.  Answers Where You Want Them.

I am honored to do a FREE webinar with Content Panda’s Co-Founder and CEO, Simeon Cathey on Thursday, April 21, 2016:

Read more…

How to Organize Content Sources – Best Practices

Working with search for years has given me a lot of experience with Content Sources as well as a lot of questions about them, like “should I create one huge content source or would be better to split up to smaller ones?” or “can I amass my small content sources into one big?” or “how to schedule the crawls for each of my content sources?“.

As usual, there is no “silver bullet” answer for this and the only general answer to this is “it depends…

Hey, but I know the slogan “it depends” is not why you started to read this blog post, so let me give you a few ideas of what you should consider when facing these questions.

It’s important to know that sometimes you do not have any (out-of-the-box) choice, your content source is one and cannot get split even though it’s big. For example a big database.
You may also think about amassing your small content sources into one big content source, for example when having small file shares that are “similar enough” (similarity must be your definition).

In many cases, you have the option to split your huge content source into two or more smaller content sources or you can amass the small ones into big or bigger ones, sometimes you can do both in the same environment. Examples include:

  • A huge file share that can get split up by subfolders (or subfolder groups);
  • A SharePoint farm you can split up by Web Applications or Site Collections or even sites (although it’s a rare situation);
  • A 3rd party document management system can get split by the repositories;
  • Small file shares can be treated as one big content source as well as small SharePoint sites)

The real question is when to keep/create one big content source and when to create multiple smaller ones IF it’s possible to split.

Considerations to make here:

  • Content Source types – In SharePoint, the following content source types are available: SharePoint Sites, Web Sites, File Shares, Exchange Public Folders, Line of Business Data, Custom Repository. These types cannot be mixed (for example, you cannot have a SharePoint site and a file share in the same content source), but in the same type, you can add more than one start addresses on your choice. This might be a good option if you have multiple, small content sources.
  • Crawling time and schedule – The more changes you have since the last crawl, the longer time the crawl takes. The more often you crawl the less changes you have to process during an incremental. The more often you do an incremental the less idle time your system will have. – But the more often you crawl, the more resources you consume on both the crawler and the source system. And the more often you crawl the bigger chance you have not being able to finish the crawl before the next one should get started – the result is worse content freshness with worse search performance than you expect.
    Moreover, if you have multiple content sources, you have to align their schedules to keep your system not overloaded by multiple parallel crawls.
  • Performance effect on the crawler components – This is an obvious one: crawling takes resources. The more you crawl the more resources you take. If you crawl more content sources parallel, it takes more resources. If you run one huge crawl, it takes resources for longer time. If you don’t have enough resources, the crawl might fail or run “forever”, making effects on other crawls.
  • Performance effect on the source system – This is usually the less considered one: crawling takes resources on the source system as well!
  • Bandwidth – Crawling pulls data from the source system that will be processed on the indexer components. This data should be transferred and this takes bandwith. In many cases, this is the bottleneck in the whole crawling process, even if the source system and crawler performs well. The more crawling process you run at the same time and the more parallel threads they have the more bandwith will be needed. Serialized crawls mean more balanced bandwith requirements.
  • Similar content sources? – At the same time, you might have similar content sources that should be treated the same. For example, if you have small file shares, you might “aggregate” them, collect into one content source, so that their crawls can be managed together. You definitely have to do a detailed inventory for this.
  • Live content vs. Archive – While “live” content changes often, archive either doesn’t change at all or changes very rarely. While “live” content has to get crawled often, archive doesn’t need incremental crawls to run very often. Remember, after the initial full crawl, content is in the index, and due to the rare changes, it can be considered pretty up-to-date. So that, if you have a system (any kind of) with both live and archive content, you’d better to split them and crawl the live content often while the archive doesn’t need any special attention after the initial full crawl.
  • Automated jobs running on the content source – There are many systems where automated jobs create or update content. In most cases, these jobs are time-scheduled, running in the late evenings or early mornings, for example. As these jobs are predictive, we have two best practices here:

It isn’t an easy decision, is it?

During the planning phase of a search project each of these points should be evaluated and the result would be something like this table:

Source system Type Amount of content Content Source(s)
X: file share 20,000,000 Marketing X:Marketing
HR X:HR
IT X:IT
Z: file share 15,000 Documents
Y: file share 100,000
http://intranet SharePoint site 2,000,000 Local SharePoint Content
http://extranet SharePoint site 150,000

Word Breakers in SharePoint – from version to version

Recently, I’ve got a question about file naming conventions versus Search: which characters can be really used as word breakers, eg. replacing spaces in file names (see Susan Hanley’s great post about the file naming conventions here).

For example, let’s say you have a file with name “My Cool Test Document.docx”. If you store this document in a file share or upload to SharePoint and index it, you’ll be able to search for the words “cool” or “Test” or “Document”, and you’ll see this document on the result set.

Anyway, using this file name in SharePoint document libraries will give you the crazy “%20” characters in the URL (see Susan’s post, again).

If we simple eliminate the spaces and uses plain camel case in the file name (MyCoolTestDocument.docx), we won’t get the “%20″s but the search engine will not be able to separate the words.

The third option is, of course, using some other character instead of spaces, for example underscore, like My_Cool_Test_Document.docx.

Well, I have started to make some research and found that this area is not really well documented, so that I decided to make some tests by myself. I created some “lorem ipsum” documents with different names, including several special character for replacing space.

My test details:

  • SharePoint versions I tested against
    • MOSS 2007 Enterprise
    • SP 2010 Enterprise, no FAST
    • FS4SP
    • SP2013 Enterprise
    • O365
  • Content sources:
    • File share
    • SharePoint document library
  • Characters used in file names: “-“, “_”, “.”, “&”, “%”, “+”, “#”

The results are absolutely consistent with Susan’s recommendations:

  1. Content source: file share
Character MOSS 2007 SP2010 FS4SP SP2013
yes yes yes NO
_ NO yes yes yes
. yes yes yes yes
& yes yes yes yes
% yes yes yes yes
+ yes yes yes yes
# yes yes yes yes
  1. Content source: SharePoint document library
Character MOSS 2007 SP2010 FS4SP SP2013 O365
yes yes yes yes yes
_ NO yes yes yes yes
. yes yes yes yes yes
& invalid invalid invalid invalid invalid
% invalid invalid invalid invalid invalid
+ yes yes yes yes yes
# invalid invalid invalid invalid invalid

In these tables, YES means: the character works fine as a word breaker, for example:

NO means, the character does NOT work as a word breaker, the engine cannot split the words by this. For example, the character underscore “_” cannot be used as a word breaker in MOSS 2007, so that it’s better to use a different character on this, old version of SharePoint.

It’s quite interesting, that “-” is not a word breaker on SharePoint 2013 IF the content source is file share.

Invalid means it’s not allowed to use the character in file names in any SP doc library.

One more, last note for MOSS 2007: there are some issues with hit highlighting in MOSS 2007, see an example on the screenshot below:

 

How to check the Crawl Status of a Content Source

As you know I’m playing working with SharePoint/FAST Search a lot. I have a lot of tasks when I have to sit on the button F5 while crawling and check the status: is it started? is it still crawling? is it finished yet?…

I have to hit F5 in every minute. I’m too lazy, so decided to write a PowerShell script that does nothing but checking the crawl status of a Content Source and writes it to the console to me. And I can work on my second screen while it’s working and working and working – without touching F5.

The script is pretty easy:

$SSA = Get-SPEnterpriseSearchServiceApplication -Identity “Search Service Application” $ContentSource = $SSA | Get-SPEnterpriseSearchCrawlContentSource -Identity “My Content Source”

do {     Write-Host $ContentSource.CrawlState (Get-Date).ToString() “-” $ContentSource.SuccessCount “/” $ContentSource.WarningCount “/” $ContentSource.ErrorCount     Start-Sleep 5 } while (1)

Yes, it works fine for FAST (FS4SP) Content Sources too.

How to Schedule Crawl Start/Pause in SharePoint 2010 Search by PowerShell

In case of having not strong enough hardware there’s a pretty common request for start the crawl in the evening and pause in the next morning, before the work day starts. Scheduling the start of Full/Incremental Crawl is pretty easy from the admin UI, but you have to do some trick if you want to schedule the pause too. Here is my favorite trick: use PowerShell!

Here is what I do here:

  1. Create a script to start/resume the crawl (CrawlStart.ps1).
  2. Create a script to pause the crawl (CrawlPause.ps1).
  3. Schedule the script CrawlStart.ps1 to run in the evening (like 6pm).
  4. Schedule the script CrawlPause.ps1 to run in the morning (like 6am).

Is it simple, right? 😉

Here are some more details.

First, we have to know how to add the SharePoint SnapIn to PowerShell. Here is the command we need: Add-PSSnapin Microsoft.SharePoint.PowerShell.

Second, we have to get the Content Source from our Search Service Application:

$SSA = Get-SPEnterpriseSearchServiceApplication -Identity “Search Service Application” $ContentSource = $SSA | Get-SPEnterpriseSearchCrawlContentSource -Identity “My Content Source”

Then we have to know how to check the status of this content source’s crawl: $ContentSource.CrawlStatus. Here are the available values:

  • Idle
  • CrawlStarting
  • CrawlingIncremental / CrawlingFull
  • CrawlPausing
  • Paused
  • CrawlResuming
  • CrawlCompleting
  • CrawlStopping

Finally, we have to know how to start/pause/resume the crawling:

  • Start Full Crawl: $ContentSource.StartFullCrawl()
  • Start Incremental Crawl: $ContentSource.StartIncrementalCrawl()
  • Pause the current crawl: $ContentSource.PauseCrawl()
  • Resume the crawl: $ContentSource.ResumeCrawl()

That’s it. Here are the final scripts:

1. CrawlStart.ps1

Add-PSSnapin Microsoft.SharePoint.PowerShell

$SSA = Get-SPEnterpriseSearchServiceApplication -Identity “Search Service Application” $ContentSource = $SSA | Get-SPEnterpriseSearchCrawlContentSource -Identity “My Content Source”

if ($ContentSource.CrawlStatus  -eq “Idle” ) {         $ContentSource.StartIncrementalCrawl()     Write-Host “Starting Incremental Crawl”

if ($ContentSource.CrawlStatus  -eq “Paused” ) {         $ContentSource.ResumeCrawl()     Write-Host “Resuminging Incremental Crawl” }

2. CrawlPause.ps1

Add-PSSnapin Microsoft.SharePoint.PowerShell

$SSA = Get-SPEnterpriseSearchServiceApplication -Identity “Search Service Application” $ContentSource = $SSA | Get-SPEnterpriseSearchCrawlContentSource -Identity “My Content Source”

Write-Host $ContentSource.CrawlState

if (($ContentSource.CrawlStatus  -eq “CrawlingIncremental” ) -or ($ContentSource.CrawlStatus  -eq “CrawlingFull” )) {         $ContentSource.PauseCrawl()     Write-Host “Pausing the current Crawl”     }

Write-host $ContentSource.CrawlState

And finally, you have to schedule these tasks as a Windows job, by using these actions: powershell –command & ‘C:ScriptsCrawlStart.ps1’ to start and powershell –command & ‘C:ScriptsCrawlPause.ps1’ to pause your crawl.

Ps.: These scripts work fine for FAST Content Sources in SharePoint too, in this case you have to use the FAST Content SSA.

Enjoy!

PowerShell script for exporting Crawled Properties (FS4SP)

Recently, I was working with FAST Search Server 2010 for SharePoint (FS4SP) and had to provide a list of all crawled properties in the category MyCategory. Here is my pretty simple script that provides the list in a .CSV file:

$outputfile = “CrawledProperties.csv”

if (Test-Path $outputfile) { clear-content $outputfile }

foreach ($crawledproperty in (Get-FASTSearchMetadataCrawledProperty)) {     $category = $crawledproperty.categoryName     if ($category = “MyCategory”)     {

# Get the name and type of the crawled property         $name = $crawledproperty.name         $type = $crawledproperty.VariantType

switch ($type) {             20 {$typestr = “Integer”}             31 {$typestr = “Text”}             11 {$typestr = “Boolean”}             64 {$typestr = “DateTime”}             default {$typestr = other}

}         # Build the output: $name and $typestr separated by           $msg = $name + ” ” + $typestr

Write-output $msg | out-file $outputfile -append     } }

$msg = “Crawled properties have been exported to the file ” + $outputfile write-output “” write-output $msg write-output “” write-output “”

SharePoint Summit 2011 Presentations

I’ve just uploaded my presentations at SharePoint Summit 2011 (Updated links to SlideShare!):

All feedbacks are welcome here!

How to Test your FAST Search Deployment?

Recently I’ve made a farm setup where SharePoint 2010 (SP2010) and FAST Search Server 2010 for SharePoint (F4SP) had to be installed to separated boxes. After a successfully installation there’s always useful to make some testing before indexing the production content sources. In case of F4SP it’s much more easier than you’d think.

First, you have to push some content to the content collection. Follow these steps:

  1. Create a new document anywhere on your local machine, for example C:FAST_test.txt
  2. Fill some content into this document, for example: Hello world, this is my FAST Test doc.
  3. Save the document.
  4. Run the Microsoft FAST Search Server 2010 for SharePoint shell.
  5. Run the following command: docpush -c <collection name> “<fullpath to a file>” (in my case: docpush –c sp C:FAST_test.txt) (See the full docpush reference here.)

If this command run successfully, your document has been pushed to the FAST content collection and can be queried. Next step should be to test some queries:

  1. Open a browser on the FAST server and go to http://localhost:[base_port+280]. In case you used the default base port number (13000) you should go to http://localhost:13280. This is the FAST Query Language (FQL) test page, so you can make some testing directly here.
  2. Search for a word contained in the document you’ve uploaded (C:FAST_test.txt). For example, search for the word ‘world’ or ‘FAST’. The result set should contain your document uploaded to the content collection before.
  3. Also, you can set some other parameters on the FQL testing page, for example language setting, debug info, etc.

FAST Search Query Language (FQL) test page

But this site (http://localhost:13280) is much more than a simple FQL testing page. On the top navigation there are other useful functions too:

  • Log
  • Configuration
  • Control
  • Statistics
  • Exclusion List
  • Reference

I’ll deep into these functions in a post later. Stay tuned!