Recently I’ve been working with a customer where my job was to make their SQL based content management system searchable in SharePoint. Nice challenge. One of the best ones was what I call “time machine“.
Imagine a nice, big environment, where a full crawl takes more than 2 weeks. There are several thing during these project where we need full crawl, for example when working with managed properties, etc. But if a full crawl is such a long, it’s always a pain. You know, when you can go even for a holiday while it’s running 😉
We’re getting close to the end of the project, incrementals are scheduled, etc., but turned out there’re some items that have been put into the database nowadays, with some older “last modified date”. How this can happen? With some app for example, or if the users can work offline and upload their docs later (depending on the source system’s capabilities, sometimes these docs get the original time stamp, sometimes the current upload time as “last modified date”). If we have items with linear “last modified dates”, incremental crawls are easy to do, but imagine this sequence:
- Full crawl, everything in the database gets crawled.
- Item1 has been added, last_modified_date = ‘2013-08-09 12:45:27’
- Item2 has been modified, last_modified_date = ‘2013-08-09 12:45:53’
- Incremental crawl at ‘2013-08-09- 12:50:00’. Result: Item1 and Item2 crawled.
- Item 3 has been added, last_modified_date = ‘2013-08-09 12:58:02’
- Incremental crawl at ‘2013-08-09- 13:00:00’. Result: Item3 crawled.
- Item4 has been added by an external tool, last_modified_date = ‘2013-08-09 12:45:00’.
Note that this time stamp is earlier than the previous crawl’s time. - Incremental crawl at ‘2013-08-09- 13:10:00’. Result: nothing gets crawled.
The reason is: Item4’s last_modified_date time stamp is older than the previous crawl, and the crawler suppose every change got happened since that (i.e. no time machine built-in to the backend 😉 ).
What to do now?
First option is: Full crawl. But:
- If Full crawl takes more than 2 weeks, it’s not always an option. We have to avoid is if possible.
- We can suppose, the very same can happen anytime in the future, i.e. docs appering from the past, even before the last crawl time. And Full crawl not an option, see #1.
Obviously, customer would like to see these “time travelling” items in the search results as well, but looks like neither full nor incremental crawl is an option.
But, consider this idea: what if we could trick the incremental to think the previous crawl was not 10 minutes ago but a month (or two, or a year, depending on how old docs we can expect to appear newly in the database)? In this case, incremental crawl would not check the new/modified items since the last incremental, but for a month (or two, or a year, etc.) back. Time machine, you know… 😉
Guess what? – It’s possible. The solution is not official, not supported, but works. The “only” thing you have to do is modifying the proper time stamps in the MSSCrawlURL table, something like this:
Why? – Because the crawler determines the “last crawl time” by this table. If you trick the time stamps back, the crawler thinks the previous crawl was too long ago and goes back in time, to get the changes from that, longer period. And in this case, without doing a full crawl, you’ll get every item indexed, even the “time travelling” ones from the past.
Ps. Same can be done if you have last_modified_date values in the future. The best docs from the future I’ve seen so far were created in 2127…
v:* {behavior:url(#default#VML);}
o:* {behavior:url(#default#VML);}
w:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
false
false
false
EN-US
X-NONE
X-NONE
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:””;
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”,”serif”;}
The problem in this case is that as soon as you crawl any of these, crawler considers 2127 as the last crawl’s year, and nothing created before (in the present) will get crawled by any upcoming incrementals. Until 2127, of course 😉
Related Posts:
- Reduce Resources Used by noderunner.exe in SharePoint 2013
- More about Search DBs – Inconsistency due to an Error while Creating the SSA
- Search Health Reports and State Service in SharePoint 2013
- Missing Blank Site Template in SharePoint 2013
- No Document Preview Displayed on the Hover Panel (SharePoint 2013 Search)
- How to use Developer Dashboard in SharePoint 2013 Search Debugging and Troubleshooting
Leave a Reply