OCR’ing all of the PDF files in a SharePoint Document Library using PowerShell and Solid PDF Tools

A recent review of the PDF Documents in our Document Control Library, revealed that most were “image only” PDF’s. We’ve run our document control system on different versions of SharePoint technologies since SharePoint Portal Server 2001. We are currently running SharePoint 2007. I’m surprised that someone did not previously notice that most of our PDF files were not showing up in the searches.

The question is:“How can we get all of these PDFs reprocessed to be searchable for a reasonable cost?” I spent some time reviewing various PDF tools and was surprised that few were “truly” scriptable in batch. Some tools would not run on a server or could not interface with a “HTTP Network Place”. There were some development kits that were marketed and priced for you to build your own PDF tools and they have a pretty spendy price tag. These were out of scope with this year’s budget. Some, like Acrobat, had a batching mechanism in their GUI. However, this would require us to:

  • Manually checkout all of the PDF’s
  • Download them into a folder structure
  • Batch them in the UI
  • Upload all of the OCR’ed PDFs
  • Manually check-in all of the PDF’s

I don’t think so! We are talking about tens of thousands of documents. This would be labor intensive and prone to human error.

I’d just about given up finding something that would fit into the current budget, when I received an email notice about Solid Converter PDF V5 being available. While reviewing the v5 features, I noticed Solid PDF Tools v5 was out and it was just $20 more the upgrade to Solid Converter PDF v5. Low and behold, Solid PDF Tools v5 claims that it has a scriptable interface. I got to looking further and discovered their script reference manual and developer blog. I downloaded the trial and verified it. The software worked, I sent them my $59 upgrade fee and viola, I had a “scriptable enough” tool that I could re-scan “image-only” pdf files to create “searchable” pdf files. I say “scriptable enough” because it gets the job done, but Solid PDF Tools v5 needs to load the GUI, load the splash screen, and display the UI for each file processed. This seems to add 10-15 seconds to the processing time of each file

Now to convert the “image-only” PDF’s in SharePoint. Once again, I decided to try PowerShell rather than write a C# program to interface with SharePoint. At the same time, I decided to give the PowerGUI Tools a try. I found the PowerGUI Script Editor to be quite useful for developing, debugging and running my script.

The “proof-of-concept” result is ~100 lines of PowerShell code that

  • processes all of the webs in a sharepoint site
  • processes all of the folders and sub-folders
  • processes all of the PDF documents and sends them to OCR processing
## References
[void][System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SharePoint") 
[void][System.Reflection.Assembly]::LoadWithPartialName("System.IO") 

$SolidPDFTools = "$env:SolidPDFTools.exe" 
$LocalFileFolder= "D:\spwork\input";
$OCRWorkFolder= "D:\spwork\output";
$OCRWorkLogFolder= "D:\spwork\logs";
$OCRScriptFile= "D:\spwork\ocr.sdscript";

function script:write_local_file($file, $fileFolder) {
	$fs = New-Object System.IO.FileStream($(Join-Path $fileFolder $file.Name), [System.IO.FileMode]::Create)
	$bw = New-Object System.IO.BinaryWriter($fs);
	[Byte[]] $binfile = $file.OpenBinary();    
	$bw.Write($binfile);
	$bw.Close();
	$fs.close();
}

function script:ocr_local_file($file, $in_fpath, $out_fpath) {
	$infile = Join-Path $in_fpath $file.Name;
	$infile = $infile -replace("\\","\\");
	$outfile = Join-Path $out_fpath $file.Name;
	$outfile = $outfile -replace("\\","\\");
        $logfile = Join-Path $OCRWorkLogFolder "debug.log";
	$logfile = $logfile -replace("\\","\\");
	$DBG='<> Trace';
	$INP='<

I do plan to add some logging, change the hard-coded variables, and look at using streams instead of Byte[] to be more flexible and scaleable. I’ll need some error handling to deal with things like download or upload failures before I can run this in production. I’m also trying to determine if a PDF is “image only” or searchable. However, the “Proof-of-concept” does work.

Be Sociable, Share!
This entry was posted in .NET, PowerShell, SharePoint, WSS and tagged , , , , , . Bookmark the permalink.
  • http://www.psigen.com SharePoint Imaging

    No matter what metadata is thrown into columns, there is always a need for the conversion to searchable text. SharePoint Imaging solutions provide the means to have fully searchable PDFs in the repository.

    This was a great overview on how to use 3rd party products to enhance search!!

  • Bunny

    hello sir, how to run this script? I have already tried but got an error:

    Cannot find type [Microsoft.SharePoint.SPSite]: make sure the assembly containing this type is loaded.
    At :line:87 char:18
    + $site = New-Object <<<< Microsoft.SharePoint.SPSite("10.22.2.31");

    can you help me fix this error? thank's

  • david

    You will need to run the script on your SharePoint server or on a system where all of the supporting SharePoint .Net components are installed.

    At the top of the code the
    [void][System.Reflection.Assembly]::LoadWithPartialName(“Microsoft.SharePoint”)
    was used is to load all of the SharePoint objects on the system.

    Does this answer your question?

  • Bunny

    not yet sir, sorry i’m newbie

    I got an error,

    Exception calling “.ctor” with “1″ argument(s): “The Web application at http://10.22.2.31 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application.”
    At :line:88 char:18
    + $site = New-Object <<<< Microsoft.SharePoint.SPSite("http://10.22.2.31&quot;);

    do you have another URL for me to test this script?

  • Bunny

    Thanks for your answer.

    Now i’ve run the script on my sharepoint server, and sharepoint object can be successfully loaded.

    But i have another error, “The Web application at http://10.22.2.31 could not be found. Verify that you have typed the URL correctly. If the URL should be serving existing content, the system administrator may need to add a new request URL mapping to the intended application.”

    I think this is about the authorization problem, because I use an account that does not have permission to open a Sharepoint site. But this error still occur even though I use another account.

    I also tried another way to create site object like:
    $site = [Microsoft.SharePoint.WebControls.SPControl]::GetContextSite([System.Web.HttpContext]::Current);
    but the current HttpContext return null.

    Do you have any idea about this problem?

    Thanks.

  • david

    Have you tried using the servers computer name instead of the ip address e.g. http:// instead of http://10.22.2.31 ?

    SharePoint 2007 uses mappings to accept site/web application names. If you look at the SharePoint Alternate Mappings in Central Administration, then I’m guessing that the server’s computer name will be shown and the server’s ip address will not be shown.

  • nuruddin

    hello sir,

    i’ve tried to run your script and it works nicely. but when it reach ocr_local_file function, it opened pdf solid tools and did nothing. so then, it came error when it run upload_ocr_result. because there were any file resulted ocr_local_file

    when i check my pdf solid tools, it appear the “Recognize text using OCR” under “document” menu is disabled.

    is the error because “Recognize text using OCR” is disabled? how to make it enabled? what should I install?

    thanks

  • Yang

    very interesting post. did you install Solid Converter on sharepoint server? thanks,

  • david

    Yes, I installed Solid Converter on the sharepoint server. It was the easiest way to gain access to all of the installed sharepoint/.net libraries.