David's technobabble Rotating Header Image

OCR’ing all of the PDF files in a SharePoint Document Library using PowerShell and Solid PDF Tools

A recent review of the PDF Documents in our Document Control Library, revealed that most were “image only” PDF’s. We’ve run our document control system on different versions of SharePoint technologies since SharePoint Portal Server 2001. We are currently running SharePoint 2007. I’m surprised that someone did not previously notice that most of our PDF files were not showing up in the searches.

The question is:“How can we get all of these PDFs reprocessed to be searchable for a reasonable cost?” I spent some time reviewing various PDF tools and was surprised that few were “truly” scriptable in batch. Some tools would not run on a server or could not interface with a “HTTP Network Place”. There were some development kits that were marketed and priced for you to build your own PDF tools and they have a pretty spendy price tag. These were out of scope with this year’s budget. Some, like Acrobat, had a batching mechanism in their GUI. However, this would require us to:

  • Manually checkout all of the PDF’s
  • Download them into a folder structure
  • Batch them in the UI
  • Upload all of the OCR’ed PDFs
  • Manually check-in all of the PDF’s

I don’t think so! We are talking about tens of thousands of documents. This would be labor intensive and prone to human error.

I’d just about given up finding something that would fit into the current budget, when I received an email notice about Solid Converter PDF V5 being available. While reviewing the v5 features, I noticed Solid PDF Tools v5 was out and it was just $20 more the upgrade to Solid Converter PDF v5. Low and behold, Solid PDF Tools v5 claims that it has a scriptable interface. I got to looking further and discovered their script reference manual and developer blog. I downloaded the trial and verified it. The software worked, I sent them my $59 upgrade fee and viola, I had a “scriptable enough” tool that I could re-scan “image-only” pdf files to create “searchable” pdf files. I say “scriptable enough” because it gets the job done, but Solid PDF Tools v5 needs to load the GUI, load the splash screen, and display the UI for each file processed. This seems to add 10-15 seconds to the processing time of each file

Now to convert the “image-only” PDF’s in SharePoint. Once again, I decided to try PowerShell rather than write a C# program to interface with SharePoint. At the same time, I decided to give the PowerGUI Tools a try. I found the PowerGUI Script Editor to be quite useful for developing, debugging and running my script.

The “proof-of-concept” result is ~100 lines of PowerShell code that

  • processes all of the webs in a sharepoint site
  • processes all of the folders and sub-folders
  • processes all of the PDF documents and sends them to OCR processing
?View Code POWERSHELL
## References
[void][System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SharePoint") 
[void][System.Reflection.Assembly]::LoadWithPartialName("System.IO") 
 
$SolidPDFTools = "$env:SolidPDFTools.exe" 
$LocalFileFolder= "D:\spwork\input";
$OCRWorkFolder= "D:\spwork\output";
$OCRWorkLogFolder= "D:\spwork\logs";
$OCRScriptFile= "D:\spwork\ocr.sdscript";
 
function script:write_local_file($file, $fileFolder) {
	$fs = New-Object System.IO.FileStream($(Join-Path $fileFolder $file.Name), [System.IO.FileMode]::Create)
	$bw = New-Object System.IO.BinaryWriter($fs);
	[Byte[]] $binfile = $file.OpenBinary();    
	$bw.Write($binfile);
	$bw.Close();
	$fs.close();
}
 
function script:ocr_local_file($file, $in_fpath, $out_fpath) {
	$infile = Join-Path $in_fpath $file.Name;
	$infile = $infile -replace("\\","\\");
	$outfile = Join-Path $out_fpath $file.Name;
	$outfile = $outfile -replace("\\","\\");
        $logfile = Join-Path $OCRWorkLogFolder "debug.log";
	$logfile = $logfile -replace("\\","\\");
	$DBG='<</Level/Verbose /Emit (starting) /FileName (' +  $logfile +') >> Trace';
	$INP='<</Input (' + $infile + ')';
	$OTP='/Output (' + $outfile + ')';
	$OCR='/OCR/Searchable';
	$CRE='>> Create';
	$XIT='EXIT';
	Write-output $DBG | Out-File $OCRScriptFile -encoding ascii;
	Write-output $INP | Out-File $OCRScriptFile -append -encoding ascii;
	Write-output $OTP | Out-File $OCRScriptFile -append -encoding ascii;
	Write-output $OCR | Out-File $OCRScriptFile -append -encoding ascii;
	Write-output $CRE | Out-File $OCRScriptFile -append -encoding ascii;
	Write-output $XIT | Out-File $OCRScriptFile -append -encoding ascii;  
 
	$sTemp = &SolidPDFTools /i $OCRScriptFile /f script 
	Write-Host $sTemp;
}
 
function script:upload_ocr_result($file, $ocr_fpath) {
	$ocrfile = Join-Path $ocr_fpath $file.Name;
	$fs = New-Object System.IO.FileStream($(Join-Path $ocr_fpath $file.Name), [System.IO.FileMode]::Open)
	$br = New-Object System.IO.BinaryReader($fs);
	[Byte[]] $binfile = $br.ReadBytes($br.BaseStream.Length);
	$file.SaveBinary($binfile);
	$br.close();
	$fs.close();
 }
 
function script:process_a_folder($folder) {
	$files = $folder.Files;
	foreach($file in $files) {
		if ($file.Name.ToLower().Contains(".pdf") ) {
			if ($file.CheckOutStatus -eq "None") {
				$file.CheckOut();
            			write_local_file $file $LocalFileFolder;
				ocr_local_file $file $LocalFileFolder $OCRWorkFolder;
				upload_ocr_result $file $OCRWorkFolder; 
				$file.CheckIn("version has had OCR processing performed");
				$file.Update();
				}
			}
		}
	$sub_folders = $folder.SubFolders;
	process_folders($sub_folders);
}
 
function script:process_folders($folders) {
	foreach($folder in $folders) {
		process_a_folder($folder);
	}
}
 
function script:append-path {
	$oldPath = get-content Env:\Path;
	$newPath = $oldPath + ";" + $args;
	set-content Env:\Path $newPath;
}
 
# MAIN
append-path (resolve-path 'D:\Program Files\SolidDocuments\Solid PDF Tools\SPDFT').Path
 
$site = new-object Microsoft.SharePoint.SPSite("http://my-sp-server");   
$siteweb = $site.OpenWeb();   
$webs = $siteweb.Webs;   
foreach($web in $webs) {   
	$folders = $web.Folders;
	process_folders($folders);
	}

I do plan to add some logging, change the hard-coded variables, and look at using streams instead of Byte[] to be more flexible and scaleable. I’ll need some error handling to deal with things like download or upload failures before I can run this in production. I’m also trying to determine if a PDF is “image only” or searchable. However, the “Proof-of-concept” does work.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Mixx
  • Google Bookmarks
  • LinkedIn
  • StumbleUpon
  • Technorati
  • TwitThis

Leave a Reply

Bad Behavior has blocked 553 access attempts in the last 7 days.