Making OCR’ed PDF’s using the HP Digital Sending Software

We recently became aware that our fancy HP workgroup printers which can copy a document and email the result as pdf to a set of email addresses only creates image pdf’s. None of the text in the pdf is searchable. After some investigation, we discovered that we needed to install the HP Digital Sending Software and let it perform OCR post-processing on the image pdf that was created by the printer.

This works all well and good, except that  the HP Digital Sending Software places the OCR’ed file in a single configurable folder on the Print Server’s local disk . You might be wondering:  “What’s the problem with that? Just share the folder to all users and be done with it.” Well, my answer is:  “Not so fast. Much of the material being scanned is confidential and should not be placed where an unauthorized user might be able to access it.”

After further analysis of how the HP Digital Sending Software works, we determined that for every document scanned there are 2 files being created. One is the OCR’ed pdf file and the other is an informational xml file. The xml file contains the information that the user provided during the scan.  This gave us access to  the “from email address”,  “to email address(es)”,  and the OCR’ed pdf file name.

With this information, it was pretty straight forward to write a  FileWatcher service in  C# that would:

  • Handle the file changed events events with a filter on “*.xml” files.
    I initially thought that I would be processing a file created event, but the HP software creates the xml file, then updates it.
  • Generate an email from the “from email address” to the “to email address(es)” with the OCR’ed pdf file as an attachment.
  • Delete the processed xml and pdf files from the folder.

It was pretty easy to plug this gap.  However, without our little “gap filler” the HP Digital Sending Software is just technology and not a solution for a Company. I wonder why HP doesn’t provide tools with their Digital Sending Software to perform this function already?

Be Sociable, Share!
This entry was posted in .NET and tagged , , , . Bookmark the permalink.
  • http://steve.heyvan.com Steve

    David — care to share the source code and build instructions to your “gap filler”? Sounds like a great solution.

  • Frank

    I was looking for a free document scanning software on the internet. I had used Textbridge for the past 8 years with many versions of Windows OS and I was not willing to buy another expensive scanning software any more. Then I found some interesting ones online, like Free OCR, etc. Though not as good as commercial OCR softwares, they did produce promising results to me.

  • david

    Hi Frank,
    I needed something that I could script and run in a batch process. In my case, we have many thousands of unsearchable documents that have already been published in our SharePoint system. I needed to check-out, convert to a searchable PDF and check the document back in. The free software that I tried I was unable to script in batch processes to work with SharePoint.

    For me, $99 for a tool that allowed me to do this was not a serious expense.

    David