One of the biggest problems with scanned documents is that the text they contain is not generally in a machine-editable format. This is a serious drawback for many users who would otherwise benefit from digitizing their documents, since it means that users cannot edit nor search the text in their scanned documents.
However, with the prevalence of OCR software, more and more scanner makers are beginning to include OCR technology in their software. OCR scanning software allows the user to scan documents and have them saved directly as text-searchable PDF files, or as editable text documents. Text-searchable PDFs are vastly superior to non-searchable PDF files, as finding specific information in a document is much simpler if you don't have to look through every page.
OCR can also be used for automated data entry. Using OCR scanning software, you can scan your documents and export the text to a database. This technique is often used for invoice and forms processing, and for EOB automation. ICR, which stands for intelligent character recognition, is a type of OCR used primarily for recognizing handwriting and converting it to editable text, and allows users to automatically enter handwritten documents as machine text. If you have standardized documents, such as forms, from which to extract specific information, you can use zonal OCR to specify exactly what data you want extracted and saved, and ignore the rest.
Organizations that deal with a large number of documents every day often utilize batch OCR software. With batch OCR software, the user can specify a large batch of files to be OCRed, and leave the program do the work. Another feature of some batch OCR software is a 'watched folder mode', which allows users to specify folders for the software to monitor; if a new document is scanned or placed into one of these folders, the software will automatically process it and place it in a specified output location. Some batch OCR products are server-based, and can automatically process scanned documents from any user on a network.
Due to its speed, accuracy, and wide range of applications, OCR software has become very prevalent in recent years. The users of OCR software range from government agencies to students. However, very few of these users know how OCR software works, nor do they know its history.
The first patent for OCR technology was filed in 1929 by a man named Gustav Tauschek, who is best remembered for his punchcard-based calculating machines. Of course, Tauschek's patent wasn't for OCR software, since neither computers nor software existed, but rather for a mechanical device. His optical character recognition device used a photodetector and character templates, and required exact matches between the template and the character in order to work. Obviously, his was not a very accurate OCR device, but up until recently, OCR software operated on the same basic principles as Tauschek's machine.
For most of its history, OCR software relied on algorithms that did something called matrix matching. The OCR software would compare each character in a document to a template, and in this way perform character recognition. Matrix matching has one main drawback: since it relies on comparison between document characters and set font templates, it only works on fonts of which it has templates. For this reason, most matrix matching based OCR software only works for documents written in a few fonts.
Most current OCR software, instead of using matrix matching algorithms, uses feature extraction. Feature extraction algorithms attempt to ignore the variable elements of characters and only analyze their invariable elements. For example, the capital letter 'H' always consists of two parallel vertical lines with one horizontal line connecting them. Because it only analyzes the invariable elements of characters, OCR software using feature recognition only needs one prototype, consisting of invariable elements, for each character. The biggest advantage to feature extraction based OCR software is that it works on a wide variety of fonts. ICR, which stands for intelligent character recognition and is basically OCR for handwriting, also uses advanced feature extraction algorithms.
OCR software has become extremely accurate, and generally can achieve at least 99% accuracy. Of course, accuracy depends heavily on the quality of the document on which the OCR software is being used. Clean, high resolution copies yield the best results. There is also evidence to suggest that OCR is more accurate when used on documents scanned in grayscale mode. Many OCR software products will attempt to optimize documents by deskewing and despeckling them before performing the OCR operations.
Given the large number of documents businesses work with, they often have much more intensive OCR requirements than home users. For a home user, processing each document individually may not be much of an inconvenience, but businesses often have far too many documents for that to be a feasible option. There are several OCR products that are specifically designed for high volume applications.
The concept of batch OCR software is simple to grasp: it allows users to specify entire directories of files to be automatically processed by the OCR software. Once the job is set up and initiated, the software takes care of the rest and requires no further human input. Batch OCR software is also designed to be fast, but it is important to be aware that some software that works incredibly quickly is wildly inaccurate. Always be wary of companies that claim that their software can attain unbelievable speeds, and always insist on trying software before purchasing it. Trustworthy OCR companies generally provide free trial versions of their software. If a company doesn't allow you to try their software before buying, it could mean that they have something to hide.
Many high-volume OCR products are server based, and allow access from all authorized users on a network. For large organizations, this is much more convenient than locally installing OCR computer on every machine, and also helps optimize document workflow. Some server-based OCR software can monitor network directories for new documents to OCR. If a file is placed in one of the specified directories, the software automatically processes it and places the resulting file in a specified output directory. This is a good way to ensure that all newly scanned documents are optimized and properly organized. Some software of this type can even integrate with existing workflow and document management systems.
There are many options available when it comes to free OCR software. Depending on your OCR requirements, a free OCR program can be sufficient. If you have a high volume of documents or desire the best accuracy and speed, a free OCR download is not the answer. If, however, you are a causal user and only have a few documents to OCR, you may be able to get away with not paying anything. There are many free OCR software products available to download, and it can be difficult to find a safe and reasonably accurate one. It is therefore important to read user reviews before downloading anything.
As with any OCR software, it is very important to only use free OCR software that is reasonably accurate; otherwise, you'll just end up wasting your time. Free OCR software is never as accurate as proprietary OCR software, so it is important to check the results each time you OCR a document with free software. Free OCR software should not be used with documents that are difficult to read, or those that don't have clear contrast between the text and background. Some people suggest that scanning in grayscale mode yields more accurate OCR results, so if your free OCR software isn't performing well on some documents, you can give that a try.
There are two types of free OCR software available online: downloads and online tools. Before downloading any software, be sure that its source is reputable. Some free, downloadable OCR software is open source and allows programmers to modify it. To use free online OCR tools, you typically upload a file, wait for it to be processed, and either receive the newly searchable file in an email or download it from the website. The main drawback to online OCR tools is that they tend to be very slow.
OCR is especially useful when working with PDFs, because PDF files are commonly used to store scanned images of text files. PDF is a very widespread format, and so it becomes especially important to be able to OCR PDF files before sharing them with others. OCR makes PDF files text-searchable, so that you can instantly locate relevant text within a PDF document. This saves a great deal of time for everyone who has to work with PDF documents, and allows other important functions, such as text editing. OCR technology has added a new layer of usefulness to traditional PDF documents.
The creation of searchable PDF files is one of the main uses of OCR software. PDF files created from word processing programs are usually already text searchable, since they were created from a machine-editable text source, but other PDFs, such as those created from scanned documents, are not unless OCRd first. Once a PDF file is made searchable, users can enter text in their PDF reader's search bar and locate it without having to look through the whole document themselves.
In order for a PDF to be made text-searchable, the text it contains must be extracted and converted to machine-editable text - i.e. the file must must be OCRd. This text is then used to create an invisible layer of text, which is added to the PDF. The invisible layer of text lines up with the visible text, which is actually just an image of text, and since it is recognized by the computer as text, rather than an image, can effectively used to search the visible text in the document.
PDF's ability to contain this layer of text is unique, which is why when image files such as JPEGs or TIFFs are OCRd with the intention of making them searchable, they are converted to PDF files. It is very important that the OCR software used to create the searchable PDF be accurate, since if it is not, the invisible, searchable text overlay will not actually correspond to the visible text. Of course, it is important that the PDF being made text-searchable be a clean, clear document; otherwise, even the most accurate OCR will have a hard time recognizing its text.
Depending on how many PDFs you need to make searchable, you may be able to get away with using a free online OCR tool. In order to use these services, you upload the file you want to OCR, and it either emails you the searchable file or allows you to download it. These tools are slow, and don't allow for any customization. More serious users can evaluate proprietary software through the use of free trial versions, which is the only real way to know what you are purchasing.
There are two ways to scan your documents directly as searchable PDF files: use scanner software that supports this option, or OCR software with a 'watched folder' mode. Some scanners come with software with a direct OCR scan option, which means that as soon as you scan your documents, they are saved as searchable PDF files. The only problem with this is that the OCR software bundled with scanners is generally not as accurate as OCR-specific, stand-alone software. With OCR software that includes a watched folder mode, you can set up your software so that as soon as new files are scanned to your computer, the OCR software automatically processes them and converts them to searchable PDF files. Most software that includes this feature is server-based, and thus can monitor various specified folders on a network, so that documents scanned by any users on the network are automatically converted to searchable PDFs. Using this type of software will generally yield more accurate results, and also enable you to turn previously scanned documents into OCR PDF files.
In order to create properly searchable PDF files, it is imperative that you have accurate OCR software. In order to verify the accuracy of the OCR software you are using to create your searchable PDFs, you should copy some text from the PDF and paste it into a text document. If the pasted text is identical to the visible text in the PDF, the OCR has done its job accurately. If, however, it is different, it means that the OCR process did not yield accurate results, and that you won't be able to accurately search the PDF. You should perform this test several times with several documents in order to be sure of your results. It is, of course, preferable to perform this test using free trial versions of software before purchasing, as this way you cannot be lead astray by false accuracy claims.
Below is only a small sample of available OCR software. They are listed in no specific order and we make no warranties to the quality or accuracy of any of the software listed. This list will be updated periodically with other/recommended available software. If you know of a good software that you believe belongs on this list, please feel free to email the webmaster the link and/or information.