OCR
OCR PDFs into searchable and editable PDFs
Optical Character Recognition, or OCR, is a software process that enables images or printed text to be translated into machine-readable text. OCR is most commonly used when scanning paper documents to create electronic copies, but can also be performed on existing electronic documents (e.g. PDF).
Foxit PhantomPDF can detect whether a PDF file is scanned or image-based and make corresponding suggestions to initiate OCR when opening a scanned or image-based PDF. You can also run OCR anytime to recognize the image-based text in a PDF.
To recognize image-based or scanned text in a PDF file, perform the following steps:
Click Convert > Convert > OCR > Current File, in the Select OCR Engine dialog box, specify the range you need.
Choose the language used in your document. You can select multiple languages as well.
In the output type, check Searchable Text Image to make the image text selectable and searchable (or check Editable Text to enable the image text to be edited with Foxit PhantomPDF). Then click OK to recognize the text.
· Searchable Text Image: During the OCR process, PhantomPDF analyzes the image text and substitutes words/characters that closely approximates the image text. The substitute words/characters will be placed on an invisible layer of text in the PDF, which makes the image text selectable and searchable. If the substitution is uncertain, the text will be marked as OCR suspects which need to be corrected manually.
· Editable Text: During the OCR process, PhantomPDF compares the shape of the image text to the approximate fonts installed on your system, and turns the image text into editable text.
Note: If you are prompted to download the OCR component after clicking OK, please click Yes to download and install it, or download it later from the link provided and install it by clicking Install Update in the Help tab. (Tip: For a plug-in in MSI format, double-click it to install it.)
(Optional) If you check Find All Suspect (Show all OCR results that may need to be changed.), the OCR suspects will be enclosed in red boxes for you to check and correct OCR suspects right after the recognition completes. To learn how to correct OCR suspects, please refer to the instruction on "Find and Correct OCR Suspects".
If you choose Editable Text in the output type, with the Find All Suspect (Show all OCR results that may need to be changed.) option selected, the OCRed text that PhantomPDF is not certain about will be marked as OCR suspects, and the original image text will be kept until you manually handle all the OCR suspects. You can also deselect this option to turn the image text into editable text with no OCR suspects after recognition. And you can modify the text directly using the commands in the Edit tab if needed (e.g., some text was not correctly recognized).
A recognition text process bar will pop up to show the progress.
Do the search function, the text on your image or scanned document will be searchable.
Tip: Foxit PhantomPDF provides a Quick OCR command under Home/Convert tab to recognize all pages of a scanned or image-based PDF with default or previous settings by one-click.
To recognize text in multiple files:
Click the Convert tab > Convert group > OCR > Multiple Files.
In the OCR Multiple Files dialog box, click Add Files to add files or folders. Use Move up, Move down, and Remove to adjust the order of the files.
Click Output Options…, in the Output Options dialog box, select the destination folder, and choose how to name the new file and whether to overwrite an existing one. Then click OK.
Click OK. After recognition, a message box will pop up to prompt you the recognition is finished.
Notes:
When you are using the CJK OCR engine for the first time, the system will remind you to download and install the engine from the Foxit server.
If there is any unsupported file added, a “Remove unsupported file(s)” button will appear in the OCR Multiple Files dialog box. Click the button to remove the unsupported file(s) and then continue.