How to Extract Text From Scanned PDFs (OCR)
Scanned PDFs contain images of text, not actual text. Learn how OCR (Optical Character Recognition) can make scanned documents searchable and editable.
Key Takeaways
- Optical Character Recognition (OCR) analyzes images of text and converts them into machine-readable characters.
- If you can't select or search text in a PDF, it's likely a scanned image.
- Several factors affect OCR quality:
- OCR output often contains minor errors, especially with unusual fonts or low-quality scans.
- Modern browser-based OCR uses WebAssembly-compiled engines (like Tesseract.js) that process documents entirely on your device.
Merge PDF
Combine multiple PDF files into one document.
What Is OCR?
Optical Character Recognition (OCR) analyzes images of text and converts them into machine-readable characters. When applied to a scanned PDF, OCR adds a hidden text layer behind each page image, making the document searchable and allowing copy-paste.
When You Need OCR
If you can't select or search text in a PDF, it's likely a scanned image. This is common with documents from older scanners, photographed pages, and PDFs created from fax transmissions.
OCR Accuracy Factors
Several factors affect OCR quality:
- Scan resolution: 300 DPI minimum; 600 DPI for small text.
- Image quality: Clean, high-contrast scans produce better results.
- Language: Latin-script languages achieve 99%+ accuracy; CJK and handwriting are harder.
- Font style: Standard printed fonts are recognized well; decorative fonts less so.
Post-OCR Cleanup
OCR output often contains minor errors, especially with unusual fonts or low-quality scans. Review the extracted text for common mistakes like confusing '1' with 'l', '0' with 'O', and misread punctuation.
Browser-Based OCR
Modern browser-based OCR uses WebAssembly-compiled engines (like Tesseract.js) that process documents entirely on your device. This means sensitive scanned documents never leave your computer.
関連ツール
関連フォーマット
関連ガイド
How to Merge PDF Files Without Losing Quality
Combining multiple PDF documents into a single file is one of the most common document tasks. This guide walks you through merging PDFs while preserving bookmarks, links, and page formatting across all merged documents.
PDF Compression: Reducing File Size Without Sacrificing Quality
Large PDF files are difficult to share via email and slow to load on mobile devices. Learn how PDF compression works and how to strike the right balance between file size and visual quality.
PDF vs DOCX vs ODT: Choosing the Right Document Format
Each document format serves different purposes. PDF excels at preserving layout, DOCX is ideal for collaborative editing, and ODT offers open-source compatibility. This comparison helps you choose the right format for your workflow.
How to Split a PDF Into Individual Pages
Extracting specific pages from a large PDF is essential for sharing relevant sections without distributing the entire document. Learn how to split PDFs by page range, by bookmark, or into individual pages.
Fixing Common PDF Display Issues
PDFs sometimes display incorrectly — fonts may substitute, images may blur, or pages may appear blank. This troubleshooting guide covers the most common PDF rendering problems and their solutions.