PSPDFKit API OCR and Office Conversion Improvements
PSPDFKit API is now shipping with brand-new OCR and Office conversion engines. Earlier this year, PSPDFKit merged with ORPALIS, and in the last few weeks, we’ve been diligently working on leveraging GdPicture.NET technology to deliver significant improvements in performance and accuracy to PSPDFKit API.
What Is GdPicture.NET?
GdPicture.NET is a comprehensive all-in-one toolkit providing complete PDF support, along with support for a number of file formats — including Office, CAD, and images. It also ships with rich image processing and industry-leading OCR and document-understanding capabilities that are using state-of-the-art artificial intelligence and machine learning algorithms. Over the coming months, we’ll be incorporating much of this cutting-edge technology into PSPDFKit API.
Why We Replaced the Previous Engines
Our previous OCR engine was based on the Tesseract open source project, and we used LibreOffice as the core of our Office conversion tools. Our technology produced good-quality results, but we found it lacking in certain aspects due to these two fundamental parts that were powering it.
The main issue with our OCR engine was the performance, which was only acceptable at best. In the case of Office conversion, our main pain point was that we were unable to effectively improve the conversion quality itself.
Performance and Accuracy
Both the OCR and Office conversion engines bring improved performance and accuracy, with documents being processed more quickly and accurately. The OCR performance gain is especially considerable: We measured improved performance of up to 7× when compared to the previous engine — all while delivering the same or sometimes even better accuracy.
With Office conversion, we achieved better conversion results for many documents from our test set. We didn’t find any regressions in quality on the same set of documents.
Conclusion
We’re excited to bring you these huge improvements to our API tools. You can already try the tools for free:
We also invite you to read our blog posts with detailed explanations of how to use our Office conversion tools:
Note that this is only a small glimpse into what’s possible with the combined powers of PSPDFKit API and GdPicture.NET. Stay tuned for the new capabilities and improvements that we’re planning to introduce soon.