Python Tesseract OCR: Extract text from images using pytesseract

Table of contents

    Extract text from images and scanned documents using Python and Tesseract OCR. This tutorial covers installation, text extraction, and preprocessing techniques. For searchable PDFs from scanned documents, see the Nutrient OCR API section.
    Python Tesseract OCR: Extract text from images using pytesseract
    TL;DR

    Use Tesseract OCR with pytesseract to extract text from images. Preprocessing (grayscale, resizing, thresholding) improves accuracy. For searchable PDFs or batch processing, use Nutrient OCR API.

    Python developers use Tesseract OCR with the pytesseract(opens in a new tab) wrapper to extract text from images and scanned documents.

    What OCR does

    OCR extracts text from images and scanned documents. Common uses include:

    • Digitizing paper documents for search and archival
    • Automating data entry from forms and invoices
    • Making scanned PDFs searchable and copyable
    • Indexing document content for retrieval

    Tesseract OCR

    Tesseract OCR(opens in a new tab) is an open source OCR engine originally developed by Hewlett-Packard (1985–2006) and now maintained by Google. It uses neural networks and traditional image processing to recognize text. Tesseract OCR supports 100+ languages, and it works with Python, Java, and C++ (Apache 2.0 license).

    Pros and cons

    Pros

    • Free and open source
    • 100+ languages supported
    • Handles various fonts and text styles
    • Active community, regular updates

    Cons

    • Setup can be tricky on some systems
    • Accuracy drops with poor image quality or complex layouts
    • No built-in preprocessing — you handle that separately
    • Training required for non-standard fonts

    Prerequisites

    Before beginning, make sure you have the following installed on your system:

    1. Python 3.x
    2. Tesseract OCR
    3. pytesseract(opens in a new tab)
    4. Pillow (Python Imaging Library)(opens in a new tab)

    The pytesseract package is a wrapper for the Tesseract OCR engine that provides a simple interface to recognize text from images. It can also be used as a standalone invocation script to interact directly with the Tesseract OCR engine.

    Installing Tesseract OCR

    To install Tesseract OCR on your system, follow the instructions for your specific operating system:

    You can find more installation instructions for other operating systems here(opens in a new tab).

    Setting up your Python OCR environment

    1. Create a new Python file in your favorite editor and name it ocr.py.
    2. Download the sample image used in this tutorial here and save it in the same directory as the Python file.
    3. Install the required Python libraries using pip:
    Terminal window
    pip install pytesseract pillow

    To verify that Tesseract OCR is properly installed and added to your system’s PATH, open a command prompt (Windows) or terminal (macOS/Linux) and run the following command:

    Terminal window
    tesseract --version

    You’ll see the version number of Tesseract, along with some additional information. If you encounter issues installing pytesseract, follow these steps to resolve common problems.

    Python Tesseract tutorial: Extract text from images

    Now that you’ve installed the pytesseract package, this section outlines how to use it to recognize text from an image.

    Import the necessary libraries and load the image you want to extract text from:

    import pytesseract
    from PIL import Image
    image_path = "path/to/your/image.jpg"
    image = Image.open(image_path)

    Extracting text from the image

    To extract text from the image, use the image_to_string() function from the pytesseract library:

    extracted_text = pytesseract.image_to_string(image)
    print(extracted_text)

    The image_to_string() function takes an image as an input and returns the recognized text as a string.

    Run the Python script to see the extracted text from the sample image:

    Terminal window
    python3 ocr.py

    You’ll see the output shown below.

    terminal showing the output

    Saving extracted text to a file

    If you want to save the extracted text to a file, use Python’s built-in file I/O functions:

    with open("output.txt", "w") as output_file:
    output_file.write(extracted_text)

    Advanced Python OCR techniques

    In addition to the basic usage, the pytesseract package provides several advanced options for configuring the OCR engine, outlined below.

    Configuring the OCR engine

    You can configure the OCR engine by passing a configuration string to the image_to_string() function. The configuration string is a set of key-value pairs separated by a space or a newline character.

    For example, the following configuration string sets the language to English and enables the PageSegMode mode to treat the image as a single block of text:

    config = '--psm 6 -l eng'
    text = pytesseract.image_to_string(image, config=config)

    You can also set the path to the Tesseract OCR engine executable using the pytesseract.pytesseract.tesseract_cmd variable. For example, if the Tesseract OCR engine is installed in a non-standard location, you can set the path to the executable using the following code:

    pytesseract.pytesseract.tesseract_cmd = '/path/to/tesseract'

    Handling multiple languages

    The Tesseract OCR engine supports more than 100 languages. You can recognize text in multiple languages by setting the language option to a comma-separated list of language codes.

    For example, the following configuration string sets the language to English and French:

    config = '-l eng+fra'
    text = pytesseract.image_to_string(image, config=config)

    Improving OCR accuracy with image preprocessing

    To improve the accuracy of OCR, you can preprocess an image before running it through the OCR engine. Preprocessing techniques can help enhance the image quality and make it easier for the OCR engine to recognize text.

    Converting images to grayscale

    One common preprocessing technique is to convert the image to grayscale. This can help to improve the contrast between the text and the background. Use the grayscale() method from the ImageOps module of the Pillow library to convert the input image to grayscale:

    from PIL import Image, ImageOps
    # Open an image.
    image = Image.open("path_to_your_image.jpg")
    # Convert image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Save or display the grayscale image.
    gray_image.show()
    gray_image.save("path_to_save_grayscale_image.jpg")
    Original imageGrayscale image
    Original image of a blue lizard with vibrant colorsGrayscale version of the original image, showing the blue lizard in shades of gray

    Resizing the image for better accuracy

    Another preprocessing technique is to resize the image to a larger size. This can make the text in the image larger and easier for the OCR engine to recognize. Use the resize() method from the Pillow library to resize the image:

    # Resize the image.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )

    In the code above, you’re resizing gray_image to a larger size using a scale factor of 2. The new size of the image is (width * scale_factor, height * scale_factor). This makes use of the Lanczos resampling filter to resize the image, which produces high-quality results.

    Applying adaptive thresholding

    Adaptive thresholding can help improve OCR accuracy by creating a more binary image with a clear separation between the foreground and background. Use the FIND_EDGES filter from the ImageFilter module of the Pillow library to apply adaptive thresholding to the image:

    from PIL import Image, ImageOps, ImageFilter
    # Load the image.
    image = Image.open('image.png')
    # Convert the image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Resize the image to enhance details.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )
    # Apply edge detection filter (find edges).
    thresholded_image = resized_image.filter(ImageFilter.FIND_EDGES)
    # Save or display the processed image.
    thresholded_image.show() # This will display the image.
    # thresholded_image.save('path_to_save_image') # This will save the image.
    Original imageThresholded image
    Image of black-and-white text with standard contrastImage of black-and-white text with enhanced contrast after applying thresholding

    Finally, you can pass the preprocessed image to the OCR engine to extract the text. Use the image_to_string() method of the pytesseract package to extract the text from the preprocessed image:

    # Extract text from the preprocessed image.
    improved_text = pytesseract.image_to_string(thresholded_image)
    print(improved_text)

    Complete OCR script

    By using these preprocessing techniques, you can improve the accuracy of OCR and extract text from images more effectively.

    Here’s the complete code for the improved OCR script:

    from PIL import Image, ImageOps, ImageFilter
    import pytesseract
    # Define the path to your image.
    image_path = 'image.png'
    # Open the image.
    image = Image.open(image_path)
    # Convert image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Resize the image to enhance details.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )
    # Apply adaptive thresholding using the `FIND_EDGES` filter.
    thresholded_image = resized_image.filter(ImageFilter.FIND_EDGES)
    # Extract text from the preprocessed image.
    improved_text = pytesseract.image_to_string(thresholded_image)
    # Print the extracted text.
    print(improved_text)
    # Optional: Save the preprocessed image for review.
    thresholded_image.save('preprocessed_image.jpg')

    Recognizing digits only

    Sometimes you only need to recognize digits from an image. You can set the --psm option to 6 to treat the image as a single block of text, and then use regular expressions to extract digits from the recognized text.

    For example, the following code recognizes digits from an image:

    import pytesseract
    from PIL import Image, ImageOps
    import re
    image_path = "image.png"
    image = Image.open(image_path)
    config = '--psm 6'
    text = pytesseract.image_to_string(image, config=config)
    digits = re.findall(r'\d+', text)
    print(digits)

    Here, you import the re module for working with regular expressions. Then, you use the re.findall() method to extract all the digits from the OCR output.

    Training Tesseract with custom data

    Training Tesseract with custom data enables you to fine-tune the OCR engine for specific use cases, fonts, or languages, improving its accuracy in recognizing characters and layouts that may not be well-represented in the default model. Tesseract’s neural network-based recognition engine requires structured training data to learn the distinct patterns of custom characters and fonts.

    To train Tesseract with custom data, you’ll need a dataset of images and corresponding text files that contain the desired output text. Tesseract provides tools, such as tesstrain and text2image, which assist in generating and labeling training data. Using these tools, you can create a custom language model that Tesseract can use to improve recognition accuracy for specific content.

    Though training Tesseract can be time-intensive, it offers significant benefits by tailoring the OCR engine to handle text in unique fonts, symbols, or languages, especially for specialized applications.

    Best practices

    1. Preprocess images — Grayscale, resize, threshold. Clean images produce better results.
    2. Set the right PSM — Page segmentation mode (--psm) affects how Tesseract interprets layout. Try different values for your document type.
    3. Specify the language — Use -l eng for English, and -l eng+fra for multiple languages.
    4. Train for custom fonts — Non-standard fonts need custom training data.
    5. Test on representative samples — Accuracy varies by document type. Test before deploying.

    Troubleshooting pytesseract imports

    If pytesseract fails to import, the issue is usually related to installation, environment configuration, or system paths. This section covers common causes and solutions.

    Common causes of pytesseract import errors

    1. Incorrect installation

      • Ensure pytesseract is installed in the correct Python environment.
      • Verify installation by running:
      Terminal window
      pip show pytesseract

      If it’s not installed, install it using:

      Terminal window
      pip install pytesseract
    2. Multiple Python versions

      If you have multiple versions of Python installed, ensure pytesseract is installed in the environment corresponding to the Python version you’re using.

      • Check your Python version with:
      Terminal window
      python3 --version
      • Use the correct pip version:
      Terminal window
      python3 -m pip install pytesseract
    3. Environment issues

      • If you’re using virtual environments, activate the correct environment before installing or running your script.
      • Check if the environment is activated:
      Terminal window
      source your_env_name/bin/activate

      Install pytesseract within the activated environment.

    4. System path issues

    • Ensure the Python and pip paths are correctly set in your system environment variables.
    • Check your current Python path:
    Terminal window
    which python3

    Additional tips

    • Reinstall pytesseract — If problems persist, try uninstalling and reinstalling pytesseract:
    Terminal window
    pip uninstall pytesseract
    pip install pytesseract
    • Check the Tesseract installation — Ensure Tesseract is correctly installed on your system. You can verify this by running:
    Terminal window
    tesseract --version
    • Upgrade pip — Sometimes upgrading pip can resolve issues:
    Terminal window
    python3 -m pip install --upgrade pip
    • Install packages on managed environments — If you’re encountering issues installing packages due to an externally managed environment (like macOS with Homebrew), follow the steps outlined below.

      • Use a virtual environment:
      Terminal window
      python3 -m venv myenv
      source myenv/bin/activate
      pip install pytesseract
      Terminal window
      brew install pipx
      pipx install pytesseract
      • Override the restriction (not recommended):
      Terminal window
      python3 -m pip install pytesseract --break-system-packages

    Check PEP 668(opens in a new tab) for details.

    Limitations of Tesseract

    • Accuracy can vary — While Tesseract OCR is generally accurate, the accuracy can vary depending on the quality of the input image, the language being recognized, and other factors. In some cases, the OCR output may contain errors or miss some text altogether.
    • Training is required for non-standard fonts — Tesseract OCR works well with standard fonts, but it may have difficulty recognizing non-standard fonts or handwriting. To improve recognition of these types of fonts, training data may need to be created and added to Tesseract’s data set.
    • Limited support for complex layouts — Tesseract OCR works best with images that contain simple layouts and clear text. If the image contains complex layouts, graphics, or tables, Tesseract may not be able to recognize the text accurately.
    • Limited support for languages — While Tesseract OCR supports many languages, it may not support all languages and scripts. If you need to recognize text in a language that isn’t supported by Tesseract, you may need to find an alternative OCR engine.
    • No built-in image preprocessing — While Tesseract OCR can recognize text from images, it doesn’t have built-in image preprocessing capabilities. Preprocessing tasks like resizing, skew correction, and noise removal may need to be done separately before passing the image to Tesseract.

    Nutrient API for OCR

    Tesseract extracts text. Nutrient’s OCR API creates searchable PDFs — the text layer is embedded in the PDF so users can search, select, and copy text.

    When to use Nutrient instead of Tesseract:

    • You need searchable PDFs, not just raw text
    • You’re processing batches of scanned documents
    • You want to merge multiple scanned pages into one PDF
    • You need 20 languages without installing language packs
    • You want consistent results without preprocessing each image

    The API is SOC 2 compliant, stores no document data, and offers 200 free credits/month to start.

    Requirements

    To get started, you’ll need:

    To access your Nutrient API key, sign up for a free account(opens in a new tab). Once you’ve signed up, you can find your API key in the Dashboard > API keys section(opens in a new tab).

    Python is a programming language, and pip is a package manager for Python, which you’ll use to install the requests library. Requests(opens in a new tab) is an HTTP library that makes it easy to make HTTP requests.

    Install the requests library with the following command:

    Terminal window
    python3 -m pip install requests

    Using the OCR API

    Follow the steps below to process a scanned document with optical character recognition (OCR) using the Nutrient API.

    1. Import required modules

    Begin by importing the necessary modules:

    import requests
    import json

    These modules handle HTTP requests and JSON serialization, respectively.

    2. Define the OCR instructions

    Set up the instructions for the OCR process in a dictionary format:

    data = {
    'instructions': json.dumps({
    'parts': [
    {
    'file': 'scanned'
    }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    }

    In this example:

    • "file": "scanned" references the uploaded file.
    • "type": "ocr" tells the API to apply OCR.
    • "language": "english" specifies the OCR language.

    3. Send the OCR request to the Nutrient API

    Make a POST request to the https://api.nutrient.io/build endpoint:

    response = requests.request(
    'POST',
    'https://api.nutrient.io/build',
    headers = {
    'Authorization': 'Bearer your_api_key_here'
    },
    files = {
    'scanned': open('image.png', 'rb')
    },
    data = {
    'instructions': json.dumps({
    'parts': [
    {
    'file': 'scanned'
    }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    },
    stream = True
    )

    Replace 'your_api_key_here' with your actual API key. This request:

    • Sends the file under the name scanned.
    • Includes the OCR instructions.
    • Streams the response so you can handle large files efficiently.

    You can use the sample document here(opens in a new tab) to test the OCR API.

    4. Save the OCR result to a file

    If the request is successful (status code 200), write the resulting file to disk:

    if response.ok:
    with open('result.pdf', 'wb') as fd:
    for chunk in response.iter_content(chunk_size=8096):
    fd.write(chunk)
    else:
    print(response.text)
    exit()

    This code:

    • Streams the OCR-processed output into result.pdf.
    • Handles errors by printing the response message if the request fails.

    Advanced OCR with Python: Merge multiple scanned pages into a searchable PDF using Nutrient API

    If you have a batch of scanned pages (like a multipage invoice or a document set), you can use the Nutrient OCR API to merge them into a single searchable PDF. This is a common scenario when replacing or augmenting Python Tesseract workflows with a scalable cloud API.

    Unlike using pytesseract on each page manually and merging PDFs afterward, this approach simplifies everything into one API call.

    Example: Merge four scanned images with OCR enabled

    Place your scanned images (page1.jpg, page2.jpg, etc.) in the same directory and use the following Python script:

    import requests
    import json
    response = requests.request(
    'POST',
    'https://api.nutrient.io/build',
    headers={
    'Authorization': 'Bearer pdf_live_TK4VeBhzh4K7hq0wBjJuvfiekbOb62TqwjGycFYqA5Y'
    },
    files={
    'page1': open('page1.jpg', 'rb'),
    'page2': open('page2.jpg', 'rb'),
    'page3': open('page3.jpg', 'rb'),
    'page4': open('page4.jpg', 'rb')
    },
    data={
    'instructions': json.dumps({
    'parts': [
    { 'file': 'page1' },
    { 'file': 'page2' },
    { 'file': 'page3' },
    { 'file': 'page4' }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    },
    stream=True
    )
    if response.ok:
    with open('merged_scanned.pdf', 'wb') as fd:
    for chunk in response.iter_content(chunk_size=8096):
    fd.write(chunk)
    else:
    print(response.text)
    exit()

    When to use this instead of pytesseract

    pytesseract: Extract text from individual images, handle preprocessing yourself, write code to merge results.

    Nutrient API: Upload images, get a searchable PDF back. One API call handles OCR, merging, and PDF creation.

    Conclusion

    Tesseract with pytesseract handles basic text extraction. Preprocess images (grayscale, resize, threshold) for better accuracy. For searchable PDFs or batch processing, use the Nutrient OCR API.

    FAQ

    What is Tesseract OCR?

    Tesseract OCR is an open source engine for recognizing text from images and scanned documents. Developed by Hewlett-Packard and now sponsored by Google, it supports more than 100 languages and various text styles.

    How do I install Tesseract OCR in Python?

    To install Tesseract OCR, download the installer from GitHub for Windows(opens in a new tab), use brew install tesseract on macOS, or run sudo apt install tesseract-ocr on Debian/Ubuntu.

    How can I improve OCR accuracy?

    You can improve OCR accuracy by converting an image to grayscale, resizing it to make the text larger, and applying adaptive thresholding to enhance text contrast.

    Can Tesseract OCR handle multiple languages?

    Yes. Tesseract supports multiple languages. Use a plus sign (+) in the configuration string, like -l eng+fra for English and French.

    What are the limitations of Tesseract OCR?

    Tesseract’s limitations include varying accuracy based on image quality, difficulty with non-standard fonts, and limited support for complex layouts and languages. It also lacks built-in image preprocessing.

    How does Nutrient’s OCR API work?

    Upload scanned images or PDFs, get searchable PDFs back. The API supports 20 languages, preserves layout, and handles multipage documents. It’s SOC 2 compliant with 200 free credits/month.

    How do I use Nutrient’s OCR API?

    Install the requests library, and send a POST request to https://api.nutrient.io/build with your API key and document. The response is the searchable PDF.

    Can I merge multiple scanned pages into one searchable PDF using Nutrient?

    Yes. You can merge multiple images into a single searchable PDF. Adjust the file handling and instructions in your API request to include all pages.

    What is pytesseract in Python?

    pytesseract is a Python wrapper for the open source Tesseract OCR engine. It enables developers to extract text from images using a simple Python API.

    How do I use pytesseract to extract text from an image?

    Install the pytesseract and Pillow libraries, open the image using PIL.Image.open(), and pass it to pytesseract.image_to_string() to extract text.

    Hulya Masharipov

    Hulya Masharipov

    Technical Writer

    Hulya is a frontend web developer and technical writer who enjoys creating responsive, scalable, and maintainable web experiences. She’s passionate about open source, web accessibility, cybersecurity privacy, and blockchain.

    Explore related topics

    Try for free Ready to get started?