Skip to content
Go back

Extract Spacial information from Scanned PDF file

Updated:
Edit page

Here are some rules/recommendations, tips & ticks for reading and extracting specific details using AI-powered text analysis.

Pdfreaders flowchart
Photo by me

Table of contents

Open Table of contents

Key Features

The application automates the extraction of structured data from scanned PDF documents, particularly legal or contractual ones. It uses AI-powered text analysis via Google’s Gemini API to identify and extract specific details — such as contract terms, document type, signature dates, and answers to predefined questions — by analyzing relevant sections of the document.

PDF Text Extraction

Before extracting information, the application first reads the scanned PDF files. You must activate and create a Custom Extractor Processor in Google Document AI, which enables you to either make fast predictions using generative AI or train your own processor from scratch based on your document structure.

The document is then sent for processing, where you can specify which pages should be analyzed. Once processed, the extracted text and attributes are retrieved and used for further analysis — such as identifying specific sections, extracting contract terms, signature dates, or answering predefined questions based on the document content.

def process_document_sample(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
    field_mask: Optional[str] = None,
) -> None:
    # You must set the `api_endpoint` if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)
    name = client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load binary data
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

    # For more information: https://cloud.google.com/document-ai/docs/reference/rest/v1/ProcessOptions

    # Optional: Additional configurations for processing.
    process_options = documentai.ProcessOptions(
        # Process only specific pages
        individual_page_selector=documentai.ProcessOptions.IndividualPageSelector(
            pages=[1,3,4,8,9,10]
        )
    )

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=raw_document,
        field_mask=field_mask,
        process_options=process_options,
    )

    result = client.process_document(request=request)
src/content.config.ts

AI-Powered Information Extraction:

In this stage, we use Google’s Gemini Flash model to analyze the extracted text. To do this, we configure an AI prompt tailored to extract the required information.

The model processes the structured content and returns specific details, such as:

This is the model that we use for the analysis phase:

import google.generativeai as genai

generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 40,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}
model = genai.GenerativeModel(
    model_name="gemini-1.5-pro",
    generation_config=generation_config )


def analyze_text_with_prompt(extracted_text: str):

    prompt = f"""
    Given the following extracted text from a contract:

    {extracted_text}

    Your task is to extract key information from contracts so please analyze the following information to find the anwer of following questions:

    1. Extract the code site. The ID consists of a number or a number with letters.
    2. Extract the agreement date (Date de signature du document de référence) which is just after the 'fait à' text. 
       When extracting date fields:
      - it must appear in DD/MM/YYYY format.** 
      - If a date is not clearly stated, do your best to infer it and note the uncertainty in the "analysis" field.
    3. From Part VII.4, Contrat précise que le Bailleur fournit l'énergie au Client? reponse (oui ou temporairementou non)
    4. From Part VII.4, Modalités définies au contrat? reponse (avance sur charges ou facturation réelle ou forfaitou inclus dans le loyer, non spécifiées)
    5. From Part VII.4, Demande d'installation d'un compteur déflacteur?reponse (oui, non).
    6. From Part VII.4, Nature du document de référence? (Bail ou Convention domaine public ou Prêt à usage ou Avenant)
    Your final output must be **only** the following valid JSON structure **without formatting**:
  
      {{
        "Code_site": "...",
        "Agreement_Date": "...",
        "Contrat_précise": "...",
        "Modalités_définies": "...",
        "Demande_installation": "...",
        "Nature_document": "..."
      }}
    """
    respose = model.generate_content(prompt)
    result=respose.textsrc/content.config.ts

Batch Processing

After reading and analyzing all PDF files, the application saves the results of all processed files into a single CSV file for further use or analysis.

Code_siteAgreement_DateContrat_préciseModalités_définiesDemande_installationNature_documentFile_Name

Bonus:

This project is useful for organizations that need to process large volumes of contractual or legal documents efficiently. It automates tedious tasks like reading PDFs and extracting specific details, saving time and reducing human error.

Cost

Before initiating the project, it is essential to estimate the processing costs based on the expected document volume and AI usage. In this case, with 2,000 PDF files and only 10 pages processed per file (totaling 20,000 pages), we can calculate an approximate cost for OCR processing and AI analysis. This pre-evaluation allows for better budget planning and ensures transparency in resource allocation.

Google Document AI OCR Cost (for 20,000 pages):

Gemini Pro 2.5 Cost (for 2,000 files):

Total Estimated Cost:


Edit page
Share this post on:

Previous Post
FastAPI-based web application
Next Post
Dynamic OG image generation in AstroPaper blog posts