Skip to content

Example of how to create to extract tables, check boxes, and text form forms and documents from an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

License

Notifications You must be signed in to change notification settings

MicrosoftCloudEssentials-LearningHub/PDFs-Layouts-Processing-Fapp-DocIntelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Demo: PDF Layout Extraction with Doc Intelligence (full-code approach)

Azure Storage + Document Intelligence + Function App + Cosmos DB

Costa Rica

GitHub GitHub brown9804

Last updated: 2025-07-16


Important

This example is based on a public network site and is intended for demonstration purposes only. It showcases how several Azure resources can work together to achieve the desired result. Consider the section below about Important Considerations for Production Environment. Please note that these demos are intended as a guide and are based on my personal experiences. For official guidance, support, or more detailed information, please refer to Microsoft's official documentation or contact Microsoft directly: Microsoft Sales and Support

List of References (Click to expand)
Table of Content (Click to expand)

How to extract layout elements from PDFs stored in an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

  1. Upload your PDFs to an Azure Blob Storage container.
  2. An Azure Function is triggered by the upload, which calls the Azure Document Intelligence Layout API to analyze the document structure.
  3. The extracted layout data (such as tables, checkboxes, and text) is parsed and subsequently stored in a Cosmos DB database, ensuring a seamless and automated workflow from document upload to data storage.

Note

Advantages of Document Intelligence for organizations handling with large volumes of documents:

  • Utilizes natural language processing, computer vision, deep learning, and machine learning.
  • Handles structured, semi-structured, and unstructured documents.
  • Automates the extraction and transformation of layout data into usable formats like JSON or CSV.
Centered Image

Note

Azure Event Grid System Topics are free to create and manage, a System Topic is automatically created and managed by Azure for certain Azure services that emit events. It represents a source of events from an Azure resource (like a Storage Account, Key Vault, or Azure Maps). You don't need to create or manage the topic yourself, Azure does it for you when you enable event publishing on a supported resource.

  • Emits predefined event types (e.g., Microsoft.Storage.BlobCreated, Microsoft.Resources.ResourceWriteSuccess).
  • You can attach event handlers (like Azure Functions, Logic Apps, Webhooks) to respond to these events.
  • Works seamlessly with serverless architectures for real-time automation.
    For example: Suppose you have a Storage Account and want to trigger a function every time a new blob is uploaded:
  • Azure automatically creates a System Topic for the Storage Account.
  • You subscribe to the BlobCreated event.
  • When a blob is uploaded, Event Grid routes the event to your Azure Function.
Centered Image

Important Considerations for Production Environment

Private Network Configuration

For enhanced security, consider configuring your Azure resources to operate within a private network. This can be achieved using Azure Virtual Network (VNet) to isolate your resources and control inbound and outbound traffic. Implementing private endpoints for services like Azure Blob Storage and Azure Functions can further secure your data by restricting access to your VNet.

Security

Ensure that you implement appropriate security measures when deploying this solution in a production environment. This includes:

  • Securing Access: Use Azure Entra ID (formerly known as Azure Active Directory or Azure AD) for authentication and role-based access control (RBAC) to manage permissions.
  • Managing Secrets: Store sensitive information such as connection strings and API keys in Azure Key Vault.
  • Data Encryption: Enable encryption for data at rest and in transit to protect sensitive information.
Scalability

While this example provides a basic setup, you may need to scale the resources based on your specific requirements. Azure services offer various scaling options to handle increased workloads. Consider using:

  • Auto-scaling: Configure auto-scaling for Azure Functions and other services to automatically adjust based on demand.
  • Load Balancing: Use Azure Load Balancer or Application Gateway to distribute traffic and ensure high availability.
Cost Management

Monitor and manage the costs associated with your Azure resources. Use Azure Cost Management and Billing to track usage and optimize resource allocation.

Compliance

Ensure that your deployment complies with relevant regulations and standards. Use Azure Policy to enforce compliance and governance policies across your resources.

Disaster Recovery

Implement a disaster recovery plan to ensure business continuity in case of failures. Use Azure Site Recovery and backup solutions to protect your data and applications.

Prerequisites

  • An Azure subscription is required. All other resources, including instructions for creating a Resource Group, are provided in this workshop.
  • Contributor role assigned or any custom role that allows: access to manage all resources, and the ability to deploy resources within subscription.
  • If you choose to use the Terraform approach, please ensure that:

Where to start?

  1. Please follow the Terraform guide to deploy the necessary Azure resources for the workshop.
  2. Next, as this method skips the creation of each resource manually. Proceed with the configuration from Configure/Validate the Environment variables.

Important

Regarding Networking, this example will cover Public access configuration, and system-managed identity. However, please ensure you review your privacy requirements and adjust network and access settings as necessary for your specific case.

Overview

Using Cosmos DB provides you with a flexible, scalable, and globally distributed database solution that can handle both structured and semi-structured data efficiently.

  • Azure Blob Storage: Store the PDF invoices.
  • Azure Functions: Trigger on new PDF uploads, extract data, and process it.
  • Azure SQL Database or Cosmos DB: Store the extracted data for querying and analytics.
Resource Recommendation
Azure Blob Storage Use for storing the PDF files. This keeps your file storage separate from your data storage, which is a common best practice.
Azure SQL Database Use if your data is highly structured and you need complex queries and transactions.
Azure Cosmos DB Use if you need a globally distributed database with low latency and the ability to handle semi-structured data.

Function App Hosting Options

In the context of Azure Function Apps, a hosting option refers to the plan you choose to run your function app. This choice affects how your function app is scaled, the resources available to each function app instance, and the support for advanced functionalities like virtual network connectivity and container support.

Tip

  • Scale to Zero: Indicates whether the service can automatically scale down to zero instances when idle.
    • IDLE stands for:
      • I – Inactive
      • D – During
      • L – Low
      • E – Engagement
    • In other words, when the application is not actively handling requests or events (it's in a low-activity or paused state).
  • Scale Behavior: Describes how the service scales (e.g., event-driven, dedicated, or containerized).
  • Virtual Networking: Whether the service supports integration with virtual networks for secure communication.
  • Dedicated Compute & Reserved Cold Start: Availability of always-on compute to avoid cold starts and ensure low latency.
  • Max Scale Out (Instances): Maximum number of instances the service can scale out to.
  • Example AI Use Cases: Real-world scenarios where each plan excels.
Flex Consumption
Feature Description
Scale to Zero Yes
Scale Behavior Fast event-driven
Virtual Networking Optional
Dedicated Compute & Reserved Cold Start Optional (Always Ready)
Max Scale Out (Instances) 1000
Example AI Use Cases Real-time data processing for AI models, high-traffic AI-powered APIs, event-driven AI microservices. Ideal for fraud detection, real-time recommendations, NLP, and computer vision services.
Consumption
Feature Description
Scale to Zero Yes
Scale Behavior Event-driven
Virtual Networking Optional
Dedicated Compute & Reserved Cold Start No
Max Scale Out (Instances) 200
Example AI Use Cases Lightweight AI APIs, scheduled AI tasks, low-traffic AI event processing. Great for sentiment analysis, simple image recognition, and batch ML tasks.
Functions Premium
Feature Description
Scale to Zero No
Scale Behavior Event-driven with premium options
Virtual Networking Yes
Dedicated Compute & Reserved Cold Start Yes
Max Scale Out (Instances) 100
Example AI Use Cases Enterprise AI applications, low-latency AI APIs, VNet integration. Ideal for secure, high-performance AI services like customer support and analytics.
App Service
Feature Description
Scale to Zero No
Scale Behavior Dedicated VMs
Virtual Networking Yes
Dedicated Compute & Reserved Cold Start Yes
Max Scale Out (Instances) Varies
Example AI Use Cases AI-powered web applications, dedicated resources. Great for chatbots, personalized content, and intensive AI inference.
Container Apps Env.
Feature Description
Scale to Zero No
Scale Behavior Containerized microservices environment
Virtual Networking Yes
Dedicated Compute & Reserved Cold Start Yes
Max Scale Out (Instances) Varies
Example AI Use Cases AI microservices architecture, containerized AI workloads, complex AI workflows. Ideal for orchestrating AI services like image processing, text analysis, and real-time analytics.

Function App: Configure/Validate the Environment variables

Note

This example is using system-assigned managed identity to assign RBACs (Role-based Access Control).

  • Under Settings, go to Environment variables. And + Add the following variables:

    • COSMOS_DB_ENDPOINT: Your Cosmos DB account endpoint 🡢 Review the existence of this, if not create it

    • COSMOS_DB_KEY: Your Cosmos DB account key 🡢 Review the existence of this, if not create it

    • COSMOS_DB_CONNECTION_STRING: Your Cosmos DB connection string 🡢 Review the existence of this, if not create it

    • invoicecontosostorage_STORAGE: Your Storage Account connection string 🡢 Review the existence of this, if not create it

    • FORM_RECOGNIZER_ENDPOINT: For example: https://<your-form-recognizer-endpoint>.cognitiveservices.azure.com/ 🡢 Review the existence of this, if not create it

    • FORM_RECOGNIZER_KEY: Your Documment Intelligence Key (Form Recognizer). 🡢

    • FUNCTIONS_EXTENSION_VERSION: ~4 🡢 Review the existence of this, if not create it

    • WEBSITE_RUN_FROM_PACKAGE: 1 🡢 Review the existence of this, if not create it

    • FUNCTIONS_WORKER_RUNTIME: python 🡢 Review the existence of this, if not create it

    • FUNCTIONS_NODE_BLOCK_ON_ENTRY_POINT_ERROR: true (This setting ensures that all entry point errors are visible in your application insights logs). 🡢 Review the existence of this, if not create it

      image image image image
    • Click on Apply to save your configuration.

      image

Function App: Develop the logic

  • You need to install VSCode

  • Install python from Microsoft store:

    image
  • Open VSCode, and install some extensions: python, and Azure Tools.

    image image
  • Click on the Azure icon, and sign in into your account. Allow the extension Azure Resources to sign in using Microsoft, it will open a browser window. After doing so, you will be able to see your subscription and resources.

    image
  • Under Workspace, click on Create Function Project, and choose a path in your local computer to develop your function.

    image
  • Choose the language, in this case is python:

    image
  • Select the model version, for this example let's use v2:

    image
  • For the python interpreter, let's use the one installed via Microsoft Store:

    image
  • Choose a template (e.g., Blob trigger) and configure it to trigger on new PDF uploads in your Blob container.

    image
  • Provide a function name, like BlobTriggerContosoPDFInvoicesDocIntelligence:

    image
  • Next, it will prompt you for the path of the blob container where you expect the function to be triggered after a file is uploaded. In this case is pdfinvoices as was previously created.

    image
  • Click on Create new local app settings, and then choose your subscription.

    image
  • Choose Azure Storage Account for remote storage, and select one. I'll be using the invoicecontosostorage.

    image
  • Then click on Open in the current window. You will see something like this:

    image
  • Now we need to update the function code to extract data from PDFs and store it in Cosmos DB, use this an example:

    1. PDF Upload: A PDF file is uploaded to the Azure Blob Storage container (pdfinvoices).
    2. Trigger Azure Function: The upload triggers the Azure Function BlobTriggerContosoPDFLayoutsDocIntelligence.
    3. Initialize Clients: Sets up connections to Azure Document Intelligence and Cosmos DB.
      • Initializes the DocumentAnalysisClient using the FORM_RECOGNIZER_ENDPOINT and FORM_RECOGNIZER_KEY environment variables.
      • Initializes the CosmosClient using Azure Active Directory (AAD) via DefaultAzureCredential.
    4. Read PDF from Blob Storage: Reads the PDF content from the blob into a byte stream.
    5. Analyze PDF: Uses Azure Document Intelligence to analyze the layout of the PDF.
      • Calls begin_analyze_document with the prebuilt-layout model.
      • Waits for the analysis to complete and retrieves the layout result.
    6. Extract Layout Data: Parses and structures the layout data from the analysis result.
      • Extracts lines, tables, and selection marks from each page.
      • Logs styles (e.g., handwritten content) and organizes data into a structured dictionary.
    7. Save Data to Cosmos DB: Saves the structured layout data to Cosmos DB.
      • Ensures the database (ContosoDBDocIntellig) and container (Layouts) exist or creates them.
      • Inserts or updates the layout data using upsert_item.
    8. Logging (Process and Errors): Logs each step of the process, including success messages and detailed error handling for debugging and monitoring.
    • Update the function_app.py, for example see the code used in this demo:

      Template Blob Trigger Function Code updated
      image image
    • Now, let's update the requirements.txt, see the code used in this demo:

      Template requirements.txt Updated requirements.txt
      image image
    • Since this function has already been tested, you can deploy your code to the function app in your subscription. If you want to test, you can use run your function locally for testing.

      • Click on the Azure icon.

      • Under workspace, click on the Function App icon.

      • Click on Deploy to Azure.

        image
      • Select your subscription, your function app, and accept the prompt to overwrite:

        image
      • After completing, you see the status in your terminal:

        image image

Important

If you need further assistance with the code, please click here to view all the function code.

Note

Please ensure that all specified roles are assigned to the Function App. The provided example used System assigned for the Function App to facilitate the role assignment.

Test the solution

Important

Please ensure that the user/system admin responsible for uploading the PDFs to the blob container has the necessary permissions. The error below illustrates what might occur if these roles are missing.
image
In that case, go to Access Control (IAM), click on + Add, and Add role assignment:
image
Search for Storage Blob Data Contributor, click Next.
image
Then, click on select members and search for your user/systen admin. Finally click on Review + assign.

Upload sample PDF invoices to the Blob container and verify that data is correctly ingested and stored in Cosmos DB.

  • Click on Upload, then select Browse for files and choose your PDF invoices to be stored in the blob container, which will trigger the function app to parse them.

    image
  • Check the logs, and traces from your function with Application Insights:

    image
  • Under Investigate, click on Performance. Filter by time range, and drill into the samples. Sort the results by date (if you have many, like in my case) and click on the last one.

    image
  • Click on View all:

    image
  • Check all the logs, and traces generated. Also review the information parsed:

    image
  • Validate that the information was uploaded to the Cosmos DB. Under Data Explorer, check your Database.

    image
Total views

Refresh Date: 2025-07-16

About

Example of how to create to extract tables, check boxes, and text form forms and documents from an Azure Storage Account, process them using Azure Document Intelligence, and store the results in Cosmos DB for further analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published