Accelerating Document AI
nbroad Nicholas Broad
Enterprises are full of documents containing knowledge that isn't accessible by digital workflows. These documents can vary from letters, invoices, forms, reports, to receipts. With the improvements in text, vision, and multimodal AI, it's now possible to unlock that information. This post shows you how your teams can use open-source models to build custom solutions for free!
Document AI includes many data science tasks from image classification , image to text , document question answering , table question answering , and visual question answering . This post starts with a taxonomy of use cases within Document AI and the best open-source models for those use cases. Next, the post focuses on licensing, data preparation, and modeling. Throughout this post, there are links to web demos, documentation, and models.
Use Cases
There are at least six general use cases for building document AI solutions. These use cases differ in the kind of document inputs and outputs. A combination of approaches is often necessary when solving enterprise Document AI problems.
What is Optical Character Recognition (OCR)?
Turning typed, handwritten, or printed text into machine-encoded text is known as Optical Character Recognition (OCR). It's a widely studied problem with many well-established open-source and commercial offerings. The figure shows an example of converting handwriting into text.
OCR is a backbone of Document AI use cases as it's essential to transform the text into something readable by a computer. Some widely available OCR models that operate at the document level are EasyOCR or PaddleOCR . There are also models like TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , which runs on single-text line images. This model works with a text detection model like CRAFT which first identifies the individual "pieces" of text in a document in the form of bounding boxes. The relevant metrics for OCR are Character Error Rate (CER) and word-level precision, recall, and F1. Check out this Space to see a demonstration of CRAFT and TrOCR.
What is Document Image Classification?
Classifying documents into the appropriate category, such as forms, invoices, or letters, is known as document image classification. Classification may use either one or both of the document's image and text. The recent addition of multimodal models that use the visual structure and the underlying text has dramatically increased classifier performance.
A basic approach is applying OCR on a document image, after which a BERT -like model is used for classification. However, relying on only a BERT model doesn't take any layout or visual information into account. The figure from the RVL-CDIP dataset shows how visual structure differs by different document types.
That's where models like LayoutLM and Donut come into play. By incorporating not only text but also visual information, these models can dramatically increase accuracy. For comparison, on RVL-CDIP , an important benchmark for document image classification, a BERT-base model achieves 89% accuracy by using the text. A DiT (Document Image Transformer) is a pure vision model (i.e., it does not take text as input) and can reach 92% accuracy. But models like LayoutLMv3 and Donut , which use the text and visual information together using a multimodal Transformer, can achieve 95% accuracy! These multimodal models are changing how practitioners solve Document AI use cases.
What is Document layout analysis?
Document layout analysis is the task of determining the physical structure of a document, i.e., identifying the individual building blocks that make up a document, like text segments, headers, and tables. This task is often solved by framing it as an image segmentation/object detection problem. The model outputs a set of segmentation masks/bounding boxes, along with class names.
Models that are currently state-of-the-art for document layout analysis are LayoutLMv3 and DiT (Document Image Transformer). Both models use the classic Mask R-CNN framework for object detection as a backbone. This document layout analysis Space illustrates how DiT can be used to identify text segments, titles, and tables in documents. An example using DiT detecting different parts of a document is shown here.
Document layout analysis with DiT.
Document layout analysis typically uses the mAP (mean average-precision) metric, often used for evaluating object detection models. An important benchmark for layout analysis is the PubLayNet dataset. LayoutLMv3 , the state-of-the-art at the time of writing, achieves an overall mAP score of 0.951 ( source ).
What is Document parsing?
A step beyond layout analysis is document parsing. Document parsing is identifying and extracting key information from a document, such as names, items, and totals from an invoice form. This LayoutLMv2 Space shows to parse a document to recognize questions, answers, and headers.
The first version of LayoutLM (now known as LayoutLMv1) was released in 2020 and dramatically improved over existing benchmarks, and it's still one of the most popular models on the Hugging Face Hub for Document AI. LayoutLMv2 and LayoutLMv3 incorporate visual features during pre-training, which provides an improvement. The LayoutLM family produced a step change in Document AI performance. For example, on the FUNSD benchmark dataset, a BERT model has an F1 score of 60%, but with LayoutLM, it is possible to get to 90%!
LayoutLMv1 now has many successors. Donut builds on LayoutLM but can take the image as input, so it doesn't require a separate OCR engine. ERNIE-Layout was recently released with promosing results, see the Space . For multilingual use cases, there are multilingual variants of LayoutLM, like LayoutXLM and LiLT . This figure from the LayoutLM paper shows LayoutLM analyzing some different documents.
Data scientists are finding document layout analysis and extraction as key use cases for enterprises. The existing commercial solutions typically cannot handle the diversity of most enterprise data, in content and structure. Consequently, data science teams can often surpass commercial tools by fine-tuning their own models.
What is Table detection, extraction, and table structure recognition?
Documents often contain tables, and most OCR tools don't work incredibly well out-of-the-box on tabular data. Table detection is the task of identifying where tables are located, and table extraction creates a structured representation of that information. Table structure recognition is the task of identifying the individual pieces that make up a table, like rows, columns, and cells. Table functional analysis (FA) is the task of recognizing the keys and values of the table. The figure from the Table transformer illustrates the difference between the various subtasks.
The approach for table detection and structure recognition is similar to document layout analysis in using object detection models that output a set of bounding boxes and corresponding classes.
The latest approaches, like Table Transformer , can enable table detection and table structure recognition with the same model. The Table Transformer is a DETR -like object detection model, trained on PubTables-1M (a dataset comprising one million tables). Evaluation for table detection and structure recognition typically uses the average precision (AP) metric. The Table Transformer performance is reported as having an AP of 0.966 for table detection and an AP of 0.912 for table structure recognition + functional analysis on PubTables-1M.
Table detection and extraction is an exciting approach, but the results may be different on your data. In our experience, the quality and formatting of tables vary widely and can affect how well the models perform. Additional fine-tuning on some custom data will greatly improve the performance.
What is Document question answering (DocVQA)?
Question answering on documents has dramatically changed how people interact with AI. Recent advancements have made it possible to ask models to answer questions about an image - this is known as document visual question answering, or DocVQA for short. After being given a question, the model analyzes the image and responds with an answer. An example from the DocVQA dataset is shown in the figure below. The user asks, "Mention the ZIP code written?" and the model responds with the answer.
In the past, building a DocVQA system would often require multiple models working together. There could be separate models for analyzing the document layout, performing OCR, extracting entities, and then answering a question. The latest DocVQA models enable question-answering in an end-to-end manner, comprising only a single (multimodal) model.
DocVQA is typically evaluated using the Average Normalized Levenshtein Similarity (ANLS) metric. For more details regarding this metric, we refer to this guide . The current state-of-the-art on the DocVQA benchmark that is open-source is LayoutLMv3 which achieves an ANLS score of 83.37. However, this model consists of a pipeline of OCR + multimodal Transformer. Donut solves the task in an end-to-end manner using a single encoder-decoder Transformer, not relying on OCR. Donut doesn't provide state-of-the-art accuracy but shows the great potential of the end-to-end approach using a generative T5-like model. Impira hosts an exciting Space that illustrates LayoutLM and Donut for DocVQA.
Visual question answering is compelling; however, there are many considerations for successfully using it. Having accurate training data, evaluation metrics, and post-processing is vital. For teams taking on this use case, be aware that DocVQA can be challenging to work properly. In some cases, responses can be unpredictable, and the model can “hallucinate” by giving an answer that doesn't appear within the document. Visual question answering models can inherit biases in data raising ethical issues. Ensuring proper model setup and post-processing is integral to building a successful DocVQA solution.