Sanket Biswas

CV
h-index56
28papers
357citations
Novelty48%
AI Score34

28 Papers

CVAug 23, 2022Code
Doc2Graph: a Task Agnostic Document Understanding Framework based on Graph Neural Networks

Andrea Gemelli, Sanket Biswas, Enrico Civitelli et al.

Geometric Deep Learning has recently attracted significant interest in a wide range of machine learning fields, including document analysis. The application of Graph Neural Networks (GNNs) has become crucial in various document-related tasks since they can unravel important structural patterns, fundamental in key information extraction processes. Previous works in the literature propose task-driven models and do not take into account the full power of graphs. We propose Doc2Graph, a task-agnostic document understanding framework based on a GNN model, to solve different tasks given different types of documents. We evaluated our approach on two challenging datasets for key information extraction in form understanding, invoice layout analysis and table detection. Our code is freely accessible on https://github.com/andreagemelli/doc2graph.

CVMar 9, 2022
Text-DIAE: A Self-Supervised Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla et al.

In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. Each of the pretext objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that confirm the design choice of the selected pretext tasks. Importantly, the proposed model does not exhibit limitations of previous state-of-the-art methods based on contrastive losses, while at the same time requiring substantially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}.

CVSep 3, 2024Code
Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli et al.

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.

CVAug 5, 2023
FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework

Alloy Das, Sanket Biswas, Prasun Roy et al.

Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mechanism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database with further subjective human evaluation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model performance and efficiency. Our code will be released upon acceptance.

CVAug 24, 2023
Beyond Document Page Classification: Design, Datasets, and Challenges

Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko et al.

This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}

CVAug 27, 2024
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Alloy Das, Sanket Biswas, Umapada Pal et al.

The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

CVSep 11, 2023
GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Souhail Bakkali, Sanket Biswas, Zuheng Ming et al.

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and often suffer a significant performance drop in real-world online industrial settings. A primary issue is their heavy reliance on OCR engines to extract local positional information within document pages, which limits the models' ability to capture global information and hinders their generalizability, flexibility, and robustness. In this paper, we introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner using three novel pretext objective tasks. GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations, resulting in more transferable models. For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR), designed to simulate industrial scenarios more closely. Extensive experimentation has been conducted to demonstrate GlobalDoc's effectiveness in practical settings.

CVOct 2, 2023
Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

Alloy Das, Sanket Biswas, Ayan Banerjee et al.

The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.

CVOct 1, 2023
Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

Alloy Das, Sanket Biswas, Umapada Pal et al.

When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.

CVSep 21, 2022
A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts

Giuseppe De Gregorio, Sanket Biswas, Mohamed Ali Souibgui et al.

Despite recent advances in automatic text recognition, the performance remains moderate when it comes to historical manuscripts. This is mainly because of the scarcity of available labelled data to train the data-hungry Handwritten Text Recognition (HTR) models. The Keyword Spotting System (KWS) provides a valid alternative to HTR due to the reduction in error rate, but it is usually limited to a closed reference vocabulary. In this paper, we propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram) that requires a small amount of labelled training data. We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary. In this case, an out-of-vocabulary (OOV) word in an input handwritten line image could be a sequence of n-grams that belong to the lexicon. An extensive experimental evaluation of our proposed multi-representation approach was carried out on a subset of Bentham's historical manuscript collections to obtain some really promising results in this direction.

LGDec 5, 2024Code
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi et al. · mila

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

CVFeb 17, 2024Code
GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

Ayan Banerjee, Sanket Biswas, Josep Lladós et al.

Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD.

CVMay 6, 2024Code
GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document Understanding

Nil Biescas, Carlos Boned, Josep Lladós et al.

This paper presents GeoContrastNet, a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs), emphasizing the significant role of geometric features. We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework, demonstrating promising results in both link prediction and semantic entity recognition performance. Our findings reveal that combining both geometric and visual features could match the capabilities of large DU models that rely heavily on Optical Character Recognition (OCR) features in terms of performance accuracy and efficiency. This approach underscores the critical importance of relational layout information between the named text entities in a semi-structured layout of a page. Specifically, our results highlight the model's proficiency in identifying key-value relationships within the FUNSD dataset for forms and also discovering the spatial relationships in table-structured layouts for RVLCDIP business invoices. Our code and pretrained models will be accessible on our official GitHub.

CVMay 8, 2023Code
SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation

Ayan Banerjee, Sanket Biswas, Josep Lladós et al.

Instance-level segmentation of documents consists in assigning a class-aware and instance-aware label to each pixel of the image. It is a key step in document parsing for their understanding. In this paper, we present a unified transformer encoder-decoder architecture for en-to-end instance segmentation of complex layouts in document images. The method adapts a contrastive training with a mixed query selection for anchor initialization in the decoder. Later on, it performs a dot product between the obtained query embeddings and the pixel embedding map (coming from the encoder) for semantic reasoning. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ), and TableBank demonstrate that our model with SwinL backbone achieves better segmentation performance than the existing state-of-the-art approaches with the average precision of \textbf{93.72}, \textbf{54.39}, \textbf{84.65} and \textbf{98.04} respectively under one billion parameters. The code is made publicly available at: \href{https://github.com/ayanban011/SwinDocSegmenter}{github.com/ayanban011/SwinDocSegmenter}

CVMay 1, 2023Code
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation

Subhajit Maity, Sanket Biswas, Siladittya Manna et al.

Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSeg

CVJan 25, 2022Code
DocEnTr: An End-to-End Document Image Enhancement Transformer

Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni et al.

Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: \url{https://github.com/dali92002/DocEnTR}.

CVDec 20, 2023
A Unified Representation Framework for the Evaluation of Optical Music Recognition Systems

Pau Torras, Sanket Biswas, Alicia Fornés

Modern-day Optical Music Recognition (OMR) is a fairly fragmented field. Most OMR approaches use datasets that are independent and incompatible between each other, making it difficult to both combine them and compare recognition systems built upon them. In this paper we identify the need of a common music representation language and propose the Music Tree Notation (MTN) format, with the idea to construct a common endpoint for OMR research that allows coordination, reuse of technology and fair evaluation of community efforts. This format represents music as a set of primitives that group together into higher-abstraction nodes, a compromise between the expression of fully graph-based and sequential notation formats. We have also developed a specific set of OMR metrics and a typeset score dataset as a proof of concept of this idea.

CVJan 3, 2024
Synthetic dataset of ID and Travel Document

Carlos Boned, Maxime Talarmain, Nabil Ghanmi et al.

This paper presents a new synthetic dataset of ID and travel documents, called SIDTD. The SIDTD dataset is created to help training and evaluating forged ID documents detection systems. Such a dataset has become a necessity as ID documents contain personal information and a public dataset of real documents can not be released. Moreover, forged documents are scarce, compared to legit ones, and the way they are generated varies from one fraudster to another resulting in a class of high intra-variability. In this paper we trained state-of-the-art models on this dataset and we compare them to the performance achieved in larger, but private, datasets. The creation of this dataset will help to document image analysis community to progress in the task of ID document verification.

CVApr 12, 2025
NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Aniket Pal, Sanket Biswas, Alloy Das et al.

Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

CVJun 12, 2024
LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach

Maria Pilligua, Nil Biescas, Javier Vazquez-Corral et al.

The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (DIR) systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. This hierarchical DIR framework dynamically adjusts to the characteristics of the input document, facilitating effective domain adaptation. We evaluated our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study. Initially trained on a synthetically generated dataset, our model demonstrates strong generalization capabilities for the DIR task, offering a promising solution for handling variability in real-world data. Our code is accessible on GitHub.

CVJun 12, 2024
DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Sanket Biswas, Rajiv Jain, Vlad I. Morariu et al.

While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.

CVJun 12, 2024
DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

Jordy Van Landeghem, Subhajit Maity, Ayan Banerjee et al.

This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.

CVMay 6, 2024
SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition

Adarsh Tiwari, Sanket Biswas, Josep Lladós

We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By mapping complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub.

CVJan 27, 2022
DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer

Sanket Biswas, Ayan Banerjee, Josep Lladós et al.

Understanding documents with rich layouts is an essential step towards information extraction. Business intelligence processes often require the extraction of useful semantic content from documents at a large scale for subsequent decision-making tasks. In this context, instance-level segmentation of different document objects (title, sections, figures etc.) has emerged as an interesting problem for the document analysis and understanding community. To advance the research in this direction, we present a transformer-based model called \emph{DocSegTr} for end-to-end instance segmentation of complex layouts in document images. The method adapts a twin attention module, for semantic reasoning, which helps to become highly computationally efficient compared with the state-of-the-art. To the best of our knowledge, this is the first work on transformer-based document segmentation. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ) and TableBank demonstrate that our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches with the average precision of 89.4, 40.3, 83.4 and 93.3. This simple and flexible framework could serve as a promising baseline for instance-level recognition tasks in document images.

CVJul 9, 2021
Graph-based Deep Generative Modelling for Document Layout Generation

Sanket Biswas, Pau Riba, Josep Lladós et al.

One of the major prerequisites for any deep learning approach is the availability of large-scale training data. When dealing with scanned document images in real world scenarios, the principal information of its content is stored in the layout itself. In this work, we have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts that can be used to train document interpretation systems, in this case, specially in digital mailroom applications. It is also the first graph-based approach for document layout generation task experimented on administrative document images, in this case, invoices.

CVJul 6, 2021
DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis

Sanket Biswas, Pau Riba, Josep Lladós et al.

Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by the user, our proposed DocSynth model learns to generate a set of realistic document images consistent with the defined layout. Also, this framework has been adapted to this work as a superior baseline model for creating synthetic document image datasets for augmenting real data during training for document layout analysis tasks. Different sets of learning objectives have been also used to improve the model performance. Quantitatively, we also compare the generated results of our model with real data using standard evaluation metrics. The results highlight that our model can successfully generate realistic and diverse document images with multiple objects. We also present a comprehensive qualitative analysis summary of the different scopes of synthetic image generation tasks. Lastly, to our knowledge this is the first work of its kind.

CVOct 24, 2018
Fault Area Detection in Leaf Diseases using k-means Clustering

Subhajit Maity, Sujan Sarkar, Avinaba Tapadar et al.

With increasing population the crisis of food is getting bigger day by day.In this time of crisis,the leaf disease of crops is the biggest problem in the food industry.In this paper, we have addressed that problem and proposed an efficient method to detect leaf disease.Leaf diseases can be detected from sample images of the leaf with the help of image processing and segmentation.Using k-means clustering and Otsu's method the faulty region in a leaf is detected which helps to determine proper course of action to be taken.Further the ratio of normal and faulty region if calculated would be able to predict if the leaf can be cured at all.

LGOct 23, 2018
A Statistical Approach to Adult Census Income Level Prediction

Navoneel Chakrabarty, Sanket Biswas

The prominent inequality of wealth and income is a huge concern especially in the United States. The likelihood of diminishing poverty is one valid reason to reduce the world's surging level of economic inequality. The principle of universal moral equality ensures sustainable development and improve the economic stability of a nation. Governments in different countries have been trying their best to address this problem and provide an optimal solution. This study aims to show the usage of machine learning and data mining techniques in providing a solution to the income equality problem. The UCI Adult Dataset has been used for the purpose. Classification has been done to predict whether a person's yearly income in US falls in the income category of either greater than 50K Dollars or less equal to 50K Dollars category based on a certain set of attributes. The Gradient Boosting Classifier Model was deployed which clocked the highest accuracy of 88.16%, eventually breaking the benchmark accuracy of existing works.