Faisal Shafait

CV
h-index19
19papers
695citations
Novelty51%
AI Score46

19 Papers

LGFeb 25
DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

Marcel Lamott, Saifullah Saifullah, Nauman Riaz et al.

Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average $87\%$ of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.

CLFeb 15, 2024Code
LAPDoc: Layout-Aware Prompting for Documents

Marcel Lamott, Yves-Noel Weweler, Adrian Ulges et al.

Recent advances in training large language models (LLMs) using massive amounts of solely textual data lead to strong generalization across many domains and tasks, including document-specific tasks. Opposed to that there is a trend to train multi-modal transformer architectures tailored for document understanding that are designed specifically to fuse textual inputs with the corresponding document layout. This involves a separate fine-tuning step for which additional training data is required. At present, no document transformers with comparable generalization to LLMs are available That raises the question which type of model is to be preferred for document understanding tasks. In this paper we investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment. We explore drop-in modifications and rule-based methods to enrich purely textual LLM prompts with layout information. In our experiments we investigate the effects on the commercial ChatGPT model and the open-source LLM Solar. We demonstrate that using our approach both LLMs show improved performance on various standard document benchmarks. In addition, we study the impact of noisy OCR and layout errors, as well as the limitations of LLMs when it comes to utilizing document layout. Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15% compared to just using plain document text. In conclusion, this approach should be considered for the best model choice between text-based LLM or multi-modal document transformers.

CVMay 31, 2019Code
Rethinking Table Recognition using Graph Neural Networks

Shah Rukh Qasim, Hassan Mahmood, Faisal Shafait

Document structure analysis, such as zone segmentation and table recognition, is a complex problem in document processing and is an active area of research. The recent success of deep learning in solving various computer vision and machine learning problems has not been reflected in document structure analysis since conventional neural networks are not well suited to the input structure of the problem. In this paper, we propose an architecture based on graph networks as a better alternative to standard neural networks for table recognition. We argue that graph networks are a more natural choice for these problems, and explore two gradient-based graph neural networks. Our proposed architecture combines the benefits of convolutional neural networks for visual feature extraction and graph networks for dealing with the problem structure. We empirically demonstrate that our method outperforms the baseline by a significant margin. In addition, we identify the lack of large scale datasets as a major hindrance for deep learning research for structure analysis and present a new large scale synthetic dataset for the problem of table recognition. Finally, we open-source our implementation of dataset generation and the training framework of our graph networks to promote reproducible research in this direction.

LGJul 8, 2018Code
Revisiting Distillation and Incremental Classifier Learning

Khurram Javed, Faisal Shafait

One of the key differences between the learning mechanism of humans and Artificial Neural Networks (ANNs) is the ability of humans to learn one task at a time. ANNs, on the other hand, can only learn multiple tasks simultaneously. Any attempts at learning new tasks incrementally cause them to completely forget about previous tasks. This lack of ability to learn incrementally, called Catastrophic Forgetting, is considered a major hurdle in building a true AI system. In this paper, our goal is to isolate the truly effective existing ideas for incremental learning from those that only work under certain conditions. To this end, we first thoroughly analyze the current state of the art (iCaRL) method for incremental learning and demonstrate that the good performance of the system is not because of the reasons presented in the existing literature. We conclude that the success of iCaRL is primarily due to knowledge distillation and recognize a key limitation of knowledge distillation, i.e, it often leads to bias in classifiers. Finally, we propose a dynamic threshold moving algorithm that is able to successfully remove this bias. We demonstrate the effectiveness of our algorithm on CIFAR100 and MNIST datasets showing near-optimal results. Our implementation is available at https://github.com/Khurramjaved96/incremental-learning.

DCDec 11, 2025
HLS4PC: A Parametrizable Framework For Accelerating Point-Based 3D Point Cloud Models on FPGA

Amur Saqib Pal, Muhammad Mohsin Ghaffar, Faisal Shafait et al.

Point-based 3D point cloud models employ computation and memory intensive mapping functions alongside NN layers for classification/segmentation, and are executed on server-grade GPUs. The sparse, and unstructured nature of 3D point cloud data leads to high memory and computational demand, hindering real-time performance in safety critical applications due to GPU under-utilization. To address this challenge, we present HLS4PC, a parameterizable HLS framework for FPGA acceleration. Our approach leverages FPGA parallelization and algorithmic optimizations to enable efficient fixed-point implementations of both mapping and NN functions. We explore several hardware-aware compression techniques on a state-of-the-art PointMLP-Elite model, including replacing FPS with URS, parameter quantization, layer fusion, and input-points pruning, yielding PointMLP-Lite, a 4x less complex variant with only 2% accuracy drop on ModelNet40. Secondly, we demonstrate that the FPGA acceleration of the PointMLP-Lite results in 3.56x higher throughput than previous works. Furthermore, our implementation achieves 2.3x and 22x higher throughput compared to the GPU and CPU implementations, respectively.

CVApr 29, 2021
TabAug: Data Driven Augmentation for Enhanced Table Structure Recognition

Umar Khan, Sohaib Zahid, Muhammad Asad Ali et al.

Table Structure Recognition is an essential part of end-to-end tabular data extraction in document images. The recent success of deep learning model architectures in computer vision remains to be non-reflective in table structure recognition, largely because extensive datasets for this domain are still unavailable while labeling new data is expensive and time-consuming. Traditionally, in computer vision, these challenges are addressed by standard augmentation techniques that are based on image transformations like color jittering and random cropping. As demonstrated by our experiments, these techniques are not effective for the task of table structure recognition. In this paper, we propose TabAug, a re-imagined Data Augmentation technique that produces structural changes in table images through replication and deletion of rows and columns. It also consists of a data-driven probabilistic model that allows control over the augmentation process. To demonstrate the efficacy of our approach, we perform experimentation on ICDAR 2013 dataset where our approach shows consistent improvements in all aspects of the evaluation metrics, with cell-level correct detections improving from 92.16% to 96.11% over the baseline.

LGNov 28, 2020
Short-Term Load Forecasting using Bi-directional Sequential Models and Feature Engineering for Small Datasets

Abdul Wahab, Muhammad Anas Tahir, Naveed Iqbal et al.

Electricity load forecasting enables the grid operators to optimally implement the smart grid's most essential features such as demand response and energy efficiency. Electricity demand profiles can vary drastically from one region to another on diurnal, seasonal and yearly scale. Hence to devise a load forecasting technique that can yield the best estimates on diverse datasets, specially when the training data is limited, is a big challenge. This paper presents a deep learning architecture for short-term load forecasting based on bidirectional sequential models in conjunction with feature engineering that extracts the hand-crafted derived features in order to aid the model for better learning and predictions. In the proposed architecture, named as Deep Derived Feature Fusion (DeepDeFF), the raw input and hand-crafted features are trained at separate levels and then their respective outputs are combined to make the final prediction. The efficacy of the proposed methodology is evaluated on datasets from five countries with completely different patterns. The results demonstrate that the proposed technique is superior to the existing state of the art.

CVMay 28, 2020
Two-stage framework for optic disc localization and glaucoma classification in retinal fundus images using deep learning

Muhammad Naseer Bajwa, Muhammad Imran Malik, Shoaib Ahmed Siddiqui et al.

With the advancement of powerful image processing and machine learning techniques, CAD has become ever more prevalent in all fields of medicine including ophthalmology. Since optic disc is the most important part of retinal fundus image for glaucoma detection, this paper proposes a two-stage framework that first detects and localizes optic disc and then classifies it into healthy or glaucomatous. The first stage is based on RCNN and is responsible for localizing and extracting optic disc from a retinal fundus image while the second stage uses Deep CNN to classify the extracted disc into healthy or glaucomatous. In addition to the proposed solution, we also developed a rule-based semi-automatic ground truth generation method that provides necessary annotations for training RCNN based model for automated disc localization. The proposed method is evaluated on seven publicly available datasets for disc localization and on ORIGA dataset, which is the largest publicly available dataset for glaucoma classification. The results of automatic localization mark new state-of-the-art on six datasets with accuracy reaching 100% on four of them. For glaucoma classification we achieved AUC equal to 0.874 which is 2.7% relative improvement over the state-of-the-art results previously obtained for classification on ORIGA. Once trained on carefully annotated data, Deep Learning based methods for optic disc detection and localization are not only robust, accurate and fully automated but also eliminates the need for dataset-dependent heuristic algorithms. Our empirical evaluation of glaucoma classification on ORIGA reveals that reporting only AUC, for datasets with class imbalance and without pre-defined train and test splits, does not portray true picture of the classifier's performance and calls for additional performance metrics to substantiate the results.

CVJan 8, 2020
Table Structure Extraction with Bi-directional Gated Recurrent Unit Networks

Saqib Ali Khan, Syed Muhammad Daniyal Khalid, Muhammad Ali Shahzad et al.

Tables present summarized and structured information to the reader, which makes table structure extraction an important part of document understanding applications. However, table structure identification is a hard problem not only because of the large variation in the table layouts and styles, but also owing to the variations in the page layouts and the noise contamination levels. A lot of research has been done to identify table structure, most of which is based on applying heuristics with the aid of optical character recognition (OCR) to hand pick layout features of the tables. These methods fail to generalize well because of the variations in the table layouts and the errors generated by OCR. In this paper, we have proposed a robust deep learning based approach to extract rows and columns from a detected table in document images with a high precision. In the proposed solution, the table images are first pre-processed and then fed to a bi-directional Recurrent Neural Network with Gated Recurrent Units (GRU) followed by a fully-connected layer with soft max activation. The network scans the images from top-to-bottom as well as left-to-right and classifies each input as either a row-separator or a column-separator. We have benchmarked our system on publicly available UNLV as well as ICDAR 2013 datasets on which it outperformed the state-of-the-art table structure extraction systems by a significant margin.

CVSep 3, 2019
Do Cross Modal Systems Leverage Semantic Relationships?

Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo et al.

Current cross-modal retrieval systems are evaluated using R@K measure which does not leverage semantic relationships rather strictly follows the manually marked image text query pairs. Therefore, current systems do not generalize well for the unseen data in the wild. To handle this, we propose a new measure, SemanticMap, to evaluate the performance of cross-modal systems. Our proposed measure evaluates the semantic similarity between the image and text representations in the latent embedding space. We also propose a novel cross-modal retrieval system using a single stream network for bidirectional retrieval. The proposed system is based on a deep neural network trained using extended center loss, minimizing the distance of image and text descriptions in the latent space from the class centers. In our system, the text descriptions are also encoded as images which enabled us to use a single stream network for both text and images. To the best of our knowledge, our work is the first of its kind in terms of employing a single stream network for cross-modal retrieval systems. The proposed system is evaluated on two publicly available datasets including MSCOCO and Flickr30K and has shown comparable results to the current state-of-the-art methods.

AIJun 19, 2019
An Open-World Extension to Knowledge Graph Completion Models

Haseeb Shah, Johannes Villmow, Adrian Ulges et al.

We present a novel extension to embedding-based knowledge graph completion models which enables them to perform open-world link prediction, i.e. to predict facts for entities unseen in training based on their textual description. Our model combines a regular link prediction model learned from a knowledge graph with word embeddings learned from a textual corpus. After training both independently, we learn a transformation to map the embeddings of an entity's name and description to the graph-based embedding space. In experiments on several datasets including FB20k, DBPedia50k and our new dataset FB15k-237-OWE, we demonstrate competitive results. Particularly, our approach exploits the full knowledge graph structure even when textual descriptions are scarce, does not require a joint training on graph and text, and can be applied to any embedding-based link prediction model, such as TransE, ComplEx and DistMult.

CVApr 17, 2019
Converting a Common Document Scanner to a Multispectral Scanner

Zohaib Khan, Faisal Shafait, Ajmal Mian

We propose the construction of a prototype scanner designed to capture multispectral images of documents. A standard sheet-feed scanner is modified by disconnecting its internal light source and connecting an external multispectral light source comprising of narrow band light emitting diodes (LED). A document is scanned by illuminating the scanner light guide successively with different LEDs and capturing a scan of the document. The system is portable and can be used for potential applications in verification of questioned documents, cheques, receipts and bank notes.

CVJul 8, 2018
Distillation Techniques for Pseudo-rehearsal Based Incremental Learning

Haseeb Shah, Khurram Javed, Faisal Shafait

The ability to learn from incrementally arriving data is essential for any life-long learning system. However, standard deep neural networks forget the knowledge about the old tasks, a phenomenon called catastrophic forgetting, when trained on incrementally arriving data. We discuss the biases in current Generative Adversarial Networks (GAN) based approaches that learn the classifier by knowledge distillation from previously trained classifiers. These biases cause the trained classifier to perform poorly. We propose an approach to remove these biases by distilling knowledge from the classifier of AC-GAN. Experiments on MNIST and CIFAR10 show that this method is comparable to current state of the art rehearsal based approaches. The code for this paper is available at https://bit.ly/incremental-learning

LGJun 8, 2016
Efficient Estimation of k for the Nearest Neighbors Class of Methods

Aleksander Lodwich, Faisal Shafait, Thomas Breuel

The k Nearest Neighbors (kNN) method has received much attention in the past decades, where some theoretical bounds on its performance were identified and where practical optimizations were proposed for making it work fairly well in high dimensional spaces and on large datasets. From countless experiments of the past it became widely accepted that the value of k has a significant impact on the performance of this method. However, the efficient optimization of this parameter has not received so much attention in literature. Today, the most common approach is to cross-validate or bootstrap this value for all values in question. This approach forces distances to be recomputed many times, even if efficient methods are used. Hence, estimating the optimal k can become expensive even on modern systems. Frequently, this circumstance leads to a sparse manual search of k. In this paper we want to point out that a systematic and thorough estimation of the parameter k can be performed efficiently. The discussed approach relies on large matrices, but we want to argue, that in practice a higher space complexity is often much less of a problem than repetitive distance computations.

CVNov 29, 2015
Sparseness helps: Sparsity Augmented Collaborative Representation for Classification

Naveed Akhtar, Faisal Shafait, Ajmal Mian

Many classification approaches first represent a test sample using the training samples of all the classes. This collaborative representation is then used to label the test sample. It was a common belief that sparseness of the representation is the key to success for this classification scheme. However, more recently, it has been claimed that it is the collaboration and not the sparseness that makes the scheme effective. This claim is attractive as it allows to relinquish the computationally expensive sparsity constraint over the representation. In this paper, we first extend the analysis supporting this claim and then show that sparseness explicitly contributes to improved classification, hence it should not be completely ignored for computational gains. Inspired by this result, we augment a dense collaborative representation with a sparse representation and propose an efficient classification method that capitalizes on the resulting representation. The augmented representation and the classification method work together meticulously to achieve higher accuracy and lower computational time compared to state-of-the-art collaborative representation based classification approaches. Experiments on benchmark face, object and action databases show the efficacy of our approach.

CVMar 27, 2015
Discriminative Bayesian Dictionary Learning for Classification

Naveed Akhtar, Faisal Shafait, Ajmal Mian

We propose a Bayesian approach to learn discriminative dictionaries for sparse representation of data. The proposed approach infers probability distributions over the atoms of a discriminative dictionary using a Beta Process. It also computes sets of Bernoulli distributions that associate class labels to the learned dictionary atoms. This association signifies the selection probabilities of the dictionary atoms in the expansion of class-specific data. Furthermore, the non-parametric character of the proposed approach allows it to infer the correct size of the dictionary. We exploit the aforementioned Bernoulli distributions in separately learning a linear classifier. The classifier uses the same hierarchical Bayesian model as the dictionary, which we present along the analytical inference solution for Gibbs sampling. For classification, a test instance is first sparsely encoded over the learned dictionary and the codes are fed to the classifier. We performed experiments for face and action recognition; and object and scene-category classification using five public datasets and compared the results with state-of-the-art discriminative sparse representation approaches. Experiments show that the proposed Bayesian approach consistently outperforms the existing approaches.

CVMar 9, 2015
Representation Learning with Deep Extreme Learning Machines for Efficient Image Set Classification

Muhammad Uzair, Faisal Shafait, Bernard Ghanem et al.

Efficient and accurate joint representation of a collection of images, that belong to the same class, is a major research challenge for practical image set classification. Existing methods either make prior assumptions about the data structure, or perform heavy computations to learn structure from the data itself. In this paper, we propose an efficient image set representation that does not make any prior assumptions about the structure of the underlying data. We learn the non-linear structure of image sets with Deep Extreme Learning Machines (DELM) that are very efficient and generalize well even on a limited number of training samples. Extensive experiments on a broad range of public datasets for image set classification (Honda/UCSD, CMU Mobo, YouTube Celebrities, Celebrity-1000, ETH-80) show that the proposed algorithm consistently outperforms state-of-the-art image set classification methods both in terms of speed and accuracy.

CVOct 19, 2014
Dense 3D Face Correspondence

Syed Zulqarnain Gilani, Ajmal Mian, Faisal Shafait et al.

We present an algorithm that automatically establishes dense correspondences between a large number of 3D faces. Starting from automatically detected sparse correspondences on the outer boundary of 3D faces, the algorithm triangulates existing correspondences and expands them iteratively by matching points of distinctive surface curvature along the triangle edges. After exhausting keypoint matches, further correspondences are established by generating evenly distributed points within triangles by evolving level set geodesic curves from the centroids of large triangles. A deformable model (K3DM) is constructed from the dense corresponded faces and an algorithm is proposed for morphing the K3DM to fit unseen faces. This algorithm iterates between rigid alignment of an unseen face followed by regularized morphing of the deformable model. We have extensively evaluated the proposed algorithms on synthetic data and real 3D faces from the FRGCv2, Bosphorus, BU3DFE and UND Ear databases using quantitative and qualitative benchmarks. Our algorithm achieved dense correspondences with a mean localisation error of 1.28mm on synthetic faces and detected $14$ anthropometric landmarks on unseen real faces from the FRGCv2 database with 3mm precision. Furthermore, our deformable model fitting algorithm achieved 98.5% face recognition accuracy on the FRGCv2 and 98.6% on Bosphorus database. Our dense model is also able to generalize to unseen datasets.

CVFeb 6, 2014
Multispectral Palmprint Encoding and Recognition

Zohaib Khan, Faisal Shafait, Yiqun Hu et al.

Palmprints are emerging as a new entity in multi-modal biometrics for human identification and verification. Multispectral palmprint images captured in the visible and infrared spectrum not only contain the wrinkles and ridge structure of a palm, but also the underlying pattern of veins; making them a highly discriminating biometric identifier. In this paper, we propose a feature encoding scheme for robust and highly accurate representation and matching of multispectral palmprints. To facilitate compact storage of the feature, we design a binary hash table structure that allows for efficient matching in large databases. Comprehensive experiments for both identification and verification scenarios are performed on two public datasets -- one captured with a contact-based sensor (PolyU dataset), and the other with a contact-free sensor (CASIA dataset). Recognition results in various experimental setups show that the proposed method consistently outperforms existing state-of-the-art methods. Error rates achieved by our method (0.003% on PolyU and 0.2% on CASIA) are the lowest reported in literature on both dataset and clearly indicate the viability of palmprint as a reliable and promising biometric. All source codes are publicly available.