LGSep 22, 2022Code
Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement LearningManuel Goulão, Arlindo L. Oliveira
The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data efficiency when we evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with the temporal order verification task shows the best results across all experiments, with richer representations, more focused attention maps and sparser representation vectors throughout the layers of the encoder, which shows the importance of exploring such similarity dimension. With this work, we hope to provide some insights into the representations learned by ViT during a self-supervised pretraining with observations from RL environments and which properties arise in the representations that lead to the best-performing agents. The source code will be available at: https://github.com/mgoulao/TOV-VICReg
CVJan 25, 2023
Connecting metrics for shape-texture knowledge in computer visionTiago Oliveira, Tiago Marques, Arlindo L. Oliveira · mit
Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
CLFeb 15, 2024Code
DE-COP: Detecting Copyrighted Content in Language Models Training DataAndré V. Duarte, Xuandong Zhao, Arlindo L. Oliveira et al. · berkeley, cmu
How can we detect if copyrighted content was used in the training process of a language model, considering that the training data is typically undisclosed? We are motivated by the premise that a language model is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approach is to probe an LLM with multiple-choice questions, whose options include both verbatim text and their paraphrases. We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff, along with their paraphrases. Our experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance (AUC) on models with logits available. Moreover, DE-COP also achieves an average accuracy of 72% for detecting suspect books on fully black-box models where prior methods give approximately 4% accuracy. The code and datasets are available at https://github.com/LeiLiLab/DE-COP.
25.8CLMay 20
Sem-Detect: Semantic Level Detection of AI Generated Peer-ReviewsAndré V. Duarte, Brian Tufts, Aditya Oke et al.
How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.
LGJul 5, 2023
Improving Address Matching using Siamese Transformer NetworksAndré V. Duarte, Arlindo L. Oliveira
Matching addresses is a critical task for companies and post offices involved in the processing and delivery of packages. The ramifications of incorrectly delivering a package to the wrong recipient are numerous, ranging from harm to the company's reputation to economic and environmental costs. This research introduces a deep learning-based model designed to increase the efficiency of address matching for Portuguese addresses. The model comprises two parts: (i) a bi-encoder, which is fine-tuned to create meaningful embeddings of Portuguese postal addresses, utilized to retrieve the top 10 likely matches of the un-normalized target address from a normalized database, and (ii) a cross-encoder, which is fine-tuned to accurately rerank the 10 addresses obtained by the bi-encoder. The model has been tested on a real-case scenario of Portuguese addresses and exhibits a high degree of accuracy, exceeding 95% at the door level. When utilized with GPU computations, the inference speed is about 4.5 times quicker than other traditional approaches such as BM25. An implementation of this system in a real-world scenario would substantially increase the effectiveness of the distribution process. Such an implementation is currently under investigation.
CVSep 25, 2024
Explicitly Modeling Pre-Cortical Vision with a Neuro-Inspired Front-End Improves CNN RobustnessLucas Piper, Arlindo L. Oliveira, Tiago Marques
While convolutional neural networks (CNNs) excel at clean image classification, they struggle to classify images corrupted with different common corruptions, limiting their real-world applicability. Recent work has shown that incorporating a CNN front-end block that simulates some features of the primate primary visual cortex (V1) can improve overall model robustness. Here, we expand on this approach by introducing two novel biologically-inspired CNN model families that incorporate a new front-end block designed to simulate pre-cortical visual processing. RetinaNet, a hybrid architecture containing the novel front-end followed by a standard CNN back-end, shows a relative robustness improvement of 12.3% when compared to the standard model; and EVNet, which further adds a V1 block after the pre-cortical front-end, shows a relative gain of 18.5%. The improvement in robustness was observed for all the different corruption categories, though accompanied by a small decrease in clean image accuracy, and generalized to a different back-end architecture. These findings show that simulating multiple stages of early visual processing in CNN early layers provides cumulative benefits for model robustness.
CVOct 16, 2023
Matching the Neuronal Representations of V1 is Necessary to Improve Robustness in CNNs with V1-like Front-endsRuxandra Barbulescu, Tiago Marques, Arlindo L. Oliveira
While some convolutional neural networks (CNNs) have achieved great success in object recognition, they struggle to identify objects in images corrupted with different types of common noise patterns. Recently, it was shown that simulating computations in early visual areas at the front of CNNs leads to improvements in robustness to image corruptions. Here, we further explore this result and show that the neuronal representations that emerge from precisely matching the distribution of RF properties found in primate V1 is key for this improvement in robustness. We built two variants of a model with a front-end modeling the primate primary visual cortex (V1): one sampling RF properties uniformly and the other sampling from empirical biological distributions. The model with the biological sampling has a considerably higher robustness to image corruptions that the uniform variant (relative difference of 8.72%). While similar neuronal sub-populations across the two variants have similar response properties and learn similar downstream weights, the impact on downstream processing is strikingly different. This result sheds light on the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain.
CVSep 19, 2025Code
Deep Feedback ModelsDavid Calhas, Arlindo L. Oliveira
Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.
CVJul 14, 2025Code
Deep Recurrence for Dynamical Segmentation ModelsDavid Calhas, Arlindo L. Oliveira
While biological vision systems rely heavily on feedback connections to iteratively refine perception, most artificial neural networks remain purely feedforward, processing input in a single static pass. In this work, we propose a predictive coding inspired feedback mechanism that introduces a recurrent loop from output to input, allowing the model to refine its internal state over time. We implement this mechanism within a standard U-Net architecture and introduce two biologically motivated operations, softmax projection and exponential decay, to ensure stability of the feedback loop. Through controlled experiments on a synthetic segmentation task, we show that the feedback model significantly outperforms its feedforward counterpart in noisy conditions and generalizes more effectively with limited supervision. Notably, feedback achieves above random performance with just two training examples, while the feedforward model requires at least four. Our findings demonstrate that feedback enhances robustness and data efficiency, and offer a path toward more adaptive and biologically inspired neural architectures. Code is available at: github.com/DCalhas/feedback_segmentation.
CVMar 11, 2025Code
Are ECGs enough? Deep learning classification of pulmonary embolism using electrocardiogramsJoao D. S. Marques, Arlindo L. Oliveira
Pulmonary embolism is a leading cause of out of hospital cardiac arrest that requires fast diagnosis. While computed tomography pulmonary angiography is the standard diagnostic tool, it is not always accessible. Electrocardiography is an essential tool for diagnosing multiple cardiac anomalies, as it is affordable, fast and available in many settings. However, the availability of public ECG datasets, specially for PE, is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural networks in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL, CPSC18 and MedalCare-XL, to a smaller, more challenging dataset for PE. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at https://github.com/joaodsmarques/Are-ECGs-enough-Deep-Learning-Classifiers .
CLJun 25, 2024Code
LumberChunker: Long-Form Narrative Document SegmentationAndré V. Duarte, João Marques, Miguel Graça et al.
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker
CVFeb 24, 2025Code
DIS-CO: Discovering Copyrighted Content in VLMs Training DataAndré V. Duarte, Xuandong Zhao, Arlindo L. Oliveira et al. · berkeley
How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO
LGFeb 26
Engineering FAIR Privacy-preserving Applications that Learn Histories of DiseaseInes N. Duarte, Praphulla M. S. Bhawsar, Lee K. Mason et al.
A recent report on "Learning the natural history of human disease with generative transformers" created an opportunity to assess the engineering challenge of delivering user-facing Generative AI applications in privacy-sensitive domains. The application of these models, particularly for personalized healthcare tasks like predicting individual morbidity risk, is typically constrained by data privacy concerns. This project was accordingly designed as an in-browser model deployment exercise (an "App") testing the architectural boundaries of client-side inference generation (no downloads or installations). We relied exclusively on the documentation provided in the reference report to develop the model, specifically testing the "R" component of the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. The successful model deployment, leveraging ONNX and a custom JavaScript SDK, establishes a secure, high-performance architectural blueprint for the future of private generative AI in medicine.
CLOct 29, 2025
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic PipelineAndré V. Duarte, Xuying li, Bin Zeng et al.
If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.
AINov 14, 2023
DeepThought: An Architecture for Autonomous Self-motivated SystemsArlindo L. Oliveira, Tiago Domingos, Mário Figueiredo et al.
The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, has raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these properties. By combining insights from complementary learning systems, global neuronal workspace, and attention schema theories, we propose to integrate LLMs and other deep learning systems into an architecture for cognitive language agents able to exhibit properties akin to agency, self-motivation, even some features of meta-cognition.
CVJun 3, 2025
Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN RobustnessLucas Piper, Arlindo L. Oliveira, Tiago Marques
Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and outperform the base CNN architecture by 9.3% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 6.2% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.
CVMar 13, 2025
CountPath: Automating Fragment Counting in Digital PathologyAna Beatriz Vieira, Maria Valente, Diana Montezuma et al.
Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the slides contain the appropriate diagnostic material from the grossing process, thereby guaranteeing the accuracy of subsequent microscopic examination and diagnosis. Traditionally, this assessment is performed manually, requiring significant time and effort while being subject to significant variability due to its subjective nature. To address these challenges, this study explores an automated approach to fragment counting using the YOLOv9 and Vision Transformer models. Our results demonstrate that the automated system achieves a level of performance comparable to expert assessments, offering a reliable and efficient alternative to manual counting. Additionally, we present findings on interobserver variability, showing that the automated approach achieves an accuracy of 86%, which falls within the range of variation observed among experts (82-88%), further supporting its potential for integration into routine pathology workflows.
CYOct 8, 2025
Leveraging LLMs to Streamline the Review of Public Funding ApplicationsJoao D. S. Marques, Andre V. Duarte, Andre Carvalho et al.
Every year, the European Union and its member states allocate millions of euros to fund various development initiatives. However, the increasing number of applications received for these programs often creates significant bottlenecks in evaluation processes, due to limited human capacity. In this work, we detail the real-world deployment of AI-assisted evaluation within the pipeline of two government initiatives: (i) corporate applications aimed at international business expansion, and (ii) citizen reimbursement claims for investments in energy-efficient home improvements. While these two cases involve distinct evaluation procedures, our findings confirm that AI effectively enhanced processing efficiency and reduced workload across both types of applications. Specifically, in the citizen reimbursement claims initiative, our solution increased reviewer productivity by 20.1%, while keeping a negligible false-positive rate based on our test set observations. These improvements resulted in an overall reduction of more than 2 months in the total evaluation time, illustrating the impact of AI-driven automation in large-scale evaluation workflows.
AIJan 31, 2025
The role of positional encodings in the ARC benchmarkGuilherme H. Bandeira Costa, Miguel Freire, Arlindo L. Oliveira
The Abstraction and Reasoning Corpus challenges AI systems to perform abstract reasoning with minimal training data, a task intuitive for humans but demanding for machine learning models. Using CodeT5+ as a case study, we demonstrate how limitations in positional encoding hinder reasoning and impact performance. This work further examines the role of positional encoding across transformer architectures, highlighting its critical influence on models of varying sizes and configurations. Comparing several strategies, we find that while 2D positional encoding and Rotary Position Embedding offer competitive performance, 2D encoding excels in data-constrained scenarios, emphasizing its effectiveness for ARC tasks
CVDec 20, 2024
The Role of Recurrency in Image Segmentation for Noisy and Limited Sample SettingsDavid Calhas, João Marques, Arlindo L. Oliveira
The biological brain has inspired multiple advances in machine learning. However, most state-of-the-art models in computer vision do not operate like the human brain, simply because they are not capable of changing or improving their decisions/outputs based on a deeper analysis. The brain is recurrent, while these models are not. It is therefore relevant to explore what would be the impact of adding recurrent mechanisms to existing state-of-the-art architectures and to answer the question of whether recurrency can improve existing architectures. To this end, we build on a feed-forward segmentation model and explore multiple types of recurrency for image segmentation. We explore self-organizing, relational, and memory retrieval types of recurrency that minimize a specific energy function. In our experiments, we tested these models on artificial and medical imaging data, while analyzing the impact of high levels of noise and few-shot learning settings. Our results do not validate our initial hypothesis that recurrent models should perform better in these settings, suggesting that these recurrent architectures, by themselves, are not sufficient to surpass state-of-the-art feed-forward versions and that additional work needs to be done on the topic.
CVApr 1, 2024
Finding Regions of Interest in Whole Slide Images Using Multiple Instance LearningMartim Afonso, Praphulla M. S. Bhawsar, Monjoy Saha et al.
Whole Slide Images (WSI), obtained by high-resolution digital scanning of microscope slides at multiple scales, are the cornerstone of modern Digital Pathology. However, they represent a particular challenge to AI-based/AI-mediated analysis because pathology labeling is typically done at slide-level, instead of tile-level. It is not just that medical diagnostics is recorded at the specimen level, the detection of oncogene mutation is also experimentally obtained, and recorded by initiatives like The Cancer Genome Atlas (TCGA), at the slide level. This configures a dual challenge: a) accurately predicting the overall cancer phenotype and b) finding out what cellular morphologies are associated with it at the tile level. To address these challenges, a weakly supervised Multiple Instance Learning (MIL) approach was explored for two prevalent cancer types, Invasive Breast Carcinoma (TCGA-BRCA) and Lung Squamous Cell Carcinoma (TCGA-LUSC). This approach was explored for tumor detection at low magnification levels and TP53 mutations at various levels. Our results show that a novel additive implementation of MIL matched the performance of reference implementation (AUC 0.96), and was only slightly outperformed by Attention MIL (AUC 0.97). More interestingly from the perspective of the molecular pathologist, these different AI architectures identify distinct sensitivities to morphological features (through the detection of Regions of Interest, RoI) at different amplification levels. Tellingly, TP53 mutation was most sensitive to features at the higher applications where cellular morphology is resolved.
IVMar 19, 2024
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation ChallengeHongwei Bran Li, Fernando Navarro, Ivan Ezhov et al.
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
LGJan 8, 2022
Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular ArchitectureTiago Gaspar Oliveira, Arlindo L. Oliveira
The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that can create challenges for research. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include planning algorithms, policies, and loss functions. We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular planning algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with this architecture have shown that the best combination of planning algorithm, policy, and loss function is heavily problem dependent. This result provides evidence that the proposed architecture, which is modular and reusable, is useful for reinforcement learning researchers who want to study new environments and techniques.
CVDec 23, 2021
Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin LesionsRafael Pedro, Arlindo L. Oliveira
Attention mechanisms have raised significant interest in the research community, since they promise significant improvements in the performance of neural network architectures. However, in any specific problem, we still lack a principled way to choose specific mechanisms and hyper-parameters that lead to guaranteed improvements. More recently, self-attention has been proposed and widely used in transformer-like architectures, leading to significant breakthroughs in some applications. In this work we focus on two forms of attention mechanisms: attention modules and self-attention. Attention modules are used to reweight the features of each layer input tensor. Different modules have different ways to perform this reweighting in fully connected or convolutional layers. The attention models studied are completely modular and in this work they will be used with the popular ResNet architecture. Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence. Self-Attention is becoming increasingly popular in Computer Vision, where it is sometimes combined with convolutional layers, although some recent architectures do away entirely with convolutions. In this work, we study and perform an objective comparison of a number of different attention mechanisms in a specific computer vision task, the classification of samples in the widely used Skin Cancer MNIST dataset. The results show that attention modules do sometimes improve the performance of convolutional neural network architectures, but also that this improvement, although noticeable and statistically significant, is not consistent in different settings. The results obtained with self-attention mechanisms, on the other hand, show consistent and significant improvements, leading to the best results even in architectures with a reduced number of parameters.
CVSep 26, 2021
Using Soft Labels to Model Uncertainty in Medical Image SegmentationJoão Lourenço Silva, Arlindo L. Oliveira
Medical image segmentation is inherently uncertain. For a given image, there may be multiple plausible segmentation hypotheses, and physicians will often disagree on lesion and organ boundaries. To be suited to real-world application, automatic segmentation systems must be able to capture this uncertainty and variability. Thus far, this has been addressed by building deep learning models that, through dropout, multiple heads, or variational inference, can produce a set - infinite, in some cases - of plausible segmentation hypotheses for any given image. However, in clinical practice, it may not be practical to browse all hypotheses. Furthermore, recent work shows that segmentation variability plateaus after a certain number of independent annotations, suggesting that a large enough group of physicians may be able to represent the whole space of possible segmentations. Inspired by this, we propose a simple method to obtain soft labels from the annotations of multiple physicians and train models that, for each image, produce a single well-calibrated output that can be thresholded at multiple confidence levels, according to each application's precision-recall requirements. We evaluated our method on the MICCAI 2021 QUBIQ challenge, showing that it performs well across multiple medical image segmentation tasks, produces well-calibrated predictions, and, on average, performs better at matching physicians' predictions than other physicians.
NCJul 1, 2021
Modelling Neuronal Behaviour with Time Series Regression: Recurrent Neural Networks on C. Elegans DataGonçalo Mestre, Ruxandra Barbulescu, Arlindo L. Oliveira et al.
Given the inner complexity of the human nervous system, insight into the dynamics of brain activity can be gained from understanding smaller and simpler organisms, such as the nematode C. Elegans. The behavioural and structural biology of these organisms is well-known, making them prime candidates for benchmarking modelling and simulation techniques. In these complex neuronal collections, classical, white-box modelling techniques based on intrinsic structural or behavioural information are either unable to capture the profound nonlinearities of the neuronal response to different stimuli or generate extremely complex models, which are computationally intractable. In this paper we show how the nervous system of C. Elegans can be modelled and simulated with data-driven models using different neural network architectures. Specifically, we target the use of state of the art recurrent neural networks architectures such as LSTMs and GRUs and compare these architectures in terms of their properties and their accuracy as well as the complexity of the resulting models. We show that GRU models with a hidden layer size of 4 units are able to accurately reproduce with high accuracy the system's response to very different stimuli.
IVJun 21, 2021
Encoder-Decoder Architectures for Clinically Relevant Coronary Artery SegmentationJoão Lourenço Silva, Miguel Nobre Menezes, Tiago Rodrigues et al.
Coronary X-ray angiography is a crucial clinical procedure for the diagnosis and treatment of coronary artery disease, which accounts for roughly 16% of global deaths every year. However, the images acquired in these procedures have low resolution and poor contrast, making lesion detection and assessment challenging. Accurate coronary artery segmentation not only helps mitigate these problems, but also allows the extraction of relevant anatomical features for further analysis by quantitative methods. Although automated segmentation of coronary arteries has been proposed before, previous approaches have used non-optimal segmentation criteria, leading to less useful results. Most methods either segment only the major vessel, discarding important information from the remaining ones, or segment the whole coronary tree based mostly on contrast information, producing a noisy output that includes vessels that are not relevant for diagnosis. We adopt a better-suited clinical criterion and segment vessels according to their clinical relevance. Additionally, we simultaneously perform catheter segmentation, which may be useful for diagnosis due to the scale factor provided by the catheter's known diameter, and is a task that has not yet been performed with good results. To derive the optimal approach, we conducted an extensive comparative study of encoder-decoder architectures trained on a combination of focal loss and a variant of generalized dice loss. Based on the EfficientNet and the UNet++ architectures, we propose a line of efficient and high-performance segmentation models using a new decoder architecture, the EfficientUNet++, whose best-performing version achieved average dice scores of 0.8904 and 0.7526 for the artery and catheter classes, respectively, and an average generalized dice score of 0.9234.
LGMay 6, 2021
Modeling the geospatial evolution of COVID-19 using spatio-temporal convolutional sequence-to-sequence neural networksMário Cardoso, André Cavalheiro, Alexandre Borges et al.
Europe was hit hard by the COVID-19 pandemic and Portugal was one of the most affected countries, having suffered three waves in the first twelve months. Approximately between Jan 19th and Feb 5th 2021 Portugal was the country in the world with the largest incidence rate, with 14-days incidence rates per 100,000 inhabitants in excess of 1000. Despite its importance, accurate prediction of the geospatial evolution of COVID-19 remains a challenge, since existing analytical methods fail to capture the complex dynamics that result from both the contagion within a region and the spreading of the infection from infected neighboring regions. We use a previously developed methodology and official municipality level data from the Portuguese Directorate-General for Health (DGS), relative to the first twelve months of the pandemic, to compute an estimate of the incidence rate in each location of mainland Portugal. The resulting sequence of incidence rate maps was then used as a gold standard to test the effectiveness of different approaches in the prediction of the spatial-temporal evolution of the incidence rate. Four different methods were tested: a simple cell level autoregressive moving average (ARMA) model, a cell level vector autoregressive (VAR) model, a municipality-by-municipality compartmental SIRD model followed by direct block sequential simulation and a convolutional sequence-to-sequence neural network model based on the STConvS2S architecture. We conclude that the convolutional sequence-to-sequence neural network is the best performing method, when predicting the medium-term future incidence rate, using the available information.
IVMar 4, 2021
Automated Detection of Coronary Artery Stenosis in X-ray Angiography using Deep Neural NetworksDinis L. Rodrigues, Miguel Nobre Menezes, Fausto J. Pinto et al.
Coronary artery disease leading up to stenosis, the partial or total blocking of coronary arteries, is a severe condition that affects millions of patients each year. Automated identification and classification of stenosis severity from minimally invasive procedures would be of great clinical value, but existing methods do not match the accuracy of experienced cardiologists, due to the complexity of the task. Although a number of computational approaches for quantitative assessment of stenosis have been proposed to date, the performance of these methods is still far from the required levels for clinical applications. In this paper, we propose a two-step deep-learning framework to partially automate the detection of stenosis from X-ray coronary angiography images. In the two steps, we used two distinct convolutional neural network architectures, one to automatically identify and classify the angle of view, and another to determine the bounding boxes of the regions of interest in frames where stenosis is visible. Transfer learning and data augmentation techniques were used to boost the performance of the system in both tasks. We achieved a 0.97 accuracy on the task of classifying the Left/Right Coronary Artery (LCA/RCA) angle view and 0.68/0.73 recall on the determination of the regions of interest, for LCA and RCA, respectively. These results compare favorably with previous results obtained using related approaches, and open the way to a fully automated method for the identification of stenosis severity from X-ray angiographies.
CVJul 19, 2018
Conditional Random Fields as Recurrent Neural Networks for 3D Medical Imaging SegmentationMiguel Monteiro, Mário A. T. Figueiredo, Arlindo L. Oliveira
The Conditional Random Field as a Recurrent Neural Network layer is a recently proposed algorithm meant to be placed on top of an existing Fully-Convolutional Neural Network to improve the quality of semantic segmentation. In this paper, we test whether this algorithm, which was shown to improve semantic segmentation for 2D RGB images, is able to improve segmentation quality for 3D multi-modal medical images. We developed an implementation of the algorithm which works for any number of spatial dimensions, input/output image channels, and reference image channels. As far as we know this is the first publicly available implementation of this sort. We tested the algorithm with two distinct 3D medical imaging datasets, we concluded that the performance differences observed were not statistically significant. Finally, in the discussion section of the paper, we go into the reasons as to why this technique transfers poorly from natural images to medical images.