Knowledge-Guided Data-Centric AI in Healthcare: Progress, Shortcomings, and Future Directions
This work tackles data scarcity in healthcare AI, offering incremental improvements through knowledge integration for better model training in medical domains.
The paper addresses the challenge of limited high-quality annotated data in medical image-based diagnosis by advocating for a data-centric AI approach, proposing knowledge-guided generative methods like GANs to incorporate domain knowledge for improved data generation.
The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.