CV LGFeb 24, 2025

An Enhanced Large Language Model For Cross Modal Query Understanding System Using DL-KeyBERT Based CAZSSCL-MPGPT

arXiv:2502.17000v1h-index: 1

Originality Incremental advance

AI Analysis

This addresses accuracy and generalization issues in cross-modal systems for applications like image captioning, though it appears incremental with a focus on specific datasets.

The paper tackles the echo chamber effect in cross-modal query understanding systems by proposing an enhanced LLM-based framework using DL-KeyBERT-based CAZSSCL-MPGPT, achieving accuracies of 99.14% on COCO 2017 and 98.43% on vqav2-val datasets.

Large Language Models (LLMs) are advanced deep-learning models designed to understand and generate human language. They work together with models that process data like images, enabling cross-modal understanding. However, existing approaches often suffer from the echo chamber effect, where redundant visual patterns reduce model generalization and accuracy. Thus, the proposed system considered this limitation and developed an enhanced LLM-based framework for cross-modal query understanding using DL-KeyBERT-based CAZSSCL-MPGPT. The collected dataset consists of pre-processed images and texts. The preprocessed images then undergo object segmentation using Easom-You Only Look Once (E-YOLO). The object skeleton is generated, along with the knowledge graph using a Conditional Random Knowledge Graph (CRKG) technique. Further, features are extracted from the knowledge graph, generated skeletons, and segmented objects. The optimal features are then selected using the Fossa Optimization Algorithm (FOA). Meanwhile, the text undergoes word embedding using DL-KeyBERT. Finally, the cross-modal query understanding system utilizes CAZSSCL-MPGPT to generate accurate and contextually relevant image descriptions as text. The proposed CAZSSCL-MPGPT achieved an accuracy of 99.14187362% in the COCO dataset 2017 and 98.43224393% in the vqav2-val dataset.

View on arXiv PDF

Similar