Clyde Lee Giles

CL
h-index18
5papers
222citations
Novelty45%
AI Score33

5 Papers

CLFeb 23, 2023Code
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization

Chieh-Yang Huang, Ting-Yao Hsu, Ryan Rossi et al.

Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., "Figure 3 shows...") into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.

CLJan 5, 2025Code
Multi-LLM Collaborative Caption Generation in Scientific Documents

Jaeyoung Kim, Jongho Lee, Hong-Jun Choi et al.

Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP

CLAug 25, 2020
Extractive Summarizer for Scholarly Articles

Athar Sefid, Clyde Lee Giles, Prasenjit Mitra

We introduce an extractive method that will summarize long scientific papers. Our model uses presentation slides provided by the authors of the papers as the gold summary standard to label the sentences. The sentences are ranked based on their novelty and their importance as estimated by deep neural networks. Our window-based extractive labeling of sentences results in the improvement of at least 4 ROUGE1-Recall points.

CLApr 4, 2020
Recognizing Long Grammatical Sequences Using Recurrent Networks Augmented With An External Differentiable Stack

Ankur Mali, Alexander Ororbia, Daniel Kifer et al.

Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction. Despite success in applications such as machine translation and voice recognition, these stateful models have several critical shortcomings. Specifically, RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems. For example, RNNs struggle in recognizing complex context free languages (CFLs), never reaching 100% accuracy on training. One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack. However, differentiable memories in prior work have neither been extensively studied on CFLs nor tested on sequences longer than those seen in training. The few efforts that have studied them have shown that continuous differentiable memory structures yield poor generalization for complex CFLs, making the RNN less interpretable. In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms that ensure that the model learns to properly balance the use of its latent states with external memory. Our improved RNN models exhibit better generalization performance and are able to classify long strings generated by complex hierarchical context free grammars (CFGs). We evaluate our models on CGGs, including the Dyck languages, as well as on the Penn Treebank language modelling task, and achieve stable, robust performance across these benchmarks. Furthermore, we show that only our memory-augmented networks are capable of retaining memory for a longer duration up to strings of length 160.

CVNov 20, 2019
Sibling Neural Estimators: Improving Iterative Image Decoding with Gradient Communication

Ankur Mali, Alexander G. Ororbia, Clyde Lee Giles

For lossy image compression, we develop a neural-based system which learns a nonlinear estimator for decoding from quantized representations. The system links two recurrent networks that \help" each other reconstruct same target image patches using complementary portions of spatial context that communicate via gradient signals. This dual agent system builds upon prior work that proposed the iterative refinement algorithm for recurrent neural network (RNN)based decoding which improved image reconstruction compared to standard decoding techniques. Our approach, which works with any encoder, neural or non-neural, This system progressively reduces image patch reconstruction error over a fixed number of steps. Experiment with variants of RNN memory cells, with and without future information, find that our model consistently creates lower distortion images of higher perceptual quality compared to other approaches. Specifically, on the Kodak Lossless True Color Image Suite, we observe as much as a 1:64 decibel (dB) gain over JPEG, a 1:46 dB gain over JPEG 2000, a 1:34 dB gain over the GOOG neural baseline, 0:36 over E2E (a modern competitive neural compression model), and 0:37 over a single iterative neural decoder.