DBJun 1
Less Is More? When Dataset Context Hurts LLM-Generated Dataset DescriptionsLisa-Yao Gan, Arunav Das, Johanna Walker et al.
Dataset search and reuse are strongly constrained by the quality of metadata such as natural language descriptions, which are often sparse or inconsistent. Although large language models (LLMs) can generate such descriptions automatically, little empirical guidance exists on what makes a good dataset description and what dataset context LLMs actually need. We study these questions through a literature-grounded framework of dataset description quality and a large-scale ablation study using 252 datasets (1,336 CSV files) from the European data portal data.europa.eu. We generate descriptions with LLMs in a baseline scenario and two ablation scenarios: (1) using only dataset titles, (2) titles and schema, and (3) titles, schema and representative data, and evaluate them with an LLM-as-a- judge framework and a semantic descriptive attribute analysis grounded in our quality dimensions. Our results reveal a consis- tent schema penalty: table-schemas alone often degrade narrative quality, while representative data partially restores grounding without improving overall human-facing quality. We further show that different LLMs exhibit stable descriptive personas. These findings provide practical guidance for LLM-supported data publishing workflows.
AIApr 4, 2025
Towards deployment-centric multimodal AI beyond vision and languageXianyuan Liu, Jiayang Zhang, Shuo Zhou et al.
Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.
LGJan 16, 2021
Visual Analytics approach for finding spatiotemporal patterns from COVID19Arunav Das
Bounce Back Loan is amongst a number of UK business financial support schemes launched by UK Government in 2020 amidst pandemic lockdown. Through these schemes, struggling businesses are provided financial support to weather economic slowdown from pandemic lockdown. £43.5bn loan value has been provided as of 17th Dec2020. However, with no major checks for granting these loans and looming prospect of loan losses from write-offs from failed businesses and fraud, this paper theorizes prospect of applying spatiotemporal modelling technique to explore if geospatial patterns and temporal analysis could aid design of loan grant criteria for schemes. Application of Clustering and Visual Analytics framework to business demographics, survival rate and Sector concentration shows Inner and Outer London spatial patterns which historic business failures and reversal of the patterns under COVID-19 implying sector influence on spatial clusters. Combination of unsupervised clustering technique with multinomial logistic regression modelling on research datasets complimented by additional datasets on other support schemes, business structure and financial crime, is recommended for modelling business vulnerability to certain types of financial market or economic condition. The limitations of clustering technique for high dimensional is discussed along with relevance of an applicable model for continuing the research through next steps.