LGMar 31, 2025Code
Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement LearningJiangjie Qiu, Hou Hei Lam, Xiuyuan Hu et al. · tsinghua
Organic photovoltaic (OPV) materials offer a promising avenue toward cost-effective solar energy utilization. However, optimizing donor-acceptor (D-A) combinations to achieve high power conversion efficiency (PCE) remains a significant challenge. In this work, we propose a framework that integrates large-scale pretraining of graph neural networks (GNNs) with a GPT-2 (Generative Pretrained Transformer 2)-based reinforcement learning (RL) strategy to design OPV molecules with potentially high PCE. This approach produces candidate molecules with predicted efficiencies approaching 21\%, although further experimental validation is required. Moreover, we conducted a preliminary fragment-level analysis to identify structural motifs recognized by the RL model that may contribute to enhanced PCE, thus providing design guidelines for the broader research community. To facilitate continued discovery, we are building the largest open-source OPV dataset to date, expected to include nearly 3,000 donor-acceptor pairs. Finally, we discuss plans to collaborate with experimental teams on synthesizing and characterizing AI-designed molecules, which will provide new data to refine and improve our predictive and generative models.
MTRL-SCINov 23, 2025
CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic DiscoveryHou Hei Lam, Jiangjie Qiu, Xiuyuan Hu et al.
Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and acceptor pairs with strong power conversion efficiencies (PCEs). Existing design strategies typically focus on either the donor or the acceptor alone, rather than using a unified approach capable of modeling both components. In this work, we introduce a dual machine learning framework for OPV discovery that combines predictive modeling with generative molecular design. We present the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), the largest curated dataset of its kind, containing 2000 experimentally characterized donor acceptor pairs. Using this dataset, we develop the Organic Photovoltaic Classifier (OPVC) to predict whether a material exhibits OPV behavior, and a hierarchical graph neural network that incorporates multi task learning and donor acceptor interaction modeling. This framework includes the Molecular Orbital Energy Estimator (MOE2) for predicting HOMO and LUMO energy levels, and the Photovoltaic Performance Predictor (P3) for estimating PCE. In addition, we introduce the Material Generative Pretrained Transformer (MatGPT) to produce synthetically accessible organic semiconductors, guided by a reinforcement learning strategy with three objective policy optimization. By linking molecular representation learning with performance prediction, our framework advances data driven discovery of high performance OPV materials.
LGAug 2, 2025
RelMap: Reliable Spatiotemporal Sensor Data Visualization via Imputative Spatial InterpolationJuntong Chen, Huayuan Ye, He Zhu et al.
Accurate and reliable visualization of spatiotemporal sensor data such as environmental parameters and meteorological conditions is crucial for informed decision-making. Traditional spatial interpolation methods, however, often fall short of producing reliable interpolation results due to the limited and irregular sensor coverage. This paper introduces a novel spatial interpolation pipeline that achieves reliable interpolation results and produces a novel heatmap representation with uncertainty information encoded. We leverage imputation reference data from Graph Neural Networks (GNNs) to enhance visualization reliability and temporal resolution. By integrating Principal Neighborhood Aggregation (PNA) and Geographical Positional Encoding (GPE), our model effectively learns the spatiotemporal dependencies. Furthermore, we propose an extrinsic, static visualization technique for interpolation-based heatmaps that effectively communicates the uncertainties arising from various sources in the interpolated map. Through a set of use cases, extensive evaluations on real-world datasets, and user studies, we demonstrate our model's superior performance for data imputation, the improvements to the interpolant with reference data, and the effectiveness of our visualization design in communicating uncertainties.
HCMar 1, 2021
Deep Colormap Extraction from VisualizationsLin-Ping Yuan, Wei Zeng, Siwei Fu et al.
This work presents a new approach based on deep learning to automatically extract colormaps from visualizations. After summarizing colors in an input visualization image as a Lab color histogram, we pass the histogram to a pre-trained deep neural network, which learns to predict the colormap that produces the visualization. To train the network, we create a new dataset of 64K visualizations that cover a wide variety of data distributions, chart types, and colormaps. The network adopts an atrous spatial pyramid pooling module to capture color features at multiple scales in the input color histograms. We then classify the predicted colormap as discrete or continuous and refine the predicted colormap based on its color histogram. Quantitative comparisons to existing methods show the superior performance of our approach on both synthetic and real-world visualizations. We further demonstrate the utility of our method with two use cases,i.e., color transfer and color remapping.
GRAug 3, 2020
Exemplar-based Layout Fine-tuning for Node-link DiagramsJiacheng Pan, Wei Chen, Xiaodong Zhao et al.
We design and evaluate a novel layout fine-tuning technique for node-link diagrams that facilitates exemplar-based adjustment of a group of substructures in batching mode. The key idea is to transfer user modifications on a local substructure to other substructures in the whole graph that are topologically similar to the exemplar. We first precompute a canonical representation for each substructure with node embedding techniques and then use it for on-the-fly substructure retrieval. We design and develop a light-weight interactive system to enable intuitive adjustment, modification transfer, and visual graph exploration. We also report some results of quantitative comparisons, three case studies, and a within-participant user study.
CVJul 9, 2020
VisImages: A Fine-Grained Expert-Annotated Visualization DatasetDazhen Deng, Yihong Wu, Xinhuan Shu et al.
Images in visualization publications contain rich information, e.g., novel visualization designs and implicit design patterns of visualizations. A systematic collection of these images can contribute to the community in many aspects, such as literature analysis and automated tasks for visualization. In this paper, we build and make public a dataset, VisImages, which collects 12,267 images with captions from 1,397 papers in IEEE InfoVis and VAST. Built upon a comprehensive visualization taxonomy, the dataset includes 35,096 visualizations and their bounding boxes in the images.We demonstrate the usefulness of VisImages through three use cases: 1) investigating the use of visualizations in the publications with VisImages Explorer, 2) training and benchmarking models for visualization classification, and 3) localizing visualizations in the visual analytics systems automatically.
HCJun 23, 2020
ICE: Identify and Compare Event Sequence Sets through Multi-Scale Matrix and Unit VisualizationsSiwei Fu, Jian Zhao, Linping Yuan et al.
Comparative analysis of event sequence data is essential in many application domains, such as website design and medical care. However, analysts often face two challenges: they may not always know which sets of event sequences in the data are useful to compare, and the comparison needs to be achieved at different granularity, due to the volume and complexity of the data. This paper presents, ICE, an interactive visualization that allows analysts to explore an event sequence dataset, and identify promising sets of event sequences to compare at both the pattern and sequence levels. More specifically, ICE incorporates a multi-level matrix-based visualization for browsing the entire dataset based on the prefixes and suffixes of sequences. To support comparison at multiple levels, ICE employs the unit visualization technique, and we further explore the design space of unit visualizations for event sequence comparison tasks. Finally, we demonstrate the effectiveness of ICE with three real-world datasets from different domains.
CLMay 7, 2020
Quda: Natural Language Queries for Visual Data AnalyticsSiwei Fu, Kai Xiong, Xiaodong Ge et al.
The identification of analytic tasks from free text is critical for visualization-oriented natural language interfaces (V-NLIs) to suggest effective visualizations. However, it is challenging due to the ambiguity and complexity nature of human language. To address this challenge, we present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. Our dataset contains $14,035$ diverse user queries, and each is annotated with one or multiple analytic tasks. We achieve this goal by first gathering seed queries with data analysts and then employing extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda through three applications. This work is the first attempt to construct a large-scale corpus for recognizing analytic tasks. With the release of Quda, we hope it will boost the research and development of V-NLIs in data analysis and visualization.
HCAug 7, 2019
Task-Oriented Optimal Sequencing of Visualization ChartsDanqing Shi, Yang Shi, Xinyue Xu et al.
A chart sequence is used to describe a series of visualization charts generated in the exploratory analysis by data analysts. It provides information details in each chart as well as a logical relationship among charts. While existing research targets on generating chart sequences that match human's perceptions, little attention has been paid to formulate task-oriented connections between charts in a chart design space. We present a novel chart sequencing method based on reinforcement learning to capture the connections between charts in the context of three major analysis tasks, including correlation analysis, anomaly detection, and cluster analysis. The proposed method formulates a chart sequencing procedure as an optimization problem, which seeks an optimal policy to sequencing charts for the specific analysis task. In our method, a novel reward function is introduced, which takes both the analysis task and the factor of human cognition into consideration. We conducted one case study and two user studies to evaluate the effectiveness of our method under the application scenarios of visualization demonstration, sequencing charts for reasoning analysis results, and making a chart design choice. The study results showed the power of our method.
HCJul 12, 2019
Narvis: Authoring Narrative Slideshows for Introducing Data Visualization DesignsQianwen Wang, Zhen Li, Siwei Fu et al.
Visual designs can be complex in modern data visualization systems, which poses special challenges for explaining them to the non-experts. However, few if any presentation tools are tailored for this purpose. In this study, we present Narvis, a slideshow authoring tool designed for introducing data visualizations to non-experts. Narvis targets two types of end-users: teachers, experts in data visualization who produce tutorials for explaining a data visualization, and students, non-experts who try to understand visualization designs through tutorials. We present an analysis of requirements through close discussions with the two types of end-users. The resulting considerations guide the design and implementation of Narvis. Additionally, to help teachers better organize their introduction slideshows, we specify a data visualization as a hierarchical combination of components, which are automatically detected and extracted by Narvis. The teachers craft an introduction slideshow through first organizing these components, and then explaining them sequentially. A series of templates are provided for adding annotations and animations to improve efficiency during the authoring process. We evaluate Narvis through a qualitative analysis of the authoring experience, and a preliminary evaluation of the generated slideshows.
AIOct 21, 2018
Challenge AI Mind: A Crowd System for Proactive AI TestingSiwei Fu, Anbang Xu, Xiaotong Liu et al.
Artificial Intelligence (AI) has burrowed into our lives in various aspects; however, without appropriate testing, deployed AI systems are often being criticized to fail in critical and embarrassing cases. Existing testing approaches mainly depend on fixed and pre-defined datasets, providing a limited testing coverage. In this paper, we propose the concept of proactive testing to dynamically generate testing data and evaluate the performance of AI systems. We further introduce Challenge.AI, a new crowd system that features the integration of crowdsourcing and machine learning techniques in the process of error generation, error validation, error categorization, and error analysis. We present experiences and insights into a participatory design with AI developers. The evaluation shows that the crowd workflow is more effective with the help of machine learning techniques. AI developers found that our system can help them discover unknown errors made by the AI models, and engage in the process of proactive testing.