CVMar 13, 2023Code
Reference-Guided Large-Scale Face Inpainting with Identity and Texture ControlWuyang Luo, Su Yang, Weishan Zhang
Face inpainting aims at plausibly predicting missing pixels of face images within a corrupted region. Most existing methods rely on generative models learning a face image distribution from a big dataset, which produces uncontrollable results, especially with large-scale missing regions. To introduce strong control for face inpainting, we propose a novel reference-guided face inpainting method that fills the large-scale missing region with identity and texture control guided by a reference face image. However, generating high-quality results under imposing two control signals is challenging. To tackle such difficulty, we propose a dual control one-stage framework that decouples the reference image into two levels for flexible control: High-level identity information and low-level texture information, where the identity information figures out the shape of the face and the texture information depicts the component-aware texture. To synthesize high-quality results, we design two novel modules referred to as Half-AdaIN and Component-Wise Style Injector (CWSI) to inject the two kinds of control information into the inpainting processing. Our method produces realistic results with identity and texture control faithful to reference images. To the best of our knowledge, it is the first work to concurrently apply identity and component-level controls in face inpainting to promise more precise and controllable results. Code is available at https://github.com/WuyangLuo/RefFaceInpainting
CVJul 13, 2022
Context-Consistent Semantic Image Editing with Style-Preserved ModulationWuyang Luo, Su Yang, Hong Wang et al.
Semantic image editing utilizes local semantic label maps to generate the desired content in the edited region. A recent work borrows SPADE block to achieve semantic image editing. However, it cannot produce pleasing results due to style discrepancy between the edited region and surrounding pixels. We attribute this to the fact that SPADE only uses an image-independent local semantic layout but ignores the image-specific styles included in the known pixels. To address this issue, we propose a style-preserved modulation (SPM) comprising two modulations processes: The first modulation incorporates the contextual style and semantic layout, and then generates two fused modulation parameters. The second modulation employs the fused parameters to modulate feature maps. By using such two modulations, SPM can inject the given semantic layout while preserving the image-specific context style. Moreover, we design a progressive architecture for generating the edited content in a coarse-to-fine manner. The proposed method can obtain context-consistent results and significantly alleviate the unpleasant boundary between the generated regions and the known pixels.
LGJul 25, 2023
FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement LearningLeiming Chen, Weishan Zhang, Cihao Dong et al.
Traditional federated learning uses the number of samples to calculate the weights of each client model and uses this fixed weight value to fusion the global model. However, in practical scenarios, each client's device and data heterogeneity leads to differences in the quality of each client's model. Thus the contribution to the global model is not wholly determined by the sample size. In addition, if clients intentionally upload low-quality or malicious models, using these models for aggregation will lead to a severe decrease in global model accuracy. Traditional federated learning algorithms do not address these issues. To solve this probelm, we propose FedDRL, a model fusion approach using reinforcement learning based on a two staged approach. In the first stage, Our method could filter out malicious models and selects trusted client models to participate in the model fusion. In the second stage, the FedDRL algorithm adaptively adjusts the weights of the trusted client models and aggregates the optimal global model. We also define five model fusion scenarios and compare our method with two baseline algorithms in those scenarios. The experimental results show that our algorithm has higher reliability than other algorithms while maintaining accuracy.
CVMar 23, 2023
SIEDOB: Semantic Image Editing by Disentangling Object and BackgroundWuyang Luo, Su Yang, Xinjian Zhang et al.
Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, \textbf{S}emantic \textbf{I}mage \textbf{E}diting by \textbf{D}isentangling \textbf{O}bject and \textbf{B}ackground (\textbf{SIEDOB}), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds.
LGAug 22, 2023
Federated Learning in Big Model Era: Domain-Specific Multimodal Large ModelsZengxiang Li, Zhaoxiang Hou, Hui Liu et al.
Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. This paper proposes a multimodal federated learning framework that enables multiple enterprises to utilize private domain data to collaboratively train large models for vertical domains, achieving intelligent services across scenarios. The authors discuss in-depth the strategic transformation of federated learning in terms of intelligence foundation and objectives in the era of big model, as well as the new challenges faced in heterogeneous data, model aggregation, performance and cost trade-off, data privacy, and incentive mechanism. The paper elaborates a case study of leading enterprises contributing multimodal data and expert knowledge to city safety operation management , including distributed deployment and efficient coordination of the federated learning platform, technical innovations on data quality improvement based on large model capabilities and efficient joint fine-tuning approaches. Preliminary experiments show that enterprises can enhance and accumulate intelligent capabilities through multimodal model federated learning, thereby jointly creating an smart city model that provides high-quality intelligent services covering energy infrastructure safety, residential community security, and urban operation management. The established federated learning cooperation ecosystem is expected to further aggregate industry, academia, and research resources, realize large models in multiple vertical domains, and promote the large-scale industrial application of artificial intelligence and cutting-edge research on multimodal federated learning.
AIJan 27
CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient RoutingShanyv Liu, Xuyang Yuan, Tao Chen et al.
Graph-based Multi-Agent Systems (MAS) enable complex cyclic workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub-tasks. We propose CASTER (Context-Aware Strategy for Task Efficient Routing), a lightweight router for dynamic model selection in graph-based MAS. CASTER employs a Dual-Signal Router that combines semantic embeddings with structural meta-features to estimate task difficulty. During training, the router self-optimizes through a Cold Start to Iterative Evolution paradigm, learning from its own routing failures via on-policy negative feedback. Experiments using LLM-as-a-Judge evaluation across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity demonstrate that CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while matching their success rates, and consistently outperforms both heuristic routing and FrugalGPT across all domains.
CVMay 24, 2023
Predicting Token Impact Towards Efficient Vision TransformerHong Wang, Su Yang, Xiaoke Huang et al.
Token filtering to reduce irrelevant tokens prior to self-attention is a straightforward way to enable efficient vision Transformer. This is the first work to view token filtering from a feature selection perspective, where we weigh the importance of a token according to how much it can change the loss once masked. If the loss changes greatly after masking a token of interest, it means that such a token has a significant impact on the final decision and is thus relevant. Otherwise, the token is less important for the final decision, so it can be filtered out. After applying the token filtering module generalized from the whole training data, the token number fed to the self-attention module can be obviously reduced in the inference phase, leading to much fewer computations in all the subsequent self-attention layers. The token filter can be realized using a very simple network, where we utilize multi-layer perceptron. Except for the uniqueness of performing token filtering only once from the very beginning prior to self-attention, the other core feature making our method different from the other token filters lies in the predictability of token impact from a feature selection point of view. The experiments show that the proposed method provides an efficient way to approach a light weighted model after optimized with a backbone by means of fine tune, which is easy to be deployed in comparison with the existing methods based on training from scratch.
NIFeb 3, 2022
Dynamic Virtual Network Embedding Algorithm based on Graph Convolution Neural Network and Reinforcement LearningPeiying Zhang, Chao Wang, Neeraj Kumar et al.
Network virtualization (NV) is a technology with broad application prospects. Virtual network embedding (VNE) is the core orientation of VN, which aims to provide more flexible underlying physical resource allocation for user function requests. The classical VNE problem is usually solved by heuristic method, but this method often limits the flexibility of the algorithm and ignores the time limit. In addition, the partition autonomy of physical domain and the dynamic characteristics of virtual network request (VNR) also increase the difficulty of VNE. This paper proposed a new type of VNE algorithm, which applied reinforcement learning (RL) and graph neural network (GNN) theory to the algorithm, especially the combination of graph convolutional neural network (GCNN) and RL algorithm. Based on a self-defined fitness matrix and fitness value, we set up the objective function of the algorithm implementation, realized an efficient dynamic VNE algorithm, and effectively reduced the degree of resource fragmentation. Finally, we used comparison algorithms to evaluate the proposed method. Simulation experiments verified that the dynamic VNE algorithm based on RL and GCNN has good basic VNE characteristics. By changing the resource attributes of physical network and virtual network, it can be proved that the algorithm has good flexibility.
CVJan 26, 2022
TransPPG: Two-stream Transformer for Remote Heart Rate EstimateJiaqi Kang, Su Yang, Weishan Zhang
Non-contact facial video-based heart rate estimation using remote photoplethysmography (rPPG) has shown great potential in many applications (e.g., remote health care) and achieved creditable results in constrained scenarios. However, practical applications require results to be accurate even under complex environment with head movement and unstable illumination. Therefore, improving the performance of rPPG in complex environment has become a key challenge. In this paper, we propose a novel video embedding method that embeds each facial video sequence into a feature map referred to as Multi-scale Adaptive Spatial and Temporal Map with Overlap (MAST_Mop), which contains not only vital information but also surrounding information as reference, which acts as the mirror to figure out the homogeneous perturbations imposed on foreground and background simultaneously, such as illumination instability. Correspondingly, we propose a two-stream Transformer model to map the MAST_Mop into heart rate (HR), where one stream follows the pulse signal in the facial area while the other figures out the perturbation signal from the surrounding region such that the difference of the two channels leads to adaptive noise cancellation. Our approach significantly outperforms all current state-of-the-art methods on two public datasets MAHNOB-HCI and VIPL-HR. As far as we know, it is the first work with Transformer as backbone to capture the temporal dependencies in rPPGs and apply the two stream scheme to figure out the interference from backgrounds as mirror of the corresponding perturbation on foreground signals for noise tolerating.
LGDec 29, 2021
Feature-context driven Federated Meta-Learning for Rare Disease PredictionBingyang Chen, Tao Chen, Xingjie Zeng et al.
Millions of patients suffer from rare diseases around the world. However, the samples of rare diseases are much smaller than those of common diseases. In addition, due to the sensitivity of medical data, hospitals are usually reluctant to share patient information for data fusion citing privacy concerns. These challenges make it difficult for traditional AI models to extract rare disease features for the purpose of disease prediction. In this paper, we overcome this limitation by proposing a novel approach for rare disease prediction based on federated meta-learning. To improve the prediction accuracy of rare diseases, we design an attention-based meta-learning (ATML) approach which dynamically adjusts the attention to different tasks according to the measured training effect of base learners. Additionally, a dynamic-weight based fusion strategy is proposed to further improve the accuracy of federated learning, which dynamically selects clients based on the accuracy of each local model. Experiments show that with as few as five shots, our approach out-performs the original federated meta-learning algorithm in accuracy and speed. Compared with each hospital's local model, the proposed model's average prediction accuracy increased by 13.28%.
LGDec 29, 2021
Federated Learning for Cross-block Oil-water Layer IdentificationBingyang Chena, Xingjie Zenga, Weishan Zhang
Cross-block oil-water layer(OWL) identification is essential for petroleum development. Traditional methods are greatly affected by subjective factors due to depending mainly on the human experience. AI-based methods have promoted the development of OWL identification. However, because of the significant geological differences across blocks and the severe long-tailed distribution(class imbalanced), the identification effects of existing artificial intelligence(AI) models are limited. In this paper, we address this limitation by proposing a dynamic fusion-based federated learning(FL) for OWL identification. To overcome geological differences, we propose a dynamic weighted strategy to fuse models and train a general OWL identification model. In addition, an F1 score-based re-weighting scheme is designed and a novel loss function is derived theoretically to solve the data long-tailed problem. Further, a geological knowledge-based mask-attention mechanism is proposed to enhance model feature extraction. To our best knowledge, this is the first work to identify OWL using FL. We evaluate the proposed approach with an actual well logging dataset from the oil field and a public 3W dataset. Experimental results demonstrate that our approach significantly out-performs other AI methods.
LGNov 28, 2021
Fed2: Feature-Aligned Federated LearningFuxun Yu, Weishan Zhang, Zhuwei Qin et al.
Federated learning learns from scattered data by fusing collaborative models from local nodes. However, the conventional coordinate-based model averaging by FedAvg ignored the random information encoded per parameter and may suffer from structural feature misalignment. In this work, we propose Fed2, a feature-aligned federated learning framework to resolve this issue by establishing a firm structure-feature alignment across the collaborative models. Fed2 is composed of two major designs: First, we design a feature-oriented model structure adaptation method to ensure explicit feature allocation in different neural network structures. Applying the structure adaptation to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. During the federated learning process, we then propose a feature paired averaging scheme to guarantee aligned feature distribution and maintain no feature fusion conflicts under either IID or non-IID scenarios. Eventually, Fed2 could effectively enhance the federated learning convergence performance under extensive homo- and heterogeneous settings, providing excellent convergence speed, accuracy, and computation/communication efficiency.
CVMay 10, 2021
Video Anomaly Detection By The Duality Of Normality-Granted Optical FlowHongyong Wang, Xinjian Zhang, Su Yang et al.
Video anomaly detection is a challenging task because of diverse abnormal events. To this task, methods based on reconstruction and prediction are wildly used in recent works, which are built on the assumption that learning on normal data, anomalies cannot be reconstructed or predicated as good as normal patterns, namely the anomaly result with more errors. In this paper, we propose to discriminate anomalies from normal ones by the duality of normality-granted optical flow, which is conducive to predict normal frames but adverse to abnormal frames. The normality-granted optical flow is predicted from a single frame, to keep the motion knowledge focused on normal patterns. Meanwhile, We extend the appearance-motion correspondence scheme from frame reconstruction to prediction, which not only helps to learn the knowledge about object appearances and correlated motion, but also meets the fact that motion is the transformation between appearances. We also introduce a margin loss to enhance the learning of frame prediction. Experiments on standard benchmark datasets demonstrate the impressive performance of our approach.
DCApr 21, 2021
A Survey on Federated Learning and its Applications for Accelerating Industrial Internet of ThingsJiehan Zhou, Shouhua Zhang, Qinghua Lu et al.
Federated learning (FL) brings collaborative intelligence into industries without centralized training data to accelerate the process of Industry 4.0 on the edge computing level. FL solves the dilemma in which enterprises wish to make the use of data intelligence with security concerns. To accelerate industrial Internet of things with the further leverage of FL, existing achievements on FL are developed from three aspects: 1) define terminologies and elaborate a general framework of FL for accommodating various scenarios; 2) discuss the state-of-the-art of FL on fundamental researches including data partitioning, privacy preservation, model optimization, local model transportation, personalization, motivation mechanism, platform & tools, and benchmark; 3) discuss the impacts of FL from the economic perspective. To attract more attention from industrial academia and practice, a FL-transformed manufacturing paradigm is presented, and future research directions of FL are given and possible immediate applications in Industry 4.0 domain are also proposed.
AIMar 17, 2021
A Survey of Hybrid Human-Artificial Intelligence for Social ComputingWenxi Wang, Huansheng Ning, Feifei Shi et al.
Along with the development of modern computing technology and social sciences, both theoretical research and practical applications of social computing have been continuously extended. In particular with the boom of artificial intelligence (AI), social computing is significantly influenced by AI. However, the conventional technologies of AI have drawbacks in dealing with more complicated and dynamic problems. Such deficiency can be rectified by hybrid human-artificial intelligence (H-AI) which integrates both human intelligence and AI into one unity, forming a new enhanced intelligence. H-AI in dealing with social problems shows the advantages that AI can not surpass. This paper firstly introduces the concept of H-AI. AI is the intelligence in the transition stage of H-AI, so the latest research progresses of AI in social computing are reviewed. Secondly, it summarizes typical challenges faced by AI in social computing, and makes it possible to introduce H-AI to solve these challenges. Finally, the paper proposes a holistic framework of social computing combining with H-AI, which consists of four layers: object layer, base layer, analysis layer, and application layer. It represents H-AI has significant advantages over AI in solving social problems.
CVFeb 21, 2021
Risk Prediction on Traffic Accidents using a Compact Neural Model for Multimodal Information Fusion over Urban Big DataWenshan Wang, Su Yang, Weishan Zhang
Predicting risk map of traffic accidents is vital for accident prevention and early planning of emergency response. Here, the challenge lies in the multimodal nature of urban big data. We propose a compact neural ensemble model to alleviate overfitting in fusing multimodal features and develop some new features such as fractal measure of road complexity in satellite images, taxi flows, POIs, and road width and connectivity in OpenStreetMap. The solution is more promising in performance than the baseline methods and the single-modality data based solutions. After visualization from a micro view, the visual patterns of the scenes related to high and low risk are revealed, providing lessons for future road design. From city point of view, the predicted risk map is close to the ground truth, and can act as the base in optimizing spatial configuration of resources for emergency response, and alarming signs. To the best of our knowledge, it is the first work to fuse visual and spatio-temporal features in traffic accident prediction while advances to bridge the gap between data mining based urban computing and computer vision based urban perception.
DCSep 22, 2020
Dynamic Fusion based Federated Learning for COVID-19 DetectionWeishan Zhang, Tao Zhou, Qinghua Lu et al.
Medical diagnostic image analysis (e.g., CT scan or X-Ray) using machine learning is an efficient and accurate way to detect COVID-19 infections. However, sharing diagnostic images across medical institutions is usually not allowed due to the concern of patients' privacy. This causes the issue of insufficient datasets for training the image classification model. Federated learning is an emerging privacy-preserving machine learning paradigm that produces an unbiased global model based on the received updates of local models trained by clients without exchanging clients' local data. Nevertheless, the default setting of federated learning introduces huge communication cost of transferring model updates and can hardly ensure model performance when data heterogeneity of clients heavily exists. To improve communication efficiency and model performance, in this paper, we propose a novel dynamic fusion-based federated learning approach for medical diagnostic image analysis to detect COVID-19 infections. First, we design an architecture for dynamic fusion-based federated learning systems to analyse medical diagnostic images. Further, we present a dynamic fusion method to dynamically decide the participating clients according to their local model performance and schedule the model fusion-based on participating clients' training time. In addition, we summarise a category of medical diagnostic image datasets for COVID-19 detection, which can be used by the machine learning community for image analysis. The evaluation results show that the proposed approach is feasible and performs better than the default setting of federated learning in terms of model performance, communication efficiency and fault tolerance.
DCSep 6, 2020
Blockchain-based Federated Learning for Device Failure Detection in Industrial IoTWeishan Zhang, Qinghua Lu, Qiuyu Yu et al.
Device failure detection is one of most essential problems in industrial internet of things (IIoT). However, in conventional IIoT device failure detection, client devices need to upload raw data to the central server for model training, which might lead to disclosure of sensitive business data. Therefore, in this paper, to ensure client data privacy, we propose a blockchain-based federated learning approach for device failure detection in IIoT. First, we present a platform architecture of blockchain-based federated learning systems for failure detection in IIoT, which enables verifiable integrity of client data. In the architecture, each client periodically creates a Merkle tree in which each leaf node represents a client data record, and stores the tree root on a blockchain. Further, to address the data heterogeneity issue in IIoT failure detection, we propose a novel centroid distance weighted federated averaging (CDW\_FedAvg) algorithm taking into account the distance between positive class and negative class of each client dataset. In addition, to motivate clients to participate in federated learning, a smart contact based incentive mechanism is designed depending on the size and the centroid distance of client data used in local model training. A prototype of the proposed architecture is implemented with our industry partner, and evaluated in terms of feasibility, accuracy and performance. The results show that the approach is feasible, and has satisfactory accuracy and performance.
LGAug 15, 2020
Heterogeneous Federated LearningFuxun Yu, Weishan Zhang, Zhuwei Qin et al.
Federated learning learns from scattered data by fusing collaborative models from local nodes. However, due to chaotic information distribution, the model fusion may suffer from structural misalignment with regard to unmatched parameters. In this work, we propose a novel federated learning framework to resolve this issue by establishing a firm structure-information alignment across collaborative models. Specifically, we design a feature-oriented regulation method ({$Ψ$-Net}) to ensure explicit feature information allocation in different neural network structures. Applying this regulating method to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. During the federated learning process under either IID or non-IID scenarios, dedicated collaboration schemes further guarantee ordered information distribution with definite structure matching, so as the comprehensive model alignment. Eventually, this framework effectively enhances the federated learning applicability to extensive heterogeneous settings, while providing excellent convergence speed, accuracy, and computation/communication efficiency.
SEJul 31, 2019
uBaaS: A Unified Blockchain as a Service PlatformQinghua Lu, Xiwei Xu, Yue Liu et al.
Blockchain is an innovative distributed ledger technology which has attracted a wide range of interests for building the next generation of applications to address lack-of-trust issues in business. Blockchain as a service (BaaS) is a promising solution to improve the productivity of blockchain application development. However, existing BaaS deployment solutions are mostly vendor-locked: they are either bound to a cloud provider or a blockchain platform. In addition to deployment, design and implementation of blockchain-based applications is a hard task requiring deep expertise. Therefore, this paper presents a unified blockchain as a service platform (uBaaS) to support both design and deployment of blockchain-based applications. The services in uBaaS include deployment as a service, design pattern as a service and auxiliary services. In uBaaS, deployment as a service is platform agnostic, which can avoid lock-in to specific cloud platforms, while design pattern as a service applies design patterns for data management and smart contract design to address the scalability and security issues of blockchain. The proposed solutions are evaluated using a real-world quality tracing use case in terms of feasibility and scalability.
CVMay 27, 2018
Anomaly Detection and Localization in Crowded Scenes by Motion-field Shape Description and Similarity-based Statistical LearningXinfeng Zhang, Su Yang, Xinjian Zhang et al.
In crowded scenes, detection and localization of abnormal behaviors is challenging in that high-density people make object segmentation and tracking extremely difficult. We associate the optical flows of multiple frames to capture short-term trajectories and introduce the histogram-based shape descriptor referred to as shape contexts to describe such short-term trajectories. Furthermore, we propose a K-NN similarity-based statistical model to detect anomalies over time and space, which is an unsupervised one-class learning algorithm requiring no clustering nor any prior assumption. Firstly, we retrieve the K-NN samples from the training set in regard to the testing sample, and then use the similarities between every pair of the K-NN samples to construct a Gaussian model. Finally, the probabilities of the similarities from the testing sample to the K-NN samples under the Gaussian model are calculated in the form of a joint probability. Abnormal events can be detected by judging whether the joint probability is below predefined thresholds in terms of time and space, separately. Such a scheme can adapt to the whole scene, since the probability computed as such is not affected by motion distortions arising from perspective distortion. We conduct experiments on real-world surveillance videos, and the results demonstrate that the proposed method can reliably detect and locate the abnormal events in the video sequences, outperforming the state-of-the-art approaches.
CVFeb 28, 2018
Neural Aesthetic Image ReviewerWenshan Wang, Su Yang, Weishan Zhang et al.
Recently, there is a rising interest in perceiving image aesthetics. The existing works deal with image aesthetics as a classification or regression problem. To extend the cognition from rating to reasoning, a deeper understanding of aesthetics should be based on revealing why a high- or low-aesthetic score should be assigned to an image. From such a point of view, we propose a model referred to as Neural Aesthetic Image Reviewer, which can not only give an aesthetic score for an image, but also generate a textual description explaining why the image leads to a plausible rating score. Specifically, we propose two multi-task architectures based on shared aesthetically semantic layers and task-specific embedding layers at a high level for performance improvement on different tasks. To facilitate researches on this problem, we collect the AVA-Reviews dataset, which contains 52,118 images and 312,708 comments in total. Through multi-task learning, the proposed models can rate aesthetic images as well as produce comments in an end-to-end manner. It is confirmed that the proposed models outperform the baselines according to the performance evaluation on the AVA-Reviews dataset. Moreover, we demonstrate experimentally that our model can generate textual reviews related to aesthetics, which are consistent with human perception.