Chengyuan Zhang

MM
h-index71
28papers
334citations
Novelty48%
AI Score51

28 Papers

LGNov 28, 2022Code
Discovering Dynamic Patterns from Spatiotemporal Data with Time-Varying Low-Rank Autoregression

Xinyu Chen, Chengyuan Zhang, Xiaoxu Chen et al.

The problem of broad practical interest in spatiotemporal data analysis, i.e., discovering interpretable dynamic patterns from spatiotemporal data, is studied in this paper. Towards this end, we develop a time-varying reduced-rank vector autoregression (VAR) model whose coefficient matrices are parameterized by low-rank tensor factorization. Benefiting from the tensor factorization structure, the proposed model can simultaneously achieve model compression and pattern discovery. In particular, the proposed model allows one to characterize nonstationarity and time-varying system behaviors underlying spatiotemporal data. To evaluate the proposed model, extensive experiments are conducted on various spatiotemporal data representing different nonlinear dynamical systems, including fluid dynamics, sea surface temperature, USA surface temperature, and NYC taxi trips. Experimental results demonstrate the effectiveness of modeling spatiotemporal data and characterizing spatial/temporal patterns with the proposed model. In the spatial context, the spatial patterns can be automatically extracted and intuitively characterized by the spatial modes. In the temporal context, the complex time-varying system behaviors can be revealed by the temporal modes in the proposed model. Thus, our model lays an insightful foundation for understanding complex spatiotemporal data in real-world dynamical systems. The adapted datasets and Python implementation are publicly available at https://github.com/xinychen/vars.

LGMar 20, 2022
Forecasting Sparse Movement Speed of Urban Road Networks with Nonstationary Temporal Matrix Factorization

Xinyu Chen, Chengyuan Zhang, Xi-Le Zhao et al.

Movement speed data from urban road networks, computed from ridesharing vehicles or taxi trajectories, is often high-dimensional, sparse, and nonstationary (e.g., exhibiting seasonality). To address these challenges, we propose a Nonstationary Temporal Matrix Factorization (NoTMF) model that leverages matrix factorization to project high-dimensional and sparse movement speed data into low-dimensional latent spaces. This results in a concise formula with the multiplication between spatial and temporal factor matrices. To characterize the temporal correlations, NoTMF takes a latent equation on the seasonal differenced temporal factors using higher-order vector autoregression (VAR). This approach not only preserves the low-rank structure of sparse movement speed data but also maintains consistent temporal dynamics, including seasonality information. The learning process for NoTMF involves optimizing the spatial and temporal factor matrices along with a collection of VAR coefficient matrices. To solve this efficiently, we introduce an alternating minimization framework, which tackles a challenging procedure of estimating the temporal factor matrix using conjugate gradient method, as the subproblem involves both partially observed matrix factorization and seasonal differenced VAR. To evaluate the forecasting performance of NoTMF, we conduct extensive experiments on Uber movement speed datasets, which are estimated from ridesharing vehicle trajectories. These datasets contain a large proportion of missing values due to insufficient ridesharing vehicles on the urban road network. Despite the presence of missing data, NoTMF demonstrates superior forecasting accuracy and effectiveness compared to baseline models. Moreover, as the seasonality of movement speed data is of great concern, the experiment results highlight the significance of addressing the nonstationarity of movement speed data.

CVJan 23
A Cosine Network for Image Super-Resolution

Chunwei Tian, Chengyuan Zhang, Bob Zhang et al.

Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.

APApr 24, 2024
Learning Car-Following Behaviors Using Bayesian Matrix Normal Mixture Regression

Chengyuan Zhang, Kehua Chen, Meixin Zhu et al.

Learning and understanding car-following (CF) behaviors are crucial for microscopic traffic simulation. Traditional CF models, though simple, often lack generalization capabilities, while many data-driven methods, despite their robustness, operate as "black boxes" with limited interpretability. To bridge this gap, this work introduces a Bayesian Matrix Normal Mixture Regression (MNMR) model that simultaneously captures feature correlations and temporal dynamics inherent in CF behaviors. This approach is distinguished by its separate learning of row and column covariance matrices within the model framework, offering an insightful perspective into the human driver decision-making processes. Through extensive experiments, we assess the model's performance across various historical steps of inputs, predictive steps of outputs, and model complexities. The results consistently demonstrate our model's adeptness in effectively capturing the intricate correlations and temporal dynamics present during CF. A focused case study further illustrates the model's outperforming interpretability of identifying distinct operational conditions through the learned mean and covariance matrices. This not only underlines our model's effectiveness in understanding complex human driving behaviors in CF scenarios but also highlights its potential as a tool for enhancing the interpretability of CF behaviors in traffic simulations and autonomous driving systems.

APJun 17, 2025
Markov Regime-Switching Intelligent Driver Model for Interpretable Car-Following Behavior

Chengyuan Zhang, Cathy Wu, Lijun Sun

Accurate and interpretable car-following models are essential for traffic simulation and autonomous vehicle development. However, classical models like the Intelligent Driver Model (IDM) are fundamentally limited by their parsimonious and single-regime structure. They fail to capture the multi-modal nature of human driving, where a single driving state (e.g., speed, relative speed, and gap) can elicit many different driver actions. This forces the model to average across distinct behaviors, reducing its fidelity and making its parameters difficult to interpret. To overcome this, we introduce a regime-switching framework that allows driving behavior to be governed by different IDM parameter sets, each corresponding to an interpretable behavioral mode. This design enables the model to dynamically switch between interpretable behavioral modes, rather than averaging across diverse driving contexts. We instantiate the framework using a Factorial Hidden Markov Model with IDM dynamics (FHMM-IDM), which explicitly separates intrinsic driving regimes (e.g., aggressive acceleration, steady-state following) from external traffic scenarios (e.g., free-flow, congestion, stop-and-go) through two independent latent Markov processes. Bayesian inference via Markov chain Monte Carlo (MCMC) is used to jointly estimate the regime-specific parameters, transition dynamics, and latent state trajectories. Experiments on the HighD dataset demonstrate that FHMM-IDM uncovers interpretable structure in human driving, effectively disentangling internal driver actions from contextual traffic conditions and revealing dynamic regime-switching patterns. This framework provides a tractable and principled solution to modeling context-dependent driving behavior under uncertainty, offering improvements in the fidelity of traffic simulations, the efficacy of safety analyses, and the development of more human-centric ADAS.

CVNov 13, 2024
UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

Chengyuan Zhang, Yilin Zhang, Lei Zhu et al.

This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.

CVSep 23, 2025
Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

Xijun Wang, Junyun Huang, Rayyan Abdalla et al.

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.

CYSep 11, 2025
Aesthetic Experience and Educational Value in Co-creating Art with Generative AI: Evidence from a Survey of Young Learners

Chengyuan Zhang, Suzhe Xu

This study investigates the aesthetic experience and educational value of collaborative artmaking with generative artificial intelligence (AI) among young learners and art students. Based on a survey of 112 participants, we examine how human creators renegotiate their roles, how conventional notions of originality are challenged, how the creative process is transformed, and how aesthetic judgment is formed in human--AI co-creation. Empirically, participants generally view AI as a partner that stimulates ideation and expands creative boundaries rather than a passive tool, while simultaneously voicing concerns about stylistic homogenization and the erosion of traditional authorship. Theoretically, we synthesize Dewey's aesthetics of experience, Ihde's postphenomenology, and actor--network theory (ANT) into a single analytical framework to unpack the dynamics between human creators and AI as a non-human actant. Findings indicate (i) a fluid subjectivity in which creators shift across multiple stances (director, dialogic partner, discoverer); (ii) an iterative, dialogic workflow (intent--generate--select--refine) that centers critical interpretation; and (iii) an educational value shift from technical skill training toward higher-order competencies such as critical judgment, cross-modal ideation, and reflexivity. We argue that arts education should cultivate a \emph{critical co-creation} stance toward technology, guiding learners to collaborate with AI while preserving human distinctiveness in concept formation, judgment, and meaning-making.

APJul 9, 2025
When Context Is Not Enough: Modeling Unexplained Variability in Car-Following Behavior

Chengyuan Zhang, Zhengbing He, Cathy Wu et al.

Modeling car-following behavior is fundamental to microscopic traffic simulation, yet traditional deterministic models often fail to capture the full extent of variability and unpredictability in human driving. While many modern approaches incorporate context-aware inputs (e.g., spacing, speed, relative speed), they frequently overlook structured stochasticity that arises from latent driver intentions, perception errors, and memory effects -- factors that are not directly observable from context alone. To fill the gap, this study introduces an interpretable stochastic modeling framework that captures not only context-dependent dynamics but also residual variability beyond what context can explain. Leveraging deep neural networks integrated with nonstationary Gaussian processes (GPs), our model employs a scenario-adaptive Gibbs kernel to learn dynamic temporal correlations in acceleration decisions, where the strength and duration of correlations between acceleration decisions evolve with the driving context. This formulation enables a principled, data-driven quantification of uncertainty in acceleration, speed, and spacing, grounded in both observable context and latent behavioral variability. Comprehensive experiments on the naturalistic vehicle trajectory dataset collected from the German highway, i.e., the HighD dataset, demonstrate that the proposed stochastic simulation method within this framework surpasses conventional methods in both predictive performance and interpretable uncertainty quantification. The integration of interpretability and accuracy makes this framework a promising tool for traffic analysis and safety-critical applications.

CVMar 25, 2025
BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Suzhe Xu, Jialin Peng, Chengyuan Zhang

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The Segment Anything Model (SAM) excels at point-prompted segmentation, while text-based models, often leveraging powerful multimodal encoders like BEIT-3, provide rich semantic understanding. However, effectively combining these complementary modalities remains a challenge. This paper introduces BiPrompt-SAM, a novel dual-modal prompt segmentation framework employing an explicit selection mechanism. We leverage SAM's ability to generate multiple mask candidates from a single point prompt and use a text-guided mask (generated via EVF-SAM with BEIT-3) to select the point-generated mask that best aligns spatially, measured by Intersection over Union (IoU). This approach, interpretable as a simplified Mixture of Experts (MoE), effectively fuses spatial precision and semantic context without complex model modifications. Notably, our method achieves strong zero-shot performance on the Endovis17 medical dataset (89.55% mDice, 81.46% mIoU) using only a single point prompt per instance. This significantly reduces annotation burden compared to bounding boxes and aligns better with practical clinical workflows, demonstrating the method's effectiveness without domain-specific training. On the RefCOCO series, BiPrompt-SAM attained 87.1%, 86.5%, and 85.8% IoU, significantly outperforming existing approaches. Experiments show BiPrompt-SAM excels in scenarios requiring both spatial accuracy and semantic disambiguation, offering a simple, effective, and interpretable perspective on multi-modal prompt fusion.

CVDec 28, 2024
DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments

Xijun Wang, Pedro Sandoval-Segura, Chengyuan Zhang et al.

Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.

SPDec 7, 2020
A Novel Hybrid Framework for Hourly PM2.5 Concentration Forecasting Using CEEMDAN and Deep Temporal Convolutional Neural Network

Fuxin Jiang, Chengyuan Zhang, Shaolong Sun et al.

For hourly PM2.5 concentration prediction, accurately capturing the data patterns of external factors that affect PM2.5 concentration changes, and constructing a forecasting model is one of efficient means to improve forecasting accuracy. In this study, a novel hybrid forecasting model based on complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and deep temporal convolutional neural network (DeepTCN) is developed to predict PM2.5 concentration, by modelling the data patterns of historical pollutant concentrations data, meteorological data, and discrete time variables' data. Taking PM2.5 concentration of Beijing as the sample, experimental results showed that the forecasting accuracy of the proposed CEEMDAN-DeepTCN model is verified to be the highest when compared with the time series model, artificial neural network, and the popular deep learning models. The new model has improved the capability to model the PM2.5-related factor data patterns, and can be used as a promising tool for forecasting PM2.5 concentrations.

ROMar 2, 2020
Spatiotemporal Learning of Multivehicle Interaction Patterns in Lane-Change Scenarios

Chengyuan Zhang, Jiacheng Zhu, Wenshuo Wang et al.

Interpretation of common-yet-challenging interaction scenarios can benefit well-founded decisions for autonomous vehicles. Previous research achieved this using their prior knowledge of specific scenarios with predefined models, limiting their adaptive capabilities. This paper describes a Bayesian nonparametric approach that leverages continuous (i.e., Gaussian processes) and discrete (i.e., Dirichlet processes) stochastic processes to reveal underlying interaction patterns of the ego vehicle with other nearby vehicles. Our model relaxes dependency on the number of surrounding vehicles by developing an acceleration-sensitive velocity field based on Gaussian processes. The experiment results demonstrate that the velocity field can represent the spatial interactions between the ego vehicle and its surroundings. Then, a discrete Bayesian nonparametric model, integrating Dirichlet processes and hidden Markov models, is developed to learn the interaction patterns over the temporal space by segmenting and clustering the sequential interaction data into interpretable granular patterns automatically. We then evaluate our approach in the highway lane-change scenarios using the highD dataset collected from real-world settings. Results demonstrate that our proposed Bayesian nonparametric approach provides an insight into the complicated lane-change interactions of the ego vehicle with multiple surrounding traffic participants based on the interpretable interaction patterns and their transition properties in temporal relationships. Our proposed approach sheds light on efficiently analyzing other kinds of multi-agent interactions, such as vehicle-pedestrian interactions. View the demos via https://youtu.be/z_vf9UHtdAM.

ROJul 17, 2019
A General Framework of Learning Multi-Vehicle Interaction Patterns from Videos

Chengyuan Zhang, Jiacheng Zhu, Wenshuo Wang et al.

Semantic learning and understanding of multi-vehicle interaction patterns in a cluttered driving environment are essential but challenging for autonomous vehicles to make proper decisions. This paper presents a general framework to gain insights into intricate multi-vehicle interaction patterns from bird's-eye view traffic videos. We adopt a Gaussian velocity field to describe the time-varying multi-vehicle interaction behaviors and then use deep autoencoders to learn associated latent representations for each temporal frame. Then, we utilize a hidden semi-Markov model with a hierarchical Dirichlet process as a prior to segment these sequential representations into granular components, also called traffic primitives, corresponding to interaction patterns. Experimental results demonstrate that our proposed framework can extract traffic primitives from videos, thus providing a semantic way to analyze multi-vehicle interaction patterns, even for cluttered driving scenarios that are far messier than human beings can cope with.

CVJun 5, 2019
PAC-GAN: An Effective Pose Augmentation Scheme for Unsupervised Cross-View Person Re-identification

Chengyuan Zhang, Lei Zhu, Shichao Zhang

Person re-identification (person Re-Id) aims to retrieve the pedestrian images of a same person that captured by disjoint and non-overlapping cameras. Lots of researchers recently focuse on this hot issue and propose deep learning based methods to enhance the recognition rate in a supervised or unsupervised manner. However, two limitations that cannot be ignored: firstly, compared with other image retrieval benchmarks, the size of existing person Re-Id datasets are far from meeting the requirement, which cannot provide sufficient pedestrian samples for the training of deep model; secondly, the samples in existing datasets do not have sufficient human motions or postures coverage to provide more priori knowledges for learning. In this paper, we introduce a novel unsupervised pose augmentation cross-view person Re-Id scheme called PAC-GAN to overcome these limitations. We firstly present the formal definition of cross-view pose augmentation and then propose the framework of PAC-GAN that is a novel conditional generative adversarial network (CGAN) based approach to improve the performance of unsupervised corss-view person Re-Id. Specifically, The pose generation model in PAC-GAN called CPG-Net is to generate enough quantity of pose-rich samples from original image and skeleton samples. The pose augmentation dataset is produced by combining the synthesized pose-rich samples with the original samples, which is fed into the corss-view person Re-Id model named Cross-GAN. Besides, we use weight-sharing strategy in the CPG-Net to improve the quality of new generated samples. To the best of our knowledge, we are the first try to enhance the unsupervised cross-view person Re-Id by pose augmentation, and the results of extensive experiments show that the proposed scheme can combat the state-of-the-arts.

MMOct 14, 2018
CNN-VWII: An Efficient Approach for Large-Scale Video Retrieval by Image Queries

Chengyuan Zhang, Yunwu Lin, Lei Zhu et al.

This paper aims to solve the problem of large-scale video retrieval by a query image. Firstly, we define the problem of top-$k$ image to video query. Then, we combine the merits of convolutional neural networks(CNN for short) and Bag of Visual Word(BoVW for short) module to design a model for video frames information extraction and representation. In order to meet the requirements of large-scale video retrieval, we proposed a visual weighted inverted index(VWII for short) and related algorithm to improve the efficiency and accuracy of retrieval process. Comprehensive experiments show that our proposed technique achieves substantial improvements (up to an order of magnitude speed up) over the state-of-the-art techniques with similar accuracy.

MMOct 1, 2018
SVS-JOIN: Efficient Spatial Visual Similarity Join over Multimedia Data

Chengyuan Zhang, Ruipeng Chen, Lei Zhu et al.

In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by mobile smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale of geo-multimedia data retrieval. Spatial similarity join is one of the important problem in the area of spatial database. Previous works focused on textual document with geo-tags, rather than geo-multimedia data such as geo-images. In this paper, we study a novel search problem named spatial visual similarity join (SVS-JOIN for short), which aims to find similar geo-image pairs in both the aspects of geo-location and visual content. We propose the definition of SVS-JOIN at the first time and present how to measure geographical similarity and visual similarity. Then we introduce a baseline inspired by the method for textual similarity join and a extension named SVS-JOIN$_G$ which applies spatial grid strategy to improve the efficiency. To further improve the performance of search, we develop a novel approach called SVS-JOIN$_Q$ which utilizes a quadtree and a global inverted index. Experimental evaluations on real geo-image datasets demonstrate that our solution has a really high performance.

MMSep 8, 2018
Efficient Multimedia Similarity Measurement Using Similar Elements

Chengyuan Zhang, Yunwu Lin, Lei Zhu et al.

Online social networking techniques and large-scale multimedia systems are developing rapidly, which not only has brought great convenience to our daily life, but generated, collected, and stored large-scale multimedia data. This trend has put forward higher requirements and greater challenges on massive multimedia data retrieval. In this paper, we investigate the problem of image similarity measurement which is used to lots of applications. At first we propose the definition of similarity measurement of images and the related notions. Based on it we present a novel basic method of similarity measurement named SMIN. To improve the performance of calculation, we propose a novel indexing structure called SMI Temp Index (SMII for short). Besides, we establish an index of potential similar visual words off-line to solve to problem that the index cannot be reused. Experimental evaluations on two real image datasets demonstrate that our solution outperforms state-of-the-art method.

MMAug 29, 2018
Efficient Region of Visual Interests Search for Geo-multimedia Data

Chengyuan Zhang, Yunwu Lin, Lei Zhu et al.

With the proliferation of online social networking services and mobile smart devices equipped with mobile communications module and position sensor module, massive amount of multimedia data has been collected, stored and shared. This trend has put forward higher request on massive multimedia data retrieval. In this paper, we investigate a novel spatial query named region of visual interests query (RoVIQ), which aims to search users containing geographical information and visual words. Three baseline methods are presented to introduce how to exploit existing techniques to address this problem. Then we propose the definition of this query and related notions at the first time. To improve the performance of query, we propose a novel spatial indexing structure called quadtree based inverted visual index which is a combination of quadtree, inverted index and visual words. Based on it, we design a efficient search algorithm named region of visual interests search to support RoVIQ. Experimental evaluations on real geo-image datasets demonstrate that our solution outperforms state-of-the-art method.

MMAug 20, 2018
An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval

Lei Zhu, Jun Long, Chengyuan Zhang et al.

Due to the rapid development of mobile Internet techniques, cloud computation and popularity of online social networking and location-based services, massive amount of multimedia data with geographical information is generated and uploaded to the Internet. In this paper, we propose a novel type of cross-modal multimedia retrieval called geo-multimedia cross-modal retrieval which aims to search out a set of geo-multimedia objects based on geographical distance proximity and semantic similarity between different modalities. Previous studies for cross-modal retrieval and spatial keyword search cannot address this problem effectively because they do not consider multimedia data with geo-tags and do not focus on this type of query. In order to address this problem efficiently, we present the definition of $k$NN geo-multimedia cross-modal query at the first time and introduce relevant conceptions such as cross-modal semantic representation space. To bridge the semantic gap between different modalities, we propose a method named cross-modal semantic matching which contains two important component, i.e., CorrProj and LogsTran, which aims to construct a common semantic representation space for cross-modal semantic similarity measurement. Besides, we designed a framework based on deep learning techniques to implement common semantic representation space construction. In addition, a novel hybrid indexing structure named GMR-Tree combining geo-multimedia data and R-Tree is presented and a efficient $k$NN search algorithm called $k$GMCMS is designed. Comprehensive experimental evaluation on real and synthetic dataset clearly demonstrates that our solution outperforms the-state-of-the-art methods.

MMAug 8, 2018
Efficient Continuous Top-$k$ Geo-Image Search on Road Network

Chengyuan Zhang, Kesheng Cheng, Lei Zhu et al.

With the rapid development of mobile Internet and cloud computing technology, large-scale multimedia data, e.g., texts, images, audio and videos have been generated, collected, stored and shared. In this paper, we propose a novel query problem named continuous top-$k$ geo-image query on road network which aims to search out a set of geo-visual objects based on road network distance proximity and visual content similarity. Existing approaches for spatial textual query and geo-image query cannot address this problem effectively because they do not consider both of visual content similarity and road network distance proximity on road network. In order to address this challenge effectively and efficiently, firstly we propose the definition of geo-visual objects and continuous top-$k$ geo-visual objects query on road network, then develop a score function for search. To improve the query efficiency in a large-scale road network, we propose the search algorithm named geo-visual search on road network based on a novel hybrid indexing framework called VIG-Tree, which combines G-Tree and visual inverted index technique. In addition, an important notion named safe interval and results updating rule are proposed, and based on them we develop an efficient algorithm named moving monitor algorithm to solve continuous query. Experimental evaluation on real multimedia dataset and road network dataset illustrates that our solution outperforms state-of-the-art method.

LGJul 31, 2018
DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition

Zhan Yang, Osolo Ian Raymond, ChengYuan Zhang et al.

Deep Convolutional Neural Networks (DCNNs) are currently popular in human activity recognition applications. However, in the face of modern artificial intelligence sensor-based games, many research achievements cannot be practically applied on portable devices. DCNNs are typically resource-intensive and too large to be deployed on portable devices, thus this limits the practical application of complex activity detection. In addition, since portable devices do not possess high-performance Graphic Processing Units (GPUs), there is hardly any improvement in Action Game (ACT) experience. Besides, in order to deal with multi-sensor collaboration, all previous human activity recognition models typically treated the representations from different sensor signal sources equally. However, distinct types of activities should adopt different fusion strategies. In this paper, a novel scheme is proposed. This scheme is used to train 2-bit Convolutional Neural Networks with weights and activations constrained to {-0.5,0,0.5}. It takes into account the correlation between different sensor signal sources and the activity types. This model, which we refer to as DFTerNet, aims at producing a more reliable inference and better trade-offs for practical applications. Our basic idea is to exploit quantization of weights and activations directly in pre-trained filter banks and adopt dynamic fusion strategies for different activity types. Experiments demonstrate that by using dynamic fusion strategy can exceed the baseline model performance by up to ~5% on activity recognition like OPPORTUNITY and PAMAP2 datasets. Using the quantization method proposed, we were able to achieve performances closer to that of full-precision counterpart. These results were also verified using the UniMiB-SHAR dataset. In addition, the proposed method can achieve ~9x acceleration on CPUs and ~11x memory saving.

MMJul 8, 2018
A Filter of Minhash for Image Similarity Measures

Jun Long, Qunfeng Liu, Xinpan Yuan et al.

Image similarity measures play an important role in nearest neighbor search and duplicate detection for large-scale image datasets. Recently, Minwise Hashing (or Minhash) and its related hashing algorithms have achieved great performances in large-scale image retrieval systems. However, there are a large number of comparisons for image pairs in these applications, which may spend a lot of computation time and affect the performance. In order to quickly obtain the pairwise images that theirs similarities are higher than the specific threshold T (e.g., 0.5), we propose a dynamic threshold filter of Minwise Hashing for image similarity measures. It greatly reduces the calculation time by terminating the unnecessary comparisons in advance. We also find that the filter can be extended to other hashing algorithms, on when the estimator satisfies the binomial distribution, such as b-Bit Minwise Hashing, One Permutation Hashing, etc. In this pager, we use the Bag-of-Visual-Words (BoVW) model based on the Scale Invariant Feature Transform (SIFT) to represent the image features. We have proved that the filter is correct and effective through the experiment on real image datasets.

SIJun 23, 2018
Temporal Activity Path Based Character Correction in Social Networks

Jun Long, Lei Zhu, Zhan Yang et al.

Vast amount of multimedia data contains massive and multifarious social information which is used to construct large-scale social networks. In a complex social network, a character should be ideally denoted by one and only one vertex. However, it is pervasive that a character is denoted by two or more vertices with different names, thus it is usually considered as multiple, different characters. This problem causes incorrectness of results in network analysis and mining. The factual challenge is that character uniqueness is hard to correctly confirm due to lots of complicated factors, e.g. name changing and anonymization, leading to character duplication. Early, limited research has shown that previous methods depended overly upon supplementary attribute information from databases. In this paper, we propose a novel method to merge the character vertices which refer to as the same entity but are denoted with different names. With this method, we firstly build the relationship network among characters based on records of social activities participated, which are extracted from multimedia sources. Then define temporal activity paths (TAPs) for each character over time. After that, we measure similarity of the TAPs for any two characters. If the similarity is high enough, the two vertices should be considered to the same character. Based on TAPs, we can determine whether to merge the two character vertices. Our experiments shown that this solution can accurately confirm character uniqueness in large-scale social network.

MMJun 9, 2018
Hierarchical Information Quadtree: Efficient Spatial Temporal Image Search for Multimedia Stream

Chengyuan Zhang, Ruipeng Chen, Lei Zhu et al.

Massive amount of multimedia data that contain times- tamps and geographical information are being generated at an unprecedented scale in many emerging applications such as photo sharing web site and social networks applications. Due to their importance, a large body of work has focused on efficiently computing various spatial image queries. In this paper,we study the spatial temporal image query which considers three important constraints during the search including time recency, spatial proximity and visual relevance. A novel index structure, namely Hierarchical Information Quadtree(\hiq), to efficiently insert/delete spatial temporal images with high arrive rates. Base on \hiq an efficient algorithm is developed to support spatial temporal image query. We show via extensive experimentation with real spatial databases clearly demonstrate the efficiency of our methods.

MMJun 2, 2018
Efficient Interactive Search for Geo-tagged Multimedia Data

Jun Long, Lei Zhu, Chengyuan Zhang et al.

Due to the advances in mobile computing and multimedia techniques, there are vast amount of multimedia data with geographical information collected in multifarious applications. In this paper, we propose a novel type of image search named interactive geo-tagged image search which aims to find out a set of images based on geographical proximity and similarity of visual content, as well as the preference of users. Existing approaches for spatial keyword query and geo-image query cannot address this problem effectively since they do not consider these three type of information together for query. In order to solve this challenge efficiently, we propose the definition of interactive top-$k$ geo-tagged image query and then present a framework including candidate search stage , interaction stage and termination stage. To enhance the searching efficiency in a large-scale database, we propose the candidate search algorithm named GI-SUPER Search based on a new notion called superior relationship and GIR-Tree, a novel index structure. Furthermore, two candidate selection methods are proposed for learning the preferences of the user during the interaction. At last, the termination procedure and estimation procedure are introduced in brief. Experimental evaluation on real multimedia dataset demonstrates that our solution has a really high performance.

MMMay 29, 2018
Hierarchical One Permutation Hashing: Efficient Multimedia Near Duplicate Detection

Chengyuan Zhang, Yunwu Lin, Lei Zhu et al.

With advances in multimedia technologies and the proliferation of smart phone, digital cameras, storage devices, there are a rapidly growing massive amount of multimedia data collected in many applications such as multimedia retrieval and management system, in which the data element is composed of text, image, video and audio. Consequently, the study of multimedia near duplicate detection has attracted significant concern from research organizations and commercial communities. Traditional solution minwish hashing (\minwise) faces two challenges: expensive preprocessing time and lower comparison speed. Thus, this work first introduce a hashing method called one permutation hashing (\oph) to shun the costly preprocessing time. Based on \oph, a more efficient strategy group based one permutation hashing (\goph) is developed to deal with the high comparison time. Based on the fact that the similarity of most multimedia data is not very high, this work design an new hashing method namely hierarchical one permutation hashing (\hoph) to further improve the performance. Comprehensive experiments on real multimedia datasets clearly show that with similar accuracy \hoph is five to seven times faster than

CVJan 4, 2018
Crossing Generative Adversarial Networks for Cross-View Person Re-identification

Chengyuan Zhang, Lin Wu, Yang Wang

Person re-identification (\textit{re-id}) refers to matching pedestrians across disjoint yet non-overlapping camera views. The most effective way to match these pedestrians undertaking significant visual variations is to seek reliably invariant features that can describe the person of interest faithfully. Most of existing methods are presented in a supervised manner to produce discriminative features by relying on labeled paired images in correspondence. However, annotating pair-wise images is prohibitively expensive in labors, and thus not practical in large-scale networked cameras. Moreover, seeking comparable representations across camera views demands a flexible model to address the complex distributions of images. In this work, we study the co-occurrence statistic patterns between pairs of images, and propose to crossing Generative Adversarial Network (Cross-GAN) for learning a joint distribution for cross-image representations in a unsupervised manner. Given a pair of person images, the proposed model consists of the variational auto-encoder to encode the pair into respective latent variables, a proposed cross-view alignment to reduce the view disparity, and an adversarial layer to seek the joint distribution of latent representations. The learned latent representations are well-aligned to reflect the co-occurrence patterns of paired images. We empirically evaluate the proposed model against challenging datasets, and our results show the importance of joint invariant features in improving matching rates of person re-id with comparison to semi/unsupervised state-of-the-arts.