AIJan 5Code
OpenSocInt: A Multi-modal Training Environment for Human-Aware Social NavigationVictor Sanchez, Chris Reinke, Ahamed Mohamed et al.
In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.
CVApr 13, 2023
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-AttentionYiming Ma, Victor Sanchez, Soodeh Nikan et al.
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
CVOct 17, 2022
Real-Time Driver Monitoring Systems through Modality and View AnalysisYiming Ma, Victor Sanchez, Soodeh Nikan et al.
Driver distractions are known to be the dominant cause of road accidents. While monitoring systems can detect non-driving-related activities and facilitate reducing the risks, they must be accurate and efficient to be applicable. Unfortunately, state-of-the-art methods prioritize accuracy while ignoring latency because they leverage cross-view and multimodal videos in which consecutive frames are highly similar. Thus, in this paper, we pursue time-effective detection models by neglecting the temporal relation between video frames and investigate the importance of each sensing modality in detecting drives' activities. Experiments demonstrate that 1) our proposed algorithms are real-time and can achieve similar performances (97.5\% AUC-PR) with significantly reduced computation compared with video-based models; 2) the top view with the infrared channel is more informative than any other single modality. Furthermore, we enhance the DAD dataset by manually annotating its test set to enable multiclassification. We also thoroughly analyze the influence of visual sensor types and their placements on the prediction of each class. The code and the new labels will be released.
SDJul 16, 2022
Visually-aware Acoustic Event Detection using Heterogeneous GraphsAmir Shirian, Krishna Somandepalli, Victor Sanchez et al.
Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.
CLMay 1, 2024Code
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace SettingOlly Styles, Sam Miller, Patricio Cerda-Mardini et al.
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at https://github.com/olly-styles/WorkBench.
CVJun 26, 2022
Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory ExchangeGuodong Shen, Yuqi Ouyang, Victor Sanchez
Video anomaly detection is a challenging task because most anomalies are scarce and non-deterministic. Many approaches investigate the reconstruction difference between normal and abnormal patterns, but neglect that anomalies do not necessarily correspond to large reconstruction errors. To address this issue, we design a Convolutional LSTM Auto-Encoder prediction framework with enhanced spatio-temporal memory exchange using bi-directionalilty and a higher-order mechanism. The bi-directional structure promotes learning the temporal regularity through forward and backward predictions. The unique higher-order mechanism further strengthens spatial information interaction between the encoder and the decoder. Considering the limited receptive fields in Convolutional LSTMs, we also introduce an attention module to highlight informative features for prediction. Anomalies are eventually identified by comparing the frames with their corresponding predictions. Evaluations on three popular benchmarks show that our framework outperforms most existing prediction-based anomaly detection methods.
CVJul 27, 2022
Look at Adjacent Frames: Video Anomaly Detection without Offline TrainingYuqi Ouyang, Guodong Shen, Victor Sanchez
We propose a solution to detect anomalous events in videos without the need to train a model offline. Specifically, our solution is based on a randomly-initialized multilayer perceptron that is optimized online to reconstruct video frames, pixel-by-pixel, from their frequency information. Based on the information shifts between adjacent frames, an incremental learner is used to update parameters of the multilayer perceptron after observing each frame, thus allowing to detect anomalous events along the video stream. Traditional solutions that require no offline training are limited to operating on videos with only a few abnormal frames. Our solution breaks this limit and achieves strong performance on benchmark datasets.
CVJun 9, 2025Code
OptiScene: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference OptimizationYixuan Yang, Zhen Luo, Tongsheng Ding et al.
Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer from spatial inconsistency and high computational costs, while learning-based methods are typically constrained by coarse relational graphs and limited datasets, restricting their generalization to diverse room categories. In this paper, we revisit LLM-based indoor layout generation and present 3D-SynthPlace, a large-scale dataset that combines synthetic layouts generated via a 'GPT synthesize, Human inspect' pipeline, upgraded from the 3D-Front dataset. 3D-SynthPlace contains nearly 17,000 scenes, covering four common room types -- bedroom, living room, kitchen, and bathroom -- enriched with diverse objects and high-level spatial annotations. We further introduce OptiScene, a strong open-source LLM optimized for indoor layout generation, fine-tuned based on our 3D-SynthPlace dataset through our two-stage training. For the warum-up stage I, we adopt supervised fine-tuning (SFT), which is taught to first generate high-level spatial descriptions then conditionally predict concrete object placements. For the reinforcing stage II, to better align the generated layouts with human design preferences, we apply multi-turn direct preference optimization (DPO), which significantly improving layout quality and generation success rates. Extensive experiments demonstrate that OptiScene outperforms traditional prompt-driven and learning-based baselines. Moreover, OptiScene shows promising potential in interactive tasks such as scene editing and robot navigation.
CVAug 8, 2024
Enhanced Prototypical Part Network (EPPNet) For Explainable Image Classification Via PrototypesBhushan Atote, Victor Sanchez
Explainable Artificial Intelligence (xAI) has the potential to enhance the transparency and trust of AI-based systems. Although accurate predictions can be made using Deep Neural Networks (DNNs), the process used to arrive at such predictions is usually hard to explain. In terms of perceptibly human-friendly representations, such as word phrases in text or super-pixels in images, prototype-based explanations can justify a model's decision. In this work, we introduce a DNN architecture for image classification, the Enhanced Prototypical Part Network (EPPNet), which achieves strong performance while discovering relevant prototypes that can be used to explain the classification results. This is achieved by introducing a novel cluster loss that helps to discover more relevant human-understandable prototypes. We also introduce a faithfulness score to evaluate the explainability of the results based on the discovered prototypes. Our score not only accounts for the relevance of the learned prototypes but also the performance of a model. Our evaluations on the CUB-200-2011 dataset show that the EPPNet outperforms state-of-the-art xAI-based methods, in terms of both classification accuracy and explainability
CVJun 6, 2024Code
LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language ModelYixuan Yang, Junru Lu, Zixiang Zhao et al.
Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM's spatial understanding. Furthermore, through dialogue, LLplace activates the LLM's capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions. Code and dataset will be released.
CVMar 14, 2024Code
CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise ClassificationYiming Ma, Victor Sanchez, Tanaya Guha
We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.
CVAug 10, 2021Code
Multi-Camera Trajectory Forecasting with Trajectory TensorsOlly Styles, Tanaya Guha, Victor Sanchez
We introduce the problem of multi-camera trajectory forecasting (MCTF), which involves predicting the trajectory of a moving object across a network of cameras. While multi-camera setups are widespread for applications such as surveillance and traffic monitoring, existing trajectory forecasting methods typically focus on single-camera trajectory forecasting (SCTF), limiting their use for such applications. Furthermore, using a single camera limits the field-of-view available, making long-term trajectory forecasting impossible. We address these shortcomings of SCTF by developing an MCTF framework that simultaneously uses all estimated relative object locations from several viewpoints and predicts the object's future location in all possible viewpoints. Our framework follows a Which-When-Where approach that predicts in which camera(s) the objects appear and when and where within the camera views they appear. To this end, we propose the concept of trajectory tensors: a new technique to encode trajectories across multiple camera views and the associated uncertainties. We develop several encoder-decoder MCTF models for trajectory tensors and present extensive experiments on our own database (comprising 600 hours of video data from 15 camera views) created particularly for the MCTF task. Results show that our trajectory tensor models outperform coordinate trajectory-based MCTF models and existing SCTF methods adapted for MCTF. Code is available from: https://github.com/olly-styles/Trajectory-Tensors
CVMay 1, 2020Code
Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of CamerasOlly Styles, Tanaya Guha, Victor Sanchez et al.
We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlapping camera views. This has wide applicability in tasks such as re-identification and multi-target multi-camera tracking. To facilitate research in this new area, we release the Warwick-NTU Multi-camera Forecasting Database (WNMF), a unique dataset of multi-camera pedestrian trajectories from a network of 15 synchronized cameras. To accurately label this large dataset (600 hours of video footage), we also develop a semi-automated annotation method. An effective MCTF model should proactively anticipate where and when a person will re-appear in the camera network. In this paper, we consider the task of predicting the next camera a pedestrian will re-appear after leaving the view of another camera, and present several baseline approaches for this. The labeled database is available online: https://github.com/olly-styles/Multi-Camera-Trajectory-Forecasting.
CVSep 26, 2019Code
Multiple Object Forecasting: Predicting Future Object Locations in Diverse EnvironmentsOlly Styles, Tanaya Guha, Victor Sanchez
This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather than trajectories alone. Towards solving this task, we introduce the Citywalks dataset, which consists of over 200k high-resolution video frames. Citywalks comprises of footage recorded in 21 cities from 10 European countries in a variety of weather conditions and over 3.5k unique pedestrian trajectories. For evaluation, we adapt existing trajectory forecasting methods for MOF and confirm cross-dataset generalizability on the MOT-17 dataset without fine-tuning. Finally, we present STED, a novel encoder-decoder architecture for MOF. STED combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF. Code & dataset link: https://github.com/olly-styles/Multiple-Object-Forecasting
CVJan 9
TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target DetectionHongyang Xie, Hongyang He, Victor Sanchez
Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.
CVDec 18, 2023
Cross-Age Contrastive Learning for Age-Invariant Face RecognitionHaoyi Wang, Victor Sanchez, Chang-Tsun Li
Cross-age facial images are typically challenging and expensive to collect, making noise-free age-oriented datasets relatively small compared to widely-used large-scale facial datasets. Additionally, in real scenarios, images of the same subject at different ages are usually hard or even impossible to obtain. Both of these factors lead to a lack of supervised data, which limits the versatility of supervised methods for age-invariant face recognition, a critical task in applications such as security and biometrics. To address this issue, we propose a novel semi-supervised learning approach named Cross-Age Contrastive Learning (CACon). Thanks to the identity-preserving power of recent face synthesis models, CACon introduces a new contrastive learning method that leverages an additional synthesized sample from the input image. We also propose a new loss function in association with CACon to perform contrastive learning on a triplet of samples. We demonstrate that our method not only achieves state-of-the-art performance in homogeneous-dataset experiments on several age-invariant face recognition benchmarks but also outperforms other methods by a large margin in cross-dataset experiments.
CVJan 8, 2024
Data-Agnostic Face Image Synthesis Detection Using Bayesian CNNsRoberto Leyva, Victor Sanchez, Gregory Epiphaniou et al.
Face image synthesis detection is considerably gaining attention because of the potential negative impact on society that this type of synthetic data brings. In this paper, we propose a data-agnostic solution to detect the face image synthesis process. Specifically, our solution is based on an anomaly detection framework that requires only real data to learn the inference process. It is therefore data-agnostic in the sense that it requires no synthetic face images. The solution uses the posterior probability with respect to the reference data to determine if new samples are synthetic or not. Our evaluation results using different synthesizers show that our solution is very competitive against the state-of-the-art, which requires synthetic data for training.
CVApr 20, 2025
Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task ApproachesGuodong Shen, Yuqi Ouyang, Junru Lu et al.
Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer's capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection.
CVDec 21, 2024
Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social ActionsTongfei Bian, Yiming Ma, Mathieu Chollet et al.
For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.
SPMay 21, 2024
Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame AttentionShuai Shao, Yu Guan, Victor Sanchez
Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.
CVFeb 22, 2025
DiffFake: Exposing Deepfakes using Differential Anomaly DetectionSotirios Stamnas, Victor Sanchez
Traditional deepfake detectors have dealt with the detection problem as a binary classification task. This approach can achieve satisfactory results in cases where samples of a given deepfake generation technique have been seen during training, but can easily fail with deepfakes generated by other techniques. In this paper, we propose DiffFake, a novel deepfake detector that approaches the detection problem as an anomaly detection task. Specifically, DiffFake learns natural changes that occur between two facial images of the same person by leveraging a differential anomaly detection framework. This is done by combining pairs of deep face embeddings and using them to train an anomaly detection model. We further propose to train a feature extractor on pseudo-deepfakes with global and local artifacts, to extract meaningful and generalizable features that can then be used to train the anomaly detection model. We perform extensive experiments on five different deepfake datasets and show that our method can match and sometimes even exceed the performance of state-of-the-art competitors.
CVJan 3, 2025
From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive LearningHaoyi Wang, Victor Sanchez, Chang-Tsun Li et al.
Generalized age feature extraction is crucial for age-related facial analysis tasks, such as age estimation and age-invariant face recognition (AIFR). Despite the recent successes of models in homogeneous-dataset experiments, their performance drops significantly in cross-dataset evaluations. Most of these models fail to extract generalized age features as they only attempt to map extracted features with training age labels directly without explicitly modeling the natural ordinal progression of aging. In this paper, we propose Order-Enhanced Contrastive Learning (OrdCon), a novel contrastive learning framework designed explicitly for ordinal attributes like age. Specifically, to extract generalized features, OrdCon aligns the direction vector of two features with either the natural aging direction or its reverse to model the ordinal process of aging. To further enhance generalizability, OrdCon leverages a novel soft proxy matching loss as a second contrastive objective, ensuring that features are positioned around the center of each age cluster with minimal intra-class variance and proportionally away from other clusters. By modeling the ageing process, the framework can enhance generalizability by improving the alignment of samples from the same class and reducing the divergence of direction vectors. We demonstrate that our proposed method achieves comparable results to state-of-the-art methods on various benchmark datasets in homogeneous-dataset evaluations for both age estimation and AIFR. In cross-dataset experiments, OrdCon outperforms other methods by reducing the mean absolute error by approximately 1.38 on average for the age estimation task and boosts the average accuracy for AIFR by 1.87%.
CVJan 8, 2024
Detecting Face Synthesis Using a Concealed Fusion ModelRoberto Leyva, Victor Sanchez, Gregory Epiphaniou et al.
Face image synthesis is gaining more attention in computer security due to concerns about its potential negative impacts, including those related to fake biometrics. Hence, building models that can detect the synthesized face images is an important challenge to tackle. In this paper, we propose a fusion-based strategy to detect face image synthesis while providing resiliency to several attacks. The proposed strategy uses a late fusion of the outputs computed by several undisclosed models by relying on random polynomial coefficients and exponents to conceal a new feature space. Unlike existing concealing solutions, our strategy requires no quantization, which helps to preserve the feature space. Our experiments reveal that our strategy achieves state-of-the-art performance while providing protection against poisoning, perturbation, backdoor, and reverse model attacks.
CVJun 24, 2025
ZIP: Scalable Crowd Counting via Zero-Inflated Poisson ModelingYiming Ma, Victor Sanchez, Tanaya Guha
Most crowd counting methods directly regress blockwise density maps using Mean Squared Error (MSE) losses. This practice has two key limitations: (1) it fails to account for the extreme spatial sparsity of annotations - over 95% of 8x8 blocks are empty across standard benchmarks, so supervision signals in informative regions are diluted by the predominant zeros; (2) MSE corresponds to a Gaussian error model that poorly matches discrete, non-negative count data. To address these issues, we introduce ZIP, a scalable crowd counting framework that models blockwise counts with a Zero-Inflated Poisson likelihood: a zero-inflation term learns the probability a block is structurally empty (handling excess zeros), while the Poisson component captures expected counts when people are present (respecting discreteness). We provide a generalization analysis showing a tighter risk bound for ZIP than MSE-based losses and DMCount provided that the training resolution is moderately large. To assess the scalability of ZIP, we instantiate it on backbones spanning over 100x in parameters/compute. Experiments on ShanghaiTech A & B, UCF-QNRF, and NWPU-Crowd demonstrate that ZIP consistently surpasses state-of-the-art methods across all model scales.
CVApr 18, 2025
RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud CompletionYixuan Yang, Jinyu Yang, Zixiang Zhao et al.
The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.
CVFeb 28, 2022
FusionCount: Efficient Crowd Counting via Multiscale Feature FusionYiming Ma, Victor Sanchez, Tanaya Guha
State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, to account for perspective distortion, the highest-level feature map is fed to extra components to extract multiscale features, which are the input to the decoder to generate crowd densities. However, in these methods, features extracted at earlier stages during encoding are underutilised, and the multiscale modules can only capture a limited range of receptive fields, albeit with considerable computational cost. This paper proposes a novel crowd counting architecture (FusionCount), which exploits the adaptive fusion of a large majority of encoded features instead of relying on additional extraction components to obtain multiscale features. Thus, it can cover a more extensive scope of receptive field sizes and lower the computational cost. We also introduce a new channel reduction block, which can extract saliency information during decoding and further enhance the model's performance. Experiments on two benchmark databases demonstrate that our model achieves state-of-the-art results with reduced computational complexity.
CVJan 24, 2022
Spectral-PQ: A Novel Spectral Sensitivity-Orientated Perceptual Compression Technique for RGB 4:4:4 Video DataLee Prangnell, Victor Sanchez
There exists an intrinsic relationship between the spectral sensitivity of the Human Visual System (HVS) and colour perception; these intertwined phenomena are often overlooked in perceptual compression research. In general, most previously proposed visually lossless compression techniques exploit luminance (luma) masking including luma spatiotemporal masking, luma contrast masking and luma texture/edge masking. The perceptual relevance of color in a picture is often overlooked, which constitutes a gap in the literature. With regard to the spectral sensitivity phenomenon of the HVS, the color channels of raw RGB 4:4:4 data contain significant color-based psychovisual redundancies. These perceptual redundancies can be quantized via color channel-level perceptual quantization. In this paper, we propose a novel spatiotemporal visually lossless coding method named Spectral Perceptual Quantization (Spectral-PQ). With application for RGB 4:4:4 video data, Spectral-PQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; the proposed method operates at the Coding Block (CB) level and the Prediction Unit (PU) level in the HEVC standard. Spectral-PQ perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), Spectral-PQ considerably reduces bitrates with a maximum reduction of approximately 81%. The Mean Opinion Score (MOS) in the subjective evaluations show that Spectral-PQ successfully achieves perceptually lossless quality.
CVDec 19, 2021
Improving Face-Based Age Estimation with Attention-Based Dynamic Patch FusionHaoyi Wang, Victor Sanchez, Chang-Tsun Li
With the increasing popularity of convolutional neural networks (CNNs), recent works on face-based age estimation employ these networks as the backbone. However, state-of-the-art CNN-based methods treat each facial region equally, thus entirely ignoring the importance of some facial patches that may contain rich age-specific information. In this paper, we propose a face-based age estimation framework, called Attention-based Dynamic Patch Fusion (ADPF). In ADPF, two separate CNNs are implemented, namely the AttentionNet and the FusionNet. The AttentionNet dynamically locates and ranks age-specific patches by employing a novel Ranking-guided Multi-Head Hybrid Attention (RMHHA) mechanism. The FusionNet uses the discovered patches along with the facial image to predict the age of the subject. Since the proposed RMHHA mechanism ranks the discovered patches based on their importance, the length of the learning path of each patch in the FusionNet is proportional to the amount of information it carries (the longer, the more important). ADPF also introduces a novel diversity loss to guide the training of the AttentionNet and reduce the overlap among patches so that the diverse and important patches are discovered. Through extensive experiments, we show that our proposed framework outperforms state-of-the-art methods on several age estimation benchmark datasets.
CVAug 24, 2021
Joint Learning Architecture for Multiple Object Tracking and Trajectory ForecastingOluwafunmilola Kesa, Olly Styles, Victor Sanchez
This paper introduces a joint learning architecture (JLA) for multiple object tracking (MOT) and trajectory forecasting in which the goal is to predict objects' current and future trajectories simultaneously. Motion prediction is widely used in several state of the art MOT methods to refine predictions in the form of bounding boxes. Typically, a Kalman Filter provides short-term estimations to help trackers correctly predict objects' locations in the current frame. However, the Kalman Filter-based approaches cannot predict non-linear trajectories. We propose to jointly train a tracking and trajectory forecasting model and use the predicted trajectory forecasts for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. We evaluate our JLA on the MOTChallenge benchmark. Evaluations result show that JLA performs better for short-term motion prediction and reduces ID switches by 33%, 31%, and 47% in the MOT16, MOT17, and MOT20 datasets, respectively, in comparison to FairMOT.
CVJul 1, 2021
On the detection-to-track association for online multi-object trackingXufeng Lin, Chang-Tsun Li, Victor Sanchez et al.
Driven by recent advances in object detection with deep neural networks, the tracking-by-detection paradigm has gained increasing prevalence in the research community of multi-object tracking (MOT). It has long been known that appearance information plays an essential role in the detection-to-track association, which lies at the core of the tracking-by-detection paradigm. While most existing works consider the appearance distances between the detections and the tracks, they ignore the statistical information implied by the historical appearance distance records in the tracks, which can be particularly useful when a detection has similar distances with two or more tracks. In this work, we propose a hybrid track association (HTA) algorithm that models the historical appearance distances of a track with an incremental Gaussian mixture model (IGMM) and incorporates the derived statistical information into the calculation of the detection-to-track association cost. Experimental results on three MOT benchmarks confirm that HTA effectively improves the target identification performance with a small compromise to the tracking speed. Additionally, compared to many state-of-the-art trackers, the DeepSORT tracker equipped with HTA achieves better or comparable performance in terms of the balance of tracking quality and speed.
MMJun 13, 2021
Deep Learning for Predictive Analytics in Reversible SteganographyChing-Chun Chang, Xu Wang, Sisheng Chen et al.
Deep learning is regarded as a promising solution for reversible steganography. There is an accelerating trend of representing a reversible steo-system by monolithic neural networks, which bypass intermediate operations in traditional pipelines of reversible steganography. This end-to-end paradigm, however, suffers from imperfect reversibility. By contrast, the modular paradigm that incorporates neural networks into modules of traditional pipelines can stably guarantee reversibility with mathematical explainability. Prediction-error modulation is a well-established reversible steganography pipeline for digital images. It consists of a predictive analytics module and a reversible coding module. Given that reversibility is governed independently by the coding module, we narrow our focus to the incorporation of neural networks into the analytics module, which serves the purpose of predicting pixel intensities and a pivotal role in determining capacity and imperceptibility. The objective of this study is to evaluate the impacts of different training configurations upon predictive accuracy of neural networks and provide practical insights. In particular, we investigate how different initialisation strategies for input images may affect the learning process and how different training strategies for dual-layer prediction respond to the problem of distributional shift. Furthermore, we compare steganographic performance of various model architectures with different loss functions.
CVDec 2, 2020
Video Anomaly Detection by Estimating Likelihood of RepresentationsYuqi Ouyang, Victor Sanchez
Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the mapping between video frames and their low-dimensional features, while ignoring the spatial connections of those features. Recent solutions focus on analyzing these spatial connections by using hard clustering techniques, such as K-Means, or applying neural networks to map latent features to a general understanding, such as action attributes. In order to solve video anomaly in the latent feature space, we propose a deep probabilistic model to transfer this task into a density estimation problem where latent manifolds are generated by a deep denoising autoencoder and clustered by expectation maximization. Evaluations on several benchmarks datasets show the strengths of our model, achieving outstanding performance on challenging datasets.
CVJul 1, 2020
Age-Oriented Face Synthesis with Conditional Discriminator Pool and Adversarial Triplet LossHaoyi Wang, Victor Sanchez, Chang-Tsun Li
The vanilla Generative Adversarial Networks (GAN) are commonly used to generate realistic images depicting aged and rejuvenated faces. However, the performance of such vanilla GANs in the age-oriented face synthesis task is often compromised by the mode collapse issue, which may result in the generation of faces with minimal variations and a poor synthesis accuracy. In addition, recent age-oriented face synthesis methods use the L1 or L2 constraint to preserve the identity information on synthesized faces, which implicitly limits the identity permanence capabilities when these constraints are associated with a trivial weighting factor. In this paper, we propose a method for the age-oriented face synthesis task that achieves a high synthesis accuracy with strong identity permanence capabilities. Specifically, to achieve a high synthesis accuracy, our method tackles the mode collapse issue with a novel Conditional Discriminator Pool (CDP), which consists of multiple discriminators, each targeting one particular age category. To achieve strong identity permanence capabilities, our method uses a novel Adversarial Triplet loss. This loss, which is based on the Triplet loss, adds a ranking operation to further pull the positive embedding towards the anchor embedding resulting in significantly reduced intra-class variances in the feature space. Through extensive experiments, we show that our proposed method outperforms state-of-the-art methods in terms of synthesis accuracy and identity permanence capabilities, qualitatively and quantitatively.
CVJun 6, 2020
Ensemble Network for Ranking Images Based on Visual AppealSachin Singh, Victor Sanchez, Tanaya Guha
We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a network which is an ensemble of three information channels, each predicting a score corresponding to one of the three visual appeal factors. For group emotion estimation, we propose a convolutional neural network (CNN) based architecture for predicting group emotion from images. This new architecture enforces the network to put emphasis on the important regions in the images, and achieves comparable results to the state-of-the-art. Next, we develop a network for the image ranking task that combines group emotion, aesthetics and image quality scores. Owing to the unavailability of suitable databases, we created a new database of manually annotated group photos taken during various social events. We present experimental results on this database and other benchmark databases whenever available. Overall, our experiments show that the proposed framework can reliably predict the overall appeal of images with results closely corresponding to human ranking.
IVMay 16, 2020
HVS-Based Perceptual Color Compression of Image DataLee Prangnell, Victor Sanchez
In perceptual image coding applications, the main objective is to decrease, as much as possible, Bits Per Pixel (BPP) while avoiding noticeable distortions in the reconstructed image. In this paper, we propose a novel perceptual image coding technique, named Perceptual Color Compression (PCC). PCC is based on a novel model related to Human Visual System (HVS) spectral sensitivity and CIELAB Just Noticeable Color Difference (JNCD). We utilize this modeling to capitalize on the inability of the HVS to perceptually differentiate photons in very similar wavelength bands (e.g., distinguishing very similar shades of a particular color or different colors that look similar). The proposed PCC technique can be used with RGB (4:4:4) image data of various bit depths and spatial resolutions. In the evaluations, we compare the proposed PCC technique with a set of reference methods including Versatile Video Coding (VVC) and High Efficiency Video Coding (HEVC) in addition to two other recently proposed algorithms. Our PCC method attains considerable BPP reductions compared with all four reference techniques including, on average, 52.6% BPP reductions compared with VVC (VVC in All Intra still image coding mode). Regarding image perceptual reconstruction quality, PCC achieves a score of SSIM = 0.99 in all tests in addition to a score of MS-SSIM = 0.99 in all but one test. Moreover, MOS = 5 is attained in 75% of subjective evaluation assessments conducted.
MMMay 16, 2020
Spatiotemporal Adaptive Quantization for the Perceptual Video Coding of RGB 4:4:4 DataLee Prangnell, Victor Sanchez
Due to the spectral sensitivity phenomenon of the Human Visual System (HVS), the color channels of raw RGB 4:4:4 sequences contain significant psychovisual redundancies; these redundancies can be perceptually quantized. The default quantization systems in the HEVC standard are known as Uniform Reconstruction Quantization (URQ) and Rate Distortion Optimized Quantization (RDOQ); URQ and RDOQ are not perceptually optimized for the coding of RGB 4:4:4 video data. In this paper, we propose a novel spatiotemporal perceptual quantization technique named SPAQ. With application for RGB 4:4:4 video data, SPAQ exploits HVS spectral sensitivity-related color masking in addition to spatial masking and temporal masking; SPAQ operates at the Coding Block (CB) level and the Prediction Unit (PU) level. The proposed technique perceptually adjusts the Quantization Step Size (QStep) at the CB level if high variance spatial data in G, B and R CBs is detected and also if high motion vector magnitudes in PUs are detected. Compared with anchor 1 (HEVC HM 16.17 RExt), SPAQ considerably reduces bitrates with a maximum reduction of approximately 80%. The Mean Opinion Score (MOS) in the subjective evaluations, in addition to the SSIM scores, show that SPAQ successfully achieves perceptually lossless compression compared with anchors.
CVMay 9, 2019
Forecasting Pedestrian Trajectory with Machine-Annotated Training DataOlly Styles, Arun Ross, Victor Sanchez
Reliable anticipation of pedestrian trajectory is imperative for the operation of autonomous vehicles and can significantly enhance the functionality of advanced driver assistance systems. While significant progress has been made in the field of pedestrian detection, forecasting pedestrian trajectories remains a challenging problem due to the unpredictable nature of pedestrians and the huge space of potentially useful features. In this work, we present a deep learning approach for pedestrian trajectory forecasting using a single vehicle-mounted camera. Deep learning models that have revolutionized other areas in computer vision have seen limited application to trajectory forecasting, in part due to the lack of richly annotated training data. We address the lack of training data by introducing a scalable machine annotation scheme that enables our model to be trained using a large dataset without human annotation. In addition, we propose Dynamic Trajectory Predictor (DTP), a model for forecasting pedestrian trajectory up to one second into the future. DTP is trained using both human and machine-annotated data, and anticipates dynamic motion that is not captured by linear models. Experimental evaluation confirms the benefits of the proposed model.
CVJul 27, 2018
Fusion Network for Face-based Age EstimationHaoyi Wang, Xingjie Wei, Victor Sanchez et al.
Convolutional Neural Networks (CNN) have been applied to age-related research as the core framework. Although faces are composed of numerous facial attributes, most works with CNNs still consider a face as a typical object and do not pay enough attention to facial regions that carry age-specific feature for this particular task. In this paper, we propose a novel CNN architecture called Fusion Network (FusionNet) to tackle the age estimation problem. Apart from the whole face image, the FusionNet successively takes several age-specific facial patches as part of the input to emphasize the age-specific features. Through experiments, we show that the FusionNet significantly outperforms other state-of-the-art models on the MORPH II benchmark.
MMFeb 16, 2018
Coding Block-Level Perceptual Video Coding for 4:4:4 Data in HEVCLee Prangnell, Miguel Hernández-Cabronero, Victor Sanchez
There is an increasing consumer demand for high bit-depth 4:4:4 HD video data playback due to its superior perceptual visual quality compared with standard 8-bit subsampled 4:2:0 video data. Due to vast file sizes and associated bitrates, it is desirable to compress raw high bit-depth 4:4:4 HD video sequences as much as possible without incurring a discernible decrease in visual quality. In this paper, we propose a Coding Block (CB)-level perceptual video coding technique for HEVC named Full Color Perceptual Quantization (FCPQ). FCPQ is designed to adjust the Quantization Parameter (QP) at the CB level (i.e., the luma CB and the chroma Cb and Cr CBs) according to the variances of pixel data in each CB. FCPQ is based on the default perceptual quantization method in HEVC called AdaptiveQP. AdaptiveQP adjusts the QP of an entire CU based only on the spatial activity of the constituent luma CB. As demonstrated in this paper, by not accounting for the spatial activity of the constituent chroma CBs, as is the case with AdaptiveQP, coding performance can be significantly affected; this is because the variance of pixel data in a luma CB is notably different from the variances of pixel data in chroma Cb and Cr CBs. FCPQ, therefore, addresses this problem. In terms of coding performance, FCPQ achieves BD-Rate improvements of up to 39.5% (Y), 16% (Cb) and 29.9% (Cr) compared with AdaptiveQP.
MMOct 26, 2017
JND-Based Perceptual Video Coding for 4:4:4 Screen Content Data in HEVCLee Prangnell, Victor Sanchez
The JCT-VC standardized Screen Content Coding (SCC) extension in the HEVC HM RExt + SCM reference codec offers an impressive coding efficiency performance when compared with HM RExt alone; however, it is not significantly perceptually optimized. For instance, it does not include advanced HVS-based perceptual coding methods, such as JND-based spatiotemporal masking schemes. In this paper, we propose a novel JND-based perceptual video coding technique for HM RExt + SCM. The proposed method is designed to further improve the compression performance of HM RExt + SCM when applied to YCbCr 4:4:4 SC video data. In the proposed technique, luminance masking and chrominance masking are exploited to perceptually adjust the Quantization Step Size (QStep) at the Coding Block (CB) level. Compared with HM RExt 16.10 + SCM 8.0, the proposed method considerably reduces bitrates (Kbps), with a maximum reduction of 48.3%. In addition to this, the subjective evaluations reveal that SC-PAQ achieves visually lossless coding at very low bitrates.
MMDec 23, 2016
Cross-Color Channel Perceptually Adaptive Quantization for HEVCLee Prangnell, Miguel Hernández-Cabronero, Victor Sanchez
HEVC includes a Coding Unit (CU) level luminance-based perceptual quantization technique known as AdaptiveQP. AdaptiveQP perceptually adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of raw input video data in a luma Coding Block (CB). In this paper, we propose a novel cross-color channel adaptive quantization scheme which perceptually adjusts the CU level QP according to the spatial activity of raw input video data in the constituent luma and chroma CBs; i.e., the combined spatial activity across all three color channels (the Y, Cb and Cr channels). Our technique is evaluated in HM 16 with 4:4:4, 4:2:2 and 4:2:0 YCbCr JCT-VC test sequences. Both subjective and objective visual quality evaluations are undertaken during which we compare our method with AdaptiveQP. Our technique achieves considerable coding efficiency improvements, with maximum BD-Rate reductions of 15.9% (Y), 13.1% (Cr) and 16.1% (Cb) in addition to a maximum decoding time reduction of 11.0%.
AIOct 5, 2016
The Predictive Context Tree: Predicting Contexts and InteractionsAlasdair Thomason, Nathan Griffiths, Victor Sanchez
With a large proportion of people carrying location-aware smartphones, we have an unprecedented platform from which to understand individuals and predict their future actions. This work builds upon the Context Tree data structure that summarises the historical contexts of individuals from augmented geospatial trajectories, and constructs a predictive model for their likely future contexts. The Predictive Context Tree (PCT) is constructed as a hierarchical classifier, capable of predicting both the future locations that a user will visit and the contexts that a user will be immersed within. The PCT is evaluated over real-world geospatial trajectories, and compared against existing location extraction and prediction techniques, as well as a proposed hybrid approach that uses identified land usage elements in combination with machine learning to predict future interactions. Our results demonstrate that higher predictive accuracies can be achieved using this hybrid approach over traditional extracted location datasets, and the PCT itself matches the performance of the hybrid approach at predicting future interactions, while adding utility in the form of context predictions. Such a prediction system is capable of understanding not only where a user will visit, but also their context, in terms of what they are likely to be doing.
MMSep 21, 2016
Minimizing Compression Artifacts for High Resolutions with Adaptive Quantization Matrices for HEVCLee Prangnell, Victor Sanchez
Visual Display Units (VDUs), capable of displaying video data at High Definition (HD) and Ultra HD (UHD) resolutions, are frequently employed in a variety of technological domains. Quantization-induced video compression artifacts, which are usually unnoticeable in low resolution environments, are typically conspicuous on high resolution VDUs and video data. The default quantization matrices (QMs) in HEVC do not take into account specific display resolutions of VDUs or video data to determine the appropriate levels of quantization required to reduce unwanted compression artifacts. Therefore, we propose a novel, adaptive quantization matrix technique for the HEVC standard including Scalable HEVC (SHVC). Our technique, which is based on a refinement of the current QM technique in HEVC, takes into consideration specific display resolutions of the target VDUs in order to minimize compression artifacts. We undertake a thorough evaluation of the proposed technique by utilizing SHVC SHM 9.0 (two-layered bit-stream) and the BD-Rate and SSIM metrics. For the BD-Rate evaluation, the proposed method achieves maximum BD-Rate reductions of 56.5% in the enhancement layer. For the SSIM evaluation, our technique achieves a maximum structural improvement of 0.8660 vs. 0.8538.
MMSep 15, 2016
Color-Based Coding Unit Level Adaptive Quantization for HEVCLee Prangnell, Victor Sanchez
HEVC HM 16 includes a Coding Unit (CU) level perceptual quantization technique named AdaptiveQP. AdaptiveQP adjusts the Quantization Parameter (QP) at the CU level based on the spatial activity of samples in the four constituent NxN sub-blocks of the luma Coding Block (CB), which is contained within a 2Nx2N CU. In this paper, we propose C-BAQ, which, in contrast to AdaptiveQP, adjusts the CU level QP according to the spatial activity of samples in the four constituent NxN sub-blocks of both the luma and chroma CBs. By computing the sum of luma, chroma Cb and chroma Cr spatial activity in a CU, a richer reflection of spatial activity in the CU is attained. Therefore, a more appropriate CU level QP can be selected, thus leading to important improvements in terms of coding efficiency. We evaluate the proposed technique in HEVC HM 16.7 using 4:4:4, 4:2:2 and 4:2:0 YCbCr sequences. Both subjective and objective evaluations are undertaken during which we compare C-BAQ with AdaptiveQP. The objective evaluation reveals that C-BAQ attains a maximum BD-Rate reduction of 15.9% (Y), 13.1% (Cr) and 16.1% (Cb) in addition to a maximum decoding time reduction of 11.0%.
DSJun 14, 2016
Context Trees: Augmenting Geospatial Trajectories with ContextAlasdair Thomason, Nathan Griffiths, Victor Sanchez
Exposing latent knowledge in geospatial trajectories has the potential to provide a better understanding of the movements of individuals and groups. Motivated by such a desire, this work presents the context tree, a new hierarchical data structure that summarises the context behind user actions in a single model. We propose a method for context tree construction that augments geospatial trajectories with land usage data to identify such contexts. Through evaluation of the construction method and analysis of the properties of generated context trees, we demonstrate the foundation for understanding and modelling behaviour afforded. Summarising user contexts into a single data structure gives easy access to information that would otherwise remain latent, providing the basis for better understanding and predicting the actions and behaviours of individuals and groups. Finally, we also present a method for pruning context trees, for use in applications where it is desirable to reduce the size of the tree while retaining useful information.