GRSep 26, 2023
Diffusion-based Holistic Texture Rectification and SynthesisGuoqing Hao, Satoshi Iizuka, Kensho Hara et al.
We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.
CVJul 26, 2022
Adaptive occlusion sensitivity analysis for visually explaining video recognition networksTomoki Uchiyama, Naoya Sogi, Satoshi Iizuka et al.
This paper proposes a method for visually explaining the decision-making process of video recognition networks with a temporal extension of occlusion sensitivity analysis, called Adaptive Occlusion Sensitivity Analysis (AOSA). The key idea here is to occlude a specific volume of data by a 3D mask in an input 3D temporal-spatial data space and then measure the change degree in the output score. The occluded volume data that produces a larger change degree is regarded as a more critical element for classification. However, while the occlusion sensitivity analysis is commonly used to analyze single image classification, applying this idea to video classification is not so straightforward as a simple fixed cuboid cannot deal with complicated motions. To solve this issue, we adaptively set the shape of a 3D occlusion mask while referring to motions. Our flexible mask adaptation is performed by considering the temporal continuity and spatial co-occurrence of the optical flows extracted from the input video data. We further propose a novel method to reduce the computational cost of the proposed method with the first-order approximation of the output score with respect to an input video. We demonstrate the effectiveness of our method through various and extensive comparisons with the conventional methods in terms of the deletion/insertion metric and the pointing metric on the UCF101 dataset and the Kinetics-400 and 700 datasets.
CVAug 19, 2023
Controllable Multi-domain Semantic Artwork SynthesisYuantian Huang, Satoshi Iizuka, Edgar Simo-Serra et al.
We present a novel framework for multi-domain synthesis of artwork from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset, which we call ArtSem, that contains 40,000 images of artwork from 4 different domains with their corresponding semantic label maps. We generate the dataset by first extracting semantic maps from landscape photography and then propose a conditional Generative Adversarial Network (GAN)-based approach to generate high-quality artwork from the semantic maps without necessitating paired training data. Furthermore, we propose an artwork synthesis model that uses domain-dependent variational encoders for high-quality multi-domain synthesis. The model is improved and complemented with a simple but effective normalization method, based on normalizing both the semantic and style jointly, which we call Spatially STyle-Adaptive Normalization (SSTAN). In contrast to previous methods that only take semantic layout as input, our model is able to learn a joint representation of both style and semantic information, which leads to better generation quality for synthesizing artistic images. Results indicate that our model learns to separate the domains in the latent space, and thus, by identifying the hyperplanes that separate the different domains, we can also perform fine-grained control of the synthesized artwork. By combining our proposed dataset and approach, we are able to generate user-controllable artwork that is of higher quality than existing
LGMar 31, 2023
Time-series Anomaly Detection based on Difference Subspace between Signal SubspacesTakumi Kanai, Naoya Sogi, Atsuto Maki et al.
This paper proposes a new method for anomaly detection in time-series data by incorporating the concept of difference subspace into the singular spectrum analysis (SSA). The key idea is to monitor slight temporal variations of the difference subspace between two signal subspaces corresponding to the past and present time-series data, as anomaly score. It is a natural generalization of the conventional SSA-based method which measures the minimum angle between the two signal subspaces as the degree of changes. By replacing the minimum angle with the difference subspace, our method boosts the performance while using the SSA-based framework as it can capture the whole structural difference between the two subspaces in its magnitude and direction. We demonstrate our method's effectiveness through performance evaluations on public time-series datasets.
CVNov 25, 2023
Occlusion Sensitivity Analysis with Augmentation Subspace Perturbation in Deep Feature SpacePedro Valois, Koichiro Niinuma, Kazuhiro Fukui
Deep Learning of neural networks has gained prominence in multiple life-critical applications like medical diagnoses and autonomous vehicle accident investigations. However, concerns about model transparency and biases persist. Explainable methods are viewed as the solution to address these challenges. In this study, we introduce the Occlusion Sensitivity Analysis with Deep Feature Augmentation Subspace (OSA-DAS), a novel perturbation-based interpretability approach for computer vision. While traditional perturbation methods make only use of occlusions to explain the model predictions, OSA-DAS extends standard occlusion sensitivity analysis by enabling the integration with diverse image augmentations. Distinctly, our method utilizes the output vector of a DNN to build low-dimensional subspaces within the deep feature vector space, offering a more precise explanation of the model prediction. The structural similarity between these subspaces encompasses the influence of diverse augmentations and occlusions. We test extensively on the ImageNet-1k, and our class- and model-agnostic approach outperforms commonly used interpreters, setting it apart in the realm of explainable AI.
LGSep 13, 2024
Second-order difference subspaceKazuhiro Fukui, Pedro H. V. Valois, Lincon Souza et al.
Subspace representation is a fundamental technique in various fields of machine learning. Analyzing a geometrical relationship among multiple subspaces is essential for understanding subspace series' temporal and/or spatial dynamics. This paper proposes the second-order difference subspace, a higher-order extension of the first-order difference subspace between two subspaces that can analyze the geometrical difference between them. As a preliminary for that, we extend the definition of the first-order difference subspace to the more general setting that two subspaces with different dimensions have an intersection. We then define the second-order difference subspace by combining the concept of first-order difference subspace and principal component subspace (Karcher mean) between two subspaces, motivated by the second-order central difference method. We can understand that the first/second-order difference subspaces correspond to the velocity and acceleration of subspace dynamics from the viewpoint of a geodesic on a Grassmann manifold. We demonstrate the validity and naturalness of our second-order difference subspace by showing numerical results on two applications: temporal shape analysis of a 3D object and time series analysis of a biometric signal.
CLDec 10, 2024Code
Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text GenerationPedro H. V. Valois, Lincon S. Souza, Erica K. Shimomoto et al.
Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git
CLApr 12, 2025Code
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up ComedyAdrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.
9.4CVMar 20
Improving Image-to-Image Translation via a Rectified Flow ReformulationSatoshi Iizuka, Shun Okamoto, Kazuhiro Fukui
In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
CVOct 13, 2024
Point Cloud Novelty Detection Based on Latent Representations of a General Feature ExtractorShizuka Akahori, Satoshi Iizuka, Ken Mawatari et al.
We propose an effective unsupervised 3D point cloud novelty detection approach, leveraging a general point cloud feature extractor and a one-class classifier. The general feature extractor consists of a graph-based autoencoder and is trained once on a point cloud dataset such as a mathematically generated fractal 3D point cloud dataset that is independent of normal/abnormal categories. The input point clouds are first converted into latent vectors by the general feature extractor, and then one-class classification is performed on the latent vectors. Compared to existing methods measuring the reconstruction error in 3D coordinate space, our approach utilizes latent representations where the shape information is condensed, which allows more direct and effective novelty detection. We confirm that our general feature extractor can extract shape features of unseen categories, eliminating the need for autoencoder re-training and reducing the computational burden. We validate the performance of our method through experiments on several subsets of the ShapeNet dataset and demonstrate that our latent-based approach outperforms the existing methods.
CVAug 22, 2025
Attention Mechanism in Randomized Time WarpingYutaro Hiraoka, Kazuya Okamura, Kota Suto et al.
This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5\% performance improvement on the Something-Something V2 dataset.
CVMar 7, 2025
Separability Membrane: 3D Active Contour for Point Cloud Surface ReconstructionGulpi Qorik Oktagalu Pratamasunu, Guoqing Hao, Kazuhiro Fukui
This paper proposes Separability Membrane, a robust 3D active contour for extracting a surface from 3D point cloud object. Our approach defines the surface of a 3D object as the boundary that maximizes the separability of point features, such as intensity, color, or local density, between its inner and outer regions based on Fisher's ratio. Separability Membrane identifies the exact surface of a 3D object by maximizing class separability while controlling the rigidity of the 3D surface model with an adaptive B-spline surface that adjusts its properties based on the local and global separability. A key advantage of our method is its ability to accurately reconstruct surface boundaries even when they are ambiguous due to noise or outliers, without requiring any training data or conversion to volumetric representation. Evaluations on a synthetic 3D point cloud dataset and the 3DNet dataset demonstrate the membrane's effectiveness and robustness under diverse conditions.
CVNov 8, 2021
Grassmannian learning mutual subspace method for image set recognitionLincon S. Souza, Naoya Sogi, Bernardo B. Gatto et al.
This paper addresses the problem of object recognition given a set of images as input (e.g., multiple camera sources and video frames). Convolutional neural network (CNN)-based frameworks do not exploit these sets effectively, processing a pattern as observed, not capturing the underlying feature distribution as it does not consider the variance of images in the set. To address this issue, we propose the Grassmannian learning mutual subspace method (G-LMSM), a NN layer embedded on top of CNNs as a classifier, that can process image sets more effectively and can be trained in an end-to-end manner. The image set is represented by a low-dimensional input subspace; and this input subspace is matched with reference subspaces by a similarity of their canonical angles, an interpretable and easy to compute metric. The key idea of G-LMSM is that the reference subspaces are learned as points on the Grassmann manifold, optimized with Riemannian stochastic gradient descent. This learning is stable, efficient and theoretically well-grounded. We demonstrate the effectiveness of our proposed method on hand shape recognition, face identification, and facial emotion recognition.
QMMar 18, 2021
Discriminative Singular Spectrum Classifier with Applications on Bioacoustic Signal RecognitionBernardo B. Gatto, Juan G. Colonna, Eulanda M. dos Santos et al.
Automatic analysis of bioacoustic signals is a fundamental tool to evaluate the vitality of our planet. Frogs and bees, for instance, may act like biological sensors providing information about environmental changes. This task is fundamental for ecological monitoring still includes many challenges such as nonuniform signal length processing, degraded target signal due to environmental noise, and the scarcity of the labeled samples for training machine learning. To tackle these challenges, we present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently. The proposed classifier does not require a large amount of training data and handles nonuniform signal length natively. Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces generated by applying Singular Spectrum Analysis (SSA). Then, a subspace is designed to expose discriminative features. The proposed model shares end-to-end capabilities, which is desirable in modern machine learning systems. This formulation provides a segmentation-free and noise-tolerant approach to represent and classify bioacoustic signals and a highly compact signal descriptor inherited from SSA. The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species. Experimental results on three bioacoustic datasets have shown the competitive performance of the proposed method compared to commonly employed methods for bioacoustics signal classification in terms of accuracy.
LGOct 29, 2019
Discriminant analysis based on projection onto generalized difference subspaceKazuhiro Fukui, Naoya Sogi, Takumi Kobayashi et al.
This paper discusses a new type of discriminant analysis based on the orthogonal projection of data onto a generalized difference subspace (GDS). In our previous work, we have demonstrated that GDS projection works as the quasi-orthogonalization of class subspaces, which is an effective feature extraction for subspace based classifiers. Interestingly, GDS projection also works as a discriminant feature extraction through a similar mechanism to the Fisher discriminant analysis (FDA). A direct proof of the connection between GDS projection and FDA is difficult due to the significant difference in their formulations. To avoid the difficulty, we first introduce geometrical Fisher discriminant analysis (gFDA) based on a simplified Fisher criterion. Our simplified Fisher criterion is derived from a heuristic yet practically plausible principle: the direction of the sample mean vector of a class is in most cases almost equal to that of the first principal component vector of the class, under the condition that the principal component vectors are calculated by applying the principal component analysis (PCA) without data centering. gFDA can work stably even under few samples, bypassing the small sample size (SSS) problem of FDA. Next, we prove that gFDA is equivalent to GDS projection with a small correction term. This equivalence ensures GDS projection to inherit the discriminant ability from FDA via gFDA. Furthermore, to enhance the performances of gFDA and GDS projection, we normalize the projected vectors on the discriminant spaces. Extensive experiments using the extended Yale B+ database and the CMU face database show that gFDA and GDS projection have equivalent or better performance than the original FDA and its extensions.
CVSep 26, 2019
Resolving Marker Pose Ambiguity by Robust Rotation Averaging with Clique ConstraintsShin-Fang Ch'ng, Naoya Sogi, Pulak Purkait et al.
Planar markers are useful in robotics and computer vision for mapping and localisation. Given a detected marker in an image, a frequent task is to estimate the 6DOF pose of the marker relative to the camera, which is an instance of planar pose estimation (PPE). Although there are mature techniques, PPE suffers from a fundamental ambiguity problem, in that there can be more than one plausible pose solutions for a PPE instance. Especially when localisation of the marker corners is noisy, it is often difficult to disambiguate the pose solutions based on reprojection error alone. Previous methods choose between the possible solutions using a heuristic criteria, or simply ignore ambiguous markers. We propose to resolve the ambiguities by examining the consistencies of a set of markers across multiple views. Our specific contributions include a novel rotation averaging formulation that incorporates long-range dependencies between possible marker orientation solutions that arise from PPE ambiguities. We analyse the combinatorial complexity of the problem, and develop a novel lifted algorithm to effectively resolve marker pose ambiguities, without discarding any marker observations. Results on real and synthetic data show that our method is able to handle highly ambiguous inputs, and provides more accurate and/or complete marker-based mapping and localisation.
LGSep 4, 2019
Tensor Analysis with n-Mode Generalized Difference SubspaceBernardo B. Gatto, Eulanda M. dos Santos, Alessandro L. Koerich et al.
The increasing use of multiple sensors, which produce a large amount of multi-dimensional data, requires efficient representation and classification methods. In this paper, we present a new method for multi-dimensional data classification that relies on two premises: 1) multi-dimensional data are usually represented by tensors, since this brings benefits from multilinear algebra and established tensor factorization methods; and 2) multilinear data can be described by a subspace of a vector space. The subspace representation has been employed for pattern-set recognition, and its tensor representation counterpart is also available in the literature. However, traditional methods do not use discriminative information of the tensors, degrading the classification accuracy. In this case, generalized difference subspace (GDS) provides an enhanced subspace representation by reducing data redundancy and revealing discriminative structures. Since GDS does not handle tensor data, we propose a new projection called n-mode GDS, which efficiently handles tensor data. We also introduce the n-mode Fisher score as a class separability index and an improved metric based on the geodesic distance for tensor data similarity. The experimental results on gesture and action recognition show that the proposed method outperforms methods commonly used in the literature without relying on pre-trained models or transfer learning.
CVMar 14, 2019
Constrained Mutual Convex Cone Method for Image Set Based RecognitionNaoya Sogi, Rui Zhu, Jing-Hao Xue et al.
In this paper, we propose a method for image-set classification based on convex cone models. Image set classification aims to classify a set of images, which were usually obtained from video frames or multi-view cameras, into a target object. To accurately and stably classify a set, it is essential to represent structural information of the set accurately. There are various representative image features, such as histogram based features, HLAC, and Convolutional Neural Network (CNN) features. We should note that most of them have non-negativity and thus can be effectively represented by a convex cone. This leads us to introduce the convex cone representation to image-set classification. To establish a convex cone based framework, we mathematically define multiple angles between two convex cones, and then define the geometric similarity between the cones using the angles. Moreover, to enhance the framework, we introduce a discriminant space that maximizes the between-class variance (gaps) and minimizes the within-class variance of the projected convex cones onto the discriminant space, similar to the Fisher discriminant analysis. Finally, the classification is performed based on the similarity between projected convex cones. The effectiveness of the proposed method is demonstrated experimentally by using five databases: CMU PIE dataset, ETH-80, CMU Motion of Body dataset, Youtube Celebrity dataset, and a private database of multi-view hand shapes.
MLJun 8, 2018
Text Classification based on Word Subspace with Term-FrequencyErica K. Shimomoto, Lincon S. Souza, Bernardo B. Gatto et al.
Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.
CVMay 31, 2018
A Method Based on Convex Cone Model for Image-Set Classification with CNN FeaturesNaoya Sogi, Taku Nakayama, Kazuhiro Fukui
In this paper, we propose a method for image-set classification based on convex cone models, focusing on the effectiveness of convolutional neural network (CNN) features as inputs. CNN features have non-negative values when using the rectified linear unit as an activation function. This naturally leads us to model a set of CNN features by a convex cone and measure the geometric similarity of convex cones for classification. To establish this framework, we sequentially define multiple angles between two convex cones by repeating the alternating least squares method and then define the geometric similarity between the cones using the obtained angles. Moreover, to enhance our method, we introduce a discriminant space, maximizing the between-class variance (gaps) and minimizes the within-class variance of the projected convex cones onto the discriminant space, similar to a Fisher discriminant analysis. Finally, classification is based on the similarity between projected convex cones. The effectiveness of the proposed method was demonstrated experimentally using a private, multi-view hand shape dataset and two public databases.