43.8IVMay 29
Training-Free Continuous Bitrate Control for Scalable Image Coding for Humans and MachinesYui Tatsumi, Hiroshi Watanabe
Continuous variable-rate compression is highly demanded in real-world applications, but remains underexplored in scalable image coding for humans and machines. In this paper, we propose a training-free variable-rate scalable image coding framework. By adjusting quantization steps based on predicted scale values, the proposed method achieves continuous bitrate control while preserving high-scale information in the machine and enhancement layers. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of bitrate allocation between the two layers.
CVApr 3, 2023
Accuracy Improvement of Object Detection in VVC Coded Video Using YOLO-v7 FeaturesTakahiro Shindo, Taiju Watanabe, Kein Yamada et al.
With advances in image recognition technology based on deep learning, automatic video analysis by Artificial Intelligence is becoming more widespread. As the amount of video used for image recognition increases, efficient compression methods for such video data are necessary. In general, when the image quality deteriorates due to image encoding, the image recognition accuracy also falls. Therefore, in this paper, we propose a neural-network-based approach to improve image recognition accuracy, especially the object detection accuracy by applying post-processing to the encoded video. Versatile Video Coding (VVC) will be used for the video compression method, since it is the latest video coding method with the best encoding performance. The neural network is trained using the features of YOLO-v7, the latest object detection model. By using VVC as the video coding method and YOLO-v7 as the detection model, high object detection accuracy is achieved even at low bit rates. Experimental results show that the combination of the proposed method and VVC achieves better coding performance than regular VVC in object detection accuracy.
CVAug 27, 2023
Image Coding for Machines with Object Region LearningTakahiro Shindo, Taiju Watanabe, Kein Yamada et al.
Compression technology is essential for efficient image transmission and storage. With the rapid advances in deep learning, images are beginning to be used for image recognition as well as for human vision. For this reason, research has been conducted on image coding for image recognition, and this field is called Image Coding for Machines (ICM). There are two main approaches in ICM: the ROI-based approach and the task-loss-based approach. The former approach has the problem of requiring an ROI-map as input in addition to the input image. The latter approach has the problems of difficulty in learning the task-loss, and lack of robustness because the specific image recognition model is used to compute the loss function. To solve these problems, we propose an image compression model that learns object regions. Our model does not require additional information as input, such as an ROI-map, and does not use task-loss. Therefore, it is possible to compress images for various image recognition models. In the experiments, we demonstrate the versatility of the proposed method by using three different image recognition models and three different datasets. In addition, we verify the effectiveness of our model by comparing it with previous methods.
CVMar 7, 2024Code
Image Coding for Machines with Edge Information Learning Using Segment AnythingTakahiro Shindo, Kein Yamada, Taiju Watanabe et al.
Image Coding for Machines (ICM) is an image compression technique for image recognition. This technique is essential due to the growing demand for image recognition AI. In this paper, we propose a method for ICM that focuses on encoding and decoding only the edge information of object parts in an image, which we call SA-ICM. This is an Learned Image Compression (LIC) model trained using edge information created by Segment Anything. Our method can be used for image recognition models with various tasks. SA-ICM is also robust to changes in input data, making it effective for a variety of use cases. Additionally, our method provides benefits from a privacy point of view, as it removes human facial information on the encoder's side, thus protecting one's privacy. Furthermore, this LIC model training method can be used to train Neural Representations for Videos (NeRV), which is a video compression model. By training NeRV using edge information created by Segment Anything, it is possible to create a NeRV that is effective for image recognition (SA-NeRV). Experimental results confirm the advantages of SA-ICM, presenting the best performance in image compression for image recognition. We also show that SA-NeRV is superior to ordinary NeRV in video compression for machines. Code is available at https://github.com/final-0/SA-ICM.
CVSep 27, 2024
Neural Video Representation for Redundancy Reduction and Consistency PreservationTaiga Hayami, Takahiro Shindo, Shunsuke Akamatsu et al.
Implicit neural representation (INR) embed various signals into neural networks. They have gained attention in recent years because of their versatility in handling diverse signal types. In the context of video, INR achieves video compression by embedding video signals directly into networks and compressing them. Conventional methods either use an index that expresses the time of the frame or features extracted from individual frames as network inputs. The latter method provides greater expressive capability as the input is specific to each video. However, the features extracted from frames often contain redundancy, which contradicts the purpose of video compression. Additionally, such redundancies make it challenging to accurately reconstruct high-frequency components in the frames. To address these problems, we focus on separating the high-frequency and low-frequency components of the reconstructed frame. We propose a video representation method that generates both the high-frequency and low-frequency components of the frame, using features extracted from the high-frequency components and temporal information, respectively. Experimental results demonstrate that our method outperforms the existing HNeRV method, achieving superior results in 96 percent of the videos.
IVNov 8, 2025
Training-Free Adaptive Quantization for Variable Rate Image Coding for MachinesYui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision into real-world applications. However, most ICM frameworks utilize learned image compression (LIC) models that operate at a fixed rate and require separate training for each target bitrate, which may limit their practical applications. Existing variable rate LIC approaches mitigate this limitation but typically depend on training, increasing computational cost and deployment complexity. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free, adaptive quantization step size control scheme that enables flexible bitrate adjustment. By leveraging both channel-wise entropy dependencies and spatial scale parameters predicted by the hyperprior network, the proposed method preserves semantically important regions while coarsely quantizing less critical areas. The bitrate can be continuously controlled through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate method.
70.6CVMar 27
Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified FlowZiyue Zeng, Xun Su, Haoyuan Liu et al.
Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
LGJul 6, 2022
Information Compression and Performance Evaluation of Tic-Tac-Toe's Evaluation Function Using Singular Value DecompositionNaoya Fujita, Hiroshi Watanabe
We approximated the evaluation function for the game Tic-Tac-Toe by singular value decomposition (SVD) and investigated the effect of approximation accuracy on winning rate. We first prepared the perfect evaluation function of Tic-Tac-Toe and performed low-rank approximation by considering the evaluation function as a ninth-order tensor. We found that we can reduce the amount of information of the evaluation function by 70% without significantly degrading the performance. Approximation accuracy and winning rate were strongly correlated but not perfectly proportional. We also investigated how the decomposition method of the evaluation function affects the performance. We considered two decomposition methods: simple SVD regarding the evaluation function as a matrix and the Tucker decomposition by higher-order SVD (HOSVD). At the same compression ratio, the strategy with the approximated evaluation function obtained by HOSVD exhibited a significantly higher winning rate than that obtained by SVD. These results suggest that SVD can effectively compress board game strategies and an optimal compression method that depends on the game exists.
CVOct 11, 2023
Point Cloud Denoising and Outlier Detection with Local Geometric Structure by Dynamic Graph CNNKosuke Nakayama, Hiroto Fukuta, Hiroshi Watanabe
The digitalization of society is rapidly developing toward the realization of the digital twin and metaverse. In particular, point clouds are attracting attention as a media format for 3D space. Point cloud data is contaminated with noise and outliers due to measurement errors. Therefore, denoising and outlier detection are necessary for point cloud processing. Among them, PointCleanNet is an effective method for point cloud denoising and outlier detection. However, it does not consider the local geometric structure of the patch. We solve this problem by applying two types of graph convolutional layer designed based on the Dynamic Graph CNN. Experimental results show that the proposed methods outperform the conventional method in AUPR, which indicates outlier detection accuracy, and Chamfer Distance, which indicates denoising accuracy.
CVDec 29, 2025
Contour Information Aware 2D Gaussian Splatting for Image RepresentationMasaya Takabe, Hiroshi Watanabe, Sujun Hong et al.
Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.
CVNov 11, 2025
Accurate and Efficient Surface Reconstruction from Point Clouds via Geometry-Aware Local AdaptationEito Ogawa, Taiga Hayami, Hiroshi Watanabe
Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.
CVMay 15, 2024
Scalable Image Coding for Humans and Machines Using Feature Fusion NetworkTakahiro Shindo, Taiju Watanabe, Yui Tatsumi et al.
As image recognition models become more prevalent, scalable coding methods for machines and humans gain more importance. Applications of image recognition models include traffic monitoring and farm management. In these use cases, the scalable coding method proves effective because the tasks require occasional image checking by humans. Existing image compression methods for humans and machines meet these requirements to some extent. However, these compression methods are effective solely for specific image recognition models. We propose a learning-based scalable image coding method for humans and machines that is compatible with numerous image recognition models. We combine an image compression model for machines with a compression model, providing additional information to facilitate image decoding for humans. The features in these compression models are fused using a feature fusion network to achieve efficient image compression. Our method's additional information compression model is adjusted to reduce the number of parameters by enabling combinations of features of different sizes in the feature fusion network. Our approach confirms that the feature fusion network efficiently combines image compression models while reducing the number of parameters. Furthermore, we demonstrate the effectiveness of the proposed scalable coding method by evaluating the image compression performance in terms of decoded image quality and bitrate.
CVMay 20, 2024
Refining Coded Image in Human Vision Layer Using CNN-Based Post-ProcessingTakahiro Shindo, Yui Tatsumi, Taiju Watanabe et al.
Scalable image coding for both humans and machines is a technique that has gained a lot of attention recently. This technology enables the hierarchical decoding of images for human vision and image recognition models. It is a highly effective method when images need to serve both purposes. However, no research has yet incorporated the post-processing commonly used in popular image compression schemes into scalable image coding method for humans and machines. In this paper, we propose a method to enhance the quality of decoded images for humans by integrating post-processing into scalable coding scheme. Experimental results show that the post-processing improves compression performance. Furthermore, the effectiveness of the proposed method is validated through comparisons with traditional methods.
IVJun 24, 2025
Explicit Residual-Based Scalable Image Coding for Humans and MachinesYui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.
CVJul 16, 2025
InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU OptimizationHaoyuan Liu, Hiroshi Watanabe
Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU's non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.
IVApr 30, 2025
SR-NeRV: Improving Embedding Efficiency of Neural Video Representation via Super-ResolutionTaiga Hayami, Kakeru Koizumi, Hiroshi Watanabe
Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals in various domains. Recently, INR-based frameworks have shown promise in neural video compression by embedding video content into compact neural networks. However, these methods often struggle to reconstruct high-frequency details under stringent constraints on model size, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation framework that integrates a general-purpose super-resolution (SR) network. This design is motivated by the observation that high-frequency components tend to exhibit low temporal redundancy across frames. By offloading the reconstruction of fine details to a dedicated SR network pre-trained on natural images, the proposed method improves visual fidelity. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in reconstruction quality, while maintaining a comparable model size.
CVMar 23, 2025
Guided Diffusion for the Extension of Machine Vision to Human Visual PerceptionTakahiro Shindo, Yui Tatsumi, Taiju Watanabe et al.
Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.
CVNov 17, 2024
Time Step Generating: A Universal Synthesized Deepfake Image DetectorZiyue Zeng, Haoyuan Liu, Dingjie Peng et al.
Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.
CVNov 1, 2024
Inter-Feature-Map Differential Coding of Surveillance VideoKei Iino, Miho Takahashi, Hiroshi Watanabe et al.
In Collaborative Intelligence, a deep neural network (DNN) is partitioned and deployed at the edge and the cloud for bandwidth saving and system optimization. When a model input is an image, it has been confirmed that the intermediate feature map, the output from the edge, can be smaller than the input data size. However, its effectiveness has not been reported when the input is a video. In this study, we propose a method to compress the feature map of surveillance videos by applying inter-feature-map differential coding (IFMDC). IFMDC shows a compression ratio comparable to, or better than, HEVC to the input video in the case of small accuracy reduction. Our method is especially effective for videos that are sensitive to image quality degradation when HEVC is applied
SDSep 5, 2025
Learning and composing of classical music using restricted Boltzmann machinesMutsumi Kobayashi, Hiroshi Watanabe
Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer's music. In this study, we adopted J. S. Bach's music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.
CVJun 15, 2025
Structure-Preserving Patch Decoding for Efficient Neural Video RepresentationTaiga Hayami, Kakeru Koizumi, Hiroshi Watanabe
Implicit neural representations (INRs) are the subject of extensive research, particularly in their application to modeling complex signals by mapping spatial and temporal coordinates to corresponding values. When handling videos, mapping compact inputs to entire frames or spatially partitioned patch images is an effective approach. This strategy better preserves spatial relationships, reduces computational overhead, and improves reconstruction quality compared to coordinate-based mapping. However, predicting entire frames often limits the reconstruction of high-frequency visual details. Additionally, conventional patch-based approaches based on uniform spatial partitioning tend to introduce boundary discontinuities that degrade spatial coherence. We propose a neural video representation method based on Structure-Preserving Patches (SPPs) to address such limitations. Our method separates each video frame into patch images of spatially aligned frames through a deterministic pixel-based splitting similar to PixelUnshuffle. This operation preserves the global spatial structure while allowing patch-level decoding. We train the decoder to reconstruct these structured patches, enabling a global-to-local decoding strategy that captures the global layout first and refines local details. This effectively reduces boundary artifacts and mitigates distortions from naive upsampling. Experiments on standard video datasets demonstrate that our method achieves higher reconstruction quality and better compression performance than existing INR-based baselines.
CVMay 26, 2025
Seed Selection for Human-Oriented Image Reconstruction via Guided DiffusionYui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based approach avoids this by generating human-oriented images from machine-oriented images without extra bitrate. However, it utilizes a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce the computational cost, selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our proposed method outperforms the baseline, which uses a single random seed without selection, across multiple evaluation metrics.
CVDec 22, 2024
Adapting Image-to-Video Diffusion Models for Large-Motion Frame InterpolationLuoxu Jin, Hiroshi Watanabe
With the development of video generation models has advanced significantly in recent years, we adopt large-scale image-to-video diffusion models for video frame interpolation. We present a conditional encoder designed to adapt an image-to-video model for large-motion frame interpolation. To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism that effectively captures both spatial and temporal information, enabling accurate interpolations of intermediate frames. Our approach demonstrates superior performance on the Fréchet Video Distance (FVD) metric when evaluated against other state-of-the-art approaches, particularly in handling large motion scenarios, highlighting advancements in generative-based methodologies.
CVNov 10, 2024
Classification in Japanese Sign Language Based on Dynamic Facial ExpressionsYui Tatsumi, Shoko Tanaka, Shunsuke Akamatsu et al.
Sign language is a visual language expressed through hand movements and non-manual markers. Non-manual markers include facial expressions and head movements. These expressions vary across different nations. Therefore, specialized analysis methods for each sign language are necessary. However, research on Japanese Sign Language (JSL) recognition is limited due to a lack of datasets. The development of recognition models that consider both manual and non-manual features of JSL is crucial for precise and smooth communication with deaf individuals. In JSL, sentence types such as affirmative statements and questions are distinguished by facial expressions. In this paper, we propose a JSL recognition method that focuses on facial expressions. Our proposed method utilizes a neural network to analyze facial features and classify sentence types. Through the experiments, we confirm our method's effectiveness by achieving a classification accuracy of 96.05%.
CVJun 15, 2024
Implicit Neural Representation for Videos Based on Residual ConnectionTaiga Hayami, Hiroshi Watanabe
Video compression technology is essential for transmitting and storing videos. Many video compression methods reduce information in videos by removing high-frequency components and utilizing similarities between frames. Alternatively, the implicit neural representations (INRs) for videos, which use networks to represent and compress videos through model compression. A conventional method improves the quality of reconstruction by using frame features. However, the detailed representation of the frames can be improved. To improve the quality of reconstructed frames, we propose a method that uses low-resolution frames as residual connection that is considered effective for image reconstruction. Experimental results show that our method outperforms the existing method, HNeRV, in PSNR for 46 of the 49 videos.
CVFeb 13, 2024
Improving Image Coding for Machines through Optimizing Encoder via Auxiliary LossKei Iino, Shunsuke Akamatsu, Hiroshi Watanabe et al.
Image coding for machines (ICM) aims to compress images for machine analysis using recognition models rather than human vision. Hence, in ICM, it is important for the encoder to recognize and compress the information necessary for the machine recognition task. There are two main approaches in learned ICM; optimization of the compression model based on task loss, and Region of Interest (ROI) based bit allocation. These approaches provide the encoder with the recognition capability. However, optimization with task loss becomes difficult when the recognition model is deep, and ROI-based methods often involve extra overhead during evaluation. In this study, we propose a novel training method for learned ICM models that applies auxiliary loss to the encoder to improve its recognition capability and rate-distortion performance. Our method achieves Bjontegaard Delta rate improvements of 27.7% and 20.3% in object detection and semantic segmentation tasks, compared to the conventional training method. \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
CVMay 30, 2023
VVC Extension Scheme for Object Detection Using Contrast ReductionTakahiro Shindo, Taiju Watanabe, Kein Yamada et al.
In recent years, video analysis using Artificial Intelligence (AI) has been widely used, due to the remarkable development of image recognition technology using deep learning. In 2019, the Moving Picture Experts Group (MPEG) has started standardization of Video Coding for Machines (VCM) as a video coding technology for image recognition. In the framework of VCM, both higher image recognition accuracy and video compression performance are required. In this paper, we propose an extention scheme of video coding for object detection using Versatile Video Coding (VVC). Unlike video for human vision, video used for object detection does not require a large image size or high contrast. Since downsampling of the image can reduce the amount of information to be transmitted. Due to the decrease in image contrast, entropy of the image becomes smaller. Therefore, in our proposed scheme, the original image is reduced in size and contrast, then coded with VVC encoder to achieve high compression performance. Then, the output image from the VVC decoder is restored to its original image size using the bicubic method. Experimental results show that the proposed video coding scheme achieves better coding performance than regular VVC in terms of object detection accuracy.
CRNov 10, 2020
Proof of Authenticity of Logistics Information with Passive RFID Tags and BlockchainHiroshi Watanabe, Kenji Saito, Satoshi Miyazaki et al.
In tracing the (robotically automated) logistics of large quantities of goods, inexpensive passive RFID tags are preferred for cost reasons. Accordingly, security between such tags and readers have primarily been studied among many issues of RFID. However, the authenticity of data cannot be guaranteed if logistics services can give false information. Although the use of blockchain is often discussed, it is simply a recording system, so there is a risk that false records may be written to it. As a solution, we propose a design in which a digitally signing, location-constrained and tamper-evident reader atomically writes an evidence to blockchain along with its reading and writing a tag. By semi-formal modeling, we confirmed that the confidentiality and integrity of the information can be maintained throughout the system, and digitally signed data can be verified later despite possible compromise of private keys or signature algorithms, or expiration of public key certificates. We also introduce a prototype design to show that our proposal is viable. This makes it possible to trace authentic logistics information using inexpensive passive RFID tags. Furthermore, by abstracting the reader/writer as a sensor/actuator, this model can be extended to IoT in general.
CVMar 18, 2020
Capsule GAN Using Capsule Network for Generator ArchitectureKanako Marusaki, Hiroshi Watanabe
This paper presents Capsule GAN, a Generative adversarial network using Capsule Network not only in the discriminator but also in the generator. Recently, Generative adversarial networks (GANs) has been intensively studied. However, generating images by GANs is difficult. Therefore, GANs sometimes generate poor quality images. These GANs use convolutional neural networks (CNNs). However, CNNs have the defect that the relational information between features of the image may be lost. Capsule Network, proposed by Hinton in 2017, overcomes the defect of CNNs. Capsule GAN reported previously uses Capsule Network in the discriminator. However, instead of using Capsule Network, Capsule GAN reported in previous studies uses CNNs in generator architecture like DCGAN. This paper introduces two approaches to use Capsule Network in the generator. One is to use DigitCaps layer from the discriminator as the input to the generator. DigitCaps layer is the output layer of Capsule Network. It has the features of the input images of the discriminator. The other is to use the reverse operation of recognition process in Capsule Network in the generator. We compare Capsule GAN proposed in this paper with conventional GAN using CNN and Capsule GAN which uses Capsule Network in the discriminator only. The datasets are MNIST, Fashion-MNIST and color images. We show that Capsule GAN outperforms the GAN using CNN and the GAN using Capsule Network in the discriminator only. The architecture of Capsule GAN proposed in this paper is a basic architecture using Capsule Network. Therefore, we can apply the existing improvement techniques for GANs to Capsule GAN.
CRJul 17, 2018
Can Blockchain Protect Internet-of-Things?Hiroshi Watanabe
In the Internet-of-Things, the number of connected devices is expected to be extremely huge, i.e., more than a couple of ten billion. It is however well-known that the security for the Internet-of-Things is still open problem. In particular, it is difficult to certify the identification of connected devices and to prevent the illegal spoofing. It is because the conventional security technologies have advanced for mainly protecting logical network and not for physical network like the Internet-of-Things. In order to protect the Internet-of-Things with advanced security technologies, we propose a new concept (datachain layer) which is a well-designed combination of physical chip identification and blockchain. With a proposed solution of the physical chip identification, the physical addresses of connected devices are uniquely connected to the logical addresses to be protected by blockchain.