Jiro Katto

IV
h-index24
21papers
755citations
Novelty48%
AI Score50

21 Papers

IVMar 27, 2023Code
Learned Image Compression with Mixed Transformer-CNN Architectures

Jinming Liu, Heming Sun, Jiro Katto

Learned image compression (LIC) methods have exhibited promising progress and superior rate-distortion performance compared with classical image compression standards. Most existing LIC methods are Convolutional Neural Networks-based (CNN-based) or Transformer-based, which have different advantages. Exploiting both advantages is a point worth exploring, which has two challenges: 1) how to effectively fuse the two methods? 2) how to achieve higher performance with a suitable complexity? In this paper, we propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity to incorporate the local modeling ability of CNN and the non-local modeling ability of transformers to improve the overall architecture of image compression models. Besides, inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention (SWAtten) modules by using channel squeezing. Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances on three different resolution datasets (i.e., Kodak, Tecnick, CLIC Professional Validation) compared to existing LIC methods. The code is at https://github.com/jmliu206/LIC_TCM.

CVAug 24, 2023
SCP: Spherical-Coordinate-based Learned Point Cloud Compression

Ao Luo, Linxin Song, Keisuke Nonaka et al.

In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, the spinning LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to leverage the aforementioned features fully. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.

CVFeb 18, 2023
Multistage Spatial Context Models for Learned Image Compression

Fangzheng Lin, Heming Sun, Jinming Liu et al.

Recent state-of-the-art Learned Image Compression methods feature spatial context models, achieving great rate-distortion improvements over hyperprior methods. However, the autoregressive context model requires serial decoding, limiting runtime performance. The Checkerboard context model allows parallel decoding at a cost of reduced RD performance. We present a series of multistage spatial context models allowing both fast decoding and better RD performance. We split the latent space into square patches and decode serially within each patch while different patches are decoded in parallel. The proposed method features a comparable decoding speed to Checkerboard while reaching the RD performance of Autoregressive and even also outperforming Autoregressive. Inside each patch, the decoding order must be carefully decided as a bad order negatively impacts performance; therefore, we also propose a decoding order optimization algorithm.

30.5IVMay 2
Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs

Kasidis Arunruangsirilert, Jiro Katto

The rapid expansion of uplink-intensive applications necessitates video coding solutions that balance high Rate-Distortion (RD) efficiency with ultra-low latency. This paper presents a longitudinal performance analysis of NVIDIA hardware encoding (NVENC), spanning from Pascal to the emerging Blackwell generation. We specifically evaluate the operational viability of the new "Ultra High Quality" (UHQ) tuning mode against standard low-latency configurations. Our results demonstrate that while the Blackwell architecture breaks historical efficiency plateaus, achieving a 5.94% BD-Rate gain in standard modes and up to 22.79% in UHQ modes, these gains incur severe system-level penalties. We reveal that UHQ operates as a hybrid pipeline, offloading complexity to CUDA cores and enforcing aggressive temporal structures (up to 7 B-frames) that increase end-to-end latency by over 400% and GPU board power consumption by up to 40%. Consequently, while UHQ successfully bridges the quality gap with software encoders, its prohibitive serialization delay renders it unsuitable for interactive real-time communications, positioning it instead as a specialized solution for Video-on-Demand (VoD) transcoding.

CVSep 3, 2022
Semantic Segmentation in Learned Compressed Domain

Jinming Liu, Heming Sun, Jiro Katto

Most machine vision tasks (e.g., semantic segmentation) are based on images encoded and decoded by image compression algorithms (e.g., JPEG). However, these decoded images in the pixel domain introduce distortion, and they are optimized for human perception, making the performance of machine vision tasks suboptimal. In this paper, we propose a method based on the compressed domain to improve segmentation tasks. i) A dynamic and a static channel selection method are proposed to reduce the redundancy of compressed representations that are obtained by encoding. ii) Two different transform modules are explored and analyzed to help the compressed representation be transformed as the features in the segmentation network. The experimental results show that we can save up to 15.8\% bitrates compared with a state-of-the-art compressed domain-based work while saving up to about 83.6\% bitrates and 44.8\% inference time compared with the pixel domain-based method.

NIDec 29, 2022
Pensieve 5G: Implementation of RL-based ABR Algorithm for UHD 4K/8K Content Delivery on Commercial 5G SA/NR-DC Network

Kasidis Arunruangsirilert, Bo Wei, Hang Song et al.

While the rollout of the fifth-generation mobile network (5G) is underway across the globe with the intention to deliver 4K/8K UHD videos, Augmented Reality (AR), and Virtual Reality (VR) content to the mass amounts of users, the coverage and throughput are still one of the most significant issues, especially in the rural areas, where only 5G in the low-frequency band are being deployed. This called for a high-performance adaptive bitrate (ABR) algorithm that can maximize the user quality of experience given 5G network characteristics and data rate of UHD contents. Recently, many of the newly proposed ABR techniques were machine-learning based. Among that, Pensieve is one of the state-of-the-art techniques, which utilized reinforcement-learning to generate an ABR algorithm based on observation of past decision performance. By incorporating the context of the 5G network and UHD content, Pensieve has been optimized into Pensieve 5G. New QoE metrics that more accurately represent the QoE of UHD video streaming on the different types of devices were proposed and used to evaluate Pensieve 5G against other ABR techniques including the original Pensieve. The results from the simulation based on the real 5G Standalone (SA) network throughput shows that Pensieve 5G outperforms both conventional algorithms and Pensieve with the average QoE improvement of 8.8% and 14.2%, respectively. Additionally, Pensieve 5G also performed well on the commercial 5G NR-NR Dual Connectivity (NR-DC) Network, despite the training being done solely using the data from the 5G Standalone (SA) network.

IVAug 2, 2022
Streaming-capable High-performance Architecture of Learned Image Compression Codecs

Fangzheng Lin, Heming Sun, Jiro Katto

Learned image compression allows achieving state-of-the-art accuracy and compression ratios, but their relatively slow runtime performance limits their usage. While previous attempts on optimizing learned image codecs focused more on the neural model and entropy coding, we present an alternative method to improving the runtime performance of various learned image compression models. We introduce multi-threaded pipelining and an optimized memory model to enable GPU and CPU workloads asynchronous execution, fully taking advantage of computational resources. Our architecture alone already produces excellent performance without any change to the neural model itself. We also demonstrate that combining our architecture with previous tweaks to the neural models can further improve runtime performance. We show that our implementations excel in throughput and latency compared to the baseline and demonstrate the performance of our implementations by creating a real-time video streaming encoder-decoder sample application, with the encoder running on an embedded device.

NIJul 23, 2023
UplinkNet: Practical Commercial 5G Standalone (SA) Uplink Throughput Prediction

Kasidis Arunruangsirilert, Jiro Katto

While 5G New Radio (NR) networks offer significant uplink throughput improvements, these gains are primarily realized when User Equipment (UE) connects to high-frequency millimeter wave (mmWave) bands. The growing demand for uplink-intensive applications, such as real-time UHD 4K/8K video streaming and Virtual Reality (VR)/Augmented Reality (AR) content, highlights the need for accurate uplink throughput prediction to optimize user Quality of Experience (QoE). In this paper, we introduce UplinkNet, a compact neural network designed to predict future uplink throughput using past throughput and RF parameters available through the Android API. With a model size limited to approximately 4,000 parameters, UplinkNet is suitable for IoT and low-power devices. The network was trained on real-world drive test data from commercial 5G Standalone (SA) networks in Tokyo, Japan, and Bangkok, Thailand, across various mobility conditions. To ensure practical implementation, the model uses only Android API data and was evaluated on unseen data against other models. Results show that UplinkNet achieves an average prediction accuracy of 98.9% and an RMSE of 5.22 Mbps, outperforming all other models while maintaining a compact size and low computational cost.

CVNov 12, 2022
ABCAS: Adaptive Bound Control of spectral norm as Automatic Stabilizer

Shota Hirose, Shiori Maki, Naoki Wada et al.

Spectral Normalization is one of the best methods for stabilizing the training of Generative Adversarial Network. Spectral Normalization limits the gradient of discriminator between the distribution between real data and fake data. However, even with this normalization, GAN's training sometimes fails. In this paper, we reveal that more severe restriction is sometimes needed depending on the training dataset, then we propose a novel stabilizer which offers an adaptive normalization method, called ABCAS. Our method decides discriminator's Lipschitz constant adaptively, by checking the distance of distributions of real and fake data. Our method improves the stability of the training of Generative Adversarial Network and achieved better Fréchet Inception Distance score of generated images. We also investigated suitable spectral norm for three datasets. We show the result as an ablation study.

IVAug 30, 2022
Learned Lossless Image Compression With Combined Autoregressive Models And Attention Modules

Ran Wang, Jinming Liu, Heming Sun et al.

Lossless image compression is an essential research field in image compression. Recently, learning-based image compression methods achieved impressive performance compared with traditional lossless methods, such as WebP, JPEG2000, and FLIF. However, there are still many impressive lossy compression methods that can be applied to lossless compression. Therefore, in this paper, we explore the methods widely used in lossy compression and apply them to lossless compression. Inspired by the impressive performance of the Gaussian mixture model (GMM) shown in lossy compression, we generate a lossless network architecture with GMM. Besides noticing the successful achievements of attention modules and autoregressive models, we propose to utilize attention modules and add an extra autoregressive model for raw images in our network architecture to boost the performance. Experimental results show that our approach outperforms most classical lossless compression methods and existing learning-based methods.

45.6IVMay 16
Sustainable Real-Time 8K60 HEVC Encoding for V2X: Repurposing Legacy NVENC Hardware at the Vehicular Edge

Kasidis Arunruangsirilert, Jiro Katto

The rapid advancement of Vehicle-to-Everything (V2X) communications and Tele-Operated Driving (ToD) demands ultra-low-latency, 8K60 video telemetry. However, deploying modern hardware at the vehicular edge is frequently hindered by supply chain constraints, high power budgets, and growing e-waste concerns. This paper investigates a highly sustainable alternative: repurposing legacy NVIDIA Pascal GPUs for real-time 8K HEVC edge encoding. We demonstrate that triggering 2-Way Split Frame Encoding (SFE) on dual-NVENC GP104 and GP102 silicon successfully unlocks real-time 8K60 throughput with a negligible Rate-Distortion penalty of under 1%. Crucially, our micro-architectural analysis reveals that smaller GPU dies significantly outperform larger flagship models in both raw throughput and energy efficiency. Because fixed-function encoding forces general-purpose Streaming Multiprocessor (SM) cores to sustain maximum frequencies while remaining idle, GPUs with fewer CUDA cores waste drastically less power. While benchmarking against the state-of-the-art RTX PRO 6000 Blackwell highlights a generational compression efficiency gap, Pascal's functional HEVC architecture and native lack of B-frames align perfectly with ultra-low-latency V2X pipelines. Ultimately, repurposed mid-range Pascal GPUs present a highly capable, cost-effective, and e-waste mitigating solution for modern Intelligent Transportation Systems.

20.3NIMay 16
Transformer-Based MCS Prediction for 5G Multicast-Broadcast Services (MBS)

Kasidis Arunruangsirilert, Jiro Katto

The deployment of 5G Multicast-Broadcast Services (MBS) is emerging as a critical technology for spectral-efficient UHD content delivery and serving as a promising solution to modernize CATV deployment. However, unlike unicast networks that rely on RLC-AM with HARQ retransmissions, MBS broadcast operates in RLC Unacknowledged Mode (RLC-UM), where the absence of a feedback loop means packet loss is permanent and immediately impacts user QoE. Conventional link adaptation algorithms, designed for unicast, typically aggressively maximize throughput and fail in this risk-intolerant environment, resulting in severe video stalls and rebuffering. To address this, we propose a lightweight Transformer-based framework that predicts the success probability of all 28 MCS indices over an upcoming video segment horizon. Utilizing a unique commercial network dataset with 0.5 ms slot-level granularity, we train our model using a custom Asymmetric Safety Loss function that penalizes channel overestimation to prioritize link stability. Experimental results show that our approach achieves a reliability score of 86.89%, significantly outperforming standard AI baselines optimized for raw throughput (31.65%) while maintaining a safe conservative bias. Furthermore, the model is optimized for real-time applications, demonstrating an inference time of less than 0.07 ms on COTS 5G-era smartphones.

IVNov 20, 2024Code
LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression

Shimon Murai, Heming Sun, Jiro Katto

Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality by using image captions as sub-information. This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model. We also propose a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58\% improvement in LPIPS BD-rate compared to existing methods. Our implementation and pre-trained weights are available at https://github.com/tokkiwa/ImageTextCoding.

CVMar 29, 2025Code
Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training

Shota Hirose, Kazuki Kotoyori, Kasidis Arunruangsirilert et al.

Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo. The code will be released at https://github.com/FykAikawa/IFRVP.

IVAug 19, 2021Code
Learned Video Compression with Residual Prediction and Loop Filter

Chao Liu, Heming Sun, Jiro Katto et al.

In this paper, we propose a learned video codec with a residual prediction network (RP-Net) and a feature-aided loop filter (LF-Net). For the RP-Net, we exploit the residual of previous multiple frames to further eliminate the redundancy of the current frame residual. For the LF-Net, the features from residual decoding network and the motion compensation network are used to aid the reconstruction quality. To reduce the complexity, a light ResNet structure is used as the backbone for both RP-Net and LF-Net. Experimental results illustrate that we can save about 10% BD-rate compared with previous learned video compression frameworks. Moreover, we can achieve faster coding speed due to the ResNet backbone. This project is available at https://github.com/chaoliu18/RPLVC.

CVDec 4, 2024
Lightweight Stochastic Video Prediction via Hybrid Warping

Kazuki Kotoyori, Shota Hirose, Heming Sun et al.

Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. Due to inherent uncertainties, existing prediction models often struggle with the complexity of motion dynamics and occlusions. In this paper, we propose a novel stochastic long-term video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. By integrating frames generated through forward and backward warpings, our approach effectively compensates for the weaknesses of each technique, improving the prediction accuracy and realism of moving regions in videos while also addressing uncertainty by making stochastic predictions that account for various motions. Furthermore, considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model. Our model, called SVPHW, achieves state-of-the-art performance on two benchmark datasets.

NISep 25, 2021
Adaptive video transmission using QUBO method and Digital Annealer based on Ising machine

Bo Wei, Hang Song, Jiro Katto

With the dramatically increasing video streaming in the total network traffic, it is critical to develop effective algorithms to promote the content delivery service of high quality. Adaptive bitrate (ABR) control is the most essential technique which determines the proper bitrate to be chosen based on network conditions, thus realize high-quality video streaming. In this paper, a novel ABR strategy is proposed based on Ising machine by using the quadratic unconstrained binary optimization (QUBO) method and Digital Annealer (DA) for the first time. The proposed method is evaluated by simulation with the real-world measured throughput, and compared with other state-of-the-art methods. Experiment results show that the proposed QUBO-based method can outperform the existing methods, which demonstrating the superior of the proposed QUBO-based method.

IVOct 25, 2020
A QP-adaptive Mechanism for CNN-based Filter in Video Coding

Chao Liu, Heming Sun, Jiro Katto et al.

Convolutional neural network (CNN)-based filters have achieved great success in video coding. However, in most previous works, individual models are needed for each quantization parameter (QP) band. This paper presents a generic method to help an arbitrary CNN-filter handle different quantization noise. We model the quantization noise problem and implement a feasible solution on CNN, which introduces the quantization step (Qstep) into the convolution. When the quantization noise increases, the ability of the CNN-filter to suppress noise improves accordingly. This method can be used directly to replace the (vanilla) convolution layer in any existing CNN-filters. By using only 25% of the parameters, the proposed method achieves better performance than using multiple models with VTM-6.3 anchor. Besides, an additional BD-rate reduction of 0.2% is achieved by our proposed method for chroma components.

IVSep 6, 2020
A Convolutional Neural Network-Based Low Complexity Filter

Chao Liu, Heming Sun, Jiro Katto et al.

Convolutional Neural Network (CNN)-based filters have achieved significant performance in video artifacts reduction. However, the high complexity of existing methods makes it difficult to be applied in real usage. In this paper, a CNN-based low complexity filter is proposed. We utilize depth separable convolution (DSC) merged with the batch normalization (BN) as the backbone of our proposed CNN-based network. Besides, a weight initialization method is proposed to enhance the training performance. To solve the well known over smoothing problem for the inter frames, a frame-level residual mapping (RM) is presented. We analyze some of the mainstream methods like frame-level and block-level based filters quantitatively and build our CNN-based filter with frame-level control to avoid the extra complexity and artificial boundaries caused by block-level control. In addition, a novel module called RM is designed to restore the distortion from the learned residuals. As a result, we can effectively improve the generalization ability of the learning-based filter and reach an adaptive filtering effect. Moreover, this module is flexible and can be combined with other learning-based filters. The experimental results show that our proposed method achieves significant BD-rate reduction than H.265/HEVC. It achieves about 1.2% BD-rate reduction and 79.1% decrease in FLOPs than VR-CNN. Finally, the measurement on H.266/VVC and ablation studies are also conducted to ensure the effectiveness of the proposed method.

IVNov 22, 2019
Dual Learning-based Video Coding with Inception Dense Blocks

Chao Liu, Heming Sun, Junan Chen et al.

In this paper, a dual learning-based method in intra coding is introduced for PCS Grand Challenge. This method is mainly composed of two parts: intra prediction and reconstruction filtering. They use different network structures, the neural network-based intra prediction uses the full-connected network to predict the block while the neural network-based reconstruction filtering utilizes the convolutional networks. Different with the previous filtering works, we use a network with more powerful feature extraction capabilities in our reconstruction filtering network. And the filtering unit is the block-level so as to achieve a more accurate filtering compensation. To our best knowledge, among all the learning-based methods, this is the first attempt to combine two different networks in one application, and we achieve the state-of-the-art performance for AI configuration on the HEVC Test sequences. The experimental result shows that our method leads to significant BD-rate saving for provided 8 sequences compared to HM-16.20 baseline (average 10.24% and 3.57% bitrate reductions for all-intra and random-access coding, respectively). For HEVC test sequences, our model also achieved a 9.70% BD-rate saving compared to HM-16.20 baseline for all-intra configuration.

CVApr 25, 2018
Deep Convolutional AutoEncoder-based Lossy Image Compression

Zhengxue Cheng, Heming Sun, Masaru Takeuchi et al.

Image compression has been investigated as a fundamental research topic for many decades. Recently, deep learning has achieved great success in many computer vision tasks, and is gradually being used in image compression. In this paper, we present a lossy image compression architecture, which utilizes the advantages of convolutional autoencoder (CAE) to achieve a high coding efficiency. First, we design a novel CAE architecture to replace the conventional transforms and train this CAE using a rate-distortion loss function. Second, to generate a more energy-compact representation, we utilize the principal components analysis (PCA) to rotate the feature maps produced by the CAE, and then apply the quantization and entropy coder to generate the codes. Experimental results demonstrate that our method outperforms traditional image coding algorithms, by achieving a 13.7% BD-rate decrement on the Kodak database images compared to JPEG2000. Besides, our method maintains a moderate complexity similar to JPEG2000.