MMJun 9, 2020Code
Hysia: Serving DNN-Based Video-to-Retail Applications in CloudHuaizheng Zhang, Yuanming Li, Qiming Ai et al.
Combining \underline{v}ideo streaming and online \underline{r}etailing (V2R) has been a growing trend recently. In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. The system consists of: 1) a back-end infrastructure providing optimized V2R related services including data engine, model repository, model serving and content matching; and 2) an application layer which enables rapid V2R application prototyping. Hysia addresses industry and academic needs in large-scale multimedia by: 1) seamlessly integrating state-of-the-art libraries including NVIDIA video SDK, Facebook faiss, and gRPC; 2) efficiently utilizing GPU computation; and 3) allowing developers to bind new models easily to meet the rapidly changing deep learning (DL) techniques. On top of that, we implement an orchestrator for further optimizing DL model serving performance. Hysia has been released as an open source project on GitHub, and attracted considerable attention. We have published Hysia to DockerHub as an official image for seamless integration and deployment in current cloud environments.
39.8CVApr 29
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC WorkflowsYuxuan Han, Yuanxing Zhang, Yushuo Wang et al.
Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.
CVMar 31, 2022
A Dataset of Images of Public Streetlights with Operational Monitoring using Computer Vision TechniquesIoannis Mavromatis, Aleksandar Stanoev, Pietro Carnelli et al.
A dataset of street light images is presented. Our dataset consists of $\sim350\textrm{k}$ images, taken from 140 UMBRELLA nodes installed in the South Gloucestershire region in the UK. Each UMBRELLA node is installed on the pole of a lamppost and is equipped with a Raspberry Pi Camera Module v1 facing upwards towards the sky and lamppost light bulb. Each node collects an image at hourly intervals for 24h every day. The data collection spans for a period of six months. Each image taken is logged as a single entry in the dataset along with the Global Positioning System (GPS) coordinates of the lamppost. All entries in the dataset have been post-processed and labelled based on the operation of the lamppost, i.e., whether the lamppost is switched ON or OFF. The dataset can be used to train deep neural networks and generate pre-trained models providing feature representations for smart city CCTV applications, smart weather detection algorithms, or street infrastructure monitoring. The dataset can be found at \url{https://doi.org/10.5281/zenodo.6046758}.
MMMar 10, 2017
Towards Wi-Fi AP-Assisted Content Prefetching for On-Demand TV Series: A Reinforcement Learning ApproachWen Hu, Yichao Jin, Yonggang Wen et al.
The emergence of smart Wi-Fi APs (Access Point), which are equipped with huge storage space, opens a new research area on how to utilize these resources at the edge network to improve users' quality of experience (QoE) (e.g., a short startup delay and smooth playback). One important research interest in this area is content prefetching, which predicts and accurately fetches contents ahead of users' requests to shift the traffic away during peak periods. However, in practice, the different video watching patterns among users, and the varying network connection status lead to the time-varying server load, which eventually makes the content prefetching problem challenging. To understand this challenge, this paper first performs a large-scale measurement study on users' AP connection and TV series watching patterns using real-traces. Then, based on the obtained insights, we formulate the content prefetching problem as a Markov Decision Process (MDP). The objective is to strike a balance between the increased prefetching&storage cost incurred by incorrect prediction and the reduced content download delay because of successful prediction. A learning-based approach is proposed to solve this problem and another three algorithms are adopted as baselines. In particular, first, we investigate the performance lower bound by using a random algorithm, and the upper bound by using an ideal offline approach. Then, we present a heuristic algorithm as another baseline. Finally, we design a reinforcement learning algorithm that is more practical to work in the online manner. Through extensive trace-based experiments, we demonstrate the performance gain of our design. Remarkably, our learning-based algorithm achieves a better precision and hit ratio (e.g., 80%) with about 70% (resp. 50%) cost saving compared to the random (resp. heuristic) algorithm.
MMApr 11, 2014
Enhancing User Experience for Multi-Screen Social TV Streaming over Wireless NetworksHuazi Zhang, Yichao Jin, Weiwen Zhang et al.
Recently, multi-screen cloud social TV is invented to transform TV into social experience. People watching the same content on social TV may come from different locations, while freely interact with each other through text, image, audio and video. This crucial virtual living-room experience adds social aspects into existing performance metrics. In this paper, we parse social TV user experience into three elements (i.e., inter-user delay, video quality of experience (QoE), and resource efficiency), and provide a joint analytical framework to enhance user experience. Specifically, we propose a cloud-based optimal playback rate allocation scheme to maximize the overall QoE while upper bounding inter-user delay. Experiment results show that our algorithm achieves near-optimal tradeoff between inter-user delay and video quality, and demonstrates resilient performance even under very fast wireless channel fading.