CVOct 27, 2022
A Novel Approach for Neuromorphic Vision Data Compression based on Deep Belief NetworkSally Khaidem, Mansi Sharma, Abhipraay Nevatia
A neuromorphic camera is an image sensor that emulates the human eyes capturing only changes in local brightness levels. They are widely known as event cameras, silicon retinas or dynamic vision sensors (DVS). DVS records asynchronous per-pixel brightness changes, resulting in a stream of events that encode the brightness change's time, location, and polarity. DVS consumes little power and can capture a wider dynamic range with no motion blur and higher temporal resolution than conventional frame-based cameras. Although this method of event capture results in a lower bit rate than traditional video capture, it is further compressible. This paper proposes a novel deep learning-based compression scheme for event data. Using a deep belief network (DBN), the high dimensional event data is reduced into a latent representation and later encoded using an entropy-based coding technique. The proposed scheme is among the first to incorporate deep learning for event compression. It achieves a high compression ratio while maintaining good reconstruction quality outperforming state-of-the-art event data coders and other lossless benchmark techniques.
CVOct 27, 2022
2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth EstimationRohit Choudhary, Mansi Sharma, Rithvik Anil
Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions that help enhance the quality of predicted scene geometry. The 2T-UNet surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene flow dataset, both quantitatively and qualitatively. The architecture performs incredibly well on complex natural scenes, highlighting its usefulness for various real-time applications. Pretrained weights and code will be made readily available.
ROMar 12, 2022
Tactile-ViewGCN: Learning Shape Descriptor from Tactile Data using Graph Convolutional NetworkSachidanand V S, Mansi Sharma
For humans, our "senses of touch" have always been necessary for our ability to precisely and efficiently manipulate objects of all shapes in any environment, but until recently, not many works have been done to fully understand haptic feedback. This work proposed a novel method for getting a better shape descriptor than existing methods for classifying an object from multiple tactile data collected from a tactile glove. It focuses on improving previous works on object classification using tactile data. The major problem for object classification from multiple tactile data is to find a good way to aggregate features extracted from multiple tactile images. We propose a novel method, dubbed as Tactile-ViewGCN, that hierarchically aggregate tactile features considering relations among different features by using Graph Convolutional Network. Our model outperforms previous methods on the STAG dataset with an accuracy of 81.82%.
CVApr 16, 2022
A Robust and Scalable Attention Guided Deep Learning Framework for Movement Quality AssessmentAditya Kanade, Mansi Sharma, Manivannan Muniyandi
Physical rehabilitation programs frequently begin with a brief stay in the hospital and continue with home-based rehabilitation. Lack of feedback on exercise correctness is a significant issue in home-based rehabilitation. Automated movement quality assessment (MQA) using skeletal movement data (hereafter referred to as skeletal data) collected via depth imaging devices can assist with home-based rehabilitation by providing the necessary quantitative feedback. This paper aims to use recent advances in deep learning to address the problem of MQA. Movement quality score generation is an essential component of MQA. We propose three novel skeletal data augmentation schemes. We show that using the proposed augmentations for generating movement quality scores result in significant performance boosts over existing methods. Finally, we propose a novel transformer based architecture for MQA. Four novel feature extractors are proposed and studied that allow the transformer network to operate on skeletal data. We show that adding the attention mechanism in the design of the proposed feature extractor allows the transformer network to pay attention to specific body parts that make a significant contribution towards executing a movement. We report an improvement in movement quality score prediction of 12% on UI-PRMD dataset and 21% on KIMORE dataset compared to the existing methods.
CVJun 21, 2022
MEStereo-Du2CNN: A Novel Dual Channel CNN for Learning Robust Depth Estimates from Multi-exposure Stereo Images for HDR 3D ApplicationsRohit Choudhary, Mansi Sharma, Uma T et al.
Display technologies have evolved over the years. It is critical to develop practical HDR capturing, processing, and display solutions to bring 3D technologies to the next level. Depth estimation of multi-exposure stereo image sequences is an essential task in the development of cost-effective 3D HDR video content. In this paper, we develop a novel deep architecture for multi-exposure stereo depth estimation. The proposed architecture has two novel components. First, the stereo matching technique used in traditional stereo depth estimation is revamped. For the stereo depth estimation component of our architecture, a mono-to-stereo transfer learning approach is deployed. The proposed formulation circumvents the cost volume construction requirement, which is replaced by a ResNet based dual-encoder single-decoder CNN with different weights for feature fusion. EfficientNet based blocks are used to learn the disparity. Secondly, we combine disparity maps obtained from the stereo images at different exposure levels using a robust disparity feature fusion approach. The disparity maps obtained at different exposures are merged using weight maps calculated for different quality measures. The final predicted disparity map obtained is more robust and retains best features that preserve the depth discontinuities. The proposed CNN offers flexibility to train using standard dynamic range stereo data or with multi-exposure low dynamic range stereo sequences. In terms of performance, the proposed model surpasses state-of-the-art monocular and stereo depth estimation methods, both quantitatively and qualitatively, on challenging Scene flow and differently exposed Middlebury stereo datasets. The architecture performs exceedingly well on complex natural scenes, demonstrating its usefulness for diverse 3D HDR applications.
CVOct 4, 2022
A Novel Light Field Coding Scheme Based on Deep Belief Network & Weighted Binary Images for Additive Layered DisplaysSally Khaidem, Mansi Sharma
Light-field displays create an immersive experience by providing binocular depth sensation and motion parallax. Stacking light attenuating layers is one approach to implement a light field display with a broader depth of field, wide viewing angles and high resolution. Due to the transparent holographic optical element (HOE) layers, additive layered displays can be integrated into augmented reality (AR) wearables to overlay virtual objects onto the real world, creating a seamless mixed reality (XR) experience. This paper proposes a novel framework for light field representation and coding that utilizes Deep Belief Network (DBN) and weighted binary images suitable for additive layered displays. The weighted binary representation of layers makes the framework more flexible for adaptive bitrate encoding. The framework effectively captures intrinsic redundancies in the light field data, and thus provides a scalable solution for light field coding suitable for XR display applications. The latent code is encoded by H.265 codec generating a rate-scalable bit-stream. We achieve adaptive bitrate decoding by varying the number of weighted binary images and the H.265 quantization parameter, while maintaining an optimal reconstruction quality. The framework is tested on real and synthetic benchmark datasets, and the results validate the rate-scalable property of the proposed scheme.
CVJun 22, 2022
A High Resolution Multi-exposure Stereoscopic Image & Video Database of Natural ScenesRohit Choudhary, Mansi Sharma, Aditya Wadaskar
Immersive displays such as VR headsets, AR glasses, Multiview displays, Free point televisions have emerged as a new class of display technologies in recent years, offering a better visual experience and viewer engagement as compared to conventional displays. With the evolution of 3D video and display technologies, the consumer market for High Dynamic Range (HDR) cameras and displays is quickly growing. The lack of appropriate experimental data is a critical hindrance for the development of primary research efforts in the field of 3D HDR video technology. Also, the unavailability of sufficient real world multi-exposure experimental dataset is a major bottleneck for HDR imaging research, thereby limiting the quality of experience (QoE) for the viewers. In this paper, we introduce a diversified stereoscopic multi-exposure dataset captured within the campus of Indian Institute of Technology Madras, which is home to a diverse flora and fauna. The dataset is captured using ZED stereoscopic camera and provides intricate scenes of outdoor locations such as gardens, roadside views, festival venues, buildings and indoor locations such as academic and residential areas. The proposed dataset accommodates wide depth range, complex depth structure, complicate object movement, illumination variations, rich color dynamics, texture discrepancy in addition to significant randomness introduced by moving camera and background motion. The proposed dataset is made publicly available to the research community. Furthermore, the procedure for capturing, aligning and calibrating multi-exposure stereo videos and images is described in detail. Finally, we have discussed the progress, challenges, potential use cases and future research opportunities with respect to HDR imaging, depth estimation, consistent tone mapping and 3D HDR coding.
CVMay 10, 2023Code
Post-training Model Quantization Using GANs for Synthetic Data GenerationAthanasios Masouris, Mansi Sharma, Adrian Boguszewski et al.
Quantization is a widely adopted technique for deep neural networks to reduce the memory and computational resources required. However, when quantized, most models would need a suitable calibration process to keep their performance intact, which requires data from the target domain, such as a fraction of the dataset used in model training and model validation (i.e. calibration dataset). In this study, we investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method. We propose a data generation method based on Generative Adversarial Networks that are trained prior to the model quantization step. We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images. Overall, the results of our experiments demonstrate the potential of leveraging synthetic data for calibration during the quantization process. In our experiments, the percentage of accuracy degradation of the selected models was less than 0.6%, with our best performance achieved on MobileNetV2 (0.05%). The code is available at: https://github.com/ThanosM97/gsoc2022-openvino
LGMay 13, 2021Code
OpenFL: An open-source framework for Federated LearningG Anthony Reina, Alexey Gruzdev, Patrick Foley et al.
Federated learning (FL) is a computational paradigm that enables organizations to collaborate on machine learning (ML) projects without sharing sensitive data, such as, patient records, financial data, or classified secrets. Open Federated Learning (OpenFL https://github.com/intel/openfl) is an open-source framework for training ML algorithms using the data-private collaborative learning paradigm of FL. OpenFL works with training pipelines built with both TensorFlow and PyTorch, and can be easily extended to other ML and deep learning frameworks. Here, we summarize the motivation and development characteristics of OpenFL, with the intention of facilitating its application to existing ML model training in a production environment. Finally, we describe the first use of the OpenFL framework to train consensus ML models in a consortium of international healthcare organizations, as well as how it facilitates the first computational competition on FL.
CVJun 21, 2022
An Integrated Representation & Compression Scheme Based on Convolutional Autoencoders with 4D DCT Perceptual Encoding for High Dynamic Range Light FieldsSally Khaidem, Mansi Sharma
The emerging and existing light field displays are highly capable of realistic presentation of 3D scenes on auto-stereoscopic glasses-free platforms. The light field size is a major drawback while utilising 3D displays and streaming purposes. When a light field is of high dynamic range, the size increases drastically. In this paper, we propose a novel compression algorithm for a high dynamic range light field which yields a perceptually lossless compression. The algorithm exploits the inter and intra view correlations of the HDR light field by interpreting it to be a four-dimension volume. The HDR light field compression is based on a novel 4DDCT-UCS (4D-DCT Uniform Colour Space) algorithm. Additional encoding of 4DDCT-UCS acquired images by HEVC eliminates intra-frame, inter-frame and intrinsic redundancies in HDR light field data. Comparison with state-of-the-art coders like JPEG-XL and HDR video coding algorithm exhibits superior compression performance of the proposed scheme for real-world light fields.
HCAug 3, 2025
Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User PredictionMansi Sharma, Shuang Chen, Philipp Müller et al.
For machines to effectively assist humans in challenging visual search tasks, they must differentiate whether a human is simply glancing into a scene (navigational intent) or searching for a target object (informational intent). Previous research proposed combining electroencephalography (EEG) and eye-tracking measurements to recognize such search intents implicitly, i.e., without explicit user input. However, the applicability of these approaches to real-world scenarios suffers from two key limitations. First, previous work used fixed search times in the informational intent condition -- a stark contrast to visual search, which naturally terminates when the target is found. Second, methods incorporating EEG measurements addressed prediction scenarios that require ground truth training data from the target user, which is impractical in many use cases. We address these limitations by making the first publicly available EEG and eye-tracking dataset for navigational vs. informational intent recognition, where the user determines search times. We present the first method for cross-user prediction of search intents from EEG and eye-tracking recordings and reach 84.5% accuracy in leave-one-user-out evaluations -- comparable to within-user prediction accuracy (85.5%) but offering much greater flexibility
CVAug 3, 2025
Distinguishing Target and Non-Target Fixations with EEG and Eye Tracking in Realistic Visual ScenesMansi Sharma, Camilo Andrés Martínez Martínez, Benedikt Emanuel Wirth et al.
Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users' intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.
LGSep 11, 2025
Distinguishing Startle from Surprise Events Based on Physiological SignalsMansi Sharma, Alexandre Duchevet, Florian Daiber et al.
Unexpected events can impair attention and delay decision-making, posing serious safety risks in high-risk environments such as aviation. In particular, reactions like startle and surprise can impact pilot performance in different ways, yet are often hard to distinguish in practice. Existing research has largely studied these reactions separately, with limited focus on their combined effects or how to differentiate them using physiological data. In this work, we address this gap by distinguishing between startle and surprise events based on physiological signals using machine learning and multi-modal fusion strategies. Our results demonstrate that these events can be reliably predicted, achieving a highest mean accuracy of 85.7% with SVM and Late Fusion. To further validate the robustness of our model, we extended the evaluation to include a baseline condition, successfully differentiating between Startle, Surprise, and Baseline states with a highest mean accuracy of 74.9% with XGBoost and Late Fusion.
LGAug 7, 2025
Echo State Networks for Bitcoin Time Series PredictionMansi Sharma, Enrico Sartor, Marc Cavazza et al.
Forecasting stock and cryptocurrency prices is challenging due to high volatility and non-stationarity, influenced by factors like economic changes and market sentiment. Previous research shows that Echo State Networks (ESNs) can effectively model short-term stock market movements, capturing nonlinear patterns in dynamic data. To the best of our knowledge, this work is among the first to explore ESNs for cryptocurrency forecasting, especially during extreme volatility. We also conduct chaos analysis through the Lyapunov exponent in chaotic periods and show that our approach outperforms existing machine learning methods by a significant margin. Our findings are consistent with the Lyapunov exponent analysis, showing that ESNs are robust during chaotic periods and excel under high chaos compared to Boosting and Naïve methods.
AIDec 6, 2021
Tele-EvalNet: A Low-cost, Teleconsultation System for Home based Rehabilitation of Stroke Survivors using Multiscale CNN-LSTM ArchitectureAditya Kanade, Mansi Sharma, M. Manivannan
Technology has an important role to play in the field of Rehabilitation, improving patient outcomes and reducing healthcare costs. However, existing approaches lack clinical validation, robustness and ease of use. We propose Tele-EvalNet, a novel system consisting of two components: a live feedback model and an overall performance evaluation model. The live feedback model demonstrates feedback on exercise correctness with easy to understand instructions highlighted using color markers. The overall performance evaluation model learns a mapping of joint data to scores, given to the performance by clinicians. The model does this by extracting clinically approved features from joint data. Further, these features are encoded to a lower dimensional space with an autoencoder. A novel multi-scale CNN-LSTM network is proposed to learn a mapping of performance data to the scores by leveraging features extracted at multiple scales. The proposed system shows a high degree of improvement in score predictions and outperforms the state-of-the-art rehabilitation models.
IVAug 27, 2021
A Novel Hierarchical Light Field Coding Scheme Based on Hybrid Stacked Multiplicative Layers and Fourier Disparity Layers for Glasses-Free 3D DisplaysJoshitha Ravishankar, Mansi Sharma
This paper presents a novel hierarchical coding scheme for light fields based on transmittance patterns of low-rank multiplicative layers and Fourier disparity layers. The proposed scheme identifies multiplicative layers of light field view subsets optimized using a convolutional neural network for different scanning orders. Our approach exploits the hidden low-rank structure in the multiplicative layers obtained from the subsets of different scanning patterns. The spatial redundancies in the multiplicative layers can be efficiently removed by performing low-rank approximation at different ranks on the Krylov subspace. The intra-view and inter-view redundancies between approximated layers are further removed by HEVC encoding. Next, a Fourier disparity layer representation is constructed from the first subset of the approximated light field based on the chosen hierarchical order. Subsequent view subsets are synthesized by modeling the Fourier disparity layers that iteratively refine the representation with improved accuracy. The critical advantage of the proposed hybrid layered representation and coding scheme is that it utilizes not just spatial and temporal redundancies in light fields but efficiently exploits intrinsic similarities among neighboring sub-aperture images in both horizontal and vertical directions as specified by different predication orders. In addition, the scheme is flexible to realize a range of multiple bitrates at the decoder within a single integrated system. The compression performance of the proposed scheme is analyzed on real light fields. We achieved substantial bitrate savings and maintained good light field reconstruction quality.
CVMay 21, 2021
A Novel 3D-UNet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth EstimationMansi Sharma, Abheesht Sharma, Kadvekar Rohit Tushar et al.
The task of predicting smooth and edge-consistent depth maps is notoriously difficult for single image depth estimation. This paper proposes a novel Bilateral Grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterizes high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene. Further, another novel 3DBGES-UNet model is introduced that integrate 3DBG-UNet for inferring an accurate depth map given a single color view. The 3DBGES-UNet concatenates 3DBG-UNet geometry map with the inception network edge accentuation map and a spatial object's boundary map obtained by leveraging semantic segmentation and train the UNet model with ResNet backbone. Both models are designed with a particular attention to explicitly account for edges or minute details. Preserving sharp discontinuities at depth edges is critical for many applications such as realistic integration of virtual objects in AR video or occlusion-aware view synthesis for 3D display applications.The proposed depth prediction network achieves state-of-the-art performance in both qualitative and quantitative evaluations on the challenging NYUv2-Depth data. The code and corresponding pre-trained weights will be made publicly available.
CVApr 19, 2021
A Hierarchical Coding Scheme for Glasses-free 3D Displays Based on Scalable Hybrid Layered Representation of Real-World Light FieldsJoshitha R, Mansi Sharma
This paper presents a novel hierarchical coding scheme for light fields based on transmittance patterns of low-rank multiplicative layers and Fourier disparity layers. The proposed scheme learns stacked multiplicative layers from subsets of light field views determined from different scanning orders. The multiplicative layers are optimized using a fast data-driven convolutional neural network (CNN). The spatial correlation in layer patterns is exploited with varying low ranks in factorization derived from singular value decomposition on a Krylov subspace. Further, encoding with HEVC efficiently removes intra-view and inter-view correlation in low-rank approximated layers. The initial subset of approximated decoded views from multiplicative representation is used to construct Fourier disparity layer (FDL) representation. The FDL model synthesizes second subset of views which is identified by a pre-defined hierarchical prediction order. The correlations between the prediction residue of synthesized views is further eliminated by encoding the residual signal. The set of views obtained from decoding the residual is employed in order to refine the FDL model and predict the next subset of views with improved accuracy. This hierarchical procedure is repeated until all light field views are encoded. The critical advantage of proposed hybrid layered representation and coding scheme is that it utilizes not just spatial and temporal redundancies, but efficiently exploits the strong intrinsic similarities among neighboring sub-aperture images in both horizontal and vertical directions as specified by different predication orders. Besides, the scheme is flexible to realize a range of multiple bitrates at the decoder within a single integrated system. The compression performance analyzed with real light field shows substantial bitrate savings, maintaining good reconstruction quality.
CVApr 10, 2021
A New Comprehensive Framework for Multi-Exposure Stereo Coding Utilizing Low Rank Tucker-ALS and 3D-HEVC TechniquesMansi Sharma, Jyotsana Grover
Display technology must offer high dynamic range (HDR) contrast-based depth induction and 3D personalization simultaneously. Efficient algorithms to compress HDR stereo data is critical. Direct capturing of HDR content is complicated due to the high expense and scarcity of HDR cameras. The HDR 3D images could be generated in low-cost by fusing low-dynamic-range (LDR) images acquired using a stereo camera with various exposure settings. In this paper, an efficient scheme for coding multi-exposure stereo images is proposed based on a tensor low-rank approximation scheme. The multi-exposure fusion can be realized to generate HDR stereo output at the decoder for increased realism and exaggerated binocular 3D depth cues. For exploiting spatial redundancy in LDR stereo images, the stack of multi-exposure stereo images is decomposed into a set of projection matrices and a core tensor following an alternating least squares Tucker decomposition model. The compact, low-rank representation of the scene, thus, generated is further processed by 3D extension of High Efficiency Video Coding standard. The encoding with 3D-HEVC enhance the proposed scheme efficiency by exploiting intra-frame, inter-view and the inter-component redundancies in low-rank approximated representation. We consider constant luminance property of IPT and Y'CbCr color space to precisely approximate intensity prediction and perceptually minimize the encoding distortion. Besides, the proposed scheme gives flexibility to adjust the bitrate of tensor latent components by changing the rank of core tensor and its quantization. Extensive experiments on natural scenes demonstrate that the proposed scheme outperforms state-of-the-art JPEG-XT and 3D-HEVC range coding standards.
MMApr 10, 2021
A Versatile Depth Video Encoding Scheme Based on Low-rank Tensor Modeling for Free Viewpoint VideoMansi Sharma, Jyotsana Grover
The compression quality losses of depth sequences determine quality of view synthesis in free-viewpoint video. The depth map intra prediction in 3D extensions of the HEVC applies intra modes with auxiliary depth modeling modes (DMMs) to better preserve depth edges and handle motion discontinuities. Although such modes enable high efficiency compression, but at the cost of very high encoding complexity. Skipping conventional intra coding modes and DMMs in depth coding limits practical applicability of the HEVC for 3D display applications. In this paper, we introduce a novel low-complexity scheme for depth video compression based on low-rank tensor decomposition and HEVC intra coding. The proposed scheme leverages spatial and temporal redundancy by compactly representing the depth sequence as a high-order tensor. Tensor factorization into a set of factor matrices following CANDECOMP PARAFAC (CP) decomposition via alternating least squares give a low-rank approximation of the scene geometry. Further, compression of factor matrices with HEVC intra prediction support arbitrary target accuracy by flexible adjustment of bitrate, varying tensor decomposition ranks and quantization parameters. The results demonstrate proposed approach achieves significant rate gains by efficiently compressing depth planes in low-rank approximated representation. The proposed algorithm is applied to encode depth maps of benchmark Ballet and Breakdancing sequences. The decoded depth sequences are used for view synthesis in a multi-view video system, maintaining appropriate rendering quality.
LGApr 8, 2021
OGGN: A Novel Generalized Oracle Guided Generative Architecture for Modelling Inverse Function of Artificial Neural NetworksMohammad Aaftab, Mansi Sharma
This paper presents a novel Generative Neural Network Architecture for modelling the inverse function of an Artificial Neural Network (ANN) either completely or partially. Modelling the complete inverse function of an ANN involves generating the values of all features that corresponds to a desired output. On the other hand, partially modelling the inverse function means generating the values of a subset of features and fixing the remaining feature values. The feature set generation is a critical step for artificial neural networks, useful in several practical applications in engineering and science. The proposed Oracle Guided Generative Neural Network, dubbed as OGGN, is flexible to handle a variety of feature generation problems. In general, an ANN is able to predict the target values based on given feature vectors. The OGGN architecture enables to generate feature vectors given the predetermined target values of an ANN. When generated feature vectors are fed to the forward ANN, the target value predicted by ANN will be close to the predetermined target values. Therefore, the OGGN architecture is able to map, inverse function of the function represented by forward ANN. Besides, there is another important contribution of this work. This paper also introduces a new class of functions, defined as constraint functions. The constraint functions enable a neural network to investigate a given local space for a longer period of time. Thus, enabling to find a local optimum of the loss function apart from just being able to find the global optimum. OGGN can also be adapted to solve a system of polynomial equations in many variables. The experiments on synthetic datasets validate the effectiveness of OGGN on various use cases.
MMJan 25, 2021
Latent Factor Modeling of Users Subjective Perception for Stereoscopic 3D Video RecommendationBalasubramanyam Appina, Mansi Sharma, Santosh Kumar
Numerous stereoscopic 3D movies are released every year to theaters and created large revenues. Despite the improvement in stereo capturing and 3D video post-production technology, stereoscopic artifacts which cause viewer discomfort continue to appear even in high-budget films. Existing automatic 3D video quality measurement tools can detect distortions in stereoscopic images or videos, but they fail to consider the viewer's subjective perception of those artifacts, and how these distortions affect their choices. In this paper, we introduce a novel recommendation system for stereoscopic 3D movies based on a latent factor model that meticulously analyse the viewer's subjective ratings and influence of 3D video distortions on their preferences. To the best of our knowledge, this is a first-of-its-kind model that recommends 3D movies based on stereo-film quality ratings accounting correlation between the viewer's visual discomfort and stereoscopic-artifact perception. The proposed model is trained and tested on benchmark Nama3ds1-cospad1 and LFOVIAS3DPh2 S3D video quality assessment datasets. The experiments revealed that resulting matrix-factorization based recommendation system is able to generalize considerably better for the viewer's subjective ratings.
CVSep 20, 2019
Multi-user Augmented Reality Application for Video Communication in Virtual SpaceKumar Mridul, M. Ramanathan, Kunal Ahirwar et al.
Communication is the most useful tool to impart knowledge, understand ideas, clarify thoughts and expressions, organize plan and manage every single day-to-day activity. Although there are different modes of communication, physical barrier always affects the clarity of the message due to the absence of body language and facial expressions. These barriers are overcome by video calling, which is technically the most advance mode of communication at present. The proposed work concentrates around the concept of video calling in a more natural and seamless way using Augmented Reality (AR). AR can be helpful in giving the users an experience of physical presence in each other's environment. Our work provides an entirely new platform for video calling, wherein the users can enjoy the privilege of their own virtual space to interact with the individual's environment. Moreover, there is no limitation of sharing the same screen space. Any number of participants can be accommodated over a single conference without having to compromise the screen size.