Vishnu Raj

IT
h-index139
16papers
296citations
Novelty51%
AI Score55

16 Papers

SDApr 19Code
Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj et al.

Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.

LGJul 4, 2022
Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach

Vishnu Raj, Tianyu Cui, Markus Heinonen et al.

Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.

CVDec 26, 2025
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

Divyansh Srivastava, Akshay Mehra, Pranav Maneriker et al.

Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.

HCDec 11, 2025
CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

Yiyang Wang, Chen Chen, Tica Lin et al.

Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.

CVSep 12, 2025
Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey

Aupendu Kar, Vishnu Raj, Guan-Ming Su

Event camera sensors are bio-inspired sensors which asynchronously capture per-pixel brightness changes and output a stream of events encoding the polarity, location and time of these changes. These systems are witnessing rapid advancements as an emerging field, driven by their low latency, reduced power consumption, and ultra-high capture rates. This survey explores the evolution of fusing event-stream captured with traditional frame-based capture, highlighting how this synergy significantly benefits various video restoration and 3D reconstruction tasks. The paper systematically reviews major deep learning contributions to image/video enhancement and restoration, focusing on two dimensions: temporal enhancement (such as frame interpolation and motion deblurring) and spatial enhancement (including super-resolution, low-light and HDR enhancement, and artifact reduction). This paper also explores how the 3D reconstruction domain evolves with the advancement of event driven fusion. Diverse topics are covered, with in-depth discussions on recent works for improving visual quality under challenging conditions. Additionally, the survey compiles a comprehensive list of openly available datasets, enabling reproducible research and benchmarking. By consolidating recent progress and insights, this survey aims to inspire further research into leveraging event camera systems, especially in combination with deep learning, for advanced visual media restoration and enhancement.

LGDec 12, 2024
Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model

Hans Moen, Vishnu Raj, Andrius Vabalas et al.

Health registers contain rich information about individuals' health histories. Here our interest lies in understanding how individuals' health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals' trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples' health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.

CVNov 23, 2025
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu, Zongxia Li, Jihui Jin et al.

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

ASSep 25, 2025
Enhanced Generative Machine Listener

Vishnu Raj, Gouthaman KV, Shiv Gehlot et al.

We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.

LGJun 13, 2020
Understanding Learning Dynamics of Binary Neural Networks via Information Bottleneck

Vishnu Raj, Nancy Nayak, Sheetal Kalyani

Compact neural networks are essential for affordable and power efficient deep learning solutions. Binary Neural Networks (BNNs) take compactification to the extreme by constraining both weights and activations to two levels, $\{+1, -1\}$. However, training BNNs are not easy due to the discontinuity in activation functions, and the training dynamics of BNNs is not well understood. In this paper, we present an information-theoretic perspective of BNN training. We analyze BNNs through the Information Bottleneck principle and observe that the training dynamics of BNNs is considerably different from that of Deep Neural Networks (DNNs). While DNNs have a separate empirical risk minimization and representation compression phases, our numerical experiments show that in BNNs, both these phases are simultaneous. Since BNNs have a less expressive capacity, they tend to find efficient hidden representations concurrently with label fitting. Experiments in multiple datasets support these observations, and we see a consistent behavior across different activation functions in BNNs.

SPJan 25, 2020
Deep Reinforcement Learning based Blind mmWave MIMO Beam Alignment

Vishnu Raj, Nancy Nayak, Sheetal Kalyani

Directional beamforming is a crucial component for realizing robust wireless communication systems using millimeter wave (mmWave) technology. Beam alignment using brute-force search of the space introduces time overhead while location aided blind beam alignment adds additional hardware requirements to the system. In this paper, we introduce a method for blind beam alignment based on the RF fingerprints of user equipment obtained by the base stations. The proposed system performs blind beam alignment on a multiple base station cellular environment with multiple mobile users using deep reinforcement learning. We present a novel neural network architecture that can handle a mix of both continuous and discrete actions and use policy gradient methods to train the model. Our results show that the proposed method can achieve a data rate of up to four times the traditional method without any overheads.

ITOct 13, 2019
Beyond 5G: Leveraging Cell Free TDD Massive MIMO using Cascaded Deep learning

Navaneet Athreya, Vishnu Raj, Sheetal Kalyani

This paper deals with the calibration of Time Division Duplexing (TDD) reciprocity in an Orthogonal Frequency Division Multiplexing (OFDM) based Cell Free Massive MIMO system where the responses of the (Radio Frequency) RF chains render the end to end channel non-reciprocal, even though the physical wireless channel is reciprocal. We further address the non-availability of the uplink channel estimates at locations other than pilot subcarriers and propose a single-shot solution to estimate the downlink channel at all subcarriers from the uplink channel at selected pilot subcarriers. We propose a cascade of two Deep Neural Networks (DNN) to achieve the objective. The proposed method is easily scalable and removes the need for relative reciprocity calibration based on the cooperation of antennas, which usually introduces dependency in Cell Free Massive MIMO systems.

ITApr 18, 2019
Design of Communication Systems using Deep Learning: A Variational Inference Perspective

Vishnu Raj, Sheetal Kalyani

Recent research in the design of end to end communication system using deep learning has produced models which can outperform traditional communication schemes. Most of these architectures leveraged autoencoders to design the encoder at the transmitter and decoder at the receiver and train them jointly by modeling transmit symbols as latent codes from the encoder. However, in communication systems, the receiver has to work with noise corrupted versions of transmit symbols. Traditional autoencoders are not designed to work with latent codes corrupted with noise. In this work, we provide a framework to design end to end communication systems which accounts for the existence of noise corrupted transmit symbols. The proposed method uses deep neural architecture. An objective function for optimizing these models is derived based on the concepts of variational inference. Further, domain knowledge such as channel type can be systematically integrated into the objective. Through numerical simulation, the proposed method is shown to consistently produce models with better packing density and achieving it faster in multiple popular channel models as compared to the previous works leveraging deep learning models.

ITApr 30, 2018
A Centralized Multi-stage Non-parametric Learning Algorithm for Opportunistic Spectrum Access

Thulasi Tholeti, Vishnu Raj, Sheetal Kalyani

Owing to the ever-increasing demand in wireless spectrum, Cognitive Radio (CR) was introduced as a technique to attain high spectral efficiency. As the number of secondary users (SUs) connecting to the cognitive radio network is on the rise, there is an imminent need for centralized algorithms that provide high throughput and energy efficiency of the SUs while ensuring minimum interference to the licensed users. In this work, we propose a multi-stage algorithm that - 1) effectively assigns the available channel to the SUs, 2) employs a non-parametric learning framework to estimate the primary traffic distribution to minimize sensing, and 3) proposes an adaptive framework to ensure that the collision to the primary user is below the specified threshold. We provide comprehensive empirical validation of the method with other approaches.

LGAug 5, 2017
An aggregating strategy for shifting experts in discrete sequence prediction

Vishnu Raj, Sheetal Kalyani

We study how we can adapt a predictor to a non-stationary environment with advises from multiple experts. We study the problem under complete feedback when the best expert changes over time from a decision theoretic point of view. Proposed algorithm is based on popular exponential weighing method with exponential discounting. We provide theoretical results bounding regret under the exponential discounting setting. Upper bound on regret is derived for finite time horizon problem. Numerical verification of different real life datasets are provided to show the utility of proposed algorithm.

ITJul 31, 2017
Spectrum Access In Cognitive Radio Using A Two Stage Reinforcement Learning Approach

Vishnu Raj, Irene Dias, Thulasi Tholeti et al.

With the advent of the 5th generation of wireless standards and an increasing demand for higher throughput, methods to improve the spectral efficiency of wireless systems have become very important. In the context of cognitive radio, a substantial increase in throughput is possible if the secondary user can make smart decisions regarding which channel to sense and when or how often to sense. Here, we propose an algorithm to not only select a channel for data transmission but also to predict how long the channel will remain unoccupied so that the time spent on channel sensing can be minimized. Our algorithm learns in two stages - a reinforcement learning approach for channel selection and a Bayesian approach to determine the optimal duration for which sensing can be skipped. Comparisons with other learning methods are provided through extensive simulations. We show that the number of sensing is minimized with negligible increase in primary interference; this implies that lesser energy is spent by the secondary user in sensing and also higher throughput is achieved by saving on sensing.

MLJul 31, 2017
Taming Non-stationary Bandits: A Bayesian Approach

Vishnu Raj, Sheetal Kalyani

We consider the multi armed bandit problem in non-stationary environments. Based on the Bayesian method, we propose a variant of Thompson Sampling which can be used in both rested and restless bandit scenarios. Applying discounting to the parameters of prior distribution, we describe a way to systematically reduce the effect of past observations. Further, we derive the exact expression for the probability of picking sub-optimal arms. By increasing the exploitative value of Bayes' samples, we also provide an optimistic version of the algorithm. Extensive empirical analysis is conducted under various scenarios to validate the utility of proposed algorithms. A comparison study with various state-of-the-arm algorithms is also included.