Tejasvi Ravi

AI
h-index2
4papers
20citations
Novelty28%
AI Score34

4 Papers

73.5SDMar 11
MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhui Zheng et al.

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.

61.7SDApr 10
MAGE: Modality-Agnostic Music Generation and Editing

Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu et al.

Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.

AIOct 21, 2024
Opportunities and Challenges of Generative-AI in Finance

Akshar Prabhu Desai, Ganesh Satish Mallya, Mohammad Luqman et al.

Gen-AI techniques are able to improve understanding of context and nuances in language modeling, translation between languages, handle large volumes of data, provide fast, low-latency responses and can be fine-tuned for various tasks and domains. In this manuscript, we present a comprehensive overview of the applications of Gen-AI techniques in the finance domain. In particular, we present the opportunities and challenges associated with the usage of Gen-AI techniques. We also illustrate the various methodologies which can be used to train Gen-AI techniques and present the various application areas of Gen-AI technologies in the finance ecosystem. To the best of our knowledge, this work represents the most comprehensive summarization of Gen-AI techniques within the financial domain. The analysis is designed for a deep overview of areas marked for substantial advancement while simultaneously pin-point those warranting future prioritization. We also hope that this work would serve as a conduit between finance and other domains, thus fostering the cross-pollination of innovative concepts and practices.

AINov 10, 2024
Gen-AI for User Safety: A Survey

Akshar Prabhu Desai, Tejasvi Ravi, Mohammad Luqman et al.

Machine Learning and data mining techniques (i.e. supervised and unsupervised techniques) are used across domains to detect user safety violations. Examples include classifiers used to detect whether an email is spam or a web-page is requesting bank login information. However, existing ML/DM classifiers are limited in their ability to understand natural languages w.r.t the context and nuances. The aforementioned challenges are overcome with the arrival of Gen-AI techniques, along with their inherent ability w.r.t translation between languages, fine-tuning between various tasks and domains. In this manuscript, we provide a comprehensive overview of the various work done while using Gen-AI techniques w.r.t user safety. In particular, we first provide the various domains (e.g. phishing, malware, content moderation, counterfeit, physical safety) across which Gen-AI techniques have been applied. Next, we provide how Gen-AI techniques can be used in conjunction with various data modalities i.e. text, images, videos, audio, executable binaries to detect violations of user-safety. Further, also provide an overview of how Gen-AI techniques can be used in an adversarial setting. We believe that this work represents the first summarization of Gen-AI techniques for user-safety.