Ritesh Kumar

CL
h-index40
37papers
6,813citations
Novelty21%
AI Score52

37 Papers

CLMay 7, 2022
UniMorph 4.0: Universal Morphology

Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa et al. · eth-zurich, microsoft-research

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

CLApr 26, 2022Code
Developing Universal Dependency Treebanks for Magahi and Braj

Mohit Raj, Shyam Ratan, Deepak Alok et al.

In this paper, we discuss the development of treebanks for two low-resourced Indian languages - Magahi and Braj based on the Universal Dependencies framework. The Magahi treebank contains 945 sentences and Braj treebank around 500 sentences marked with their lemmas, part-of-speech, morphological features and universal dependencies. This paper gives a description of the different dependency relationship found in the two languages and give some statistics of the two treebanks. The dataset will be made publicly available on Universal Dependency (UD) repository (https://github.com/UniversalDependencies/UD_Magahi-MGTB/tree/master) in the next(v2.10) release.

CLMar 22, 2022Code
Demo of the Linguistic Field Data Management and Analysis System -- LiFE

Siddharth Singh, Ritesh Kumar, Shyam Ratan et al.

In the proposed demo, we will present a new software - Linguistic Field Data Management and Analysis System - LiFE (https://github.com/kmi-linguistics/life) - an open-source, web-based linguistic data management and analysis application that allows for systematic storage, management, sharing and usage of linguistic data collected from the field. The application allows users to store lexical items, sentences, paragraphs, audio-visual content with rich glossing / annotation; generate interactive and print dictionaries; and also train and use natural language processing tools and models for various purposes using this data. Since its a web-based application, it also allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other. The system uses the Python-based Flask framework and MongoDB in the backend and HTML, CSS and Javascript at the frontend. The interface allows creation of multiple projects that could be shared with the other users. At the backend, the application stores the data in RDF format so as to allow its release as Linked Data over the web using semantic web technologies - as of now it makes use of the OntoLex-Lemon for storing the lexical data and Ligt for storing the interlinear glossed text and then internally linking it to the other linked lexicons and databases such as DBpedia and WordNet. Furthermore it provides support for training the NLP systems using scikit-learn and HuggingFace Transformers libraries as well as make use of any model trained using these libraries - while the user interface itself provides limited options for tuning the system, an externally-trained model could be easily incorporated within the application; similarly the dataset itself could be easily exported into a standard machine-readable format like JSON or CSV that could be consumed by other programs and pipelines.

CLJun 26, 2022
Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Ritesh Kumar, Siddharth Singh, Shyam Ratan et al.

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.

CLOct 3, 2023
ProtoNER: Few shot Incremental Learning for Named Entity Recognition using Prototypical Networks

Ritesh Kumar, Saurabh Goyal, Ashish Verma et al.

Key value pair (KVP) extraction or Named Entity Recognition(NER) from visually rich documents has been an active area of research in document understanding and data extraction domain. Several transformer based models such as LayoutLMv2, LayoutLMv3, and LiLT have emerged achieving state of the art results. However, addition of even a single new class to the existing model requires (a) re-annotation of entire training dataset to include this new class and (b) retraining the model again. Both of these issues really slow down the deployment of updated model. \\ We present \textbf{ProtoNER}: Prototypical Network based end-to-end KVP extraction model that allows addition of new classes to an existing model while requiring minimal number of newly annotated training samples. The key contributions of our model are: (1) No dependency on dataset used for initial training of the model, which alleviates the need to retain original training dataset for longer duration as well as data re-annotation which is very time consuming task, (2) No intermediate synthetic data generation which tends to add noise and results in model's performance degradation, and (3) Hybrid loss function which allows model to retain knowledge about older classes as well as learn about newly added classes.\\ Experimental results show that ProtoNER finetuned with just 30 samples is able to achieve similar results for the newly added classes as that of regular model finetuned with 2600 samples.

CVFeb 3
VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering

Rahul Atul Bhope, K. R. Jayaram, Vinod Muthusamy et al.

Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.

CVApr 23, 2023
An Efficient Ensemble Explainable AI (XAI) Approach for Morphed Face Detection

Rudresh Dwivedi, Ritesh Kumar, Deepak Chopra et al.

The extensive utilization of biometric authentication systems have emanated attackers / imposters to forge user identity based on morphed images. In this attack, a synthetic image is produced and merged with genuine. Next, the resultant image is user for authentication. Numerous deep neural convolutional architectures have been proposed in literature for face Morphing Attack Detection (MADs) to prevent such attacks and lessen the risks associated with them. Although, deep learning models achieved optimal results in terms of performance, it is difficult to understand and analyse these networks since they are black box/opaque in nature. As a consequence, incorrect judgments may be made. There is, however, a dearth of literature that explains decision-making methods of black box deep learning models for biometric Presentation Attack Detection (PADs) or MADs that can aid the biometric community to have trust in deep learning-based biometric systems for identification and authentication in various security applications such as border control, criminal database establishment etc. In this work, we present a novel visual explanation approach named Ensemble XAI integrating Saliency maps, Class Activation Maps (CAM) and Gradient-CAM (Grad-CAM) to provide a more comprehensive visual explanation for a deep learning prognostic model (EfficientNet-B1) that we have employed to predict whether the input presented to a biometric authentication system is morphed or genuine. The experimentations have been performed on three publicly available datasets namely Face Research Lab London Set, Wide Multi-Channel Presentation Attack (WMCA), and Makeup Induced Face Spoofing (MIFS). The experimental evaluations affirms that the resultant visual explanations highlight more fine-grained details of image features/areas focused by EfficientNet-B1 to reach decisions along with appropriate reasoning.

CLApr 6, 2022
Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

Ritesh Kumar, Bornini Lahiri

In the present paper, we will present a survey of the language resources and technologies available for the non-scheduled and endangered languages of India. While there have been different estimates from different sources about the number of languages in India, it could be assumed that there are more than 1,000 languages currently being spoken in India. However barring some of the 22 languages included in the 8th Schedule of the Indian Constitution (called the scheduled languages), there is hardly any substantial resource or technology available for the rest of the languages. Nonetheless there have been some individual attempts at developing resources and technologies for the different languages across the country. Of late, some financial support has also become available for the endangered languages. In this paper, we give a summary of the resources and technologies for those Indian languages which are not included in the 8th schedule of the Indian Constitution and/or which are endangered.

SDApr 6, 2022
Aggression in Hindi and English Speech: Acoustic Correlates and Automatic Identification

Ritesh Kumar, Atul Kr. Ojha, Bornini Lahiri et al.

In the present paper, we will present the results of an acoustic analysis of political discourse in Hindi and discuss some of the conventionalised acoustic features of aggressive speech regularly employed by the speakers of Hindi and English. The study is based on a corpus of slightly over 10 hours of political discourse and includes debates on news channel and political speeches. Using this study, we develop two automatic classification systems for identifying aggression in English and Hindi speech, based solely on an acoustic model. The Hindi classifier, trained using 50 hours of annotated speech, and English classifier, trained using 40 hours of annotated speech, achieve a respectable accuracy of over 73% and 66% respectively. In this paper, we discuss the development of this annotated dataset, the experiments for developing the classifier and discuss the errors that it makes.

AIMar 11
Trajectory-Informed Memory Generation for Self-Improving Agent Systems

Gaodan Fang, Vatche Isahagian, K. R. Jayaram et al.

LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance -- strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149\% relative increase).

LGJan 30, 2025Code
Navigating the Fragrance space Via Graph Generative Models And Predicting Odors

Mrityunjay Sharma, Sarabeshwar Balaji, Pinaki Saha et al.

We explore a suite of generative modelling techniques to efficiently navigate and explore the complex landscapes of odor and the broader chemical space. Unlike traditional approaches, we not only generate molecules but also predict the odor likeliness with ROC AUC score of 0.97 and assign probable odor labels. We correlate odor likeliness with physicochemical features of molecules using machine learning techniques and leverage SHAP (SHapley Additive exPlanations) to demonstrate the interpretability of the function. The whole process involves four key stages: molecule generation, stringent sanitization checks for molecular validity, fragrance likeliness screening and odor prediction of the generated molecules. By making our code and trained models publicly accessible, we aim to facilitate broader adoption of our research across applications in fragrance discovery and olfactory research.

CLApr 11, 2021Code
Unsupervised Learning of Explainable Parse Trees for Improved Generalisation

Atul Sahay, Ayush Maheshwari, Ritesh Kumar et al.

Recursive neural networks (RvNN) have been shown useful for learning sentence representations and helped achieve competitive performance on several natural language inference tasks. However, recent RvNN-based models fail to learn simple grammar and meaningful semantics in their intermediate tree representation. In this work, we propose an attention mechanism over Tree-LSTMs to learn more meaningful and explainable parse tree structures. We also demonstrate the superior performance of our proposed model on natural language inference, semantic relatedness, and sentiment analysis tasks and compare them with other state-of-the-art RvNN based methods. Further, we present a detailed qualitative and quantitative analysis of the learned parse trees and show that the discovered linguistic structures are more explainable, semantically meaningful, and grammatically correct than recent approaches. The source code of the paper is available at https://github.com/atul04/Explainable-Latent-Structures-Using-Attention.

CLFeb 6
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Sanyam Singh, Naga Ganesh, Vineet Singh et al.

Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.

MTRL-SCIMay 4
From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Aritra Roy, Kevin Shen, Andrew MacBride et al.

Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community-developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single-purpose LLM tools toward integrated, multi-agent workflows that combine retrieval, reasoning, tool use, and domain-specific validation. Prominent themes include retrieval-augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory-integrated closed-loop systems. Together, these results suggest that LLMs are evolving from general-purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM-enabled workflows in materials science and chemistry.

CLMar 17, 2024
HarmPot: An Annotation Framework for Evaluating Offline Harm Potential of Social Media Text

Ritesh Kumar, Ojaswee Bhalla, Madhu Vanthi et al.

In this paper, we discuss the development of an annotation schema to build datasets for evaluating the offline harm potential of social media texts. We define "harm potential" as the potential for an online public post to cause real-world physical harm (i.e., violence). Understanding that real-world violence is often spurred by a web of triggers, often combining several online tactics and pre-existing intersectional fissures in the social milieu, to result in targeted physical violence, we do not focus on any single divisive aspect (i.e., caste, gender, religion, or other identities of the victim and perpetrators) nor do we focus on just hate speech or mis/dis-information. Rather, our understanding of the intersectional causes of such triggers focuses our attempt at measuring the harm potential of online content, irrespective of whether it is hateful or not. In this paper, we discuss the development of a framework/annotation schema that allows annotating the data with different aspects of the text including its socio-political grounding and intent of the speaker (as expressed through mood and modality) that together contribute to it being a trigger for offline harm. We also give a comparative analysis and mapping of our framework with some of the existing frameworks.

LGDec 26, 2023
Olfactory Label Prediction on Aroma-Chemical Pairs

Laura Sisson, Aryan Amit Barsainyan, Mrityunjay Sharma et al.

The application of deep learning techniques on aroma-chemicals has resulted in models more accurate than human experts at predicting olfactory qualities. However, public research in this domain has been limited to predicting the qualities of single molecules, whereas in industry applications, perfumers and food scientists are often concerned with blends of many molecules. In this paper, we apply both existing and novel approaches to a dataset we gathered consisting of labeled pairs of molecules. We present graph neural network models capable of accurately predicting the odor qualities arising from blends of aroma-chemicals, with an analysis of how variations in architecture can lead to significant differences in predictive power.

LGAug 13, 2025
An Explainable AI based approach for Monitoring Animal Health

Rahul Jana, Shubham Dixit, Mrityunjay Sharma et al.

Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.

CLJul 22, 2025
Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Rishemjit Kaur, Arshdeep Singh Bhankhar, Jashanpreet Singh Salh et al.

Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Publicly available general-purpose Large Language Models (LLMs) typically offer generic agriculture advisories, lacking precision in local and multilingual contexts. Our study addresses this limitation by generating multilingual (English, Hindi, Punjabi) synthetic datasets from agriculture-specific documents from India and fine-tuning LLMs for the task of question answering (QA). Evaluation on human-created datasets demonstrates significant improvements in factuality, relevance, and agricultural consensus for the fine-tuned LLMs compared to the baseline counterparts.

CVApr 12, 2024
FaceFilterSense: A Filter-Resistant Face Recognition and Facial Attribute Analysis Framework

Shubham Tiwari, Yash Sethia, Ritesh Kumar et al.

With the advent of social media, fun selfie filters have come into tremendous mainstream use affecting the functioning of facial biometric systems as well as image recognition systems. These filters vary from beautification filters and Augmented Reality (AR)-based filters to filters that modify facial landmarks. Hence, there is a need to assess the impact of such filters on the performance of existing face recognition systems. The limitation associated with existing solutions is that these solutions focus more on the beautification filters. However, the current AR-based filters and filters which distort facial key points are in vogue recently and make the faces highly unrecognizable even to the naked eye. Also, the filters considered are mostly obsolete with limited variations. To mitigate these limitations, we aim to perform a holistic impact analysis of the latest filters and propose an user recognition model with the filtered images. We have utilized a benchmark dataset for baseline images, and applied the latest filters over them to generate a beautified/filtered dataset. Next, we have introduced a model FaceFilterNet for beautified user recognition. In this framework, we also utilize our model to comment on various attributes of the person including age, gender, and ethnicity. In addition, we have also presented a filter-wise impact analysis on face recognition, age estimation, gender, and ethnicity prediction. The proposed method affirms the efficacy of our dataset with an accuracy of 87.25% and an optimal accuracy for facial attribute analysis.

CLDec 3, 2021
Translating Politeness Across Cultures: Case of Hindi and English

Ritesh Kumar, Girish Nath Jha

In this paper, we present a corpus based study of politeness across two languages-English and Hindi. It studies the politeness in a translated parallel corpus of Hindi and English and sees how politeness in a Hindi text is translated into English. We provide a detailed theoretical background in which the comparison is carried out, followed by a brief description of the translated data within this theoretical model. Since politeness may become one of the major reasons of conflict and misunderstanding, it is a very important phenomenon to be studied and understood cross-culturally, particularly for such purposes as machine translation.

CLDec 3, 2021
Creating and Managing a large annotated parallel corpora of Indian languages

Ritesh Kumar, Shiv Bhusan Kaushik, Pinkey Nainwani et al.

This paper presents the challenges in creating and managing large parallel corpora of 12 major Indian languages (which is soon to be extended to 23 languages) as part of a major consortium project funded by the Department of Information Technology (DIT), Govt. of India, and running parallel in 10 different universities of India. In order to efficiently manage the process of creation and dissemination of these huge corpora, the web-based (with a reduced stand-alone version also) annotation tool ILCIANN (Indian Languages Corpora Initiative Annotation Tool) has been developed. It was primarily developed for the POS annotation as well as the management of the corpus annotation by people with differing amount of competence and at locations physically situated far apart. In order to maintain consistency and standards in the creation of the corpora, it was necessary that everyone works on a common platform which was provided by this tool.

CLNov 30, 2021
Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi

Ritesh Kumar

Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of India. Despite having a significant number of speakers, there has been virtually no language resource (LR) or language technology (LT) developed for the language, mainly because of its status as a non-scheduled language. The present paper describes an attempt to develop an annotated corpus of Magahi. The data is mainly taken from a couple of blogs in Magahi, some collection of stories in Magahi and the recordings of conversation in Magahi and it is annotated at the POS level using BIS tagset.

CLNov 30, 2021
Towards automatic identification of linguistic politeness in Hindi texts

Ritesh Kumar

In this paper I present a classifier for automatic identification of linguistic politeness in Hindi texts. I have used the manually annotated corpus of over 25,000 blog comments to train an SVM. Making use of the discursive and interactional approaches to politeness the paper gives an exposition of the normative, conventionalised politeness structures of Hindi. It is seen that using these manually recognised structures as features in training the SVM significantly improves the performance of the classifier on the test set. The trained system gives a significantly high accuracy of over 77% which is within 2% of human accuracy.

CLNov 19, 2021
The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

Ritesh Kumar, Enakshi Nandi, Laishram Niranjana Devi et al.

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that can be used for marking comments with aggression and bias of various kinds including gender bias, religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking different the discursive role being performed through the comments, such as attack, defend, etc. We also present a statistical analysis of the dataset as well as results of our baseline experiments with developing an automatic aggression identification system using the dataset developed.

CYOct 29, 2021
Diagnosing Data from ICTs to Provide Focused Assistance in Agricultural Adoptions

Ashwin Singh, Mallika Subramanian, Anmol Agarwal et al.

In the last two decades, ICTs have played a pivotal role in empowering rural populations in India by making knowledge more accessible. Digital Green (DG) is one such ICT that employs a participatory approach with smallholder farmers to produce instructional videos that encompass content specific to them. With help of human mediators, they disseminate these videos using projectors to improve the adoption of agricultural practices. DG's web-based data tracker stores attendance and adoption logs of millions of farmers, videos screened and their demographic information. We leverage this data for a period of ten years between 2010-2020 across five states in India and use it to conduct a holistic evaluation of the ICT. First, we find disparities in adoption rates of farmers, following which we use statistical tests to identify different factors that lead to these disparities and gender-based inequalities. Second, to provide assistance to farmers facing challenges, we model the adoption of practices from a video as a prediction problem and experiment with different model architectures. Our classifier achieves accuracies ranging from 79% to 90% across the five states, demonstrating its potential for assisting future ethnographic investigations. Third, we use SHAP values in conjunction with our model for explaining the impact of various network, content and demographic features on adoption. Our research finds that farmers greatly benefit from past adopters of a video from their group and village. We also discover that videos with a low content-specificity benefit some farmers more than others. Next, we highlight the implications of our findings by translating them into recommendations for community building, revisiting participatory approach and mitigating inequalities. We conclude with a discussion on how our work can assist future investigations into the lived experiences of farmers.

AIAug 7, 2021
What a million Indian farmers say?: A crowdsourcing-based method for pest surveillance

Poonam Adhikari, Ritesh Kumar, S. R. S Iyengar et al.

Many different technologies are used to detect pests in the crops, such as manual sampling, sensors, and radar. However, these methods have scalability issues as they fail to cover large areas, are uneconomical and complex. This paper proposes a crowdsourced based method utilising the real-time farmer queries gathered over telephones for pest surveillance. We developed data-driven strategies by aggregating and analyzing historical data to find patterns and get future insights into pest occurrence. We showed that it can be an accurate and economical method for pest surveillance capable of enveloping a large area with high spatio-temporal granularity. Forecasting the pest population will help farmers in making informed decisions at the right time. This will also help the government and policymakers to make the necessary preparations as and when required and may also ensure food security.

CLJun 7, 2021
SIGTYP 2021 Shared Task: Robust Spoken Language Identification

Elizabeth Salesky, Badr M. Abdullah, Sabrina J. Mielke et al.

While language identification is a fundamental speech and language processing task, for many languages and language families it remains a challenging task. For many low-resource and endangered languages this is in part due to resource availability: where larger datasets exist, they may be single-speaker or have different domains than desired application scenarios, demanding a need for domain and speaker-invariant language identification systems. This year's shared task on robust spoken language identification sought to investigate just this scenario: systems were to be trained on largely single-speaker speech from one domain, but evaluated on data in other domains recorded from speakers under different recording circumstances, mimicking realistic low-resource scenarios. We see that domain and speaker mismatch proves very challenging for current methods which can perform above 95% accuracy in-domain, which domain adaptation can address to some degree, but that these conditions merit further investigation to make spoken language identification accessible in many scenarios.

LGSep 17, 2020
MSTREAM: Fast Anomaly Detection in Multi-Aspect Streams

Siddharth Bhatia, Arjit Jain, Pan Li et al.

Given a stream of entries in a multi-aspect data setting i.e., entries having multiple dimensions, how can we detect anomalous activities in an unsupervised manner? For example, in the intrusion detection setting, existing work seeks to detect anomalous events or edges in dynamic graph streams, but this does not allow us to take into account additional attributes of each entry. Our work aims to define a streaming multi-aspect data anomaly detection framework, termed MSTREAM which can detect unusual group anomalies as they occur, in a dynamic manner. MSTREAM has the following properties: (a) it detects anomalies in multi-aspect data including both categorical and numeric attributes; (b) it is online, thus processing each record in constant time and constant memory; (c) it can capture the correlation between multiple aspects of the data. MSTREAM is evaluated over the KDDCUP99, CICIDS-DoS, UNSW-NB 15 and CICIDS-DDoS datasets, and outperforms state-of-the-art baselines.

CLMar 16, 2020
Developing a Multilingual Annotated Corpus of Misogyny and Aggression

Shiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar et al.

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

IRAug 19, 2019
Tale of tails using rule augmented sequence labeling for event extraction

Ayush Maheshwari, Hrishikesh Patel, Nandan Rathod et al.

The problem of event extraction is a relatively difficult task for low resource languages due to the non-availability of sufficient annotated data. Moreover, the task becomes complex for tail (rarely occurring) labels wherein extremely less data is available. In this paper, we present a new dataset (InDEE-2019) in the disaster domain for multiple Indic languages, collected from news websites. Using this dataset, we evaluate several rule-based mechanisms to augment deep learning based models. We formulate our problem of event extraction as a sequence labeling task and perform extensive experiments to study and understand the effectiveness of different approaches. We further show that tail labels can be easily incorporated by creating new rules without the requirement of large annotated data.

IVJun 10, 2019
Alzheimer's Disease Brain MRI Classification: Challenges and Insights

Yi Ren Fung, Ziqiang Guan, Ritesh Kumar et al.

In recent years, many papers have reported state-of-the-art performance on Alzheimer's Disease classification with MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset using convolutional neural networks. However, we discover that when we split that data into training and testing sets at the subject level, we are not able to obtain similar performance, bringing the validity of many of the previous studies into question. Furthermore, we point out that previous works use different subsets of the ADNI data, making comparison across similar works tricky. In this study, we present the results of three splitting methods, discuss the motivations behind their validity, and report our results using all of the available subjects.

CVApr 16, 2019
A Comprehensive Study of Alzheimer's Disease Classification Using Convolutional Neural Networks

Ziqiang Guan, Ritesh Kumar, Yi Ren Fung et al.

A plethora of deep learning models have been developed for the task of Alzheimer's disease classification from brain MRI scans. Many of these models report high performance, achieving three-class classification accuracy of up to 95%. However, it is common for these studies to draw performance comparisons between models that are trained on different subsets of a dataset or use varying imaging preprocessing techniques, making it difficult to objectively assess model performance. Furthermore, many of these works do not provide details such as hyperparameters, the specific MRI scans used, or their source code, making it difficult to replicate their experiments. To address these concerns, we present a comprehensive study of some of the deep learning methods and architectures on the full set of images available from ADNI. We find that, (1) classification using 3D models gives an improvement of 1% in our setup, at the cost of significantly longer training time and more computation power, (2) with our dataset, pre-training yields minimal ($<0.5\%$) improvement in model performance, (3) most popular convolutional neural network models yield similar performance when compared to each other. Lastly, we briefly compare the effects of two image preprocessing programs: FreeSurfer and Clinica, and find that the spatially normalized and segmented outputs from Clinica increased the accuracy of model prediction from 63% to 89% when compared to FreeSurfer images.

CLMar 19, 2019
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov et al.

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. In sub-task B, the focus was on the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, about 800 teams signed up to participate in the task, and 115 of them submitted results, which we present and analyze in this report.

CLFeb 25, 2019
Predicting the Type and Target of Offensive Posts in Social Media

Marcos Zampieri, Shervin Malmasi, Preslav Nakov et al.

As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

CLMar 26, 2018
Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Ritesh Kumar, Bornini Lahiri, Deepak Alok et al.

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.

CLMar 26, 2018
Aggression-annotated Corpus of Hindi-English Code-mixed Data

Ritesh Kumar, Aishwarya N. Reganti, Akshit Bhatia et al.

As the interaction over the web has increased, incidents of aggression and related events like trolling, cyberbullying, flaming, hate speech, etc. too have increased manifold across the globe. While most of these behaviour like bullying or hate speech have predated the Internet, the reach and extent of the Internet has given these an unprecedented power and influence to affect the lives of billions of people. So it is of utmost significance and importance that some preventive measures be taken to provide safeguard to the people using the web such that the web remains a viable medium of communication and connection, in general. In this paper, we discuss the development of an aggression tagset and an annotated corpus of Hindi-English code-mixed data from two of the most popular social networking and social media platforms in India, Twitter and Facebook. The corpus is annotated using a hierarchical tagset of 3 top-level tags and 10 level 2 tags. The final dataset contains approximately 18k tweets and 21k facebook comments and is being released for further research in the field.

SYNov 10, 2017
Cooperative control of multi-agent systems to locate source of an odor

Abhinav Sinha, Rishemjit Kaur, Ritesh Kumar et al.

This work targets the problem of odor source localization by multi-agent systems. A hierarchical cooperative control has been put forward to solve the problem of locating source of an odor by driving the agents in consensus when at least one agent obtains information about location of the source. Synthesis of the proposed controller has been carried out in a hierarchical manner of group decision making, path planning and control. Decision making utilizes information of the agents using conventional Particle Swarm Algorithm and information of the movement of filaments to predict the location of the odor source. The predicted source location in the decision level is then utilized to map a trajectory and pass that information to the control level. The distributed control layer uses sliding mode controllers known for their inherent robustness and the ability to reject matched disturbances completely. Two cases of movement of agents towards the source, i.e., under consensus and formation have been discussed herein. Finally, numerical simulations demonstrate the efficacy of the proposed hierarchical distributed control.