Sheikh Muhammad Sarwar

CL
h-index17
14papers
2,256citations
Novelty39%
AI Score33

14 Papers

CLMay 21, 2025
EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association

Weiqi Wang, Limeng Cui, Xin Liu et al.

Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.

CLMar 28, 2024
Target Span Detection for Implicit Harmful Content

Nazanin Jafari, James Allan, Sheikh Muhammad Sarwar

Identifying the targets of hate speech is a crucial step in grasping the nature of such speech and, ultimately, in improving the detection of offensive posts on online forums. Much harmful content on online platforms uses implicit language especially when targeting vulnerable and protected groups such as using stereotypical characteristics instead of explicit target names, making it harder to detect and mitigate the language. In this study, we focus on identifying implied targets of hate speech, essential for recognizing subtler hate speech and enhancing the detection of harmful content on digital platforms. We define a new task aimed at identifying the targets even when they are not explicitly stated. To address that task, we collect and annotate target spans in three prominent implicit hate speech datasets: SBIC, DynaHate, and IHC. We call the resulting merged collection Implicit-Target-Span. The collection is achieved using an innovative pooling method with matching scores based on human annotations and Large Language Models (LLMs). Our experiments indicate that Implicit-Target-Span provides a challenging test bed for target span detection methods.

AIMay 27, 2025
RRO: LLM Agent Optimization Through Rising Reward Trajectories

Zilong Wang, Jingfeng Yang, Sreyashi Nag et al.

Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.

CLSep 10, 2021
AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction

Dong-Ho Lee, Ravi Kiran Selvam, Sheikh Muhammad Sarwar et al.

Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging ``entity triggers'' which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model's prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average.

IRSep 7, 2021
Mixed Attention Transformer for Leveraging Word-Level Knowledge to Neural Cross-Lingual Information Retrieval

Zhiqi Huang, Hamed Bonab, Sheikh Muhammad Sarwar et al.

Pretrained contextualized representations offer great success for many downstream tasks, including document ranking. The multilingual versions of such pretrained representations provide a possibility of jointly learning many languages with the same model. Although it is expected to gain big with such joint training, in the case of cross lingual information retrieval (CLIR), the models under a multilingual setting are not achieving the same level of performance as those under a monolingual setting. We hypothesize that the performance drop is due to the translation gap between query and documents. In the monolingual retrieval task, because of the same lexical inputs, it is easier for model to identify the query terms that occurred in documents. However, in the multilingual pretrained models that the words in different languages are projected into the same hyperspace, the model tends to translate query terms into related terms, i.e., terms that appear in a similar context, in addition to or sometimes rather than synonyms in the target language. This property is creating difficulties for the model to connect terms that cooccur in both query and document. To address this issue, we propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table. We design a sandwich like architecture to embed MAT into the recent transformer based deep neural models. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence. Experimental results demonstrate the effectiveness of the external knowledge and the significant improvement of MAT embedded neural reranking model on CLIR task.

CLJul 27, 2021
Unsupervised Domain Adaptation for Hate Speech Detection Using a Data Augmentation Approach

Sheikh Muhammad Sarwar, Vanessa Murdock

Online harassment in the form of hate speech has been on the rise in recent years. Addressing the issue requires a combination of content moderation by people, aided by automatic detection methods. As content moderation is itself harmful to the people doing it, we desire to reduce the burden by improving the automatic detection of hate speech. Hate speech presents a challenge as it is directed at different target groups using a completely different vocabulary. Further the authors of the hate speech are incentivized to disguise their behavior to avoid being removed from a platform. This makes it difficult to develop a comprehensive data set for training and evaluating hate speech detection models because the examples that represent one hate speech domain do not typically represent others, even within the same language or culture. We propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision.

CLMay 27, 2021
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence

Andrew Halterman, Katherine A. Keith, Sheikh Muhammad Sarwar et al.

Automated event extraction in social science applications often requires corpus-level evaluations: for example, aggregating text predictions across metadata and unbiased estimates of recall. We combine corpus-level evaluation requirements with a real-world, social science setting and introduce the IndiaPoliceEvents corpus--all 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. Our trained annotators read and label every document for mentions of police activity events, allowing for unbiased recall evaluations. In contrast to other datasets with structured event representations, we gather annotations by posing natural questions, and evaluate off-the-shelf models for three different tasks: sentence classification, document ranking, and temporal aggregation of target events. We present baseline results from zero-shot BERT-based models fine-tuned on natural language inference and passage retrieval tasks. Our novel corpus-level evaluations and annotation approach can guide creation of similar social-science-oriented resources in the future.

CLMar 31, 2021
A Neighbourhood Framework for Resource-Lean Content Flagging

Sheikh Muhammad Sarwar, Dimitrina Zlatkova, Momchil Hardalov et al.

We propose a novel framework for cross-lingual content flagging with limited target-language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbour architecture. It is a modern instantiation of the vanilla k-nearest neighbour model, as we use Transformer representations in all its components. Our framework can adapt to new source-language instances, without the need to be retrained from scratch. Unlike prior work on neighbourhood-based approaches, we encode the neighbourhood information based on query--neighbour interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.

CLFeb 27, 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go

Arnav Arora, Preslav Nakov, Momchil Hardalov et al.

The proliferation of harmful content on online platforms is a major societal problem, which comes in many different forms including hate speech, offensive language, bullying and harassment, misinformation, spam, violence, graphic content, sexual abuse, self harm, and many other. Online platforms seek to moderate such content to limit societal harm, to comply with legislation, and to create a more inclusive environment for their users. Researchers have developed different methods for automatically detecting harmful content, often focusing on specific sub-problems or on narrow communities, as what is considered harmful often depends on the platform and on the context. We argue that there is currently a dichotomy between what types of harmful content online platforms seek to curb, and what research efforts there are to automatically detect such content. We thus survey existing methods as well as content moderation policies by online platforms in this light and we suggest directions for future work.

IRJul 2, 2019
Semantic Driven Fielded Entity Retrieval

Shahrzad Naseri, Sheikh Muhammad Sarwar, James Allan

A common approach for knowledge-base entity search is to consider an entity as a document with multiple fields. Models that focus on matching query terms in different fields are popular choices for searching such entity representations. An instance of such a model is FSDM (Fielded Sequential Dependence Model). We propose to integrate field-level semantic features into FSDM. We use FSDM to retrieve a pool of documents, and then to use semantic field-level features to re-rank those documents. We propose to represent queries as bags of terms as well as bags of entities, and eventually, use their dense vector representation to compute semantic features based on query document similarity. Our proposed re-ranking approach achieves significant improvement in entity retrieval on the DBpedia-Entity (v2) dataset over existing FSDM model. Specifically, for all queries we achieve 2.5% and 1.2% significant improvement in NDCG@10 and NDCG@100, respectively.

IRJun 17, 2019
A Multi-Task Architecture on Relevance-based Neural Query Translation

Sheikh Muhammad Sarwar, Hamed Bonab, James Allan

We describe a multi-task learning approach to train a Neural Machine Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search query translation. The translation process for Cross-lingual Information Retrieval (CLIR) task is usually treated as a black box and it is performed as an independent step. However, an NMT model trained on sentence-level parallel data is not aware of the vocabulary distribution of the retrieval corpus. We address this problem with our multi-task learning architecture that achieves 16% improvement over a strong NMT baseline on Italian-English query-document dataset. We show using both quantitative and qualitative analysis that our model generates balanced and precise translations with the regularization effect it achieves from multi-task learning paradigm.

IRJun 12, 2018
Named Entity Recognition with Extremely Limited Data

John Foley, Sheikh Muhammad Sarwar, James Allan

Traditional information retrieval treats named entity recognition as a pre-indexing corpus annotation task, allowing entity tags to be indexed and used during search. Named entity taggers themselves are typically trained on thousands or tens of thousands of examples labeled by humans. However, there is a long tail of named entities classes, and for these cases, labeled data may be impossible to find or justify financially. We propose exploring named entity recognition as a search task, where the named entity class of interest is a query, and entities of that class are the relevant "documents". What should that query look like? Can we even perform NER-style labeling with tens of labels? This study presents an exploration of CRF-based NER models with handcrafted features and of how we might transform them into search queries.

IRJan 8, 2018
Term Relevance Feedback for Contextual Named Entity Retrieval

Sheikh Muhammad Sarwar, John Foley, James Allan

We address the role of a user in Contextual Named Entity Retrieval (CNER), showing (1) that user identification of important context-bearing terms is superior to automated approaches, and (2) that further gains are possible if the user indicates the relative importance of those terms. CNER is similar in spirit to List Question answering and Entity disambiguation. However, the main focus of CNER is to obtain user feedback for constructing a profile for a class of entities on the fly and use that to retrieve entities from free text. Given a sentence, and an entity selected from that sentence, CNER aims to retrieve sentences that have entities similar to query entity. This paper explores obtaining term relevance feedback and importance weighting from humans in order to improve a CNER system. We report our findings based on the efforts of IR researchers as well as crowdsourced workers.

IRAug 16, 2015
Two-stage Cascaded Classifier for Purchase Prediction

Sheikh Muhammad Sarwar, Mahamudul Hasan, Dmitry I. Ignatov

In this paper we describe our machine learning solution for the RecSys Challenge, 2015. We have proposed a time efficient two-stage cascaded classifier for the prediction of buy sessions and purchased items within such sessions. Based on the model, several interesting features found, and formation of our own test bed, we have achieved a reasonable score. Usage of Random Forests helps us to cope with the effect of the multiplicity of good models depending on varying subsets of features in the purchased items prediction and, in its turn, boosting is used as a suitable technique to overcome severe class imbalance of the buy-session prediction.