62.9CYMay 31
Beyond Access: Guided LLM Scaffolding for Independent Learning in Undergraduate StatisticsMohammad Amanlou, Yasaman Amou-Jafari, Mehrad Livian et al.
Large language models (LLMs) are increasingly entering students' learning practices, but their educational value depends on whether they support reasoning or enable task completion without engagement. This study examines guided LLM use in an undergraduate Probability and Statistics course, focusing on the gap between assigned access and actual interaction quality. In a four-week quasi-experimental summer program, students were organized into three balanced conditions: no LLM access, unrestricted LLM access, and guided LLM access. The guided condition used the same LLM platform as the unrestricted condition, but students received explicit training and rules promoting reasoning-focused help-seeking, stepwise hints, verification, and ethical use. All quizzes and the delayed final exam were completed without LLM or external assistance, allowing us to distinguish AI-supported practice performance from independent learning. Results show that guided use was associated with clearer learning-oriented interaction patterns than unrestricted access, especially in prioritizing reasoning over final answers and requesting stepwise support. Guided-LLM students showed stronger no-help quiz performance during the intervention phase, whereas unrestricted access appeared more useful for assisted practice completion than for consistently improving independent performance. Available time measures did not support a simple duration-based explanation, and self-assessment calibration suggested better alignment between perceived and demonstrated understanding in the Guided-LLM condition. Overall, LLM access alone appears to be an incomplete educational intervention. For Artificial Intelligence in Education (AIED), the central design challenge is to scaffold how students use LLMs so that these systems function as partners in reasoning rather than answer-getting tools.
CLJan 29Code
MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMsGhazal Kalhor, Behnam Bahrak
In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.
CLApr 18, 2022
UTNLP at SemEval-2022 Task 6: A Comparative Analysis of Sarcasm Detection Using Generative-based and Mutation-based Data AugmentationAmirhossein Abaskohi, Arash Rasouli, Tanin Zeraati et al.
Sarcasm is a term that refers to the use of words to mock, irritate, or amuse someone. It is commonly used on social media. The metaphorical and creative nature of sarcasm presents a significant difficulty for sentiment analysis systems based on affective computing. The methodology and results of our team, UTNLP, in the SemEval-2022 shared task 6 on sarcasm detection are presented in this paper. We put different models, and data augmentation approaches to the test and report on which one works best. The tests begin with traditional machine learning models and progress to transformer-based and attention-based models. We employed data augmentation based on data mutation and data generation. Using RoBERTa and mutation-based data augmentation, our best approach achieved an F1-sarcastic of 0.38 in the competition's evaluation phase. After the competition, we fixed our model's flaws and achieved an F1-sarcastic of 0.414.
CLNov 15, 2022
Persian Emotion Detection using ParsBERT and Imbalanced Data Handling ApproachesAmirhossein Abaskohi, Nazanin Sabri, Behnam Bahrak
Emotion recognition is one of the machine learning applications which can be done using text, speech, or image data gathered from social media spaces. Detecting emotion can help us in different fields, including opinion mining. With the spread of social media, different platforms like Twitter have become data sources, and the language used in these platforms is informal, making the emotion detection task difficult. EmoPars and ArmanEmo are two new human-labeled emotion datasets for the Persian language. These datasets, especially EmoPars, are suffering from inequality between several samples between two classes. In this paper, we evaluate EmoPars and compare them with ArmanEmo. Throughout this analysis, we use data augmentation techniques, data re-sampling, and class-weights with Transformer-based Pretrained Language Models(PLMs) to handle the imbalance problem of these datasets. Moreover, feature selection is used to enhance the models' performance by emphasizing the text's specific features. In addition, we provide a new policy for selecting data from EmoPars, which selects the high-confidence samples; as a result, the model does not see samples that do not have specific emotion during training. Our model reaches a Macro-averaged F1-score of 0.81 and 0.76 on ArmanEmo and EmoPars, respectively, which are new state-of-the-art results in these benchmarks.
CLFeb 9, 2023
A Large-Scale Analysis of Persian Tweets Regarding Covid-19 VaccinationTaha ShabaniMirzaei, Houmaan Chamani, Amirhossein Abaskohi et al.
The Covid-19 pandemic had an enormous effect on our lives, especially on people's interactions. By introducing Covid-19 vaccines, both positive and negative opinions were raised over the subject of taking vaccines or not. In this paper, using data gathered from Twitter, including tweets and user profiles, we offer a comprehensive analysis of public opinion in Iran about the Coronavirus vaccines. For this purpose, we applied a search query technique combined with a topic modeling approach to extract vaccine-related tweets. We utilized transformer-based models to classify the content of the tweets and extract themes revolving around vaccination. We also conducted an emotion analysis to evaluate the public happiness and anger around this topic. Our results demonstrate that Covid-19 vaccination has attracted considerable attention from different angles, such as governmental issues, safety or hesitancy, and side effects. Moreover, Coronavirus-relevant phenomena like public vaccination and the rate of infection deeply impacted public emotional status and users' interactions.
CLFeb 23
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness CalibrationMohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari et al.
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
CLFeb 15
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck PerspectiveAli Zahedzadeh, Behnam Bahrak
Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.
LGJan 7
ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing CuesBehrad Binaei-Haghighi, Nafiseh Sadat Sajadi, Mehrad Liviyan et al.
The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
SIFeb 15
A Hybrid TGN-SEAL Model for Dynamic Graph Link PredictionNafiseh Sadat Sajadi, Behnam Bahrak, Mahdi Jafari Siavoshani
Predicting links in sparse, continuously evolving networks is a central challenge in network science. Conventional heuristic methods and deep learning models, including Graph Neural Networks (GNNs), are typically designed for static graphs and thus struggle to capture temporal dependencies. Snapshot-based techniques partially address this issue but often encounter data sparsity and class imbalance, particularly in networks with transient interactions such as telecommunication call detail records (CDRs). Temporal Graph Networks (TGNs) model dynamic graphs by updating node embeddings over time; however, their predictive accuracy under sparse conditions remains limited. In this study, we improve the TGN framework by extracting enclosing subgraphs around candidate links, enabling the model to jointly learn structural and temporal information. Experiments on a sparse CDR dataset show that our approach increases average precision by 2.6% over standard TGNs, demonstrating the advantages of integrating local topology for robust link prediction in dynamic networks.
CRMar 7
TopRank-Based Delivery Rate Optimization for Coded Caching under Non-Uniform DemandsMohammadsaber Bahadori, Seyed Pooya Shariatpanahi, Behnam Bahrak
We study the problem of coded caching with nonuniform file popularity under the setting where the popularity distribution is initially unknown. By reframing the problem, we propose a method inspired by an algorithm from the recommender-systems literature and multi-armed bandits. Unlike prior approaches, which focus on accurately estimating file popularities, our method ranks files relative to one another and partitions them into groups. This perspective is more consistent with the structure of prior approaches as well, since earlier methods also divided files into popular and non-popular groups after estimating their popularities. The proposed approach relies on differences in request counts between files as the basis for ranking, and under many conditions it outperforms the previous algorithm. In particular, we obtain significantly improved performance in scenarios where the number of users in the network is small, the cache storage capacity is limited, or the learning process of the true popularity of files based on observations is contaminated by exploratory or synthetic requests that do not match the true popularity distribution. In these cases, our policy achieves markedly better performance and attains sublinear regret.
CLNov 24, 2025
Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in PersianMobina Mehrazar, Mohammad Amin Yousefi, Parisa Abolfath Beygi et al.
Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.
CLSep 24, 2025
Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in PersianGhazal Kalhor, Behnam Bahrak
Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.
CLApr 10, 2021
UTNLP at SemEval-2021 Task 5: A Comparative Analysis of Toxic Span Detection using Attention-based, Named Entity Recognition, and Ensemble ModelsAlireza Salemi, Nazanin Sabri, Emad Kebriaei et al.
Detecting which parts of a sentence contribute to that sentence's toxicity -- rather than providing a sentence-level verdict of hatefulness -- would increase the interpretability of models and allow human moderators to better understand the outputs of the system. This paper presents our team's, UTNLP, methodology and results in the SemEval-2021 shared task 5 on toxic spans detection. We test multiple models and contextual embeddings and report the best setting out of all. The experiments start with keyword-based models and are followed by attention-based, named entity-based, transformers-based, and ensemble models. Our best approach, an ensemble model, achieves an F1 of 0.684 in the competition's evaluation phase.
CLFeb 25, 2021
Sentiment Analysis of Persian-English Code-mixed TextsNazanin Sabri, Ali Edalat, Behnam Bahrak
The rapid production of data on the internet and the need to understand how users are feeling from a business and research perspective has prompted the creation of numerous automatic monolingual sentiment detection systems. More recently however, due to the unstructured nature of data on social media, we are observing more instances of multilingual and code-mixed texts. This development in content type has created a new demand for code-mixed sentiment analysis systems. In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets. We then proceed to introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets. Our model outperforms the baseline models that use Naïve Bayes and Random Forest methods.
LGJun 7, 2020
Hybrid Model for Anomaly Detection on Call Detail Records by Time Series ForecastingAryan Mokhtari, Leyla Sadighi, Behnam Bahrak et al.
Mobile network operators store an enormous amount of information like log files that describe various events and users' activities. Analysis of these logs might be used in many critical applications such as detecting cyber-attacks, finding behavioral patterns of users, security incident response, network forensics, etc. In a cellular network Call Detail Records (CDR) is one type of such logs containing metadata of calls and usually includes valuable information about contact such as the phone numbers of originating and receiving subscribers, call duration, the area of activity, type of call (SMS or voice call) and a timestamp. With anomaly detection, it is possible to determine abnormal reduction or increment of network traffic in an area or for a particular person. This paper's primary goal is to study subscribers' behavior in a cellular network, mainly predicting the number of calls in a region and detecting anomalies in the network traffic. In this paper, a new hybrid method is proposed based on various anomaly detection methods such as GARCH, K-means, and Neural Network to determine the anomalous data. Moreover, we have discussed the possible causes of such anomalies.
LGDec 22, 2019
A Regression Framework for Predicting User's Next Location using Call Detail RecordsMohammad Saleh Mahdizadeh, Behnam Bahrak
With the growth of using cell phones and the increase in diversity of smart mobile devices, a massive volume of data is generated continuously in the process of using these devices. Among these data, Call Detail Records, CDR, is highly remarkable. Since CDR contains both temporal and spatial labels, mobility analysis of CDR is one of the favorite subjects of study among the researchers. The user next location prediction is one of the main problems in the field of human mobility analysis. In this paper, we propose a data processing framework to predict user next location. We propose domain-specific data processing strategies and design a deep neural network model which is based on recurrent neurons and perform regression tasks. Using this prediction framework, the error of the prediction decreases from 74% to 55% in comparison to the worst and best performing traditional models. Methods, strategies, the framework and the results of this paper can be helpful in many applications such as urban planning and digital marketing.
LGDec 1, 2019
Real-time Travel Time Estimation Using Matrix FactorizationEbrahim Badrestani, Behnam Bahrak, Ali Elahi et al.
Estimating the travel time of any route is of great importance for trip planners, traffic operators, online taxi dispatching and ride-sharing platforms, and navigation provider systems. With the advance of technology, many traveling cars, including online taxi dispatch systems' vehicles are equipped with Global Positioning System (GPS) devices that can report the location of the vehicle every few seconds. This paper uses GPS data and the Matrix Factorization techniques to estimate the travel times on all road segments and time intervals simultaneously. We aggregate GPS data into a matrix, where each cell of the original matrix contains the average vehicle speed for a segment and a specific time interval. One of the problems with this matrix is its high sparsity. We use Alternating Least Squares (ALS) method along with a regularization term to factorize the matrix. Since this approach can solve the sparsity problem that arises from the absence of cars in many road segments in a specific time interval, matrix factorization is suitable for estimating the travel time. Our comprehensive evaluation results using real data provided by one of the largest online taxi dispatching systems in Iran, shows the strength of our proposed method.
LGOct 17, 2019
Proposing a Model for Predicting Passenger Origin-Destination in Online Taxi-Hailing SystemsPouria Golshanrad, Hamid Mahini, Behnam Bahrak
Due to the significance of transportation planning, traffic management, and dispatch optimization, predicting passenger origin-destination has emerged as a crucial requirement for intelligent transportation systems management. In this study, we present a model designed to forecast the origin and destination of travels within a specified time window. To derive meaningful travel flows, we employ K-means clustering in a four-dimensional space with a maximum cluster size constraint for origin and destination zones. Given the large number of clusters, we utilize non-negative matrix factorization to reduce the number of travel clusters. Furthermore, we implement a stacked recurrent neural network model to predict the travel count in each cluster. A comparison of our results with existing models reveals that our proposed model achieves a 5-7\% lower mean absolute percentage error (MAPE) for 1-hour time windows and a 14\% lower MAPE for 30-minute time windows.
SISep 15, 2019
Quantitative analysis of cryptocurrencies transaction graphAmir Pasha Motamed, Behnam Bahrak
Cryptocurrencies as a new way of transferring assets and securing financial transactions have gained popularity in recent years. Transactions in cryptocurrencies are publicly available, hence, statistical studies on different aspects of these currencies are possible. However, previous statistical analysis on cryptocurrencies transactions have been very limited and mostly devoted to Bitcoin, with no comprehensive comparison between these currencies. In this study, we intend to compare the transaction graph of Bitcoin, Ethereum, Litecoin, Dash, and Z-Cash, with respect to the dynamics of their transaction graphs over time, and discuss their properties. In particular, we observed that the growth rate of the nodes and edges of the transaction graphs, and the density of these graphs, are closely related to the price of these currencies. We also found that the transaction graph of these currencies is non-assortative, and the degree sequence of their transaction graph follows the power law distribution.
LGJan 29, 2019
Short-term Demand Forecasting for Online Car-hailing Services using Recurrent Neural NetworksAlireza Nejadettehad, Hamid Mahini, Behnam Bahrak
Short-term traffic flow prediction is one of the crucial issues in intelligent transportation system, which is an important part of smart cities. Accurate predictions can enable both the drivers and the passengers to make better decisions about their travel route, departure time and travel origin selection, which can be helpful in traffic management. Multiple models and algorithms based on time series prediction and machine learning were applied to this issue and achieved acceptable results. Recently, the availability of sufficient data and computational power, motivates us to improve the prediction accuracy via deep-learning approaches. Recurrent neural networks have become one of the most popular methods for time series forecasting, however, due to the variety of these networks, the question that which type is the most appropriate one for this task remains unsolved. In this paper, we use three kinds of recurrent neural networks including simple RNN units, GRU and LSTM neural network to predict short-term traffic flow. The dataset from TAP30 Corporation is used for building the models and comparing RNNs with several well-known models, such as DEMA, LASSO and XGBoost. The results show that all three types of RNNs outperform the others, however, more simple RNNs such as simple recurrent units and GRU perform work better than LSTM in terms of accuracy and training time.
CRApr 8, 2013
A GOST-like Blind Signature Scheme Based on Elliptic Curve Discrete Logarithm ProblemHossein Hosseini, Behnam Bahrak, Farzad Hessar
In this paper, we propose a blind signature scheme and three practical educed schemes based on elliptic curve discrete logarithm problem. The proposed schemes impart the GOST signature structure and utilize the inherent advantage of elliptic curve cryptosystems in terms of smaller key size and lower computational overhead to its counterpart public key cryptosystems such as RSA and ElGamal. The proposed schemes are proved to be secure and have less time complexity in comparison with the existing schemes.
CRApr 2, 2013
Security of Spectrum Learning in Cognitive RadiosBehnam Bahrak, Jung-Min Park
Due to delay and energy constraints, a cognitive radio may not be able to perform spectrum sensing in all available channels. Therefore, a sensing policy is needed to decide which channels to sense. The channel selection problem is the problem of designing such a sensing policy to maximize throughput while avoiding interference to primary users. The channel selection problem can be formulated as a reinforcement learning problem. Channel selection schemes that employ reinforcement machine learning algorithms are vulnerable to belief manipulation attacks that contaminate the knowledge base of the learning algorithms. In this paper, we analyze the security of channel selection algorithms that are based on reinforcement learning and propose mitigation techniques that make these algorithms more robust against belief manipulation attacks.