Behrouz Minaei-Bidgoli

CL
h-index38
19papers
1,996citations
Novelty33%
AI Score35

19 Papers

AIMar 26, 2023
Farspredict: A benchmark dataset for link prediction

Najmeh Torabian, Behrouz Minaei-Bidgoli, Mohsen Jahanshahi

Link prediction with knowledge graph embedding (KGE) is a popular method for knowledge graph completion. Furthermore, training KGEs on non-English knowledge graph promote knowledge extraction and knowledge graph reasoning in the context of these languages. However, many challenges in non-English KGEs pose to learning a low-dimensional representation of a knowledge graph's entities and relations. This paper proposes "Farspredict" a Persian knowledge graph based on Farsbase (the most comprehensive knowledge graph in Persian). It also explains how the knowledge graph structure affects link prediction accuracy in KGE. To evaluate Farspredict, we implemented the popular models of KGE on it and compared the results with Freebase. Given the analysis results, some optimizations on the knowledge graph are carried out to improve its functionality in the KGE. As a result, a new Persian knowledge graph is achieved. Implementation results in the KGE models on Farspredict outperforming Freebases in many cases. At last, we discuss what improvements could be effective in enhancing the quality of Farspredict and how much it improves.

CLOct 21, 2023
Emulating the Human Mind: A Neural-symbolic Link Prediction Model with Fast and Slow Reasoning and Filtered Rules

Mohammad Hossein Khojasteh, Najmeh Torabian, Ali Farjami et al.

Link prediction is an important task in addressing the incompleteness problem of knowledge graphs (KG). Previous link prediction models suffer from issues related to either performance or explanatory capability. Furthermore, models that are capable of generating explanations, often struggle with erroneous paths or reasoning leading to the correct answer. To address these challenges, we introduce a novel Neural-Symbolic model named FaSt-FLiP (stands for Fast and Slow Thinking with Filtered rules for Link Prediction task), inspired by two distinct aspects of human cognition: "commonsense reasoning" and "thinking, fast and slow." Our objective is to combine a logical and neural model for enhanced link prediction. To tackle the challenge of dealing with incorrect paths or rules generated by the logical model, we propose a semi-supervised method to convert rules into sentences. These sentences are then subjected to assessment and removal of incorrect rules using an NLI (Natural Language Inference) model. Our approach to combining logical and neural models involves first obtaining answers from both the logical and neural models. These answers are subsequently unified using an Inference Engine module, which has been realized through both algorithmic implementation and a novel neural model architecture. To validate the efficacy of our model, we conducted a series of experiments. The results demonstrate the superior performance of our model in both link prediction metrics and the generation of more reliable explanations.

CLJun 22, 2023
Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

Huda AlShuhayeb, Behrouz Minaei-Bidgoli, Mohammad E. Shenassa et al.

There are numerous complex and rich morphological features in the Arabic language, which are highly useful when analyzing traditional Arabic textbooks, especially in the literary and religious contexts, and help in understanding the meaning of the textbooks. Vocabulary separation means separating the word into different components, such as the root and affixes. In the morphological datasets, the variety of markers and the number of data samples help to evaluate the morphological techniques. In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book, labeled by human experts. In terms of volume and word variety, this dataset is superior to the other Hadith Arabic datasets, to the best of our knowledge. To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well. This be

CLOct 4, 2025
Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

Majid Asgari-Bidhendi, Muhammad Amin Ghaseminia, Alireza Shahbazi et al.

This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain--text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain--text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.

CLJan 10, 2025
Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks

Iman Barati, Arash Ghafouri, Behrouz Minaei-Bidgoli

In recent years, the use of large language models (LLMs) has significantly increased, and these models have demonstrated remarkable performance in a variety of general language tasks. However, the evaluation of their performance in domain-specific tasks, particularly those requiring deep natural language understanding, has received less attention. In this research, we evaluate the ability of large language models in performing domain-specific tasks, focusing on the multi-hop question answering (MHQA) problem using the HotpotQA dataset. This task, due to its requirement for reasoning and combining information from multiple textual sources, serves as a challenging benchmark for assessing the language comprehension capabilities of these models. To tackle this problem, we have designed a two-stage selector-reader architecture, where each stage utilizes an independent LLM. In addition, methods such as Chain of Thought (CoT) and question decomposition have been employed to investigate their impact on improving the model's performance. The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers, providing evidence of the models' ability to handle domain-specific tasks and their understanding of complex language.

CLFeb 22, 2022
Evaluating Persian Tokenizers

Danial Kamali, Behrooz Janfada, Mohammad Ebrahim Shenasa et al.

Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing tasks, like semantic parsing and language modeling. Natural Language Processing in Persian is challenging due to Persian's exceptional cases, such as half-spaces. Thus, it is crucial to have a precise tokenizer for Persian. This article provides a novel work by introducing the most widely used tokenizers for Persian and comparing and evaluating their performance on Persian texts using a simple algorithm with a pre-tagged Persian dependency dataset. After evaluating tokenizers with the F1-Score, the hybrid version of the Farsi Verb and Hazm with bounded morphemes fixing showed the best performance with an F1 score of 98.97%.

CLJun 27, 2021
KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods

Mohammad Javad Saeedizade, Najmeh Torabian, Behrouz Minaei-Bidgoli

The Link Prediction is the task of predicting missing relations between entities of the knowledge graph. Recent work in link prediction has attempted to provide a model for increasing link prediction accuracy by using more layers in neural network architecture. In this paper, we propose a novel method of refining the knowledge graph so that link prediction operation can be performed more accurately using relatively fast translational models. Translational link prediction models, such as TransE, TransH, TransD, have less complexity than deep learning approaches. Our method uses the hierarchy of relationships and entities in the knowledge graph to add the entity information as auxiliary nodes to the graph and connect them to the nodes which contain this information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in H@10, MR, MRR.

CLApr 15, 2021
A Sample-Based Training Method for Distantly Supervised Relation Extraction with Pre-Trained Transformers

Mehrdad Nasser, Mohamad Bagher Sajadi, Behrouz Minaei-Bidgoli

Multiple instance learning (MIL) has become the standard learning paradigm for distantly supervised relation extraction (DSRE). However, due to relation extraction being performed at bag level, MIL has significant hardware requirements for training when coupled with large sentence encoders such as deep transformer neural networks. In this paper, we propose a novel sampling method for DSRE that relaxes these hardware requirements. In the proposed method, we limit the number of sentences in a batch by randomly sampling sentences from the bags in the batch. However, this comes at the cost of losing valid sentences from bags. To alleviate the issues caused by random sampling, we use an ensemble of trained models for prediction. We demonstrate the effectiveness of our approach by using our proposed learning setting to fine-tuning BERT on the widely NYT dataset. Our approach significantly outperforms previous state-of-the-art methods in terms of AUC and P@N metrics.

CLApr 4, 2021
Interval Probabilistic Fuzzy WordNet

Yousef Alizadeh-Q, Behrouz Minaei-Bidgoli, Sayyed-Ali Hossayni et al.

WordNet lexical-database groups English words into sets of synonyms called "synsets." Synsets are utilized for several applications in the field of text-mining. However, they were also open to criticism because although, in reality, not all the members of a synset represent the meaning of that synset with the same degree, in practice, they are considered as members of the synset, identically. Thus, the fuzzy version of synsets, called fuzzy-synsets (or fuzzy word-sense classes) were proposed and studied. In this study, we discuss why (type-1) fuzzy synsets (T1 F-synsets) do not properly model the membership uncertainty, and propose an upgraded version of fuzzy synsets in which membership degrees of word-senses are represented by intervals, similar to what in Interval Type 2 Fuzzy Sets (IT2 FS) and discuss that IT2 FS theoretical framework is insufficient for analysis and design of such synsets, and propose a new concept, called Interval Probabilistic Fuzzy (IPF) sets. Then we present an algorithm for constructing the IPF synsets in any language, given a corpus and a word-sense-disambiguation system. Utilizing our algorithm and the open-American-online-corpus (OANC) and UKB word-sense-disambiguation, we constructed and published the IPF synsets of WordNet for English language.

CLJan 31, 2021
An Unsupervised Language-Independent Entity Disambiguation Method and its Evaluation on the English and Persian Languages

Majid Asgari-Bidhendi, Behrooz Janfada, Amir Havangi et al.

Entity Linking is one of the essential tasks of information extraction and natural language understanding. Entity linking mainly consists of two tasks: recognition and disambiguation of named entities. Most studies address these two tasks separately or focus only on one of them. Moreover, most of the state-of-the -art entity linking algorithms are either supervised, which have poor performance in the absence of annotated corpora or language-dependent, which are not appropriate for multi-lingual applications. In this paper, we introduce an Unsupervised Language-Independent Entity Disambiguation (ULIED), which utilizes a novel approach to disambiguate and link named entities. Evaluation of ULIED on different English entity linking datasets as well as the only available Persian dataset illustrates that ULIED in most of the cases outperforms the state-of-the-art unsupervised multi-lingual approaches.

CLJul 24, 2020
IUST at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text using Deep Neural Networks and Linear Baselines

Soroush Javdan, Taha Shangipour ataei, Behrouz Minaei-Bidgoli

Sentiment Analysis is a well-studied field of Natural Language Processing. However, the rapid growth of social media and noisy content within them poses significant challenges in addressing this problem with well-established methods and tools. One of these challenges is code-mixing, which means using different languages to convey thoughts in social media texts. Our group, with the name of IUST(username: TAHA), participated at the SemEval-2020 shared task 9 on Sentiment Analysis for Code-Mixed Social Media Text, and we have attempted to develop a system to predict the sentiment of a given code-mixed tweet. We used different preprocessing techniques and proposed to use different methods that vary from NBSVM to more complicated deep neural network models. Our best performing method obtains an F1 score of 0.751 for the Spanish-English sub-task and 0.706 over the Hindi-English sub-task.

CLMay 13, 2020
PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction

Majid Asgari-Bidhendi, Mehrdad Nasser, Behrooz Janfada et al.

Relation extraction is the task of extracting semantic relations between entities in a sentence. It is an essential part of some natural language processing tasks such as information extraction, knowledge extraction, and knowledge base population. The main motivations of this research stem from a lack of a dataset for relation extraction in the Persian language as well as the necessity of extracting knowledge from the growing big-data in the Persian language for different applications. In this paper, we present "PERLEX" as the first Persian dataset for relation extraction, which is an expert-translated version of the "Semeval-2010-Task-8" dataset. Moreover, this paper addresses Persian relation extraction utilizing state-of-the-art language-agnostic algorithms. We employ six different models for relation extraction on the proposed bilingual dataset, including a non-neural model (as the baseline), three neural models, and two deep learning models fed by multilingual-BERT contextual word representations. The experiments result in the maximum f-score 77.66% (provided by BERTEM-MTB method) as the state-of-the-art of relation extraction in the Persian language.

CLMay 4, 2020
FarsBase-KBP: A Knowledge Base Population System for the Persian Knowledge Graph

Majid Asgari-Bidhendi, Behrooz Janfada, Behrouz Minaei-Bidgoli

While most of the knowledge bases already support the English language, there is only one knowledge base for the Persian language, known as FarsBase, which is automatically created via semi-structured web information. Unlike English knowledge bases such as Wikidata, which have tremendous community support, the population of a knowledge base like FarsBase must rely on automatically extracted knowledge. Knowledge base population can let FarsBase keep growing in size, as the system continues working. In this paper, we present a knowledge base population system for the Persian language, which extracts knowledge from unlabeled raw text, crawled from the Web. The proposed system consists of a set of state-of-the-art modules such as an entity linking module as well as information and relation extraction modules designed for FarsBase. Moreover, a canonicalization system is introduced to link extracted relations to FarsBase properties. Then, the system uses knowledge fusion techniques with minimal intervention of human experts to integrate and filter the proper knowledge instances, extracted by each module. To evaluate the performance of the presented knowledge base population system, we present the first gold dataset for benchmarking knowledge base population in the Persian language, which consisting of 22015 FarsBase triples and verified by human experts. The evaluation results demonstrate the efficiency of the proposed system.

CLApr 22, 2020
ParsEL 1.0: Unsupervised Entity Linking in Persian Social Media Texts

Majid Asgari-Bidhendi, Farzane Fakhrian, Behrouz Minaei-Bidgoli

In recent years, social media data has exponentially increased, which can be enumerated as one of the largest data repositories in the world. A large portion of this social media data is natural language text. However, the natural language is highly ambiguous due to exposure to the frequent occurrences of entities, which have polysemous words or phrases. Entity linking is the task of linking the entity mentions in the text to their corresponding entities in a knowledge base. Recently, FarsBase, a Persian knowledge graph, has been introduced containing almost half a million entities. In this paper, we propose an unsupervised Persian Entity Linking system, the first entity linking system specially focused on the Persian language, which utilizes context-dependent and context-independent features. For this purpose, we also publish the first entity linking corpus of the Persian language containing 67,595 words that have been crawled from social media texts of some popular channels in the Telegram messenger. The output of the proposed method is 86.94% f-score for the Persian language, which is comparable with the similar state-of-the-art methods in the English language.

CLJul 26, 2019
Pars-ABSA: an Aspect-based Sentiment Analysis dataset for Persian

Taha Shangipour Ataei, Kamyar Darvishi, Soroush Javdan et al.

Due to the increased availability of online reviews, sentiment analysis had been witnessed a booming interest from the researchers. Sentiment analysis is a computational treatment of sentiment used to extract and understand the opinions of authors. While many systems were built to predict the sentiment of a document or a sentence, many others provide the necessary detail on various aspects of the entity (i.e. aspect-based sentiment analysis). Most of the available data resources were tailored to English and the other popular European languages. Although Persian is a language with more than 110 million speakers, to the best of our knowledge, there is a lack of public dataset on aspect-based sentiment analysis for Persian. This paper provides a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews. Moreover, as a baseline, this paper reports the performance of some state-of-the-art aspect-based sentiment analysis methods with a focus on deep learning, on Pars-ABSA. The obtained results are impressive compared to similar English state-of-the-art.

MLOct 9, 2016
A new selection strategy for selective cluster ensemble based on Diversity and Independency

Muhammad Yousefnezhad, Ali Reihanian, Daoqiang Zhang et al.

This research introduces a new strategy in cluster ensemble selection by using Independency and Diversity metrics. In recent years, Diversity and Quality, which are two metrics in evaluation procedure, have been used for selecting basic clustering results in the cluster ensemble selection. Although quality can improve the final results in cluster ensemble, it cannot control the procedures of generating basic results, which causes a gap in prediction of the generated basic results' accuracy. Instead of quality, this paper introduces Independency as a supplementary method to be used in conjunction with Diversity. Therefore, this paper uses a heuristic metric, which is based on the procedure of converting code to graph in Software Testing, in order to calculate the Independency of two basic clustering algorithms. Moreover, a new modeling language, which we called as "Clustering Algorithms Independency Language" (CAIL), is introduced in order to generate graphs which depict Independency of algorithms. Also, Uniformity, which is a new similarity metric, has been introduced for evaluating the diversity of basic results. As a credential, our experimental results on varied different standard data sets show that the proposed framework improves the accuracy of final results dramatically in comparison with other cluster ensemble methods.

SIApr 26, 2016
Evaluating the effect of topic consideration in identifying communities of rating-based social networks

Ali Reihanian, Behrouz Minaei-Bidgoli, Muhammad Yousefnezhad

Finding meaningful communities in social network has attracted the attentions of many researchers. The community structure of complex networks reveals both their organization and hidden relations among their constituents. Most of the researches in the field of community detection mainly focus on the topological structure of the network without performing any content analysis. Nowadays, real world social networks are containing a vast range of information including shared objects, comments, following information, etc. In recent years, a number of researches have proposed approaches which consider both the contents that are interchanged in the networks and the topological structures of the networks in order to find more meaningful communities. In this research, the effect of topic analysis in finding more meaningful communities in social networking sites in which the users express their feelings toward different objects (like movies) by the means of rating is demonstrated by performing extensive experiments.

IRMay 16, 2013
Multi-View Learning for Web Spam Detection

Ali Hadian, Behrouz Minaei-Bidgoli

Spam pages are designed to maliciously appear among the top search results by excessive usage of popular terms. Therefore, spam pages should be removed using an effective and efficient spam detection system. Previous methods for web spam classification used several features from various information sources (page contents, web graph, access logs, etc.) to detect web spam. In this paper, we follow page-level classification approach to build fast and scalable spam filters. We show that each web page can be classified with satisfiable accuracy using only its own HTML content. In order to design a multi-view classification system, we used state-of-the-art spam classification methods with distinct feature sets (views) as the base classifiers. Then, a fusion model is learned to combine the output of the base classifiers and make final prediction. Results show that multi-view learning significantly improves the classification performance, namely AUC by 22%, while providing linear speedup for parallel execution.

LGJan 29, 2012
A Comparison Between Data Mining Prediction Algorithms for Fault Detection(Case study: Ahanpishegan co.)

Golriz Amooee, Behrouz Minaei-Bidgoli, Malihe Bagheri-Dehnavi

In the current competitive world, industrial companies seek to manufacture products of higher quality which can be achieved by increasing reliability, maintainability and thus the availability of products. On the other hand, improvement in products lifecycle is necessary for achieving high reliability. Typically, maintenance activities are aimed to reduce failures of industrial machinery and minimize the consequences of such failures. So the industrial companies try to improve their efficiency by using different fault detection techniques. One strategy is to process and analyze previous generated data to predict future failures. The purpose of this paper is to detect wasted parts using different data mining algorithms and compare the accuracy of these algorithms. A combination of thermal and physical characteristics has been used and the algorithms were implemented on Ahanpishegan's current data to estimate the availability of its produced parts. Keywords: Data Mining, Fault Detection, Availability, Prediction Algorithms.