LGApr 24, 2017Code
GaKCo: a Fast GApped k-mer string Kernel using COuntingRitambhara Singh, Arshdeep Sekhon, Kamran Kowsari et al.
String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size ($Σ$) or allow more mismatches ($M$). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to $O(Σ^{M})$. We propose a \textbf{fast} algorithm for calculating \underline{Ga}pped $k$-mer \underline{K}ernel using \underline{Co}unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger $Σ$ and $M$, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the $Σ^{M}$ term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets), respectively. GaKCo is shared as an open source tool at \url{https://github.com/QData/GaKCo-SVM}
CYJun 15, 2024
Transformative Influence of LLM and AI Tools in Student Social Media Engagement: Analyzing Personalization, Communication Efficiency, and Collaborative LearningMasoud Bashiri, Kamran Kowsari
The advent of Large Language Models (LLMs) and Artificial Intelligence (AI) tools has revolutionized various facets of our lives, particularly in the realm of social media. For students, these advancements have unlocked unprecedented opportunities for learning, collaboration, and personal growth. AI-driven applications are transforming how students interact with social media, offering personalized content and recommendations, and enabling smarter, more efficient communication. Recent studies utilizing data from UniversityCube underscore the profound impact of AI tools on students' academic and social experiences. These studies reveal that students engaging with AI-enhanced social media platforms report higher academic performance, enhanced critical thinking skills, and increased engagement in collaborative projects. Moreover, AI tools assist in filtering out distracting content, allowing students to concentrate more on educational materials and pertinent discussions. The integration of LLMs in social media has further facilitated improved peer-to-peer communication and mentorship opportunities. AI algorithms effectively match students based on shared academic interests and career goals, fostering a supportive and intellectually stimulating online community, thereby contributing to increased student satisfaction and retention rates. In this article, we delve into the data provided by UniversityCube to explore how LLMs and AI tools are specifically transforming social media for students. Through case studies and statistical analyses, we offer a comprehensive understanding of the educational and social benefits these technologies offer. Our exploration highlights the potential of AI-driven tools to create a more enriched, efficient, and supportive educational environment for students in the digital age.
CYJun 14, 2021
Toward a Knowledge Discovery Framework for Data Science Job Market in the United StatesMojtaba Heidarysafa, Kamran Kowsari, Masoud Bashiri et al.
The growth of the data science field requires better tools to understand such a fast-paced growing domain. Moreover, individuals from different backgrounds became interested in following a career as data scientists. Therefore, providing a quantitative guide for individuals and organizations to understand the skills required in the job market would be crucial. This paper introduces a framework to analyze the job market for data science-related jobs within the US while providing an interface to access insights in this market. The proposed framework includes three sub-modules allowing continuous data collection, information extraction, and a web-based dashboard visualization to investigate the spatial and temporal distribution of data science-related jobs and skills. The result of this work shows important skills for the main branches of data science jobs and attempts to provide a skill-based definition of these data science branches. The current version of this application is deployed on the web and allows individuals and institutes to investigate skills required for data science positions through the industry lens.
LGNov 9, 2020
Sparse Longitudinal Representations of Electronic Health Record Data for the Early Detection of Chronic Kidney Disease in Diabetic PatientsJinghe Zhang, Kamran Kowsari, Mehdi Boukhechba et al.
Chronic kidney disease (CKD) is a gradual loss of renal function over time, and it increases the risk of mortality, decreased quality of life, as well as serious complications. The prevalence of CKD has been increasing in the last couple of decades, which is partly due to the increased prevalence of diabetes and hypertension. To accurately detect CKD in diabetic patients, we propose a novel framework to learn sparse longitudinal representations of patients' medical records. The proposed method is also compared with widely used baselines such as Aggregated Frequency Vector and Bag-of-Pattern in Sequences on real EHR data, and the experimental results indicate that the proposed model achieves higher predictive performance. Additionally, the learned representations are interpreted and visualized to bring clinical insights.
SPOct 28, 2020
HHAR-net: Hierarchical Human Activity Recognition using Neural NetworksMehrdad Fazli, Kamran Kowsari, Erfaneh Gharavi et al.
Activity recognition using built-in sensors in smart and wearable devices provides great opportunities to understand and detect human behavior in the wild and gives a more holistic view of individuals' health and well being. Numerous computational methods have been applied to sensor streams to recognize different daily activities. However, most methods are unable to capture different layers of activities concealed in human behavior. Also, the performance of the models starts to decrease with increasing the number of activities. This research aims at building a hierarchical classification with Neural Networks to recognize human activities based on different levels of abstraction. We evaluate our model on the Extrasensory dataset; a dataset collected in the wild and containing data from smartphones and smartwatches. We use a two-level hierarchy with a total of six mutually exclusive labels namely, "lying down", "sitting", "standing in place", "walking", "running", and "bicycling" divided into "stationary" and "non-stationary". The results show that our model can recognize low-level activities (stationary/non-stationary) with 95.8% accuracy and overall accuracy of 92.8% over six labels. This is 3% above our best performing baseline.
IVJun 12, 2020
HMIC: Hierarchical Medical Image Classification, A Deep Learning ApproachKamran Kowsari, Rasoul Sali, Lubaina Ehsan et al.
Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).
LGJun 11, 2020
Diagnosis and Analysis of Celiac Disease and Environmental Enteropathy on Biopsy Images using Deep Learning ApproachesKamran Kowsari
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. Both conditions require a tissue biopsy for diagnosis and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose four diagnosis techniques for these diseases and address their limitations and advantages. First, the diagnosis between CD, EE, and Normal biopsies is considered, but the main challenge with this diagnosis technique is the staining problem. The dataset used in this research is collected from different centers with different staining standards. To solve this problem, we use color balancing in order to train our model with a varying range of colors. Random Multimodel Deep Learning (RMDL) architecture has been used as another approach to mitigate the effects of the staining problem. RMDL combines different architectures and structures of deep learning and the final output of the model is based on the majority vote. CD is a chronic autoimmune disease that affects the small intestine genetically predisposed children and adults. Typically, CD rapidly progress from Marsh I to IIIa. Marsh III is sub-divided into IIIa (partial villus atrophy), Marsh IIIb (subtotal villous atrophy), and Marsh IIIc (total villus atrophy) to explain the spectrum of villus atrophy along with crypt hypertrophy and increased intraepithelial lymphocytes. In the second part of this study, we proposed two ways for diagnosing different stages of CD. Finally, in the third part of this study, these two steps are combined as Hierarchical Medical Image Classification (HMIC) to have a model to diagnose the disease data hierarchically.
SIApr 13, 2020
Gender Detection on Social Networks using Ensemble Deep LearningKamran Kowsari, Mojtaba Heidarysafa, Tolu Odukoya et al.
Analyzing the ever-increasing volume of posts on social media sites such as Facebook and Twitter requires improved information processing methods for profiling authorship. Document classification is central to this task, but the performance of traditional supervised classifiers has degraded as the volume of social media has increased. This paper addresses this problem in the context of gender detection through ensemble classification that employs multi-model deep learning architectures to generate specialized understanding from different feature spaces.
CLDec 9, 2019
Women in ISIS Propaganda: A Natural Language Processing Analysis of Topics and Emotions in a Comparison with Mainstream Religious GroupMojtaba Heidarysafa, Kamran Kowsari, Tolu Odukoya et al.
Online propaganda is central to the recruitment strategies of extremist groups and in recent years these efforts have increasingly extended to women. To investigate ISIS' approach to targeting women in their online propaganda and uncover implications for counterterrorism, we rely on text mining and natural language processing (NLP). Specifically, we extract articles published in Dabiq and Rumiyah (ISIS's online English language publications) to identify prominent topics. To identify similarities or differences between these texts and those produced by non-violent religious groups, we extend the analysis to articles from a Catholic forum dedicated to women. We also perform an emotional analysis of both of these resources to better understand the emotional components of propaganda. We rely on Depechemood (a lexical-base emotion analysis method) to detect emotions most likely to be evoked in readers of these materials. The findings indicate that the emotional appeal of ISIS and Catholic materials are similar
IVOct 7, 2019
CeliacNet: Celiac Disease Severity Diagnosis on Duodenal Histopathological Images Using Deep Residual NetworksRasoul Sali, Lubaina Ehsan, Kamran Kowsari et al.
Celiac Disease (CD) is a chronic autoimmune disease that affects the small intestine in genetically predisposed children and adults. Gluten exposure triggers an inflammatory cascade which leads to compromised intestinal barrier function. If this enteropathy is unrecognized, this can lead to anemia, decreased bone density, and, in longstanding cases, intestinal cancer. The prevalence of the disorder is 1% in the United States. An intestinal (duodenal) biopsy is considered the "gold standard" for diagnosis. The mild CD might go unnoticed due to non-specific clinical symptoms or mild histologic features. In our current work, we trained a model based on deep residual networks to diagnose CD severity using a histological scoring system called the modified Marsh score. The proposed model was evaluated using an independent set of 120 whole slide images from 15 CD patients and achieved an AUC greater than 0.96 in all classes. These results demonstrate the diagnostic power of the proposed model for CD severity classification using histological images.
LGApr 17, 2019
Text Classification Algorithms: A SurveyKamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa et al.
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.
IVApr 10, 2019
Diagnosis of Celiac Disease and Environmental Enteropathy on Biopsy Images Using Color Balancing on Convolutional Neural NetworksKamran Kowsari, Rasoul Sali, Marium N. Khan et al.
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results in barrier dysfunction but is thought to be caused by an increased vulnerability to infections. EE has been implicated as the predominant cause of under-nutrition, oral vaccine failure, and impaired cognitive development in low-and-middle-income countries. Both conditions require a tissue biopsy for diagnosis, and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose a convolutional neural network (CNN) to classify duodenal biopsy images from subjects with CD, EE, and healthy controls. We evaluated the performance of our proposed model using a large cohort containing 1000 biopsy images. Our evaluations show that the proposed model achieves an area under ROC of 0.99, 1.00, and 0.97 for CD, EE, and healthy controls, respectively. These results demonstrate the discriminative power of the proposed model in duodenal biopsies classification.
CVNov 15, 2018
From Videos to URLs: A Multi-Browser Guide To Extract User's Behavior with Optical Character RecognitionMojtaba Heidarysafa, James Reed, Kamran Kowsari et al.
Tracking users' activities on the World Wide Web (WWW) allows researchers to analyze each user's internet behavior as time passes and for the amount of time spent on a particular domain. This analysis can be used in research design, as researchers may access to their participant's behaviors while browsing the web. Web search behavior has been a subject of interest because of its real-world applications in marketing, digital advertisement, and identifying potential threats online. In this paper, we present an image-processing based method to extract domains which are visited by a participant over multiple browsers during a lab session. This method could provide another way to collect users' activities during an online session given that the session recorder collected the data. The method can also be used to collect the textual content of web-pages that an individual visits for later analysis
CLOct 17, 2018
Analysis of Railway Accidents' Narratives Using Deep LearningMojtaba Heidarysafa, Kamran Kowsari, Laura E. Barnes et al.
Automatic understanding of domain specific texts in order to extract useful relationships for later use is a non-trivial task. One such relationship would be between railroad accidents' causes and their correspondent descriptions in reports. From 2001 to 2016 rail accidents in the U.S. cost more than $4.6B. Railroads involved in accidents are required to submit an accident report to the Federal Railroad Administration (FRA). These reports contain a variety of fixed field entries including primary cause of the accidents (a coded variable with 389 values) as well as a narrative field which is a short text description of the accident. Although these narratives provide more information than a fixed field entry, the terminologies used in these reports are not easy to understand by a non-expert reader. Therefore, providing an assisting method to fill in the primary cause from such domain specific texts(narratives) would help to label the accidents with more accuracy. Another important question for transportation safety is whether the reported accident cause is consistent with narrative description. To address these questions, we applied deep learning methods together with powerful word embeddings such as Word2Vec and GloVe to classify accident cause values for the primary cause field using the text in the narratives. The results show that such approaches can both accurately classify accident causes based on report narratives and find important inconsistencies in accident reporting.
QMOct 10, 2018
Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health RecordJinghe Zhang, Kamran Kowsari, James H. Harrison et al.
The wide implementation of electronic health record (EHR) systems facilitates the collection of large-scale health data from real clinical settings. Despite the significant increase in adoption of EHR systems, this data remains largely unexplored, but presents a rich data source for knowledge discovery from patient health histories in tasks such as understanding disease correlations and predicting health outcomes. However, the heterogeneity, sparsity, noise, and bias in this data present many complex challenges. This complexity makes it difficult to translate potentially relevant information into machine learning algorithms. In this paper, we propose a computational framework, Patient2Vec, to learn an interpretable deep representation of longitudinal EHR data which is personalized for each patient. To evaluate this approach, we apply it to the prediction of future hospitalizations using real EHR data and compare its predictive performance with baseline methods. Patient2Vec produces a vector space with meaningful structure and it achieves an AUC around 0.799 outperforming baseline methods. In the end, the learned feature importance can be visualized and interpreted at both the individual and population levels to bring clinical insights.
LGAug 23, 2018
An Improvement of Data Classification Using Random Multimodel Deep Learning (RMDL)Mojtaba Heidarysafa, Kamran Kowsari, Donald E. Brown et al.
The exponential growth in the number of complex datasets every year requires more enhancement in machine learning methods to provide robust and accurate data classification. Lately, deep learning approaches have achieved surpassing results in comparison to previous machine learning algorithms. However, finding the suitable structure for these models has been a challenge for researchers. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. In short, RMDL trains multiple randomly generated models of Deep Neural Network (DNN), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in parallel and combines their results to produce better result of any of those models individually. In this paper, we describe RMDL model and compare the results for image and text classification as well as face recognition. We used MNIST and CIFAR-10 datasets as ground truth datasets for image classification and WOS, Reuters, IMDB, and 20newsgroup datasets for text classification. Lastly, we used ORL dataset to compare the model performance on face recognition task.
LGMay 3, 2018
RMDL: Random Multimodel Deep Learning for ClassificationKamran Kowsari, Mojtaba Heidarysafa, Donald E. Brown et al.
The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RDML can accept as input a variety data to include text, video, images, and symbolic. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RDML produces consistently better performance than standard methods over a broad range of data types and classification problems.
LGSep 26, 2017
FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for ClassificationKamran Kowsari, Nima Bari, Roman Vichr et al.
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classification task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an efficient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classification to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model.
LGSep 24, 2017
HDLTex: Hierarchical Deep Learning for Text ClassificationKamran Kowsari, Donald E. Brown, Mojtaba Heidarysafa et al.
The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document classification, which has become an important application for supervised learning. Recently the performance of these traditional classifiers has degraded as the number of documents has increased. This is because along with this growth in the number of documents has come an increase in the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide specialized understanding at each level of the document hierarchy.
CVFeb 18, 2016
Weighted Unsupervised Learning for 3D Object DetectionKamran Kowsari, Manal H. Alassaf
This paper introduces a novel weighted unsupervised learning for object detection using an RGB-D camera. This technique is feasible for detecting the moving objects in the noisy environments that are captured by an RGB-D camera. The main contribution of this paper is a real-time algorithm for detecting each object using weighted clustering as a separate cluster. In a preprocessing step, the algorithm calculates the pose 3D position X, Y, Z and RGB color of each data point and then it calculates each data point's normal vector using the point's neighbor. After preprocessing, our algorithm calculates k-weights for each data point; each weight indicates membership. Resulting in clustered objects of the scene.
DBMar 22, 2015
Construction of FuzzyFind Dictionary using Golay Coding Transformation for Searching ApplicationsKamran Kowsari, Maryam Yammahi, Nima Bari et al.
Searching through a large volume of data is very critical for companies, scientists, and searching engines applications due to time complexity and memory complexity. In this paper, a new technique of generating FuzzyFind Dictionary for text mining was introduced. We simply mapped the 23 bits of the English alphabet into a FuzzyFind Dictionary or more than 23 bits by using more FuzzyFind Dictionary, and reflecting the presence or absence of particular letters. This representation preserves closeness of word distortions in terms of closeness of the created binary vectors within Hamming distance of 2 deviations. This paper talks about the Golay Coding Transformation Hash Table and how it can be used on a FuzzyFind Dictionary as a new technology for using in searching through big data. This method is introduced by linear time complexity for generating the dictionary and constant time complexity to access the data and update by new data sets, also updating for new data sets is linear time depends on new data points. This technique is based on searching only for letters of English that each segment has 23 bits, and also we have more than 23-bit and also it could work with more segments as reference table.
DBMar 1, 2015
Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challengesNima Bari, Roman Vichr, Kamran Kowsari et al.
Past research has challenged us with the task of showing relational patterns between text-based data and then clustering for predictive analysis using Golay Code technique. We focus on a novel approach to extract metaknowledge in multimedia datasets. Our collaboration has been an on-going task of studying the relational patterns between datapoints based on metafeatures extracted from metaknowledge in multimedia datasets. Those selected are significant to suit the mining technique we applied, Golay Code algorithm. In this research paper we summarize findings in optimization of metaknowledge representation for 23-bit representation of structured and unstructured multimedia data in order to
DBMar 1, 2015
23-bit Metaknowledge Template Towards Big Data Knowledge Discovery and ManagementNima Bari, Roman Vichr, Kamran Kowsari et al.
The global influence of Big Data is not only growing but seemingly endless. The trend is leaning towards knowledge that is attained easily and quickly from massive pools of Big Data. Today we are living in the technological world that Dr. Usama Fayyad and his distinguished research fellows discussed in the introductory explanations of Knowledge Discovery in Databases (KDD) predicted nearly two decades ago. Indeed, they were precise in their outlook on Big Data analytics. In fact, the continued improvement of the interoperability of machine learning, statistics, database building and querying fused to create this increasingly popular science- Data Mining and Knowledge Discovery. The next generation computational theories are geared towards helping to extract insightful knowledge from even larger volumes of data at higher rates of speed. As the trend increases in popularity, the need for a highly adaptive solution for knowledge discovery will be necessary. In this research paper, we are introducing the investigation and development of 23 bit-questions for a Metaknowledge template for Big Data Processing and clustering purposes. This research aims to demonstrate the construction of this methodology and proves the validity and the beneficial utilization that brings Knowledge Discovery from Big Data.
LGDec 19, 2013
Comparison three methods of clustering: k-means, spectral clustering and hierarchical clusteringKamran Kowsari
Comparison of three kind of the clustering and find cost function and loss function and calculate them. Error rate of the clustering methods and how to calculate the error percentage always be one on the important factor for evaluating the clustering methods, so this paper introduce one way to calculate the error rate of clustering methods. Clustering algorithms can be divided into several categories including partitioning clustering algorithms, hierarchical algorithms and density based algorithms. Generally speaking we should compare clustering algorithms by Scalability, Ability to work with different attribute, Clusters formed by conventional, Having minimal knowledge of the computer to recognize the input parameters, Classes for dealing with noise and extra deposition that same error rate for clustering a new data, Thus, there is no effect on the input data, different dimensions of high levels, K-means is one of the simplest approach to clustering that clustering is an unsupervised problem.