SEMay 27
Towards Understanding Barriers and Mitigation Strategies of Software Engineers with Non-traditional Educational and Occupational BackgroundsTavian Barnes, Ken Jen Lee, Cristina Tavares et al.
The traditional path to a software engineering career usually involves a post-secondary diploma in Software Engineering, Computer Science, or a related field. However, many individuals working as software engineers take a non-traditional path to their careers, starting from other industries or fields of study. This paper explores the barriers that individuals with non-traditional educational and occupational backgrounds face when pursuing a software engineering career and proposes potential strategies to overcome those barriers. A two-stage methodology was used, consisting of an exploratory study followed by a follow-up survey. The exploratory study consisted of a grounded-theory-based qualitative analysis of relevant Reddit data to yield a framework around the barriers and possible mitigation strategies. These findings were then supplemented through a follow-up survey. Understanding these barriers and what strategies could be effective is an important step towards making software engineering more accessible to individuals with non-traditional backgrounds. In addition to fostering functional diversity, this might also serve to tackle labor shortages within the software engineering industry.
AINov 20, 2023
GPT in Data Science: A Practical Exploration of Model SelectionNathalia Nascimento, Cristina Tavares, Paulo Alencar et al.
There is an increasing interest in leveraging Large Language Models (LLMs) for managing structured data and enhancing data science processes. Despite the potential benefits, this integration poses significant questions regarding their reliability and decision-making methodologies. It highlights the importance of various factors in the model selection process, including the nature of the data, problem type, performance metrics, computational resources, interpretability vs accuracy, assumptions about data, and ethical considerations. Our objective is to elucidate and express the factors and assumptions guiding GPT-4's model selection recommendations. We employ a variability model to depict these factors and use toy datasets to evaluate both the model and the implementation of the identified heuristics. By contrasting these outcomes with heuristics from other platforms, our aim is to determine the effectiveness and distinctiveness of GPT-4's methodology. This research is committed to advancing our comprehension of AI decision-making processes, especially in the realm of model selection within data science. Our efforts are directed towards creating AI systems that are more transparent and comprehensible, contributing to a more responsible and efficient practice in data science.
LGNov 23, 2023
Extending Variability-Aware Model Selection with Bias Detection in Machine Learning ProjectsCristina Tavares, Nathalia Nascimento, Paulo Alencar et al.
Data science projects often involve various machine learning (ML) methods that depend on data, code, and models. One of the key activities in these projects is the selection of a model or algorithm that is appropriate for the data analysis at hand. ML model selection depends on several factors, which include data-related attributes such as sample size, functional requirements such as the prediction algorithm type, and non-functional requirements such as performance and bias. However, the factors that influence such selection are often not well understood and explicitly represented. This paper describes ongoing work on extending an adaptive variability-aware model selection method with bias detection in ML projects. The method involves: (i) modeling the variability of the factors that affect model selection using feature models based on heuristics proposed in the literature; (ii) instantiating our variability model with added features related to bias (e.g., bias-related metrics); and (iii) conducting experiments that illustrate the method in a specific case study to illustrate our approach based on a heart failure prediction project. The proposed approach aims to advance the state of the art by making explicit factors that influence model selection, particularly those related to bias, as well as their interactions. The provided representations can transform model selection in ML projects into a non ad hoc, adaptive, and explainable process.
CLApr 3, 2024
Assessing ML Classification Algorithms and NLP Techniques for Depression Detection: An Experimental Case StudyGiuliano Lorenzoni, Cristina Tavares, Nathalia Nascimento et al.
Depression has affected millions of people worldwide and has become one of the most common mental disorders. Early mental disorder detection can reduce costs for public health agencies and prevent other major comorbidities. Additionally, the shortage of specialized personnel is very concerning since Depression diagnosis is highly dependent on expert professionals and is time-consuming. Recent research has evidenced that machine learning (ML) and Natural Language Processing (NLP) tools and techniques have significantly bene ted the diagnosis of depression. However, there are still several challenges in the assessment of depression detection approaches in which other conditions such as post-traumatic stress disorder (PTSD) are present. These challenges include assessing alternatives in terms of data cleaning and pre-processing techniques, feature selection, and appropriate ML classification algorithms. This paper tackels such an assessment based on a case study that compares different ML classifiers, specifically in terms of data cleaning and pre-processing, feature selection, parameter setting, and model choices. The case study is based on the Distress Analysis Interview Corpus - Wizard-of-Oz (DAIC-WOZ) dataset, which is designed to support the diagnosis of mental disorders such as depression, anxiety, and PTSD. Besides the assessment of alternative techniques, we were able to build models with accuracy levels around 84% with Random Forest and XGBoost models, which is significantly higher than the results from the comparable literature which presented the level of accuracy of 72% from the SVM model.