Praneeth Vadlapati

CLSep 27, 2024Code

LML-DAP: Language Model Learning a Dataset for Data-Augmented Prediction

Praneeth Vadlapati

Classification tasks are typically handled using Machine Learning (ML) models, which lack a balance between accuracy and interpretability. This paper introduces a new approach for classification tasks using Large Language Models (LLMs) in an explainable method. Unlike ML models, which rely heavily on data cleaning and feature engineering, this method streamlines the process using LLMs. This paper proposes a method called "Language Model Learning (LML)" powered by a new method called "Data-Augmented Prediction (DAP)." The classification is performed by LLMs using a method similar to that used by humans who manually explore and understand the data to decide classifications. In the process of LML, a dataset is summarized and evaluated to determine the features leading to each label the most. In the DAP process, the system uses the data summary and a row of the testing dataset to automatically generate a query to retrieve relevant rows from the dataset for context-aware classification. LML and DAP unlock new possibilities in areas that require explainable and context-aware decisions by ensuring satisfactory accuracy even with complex data. The system scored an accuracy above 90% in some test cases, confirming the effectiveness and potential of the system to outperform ML models in various scenarios. The source code is available at https://github.com/Pro-GenAI/LML-DAP

CLJun 27, 2024Code

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge

Praneeth Vadlapati

Up-to-date and reliable language models are consistently sought after and are essential in various applications. Typically, models are trained on a fixed dataset and then deployed globally. However, the knowledge of the models becomes outdated. Enabling automatic updation of AI knowledge using web data involves significant concerns regarding the model's safety and quality due to a threat from unsafe and undesirable text across the web. The purity of new data was essential for updating knowledge of language models to maintain their reliability. This paper proposes AutoPureData, a system that automatically collects and purifies web data. The system loaded a sample of web data. Utilizing existing trusted AI models, it successfully eliminated unsafe text with an accuracy of 97% and undesirable text with an accuracy of 86%, demonstrating the system's effectiveness in purifying the data. The system ensures that only meaningful and safe text can be used to update LLM knowledge. The pure text was then optimized and stored in a vector database for future querying. It was found that LLM can fetch new data from the vector DB. The LLM writes the RAG query in English, even if the user's query is in another language, proving that the system can perform cross-lingual retrieval. This paper proposes a method to maintain the accuracy and relevance of up-to-date language models by ensuring that only purified data was used to update LLM knowledge. This work contributes to updating knowledge of chatbots using meaningful and safe text, enhancing their utility across various industries, and potentially reducing the risks associated with outputs caused by unsafe or impure data. Code is available at github.com/Pro-GenAI/AutoPureData.

Praneeth Vadlapati

2 Papers