IRFeb 7, 2024
Using text embedding models as text classifiers with medical dataRishabh Goel
The advent of Large Language Models (LLMs) is promising and LLMs have been applied to numerous fields. However, it is not trivial to implement LLMs in the medical field, due to the high standards for precision and accuracy. Currently, the diagnosis of medical ailments must be done by hand, as it is costly to build a sufficiently broad LLM that can diagnose a wide range of diseases. Here, we explore the use of vector databases and embedding models as a means of encoding and classifying text with medical text data without the need to train a new model altogether. We used various LLMs to generate the medical data, then encoded the data with a text embedding model and stored it in a vector database. We hypothesized that higher embedding dimensions coupled with descriptive data in the vector database would lead to better classifications and designed a robustness test to test our hypothesis. By using vector databases and text embedding models to classify a clinician's notes on a patient presenting with a certain ailment, we showed that these tools can be successful at classifying medical text data. We found that a higher embedding dimension did indeed yield better results, however, querying with simple data in the database was optimal for performance. We have shown in this study the applicability of text embedding models and vector databases on a small scale, and our work lays the groundwork for applying these tools on a larger scale.
LGMay 6, 2024
Transformer models as an efficient replacement for statistical test suites to evaluate the quality of random numbersRishabh Goel, YiZi Xiao, Ramin Ramezani
Random numbers are incredibly important in a variety of fields, and the need for their validation remains important for safety. A Quantum Random Number Generator (QRNG) can theoretically generate truly random numbers, however their quality still needs to be thoroughly validated. Generally, the task of validating random numbers has been delegated to different statistical tests such as the tests from the NIST Statistical Test Suite (STS), which are often slow and only perform one test at a time. Our work presents a deep learning model utilizing the Transformer architecture that 1) performs multiple NIST STS tests at once, and 2) runs much faster. This model outputs multi-label classification results on passing these statistical tests. We performed a thorough hyper-parameter optimization to converge on the best possible model and as a result, achieved a high degree of accuracy with a Macro F1-score of above 0.96. We also compared this model to a conventional deep learning method (Long Short Term Memory Recurrent Neural Networks) to quantify randomness and showed our model achieved similar performances while being much more efficient and scalable. The high performance and efficiency of this Transformer-based deep learning model showed that it can be a viable replacement for the NIST STS for validating random numbers.