(W-037) Evaluation of Large Language Models for an AI Chat Assistant Focused on Pumas and Pharmacometrics

Wednesday, November 13, 2024

7:00 AM - 1:45 PM MST

Agastya Vinchhi, n/a – Intern, Pumas-AI, Inc.; Vijay Ivaturi, PhD – Chief Executive Officer, Pumas-AI, Inc.

Author(s)

JG

Juan J. Gonzalez (he/him/his)

Scientific Research Associate
Pumas-AI, Inc.
Bogota, Distrito Capital de Bogota, Colombia

Disclosure(s):

Juan J. Gonzalez: No financial relationships to disclose

The motivation for this work is to choose a Large Language Model (LLM) to build an AI assistant for scientists at PumasAI [1]. Focused on chat completion models, this study investigates the effectiveness of various LLMs in facilitating pharmacometrics-related inquiries. Beyond conventional choices like GPT and Gemini [2], this research explores a broader spectrum of LLM options [3]. This approach not only enables the evaluation of diverse LLM types but also establishes a framework for assessing other models such as code completion and embedding models [4].

Objectives:
1. Explore the LLM landscape: By examining various LLMs, including those specifically trained for different tasks, this study seeks to identify top-performing models that align with the project goals.

2. Ensure compliance and data privacy: By considering different LLMs, the project aims to allow the use of sensitive information, thereby expanding its potential scope and usefulness.

Methods:
1.Exploring LLMs: An initial exploration phase was conducted to compile a comprehensive list of potential models for evaluation.

2. Establishment of a Prompt Repository: A curated collection of prompts representing common pharmacometrics-related questions and queries was assembled, serving as a standardized test for comparison.

3. Benchmarking LLMs

3.1. Automated Model Evaluation using RAGAS (Retrieval Augmented Generation Automated Scoring) [5]: RAGAS provided scores for answer correctness and faithfulness, assessing how close the LLM’s output was to the ground truth set beforehand by human experts.

3.2. Grading Output with Assistance from Another LLM: Utilization of an external LLM, to provide a quantitative metric and grade of the LLM outputs based on a rubric including relevant factors created by human experts.

3.3. Human Vetting: Manual evaluation by human experts to assess the relevance and accuracy of LLM outputs.

Results and

Conclusions: The result of this research identifies top-performing LLMs for an AI chat assistant for PumasAI. Through evaluation with RAGAS, grading with an external LLM, and human vetting, Mistral, GPT-4, and Claude emerged as models with strong potential. Finally, these LLMs are prospective candidates that can be integrated into an AI assistant, enhancing PumasAI scientists' work by providing a reliable tool for decision-making and inquiries.

Citations: [1] E. Waisberg et al., “GPT-4: a new era of artificial intelligence in medicine,” Ir. J. Med. Sci., vol. 192, no. 6, pp. 3197–3200, 2023, doi: 10.1007/s11845-023-03377-8.

[2] E. Shin, Y. Yu, R. R. Bies, and M. Ramanathan, “Evaluation of ChatGPT and Gemini large language models for pharmacometrics with NONMEM,” J. Pharmacokinet. Pharmacodyn., Apr. 2024, doi: 10.1007/s10928-024-09921-y.

[3] “Models - Hugging Face.” Accessed: May 13, 2024. [Online]. Available: https://huggingface.co/models

[4] C. Wang, J. Ong, C. Wang, H. Ong, R. Cheng, and D. Ong, “Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation,” Ann. Biomed. Eng., Aug. 2023, doi: 10.1007/s10439-023-03327-6.

[5] “explodinggradients/ragas.” Exploding Gradients, May 13, 2024. Accessed: May 13, 2024. [Online]. Available: https://github.com/explodinggradients/ragas