🔴 Building, Deploying and Monitoring Large Language Models with Jinen Setpal

September 6, 2023
4:24 pm

I speak with Jinen Setpal, ML Engineer at DagsHub about actually building, deploying, and monitoring large language model applications. We dive into evaluation methods, ways to reduce hallucinations and much more. We also answer the audience’s great questions.

In this live episode, I’m speaking with Jinen Setpal, ML Engineer at DagsHub about actually building, deploying, and monitoring large language model applications. We discuss DPT, a chatbot project that is live in production on the DagsHub Discord server and helps answer support questions and the process and challenges involved in building it. We dive into evaluation methods, ways to reduce hallucinations and much more. We also answer the audience’s great questions.

Watch the Video

Listen to the Audio

Highlights & Summary

In this episode of the MLOps podcast, the host, Dean, speaks with Jinen Setpal, a machine learning engineer at DagsHub. They discuss the applications of large language models (LLMs) and the challenges associated with working with LLMs, such as hallucinations. They also delve into the development stack and tools for building LLM applications, as well as monitoring LLMs in production. This blog post will provide a summary of the key points discussed in the podcast episode.

Introduction to LLMs

Jinen explains that LLMs have been around for quite some time, but the introduction of ChatGPT by OpenAI marked a significant moment when LLMs became more than just a theoretical concept. The utility and intelligence of LLMs became apparent, and their potential for real-world applications became evident.

DPT – DagsHub’s Documentation Chatbot

Jinen discusses DPT, which stands for DagsHub’s Documentation Chatbot. DPT leverages Chat GPT 3.5 Turbo to provide responses to user queries based on semantic search of DagsHub’s documentation. It uses prompt engineering and domain adaptation techniques to generate accurate and helpful responses.

Evaluating LLMs

There are a lot of challenges in evaluating LLMs and existing metrics each have their limitations. They discussed the trade-off between using automated metrics versus human evaluators, with the latter being more reliable but also more expensive. They also touched upon the issue of biases in evaluation and the need for careful annotation and quality control.

We highlighted the importance of interpretability in LLM research, as it can help identify biases and provide insights into model behavior. They suggested that future metrics for LLMs should be based on intrinsic interpretability, which would allow for unbiased estimates of model performance. They also discussed the potential for privacy and security concerns in LLMs, noting that privacy-preserving techniques are still a work in progress and that fine-tuning and prompt engineering are currently the most common approaches to combat LLM limitations.

We discuss the current challenges in evaluating LLMs and emphasized the need for ongoing research and development to improve metrics and address privacy and security concerns.

Challenges of Hallucinations

Hallucinations occur when LLMs generate responses that are confident but inaccurate. Jinen discusses the misaligned incentives of LLMs, where the models prioritize generating responses that sound plausible, rather than focusing solely on accuracy. Domain adaptation and prompt engineering are two approaches to mitigate hallucinations, but they are still an open problem.

Privacy and Security

Privacy and security aspects of LLMs are concerns that need to be taken into account. OpenAI and other providers have measures in place to protect user data, but self-hosting models allows for greater control over privacy and security. However, LLMs require a trade-off between performance, scalability, and security.

Monitoring LLMs in Production

Monitoring LLMs in production involves infrastructure management, scalability, and performance monitoring. Tools like Terraform, AWS Auto Scaling Groups, and ECS services can help streamline the monitoring process. However, evaluating and monitoring model performance is more complex and often requires manual intervention and human evaluation.

Conclusion

Large language models like Chat GPT have transformed the field of natural language processing and have vast potential for various applications. However, challenges such as hallucinations and privacy/security concerns remain open problems. Monitoring LLMs in production requires a combination of infrastructure management, scalability, and manual evaluation to ensure accurate and reliable results. As LLMs continue to evolve, advancements in interpretability and privacy-preserving techniques will shape their future use and impact.