Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
Production LLM Deployment: vLLM,FastAPI,Modal and AI Chatbot
Rating: 4.1 out of 5(23 ratings)
297 students

Production LLM Deployment: vLLM,FastAPI,Modal and AI Chatbot

Production Grade LLM deployment and High-Load Inferencing with vLLm, Chatbots with Memory, Local Cache of Model Weights
Created byPetar Petkanov
Last updated 3/2025
English

What you'll learn

  • Master volume mapping to efficiently manage model storage, cut redundant data retrieval, optimize weight storage, and speed up access by using local storage str
  • Master deploying AI models with vLLM, handle thousands of requests, and design modular architectures for efficient model downloading and inference
  • Create a conversational AI chatbot using Python, integrating OpenAI's API for seamless, real-time chats with deployed language models
  • Use FastAPI and vLLM to build efficient, OpenAI-compatible APIs. Deploy REST API endpoints in containers for seamless AI model interactions with external apps
  • Use concurrency and synchronization for model management, ensuring high availability. Optimize GPU use to efficiently handle many parallel inference requests
  • Design scalable systems with efficient scaling via local model weights and storage. Secure apps using advanced authentication and token-based access control
  • Execute GPU or CPU intensive functions of your locally running application on a Modal powerful remote infrastructure
  • Deploy AI Models with a single command to run on a remote infrastructure defined in your application code
  • Implement Web APIs: Transform Python functions to web services using FastAPI in Modal, integrating with multi-language applications effectively

Course content

3 sections20 lectures5h 28m total length
  • Course Repository
  • Start Strong with Modal: Environment, Installation, and API Setup11:33

    First lesson introduces the Modal platform, which simplifies deploying and scaling machine learning models by automating infrastructure management, scaling, and cost optimization. It compares Modal to traditional platforms like AWS and emphasizes its serverless, pay-per-use approach. The lesson walks through setting up Modal on a local machine, including environment preparation, package installation, and authentication.

    Lesson Plan:

    1. Introduction to Modal:  Understand how Modal automates infrastructure, scaling, and deployment for machine learning workloads. Learn the benefits of Modal's serverless architecture.

    2. Conda Environment Setup: Create and activate conda environments for flexible Python version management.

    3. Modal Installation: Install the Modal Python package and manage dependencies using `requirements.txt`.

    4. Authentication and API Tokens: Authenticate Modal with a browser-based token setup for secure cloud access.

    5. Modal Console Overview: Navigate the Modal cloud panel to manage deployed applications, containers, and infrastructure.

    6. Cost Optimization:  Explore Modal’s pricing model and learn to utilize high-end GPUs efficiently

    7. Next Steps in Development: Prepare for coding with Modal APIs, decorators, lifecycle management, and entry points.

    This lesson lays the foundation for deploying applications seamlessly with Modal while avoiding manual infrastructure management.

  • Basics of Python Scripts in Modal: From Local Testing to Remote Deployment16:54

    In Lesson 2, the focus shifts from exploring the platform’s UI to writing and executing a simple script using the Modal platform. The lesson explains how to connect the environment, set up dependencies, and run Python code in both local and remote Modal environments. It emphasizes the importance of understanding Modal’s behaviour for deploying and executing serverless functions, especially how Modal treats Python methods and bytecode during remote execution.

    Lesson Plan:

    • Connecting to the Environment:  How to set up and choose the correct Python environment in Modal. Linking dependencies installed via  pip  to the Modal notebook environment.

    • Application and Function Naming:  The significance of naming applications in Modal as a namespace for resources. How application and method names affect execution and visibility.

    • Function Registration:  Using Python decorators `@app.function` to mark functions for Modal deployment.  Differentiating between locally and remotely executed functions.

    • Dependency Management:  Ensuring all dependencies are defined within the marked function, as only this bytecode gets deployed to Modal’s serverless environment.

    • Entry Points and Local Execution:  Defining an entry point function with `@app.local_entrypoint()` for local testing.  The difference between local and remote execution in Modal.

    • Running Scripts:  Executing scripts with `modal run <script>` instead of `python <script>`.  Providing input parameters via command-line arguments for Modal execution.

    • Remote Execution and Monitoring:  Executing functions remotely in Modal’s cloud infrastructure. Viewing logs, metrics, and container details in Modal’s developer panel.

    • Ephemeral Applications:  Introduction to ephemeral apps, their lifecycle

    • Advanced Script Execution: Specifying methods to execute remotely using `<script_name>::method_name>` format.  Sending multiple input variables to functions.

    • Platform Features:  Monitoring live and stopped apps, viewing logs, container lifecycle, and execution metrics.  Understanding how Modal tracks function calls and resource usage.

  • Ephemeral Apps in Modal: Deployment, Invocation, and Lifecycle Management10:08

    In this lesson, students learn about deploying applications using Modal, a platform for running serverless functions. The lesson transitions from running ephemeral applications locally to deploying them for remote execution. Students gain hands-on experience in setting up a Modal application, deploying it, and invoking its functions from another script or terminal. The lesson emphasizes how ephemeral apps operate, the ease of deployment, and the lazy initialization of functions, where they only activate when invoked. Additionally, cost-conscious deployment strategies are discussed, along with tips for extending container runtime and understanding the impact of continuous calls on pricing.

    Lesson Plan:

    • Ephemeral Applications:  Understanding what ephemeral apps are and how they function.  How ephemeral apps scale down and terminate after execution.

    • Entrypoints and Argument Parsing: Using modal run  and modal run script.py::function  to execute specific functions locally. Parsing arguments for dynamic inputs during execution.

    • Deploying Applications:  Using  modal deploy script.py  to deploy applications for remote execution. Viewing live deployments in the Cloud Panel Application.

    • Lazy Initialization:  How Modal uses lazy initialization for functions, activating them only upon invocation.

    • Calling Deployed Functions: Importing deployed functions in another Python process using modal.Function.from_name. Running functions remotely with .remote()

    • Container Lifecycle Management:  Understanding container states (idle, active) and their runtime limits. Techniques to extend container life using periodic calls, such as cron jobs.

    • Cost Management:  The financial implications of keeping containers alive for extended periods. Strategies to optimize cost-effectiveness when working with GPU-intensive workloads.

  • Deployment Basics: Setting Up Infrastructure and Exploring Local vs. Remote Runs14:45

    This lesson dives into defining basic infrastructure for applications using Modal, showcasing how to work with ephemeral and remote containers. It emphasizes the use of Modal decorators to streamline Docker-like operations, such as specifying images, dependencies, and running commands. The lesson demonstrates how to define a simple location-based function that fetches geolocation and weather data using APIs, both locally and remotely, highlighting the differences between the two environments. Additionally, it explains how Modal manages container creation, execution, and cleanup, as well as how to handle output and entry points for remote runs.

    Lesson Plan:

    • Defining Modal Infrastructure:  Using `@app.function` decorators to configure infrastructure like base images and dependencies.  Specifying images (e.g., `Image.debian_slim`) and automating pip installations.

    • Working with Containers:  Creating containers remotely and understanding how Modal executes and cleans them up.  Observing the differences between local and remote runs.

    • Executing Functions in Modal:  Defining entry points for Modal apps.  Running Python code remotely without manually managing deployments.  Using Modal's logging and context setup to receive execution feedback.

    • APIs for Location and Weather Data: Fetching geolocation data using  http://ip-api.com/json, retrieving weather data with  https://wttr.in  and parsing and formatting API responses for output.

    • Local vs. Remote Execution:  Running functions locally vs. running in Modal's distributed cloud.  How Modal assigns execution locations dynamically across global edge centers. Observing and debugging execution logs and container states via the Modal dashboard.

    • Deploying and Testing Code:  Saving and testing changes to avoid runtime errors.  Running Modal apps without explicitly defining entry points.  Programmatically running Modal apps within Jupyter notebooks.

    • Preparation for Web APIs:   Laying the groundwork for deploying API endpoints.  Brief introduction to integrating large language models and serving REST APIs.

  • Building Web APIs with Modal: FastAPI Integration and REST Endpoint Deployment20:39

    Lesson 5 discusses the use of the Modal platform to expose Web API endpoints, transforming Python functions into accessible services for other applications. This capability is beneficial when integrating with applications written in other languages, such as JavaScript or Java, which do not natively integrate with Modal Python Function. The lesson uses FastAPI underneath the Modal framework to facilitate the creation of these endpoints.

    Lesson Plan:

    • Endpoint Exposure:  Learn how to transform Python functions into web-accessible endpoints using the Modal platform.

    • Integration with FastAPI:  Understand how Modal uses FastAPI to implement endpoint serving, and why it’s necessary to include FastAPI in the container environment.

    • Temporary vs Permanent Deployment:  Grasp the difference between temporary (ephemeral) and permanent deployments using `modal serve` and `modal deploy`.

    • Interactive Documentation:  Experience how to utilize OpenAPI and Swagger for generating interactive API documentation, which aids in understanding and debugging APIs.

    • Container Management:  Learn how containers are managed in Modal, including how they are created and removed based on function calls and their ephemeral nature.

    • Performance Considerations:  Address the challenge of running resource-intensive code, with strategies to minimize loading times and maintain state across function calls.

  • Class-Based Deployment: Lifecycle Hooks, Resources and Dynamic Management26:11

    Lesson 6 dives into utilizing classes within the Modal platform to improve handling of expensive start-up tasks, such as loading heavy models. This approach leverages the `@modal.enter()` lifecycle hook, which allows specific initialization code to run when a container is started and a class instance is created. This lesson builds upon the concept of using Web API endpoints but focuses on optimizing performance by maintaining application state within a class. By doing so, it extends the life of containers and reduces the need to repeatedly initialize expensive resources.

    Lesson Plan:

    • Lifecycle Hooks:  Understand how to use the `@modal.enter()` lifecycle hook to manage initialization tasks when a container and class instance are first created. This technique helps in setting up resources efficiently right at the start.

    • Class-Based Deployment:  Learn to organize Modal applications using classes, which aids in structuring applications that require persistent or expensive initialization procedures.

    • Stateful Resources:  Discover methods to maintain application state across multiple requests within the same container, minimizing redundant initialization efforts and enhancing performance.

    • Container Management: Explore strategies for keeping containers alive longer than their default ephemeral duration using periodic tasks (e.g., cron jobs), ensuring resources remain accessible without frequent restarts.

    • Infrastructure Configuration:  Gain insights into configuring infrastructure details such as CPU and memory allocation through classes, exemplifying the concept of "Infrastructure as Code"

    • Dynamic Resource Management: See practical examples of balancing resource management with performance needs, knowing when and how to extend container life to minimize cold startup times.

    • Practical Application:  Look forward to applying these concepts in a practical project, like deploying machine learning models efficiently through Modal, leveraging periodic methods to refresh container life.

Requirements

  • Basic Python Skills: Familiarity with Python programming, as the course involves scripting and using Python-based tools.
  • Understanding of Machine Learning Concepts: A foundational grasp of machine learning principles and workflows will help in the application of deployment strategies.
  • Experience with Command Line Interfaces: Competence in using command line tools for installing packages and running scripts is beneficial.
  • Access to a Computer with Internet: A reliable computer setup with internet access is necessary to follow along with the cloud-based exercises and deployments.

Description

This course offers a blend of theoretical understanding and practical application with heavy hands-on lessons designed to transition learners from fundamentals to advanced deployment strategies. You will not only learn to deploy AI models in multiple ways, but also to build Chat Bot with Memory that will interact with our own production grade inference endpoint that will be able to support thousands of requests. Gain the expertise to deploy scalable, interactive AI applications with confidence and efficiency. Whether you're building apps for business, customer interaction, or personal projects, this course is your gateway to mastering AI model deployment. This course will equip you with the knowledge and skills to design robust inference services using cutting-edge tools such as the vLLM framework, FastAPI, and Modal.

What You Will Learn:

  • Strategic Volume Mapping for Efficient Model Management:  Understand how to map and manage storage volumes meticulously to reduce redundant data retrieval and optimize model weight storage.  Gain insights into leveraging local volumes for faster data access and persistent storage, minimizing unnecessary downloads from external repositories like Hugging Face.

  • Deploying High-Performance AI Models:  Master the deployment of machine learning models using the vLLM framework, supporting thousands of parallel inference requests for production-grade applications.  Learn to craft a modular architecture with distinct services for model downloading and inference tasks, reflecting modern software design practices.

  • Developing a Conversational AI Chat Application:  Transform theoretical knowledge into a tangible product by developing a simple Python script to manage chat interactions with deployed language models.  Integrate and authenticate using OpenAI's API client to experience seamless, real-time chat dialogue execution.

  • Building Robust APIs with FastAPI and vLLM :  Create and integrate APIs using FastAPI and vLLM to serve AI models efficiently, ensuring OpenAI-compatible interactions within a containerized infrastructure. Implement REST API endpoints for inferencing services to facilitate interactions with external applications through standardized interfaces.

  • Efficient Resource and Model Management: Employ concurrency and synchronization techniques to manage model data between services, ensuring high availability without excessive network traffic.  Optimize the use of GPUs and other hardware resources to handle a high number of parallel inference requests.

  • Scalable and Secure Service Design:  Design scalable systems that allow rapid initialization and efficient scaling through the strategic use of model weights and local storage.  Secure your application using advanced authentication protocols, including token-based access control to restrict API endpoint usage to authorized users.


Also this course provides an practical exploration of deploying and scaling machine learning models with only a few lines of Python decorators, using Modal's Infrastructure as a Code serverless platform and integration API's.

  • Introduction to Modal: Begin with an overview of Modal's innovative infrastructure management, which simplifies scaling and deployment by automating processes traditionally handled by platforms like AWS. Discover the benefits of serverless architecture and cost optimization strategies.

  • Environment Setup and Script Execution: Learn how to set up and connect your local environment to Modal, manage dependencies, and execute Python scripts in both local and remote settings. Understand Modal's unique approach to deploying serverless functions and the differences between local and remote execution.

  • Ephemeral and Deployed Applications: Transition from running ephemeral applications locally to deploying them for remote execution. Explore the lifecycle of Modal applications, lazy initialization, and container management, with a focus on cost-effective deployment strategies for high-performance workloads.

  • Defining Infrastructure and API Integration: Dive into configuring infrastructure using Modal decorators, manage Docker-like operations, and transform Python functions into web-accessible services using Modals integrated FastAPI. Learn to navigate container management and performance considerations for optimal runtime.

  • Advanced Deployment Techniques: Utilize classes and lifecycle hooks for efficient resource management, maintaining application state across requests, and extending container life. Gain insights into deploying machine learning models from Hugging Face and integrating large language models into your applications.

  • Authentication and Environment Configuration: Master the process of managing secrets for authentication, configuring GPU resources, and setting up container environments. Understand the importance of keeping containers and models ready for quick inference requests.

  • Full Deployment Workflow: Experience a complete workflow for deploying a machine learning model as a web service. From setup to ensuring service availability with cron jobs, observe best practices in container lifecycle management and DevOps automation.

Who this course is for:

  • This course is designed for software developers, and IT professionals who are looking to elevate their skills in deploying and scaling machine learning models in a cloud environment
  • Those who want to move beyond traditional infrastructure challenges like manual scaling and complex server setups and are interested in leveraging serverless architecture for streamlined operations.
  • Learners who appreciate a hands-on approach to learning, focusing on implementing real-world solutions involving API integration, container management, and cost-effective deployment strategies.
  • Individuals who wish to deepen their understanding of cloud-based technologies, specifically around optimizing machine learning workflows using platforms like Modal.