May 15, 2023

RuntimeClass Request in the Components of Kubeflow Pipeline

Introduction

Are you struggling to request a docker runtime class in Kubeflow Pipeline components? You’re not alone. Currently, there is no built-in implementation for it in Kubeflow Pipeline version==1.8.0 with the kfp==1.8.6 cluster management tool kops. This can be frustrating for developers who rely on these tools for pipeline management.

In this blog, we’ll discuss a solution to this problem that will help you streamline your workflow and save time. So, if you’re ready to learn how to request a docker runtime class in Kubeflow Pipeline components, let’s get started!

Problem Statement

When running Kubeflow Pipelines, components will start running on the K8’s cluster as specified in the pipeline. However, there is a problem when it comes to running GPU workloads. By default, the docker runtime for CPU pods/components is suitable, but GPU workloads require the docker runtime to be Nvidia.

The cluster management tool, kops, maintains the cluster default runtime as runc and also installs a RuntimeClass called NVIDIA. However, As the NVIDIA runtime is not the default runtime, you will need to add runtimeClassName: NVIDIA to any Podspec you want to use for GPU workloads. When writing Kubeflow Pipelines, there is no method or class to specify or request the NVIDIA runtime. This means that when you write a pipeline with GPU components, it will fail to execute on GPUs.

This is a major problem for developers who need to run GPU workloads. The default runtime is simply not suitable for GPU workloads and the lack of a method to specify the NVIDIA runtime makes it impossible to run these workloads on Kubeflow Pipelines. In the next section, we will discuss a solution to this problem.

Solution

Since we know that Kubeflow Pipelines generate Argo Workflow YAML files when we compile pipeline code, we can modify these YAML files to add the runtimeClassName field to the Podspec of components that require GPUs.

How it’s done:

1. Generate Kubeflow Pipeline YAML. Generate the Kubeflow Pipeline YAML using your CICD tool. This YAML file contains all the configurations related to all the components.

2. Loop through the YAML. Use the YAML package to read the generated Kubeflow Pipeline YAML file in your CICD tool. Loop through the YAML and check the resource request of all the components.

3. Patch the YAML. For any component that has a resource request of nvidia.com/gpu, add podSpecPatch: ‘{“runtimeClassName”:”nvidia”}’ to the component’s YAML.

This code block will loop through all the components in the pipeline and check if the component requests GPU resources. If it does, it will add a podSpecPatch to the component with the runtimeClassName set to nvidia. The patched pipeline is then written to a new file called patched_pipeline.yaml.

4. Upload the patched YAML. Once you’ve patched the YAML , upload it to Kubeflow Pipeline.

5. Run the Pipeline. Create an experiment and run the pipeline. Under the GPU pods, you should be able to see the podspec. Copy it to a text editor and search for the runtimeClassName: nvidia key value in it.

That’s it! By patching the runtime class as NVIDIA in the workflow YAML , you can enable Kubeflow Pipelines for GPU workloads.

Conclusion

It’s important to keep in mind that making changes to the runtime environment may have unintended consequences and may require additional debugging. However, with the solution presented in this blog post, you can ensure that your Kubeflow Pipeline components requesting GPU resources will use the correct runtime class and avoid failures during execution.

LINKS:

  1. https://kops.sigs.k8s.io/gpu/
  2. https://github.com/kubeflow/pipelines/issues/8052
  3. https://github.com/argoproj/argo-workflows/issues/7519
  4. https://github.com/kubernetes/kops/blob/master/hooks/nvidia-device-plugin/README.md

Author:

Ketan Gangal
Machine Learning Engineer

At IntellectAI, I work on design,  implementation and deployment of machine learning pipelines in production. I love working with open source including like Kubeflow and MLflow.

Related Articles

Part 3: Technological Solutions and Innovations for Loss Run Analysis

Article | Jul 22, 2024

Part 2: Challenges in Reading Loss Runs, Current Practices and Limitations

Article | Jul 20, 2024

Part 1: Understanding Loss Run Reports and Use Cases in Insurance

Article | Jul 18, 2024
×

Want to see our products in action? Let our experts help you get started