Logo

The Data Daily

Machine Learning with Kubeflow on Amazon EKS with Amazon EFS | Amazon Web Services

Machine Learning with Kubeflow on Amazon EKS with Amazon EFS | Amazon Web Services

Machine Learning with Kubeflow on Amazon EKS with Amazon EFS
by Anjani Reddy, Suman Debnath, Daniel Rubinstein, and Narayana Vemburaj | on
22 NOV 2022
| in Amazon Elastic File System (EFS) , Amazon Elastic Kubernetes Service , Artificial Intelligence , Compute , Storage , Technical How-to | Permalink | Comments |  Share
Training Machine Learning models involves multiple steps, it gets more complex and time consuming when the size of the data set for training is in the range of 100s of GBs. Data Scientists run through large number of experiments and research which includes testing and training large number of models. Kubeflow provides various ML capabilities to accelerate the training process and run simple, portable scalable Machine Learning workloads on Kubernetes.
Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances. When Data Scientists adopt Model parallelism there’s also a need to share the large dataset across Machine Learning models.
In part 1 of this two-part blog series, we covered persistent storage for Kubernetes and an example workload that used Amazon Elastic Kubernetes Service (EKS) with Amazon Elastic File System (EFS) as persistent storage.
In this blog, we will walk through how you can use Kubeflow on Amazon EKS to implement model parallelism and use Amazon EFS as persistent storage to share datasets. You can use Kubeflow to build ML systems on top of Amazon EKS to build, train, tune, and deploy ML models for a wide variety of use cases, including computer vision, natural language processing, speech translation, and financial modeling. And with the use of Amazon EFS as the backend storage you can get better performance for your model training and inference.
Solution overview
The architecture uses Amazon EKS as a compute layer, wherein we create different `pods` to perform our ML training jobs and use Amazon EFS as the storage layer to store our training datasets. We use Amazon ECR as the image repository that Amazon EKS uses to store the ML training container images.
Figure 1: Architecture of Kubeflow on Amazon EKS with Amazon EFS
As part of this architecture, we will do the following:
Install and configure Kubeflow on Amazon EKS.
Set up Amazon EFS as persistent storage with Kubeflow.
Create a Jupyter notebook on Kubeflow.
Perform a machine learning training using an ML (TensorFlow) image from Amazon ECR.
Prerequisites
Complete the following steps to create EKS Cluster and install the necessary tools required.
Set up the initial Cloud9 setup as per the tutorial .
Install these pre-requisite tools:
Creating the Kubernetes cluster using Amazon EKS following the instructions here .
Verify your EKS cluster by running the following command:
$ kubectl get nodes -o=wide
Install and configure Kubeflow
We use the Kustomize tool to install Kubeflow. Use the following commands to install Kustomize (version 3.2.0):
$ wget -O kustomize https://github.com/kubernetessigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64 $ chmod +x kustomize $ sudo mv -v kustomize /usr/local/bin
Verify if Kustomize is installed properly:
$ kustomize version
Use the following commands to set up Kubeflow to deliver end-end workflow for training ML models:
$ git clone https://github.com/aws-samples/amazon-efs-developer-zone.git $ cd amazon-efs-developer-zone/application-integration/container/eks/kubeflow/manifests $ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
Verify the installation ensuring the Pods are in running state:
$ kubectl get pods -n cert-manager $ kubectl get pods -n istio-system $ kubectl get pods -n auth $ kubectl get pods -n knative-eventing $ kubectl get pods -n knative-serving $ kubectl get pods -n kubeflow $ kubectl get pods -n kubeflow-user-example-com
Set up Amazon EFS as persistent storage with Kubeflow
Let’s look at the detailed steps involved in setting up the persistent storage.
1. Create an OIDC provider for the cluster
$ export CLUSTER_NAME=efsworkshop-eksctl $ eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
2. Setup EFS using automated script auto-efs-setup.py. The script applies some default values for the file system name, performance mode and executes the following:
Install the EFS CSI Driver
Create the IAM Policy for the CSI Driver
Create an EFS Filesystem
Creates a Storage Class for the cluster
3. Run the auto-efs-setup.py
$ cd ml/efs $ pip install -r requirements.txt $ python auto-efs-setup.py --region $AWS_REGION --cluster $CLUSTER_NAME --efs_file_system_name myEFS1 ================================================================ EFS Setup ================================================================ Prerequisites Verification ================================================================ Verifying OIDC provider... OIDC provider found Verifying eksctl is installed... eksctl found! ... ... Setting up dynamic provisioning... Editing storage class with appropriate values... Creating storage class... storageclass.storage.k8s.io/efs-sc created Storage class created! Dynamic provisioning setup done! ================================================================ EFS Setup Complete ================================================================
4. Verify the Storage Class in the Kubernetes cluster
$ kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE efs-sc efs.csi.aws.com Delete WaitForFirstConsumer true 96s gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 148m
Creating a Jupyter Notebook on Kubeflow
Run the following to port-forward Istio’s Ingress-Gateway to local port 8080:
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080 ... ...
2. In the Cloud9 console, select Tools > Preview > Preview Running Application to access dashboard. You can click on pop out window button to maximize the browser into new tab.
Figure 2: AWS Cloud9 cloud-based IDE
Keep the current terminal running so you don’t lose access to the UI page.
3. Login to the Kubeflow dashboard with the default user credentials. The default email address is user@example.com and the default password is 12341234.
Figure 3: Kubeflow dashboard
4. Create a Jupyter Notebook by selecting on Notebook then New Server.
5. Name the notebook as “notebook1”, keep the rest of the settings at default and scroll down to click on LAUNCH.
Figure 4: Kubeflow Dashboard (Notebook Creation)
At this point the EFS CSI driver should create an Access Point, as this new notebook on Kubeflow will internally create a pod, which then creates a PVC that in turn calls the storage from the storage class efs-sc (as that’s the default storage we have selected for this EKS cluster). Wait for the notebook to come to ready state.
Figure 5: Kubeflow dashboard (Notebook server)
6. Now we can check the PV (Persistent Volume), PVC (Persistent Volume Claim) using `kubectl` which got created under the hood.
$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-3d1806bc-984c-404d-9c2a-489408279bad 20Gi RWO Delete Bound kubeflow/minio-pvc gp2 52m pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO Delete Bound kubeflow-user-example-com/workspace-nootbook1 efs-sc 5m16s pvc-940c8ebf-5632-4024-a413-284d5d288592 10Gi RWO Delete Bound kubeflow/katib-mysql gp2 52m pvc-a8f5e29f-d29d-4d61-90a8-02beeb2c638c 20Gi RWO Delete Bound kubeflow/mysql-pv-claim gp2 52m pvc-af81feba-6fd6-43ad-90e4-270727e6113e 10Gi RWO Delete Bound istio-system/authservice-pvc gp2 52m $ kubectl get pvc -n kubeflow-user-example-com NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE workspace-nootbook1 Bound pvc-8f638f2c-7493-461c-aee8-984760e233c2 10Gi RWO efs-sc 5m59s
Finally, you should be able to see the access point in the AWS console.
Figure 6: Amazon EFS console (for Access Point)
So, at this point you can make use of this Jupyter Notebook.
Perform a Machine Learning training
Next, let’s create a PVC for our machine learning training dataset in ReadWriteMany mode. You can go to the Kubeflow Dashboard → Volumes → New Volume and create a new volume called dataset with efs-sc as the storage class
Figure 7: Creating a new volume from the Kubeflow dashboard
Now, you can follow the GitHub to use this persistent volume to store the training dataset and perform a ML training.
Cleaning up
To avoid incurring unwanted future charges, delete the Amazon EKS cluster by completing the steps covered in this documentation .
Conclusion
In this blog post, we walked you through how to set up Kubeflow for your machine learning workflow on Amazon EKS. We also covered how you can use Amazon EFS as shared persistent file system to store your training datasets. We highlighted the value that Kubeflow on AWS provides through native AWS-managed service integrations for secure, scalable, and enterprise-ready AI and ML workloads. To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in  Kubeflow on AWS and other documentations as mentioned bellow.
In case you are new to Kubernetes and storage integration with Kubernetes, refer to part 1 of this two-part blog series.
For more information, visit the following resources:
TAGS: Amazon EKS , Amazon Elastic File System (Amazon EFS) , AWS Cloud Storage , Kubernetes , machine learning
Anjani Reddy
Anjani is a Specialist Technical Account Manager at AWS. She works with Enterprise customers to provide operational guidance to innovate and build a secure, scalable cloud on the AWS platform. Outside of work, she is an Indian classical & salsa dancer, loves to travel and Volunteers for American Red Cross & Hands on Atlanta.
Suman Debnath
Suman Debnath is a Principal Developer Advocate (Data Engineering) at Amazon Web Services, primarily focusing on Data Engineering, Data Analysis and Machine Learning. He is passionate about large scale distributed systems and is a vivid fan of Python. His background is in storage performance and tool development, where he has developed various performance benchmarking and monitoring tools.
Daniel Rubinstein
Daniel Rubinstein is a Software Development Engineer on the Amazon Elastic File System team. He is passionate about solving technology challenges, distributed systems, and storage. In his spare time, he enjoys outdoor activities and cooking.
Narayana Vemburaj
Narayana Vemburaj is a Senior Technical Account Manager at Amazon Web Services based in Atlanta, GA. He’s passionate about cloud technologies , assists Enterprise AWS customers in their cloud transformation journey. Outside of work he likes to spend time playing video games and watching science fiction movies.

Images Powered by Shutterstock