E2E deployment of a production ready NDv4 (A100) cluster targeting large deep learning training

Azure Global > E2E deployment of a production ready NDv4 (A100) cluster targeting large deep learning training

https://techcommunity.microsoft.com/t5/azure-global/e2e-deployment-of-a-production-ready-ndv4-a100-cluster-targeting/ba-p/3580003

image.png


 


Introduction


The NDv4 series is very popular for running large deep learning training jobs, which require lots of floating-point performance and high interconnection bandwidth. In this article we will walk through the steps to deploy a complete E2E production ready Ndv4 cluster environment targeting large deep learning training.


The NDv4 cluster environment will consist of the following 



  • NDv4 compute nodes running Ubuntu-hpc 18.04 marketplace image (pre-installed with Infiniband drivers, gpu drivers, NCCL libraries/tests and MPI libraries)

  • Cyclecloud 8.2.2 with SLURM (with pmix support)

  • Premium SSD’s used for OS disks with larger capacity (60 GB)

  • Accelerated networking is enabled on NDv4

  • Set up 7 TB local NVMe SSD’s (/mnt/resource_nvme)

  • Automatically recover node from a reboot (e.g local NVMe SSD’s, GPU persistence mode enabled and GPU app clock frequencies)

  • User home directories are mounted on Azure netapp files

  • container supported via pyxis+enroot integration with the SLURM scheduler.

  • Extensive automatic pre-job healthchecks are enabled, unhealthy nodes will be put into a DRAIN state. (See Automated HPC/AI compute node healthchecks integrated with the SLURM schedule)

  • Cyclecloud autoscaling is disabled

  • Slurm accounting is enabled (MariaDB is deployed and accessed by scheduler via a private endpoint)

  • NDv4 actual and physical amount of memory relaxed (in SLURM)

  • No public IP’s, the NDv4 cluster is accessed via a Bastion landing zone.

  • Windows server (winbox) is deployed to access the Cyclecloud portal via bastion.


 


Architecture


 


deploy_e2e_NDv4_cyclecloud.jpg


 


NHC and checkpoint/restart automation


 


inflection_ai_checkpoint_flow.jpg


 


 


 


Deployment procedure


The azurehpc github repository is used to deploy the complete NDv4 cluster environment.


 


Step 1 – Get azurehpc github respository


Frist, get the azurehpc repository, we will be working primarily in the experimental/deploy_cycle_slurm_ndv4 directory


 


git clone https://github.com/Azure/azurehpc.git

 


Step 2 – Deploy Bastion and jumpbox


Deploy the Bastion and a bjumpbox, this will be your landing zone. See examples/bastion for how to deploy this and edit examples/bastion/bastion_ssh_bjumpbox.sh so you will be able to access your bjumpbox directly by running this script.


Note: You will need to have the appropriate authentication to your subscription ID on your bjumpbox (e.g you may need to execute “az login <ARGS>” to authenticate.


 


Step 3 – deploy the prerequisites


From your jumpbox, deploy the prerequisites (VNET, keyvault, peering to bastion landing zone and Azure netapp files). A prereqs.json configuration file is provided (edit file before using).


azhpc-build -c prereqs.json

Note: In the prereqs.json file, uuid variable is just any unique set of characters/numbers that will be part of your key value name (i.e to make sure its unique). Two Azure netapp files volumes are created, one for the User home directories and the other is just an additional azure netapp file volume for apps or data.


 


Step 3b – Deploy Maria DB and private endpoint (only needed if Slurm accounting is enabled)


azhpc-build –no-vnet -c prereqs_sacct.json

step 4 – prepare scripts for cyclecloud project generation


azurehpc will generate the cyclecloud projects defined in the config.json (no container support via pyxis+enroot integration with SLURM) or config_pyxis_enroot.json files, but the scripts referenced in the projects need to be in the deploy_cycle_slurm_ndv4/scripts directory.


cp ../gpu_optimizations/max_gpu_app_clocks.sh scripts
cp ../cc_slurm_nhc/cc_slurm_nhc/specs/default/cluster-init/files/* scripts
cp ../cc_slurm_pyxis_enroot/cc_slurm_pyxis_enroot/specs/default/cluster-init/files/* scripts

 


step 5 – deploy the NDv4 cluster with cyclecloud


Now we deploy the cyclecloud server, cyclecloud locker, generate the cyclecloud projects, upload the cyclecloud projects to the cyclecloud locker and create the NDv4 cluster. Two configuration files have been provided, config.json and config_pyxis_enroot.json.


 


azhpc-build –no-vnet -c config_pyxis_enroot.json

Note: The variable “projectstore” is the name of the storage account used by cyclecloud to store packages and projects (i.e. the cyclecloud locker), make sure to use a unique name and make sure the “storage” resource (e.g storage account) you are deploying has the same name.


 


step 6 – start the NDv4 cluster


You have 2 options to start the NDv4 cluster, login to the Windows server via bastion and start it via the cyclecloud web portal or login to the jumpbox and start it vis the cyclecloud CLI.


azhpc-connect jumpbox

 cyclecloud start_cluster slurmcycle

 


step 7 – connect to the login node (login-1)


Now you can login to the login node (login-1) and submit jobs via the SLURM scheduler.


 


From the jumpbox


cyclecloud connect login-1 -c slurmcycle

 


How to connect to the cyclecloud web portal


You will first need to retrieve the Windows box (winbox) and cyclecloud server password’s from the azure keyvault (they were added to the keyvault using the prereqs.json config file). You will also need the cycleserver private IP address.


Fortunately, azurehpc has support to retrieve secrets from azure keyvault.


azhpc-get secret.{{variables.key_vault}}.WinPassord

 azhpc-get secret.{{variables.key_vault}}.CycleAdminPassword

azhpc-get ip.cycleserver

Go to the azure portal (to your resource group) and login to winbox via bastion, using “hpcadmin” for the username and the retrieved password. Then from the Windows box, in a browser go to the cycleserver private IP address and use user “hpcadmin” and the retrieved password to access the cycleserver.


 


Manually start and delete NDv4 nodes


Autoscaling was disabled, so all NDV4 nodes in the cluster need to be manually added (up to the maximum number of cores you specified in cyclecloud configuration) or deleted.


You can add or delete nodes via the cyclecloud web portal, but the recommended way is from the scheduler node using the cyclecloud provided scripts.


 


First login to the scheduler (via the cyclecloud CLI) (from the jumpbox)


cyclecloud connect scheduler -c slurmcycle

 Then, from the scheduler, to add a node(s)


sudo /opt/cycle/slurm/resume_program.sh slurmcycle-hpc-pg0-[1-4]

To delete nodes


sudo /opt/cycle/slurm/suspend_progrma.sh  slurmcycle-hpc-pg0-1

 


Some GPU monitoring options


Moneo: Distributed GPU System Monitoring for AI Workflows based on prometheus and grafana can be easily integrated into this cluster. GPU Monitoring using Azure Monitor if you would prefer a more Azure native approach leveraging Azure monitor service.


 


Conclusion


The deployment procedure outlined above allows you to quickly deploy a complete production ready NDV4 cluster ideal for large deep learning training jobs. The azurehpc framework is very flexible and allows you to easily customize your deployment.

Leave a comment