Deployment guide VMware vSphere Bitfusion on Dell EMC PowerEdge servers Abstract VMware vSphere Bitfusion is a software solution that you can deploy on Dell EMC PowerEdge R740xd and C4140 servers. The solution virtualizes hardware resources to provide a pool of shared resources that are accessible to any virtual machine in the network.
Revisions Revisions Date Description September 2020 Initial release Acknowledgements Authors: Vamsee Kotha, Jay Engh Support: Gurupreet Kaushik, Sherry Keller The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Table of contents .................................................................................................
Table of contents 6.6 Deploy the Open Virtual Appliance (OVA) to create the bitfusion-server-2 virtual machine ............................30 6.7 Edit the bitfusion-server-2 virtual machine hardware settings ..........................................................................31 6.8 Deploy the OVA to create the bitfusion-server-3 virtual machine ....................................................................33 6.9 Edit the bitfusion-server-3 virtual machine hardware settings ...............
Executive summary Executive summary Modern computational requirements have evolved and diverged with the wide acceptance and use of containers and virtualization. Today’s workloads are presenting challenges to the CPU-centric paradigm which has been a key determinant of customers' server performance and utility needs. General purpose CPUs no longer adequately handle these new workloads, whether they are artificial intelligence, machine learning or virtual desktops.
Audience and scope 1 Audience and scope This deployment guide includes step-by-step instructions for deployment and configuration of the VMware vSphere Bitfusion appliance on Dell EMC PowerEdge R740xd and C4140 rack servers. This deployment guide makes certain assumptions about the prerequisite knowledge of the deployment personnel and the hardware they are using.
Overview 2 Overview With the new VMware vSphere Bitfusion software, graphics processing units (GPU) are no longer isolated from other resources. GPUs are now shared in a virtualized pool of resources and you can access them through any virtual machine in the infrastructure as shown in Figure 1. Similar to processors and storage resources, GPU deployments can now benefit from optimized utilization, reduced Capex and Opex, and accelerated development and deployment of R&D resources.
Overview Bitfusion GPU sharing model Deployment of VMware Bitfusion on the Dell EMC PowerEdge servers provide an infrastructure solution incorporating the best-in class hardware from Dell EMC with core VMware products. Virtualization of computation, storage, networking and accelerators is delivered on a cluster of PowerEdge servers. The combination of VMware vSphere Bitfusion software on the Dell EMC PowerEdge hardware described in this document has been validated in Dell EMC labs.
Component overview 3 Component overview This section briefly describes the components that support VMware vSphere Bitfusion and their key capabilities to help you deploy the software. 3.1 DELL EMC PowerEdge R740xd server The PowerEdge R740xd server provides the benefit of scalable storage performance and data set processing. This 2U, 2-socket platform brings you scalability and performance to adapt to a variety of applications.
Component overview Key capabilities: • • • Unthrottled performance and superior thermal efficiency with patent-pending interleaved GPU system design* No-compromise (CPU + GPU) acceleration technology up to 500 TFLOPS / U+ using the NVIDIA® Tesla™V100 with NVLink™ 2.4KW PSUs help future-proof for next generation GPUs Front view of a Dell EMC PowerEdge C4140 Rear view of a Dell EMC PowerEdge C4140 Internal view of a Dell EMC PowerEdge R740xd displaying a NVIDIA T4 GPU card 3.
Component overview 3.4 Dell EMC Networking S5248F-ON Switch The S5200F-ON series introduces optimized 25GbE and 100GbE open networking connectivity for servers/storage in demanding web and cloud environments. Innovative next-generation top-of-rack family of 25GbE switches providing optimized performance both in-rack and between-racks, cost-effective 50/100GbE leaf/spine fabric, and migration capabilities for future connectivity needs.
Component overview NVIDIA V100 for PCIE NVIDIA V100 for NVLink 3.7 Mellanox ConnextX-5 Dual Port 10/25GbE Adapter ConnectX-5 EN supports two ports of 25Gb Ethernet connectivity, sub-600 ns latency, and very high message rate, plus PCIe switch and NVMe over Fabric offloads, providing the highest performance and most flexible solution for the most demanding applications and markets: Machine Learning, Data Analytics, and more.
Pre-deployment requirements 4 Pre-deployment requirements This section describes the pre-deployment requirements for configuring VMware vSphere Bitfusion. Figure 13 shows an illustration of the components involved in creating the Bitfusion client-server cluster. Bitfusion sample client-server cluster 4.1 GPU hosts The Bitfusion OVA is deployed on the GPU hosts. Below are the Dell EMC customized images for VMware ESXi 7.
Pre-deployment requirements 4.3 Bitfusion server and client software Bitfusion OVA is a VMware appliance prepackaged with GPU software and services. Bitfusion client package runs on the virtual machines where, the applications make use of the GPU resources. To download the OVA and client package, see the Download VMware vSphere Bitfusion page after logging into My VMware account. 4.4 vCenter Once the Bitfusion server OVA is deployed, select Bitfusion from the vCenter menu. vCenter Server 7.
Pre-deployment requirements 4.7 Network services Domain Name Service (DNS) is required to fetch both forward and reverse name resolution. The IP addresses of name servers, search domains, and hostnames of all the Bitfusion appliance virtual machines should be tested and verified for both forward and reverse lookups. Test the DNS entries using their Fully Qualified Domain Name (FQDN) and their short name or hostname. Time synchronization is critical to the Bitfusion server appliances.
Solution overview 5 Solution overview Following is the solution overview for the deployment instructions provided in the rest of the document. 5.1 Architecture The GPU hosts and client cluster architecture shown in Figure 14 is the reference architecture for the use case described in the forthcoming section.
Solution overview NVIDIA T4 Accelerator NVIDIA V100 PCIe Accelerator NVIDIA V100 SXM2 Accelerator S5248F Network Switch Client VM Client Hypervisor GPU Host Hypervisor Client host 5.3 VLANs and IP subnet information. VLAN ID 96 90 10 20 17 2 x 1.8TB GB SAS SSD 6 x 1.
Deployment and configuration 6 Deployment and configuration The following section describes the step by step instructions to deploy Bitfusion appliances on the GPU hosts and a quick test to showcase the GPU accessibility by a remote VM assuming all the pre-requisites are met. 6.1 Verify the GPU host hardware configuration Follow the steps: 1. Login to iDRAC and browse to the Configuration > BIOS settings > System Profile Settings and verify that the System Profile is set to Performance. 2.
Deployment and configuration 3. Create pvrdma as a new distributed portgroup pvrdma using the New Distributed Port Group wizard under the Bitfusion distributed switch created in the previous step. Set the Port Binding to ephemeral and VLAN to 90. 4. Create VMKernel port for PVRDMA on all hosts by clicking on Configure > Networking > VMkernel Adapters > Add Networking wizard. Create vmk1 on the GPU hosts and then create vmk3 on the client cluster. 5.
Deployment and configuration Tagging the VMkernel adapter 6.3.2 Add and manage hosts on the Bitfusion distributed switch Follow the steps: 1. Attach the GPU hosts and client cluster hosts. 2. Assign the two RDMA NICs (vmnic6 & vmnic7) from each host. 3. Assign the vmkernel port (vmk3 on client cluster hosts and vmk1 on GPU hosts) on each host created in the earlier step to pvrdma port group.
Deployment and configuration Adding and managing hosts on the Bitfusion distributed switch. 6.4 Deploy the Open Virtual Appliance (OVA) to create the bitfusionserver-1 virtual machine Follow the steps: 1. 2. 3. 4. Use Deploy OVF Template action on the C4140 to deploy the first appliance. Select the appliance OVA file, bitfusion-server-2.0.0-11.ova. Select the folder and provide the name for the first appliance bitfusion-server-1.
Deployment and configuration Enter the vCenter URL and the vCenter administrator account credentials 9. Extract the TLS certificate from the browser navigation pane. The hexadecimal is case sensitive. a. For Google Chrome: click on the lock icon or not secure icon to the left of the URL bar in the browser and then click Certificate > Details > Thumbprint. b.
Deployment and configuration Extracting the TLS certificate from Google Chrome web browser navigation pane 10. Provide credentials for user customer. This account is used to login to the appliance for any troubleshooting. 11. Select the checkbox for NVIDIA driver license agreement. The appliance has connectivity to the internet to download the NVIDIA software. Select the checkbox for NVIDIA driver license agreement 12. Provide information for Network Adapter 1 settings.
Deployment and configuration • • • • • MTU – 9000 Gateway – 100.71.x.x DNS – 100.71.x.x Search domain – oseadc.local NTP – 100.71.x.x Configuring the network adapter 1 settings 13. Provide information for Network Adapter 2 settings. This provides the appliance access to the data plane (pvrdma) for GPU traffic. Select the checkbox for configuring network adapter 2 settings. • • • 24 IPv4 address – 172.16.6.
Deployment and configuration Configure the network adapter 2 settings and select the checkbox 14. Click Next to complete the deployment configuration and wait for the task to complete. Refrain from powering on the virtual machine. 6.5 Edit the bitfusion-server-1 hardware settings Follow the steps: 1. Under the Virtual Hardware tab, verify if the number of vCPUs is 8 • Minimum No. of vCPUs = 4x No. of GPU devices attached to the appliance. In this case, 4 x 2 GPUs i.e. 8 2.
Deployment and configuration Editing the bitfusion-server-1 hardware settings 3. Click on Add New Device and add two PCIe devices. Select the PCI devices from the drop-down menu. • • 26 0000:18:00.0 | GV100GL V100 SXM2 16GB 0000:3b:00.
Deployment and configuration Adding PCI devices from the drop-down menu 4. Click on Add New Device and select Network adapter. • • 27 Browse and select the network pvrdma that is created on the Bitfusion distributed switch earlier.
Deployment and configuration Setting the adapter type as PVRDMA the device protocol to RoCE v2 5. Under the Virtual Machine Options tab, select Advanced > Configuration parameters and then select Edit Configuration. Edit the pciPassthru.64bitMMIOSizeGB parameter to 64. • pciPassthru.64bitMMIOSizeGB= , where n equals (num-cards * size-of-card-in-GB) rounded up to NEXT power of 2. In this case, 2x 16 → 32, rounded to next power of 2 i.e. 64 6.
Deployment and configuration Notification on the top of the window indicating successful deployment of Bitfusion 7. Select Menu > Bitfusion and wait for the Bitfusion plug-in GUI to load. Bitfusion plug-in GUI bitfusion-server-1 virtual machine 8. Login to the bitfusion-1 appliance using the same credentials used during deploying the appliance. Run the following command to reload, restart and edit the bitfusion-manager service file: sudo vi /usr/lib/systemd/system/bitfusion-manager.
Deployment and configuration Editing the bitfusion-manager service file 9. Use the following commands to first save the file, and then restart the bitfusion-manager service for the changes to take effect: sudo systemctl daemon-reload sudo systemctl restart bitfusion-manager 6.6 Deploy the Open Virtual Appliance (OVA) to create the bitfusionserver-2 virtual machine Follow the steps: 1. Select the appliance OVA file, bitfusion-server-2.0.0-11.ova. 2.
Deployment and configuration • • 6.7 CIDR–24 MTU–9000 Edit the bitfusion-server-2 virtual machine hardware settings Follow the steps: 1. Under the Virtual Hardware tab, Verify the number of vCPUs is 8 • Minimum No. of vCPUs = 4x No. of GPU devices attached to the appliance. In this case, 4 x 2 GPUs i.e. 8 2. Verify the memory is set to 48GB and select the checkbox Reserve all guest memory. • Minimum GB of memory = 1.5x aggregate total of GPU memory on all GPU cards passed through. In this case, 1.
Deployment and configuration Select Enable Bitfusion from the Actions menu 8. A window pops up listing the options to enable as a client or server. Select the For a server, this will allow it to be used as a GPU server radio button and click on ENABLE. This adds guest variables informing the server it is not the first GPU server in the Bitfusion cluster. 9. Power on the virtual machine and wait for the Bitfusion plugin UI to show the additional appliance and additional GPUs added to the cluster.
Deployment and configuration Bitfusion plug-in GUI for bitfusion-server-2 virtual machine 10. Login to the bitfusion-2 appliance using the same credentials used during deploying the appliance. Run the following command to reload, restart and edit the bitfusion-manager service file: sudo vi /usr/lib/systemd/system/bitfusion-manager.service Add the line Environment=BF_IB_GID_INDEX=1 at the end of the [Service] section 11.
Deployment and configuration 6. Verify the network adapter 2 settings: • • • 6.9 IPv4 address–172.16.6.22 CIDR–24 MTU–9000 Edit the bitfusion-server-3 virtual machine hardware settings Follow the steps: 1. Under the Virtual Hardware tab, verify the number of vCPUs is 8 • Minimum No. of vCPUs = 4x No. of GPU devices attached to the appliance. In this case, 4 x 2 GPUs i.e. 8 2. Verify that the memory is set to 48GB and select the checkbox Reserve all guest memory. • Minimum GB of memory = 1.
Deployment and configuration Bitfusion plug-in GUI for bitfusion-server-3 virtual machine 10. In the Bitfusion user interface, verify that the total available GPUs is 6 and the allocation is 0 in the Cluster GPU Allocation graph. All six available GPUs in the cluster 11. Login to the bitfusion-3 appliance using the same credentials used during deploying the appliance.
Deployment and configuration sudo systemctl daemon-reload sudo systemctl restart bitfusion-manager 6.10 Provide client cluster access to GPU resources Follow the steps: 1. Enable bitfusion on the powered off bitfusion client virtual machine. To do this, select Actions > Bitfusion > Enable. Select the option For a client, this will allow users to run Bitfusion workloads radio button and click on ENABLE 2.
Deployment and configuration Listing all available GPUs in the cluster 7. Run Bitfusion commands over the PVRDMA network targeting each GPU server and verify that the available resources are listed. Use the following command: bitfusion list_gpus -l 172.16.6.x 8. Verify the GPU allocation in the Bitfusion user interface by running the TensorFlow benchmark and assign two GPUs using the Bitfusion command-line interface.
Getting help 7 Getting help 7.1 Contacting Dell EMC Dell EMC provides several online and telephone-based support and service options. Availability varies by country, region, and product, and some services may not be available in your area. To contact Dell EMC for sales, technical assistance, or customer service issues, see https://www.dell.com/contactdell.