A Slick New Way to Get More GPUs on Blade Servers

It should be no surprise that the popularity of accelerators in the datacenter continues to grow.  Last year I wrote a blog post on the GPU options for blade servers.  At that time, the options were limited to either small GPUs like the NVIDIA T4 or mezzanine based GPUs.  The  market really hasn’t added anything more in the GPU-on-blade space until now.

Dell Technologies and Liqid have created two new ways to expand GPU options into modular space that gives a glimpse into future datacenter designs: GPU over PCIe and GPU over Ethernet.  Let’s take a look at each.

GPU Expansion over PCIe

This design has 4 major components:

  • PowerEdge MX7000 chassis from Dell Technologies
  • PCIe expansion chassis from Liqid
  • PCIe Adapter from Liqid
  • Liqid Command Center software

I’ve covered the Dell Technologies PowerEdge MX in depth, but if you want a refresher, check out this original blog post.

The Liqid GPU expansion chassis is officially called the “LQD300x20x Expansion Chassis“.  It is a 7U chassis that offers 20x double wide PCIe Gen3x16 expansion slots supporting up to 20 devices.  Note – I didn’t say GPUs because this chassis is very unique.  It supports more than just GPUs but also FPGAs, NICs or SSD Add-in Cards.  Unlike other GPU options in the market for blade servers, this expansion supports common GPUs like the NVIDIA V100, A100, RTX and T4.

The last piece of the solution is a PCIe Adapter, provided by Liqid.  This adapter sits within one of the hot-plug drive slots on the PowerEdge MX blade server.  Each blade server needing access to the Liquid expansion chassis would need to have an adapter which connects directly to the LQD300x20x chassis.

Once the PCIe devices are connected to the MX7000, the Liqid Command Center software provides the secret sauce that allows dynamic allocation of GPUs to the PowerEdge MX blade servers.

To me, there are a couple of major benefits in this solution compared to what has been in the market for GPUs on blade servers.  This solution can support 20 GPUs across 8 servers, 20 GPUs across 1 server or any incremental design in between.

Lastly, the GPU over PCIe offers substantial GPU density without using any of the PowerEdge MX I/O fabric like the Amulet Hotkey solution.

GPU Expansion over Ethernet (or Fabrics)

The second of the two GPU Expansion solutions is a bit different.  Instead of utilizing a Liqid expansion chassis, the design incorporates the Dell Technologies DSS 8440.  I’ve seen this solution listed as “GPU Expansion over Ethernet” and “GPU Expansion over Fabrics (GPU-oF)” but for this initial release, it focuses on Ethernet connectivity.

Here are the 2 major components:

  • PowerEdge MX7000 chassis from Dell Technologies
  • DSS 8440 from Dell Technologies
  • Liqid Command Center software

The DSS 8440 specs:

  • 4U, 19″ width
  • Up to 2 x Intel Xeon SP Processors (up to 24 cores per CPU, and a TDP up to 205W)
  • 10 drive bays, 6 full flexible (NVMe or SATA) + 2 SATA, +2 NVMe
  • Supports the PERC 730P+
  • Up to 10 Dual Width Full Length PCIe slots and 8 Full Height + 1 low profile rear slots
  • Accelerator options:
    • 10 x NVIDIA V100 or V100s GPUs
    • 10 x NVIDIA Quadro RTX 8000 or RTX 6000 GPUs (with graphics capabilities)
    • 16 x NVIDIA T4 GPUs
    • 8 x Graphcore C2 IPU cards

This architecture requires connectivity between the DSS 8440 and the PowerEdge MX7000’s Fabric A via the MX9116n Fabric Switching Engine (see image below.)  Once the DSS 8440 is connected to the MX9116n 100GbE switching engine, the Liqid Command Center software discovers devices and enables them for use by any of the MX7000 blade servers.  Part of the uniqueness of this design is that the Liqid software carves up the GPU horsepower to be delivered to the server.  This model allows allows for GPU performance with limited CPU/DRAM on the blade servers, so it might be useful in cases where a lightweight server is required.

One important note with this design – it’s currently only supported on Linux.  VMware and Microsoft Windows support is pending.

 

Conclusion

I think this partnership with Dell Technologies and Liqid is a good example of how future datacenters may look.  This concept of “kinetic expansion” (I refuse to use the C-word) could be ideal for  use in these categories:

  • Bare Metal Cloud as a Service
  • Artificial Intelligence
  • Media & Entertainment (i.e. supporting AI engineering in the day and video rendering at night)
  • Data Center Orchestration with Kubernetes, Openstack, Slurm, etc.

If you are interested in getting more information on these solutions check out these two summary documents below then reach out to your Dell Technologies resources.

 

Kevin Houston (Twitter: @Kevin_Houston)
Kevin Houston is the founder and Editor-in-Chief of BladesMadeSimple.com.  He has over 23 years of experience in the x86 server marketplace.  Since 1997 Kevin has worked at several resellers in the Atlanta area, and has a vast array of competitive x86 server knowledge and certifications as well as an in-depth understanding of VMware virtualization.  Kevin has worked at Dell since August 2011 supporting enterprise server sales as a Principal Technologist.

 Disclaimer: The views presented in this blog are personal views and may or may not reflect any of the contributors’ employer’s positions. Furthermore, the content is not reviewed, approved or published by any employer.