Given that not every agency will have a need to implement an AI/ML data center, are there supporting components that should be considered if you find yourself being asked to design one? One of the primary reasons to build your own AI/ML data center is to meet business demands while also protecting sensitive data. In part, the business demands are likely stemming from the increased availability of open-source AI models. That is fair. But what should be considered in building one?

GPU-based systems are the heavy favorite for AI/ML scenarios largely due to their ability to execute multiple tasks simultaneously. While you are researching GPUs, you may encounter the technical phrase Remote Direct Memory Access (RDMA). Typically, the Operating System handles data transfers. This effort requires the CPU to halt whatever it is doing, process the data transfer and then resume what it was doing. This is commonly called a “context switch”. Context switches take time and can slow down other tasks. Not ideal, but especially not ideal when you have a heavy AI/ML effort underway. RDMA’s approach is to bypass the CPU using network adaptors to make data transfers.
RDMA may trigger a memory of Direct Memory Access (DMA) technology. It was designed to address the same issue of removing data transfer tasks from the CPU for faster processing. However, DMA’s limitation is that it functions within a single system. RDMA adopts the same approach but extends the model over different systems within the network.
Aside from AI system environments, RDMA is applicable in High Performance Computing (HPC) environments and is heavily used by major cloud providers as well. RDMA does require specialized Network Interface Cards (NICs), and your network team will need to check the ethernet cable type, the network switches, and network protocols to ensure all of those meet the RDMA needs. All of that may be more than you want to know, but relay your intention to the network teams and they should be able to relay other changes to consider.
So, after getting deep in the details, what does RDMA get you? Well, it is faster and that is good, right? Faster data transfers where latency is measured in terms of microseconds. And the CPU? Oh, it is much happier to be able to work on other tasks. No (or less) data transfer requests? Yay!
Are there any limitations to using RDMA? In terms of systems/nodes, the limit varies depending on the (software) protocol used, however recent deployments scale into tens of thousands to over 100,000 nodes. Your limitation to implementing something of that scale is likely more on the budget side rather than the technology side.
RDMA is fascinating, and given the processing needs of AI, it is a technical approach that is very much needed. If you are in discussions about building an AI/ML environment, ask about adopting RDMA, but certainly reach out to vendors to help understand the infrastructure needs.
Dan Kempton is the Sr. IT Advisor at North Carolina Department of Information Technology. An accomplished IT executive with over 35 years of experience, Dan has worked nearly equally in the private sector, including startups and mid-to-large scale companies, and the public sector. His Bachelor’s and Master’s degrees in Computer Science fuel his curiosity about adopting and incorporating technology to reach business goals. His experience spans various technical areas including system architecture and applications. He has served on multiple technology advisory boards, ANSI committees, and he is currently an Adjunct Professor at the Industrial & Systems Engineering school at NC State University. He reports directly to the CIO for North Carolina, providing technical insight and guidance on how emerging technologies could address the state’s challenges.



Leave a Reply
You must be logged in to post a comment.