AI is the new paradigm shift within businesses, offering an exciting level of automation and creation tools that are revolutionizing industries. As exciting as this development is, there’s been less concern about the changes in the data center infrastructure necessary to support these complex system architectures. AI server racks have additional considerations and concerns that standard data center server racks do not contend with, primarily due to the much higher power consumption and heat dissipation.
AI Server Rack Options | ||
---|---|---|
Basic | Thermal | Additional |
|
|
|
AI Server Rack Architecture
Understanding the needs of AI server racks requires a quick dive into the changing landscape of data center architecture. Traditionally, the three-tier network design met the needs of computational complexity and data trafficking. Routers interfacing with routing switches could deftly route signals between routing switches but less so between the routers. In other words, the three-tier network was akin to parallel traffic lanes: data flows easily up and down the lane, but perpendicular movement across the routers is more challenging. However, many factors made three-tier network design the dominant architecture for many years, such as its relatively discrete nature (diagnosing or isolating network sections is trivial without affecting other branches), scalability, and overall performance characteristics.
With the advent of cloud computing, the three-tier architecture was no longer adequate for the massive data needs, much less the tremendous computational power necessary for AI. Enter the spine-leaf architecture: in spine-leaf, routers connect to every routing switch in a full mesh topology across two layers (spine, the closest, and leaf, the furthest). The foremost advantage is the massive interconnectivity, which limits bottlenecks and optimizes node distance to a single hop between leaves (i.e., leaf to spine to leaf). While this is an advantage for computational speed and power, it creates a more complex networking architecture, as scaling the system requires connecting every extant spine and newly added leaf.
This shift means that cabling concerns exert significant pressure on the shape and proximity of AI server racks. At the same time, these servers’ power consumption limits racks’ density without extensive cooling. The longest links in an AI server rack are hard-capped at approximately 100 m of fiber optic cable (approximately a 0.5-microsecond transmission speed). A superpod may contain up to 32 individual GPU servers; however, depending on the power capabilities of the facility, it may not be possible for an organization to align all their servers in a row for latency savings.
Transceivers’ Role in Server Performance
Transceiver selection will also play an outsized role in server rack performance. Transceivers dominate the overall cost since cabling lengths are limited to 100 m. Parallel-fiber transceivers have a distinct advantage because they don’t require optical mux/demux operations (and the associated circuitry) for wavelength division multiplexing. As this circuitry can add a non-negligible amount of power consumption/dissipation, parallel fiber routing can reduce energy usage and cooling costs. While multimode fiber incurs greater cabling costs than single-mode fiber, the bulk of the savings is due to the connectors. New VR-rated cables (short for “very short reach”), with a max reach of 50 m, can further improve power consumption and cost should the system routing support them.
Another important factor in network design is the option of active optical cables (AOCs) that bundle the transceivers within the cable rather than standalone transceivers. An advantage to integrated transceivers instead of a discrete mode is that AOCs don’t need the same “universality” to interface with various units, as they eschew general interoperability for specificity. This lock-and-key system means installers don’t need experience with fiber cleaning and inspection beyond general quality observations. However, this aspect of AOCs restricts the system flexibility and maintenance as replacement of AOCs (due to failure or system upgrades) requires removal and rerouting. Additionally, AOCs can experience a failure rate double that of discrete transceivers, greatly escalating lifetime system costs, with the further wrinkle that system downtime owing to AOC replacement detracts from computations.
Your Contract Manufacturer Builds The Infrastructure for AI Servers
Designing and implementing an AI server rack can be challenging: data speeds and proper cooling concerns intensify as system complexity increases. Considering the highly competitive and rapidly evolving nature of AI systems (not to mention the expenses associated with cutting-edge technology), organizations cannot take any chances when building out their AI racks. Fortunately, VSE has the staff and expertise to help guide all electronic manufacturing applications. Our engineers are committed to building electronics for our customers, including state-of-the-art enclosures that maximize system performance and reliability for stable uptimes. We’ve been realizing life-saving and life-changing electronic devices for over forty years with our valued manufacturing partners.