Microsoft Publishes Details about its 'Singularity' AI Infrastructure Service

Microsoft demonstrated that it operates a planet-scale distributed scheduling service for AI workloads that it has modestly dubbed “Singularity.” 

AI workloads presented by 26 Microsoft employees, Singularity’s, aims to help the software giant control costs by driving high utilization for deep learning workloads.  

As such, the paper entitled “Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads” and published provides technical details about the service’s effort.  

It appears that the service is related to providing data scientists and AI practitioners with a way to build and experiment on their models on a Microsoft-provided distributed infrastructure service explicitly created for AI.  

It is worth mentioning that the authors listed on the newly published paper include Azure Chief Technical Officer Mark Russinovich; Partner Architect Rimma Nehme, who worked on Azure Cosmos DB until moving to Azure to work on AI.  

“At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of accelerators (e.g., GPUs, FPGAs),” the paper noted.  

In addition, the tech giant officials discussed plans to make field-programmable gate arrays (FPGA) available to customers as a service. Microsoft went public about its “Project Brainwave” work in 2018, which was designed to provide fast AI processing in Azure.”  

During that year, Microsoft made a preview of Azure Machine Learning Hardware Accelerated Models powered by Brainwave in the cloud. This was considered the first step in making FPGA processing for AI workloads available to customers. 

To make that happen, the tech giant requires what the authors call a “device proxy.” This tool “runs in its address space and has a one-to-one correspondence to a physical accelerator device. When a job worker initiates device APIs, they are intercepted and sent over the shared memory to the device proxy process that runs in a separate address space, and whose lifetime is decoupled from the lifetime of the worker process.”  

So, when the requirement is achieved, additional jobs are scheduled more efficiently, and thousands of servers are in use for more time. It also enables swift scaling, up or down, without disruption.  

“Singularity achieves a significant breakthrough in scheduling deep learning workloads, converting niche features such as elasticity into mainstream, always-on features that the scheduler can rely on for implementing stringent SLAs,” the paper concludes.