Pooling
Last updated
Last updated
Janction does this by integrating deep learning or other compute services into tools that support a microservices architecture, where each function is executed in a separate container (Docker, for example). We start with a simple virtualization of GPU arithmetic resources for use by different container applications, which involves splitting the physical GPU into multiple fixed-size vGPUs, each with equal arithmetic and graphics memory. It has been proven that different AI models have different requirements for computing power and graphics memory resources. Therefore, such a slicing approach does not meet the diverse needs of AI models.GPU virtualization is mainly Time-slicing, Multi-Process Service, Multi-Instance GPU, etc.
Time-slicing uses a simple oversubscription strategy to invoke the time-slice scheduler of GPUs. Time-slicing uses a simple oversubscription policy to invoke the GPU's time-slice scheduler to achieve concurrent execution of multiple CUDA applications through GPU time sharing; Multi-Process Service is an alternative binary-compatible implementation of the CUDA Application Programming Interface (APPI) that allows multiple streams or CPU processes to make CUDA Kernel calls to the GPU at the same time, combining them into a single The Multi-Instance GPU feature (starting with the NVIDIA Ampere architecture) allows the GPU to be securely partitioned into up to seven separate GPU instances, partitioning the system's available GPU compute resources (including stream multiprocessors, SMs, GPU engines, and so on) to provide the ability for different clients (e.g., virtual machines, containers, or processes) to run their applications in a single GPU. The GPUs are partitioned into up to seven separate instances of the system's available GPU compute resources (including streaming multiprocessors, SMs, GPU engines, etc.), providing quality assurance with fault isolation for different clients such as virtual machines, containers, or processes.
Each microservice interoperates with other microservices in a lightweight and loosely coupled way via RESTful APIs. Janction provides Serverless container batch management and scheduling interfaces as the backend of the entire distributed AI service, and these interfaces automatically manage the allocation of resources as well as storage and bandwidth in the GPU market in real-time. Then Janction builds cross-cluster arithmetic service routing protocols based on VxLAN, and applies appropriate scheduling policies (Gang-scheduling, Fair-share scheduling, DRF scheduling) in the distributed GPU arithmetic network to complete the entire microservice-based implementation of the Serverless Distributed pooling of resources system.
In Janction's GPU pooling design, Janction Network serves as the Service Mesh for the pooling process and is mainly responsible for managing scheduling, monitoring, and system security. Each GPU resource pool is created by deploying a Controller contract on Janction Network, which mainly accomplishes service registration and service discovery of each functional component, scheduling allocation of vGPUs, metadata storage and management, and License management. These contract tasks will not be completely completed on the chain, and may be executed off-chain by calling the SDK. physical machines provided to the GPU contributor are virtualized through the CUDA Driver and Janction network, and managed using Docker. Arithmetic resource renters use vGPU resources by calling the Janction Server API, and the process builds virtual routes for arithmetic resources based on VxLAN.The Janction Server API wraps the Client Runtime and surrounding ecolibraries such as cudart, cublas, cudnn, cufft, cusparse, for developing parallel computing task programs, and additionally, the Janction Server API provides users with functions such as requesting, releasing, managing, and monitoring GPU resources. Then under the application of task execution is the framework layer, which is tensorflow, pytorch, paddle, etc. commonly used in AI training and reasoning. For system data input, we designed Data Plane, which is mainly composed of IPFS (distributed storage) and Memory, and the physical machine communicates with Data Plane directly through GPUDirect RDMA.GPUDirect with RDMA enables communication between GPUs and other devices (e.g., network adapters, storage systems, and other GPUs) with direct memory access between GPUs and other devices such as network adapters, storage systems, and other GPUs. It allows data to be transferred efficiently without CPU involvement, thus reducing latency and increasing bandwidth.