⚡Colocation of Idle Processor Computing Power

More than just thinking about brokering deals between GPU renters and demanders, Janction Network wants to utilize unused processor resources. It's easy to think that everyone's device won't be utilizing 100% of its processor resources when it's powered on. For example, the RTX4090 owner will not be using the processor to play games all the time, and the RTX4090 processor is effectively idle when the owner is at work or studying. This is actually a typical Colocation scenario, where we can think of the user's task of playing a game as an online job, and the user can make the processor available for the time period when it is not running to run the offline job. The same example can be analogized to CPU resources and ARM resources of cell phones.

Janction hopes to realize the full utilization of the idle time of user's processor through efficient Colocation, which can use the user's processor to complete distributed offline jobs without affecting the user's experience.The scheduling decision methods of Colocation include whole machine time-sharing multiplexing, partial sharing of resources, and complete sharing of resources, and in the case of idle processor resources with different specifications, the resources are completely shared. In a Colocation scenario with idle processor resources of different specifications, the full resource sharing approach best meets the requirements. Complete resource sharing is a dynamic scheduling decision process that realizes machine resource reuse in a more efficient and automated way by means of timely and accurate resource prediction, the ability to respond quickly to resource changes, and a set of service safeguards that can be applied when the resource water level changes. Resource attribution is not predetermined, and decision-making is based entirely on real-time metrics.Aggregator will be responsible for discovering and integrating idle processor resources provided by users, containerizing and isolating these resources, setting up reasonable scheduling policies, and including dynamic allocation of all resources such as processor, memory, disk I/O, and network I/O, etc., within a single machine.

In terms of offline hybrid job scheduling, Google's Borg and Tencent's YARD use a unified scheduling architecture, i.e., online and offline jobs are scheduled by a single scheduler, whereas Baidu Matrix and Ali Fuxi & Sigma use separate scheduling for online and offline jobs, i.e., offline jobs are scheduled by their respective schedulers (Sigma and Fuxi, respectively), and offline jobs are scheduled by their respective schedulers. Sigma and Fuxi are the original online and offline job management systems in Ali, Sorlaria and Normandy are the original online and offline job management systems in Baidu). The offline scheduling architecture reuses the code of the original system and reduces the complexity of engineering implementation, but there are some limitations: (1) the offline job scheduler and the underlying resource manager do not share the information, and additional components need to be introduced for the coordination of offline resources (e.g., Level-0 in Fuxi & Sigma); (2) the system architecture is more complex than the unified scheduling, and each node runs at least three AIs, which is the same as the one in Fuxi and Normandy. Each node runs at least 3 Agents, which brings additional resource overhead. From the point of view of scheduling algorithm, Borg and Sigma adopt the scheduling algorithm based on scoring rules, which include: scoring according to the load, scoring according to the amount of resource fragmentation, scoring according to the combination of job priority, etc. Fuxi, Matrix, and YARD are the improvement of YARN scheduling algorithm, which takes into account the nodes' spare resources during scheduling, and uses job portrait, node portrait, etc. to predict the resource demand of jobs and node future. They also use job image, node image, etc. to predict the resource demand of jobs and the future available resources of nodes.

PreviousToken NextProof of Contribution

Last updated 1 year ago