Empowering the AI Revolution: The Global Distributed Resource Network of MATRIX (2/4)

April 9, 2024

4. The Technical Principle of MANTA

MANTA, which stands for MATRIX AI Network Training Assistant, is a distributed automatic machine learning platform built on the MATRIX high-performance blockchain. This platform is an application and deployment system for automatic machine learning (AutoML) accelerated by distributed technology. Its initial application areas include image classification in computer vision and 3D depth estimation tasks. By utilizing an AutoML network search algorithm, MANTA searches for a high-accuracy, low-latency deep model, leveraging distributed computing technology for accelerated search. The core functionalities of MANTA are divided into two main parts: Automatic Machine Learning (AutoML) and Distributed Machine Learning.

MATRIX’s automatic machine learning adopts the Once-for-all (OFA) network search algorithm. The main idea is to construct a Super Net with an enormous network structure search space that can derive a large number of sub-networks. Through a progressive pruning search strategy, the Super Net's network parameters are adjusted, allowing the sampling of different scales of sub-network structures from the Super Net without the need for fine-tuning the sampled sub-networks, thus reducing the cost of network deployment.

For instance, in image classification, the network structure used for depth estimation by MANTA is as shown. This structure has 5 searchable units, with the stride of the first convolutional layer in each unit set to 2, and all other convolutional layers set to 1, effectively reducing dimensions while extracting multi-channel features. The network concludes with a SoftMax regression layer that calculates the probability of each category of objects, classifying the given images accordingly.

Image 2: Network Structure Model for Image Classification

At the same time, the network structure used by MANTA for depth estimation is depicted in the following. This structure has 6 searchable units. The first three units are primarily used for dimension reduction and extraction of multi-scale feature maps from left and right views, with a Correlation Layer in the middle to compute the disparity cost between the left and right feature maps. The last three units further extract features from the disparity cost. Then, through 6 transposed convolutional layers, the reduced feature maps are gradually upscaled and linked with the U-shaped connection to the feature maps output by the previous 6 units, predicting depth maps at different scales.

Image 3: Network Structure Model for Depth Estimation

4.2 Distributed Machine Learning - Distributed Data/Model Parallel Algorithms

Within MANTA, GPU-based distributed parallel computing is utilized to further accelerate network search and training speeds. The essence of MANTA's distributed machine learning is the distributed parallel algorithm, supporting both data parallelism and model parallelism acceleration algorithms. The main idea of the distributed data parallel algorithm is to distribute batch data for each iteration among different GPUs for forward and backward computations. In each iteration, every GPU samples the same sub-model. The distributed model parallel algorithm differs in that MANTA allows different nodes to sample different sub-networks in each iteration, and after completing gradient aggregation, global gradient information exchange is performed.

4.2.1 The process of the distributed data parallel algorithm is as follows:

1. Sample 4 sub-networks from the Super Net.

2. For each sub-network, every GPU in each node performs the following operations:

a. Load training data, the number corresponding to the set Batch Size;

b. Perform forward and backward propagation on the current sub-network;

c. Accumulate gradient information from the backward propagation.

3. Aggregate and average the gradients accumulated in all GPUs, followed by broadcasting (All Reduce operation).

4. Each GPU updates its Super Net with the obtained global gradient information.

5. Return to step 1.

4.2.2 The process of distributed model parallelism (taking 4 nodes as an example) is as follows:

1. GPUs located on the same node sample the same sub-network model, with a total of 4 different sub-network models.

2. Every GPU on the same node performs the following operations:

a. Load training data, the number corresponding to the set Batch Size;

b. Perform forward and backward propagation on the sub-network sampled by the node;

c. Calculate the gradient information from backward propagation;

d. GPUs within the same node aggregate and average the gradients, storing them in the node's CPU memory.

3. The 4 nodes aggregate and average the gradient information obtained in step 2 and broadcast it to all GPUs.

4. Each GPU updates its Super Net with the obtained global gradient information.

5. Return to step 1.

Image 4: MANTA's Task Training Monitoring Interface

Last updated