A Bilingual Pretrained Model Based on MATRIX Mainnet
September 15, 2023
At the beginning of 2022, as we were wrapping up our work on Matrix 2.0 and began brainstorming for Matrix 3.0, we proposed the prototype of Morpheus. At the time, our idea was to pretrain a high-precision bilingual model (Chinese/English) and deploy it on the Matrix Mainnet for users through a distributed network. This idea was based on the following considerations: on the one hand, such models have a wide range of applications, but at the time, high-quality models were rarely available to the public; on the other, during the initial rise of the Metaverse, we believed that a pretrained large language model would bring immense value to the development process of the Metaverse, and it could certainly play a big role in Matrix 3.0.
However, we faced many challenges:
◆ Lack of computational resources: at the time, we did not have sufficient computational resources to complete the project;
◆ Lack of high-quality pretraining algorithms: high-quality pretraining algorithms for bilingual models still needed to be validated and improved;
◆ Lack of rapid inference methods: rapid inference methods are the prerequisite for running the model on low-configuration GPU servers in a distributed manner.
For the pretraining model architecture algorithm, we chose a model previously proposed by Tsinghua University, which has shown impressive performance on multiple tasks. After several rounds of intense debate, we ultimately decided to train a GLM model with 500-1000 billion parameters. With this density of parameters, on the one hand, we can ensure the high precision of the model, and on the other, it’s not too demanding on the hardware. For example, it can run single-machine inference on an A100 server.
Although we made the decision and started, we quickly realized that we had greatly underestimated the technical difficulty of training a trillion model, such as frequent random hardware failures, model gradient explosions, the unexpectedly high memory usage in the algorithm, the inability to recover from optimizer states, TCP congestion between machines, as well as many unexpected "bugs". Despite the project being delayed several times, after more than a year, we still basically completed the work according to our plan. Here we would like to thank the Tsinghua University lab for providing us with the computational power to start training, and the several high-performance nodes on the Matrix Mainnet for providing us with stable computational support in the later stage of the project.
As of the first version in the current public beta, the Morpheus model has been trained on over 100 trillion text identifiers, and its few-shot learning performance on the Multi-task Language Understanding Benchmark (MMLU) has reached and exceeded the level of GPT-3.
In the current online version, Morpheus is a 30 billion parameter scale bilingual (Chinese and English) bidirectional language model. Its underlying architecture is based on the General Language Model (GLM1), pretrained on over 100 trillion text identifiers. Morpheus uses auto-regressive blank filling as its main pretraining target. During the understanding process, it automatically masks random continuous text intervals and makes auto-regressive predictions on them.
In actual training, Morpheus uses two different mask identifiers ([MYTH] and [gMYTH]), respectively for short and long text generation. In addition, it adopts recent techniques such as Rotary Position Embedding (RoPE), DeepNorm layer normalization, and Gaussian Error GLU (GeGLU). All these designs and techniques contribute to the stable training and high-precision performance of Morpheus's large-scale language model. Specifically, in the current public beta version of Morpheus, the model has 55 layers of Transformer, a hidden layer dimension of 7,200, a maximum sequence length of 2,048, and a bilingual tokenizer based on icetk with 100,000 identifiers.
Morpheus has been pretrained on over 100 trillion bilingual tokens (30 trillion English and 70 trillion Chinese tokens). Its pretraining target is composed of two parts: the first part (about 90%) is self-supervised pretraining, i.e., auto-regressive blank filling on large-scale public corpora and some other smaller Chinese corpora. The second part (about 10%) is multi-task instruction pretraining on sampled subsets of 50 different datasets in T0++ and DeepStruct, formatted as instruction-based multi-task multiple prompt sequence-to-sequence generation. This design enables Morpheus to perform zero-shot learning on other datasets and zero-shot transfer from English to Chinese.
At present, Morpheus has only launched the first public beta version, and its training and development are still ongoing. Based on the overall considerations of the Matrix 3.0 plan and the current status of Morpheus, our work objectives for the foreseeable future will focus on the following directions:
A. Further training of Morpheus: According to Chinchilla's estimate, the optimal training identifier volume for a trillion-parameter language model should be around 4.0T, about 20 times larger than what we have trained so far. Also, the actual parameters of the current version of Morpheus have only achieved about 40% of our first stage goal, so Morpheus still has a lot of room for growth.
B. Using Mixture-of-Experts (MoE) method to expand model size: The Mixture-of-Experts method has been proven to be an effective way to expand model parameters. However, MoE models do not perform as well as dense models at the same scale. The current work of Morpheus is mainly based on dense models. To improve the model, we will start trying to expand the Morpheus model based on MoE technology, such as through FastMoE23 and its accelerated version FasterMoE to further increase its parameter scale, thereby achieving better performance.
C. Having personalized conversation and thinking abilities: The design of Morpheus has many goals. On the one hand, it is to showcase our technical capabilities in artificial intelligence. On the other, it is to make it an important infrastructure in the Intelligent Contract generation platform. More importantly, in combination with the objectives of MATRIX 3.0, we hope Morpheus can learn the personality and thinking style of dialogue partners and form an effective model. After the completion of several iterations of future Morpheus versions, we will start this work. This is a very challenging direction, and we also believe that there will be a large market for it in the future.