New work by Tsinghua Zhu Jun team: Train Transformer with 4-bit integers to accelerate the arrival of AGI!
AD |
Transferred from XinzhiyuanEditor Aeneas RunQuantifying activation, weight, and gradient to 4 bits is expected to accelerate neural network training.However, existing 4-bit training methods require a custom number format, which modern hardware does not support
Transferred from Xinzhiyuan
Editor Aeneas Run
Quantifying activation, weight, and gradient to 4 bits is expected to accelerate neural network training.
However, existing 4-bit training methods require a custom number format, which modern hardware does not support.
Recently, the Tsinghua Zhu Jun team proposed a Transformer training method that uses the INT4 algorithm to implement all Matrix multiplication.
Training with ultra-low INT4 accuracy is very challenging. To achieve this goal, researchers carefully analyzed the specific structures of activation and gradient in Transformer and proposed dedicated quantizers for them.
For forward propagation, researchers identified the challenge of Outlier and proposed Hadamard quantizer to suppress Outlier.
For backward propagation, they utilize the structural sparsity of gradients by proposing bit segmentation, and use fractional sampling techniques to accurately quantify gradients.
This new algorithm has achieved competitive accuracy in a wide range of tasks such as Natural-language understanding, Machine translation and image classification.
The operational speed of the prototype linear operator is 2.2 times faster than that of similar operators in FP16, and the training speed has been improved by 35.1%.
Paper address:
https://arxiv.org/abs/2306.11987
Code address:
https://github.com/xijiu9/Train_Transformers_with_INT4
New INT4 training algorithm
Training neural networks requires high computational requirements. The use of low precision arithmetic for training (fully quantized training/FQT) is expected to improve computational and memory efficiency.
The FQT method added some quantizers and inverse quantizers to the original full precision computational graph, and replaced the higher consumption of floating-point operations with lower consumption of low precision floating-point operations.
The research of FQT aims to reduce the accuracy of training values without sacrificing too much Rate of convergence or accuracy.
The required numerical accuracy has been reduced from FP16 to FP8, INT32+INT8, and INT8+INT5.
FP8 training is implemented in NvidiaH100GPU with Transformer engine, accelerating the training of large-scale Transformers. The recent training numerical accuracy has dropped to 4 digits.
However, these 4-bit training methods cannot be directly used for acceleration as they require custom digital formats that modern hardware does not support.
Firstly, the non differentiable optimizer in forward propagation can make the loss situation bumpy, and gradient based optimizers can easily fall into local optima.
Secondly, gradients are only approximated with low accuracy. This imprecise gradient can slow down the training process and even lead to unstable or divergent training.
In this work, researchers proposed a novel INT4 training algorithm for Transformer.
All costly linear operations for training the Transformer can be written in the form of Matrix multiplication (MM).
This MM form allows us to design a more flexible quantizer. By using the specific structure of activation, weight and gradient in Transformer, we can better approximate FP32 Matrix multiplication.
The progress in the field of random numerical Linear algebra (RandNLA) is fully utilized by this quantizer.
For forward propagation, the researchers found that the active Outlier were the main reason for the decline in accuracy.
To suppress Outlier, they proposed Hadamard quantizer, which quantifies the transformed version of the activation matrix. This transformation is a block diagonal Hadamard matrix, which propagates the information carried by outliers to adjacent entries in the matrix, thereby reducing the numerical range of outliers.
For backward propagation, they utilized the structural sparsity of activation gradients. Researchers have found that some tokens have very large gradients. Meanwhile, the gradients of most other tokens are very uniform, and even the quantization residuals of larger gradients are more uniform.
Therefore, rather than calculating all gradients, it is better to save computational resources for calculating larger gradient residuals.
In order to utilize this sparsity, researchers proposed bit segmentation, which divides the gradient of each token into high 4 bits and low 4 bits.
Then, leverage score sampling is used to select the most informative gradient, which is an important sampling technique in RandNLA.
Combining the forward and backward propagation quantization technology, the researchers proposed an algorithm using INT4MM for all linear operations in Transformer, and evaluated the algorithms for training Transformer on various tasks, including Natural-language understanding, question answering, Machine translation and image classification.
Compared with existing 4-bit training algorithms, their algorithm achieves competitive or higher accuracy.
In addition, this algorithm is compatible with contemporary hardware such as GPUs, as it does not require custom digital formats such as FP4 or logarithmic format.
This prototype quantization+INT4MM operator implementation achieves a speed 2.2 times faster than the FP16MM baseline, and improves training speed by 35.1%.
Related work
Fully quantified training
Fully quantified training (FQT)
FQT's research has designed novel numerical formats and quantization algorithms that can better approximate full precision tensors.
The current research frontier is the 4-digit FQT. Due to the large numerical range of gradients and the optimization problem of training quantization networks from scratch, FQT is challenging.
Due to these challenges, the accuracy of existing 4-bit FQT algorithms has still decreased by 1-2.5% on certain tasks and cannot support contemporary hardware.
Other effective training methods
The hybrid expert increased the model capacity without increasing the training budget.
Structural dropout utilizes computationally effective methods to regularize the model. Efficient attention reduces the secondary Time complexity of computing attention.
Distributed training systems reduce training time by utilizing more computing resources.
The work of researchers to reduce numerical accuracy is orthogonal to these directions.
Forward propagation
Neural network training is an iterative optimization process that calculates random gradients through forward and backward propagation.
The research team used the 4-bit integer (INT4) algorithm to accelerate forward and backward propagation.
Forward propagation can be achieved by combining linear and nonlinear (GeLU, normalization, softmax, etc.) operators.
In our training process, we use INT4 arithmetic to accelerate all linear operators and keep all nonlinear operators with smaller computational complexity in the 16 bit floating point (FP16) format.
All linear operations in Transformer can be written in the form of Matrix multiplication (MM).
For the convenience of expression, this paper considers the following acceleration of simple Matrix multiplication:
The main use case for this type of MM is the fully connected layer.
Consider a Transformer with an input shape of (batch size S, sequence length T, dimension D).
The fully connected layer can be expressed as the formula above, where X is the activation of N=STtoken and W is the weight matrix.
For attention level, batch Matrix multiplication (BMMS) may be required.
Our proposed technology can be applied to BMMS.
Learning Step Quantization
Forward propagation
Researchers used the Learning Step Quantizer (LSQ) for this purpose.
LSQ is a static quantization, and its quantization scale does not depend on the input method, so it consumes less energy than dynamic methods. Quantification methods require dynamic calculation of quantization scale during each iteration.
Activate Outlier
LSQ4/FQTActivate Outlier
As shown in the above figure, there are some outlier entries activated, which are much larger in scale than other entries.
Unfortunately, Transformers tends to store information in these Outlier, and such truncation can seriously compromise accuracy.
When the training task is to fine tune the pre training model on some new downstream tasks, the Outlier problem is particularly obvious.
Because the pre training model contains more Outlier than the random initialization.
Hadamard quantization
Hadamard quantizationHQ
The main idea is to quantify another matrix in a linear space with fewer Outlier.
The Outlier in the activation matrix form a feature wise structure.
They usually concentrate on several dimensions, which means that only a few columns in X are significantly larger than the other columns.
Hadamard transform is a linear transform that can allocate Outlier to other entries.
Backward propagation
INT4Backward propagation
We will discuss the calculation of activation gradient/weight gradient in this section.
The structural sparsity of gradients
We noticed that the gradient matrix is often very sparse during the training process.
And sparsity has the following structure:
A few lines of (such as tokens) have large entries, while most other lines are close to the full Zero vector.
This structural sparsity stems from the severe hyperparameterization of modern neural networks.
Almost throughout the entire training process, the network operates in a hyperparameterization scheme, and except for some difficult examples, it can adapt well to most training data.
Therefore, for well fitted data points, the (activation) gradient will approach zero.
Researchers have found that for pre training tasks, for example, after several training cycles, structural sparsity quickly appears.
For fine-tuning tasks, the gradient is always sparse throughout the entire training process.
BitSplitting and LeverageScoreSampling
How to design a gradient quantizer to accurately calculate MM during backpropagation using structural sparsity?
The advanced idea is that many lines of the gradient are so small that they have little impact on the parameter gradient, but they waste a lot of computational time.
On the other hand, Da Xing cannot be accurately represented with INT4.
We give up some small rows and use the saved computing power to more accurately represent large rows.
experiment
Researchers evaluated our INT4 training algorithm tuning, Machine translation and image classification on various tasks, including language models.
The researchers performed their proposed HQ-MM and LSS-MM algorithms using CUDA and cutpass.
Researchers replaced all floating-point linear operators with INT4 implementation, but did not simply use LSQ to embed layers and maintain the accuracy of the last classifier layer.
Finally, the researchers used the default architecture, optimizer, scheduler, and hyperparameters for all evaluated models.
Convergence model accuracy
The researchers compared the accuracy of the convergence model in various tasks in the table below.
FPINT8INT8FP4LSQ(LSQ+LUQ)4 HQForward propagationLSSHQ+LSS
There is no publicly available implementation of 'ultra low', so we only listed its performance translation tasks in the original paper on the machine.
In addition to the large Machine translation task and the large visual transformer task, we will repeat each run three times and report the standard deviation as the subscript in the table.
The researchers did not perform any type of knowledge distillation or data augmentation.
experiment
experiment
Forward propagationBackward propagationFP16
The results are shown in the following figure.
Computing and memory efficiency
Finally, the researchers demonstrated the potential of their method to accelerate neural network training by evaluating their prototype implementation.
And their implementation has not been fully optimized yet.
Researchers also did not integrate linear operators with nonlinearity and normalization.
Therefore, the results cannot fully reflect the potential of the INT4 training algorithm.
The implementation of complete optimization requires a lot of engineering, which is beyond the scope of our paper's discussion.
conclusion
Researchers have proposed a hardware friendly training method for Transformer INT4.
By analyzing the attributes of MM in Transformer, researchers proposed HQ and LSS methods to quantify activation and gradient while maintaining accuracy.
Our method performs equally or even better than the existing INT4 method on several important tasks.
The work of researchers may be extended to other MM architectures besides Transformers, such as MLP Mixer, graph neural network and Recurrent neural network.
This is their future research direction.
Wider impact:Researchers' algorithms can improve efficiency and reduce the energy consumption of training neural networks, which helps to reduce carbon emissions caused by deep learning.
However, efficient training algorithms may also promote the development of large language models and malicious artificial intelligence applications that pose security risks to humans.
For example, relevant models and applications that can be used for generating false content.
Restrictions:The main limitation of this work is that it can only accelerate the large-scale Matrix multiplication (linear layer) model, but cannot accelerate the convolution layer.
Moreover, the proposed method is not yet well applicable to super large models such as OPT-175B.
To our knowledge, even INT8 training is still an unresolved issue for these very large models.
References:
https://arxiv.org/abs/2306.11987
Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])
Mobile advertising space rental |
Tag: New work by Tsinghua Zhu Jun team Train Transformer
Have you noticed? Once a smartphone emits these signals, it indicates that it is time to change the phone!
NextTop global technologies from three Asian countries: Korean semiconductors, Japanese precision machine tools, and what do China have?
Guess you like
- Detail
- Detail
-
Ant Group Powers the Greater Bay Area's "One-Hour Living Circle" and Fuels Global "ChinaTravel Boom"Detail
2024-11-21 19:23:04 1
-
Shenzhen's First Roadside Supercharger Station Commences Trial Operation, Ushering in a New Era for the "Supercharging City"Detail
2024-11-21 11:25:06 1
-
Xiaomi's High-End Strategy: An In-Depth Analysis of Q3 2024 Financial Results and Future OutlookDetail
2024-11-19 23:07:40 1
-
TSMC's Sudden Shift: A Global Chip Giant's Difficult Choices in the US-China GameDetail
2024-11-19 12:27:48 1
-
International Space Station Leak Crisis: NASA's Emergency Evacuation Plan and Signals of Chinese CooperationDetail
2024-11-19 11:34:51 1
-
Ten Years of Searching: Li Eryou's Unwavering Hope in the Search for His Son on MH370Detail
2024-11-18 18:39:16 1
-
The Facial Swelling of Shenzhou 18 Astronauts: The Physiological Cost of Space Exploration and Future ChallengesDetail
2024-11-17 08:03:04 11
-
Xiaomi Automobile Unveils Intelligent Chassis Pre-Research Technology, Ushering in a New Era of "Human-Car-Home Full Ecosystem"Detail
2024-11-14 11:24:27 1
-
Douyin E-commerce Double 11 Data Report: Merchants Businesses Grow, Consumer Trends EmergeDetail
2024-11-14 11:23:11 1
-
New Trends in SOE Reform: Focusing on Five Values to Build a "Living Organism"Detail
2024-11-14 11:19:26 1
-
CATL Chairman Zeng Yuqun: Musk Doesn't Understand Batteries, Tesla's Bet on Cylindrical Batteries is Doomed to FailDetail
2024-11-13 18:47:38 11
-
China Eastern Airlines Technology and Thales Renew Cooperation Agreement, Deepening Avionics Maintenance PartnershipDetail
2024-11-13 16:40:50 1
- Detail
- Detail
- Detail
-
Li Jiaqi's Livestream Double 11 Report: Domestic Brands Surge, Winter Warmer Economy BoomsDetail
2024-11-12 11:07:26 11
-
BYD: Plug-in Hybrids "To the Rescue," Behind the Price War Lies a "Davis Double-Click" in ProfitabilityDetail
2024-11-12 10:49:05 1
-
The Rise of Online Livestreamers: A Mass Career with 15 Million Dream Chasers in Live RoomsDetail
2024-11-11 15:27:33 11