What kind of storage is needed in the era of large models?

Tech 2023-07-18 19:01:35 Source: Network

[Global Network reporter Zhang Yang] With the rapid development of science and technology, we are increasingly entering an era represented by new technologies such as Big data, artificial intelligence and cloud computing. Especially in today's era where "big models" are prevalent, artificial intelligence technology, as a key variable, has become a strategic lever for driving a new round of technological revolution, industrial transformation, and social development, empowering various industries to profoundly change people's social life, industrial structure, work methods, and technological trends

[Global Network reporter Zhang Yang] With the rapid development of science and technology, we are increasingly entering an era represented by new technologies such as Big data, artificial intelligence and cloud computing. Especially in today's era where "big models" are prevalent, artificial intelligence technology, as a key variable, has become a strategic lever for driving a new round of technological revolution, industrial transformation, and social development, empowering various industries to profoundly change people's social life, industrial structure, work methods, and technological trends.

However, we are closely monitoring the development of large models today, which is somewhat of a 'castle in the air'. In this idiom story, the wealthy only want a third floor, rather than having workers build one or two floors. The "big model" is like this third floor, especially when industry models focus on specific fields, specific scenarios, and solve specific problems, empowering the industry and attracting public attention.

But in order to prevent the third floor from collapsing overnight, and even to make it grow towards higher floors, special attention should be paid to the stability of the first, second, and second floors. Computing power, algorithms, storage, frameworks, talent, and other aspects all determine how great a building can be.

Storage becomes the cornerstone of the development of large models

The importance of computing power, algorithms, and data in the development of artificial intelligence has long been well-known. However, as a carrier of data, storage is often overlooked. In fact, the development of artificial intelligence has certain similarities with the development of computer systems, both of which are in line with the typical barrel theory. Any shortcomings in any part will seriously constrain the overall performance. In the process of training large models, a large amount of data exchange is often required. If the storage performance is not strong, it may take a lot of time to complete a single training, which seriously restricts the development and iteration of large models.

In fact, many enterprises have begun to realize the enormous challenges faced by storage systems during the development and implementation of large model applications.

Firstly, the data preparation time is long, the data sources are scattered, and the collection is slow. Data collection needs to copy the original data from multiple data sources across regions. Multiple data formats and protocols of data sources result in complex and time-consuming data collection processes; On the other hand, traditional hard disk delivery methods can take several weeks, and public network transmission methods are expensive. How to break through data silos and shorten the collection time is the first challenge faced by storage systems in the era of large models;

Secondly, the data preprocessing cycle is long. Because the original data collected on the network cannot be directly used for AI model training, it needs to clean, de duplicate, filter and process the diversified and multi format data, which is called "data preprocessing" in the industry. Compared to traditional single mode small model training, multimodal large models require more than 1000 times the amount of training data. A typical 100TB level large model dataset, with a preprocessing time of over 10 days, accounts for 30% of the entire AI data mining process. At the same time, data preprocessing is accompanied by high concurrency processing, which consumes a huge amount of computing power. How to shorten the duration of data preprocessing through the most economical means is the second urgent problem that needs to be solved at present;

Again, the training set loading is slow; Training is prone to interruption and data recovery time is long. Compared to traditional learning models, the training parameters and training datasets of large models increase exponentially. How to achieve fast loading of massive small file datasets and reduce GPU waiting time is crucial. At present, mainstream pre training models already have hundreds of billions of level parameters, and frequent parameter tuning, network instability, server failures, and other factors bring instability to the training process, which is prone to interruption and rework. A Checkpoints mechanism is needed to ensure that the training falls back to the restore point, rather than the initial point. Currently, due to the need for a day level recovery time for Checkpoints, the overall training cycle of large models has sharply increased. However, in the face of a large single data volume and future hourly frequency requirements, it is necessary to carefully consider how to reduce the recovery time for Checkpoints;

Finally, the implementation threshold for large models is high, the system construction is complex, and the real-time and accuracy of inference are low. When using a large model for inference, in order to improve the real-time and accuracy of inference data and avoid hallucinations in the large model, it is necessary to connect the latest data and enterprise private data to the large model. If the latest data is used for further training and fine-tuning in the GPU training cluster, the training time will be long and the cost will be high. Therefore, it is necessary to find more efficient methods to achieve dynamic updates of model data;

Data determines AI intelligence level

In response to these challenges, Huawei complied with the AI development trend in the era of big model, and launched OceanStorA310 deep learning Data lake storage and FusionCube A3000 integrated training/pushing super machine on July 14 for big model applications in different industries and scenarios.

In Huawei's view, the challenges faced by enterprises in the development process of large models are precisely the targets that can be targeted to solve storage problems. Firstly, in response to the difficulty of data collection, Huawei data storage has built data weaving capabilities. Through the Global File System, it can achieve the ability to globally unify data views and scheduling across systems, regions, and clouds, shortening data collection from the day level to the hour level. Improve data transmission efficiency, break down data silos, and enable all of this data to be stored in one device, effectively supporting the use of data by large models.

In response to the problem of long preprocessing cycles, Huawei uses near storage computing to prepare data and reduce data movement; Storage supports configuring computing power, accelerating data preparation, and releasing training cluster CPU and GPU resources.

Faced with the problems of slow loading of training sets, easy interruption of training, and long data recovery time, Huawei has reduced data recovery time through methods such as preprocessing acceleration, high-performance and high bandwidth acceleration of training set loading.

The two new AI storage products created based on this provide storage solutions specifically for basic model training, industry model training, and segmentation scenario model training and inference scenarios.

Among them, OceanStorA310 deeply learns the storage of Data lake, faces the Data lake scenario of the basic/industrial big model, and realizes AI full process massive data management from data collection, pre-processing to model training, reasoning and application. OceanStorA310 single frame 5U supports the industry's highest bandwidth of 400GB/s and the highest performance of 12 million IOPS, and can scale linearly to 4096 nodes, achieving multi-protocol lossless interoperability. The global file system GFS enables cross regional intelligent data weaving, simplifying the data collection process; By using near storage computing, near data preprocessing is achieved, reducing data movement and improving preprocessing efficiency by 30%.

FusionCube A3000 integrated training/pushing and super integration machine is oriented to the training/reasoning scenarios of the industry's big model, aiming at the application of 10 billion level models, integrating OceanStorA300 high-performance storage nodes, training/pushing nodes, switching equipment, AI platform software and management operation and maintenance software, to provide the big model partners with a mobile deployment experience and achieve one-stop delivery. Ready to use out of the box, deployment can be completed within 2 hours. Training/push nodes and storage nodes can be independently and horizontally expanded to match the needs of models of different scales. At the same time, FusionCube A3000 achieves GPU sharing for multiple model training inference tasks through high-performance containers, increasing resource utilization from 40% to over 70%. The FusionCube A3000 supports two flexible business models, including Huawei's Ascension one-stop solution and third-party partner one-stop solution for open computing, networking, and AI platform software.

Zhou Yuefeng, President of Huawei's Data Storage Product Line, said, "In the era of big models, data determines the height of AI intelligence. As a carrier of data, data storage has become a key infrastructure for AI big models. Huawei's data storage will continue to innovate in the future, providing diverse solutions and products for the era of AI big models, and working together with partners to promote AI empowerment for thousands of industries

Looking towards the longer term, IT infrastructure such as storage, computing, and networking in the era of big models is bound to be further reshaped according to new demands. When the AI industry has a solid foundation and one or two floors, it is only then that we can climb up and see the more beautiful scenery of the AI era.

Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])

Mobile advertising space rental

Tag: of What kind storage is needed in the era