Alibaba Cloud collapses all over the line! Unable to order takeout, unable to settle accounts... 99.99% high availability myth shattered!

Tech 2023-11-14 09:13:07 Source: Network

At 6pm on November 12, 2023, Alibaba Cloud experienced a major failure again. As early as December 18, 2022, a large-scale service interruption event occurred in Alibaba Cloud Hong Kong Region C, which had a significant impact on many customer businesses and expanded to include more cloud services such as EBS, OSS, and RDS in Hong Kong Region C

At 6pm on November 12, 2023, Alibaba Cloud experienced a major failure again. As early as December 18, 2022, a large-scale service interruption event occurred in Alibaba Cloud Hong Kong Region C, which had a significant impact on many customer businesses and expanded to include more cloud services such as EBS, OSS, and RDS in Hong Kong Region C.

This time, the impact is even greater. In over an hour, even hungry people cannot place orders, riders cannot enter the system, cannot order takeout, parking lots do not lift poles, and supermarkets cannot checkout!

Weibo is like a "mourning eunuch", and the big newspaper reports that "Alibaba Cloud's entire product line has collapsed". In succession, Taobao has collapsed, Taobao has collapsed again, Xianyu has collapsed, DingTalk has collapsed, and other hot searches have surged.

This time is likely not a typical 'collapse', based on the current information obtained, it is likely to be a historic 'collapse'.

On the evening of Double 11, Taobao had a brief outage, but it quickly passed. But on the evening of the 12th, multiple Alibaba series apps, including Taobao, Xianyu, DingTalk, Alibaba Cloud Drive, Hungry Me, Tmall Genie, Cainiao, Quark, Yuque, etc., were unable to access or had abnormal services.

The most speechless should be the language sparrow. Last month, on October 23rd, Alibaba's product "Yuque" also experienced a "P0" level accident, causing the platform to be unable to access and use normally for nearly 8 hours (around 14:10 to 21:45).

In the last event,Yuque gave users half a year of membership as compensationI'm not sure if there will be similar compensation measures this time.

Affected areas:

North China 2 (Beijing), North China 6 (Ulan Jibu), North China 1 (Qingdao), East China 2 (Shanghai), South China 2 (Heyuan), North China 3 (Zhangjiakou), Hong Kong, India (Mumbai), United States (Silicon Valley), South China 1 (Shenzhen), United Kingdom (Lunjiao), South Korea (Seoul), Japan (Tokyo), United Arab Emirates (Dubai), Southwest 1 (Chengdu), South China 3 (Guangzhou), Singapore, Australia (Sydney) Malaysia (Kuala Lumpur), North China 5 (Hohhot), Indonesia (Jakarta), United States (Virginia), Philippines (Manila), Thailand (Bangkok), East China 1 (Hangzhou), South China 1 Financial Cloud.

From the information on Alibaba Cloud's Health Status Page, it can be seen that this is not a fault in a specific availability zone, but rather a global major fault with almost no spared areas. The affected areas are not only Alibaba Cloud's own business areas, but also the financial cloud and political cloud that provide external services. Even more severe is that there is no service that is spared, and all services are hung up.

Event review (three and a half hours of malfunction (17:44-21:11):

17: Alibaba Cloud has confirmed that the cause of the malfunction is related to a certain underlying service component, and engineers are urgently processing it.

18: After being processed by engineers, console and API services in regions such as Hangzhou and Beijing have been restored, while console and API services in other regions are gradually being restored.

19: 20 engineers have restarted component services in batches, and most of the console and API services in the region have been restored.

19: The 43 exception control service components have all been restarted. Except for some cloud products (such as Message Queuing MQ and Message Service MNS) that still need to be processed, the rest of the cloud product consoles and API services have been restored.

20: The message queue MQ in regions such as Beijing and Hangzhou has been restarted, while the rest of the regions are gradually recovering.

At 21:11, all affected cloud products have been restored. Due to the fault affecting some cloud products' data (such as monitoring, billing, etc.), there may be delayed push situations, which do not affect business operations.

Industry practitioners have expressed their shock at Alibaba Cloud's recent failure, as they have not heard of such a scale of cloud computing failure since their employment.

Currently, various technology groups are filled with anxiety and anger, because in this situation, the possibility of users' self rescue is zero, and they can only wait for Alibaba Cloud to recover.

Due to Alibaba Cloud's huge market share, just over a decade ago at the 2023 Yunqi Conference, Alibaba Group Chairman Cai Chongxin pointed out that 80% of China's technology companies and half of the large model companies are currently running on Alibaba Cloud, and the impact of this failure is very significant.

Cloud products have begun to penetrate into all aspects of our lives. For this, the affected netizens have joined in the double 11 scuffle roast:

And colleagues also bluntly joked that Alibaba has achieved initial results by reducing costs and increasing "laughter".

The appearance of this accident was a technical malfunction, but the maintenance of the system was still carried out by people.

This incident has sparked speculation among some netizens, with some questioning whether it was due to Alibaba Cloud's large-scale layoffs of older employees in August, resulting in a decrease in key personnel, which led to this malfunction.

However, this is just speculation from netizens and there is no conclusive evidence to support this claim. We cannot judge and view this malfunction event based on speculation.

A netizen said: Ali's collapse this time may be a good thing from a certain perspective - some veterans still need to stay.

Some netizens have raised a question: Alibaba has suffered heavy losses this time, is this a sequela of cost reduction and efficiency improvement?

From the perspective of Alibaba Cloud, this failure is very "not Alibaba Cloud". After all, Alibaba Cloud has always prided itself on security, stability, and high availability. Such a large-scale, long-term, and wide-ranging failure is definitely a fatal blow to Alibaba Cloud's brand image. This is no longer just a matter of "killing a programmer to heaven", and it is likely to require "killing a CEO". However, unfortunately, Alibaba Cloud currently does not have a CEO.

What's even more headache is that we still have to face a snowy demand for compensation in the future.

Employee perspective

Consecutive accidents have forced the release of talent, which has led to the reduction of costs and increased efficiency, as well as the opening of "apes" to reduce costs.

The year-end bonus is nothing small, and it is more likely to deduct wages. At least I will offer a P0 to the heavens. The sensible Ali people have already started writing resumes.

Customer perspective

Previously, Byte crashed for one hour at noon and lost 2 small targets. During the Double Eleven period of Alibaba Cloud, according to SLA 99.99%, one hour is 10% compensation.

In the future, Alibaba Cloud is no longer synonymous with high-performance and high availability as it faces changes in orders. After this battle, customers' superstitions about Alibaba Cloud are likely to be shattered.

Brand perspective

Enemies such as AWS, Huawei Cloud, and Tencent Cloud have already rubbed their hands. There is an old saying that "when you are sick, you will die". When are customers waiting for a wave of blood washing?

Network security angle

In fact, ChatGPT has also experienced consecutive network security failures recently. On the same afternoon, ChatGPT and other services malfunctioned, and various reasons for service downtime are being investigated. Subsequently, OpenAI stated that the problem was resolved and is now running normally. As early as Thursday, ChatGPT discovered signs of DDoS attacks (DDOS stands for Distributed Denial of Service). Hackers use DDOS attackers to control multiple machines to attack simultaneously, achieving the goal of "obstructing normal users from using the service". As early as November 9th, the ChatGPT and API services of OpenAI experienced severe interruptions, resulting in the inability of services for users and developers to function properly. In the following 16 hours, ChatGPT still did not fully recover.

As users and the outside world, we should take a look at the handling process of this fault event. Alibaba Cloud engineers quickly took action to solve the problem by restarting and gradually restoring services.

Their efforts and proactive response are worthy of recognition. At the same time, we also need to recognize that cloud services, as a complex system, are inevitably prone to malfunctions, and even technologically advanced companies cannot eliminate the possibility of malfunctions 100%.

Sometimes, accidents are not entirely in people's hands and may be caused by hardware failures, software errors, or many other factors.

The key is that the company should have transparency and a sense of responsibility, be able to release information in a timely manner, actively respond to faults, and learn from them to improve system stability.

Therefore, we should view this failure event with an objective attitude, acknowledge the rapid response of engineers and the efficiency of fault handling, and encourage Alibaba Cloud to further strengthen its fault prevention and emergency response capabilities to ensure user data security and service stability.

At the same time, as users, we should also be prepared for our own data backup and disaster recovery to reduce the impact of failures.

No service is reliable, enhancing the disaster tolerance capability of the underlying service components, deploying in multiple locations, multiple machine rooms, or even multiple clouds, and being "prepared".

The large-scale failure events that occurred in Alibaba Cloud remind us that cloud services are not flawless, but through active response, continuous improvement, and active user participation, we can jointly ensure that service quality reaches a higher level and promote the continuous development of cloud computing.

What are your thoughts and opinions on this matter? Welcome to leave a message in the comment section!

Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])

Mobile advertising space rental

Tag: to Alibaba Cloud collapses all over the line Unable

Alibaba Cloud collapses all over the line! Unable to order takeout, unable to settle accounts... 99.99% high availability myth shattered!

17: Alibaba Cloud has confirmed that the cause of the malfunction is related to a certain underlying service component, and engineers are urgently processing it.

18: After being processed by engineers, console and API services in regions such as Hangzhou and Beijing have been restored, while console and API services in other regions are gradually being restored.

19: 20 engineers have restarted component services in batches, and most of the console and API services in the region have been restored.

19: The 43 exception control service components have all been restarted. Except for some cloud products (such as Message Queuing MQ and Message Service MNS) that still need to be processed, the rest of the cloud product consoles and API services have been restored.

20: The message queue MQ in regions such as Beijing and Hangzhou has been restarted, while the rest of the regions are gradually recovering.

At 21:11, all affected cloud products have been restored. Due to the fault affecting some cloud products' data (such as monitoring, billing, etc.), there may be delayed push situations, which do not affect business operations.

Employee perspective

Customer perspective

Brand perspective

Network security angle

Foreign media: Amazon's gaming department plans to lay off approximately 180 more employees

Blood abnormalities in Osaka Prefecture, Japan! Terrible water pollution and cancer may be here

Popular articles

Recommended Reading

Category