Driving AI to new horizons with an all-scenario native, full-stack solution

2018.12.28 By Dang Wenshuan, Huawei Chief Strategy Architect

At HUAWEI CONNECT 2018, Huawei Rotating Chairman Eric Xu announced Huawei's AI strategy and full-stack, all-scenario AI portfolio. Here we take a detailed look at what the strategy involves and the features and benefits of each of the solutions.  

Huawei's AI portfolio includes four layers: Ascend, CANN, MindSpore, and the Application Enablement layer. Ascend is a series of AI IP and chipsets based on unified and scalable architecture. There are five chips in the series: Max, Mini, Lite, Tiny, and Nano. CANN is a chip operator library and highly automated operator development toolkit. MindSpore is a unified training and inference framework that supports device, edge, and cloud (both standalone and collaborative). The Application Enablement layer provides full-pipeline services, hierarchical APIs, and pre-integrated solutions.

This article will set out the purposes, architectures, key technologies, products, and services of Huawei's full-stack AI portfolio's four layers from the bottom up.

Ascend layer

Ascend is the IP and chipset layer and the foundation of the full-stack solution. It aims to provide optimal performance at minimal cost for all scenarios.

Organizations tend to choose different applications for different deployment scenarios. As each possible scenario is unique and equally important, we set out with the goal of providing the best performance at minimal cost in any scenario. However, this wasn’t an easy task. The required compute performance for different deployment scenarios might range from 20 MOPS for headphones to 20 TOPS for IP cameras, to 200 TOPS for cloud, which is a dynamic range of over 10 million times. Equally, power budgets might range from 1mW to over 200W – a dynamic range of over 200,000 times. Expected model sizes could range from as small as 10 KB to more than 300 MB. And as for latency, differences can range by up to 100 times for different requirements.

In all scenarios, the use of inference is required. But is training or learning in scenarios other than cloud and edge needed? In our view, yes – because data privacy protection is our highest priority, so we require local training and learning as long as there are privacy concerns. To cope with such huge dynamic requirements, we’ve developed a range of IPs and chipsets – from Ascend Nano, the smallest, to Ascend Max, which is used in the cloud.

Whether or not to adopt a unified architecture when developing a series of IPs and chips is a crucial decision. True, the benefits of a unified architecture are obvious. You only have to develop the operator once and then you can use it in any scenario, with a guaranteed consistent development and debugging experience across scenarios. And more importantly, once an algorithm has been developed for a chip, you can smoothly migrate it to other IPs or chips for other scenarios.

Yet a unified architecture also poses unprecedented challenges. To achieve huge computational scalability, the scale-out method can be used. Architecture optimized for a small or the smallest computing scenario is developed first. Then you scale out to match to the largest computing scenario. However, this inevitably results in increases in chip size and power dissipation beyond acceptable limits. 

Another choice is the scale-in approach. This involves first designing architecture optimized for a large or the largest computing scenario. Then fine partitioning is used to match to the smallest computing scenario. However, this unavoidably leads to highly complex task scheduling and software design, and may result in a failure to achieve low power dissipation targets due to current leakage. 

In addition, there will be huge variations in both memory bandwidth and latency in different scenarios, and the correct computing power for these will always have to be used to avoid poor compute power utilization. You will also have to face power and area constraints for on-chip and inter-chip interconnect.

Backed by years of experience in chip design and a deep understanding of our customers, Huawei selected the unified Da Vinci architecture to develop the Ascend chips. Three unique, key technologies – scalable computing, scalable memory, and scalable interconnections – make the unified architecture possible.

To achieve highly scalable computing power, we first designed a scalable cube, which acts as an ultrafast matrix computing unit. In its maximum configuration (16x16x16), the cube can perform 4,096 FP16 MAC operations per clock cycle. Given the huge dynamic range that needs to be supported, we believe that the 16x16x16 Cube is the sweet spot between performance and power dissipation. With a 16x16x16 configuration, the cube's scale-in capability, and efficient multicore stacking, it's possible to support all scenarios with one architecture. 

For lower computing power use cases, the cube can be gradually scaled down to 16x16x1, which provides 256 MAC operations per cycle. This flexibility alongside one instruction set provides a successful balance between computing power and power dissipation. By supporting multiple precisions, each task can be executed most efficiently. 

Due to the extremely high computational density, the integrity of the power supply is critical when the circuit is operating at full speed. Picosecond current control technology meets this critical requirement. 

The Da Vinci Core also has an integrated ultra-high-bit vector processor unit and a scalar processor unit. This varied compute design allows the Da Vinci architecture to support calculations outside the matrix and adapt to potential neural network calculation types in the future.

To support highly scalable memory, each Da Vinci Core is equipped with dedicated SRAMs with fixed functions and variable capacities to accommodate different computing power scenarios. These memories are designed to be explicit to low-level software. Thus, it’s possible to use the auto-tiling plan to achieve fine-grained control of data multiplexing and optimally balance performance and power dissipation to suit different scenarios. 

For data center applications, the on-chip, ultra-high bandwidth Mesh network connects multiple Da Vinci Cores. This ensures extremely low latency communication between cores and between the core and other IPs. Thanks to an L2 buffer with up to 4 TByte/s of bandwidth and a 1.2 TByte/s HBM, the high-density computing core's performance can be fully utilized. Leveraging 2.5D packaging technology, the Ascend 910 chip integrates eight dies – a standout feature. These include compute, HBM and IO. 

As the world's first all-scenario AI IP and chip series, the Ascend series offers the best energy efficiency ratio in all scenarios, from extremely low power to high computing power scenarios.

CANN layer

The CANN (Compute Architecture for Neural Networks) layer lies above the chip layer. It offers a chip operator library and operator development toolkits. Aiming to provide optimal development efficiency and operator performance, it meets the requirements created by the booming growth in academic research and industry applications.

AI academic research has flourished since 2009, and the number of machine learning papers published has increased at the rate of Moore's Law. 

Gartner's digital disruption scale outlines five levels of digital disruption in an enterprise: enhance, expand, transform, reinvent, and revolutionize. According to Gartner, enterprises will mainly be using AI to enhance their existing services in 2018, Within just five years, they’ll be using it to "transform" and "reinvent" them. 

Never before have we seen a technology capable of making such a huge impact so quickly. We’re entering an era of "dual prosperity" where both AI academic research and AI industry applications will flourish.

From a technical perspective, this means we’ll see sustained, rapid growth in the diversity of operators. There will be many possible factors behind this diversity: the variety of applications, models, and networks; their potential use in forward or backward algorithms; or their accuracy and high resource budgets. 

Yet the languages used today to develop operators either perform well or are development-oriented. For AI applications and development in future, it will be crucial to have toolkits that are both high-performance and highly efficient. This is why Huawei chips support CANN.

CANN provides a high-performance CCE (Cube Compute Engine) operator library. The key component of CANN is its Tensor Engine, a highly automated operator development tool. The Tensor Engine enables users to easily develop customized operators (CCE lib-extension) on the Ascend chip using a unified domain specific language (DSL) interface (TE Interface) and tools such as preset high-level template encapsulation and auto performance tuning. 

The Tensor Engine is a unified tool aimed at Huawei and non-Huawei developers alike. Huawei engineers focus on developing extreme performance operators. Huawei-developed operators can be found in the CCE lib (CCE library). Non-Huawei developers can use the same toolkit to develop the operators they need. These operators can be found in the CCE lib-extension. CANN also fully supports the use of TVM to produce operators. These are also included in the CCE lib-extension. 

With these two libraries and the Tensor Engine toolkit, CANN supports all major machine learning and deep learning frameworks. Reduce_sum is an example of a Huawei-developed operator. It's often used in TensorFlow, and was also developed on the Ascend chip. Using a general DSL requires 63 LOC. But using Tensor Engine DSL only needs 22 LOC to achieve the same functions. It represents a near 3 times increase in development efficiency – this operator for extreme performance can be found in the CCE lib.

MindSpore

MindSpore is a unified training/inference framework designed to be design-friendly, operations-friendly, and adaptable to all scenarios.

There are many AI frameworks available on the market, but none fully meet our expectations. We believe that AI frameworks should be design-friendly, for instance, by offering significantly reduced training time and costs, and operationally efficient, for instance, by providing minimum resources and maximum energy efficiency ratios. More importantly, we believe they should be adaptable to each scenario, including device, edge, and cloud.

In the future, AI will be highly dynamic. This isn’t just because of academic research by the likes of world leading computer scientist Professor Michael Jordan – he points out that the frontiers of AI research are heading towards dynamic environments AI, secure AI, and AI-specific architecture due to trends such as mission-critical AI, personalized AI, cross-organizational AI, and AI demand outpacing Moore's Law. Nor is it just because of the explosive development of industry AI, based on the computational demand of a single neural network having increased 300,000 times in just six years. 

The most important factor is GDPR, which has been applied to organizations across the world since May 25, 2018. We long ago mastered the management of land, industry, and many other things besides. Yet we lack effective methods for managing data, long seen as a new type of resource. GDPR is the first comprehensive attempt to manage data at the government level. As data is key to AI, the impact of GDPR is clearly historic.

Understanding these factors, Huawei believes that the AI framework should be a unified training/inference framework, so that training and inferencing can take place anywhere, and a consistent development experience can be maintained regardless of scenario – be that standalone or collaborative; on device, edge or cloud; or collaboration between device and cloud, edge and cloud, or device and edge.

MindSpore is this unified AI framework. The complete framework is planned for release in the second quarter of 2019. It will include core subsystems, such as a model library, graph compute, and tuning toolkit; a unified, distributed architecture for machine learning, deep learning, and reinforcement learning; a flexible program interface; and support for multiple languages.

MindSpore's size can vary to suit different environments. For example, the small version of the framework for on-device learning is probably the smallest full-feature AI framework ever. The total framework is less than 2 MB and requires less than 50 MB of RAM. Plus it needs 5 times less ROM than its closest competitor solution. The on-device MindSpore framework will be available in the first quarter of 2019. 

Meanwhile, Ascend Cluster is probably the largest distributed training system in the world at present. It connects 1,024 Ascend 910 chips – which offer the greatest computing density on a chip – in a single computing cluster, providing 256 petaFLOPS of ultra-high computing power to enable model training at unprecedented speeds and training goals to be achieved in minutes or even seconds. And with 32 TB HBM, it’s easy to develop new models that are larger than ever before if needed.

In addition, with Huawei's OMG (Offline Model Generator) and OME (Offline Model Engine), models trained, or to be trained, using major open source frameworks can work on Ascend.

Application Enablement layer

The Application Enablement layer is a machine learning PaaS that provides full process services, hierarchical APIs, and pre-integrated solutions. It was designed to meet the unique needs of different developers and make AI adoption easier.

As AI becomes increasingly advanced, machines will continue to reach and surpass human performance in increasing numbers of tasks. As a result, AI will gradually redefine application development by increasingly replacing tasks that were developed in the traditional way. As such, applications will comprise AI-related software alongside traditional software. 

These two types of software are very different in terms of development, testing, and maintenance. The AI-related parts of applications actually require many services, which are often isolated from each other, if they’re available. Moreover, highly skilled data scientists are also frequently required to develop AI-related components. These resources aren’t generally available either. This makes application and solution development very challenging, something which holds back the pace of AI adoption.

With the Application Enablement layer, our aim is to simplify the AI-related parts of applications as much as possible, by providing complete full-pipeline services, layered APIs, and pre-integrated solutions. Our complete, full-process service is called ModelArts. It combines all the services needed to generate models, including data acquisition, to model training, to adaptation, into a one-stop service. Two of these services are important to highlight: adaptation and ExeML.

The first service to highlight is adaptation. Given the unavoidable performance deterioration of trained models, it’s especially important to constantly monitor the system and quickly implement necessary updates. We’re developing the following five sub-services to do this. 

  • Data distribution monitoring
  • Accuracy monitoring
  • Smart retraining data generation
  • Auto model adaptation
  • Local learning

The second service to spotlight is ExeML. Given the tremendous capabilities of the large-scale training system Ascend Cluster, we believe that it's possible to take machine learning automation to a new level. 

In addition to automating model building and optimizing the training process, ExeML was designed for the auto optimization of execution-oriented auto model generation and adaptive deployment environments – including optimizing inference latency, hardware resources, and operator adaptation for specific deployment environments. 

As such, ExeML is the first system for generating environments that has been designed from the start to offer optimal generation performance at minimal cost. It will be available in the second quarter of 2019 along with the Ascend 910.

AI is still in the early stages of development, and pre-integrated solutions are crucial for simplifying AI adoption. Launched a year ago, Huawei Cloud EI already has a wealth of pre-integrated solutions for cities, manufacturing, logistics, and health. It’s worth underlining the fact that our full-stack combination is not a closed system, but based on open architecture. Huawei Cloud EI also supports GPUs.

We believe that to drive AI adoption and truly realize AI everywhere, AI solutions must be all-scenario native. Harnessing Huawei's all-scenario native, full-stack AI portfolio, we’re ready to provide AI support to every person, home, and organization, and together drive AI to new horizons.

ipad code comm87 en2

Mobile reading