Huawei Website Group

Corporate: Corporate news and information

Consumer: Phones, laptops, tablets, wearables & other devices

Enterprise: Enterprise products, solutions & services

Carrier: Products, solutions & services for carrier networks

Huawei Cloud: Cloud products, solutions & services

Select a Country or Region

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy

PyTorch An Ever-growing AI framework among AI practitioners

2023.01.12

Abstract

Deep learning is a subset of Artificial Intelligence (AI) that focuses on using neural networks to enable machines to mimic the human brain. In order to carry out such processes on computers, specialized frameworks are needed to do so, furthermore, AI practitioners need to be familiar such frameworks. PyTorch [1] is an AI framework developed by Facebook in 2016 that enables us to implement neural networks on computers for research and development is the domain of AI. However, there are other AI frameworks out there in the market developed by other institutes that provide different functionalities to AI practitioners. Despite PyTorch [1] being relatively new, it has been widely adopted by researchers and developers worldwide due to various reason. It has always been a longstanding debate as to which framework is better much like which programming language is better. In this paper, the focus will be placed on factors that make an AI framework widely adopted such as community, ease of usability, speed, flexibility, control, functionalities etc.

Introduction

As the interest in deep learning grows over the years, there has also been an increase in the number of AI tools and frameworks. There is a very high importance of AI frameworks in the AI world as they provide an interface for the practitioners to implement the mathematical operations of the deep learning models onto computers.

Figure 1. High-level mathematical representations of different types of neural networks.

For the sake of simplicity, Figure 1 shows some of the most common types of neural networks although there are many more types of neural networks. The purpose of this figure is to show that any neural network is built on a ground of solid mathematical foundations at a theoretical level. Hence, in order to convert these mathematical operations into computer programs involves a lot of computer programming just to ensure the implementation is correct. The fact that these mathematical foundations are already laid out means that AI practitioners should be spending less time on ground up programming and should instead spend more time on research and development. This is very AI frameworks come into play where majority of the mathematical operations are taking place in the backend, thus, AI practitioners only need to focus on putting components together, research, analysis, etc.

Some of the most widely used framework are TensorFlow [2], Caffe [3], Theano [4] etc. The common functionality between these frameworks is that depending on the model that the end user wants to implement, a static dataflow graph is computed that represents the computations that need to be executed depending on the model. However, a major downside of this approach is that the computation graph that is computed is static in nature. This behavior causes a compromise in the ease of use, debugging, control, etc.

The reason why static dataflow graph computation is a concern is because for advancements to be achieved in AI requires intensive research efforts. When it comes to research, a lot of experiments need to be designed which often involves access to every part of the neural network and in some cases requires dynamically tweaking the model and the hyperparameters based on which the model is operating. If an AI framework can provide such flexibility, then it becomes a lot easier for researchers to draw conclusions on the experiments and the model behaviors. However, as mentioned that a large number of AI frameworks are based on static dataflow graphs which makes it very difficult for researchers to dynamically tweak models and hyperparameters on the go. PyTorch [1] is based on dynamic graph computations that largely solves these problems which is one of the prime reasons of its adoption among AI practitioners.

Scientific Computation in Deep Learning

Figure 2. State-of-the-art computation pillars of deep learning.

AI frameworks must be built on top of the scientific computation trends in deep learning.

1. Multi-Dimensional Data Processing – Due to a large volume of data and the dimensionality of the data, deep learning often deals with high dimensional data. Mathematically, these can be thought of as multi-dimensional arrays and matrices. Performing mathematical operations on such data requires dedicated libraries such as NumPy [5].

2. Python Ecosystem – Python programming language has been widely adopted by AI practitioners due to numerous reasons such as it being a high-level programming language, open-source, in-built data structures, support for extensive libraries, etc. This is the reason why any AI framework must be a part of the Python ecosystem. This is because the scientific community initially started by using proprietary software such as MATLAB, but eventually moved towards using open-source scientific libraries such as NumPy [5], Pandas [6], SciPy [7]. Such libraries allow practitioners to implement machine learning models and perform analysis. Moreover, these well-established libraries are part of the python ecosystem.

3. Automatic Differentiation – One of the biggest ideas behind deep learning are loss functions. Loss functions are objective functions that are defined when training deep learning models. A model learns patterns through data based on objective function defined. Training deep learning involves a lot of derivatives to computed because it is through derivatives that an objective function can be minimized or maximized which often the goal after training deep learning models. Derivatives are usually computed at every layer and at every connection of the model. With the help of automatic differentiation, intensive computation of derivatives can be fully automated. The autograd engine in PyTorch [1] provides this functionality.

4. Hardware Accelerators – Alongside availability of big data, hardware accelerators are the other main reasons behind the rise of deep learning. Hardware accelerators such as Graphics Processing Units (GPUs) provide the computing power and parallel processing capabilities. AI frameworks such as PyTorch [1], TensorFlow [2] are equipped with kernels which allows them to utilize hardware accelerators.

Understanding the growth of PyTorch

One of the measures of determining the dominance of an AI framework among AI practitioners is to observe the number of papers using the framework in major AI conferences.

Figure 3. Percentage of papers that use PyTorch [1] with respect to both PyTorch [1] and TensorFlow [2].

The legend in Figure 3 shows the top AI conferences that are highly sought after by AI practitioners. The x-axis denotes the year, and the y-axis denotes the ratio of papers mentioning PyTorch [1] to papers that mention either PyTorch [1] or TensorFlow [2]. It can be noted that regardless of the conference, the number of papers that mention PyTorch [1] have been on the rise since 2018. Furthermore, by the year 2019, majority of the conferences constitute more than 50% of the papers that use PyTorch [1].

Figure 4. Number of unique mentions of PyTorch [1] and TensorFlow [2] in the papers submitted to the top AI conferences.

In Figure 4, the solid lines show the growth of PyTorch [1] and the dotted lines show the growth of TensorFlow [2]. In the year 2018, PyTorch [1] was a minority due to very less unique mentions as compared to TensorFlow [2], however, it is very evident that by 2019, the growth rate of the number of papers mentioning PyTorch [1] largely increased, whereby, the growth rate of TensorFlow [2] was negative for conferences such as NAACL, ACL, ICLR. Meanwhile PyTorch [1] never experienced a negative growth rate in either of the conferences. For general machine learning conferences such and ICML and ICLR, PyTorch [1] is more popular than TensorFlow [2]. Computer vision and natural language processing are two main fields of deep learning and for these conferences, PyTorch [1] heavily outnumbers TensorFlow [2] by a ratio of 2:1 for computer vision conferences and by a ratio of 3:1 for natural language processing conferences.

Clearly, there must be several reasons behind the rise of PyTorch [1] as so many researchers are adopting this framework over time. Some of the key reasons are listed as follows:

• Dynamic graph definitions – as already mentioned that majority of the frameworks such as TensorFlow [2] use static graphs which means graphs have to be statically defined before running a model. However, in PyTorch [1], nodes can be defined, changed, and executed dynamically. Such frameworks can support the research and development of dynamic neural networks. For e.g., incremental learning is an upcoming research area in the domain of deep learning where the goal is to enable neural networks to learn continuously from new incoming data as opposed to traditional deep learning models that focus on training a model beforehand on big data and not having it learn from new data on the fly. In such scenarios, the architecture and the learning parameters of the model may need to adapt dynamically over time to facilitate learning. Hence, frameworks based on dynamic graphs are more well suited to do so as compared to static graphs.

• Debugging – compared to software engineering, performing debugging in deep learning is a lot harder as most of the bugs present in deep learning are invisible. This is because in software engineering, a large part of the code base is designed and written by engineers allowing more control to the engineers. In case there is an error in the overall process, it is fairly straightforward for the engineers to debug the program as they are the ones who wrote majority of the program. However, in the case of deep learning, a large part of the modelling is taken care of by frameworks because most of forward and backward propagation operations take place in the backend. While this approach certainly reduces the coding effort but takes away a large part of the control from the end users making it slightly challenging when it comes to debugging. In PyTorch [1], since the computation is based on dynamic graphs, various python debugging tools such as ipdb, pdb can be used along with print statements as well. This is not the case for Frameworks such as TensorFlow [2] due to static graphs, hence, not every python code can be debugged and thus will require special tools. Furthermore, PyTorch [1] provides a low-level API meaning practitioners needs to make more effort to build models but in return this allows more control and hence, debugging becomes slightly easier as compared to frameworks with high level APIs. Following the growth of PyTorch [1], an enhanced version of TensorFlow was released in 2019 known as TensorFlow 2.0.

• API – As mentioned that PyTorch [1] provides a low-level API, this can bring several advantages with itself. Researchers often perform deep dives into models which often requires more granular access to the model for tweaking and experimenting. This is a big reason why researchers prefer PyTorch [1] over other frameworks because granular access to models is key to innovation in deep learning applications such as computer vision and natural language processing.

• Data parallelism – PyTorch [1] offers declarative parallelism that makes GPU utilization effortless. The following declaration: torch.nn. DataParallel can be used to wrap any module and parallelize it over batch dimension, and this way multiple GPUs can be leveraged.

• Production – Based on the results shown in Figure 2 and Figure 3, it is evident that PyTorch [1] is a growing framework in research, however, when it comes to production, it is not PyTorch [1] but in fact TensorFlow [2] that takes the lead. One reason why PyTorch [1] does not lead in production despite leading in research is because PyTorch [1] was released after TensorFlow [2], however, there are several other reasons. Researchers largely focus on coming up with a novel concept in a research area that often involves experimenting on small datasets and few GPUs. While obtaining a novelty in a research area is a step forward in the research world, however, at the industrial level, this is not large-scale enough. For e.g., reducing the run time of a program by a few minutes may not have a huge impact to researchers, but the same reduction could save large amounts of spending by companies. When it comes to production, the focus should be on both the servers and the edge devices, for e.g., some companies run servers where the python runtime overhead is very high, python interpreter can’t be embedded into mobile binary, etc. These are some of the requirements and restrictions that researchers do not need to follow, hence frameworks such as TensorFlow [2] wins in this area as it has solutions to address industrial regulations. Furthermore, TensorFlow Lite and TensorFlow Serving already cater for mobile and serving scenarios. Historically PyTorch [1] has fallen behind in this area, however, with the development of PyTorch Lighting, the community seems to be catching up again.

Decision Guide on choosing the AI framework

As mentioned that the purpose of an AI framework is to ensure smooth research and development of AI models, hence, there is no right or wrong answer as to which AI framework is the best. It all depends on the user needs and various other scenarios.

1. Community – A framework’s maturity is often dependent on how large and the strong its community is. The larger the community, the more support the framework is going to get, for e.g., discussion forums of the framework will be populated with diverse topics, Q/As, etc. Developers often rely on online sources for syntax, tutorials, etc., hence, the larger the community surrounding a framework, the higher the chances of an abundance of these materials online. A large scale open-source repository of an AI framework is also an important feature of a mature framework community.

2. Documentation – A well detailed and a comprehensive makes it very easy for developers to learn about the functionalities of the framework at a technical level and also for reference purposes. This aspect is corelated with the community support, the larger the community, the higher the number of people maintaining and updating the documentation of the framework on a regular basis.

3. Learning curve and visibility – Developers also need to look at whether a framework is low level or high level and also need to look at whether they would use the framework mostly for research of production. Low level frameworks such as PyTorch [1] require more coding as compared to high level frameworks but in return allow more control over the program and easier debugging. However, if developers want to quickly implement a model and get quick results then probably a high-level framework such as Keras will be useful.

4. Hardware support – A lot of developers usually download the AI frameworks onto their local machines (laptops or desktop computers) to start using the framework. However, depending on use cases, the framework may also need to run on various other devices such as microcontrollers, cloud, other smart edge devices, etc. Therefore, it is also important to see if a framework can run on devices with various processors such as CPUs, GPUs, TPUs, etc. The more the number of devices that can be supported by an AI framework, the higher the chance of the AI framework being used my more developers.

Transition of PyTorch to the Linux Foundation

Following the inception of PyTorch [1] in 2016 by Meta (then Facebook), merely 6 years later in 2022, the PyTorch [1] framework was handed over to the Linux Foundation by Meta. This means that till the time PyTorch [1] was under Meta’s supervision, the entire project was maintained by engineers and scientists working for Meta. However, with this new move where PyTorch [1] has been democratized, the global community will be able to contribute to this framework. The overall objective of this move is to accelerate progress in AI research.

PyTorch [1] adoption rate has been very high and the team at Meta is not big enough to manage the demands of PyTorch’s [1] global user community. Another reason why PyTorch [1] was put under global community control is to avoid any potential conflict of interests as it has started to being adopted by various industries. Going forward, the creation of the PyTorch [1] foundation means that the decisions will be made transparently by a diverse group of board members whereby the board will include many AI leaders including Meta, Amazon, Google, Microsoft, and Nvidia, all of whom have helped to get the community where it is today. Figure 5 below shows the evidence of growing interest of PyTorch [1] in some of the countries with the most AI advanced practices.

(a) Framework interest in Canada

(b) Framework interest in USA

(d) Framework interest in UK

Figure 5. AI framework interest over 5 years across countries with large efforts toward AI initiatives.

Figure 5 shows the interest over the last 5 years among the most commonly used frameworks in a few countries where AI research & development is very advanced. The blue line denotes TensorFlow [2], the orange line denotes PyTorch [1] and the grey line denotes Keras. The graphs in Figure 5 were created using Google Trends. There is one clear pattern in every country which is that 5 years ago, TensorFlow [2] was the leading AI framework, however, after certain time, PyTorch [1] becomes more searched than TensorFlow [2] whereas the interest in TensorFlow [2] went down over time. The interest in Keras stayed nearly the same.

Linux Foundation and the Cloud Native Cloud Foundation

The Cloud Native Cloud Foundation (CNCF) is also part of the Linux Foundation. Cloud native is a technology that empowers organizations to build and run scalable applications in modern dynamic environments such as public, private, and hybrid clouds. The aim of CNCF is to drive adoption of this paradigm by sustaining an ecosystem of open source and vendor-neutral projects, hence, CNCF democratizes state-of-the-art patterns to make these technologies accessible to everyone.

This is where Huawei sets a very good example as it has 8 data centers for its internal I.T. department which have been running 800+ applications in 100,000+ virtual machines (VM) to serve 180,000 users. As there is a further increase in the number of applications, the cost and efficiency of management and deployment of these VM-based apps become challenging due to managing such a highly complex distributed system. Due to these challenges, Huawei needed to move into a more agile and decent practice which is one of the main reasons why Huawei today is one of the biggest contributors to containerization technologies. This is because after realizing the complex distributed system, the timing was right to look into containerization technologies and thus, Huawei began moving its internal applications to run on Kubernetes which resulted in an operating expense cut by around 20-30%.

As a founding and platinum member of CNCF, Huawei actively joined industrial and ecological development. With practices and experiences of Kubernetes, Huawei applied Kubernetes in many business scenarios, rolled out ServiceStage and cloud container engine (CCE) on its own public cloud, and gained the first wave of Kubernetes Certified Service Providers Qualification. At the same time, as one of the earliest adopters of Kubernetes, Huawei keeps feeding back to the community and has participated actively in more than 10 SIGs such as Federation, Architecture and Auth, Resource Work and ContainerPolicy on discussion, design and development efforts. At present, Huawei's overall contribution to the community is ranked first in China, and fifth in the world, with one steering committee member seat.

Future of AI frameworks

In the research domain, PyTorch [1] is the dominant framework and is currently focusing on doing the same to acquire the industrial market. TensorFlow [2] is the dominant framework in the industrial market but is focusing its effort to obtain a broader audience among researchers. Usually, when a software or a framework gets highly established in the industry, changes are harder to be made because the industry is slower to adopt changes as compared to researchers.

In the future, a lot of PhD researchers will be more equipped with PyTorch [1] when they graduate and start working in either the industry or academia. Given this trend, two things would need to start working together in tandem which are as follows:

•Production capabilities of PyTorch [1] will need to start maturing.

•Companies may start focusing on hiring candidates with skills in PyTorch [1].

In order for PyTorch [1] to retain its users and continue to maintain a strong grip over the community, it needs to start maturing its production capabilities. In such an event, PyTorch [1] will become a dominant framework in both the academia and the industry which will in turn lead to companies hiring for candidate with skills in PyTorch [1] as the companies themselves will start productionalizing models built in PyTorch [1]. However, with the release of PyTorch [1] 1.5 in April 2020, and initial version of TorchServe was also released which is a library for serving PyTorch [1] models in production. TorchServe has several features such as multimodel serving, monitoring metrics, model versioning for A/B testing, and RESTful endpoints for application integration. TorchServe can also support any machine learning environment including Kubernetes, Amazon SageMaker, etc. Meta and AWS are working together to enhance the capabilities of PyTorch [1] over time, currently Meta is using the infrastructure provided by AWS to do so.

References

[1] A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” Adv. Neural Inf. Process. Syst., no. NeurIPS, 2019, [Online]. Available: http://arxiv.org/abs/1912.01703.

[2] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” 2016, [Online]. Available: http://arxiv.org/abs/1603.04467.

[3] Y. Jia et al., “Caffe,” in Caffe: Convolutional Architecture for Fast Feature Embedding, Nov. 2014, pp. 675–678, doi: 10.1145/2647868.2654889.

[4] The Theano Development Team et al., “Theano: A Python framework for fast computation of mathematical expressions,” pp. 1–19, 2016, [Online]. Available: http://arxiv.org/abs/1605.02688.

[5] C. R. Harris et al., “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020, doi: 10.1038/s41586-020-2649-2.

[6] W. McKinney, “Data Structures for Statistical Computing in Python,” Proc. 9th Python Sci. Conf., vol. 1, no. Scipy, pp. 56–61, 2010, doi: 10.25080/majora-92bf1922-00a.

[7] P. Virtanen et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nat. Methods, vol. 17, no. 3, pp. 261–272, Mar. 2020, doi: 10.1038/s41592019-0686-2.

Applied Intelligence Office - Huawei International