#### Future Technologies

# Joint Message Passing and Autoencoder for Deep Learning

This MPA enabling feature would allow not only a coarse learning during the training cycle but also an adaptive inference (reasoning) during the inference cycle.

*Authors (all from Huawei 6G research team): Yiqun Ge 1, Wuxian Shi 1 , Jian Wang 2, Rong Li 2, Wen Tong 1*

- Ottawa Wireless Advanced System Competency Centre
- Wireless Technology Lab

## 1 Introduction

### 1.1 AE-based Global Transceiver

There are two different paradigms for applying the deep neural network (DNN) to a wireless transceiver design: block-by-block optimization (shown in Figure 1) and end- to-end (E2E) or global optimization (shown in Figure 2).

Figure 1 AE-based block-by-block transceiver

Figure 2 AE-based global transceiver

Since 2017, much effort has been devoted to studying the AE-based global transceiver, which is a natural and straightforward step in applying native AI for wireless communications after observing potential benefits from AE-based block-by-block optimization. Many papers have been published to discuss its feasibility and benchmark its performance in terms of inference against other DNN structures and training methods.

A classical physical-layer transceiver assumes reliable information reconstruction at the receiver as a universal transmission goal. To tackle varying hostile channels, it estimates the current channel by reference signals and then equalizes estimated channel distortions.

In comparison, the AE-based global transceiver simultaneously optimizes both the deep neural layers in the transmitter and those in the receiver for a given source and a given goal in a given channel environment. Specifically, the AE-based global transceiver extracts the most essential information associated to a given goal, including information reconstruction. If the goal was simpler than reconstruction, a smaller amount of essential information would need to be extracted, saving some transmission resources. If the AE-based global transceiver can learn nearly all the possible radio channel realization scenarios into its neurons in a given channel environment, it can save more transmission resources by reducing reference signals and controlling message overheads while achieving the same BLER performance. Both are paramount for a wireless system to achieve higher spectral efficiency. Figure 3 illustrates that the AE-based transceiver bridges three factors: source distribution, goal orientation, and channel environment.

Figure 3 Source-, goal-, and channel-orientation for higher spectrum efficiency

Moreover, the AE-based global transceiver would facilitate an optimal design flow of the transceiver in machine learning (ML)-based data-driven learning methods.

### 1.2 Current Issues with the AE-based Global Transceiver

Among the technical hurdles that need to be overcome for a simple and straightforward AE-based global transceiver, we underscore three key issues.

**Issue 1:** Training an AE by a stochastic gradient descent (SGD)-based backward propagation algorithm demands one or more differentiable channel model layers that connect the deep neural layers in the transmitter and those in the receiver, as illustrated in Figure 4. Because a true channel must include many non-linear components (such as digital/analog predistortion and conversion) and non-differentiable stages (such as upsampling and downsampling), the deep neural layers in the transceiver are trained for a comprised channel model rather than a true one, which might cause performance loss during the inference cycle with a true channel.

Figure 4 Simplified and differentiable channel model layer(s)

**Issue 2: **All hidden or intermediate layers are trained according to the posterior probability of its input signal, as shown in Figure 5. In the AE-based global transceiver, the first layer of the deep neural layers in the receiver is an intermediate layer whose input signal is subject to the current channel distortion. Its impact inevitably penetrates forward to all the deep neural layers in the receiver. If the channel changes drastically beyond the training expectation, the receiver will become obsolete on the inference.

**Issue 3: **A lack of interpretability from one neural layer to another hides the information about which neurons and connections between the layers are critical for final learning accuracy. Goodfellow et al. gives an example where a DNN classifier that has been well-trained with non-noise images may misclassify a noised image of a panda as a gibbon, as shown on the right of Figure 6. This example implies that a DNN-based classifier heavily relies on some "shortcuts" or "localities" (i.e., some pixels in the image of a panda) for its final decision. If the shortcuts are intact, the classification will be correct. If the shortcuts are perturbed, the classification will be incorrect. Furthermore, noise-triggered panda-to-gibbon misclassification happens only occasionally with additive random noise, indicating that DNN bets on the intactness of "shortcuts" through a noise channel. The reality that DNN is vulnerable to additive random noise could be disastrous to the application of DNN in wireless transceiver design.

Figure 5 Deep neural layers in the receiver depend on the posterior probability of its input signals

Figure 6 AE-based global transceiver and adversary attack with additive noise

### 1.3 Out of Distribution (OOD) and Generalization

The existing solutions to these issues are unfortunately hindered by the practical requirements of low-energy, short latency, and decreased overhead in the devices and infrastructures of wireless communications. It is too costly for the AE-based transceiver to accumulate, augment, and re-train itself in a dynamic environment. Moreover, it is against DNN's "Once-for-All" strategy to meet realistic and energy consumption requirements, i.e., to learn once and to work for as long as possible.

All three issues are rooted in the same core problem: DNN's poor generalization against the random variation of wireless channels. Because no model (even a superior channel model) can exhaustively capture all possible scenarios of radio propagation, the AE is destined to confront the so-called out-of-distribution (OOD) or outlier problem in reality.

We address this problem in our proposal: adapt the AE-based transceiver to variation of the real-world random wireless channel. An AE-based transceiver framework should come up with sufficient generalization against OOD cases in a data-driven method.

A native-AI contains two different cycles: training and inference (or reasoning). Training needs a massive number of raw data samples and is typically performed with offline computing. Conversely, inference is performed on real-world data sample(s) on the fly.

Inference can be interpreted as either interpolation or extrapolation. Because extrapolation is far less reliable than interpolation, inference accuracy depends on the distributions of true data samples. If a true data sample represented by the blue circles in Figure 7 is within an area to be interpolated by the training data samples represented by the orange circles, the trained AE-based transceiver is likely to handle it properly, as shown on the left of Figure 7. Otherwise, the transceiver would be unlikely to conduct a reliable inference over an outlier data sample, as illustrated by the "unmodeled outlier" dot (in blue) on the right of Figure 7.

Figure 7 Outlier data samples degrade inference accuracy

In a wireless context, these outliers are often caused by random variation of channels, especially when the channel involved in the inference cycle is shifting away from the channel model used by the training cycle. Along with the inference time, increasing occurrences of outliers shape the distribution of the received signal, which Bengio explains as poor generalization of deep learning. Although current remedies may include some extra training, such as transfer training, attention-based recurrent network, or reinforcement learning, none are practicable or realistic for the low-energy, low latency, and reduced controlling overhead requirements of future wireless communications.

To make the AE-based global transceiver viable, our primary task is to find an effective and real-time method against outliers or OOD during the inference cycle.

In this paper, we will enable the AE-based global transceiver with a message-passing-algorithm (MPA) based precoder layer to improve generalization performance in dynamic channel variations. In section 2, based on the core idea of dimensional transformations, we propose the insertion of a dimension reduction layer into the AE framework. In section 3, we describe how to train the MPA-enabled AE-based transceiver. In section 4, we look at some applications of the MPA-enabled AE-based transceiver. In section 5, we provide our conclusions.

## 2 Deep MPA Precoder

In this section, we introduce the dimension reduction layer appended to the transmitter as an example and explain how the MPA-based precoder works and how it improves generalization performance against variations of random channels. First, a dimension reduction layer is appended to the deep neural layers in the transmitter to conduct a linear dimension reduction transformation from L (number of extended dimensions by the precedent deep neural layers) to N (degree-of-freedom of the current channel). We assume that the current N degree-of-freedom channel measurement is available to the transmitter. In practice, this is realized by frequent channel feedbacks from the receivers or uplink/downlink channel reciprocity. The dimension reduction layer appended to the transmitter is particularly interesting for the downlink, in which the deep neural layers in the terminal receivers are supposed to remain unchanged for as long as possible.

Tuning a linear dimension reduction transformation by MPA enhances generalization against outliers caused by the variations of channels during the inference cycle. Tuning or adjusting the coefficients of the dimension reduction layers is an iterative algorithm that includes forward message passing from functional nodes to variable nodes and backward message passing from variable nodes to functional nodes. Before describing the iterative adjusting procedure over a dimension reduction layer, we will introduce two widely used native-AI technologies: support vector machine (SVM) and attention DNN.

To better describe the working mechanism of the introduced transceiver, we first provide the key parameters used in this paper.

### 2.1 Forward Iteration (Intonation) and Non-linear SVM

An SVM is a supervised machine learning model used for data classification, regression, and outlier detection. In general, an SVM model is composed of a non-linear dimension extension function φ(∙); a linear combination function f (x) = w∙φ (x) + b; and a binary classification function sign(∙) , where x is the input data, **w** is the weight coefficient vector, and **b** is the bias vector, as shown in Figure 8. The objective of SVM is to divide the data samples into classes to find a maximum marginal hyperplane, as shown in Figure 9.

Geometrically, \(w\cdot \varphi (\vec{x})+\vec{b}\) forms an N-dimensional hyperplane. Some hyperplanes are better than others, and this can be measured by margin. In the example shown in Figure 9, the hyperplane on the right is better than the one on the left, because the one on the right has a larger margin.

Table 1 System parameters

Figure 8 A non-linear SVM for binary classification

Figure 9 SVM function defines a hyperplane to classify users

The mathematical description of SVM optimization is as follows:

\(\left\langle\boldsymbol{w}^{*}, \vec{b}^{*}\right\rangle=\underbrace{\operatorname{argmin}}_{w, \vec{b}}\left(l(\vec{y}, \hat{\vec{y}})+\frac{\|\boldsymbol{w}\|^{2}}{2}\right)\),

where \(l\left ( \vec{y},\widehat{\vec{y}} \right )\) is a given loss measurement function (like a training goal in DNN). To approach the optimal solution \(\left \langle \mathbf{w*},\vec{b}* \right \rangle\), there are several ways such as direct MSE (minimum square error) or SGD. SVM is the predecessor of DNN.

- The study of non-linear SVM tells us three things:
- Non-linear dimensional extension transformation φ(∙) followed by linear dimension reduction transformation \(\mathbf{w}\cdot \varphi (\vec{x})+\vec{b}\) can improve classification accuracy, laying a foundation for the MPA-enabled AE-based global transceiver, as shown in Figure 10.
- The linear dimension reduction transformation \(\mathbf{w}\cdot \varphi (\vec{x})+\vec{b}\) tunes a hyperplane that separates classes, revealing the mechanism of a forward dimension reduction layer.
- The hyperplane (\(\mathbf{w}\cdot \varphi (\vec{x})+\vec{b}\) ) is tuned with fixed φ(∙) and sign (∙), which indicate a tandem-like training scheme with both the dimension reduction layer and other deep neural layers, as well as an adjustable inference, as shown in Figure 17 later in section 3.1.

Figure 10 MPA-enabled AE vs non-linear SVM

Figure 11 Dimension reduction layer with SVM

Hence, the dimension reduction layer in both training and inference cycles keeps adjusting an intermediate hyperplane that helps final classification by the deep neural layers in the receiver. For wireless context, the hyperplane must be adjusted according to current variations of channels.

Taking advantage of the dimensional transformation of SVM, we can transform the dimension of the transmitter's output, L, to the dimension of the communication channel measurement, N. Figure 11 shows the detailed forward iteration with SVM. In particular, the input of the dimension reduction layer is an L - dimensional feature matrix \(F=\left [ f_{1}, f_{2},..., f_{L}\right ]\), where \(f_{i}\) is the i-th K-dimensional input feature vector. The output of the dimension reduction layer is an N-dimensional feature matrix \(T=\left [ t_{1}, t_{2},..., t_{N}\right ]\), where \(t_{i}\) is the i-th K-dimensional output feature vector. When the output feature vectors are transmitted via communication channels, the received signal is given by

\(\boldsymbol{r}_{\mathrm{i}}=\sum_{l=1}^{L} \alpha_{l, i} \cdot \boldsymbol{f}_{l} \cdot \boldsymbol{h}_{i}+\boldsymbol{n}_{i}, i=1, \ldots, N\),

where \( \alpha _{l,i} \) is the coefficient of the connection between neuron l and neuron i.

Based on the preceding description, we can conclude that the forward sub-iteration will keep fine-tuning the hyperplane of the SVM model in both the training and inference phases for the given transmitter's feature matrix F, channel state information H, noise vector N, and received signal R, as shown in Figure 12.

Figure 12 Hyperplane is adjusted over time

### 2.2 Backward Iteration and Attention DNN

The dimension reduction layer needs to be trained by a standalone mode rather than a connection mode with backpropagation from the receiver. In this regard, we consider using an attention DNN in the backward sub-iteration.

An attention DNN is an efficient approach that measures the similarity of two features with different dimensions. Figure 13 depicts the structure of the attention DNN. The input is the received signal R. The attention operation is conducted by computing the inner product of each \( r_{i} \) with an attention coefficient \(c_{l} \), i.e., \(\left \langle r_{i},c_{l} \right \rangle\). This inner product implies the similarity of the signal \( r_{i} \) and the attention coefficient \( c_{l} \), which is normalized by a softmax layer as

\( \alpha _{l, i}=\frac{\mathrm{e}^{\left\langle r_{i}, c_{l}\right\rangle}}{\sum_{n=1}^{N} \mathrm{e}^{\left\langle r_{n}, c_{l}\right\rangle}}, i=1,...N,\)

Then, the output of the attention DNN is given by

\( z_{l}=\sum_{i=1}^{N} \alpha _{l, i} \cdot \boldsymbol{r}_{i}, l=1, \ldots, L\).

We note that the number of attentions is smaller than the number of received signals, i.e., L < N.

Figure 13 Structure of the attention DNN

The attention DNN can be employed in the dimension reduction layer for back propagation. In particular, each extracted feature vector \( f_{l} \) can be used as an attention coefficient. Then, in the backward subiteration, the coefficient \( \alpha _{l,i} \) can be given by

\( \alpha _{l, i}=\frac{\mathrm{e}^{\left\langle r_{i}, f_{l}\right\rangle}}{\sum_{n=1}^{N} \mathrm{e}^{\left\langle r_{n}, f_{l}\right\rangle}}, i=1, \ldots, N, l=1, \ldots, L\).

The attention DNN allows a number of attentions: \(c_{l} \) ,l= 1, 2, 3, ...,L . Then, it can generate an L combination \(z_{l} \), l = 1,2,3,...,L. In a practical attention DNN, the number of attentions is smaller than the number of captured features: L < N, because, in reality, you cannot hold a great number of attentions simultaneously.

In the MPA layer, to measure the similarity between the received signal space \(\left [ r_{1}, r_{2},...,r_{N} \right ]\) and extracted feature space \(\left [ f_{1}, f_{2},...,f_{L} \right ]\), we can borrow the preceding method. Suppose that the deep neural layers in the transmitter are well trained to extract \(\left [ f_{1}, f_{2},...,f_{L} \right ]\) for a specific goal. We can consider one feature \( f_{l} \) as one attention. This is how we compute the coefficients (\( \alpha _{l,i} \) , l=1,2,...,L, i=1,2,3,...,N) of the dimension reduction layer in the backward iteration.

Figure 14 \( \alpha _{l,i} \) represents the weight of each received feature associated to extracted feature \( f_{l} \)

Since \( \alpha _{l,i} \) tells the similarity between the feature \( f_{l} \) and received feature \( r_{i} \), it can also be the scaling weight from the extracted feature \( f_{l} \) to the transmitted feature \( t_{i} \).

### 2.3 Standalone MPA Iteration

Figure 15 illustrates the overall MPA iteration. The forward part on the left is equivalent to a non-linear SVM, whereas the backward part on the right is equivalent to an attention DNN.

The MPA iteration is standalone. It is independent of the SGD-based backward propagation of the original AE.

Figure 15 Standalone MPA iteration

Figure 16 MPA-enabled AE-based global transceiver

This means that when the MPA iterates its coefficients, it assumes that the remaining deep neural layers in both the transmitter and the receiver are frozen, as shown in Figure 16.

## 3 Global Tandem Learning

### 3.1 Coarse Learning

The insertion of a dimension reduction layer divides the training into two training agents: one is a standalone agent for the dimension reduction layer, while the other is backward propagation for the deep neural layers in the transceiver.

The two training agents work in tandem, as illustrated in Figure 17:

- In tandem stage 1, the dimension reduction layers are fixed (as shown on the left of Figure 17). As dimension reduction layers are linear and differentiable in forward intonation, they can pass the gradients backward to have deep neural layers trained by backward propagation.
- In tandem stage 2, the deep neural layers are fixed (as shown on the right of Figure 17). The dimension reduction layers are trained by the standalone MPA, just as they are trained by a non-linear SVM.

Tandem stage 1 and stage 2 are iterative until the training converges. The detailed procedure is summarized in Algorithm 1.

### 3.2 Inference Cycle Adaptation

Because it is trained in a standalone mode, the dimension reduction layer can be tuned for each transmission. The MPA-enabled AE-based global transceiver adapts its transmitter to the variation of channels during the inference cycle and keeps the receiver unchanged.

On the left of Figure 18 is the AE-based global transceiver without the dimension reduction layers. The AE-based transceiver cannot change its neurons coefficients during the inference cycle, even if the channel environment and/or source distribution has already changed. Its performance relies on the odds that the varying channel is in the interpolation range; otherwise, the transceiver has to accumulate sufficient new data sets to start either a new training cycle or transfer learning to some of the deep neural layers, all of which are detrimental for applying the AE-based transceiver in a fast-changing environment.

On the right of Figure 18 is the MPA-enabled AE-based global transceiver. The transmitter keeps adjusting the dimension reduction layer(s) for each transmission. Since the dimension reduction layer is an intermediate layer of the transceiver, its coefficient updates in terms of the current channel can significantly enhance generalization. Moreover, the dimension reduction layers are adjusted on the fly by the real-world channel measurement without any labeled data. The detailed procedure is summarized in Algorithm 2.

Figure 17 Tandem training cycle

Figure 18 AE-based transceiver, with or without MPA enabled.

Figure 19 Coarse learning and adaptive inference

### 3.3 Advantages of MPA-Enabled AE- based Global Transceiver

In addition to the adaptive inference to the changing channel by the dimension reduction layer, the dimension reduction layer can deliver significant advantages.

First, differentiable channel model simplification would harm the performance of the AE-based transceiver, because AE is trained for the simplified channel model rather than a true one. The performance loss is due to the offset between the simplified channel model used during the training cycle and the true channel faced during the inference cycle. If the offset increases beyond expectation, the entire AE-based transceiver would become obsolete. Two remedies are proposed to mitigate performance degradation. The first is based on reinforcement learning (RL), which keeps recording the channel states and keeps training the policy DNN and/or evaluation DNN from time to time. However, RL is too complex in a dimensionality such as a wireless system, and in fact deals with a dimensionality that is much bigger than AlphaGo does. Therefore, the RL-based adjustment mechanism is impracticable. The second is based on the generative adversary network (GAN) that learns as many channel scenarios as possible into a big DNN model. This is an empirical method that can never be proven to cover all the channel scenarios.

The MPA-enabled AE takes a different technology path. Because MPA adjusts the coefficients of the dimension reduction layer on the fly for each data transmission in the function of current channel measurement during the inference cycle, adaptive inference leverages a coarse channel model during the training cycle. This is what we call "coarse learning". Sometimes, it is hard to demonstrate the advantage by simulations in which the same or similar channel model is emulated in both training and inference cycles. However, the advantage can be demonstrated in real field tests.

Then, the MPA-enabled AE-based transceiver can work with the GAN-based channel model. Our experience tells us that most channels relate to the user's position and environmental topologies such as high buildings, hills, roads, and so on. The reference proposes to use a conditional GAN to model unknown channels and achieves good performance. This method can be used for developing a channel model that supports our training cycle.

In the inference cycle, we suggest relying on the channel estimation on pilots, channel measurements and feedback, or channel reciprocity to obtain the latest channel condition. Furthermore, MPA is well known to benefit from sparsity and tolerate bias and offsets, which is the reason the LDPC decoder can work. This means that it is not necessary to measure the channels on full dimensionality, just a part of it. Even with an estimation error, our scheme can be robust in terms of overall performance. Residual errors can be tackled by the receiving deep neural layers that are capable of tolerating errors. Thanks to the dimension reduction layers that are adjusted during the inference and training cycles and used as a precoder for the overall transmitting chain, the receiving deep neural layers can remain untrained, which is a huge advantage in terms of power-saving and extending the battery life of the user's device.

Figure 19 highlights the complete picture of our proposed scheme. For the transceiver algorithm architecture, we have chosen an appropriate MPA-adjustable dimension reduction to address the information bottleneck at the channel input and minimize the SGD gradient overhead for the training cycle. During the training cycle, we can use a generated channel model. A tandem training scheme is used to back propagate the gradients through the frozen dimension reduction layer, freeze the autoencoder layers, and iterate the dimension reduction layers. During the inference cycle, we freeze the autoencoder layers and iterate the dimension reduction layers by true channel measurements or feedback.

Figure 20 Flexible insertions of MPA layer(s)

## 4 Future Directions

### 4.1 Flexible MPA Enabling Scheme

The dimension reduction layers can be flexibly inserted into an AE-based transceiver for various problems.

In one-to-one single-user communication shown in the top part of Figure 20, we applied this method — flexibly inserting dimension reduction layers into an AE-based transceiver — to design a high-order modulation scheme, a massive MIMO scheme, a predistortion scheme, and so on. These designs achieved good results in time-varying channel conditions.

This method can also be used in the multi-user communication scheme. For example, in a downlink MU-MIMO context in the middle part of Figure 20, we can jointly adapt the MPA layer.

For uplink MU-MIMO, each user transmitter can adapt the MPA layer individually, while the base station receiver can adapt the dimension reduction layer prior to the DNN receiver, as shown in the bottom part of Figure 20.

By introducing the dimension reduction layer in the autoencoder structure based on deep learning, we can extend the current DNN framework into a more generalized one to enhance the generality of OOD.

### 4.2 Training with a Complex Channel Model

Another research direction is to use a more complex channel model during the training cycle. Most probably, the channel model is generated by a DNN with input of the surrounding topological information, as illustrated in Figure 21.

The MPA iteration is still valid because the inference DNN of channel can be considered as a non-linear function \( Ch\left ( \cdot ,enh;\vartheta _{ch} \right )\).

Figure 21 Training with a DNN-based channel model

Figure 22 During the inference cycle, we suggest using another DNN to generate the current environment parameters.

Nevertheless, if the DNN-based channel model is huge, it will probably be too costly to use this model for inference in an iterative way. Research on how to reduce the size of the DNN by model distillation is required.

For the inference cycle, the environmental parameters can be results from another DNN that deduces the surrounding topological information from the current channel measurements and feedback, as shown in Figure 22.

## 5 Conclusion

In this paper, we address the fundamental issue in the AE-based global transceiver, which is poor generalization against variation of random channels, by introducing an MPA-enabled feature. MPA can adapt the AE-based transceiver during the inference cycle to the current channel measurement. Thanks to MPA, the learning cycle can tolerate more simplification of the channel model. The introduction of MPA into AE architecture is based on solid dimensional transformation technology widely used in classic wireless systems, and the implementation of MPA is a mature and efficient technology in wireless communications.

- Tags:
- 6G