Fast Polar Codes for Terabits-Per-Second Throughput Communications
Targeting high-throughput and low-power communications, we have implemented two successive cancellation (SC) decoders for polar codes.
This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy
Authors (all from Huawei 6G research team): Jiajie Tong 1, Xianbin Wang 1, Qifan Zhang 2, Huazi Zhang 2, Rong Li 1, Jun Wang 1
Higher throughput has always been a primary target during the evolution of mobile communications. Driven by high data rate applications such as virtual/augmented reality (VR/AR) applications, the sixth generation of wireless technology (6G) requires a peak throughput of 1 Tbps. This is roughly a 50×–100× increase over the 10–20 Gbps throughput targeted for 5G standards.
To support such a high data rate, we need to propose a new physical layer design to reduce implementation complexity, save energy, and improve spectral efficiency. This is particularly true when the peak throughput requirement is imposed on a resource-constrained device (e.g., with limited processing power, storage, and energy supply). As channel coding is widely known to consume a substantial proportion of computational resources, it poses a bottleneck for extreme throughput. To this end, channel coding is one of the most relevant physical layer technologies used to guarantee 1-Tbps peak throughput for 6G.
Polar codes, defined by Arikan, are a class of linear block codes with a generator matrix \(G_{N}\) of size N, defined by \( G_{N} \triangleq F^{\otimes n} \) , in which \( N=2^{n} \) and \( F^{\bigotimes n} \) denotes the n-th Kronecker power of \( F=\begin{bmatrix} 10\\ 11\end{bmatrix} \). Successive cancellation (SC) is a basic decoding algorithm for polar codes.
Although the SC decoding algorithm seems unsuitable for high-throughput applications due to its serial nature, state- of-the-art SC decoders managed to significantly simplify and parallelize the decoding process such that the area efficiency of SC decoding has far exceeded that of belief propagation (BP) decoding for low-density parity-check codes (LDPC). In particular, these works represent SC decoding as a binary tree traversal, as shown in Figure 1a, with each subtree therein representing a shorter polar code. The original SC decoding algorithm traverses the tree by visiting all the nodes and edges, leading to high decoding latency. Simplified SC decoders can fast decode certain subtrees (shorter polar codes) and thus "prune" those subtrees. The resulting decoding latency is largely determined by the number of remaining edges and nodes in the pruned binary tree. Several tree-pruning techniques have been proposed. To achieve 1-Tbps throughput, more aggressive techniques need to be proposed on both the decoding and encoding sides.
Figure 1 (a) Decoding architecture as a binary tree; (b) Node v received/response information
This paper introduces a novel polar code construction method, referred to as "fast polar codes", to facilitate parallelized processing at an SC decoder. In contrast to some existing decoding-only techniques, we take a joint encoding-decoding optimization approach. Similar to existing methods, our main ideas can be better understood from the binary tree traversal perspective. They include (a) pruning more subtrees, (b) replacing some non-prunable subtrees with other fast-decodable short codes of the same code rates, and then pruning these "grafted" subtrees, and (c) eliminating the remaining non-prunable subtrees by altering their code rates. As can be seen, both (b) and (c) involve a modified code construction. Consequently, we are able to fast decode any subtree (short code) of a certain size, without sacrificing parallelism.
The algorithmic contributions are summarized below:
For code length N=1024 and code rate R=0.875, the proposed fast polar codes enable parallel decoding of all length-16 nodes. The proposed decoding algorithm reduces node visits by 55% and edge visits by 43.5% when compared with the original polar codes, with a performance cost of under 0.3 dB. Two types of decoder hardware are designed to evaluate the area efficiency and energy efficiency.
The implementation contributions are summarized below:
Following the notations, node v in a tree is directly connected to a parent node \(p_{v}\), left child node \(v_{l}\) and right child node \(v_{r}\), respectively. The stage of node v is defined by the number of edges between it and its nearest leaf node. All leaf nodes are at stage s=0. The set of nodes of the subtree rooted at node v is denoted by \(V_{v}\). As such, \(V_{root}\) denotes the full binary decoding tree. The set of all leaf nodes is denoted by U, the index of a leaf u is denoted by i(u), and the indices of U is denoted by i(U). Meanwhile, the set of the leaf nodes in subtree Vv is denoted by Uv, and the indices of \(U_{v}\) is denoted by \( i\left ( U_{v} \right ) \).
The set of all information bit positions is denoted by I and that of all frozen bits by \(L^{C}\). The set of the information bit positions in subtree \(V_{v}\) is denoted by \(L_{v}\) and the remaining frozen bit positions therein by \( L_{V}^{C}\).
If ICv matches patterns, pattern-based simplified decoding can be triggered to process the node in parallel rather than bit-by-bit. From the binary tree traversal perspective, all child nodes of v do not need to be traversed. As a result, decoding latency is reduced.
The existing pattern-based simplified decoding includes 4 different types. Node v is a Rate-1 node if all leaves in the subtree \(V_{v}\) are information bits, and a Rate-0 node if all leaves in the subtree \(V_{v}\) are frozen bits. To improve the decoder’s efficiency, references define the single parity check (SPC) and repetition (REP) nodes. We can employ pattern-specific parallel processing for each type of node. However, we need to identify and exploit additional special nodes or patterns for improved latency reduction.
In this paper, we present four new types of corresponding nodes:
The corresponding fast decoding methods are described in Section 3.
Pattern-based simplified decoding skips the traversal of certain subtrees when it matches the above patterns.
Currently, there are eight pattern types to cover eight code rates in a subtree: \( \left \{0, \frac{1}{M},\frac{2}{M},\frac{3}{M},\frac{M-3}{M},\frac{M-2}{M},\frac{M-1}{M},1 \right \} \) . In other words, nodes with other code rates cannot be decoded fast, and we need to work on the following two parameters.
For M=8, the ratio of simplified nodes is 8/9, with only one unsupported code rate\( \left ( \frac{4}{8} \right ) \), but the degree of parallelism is only 8. For M=16, the ratio of simplified nodes reduces to 8/17, leaving a wide gap of nine unsupported code rates \( \left ( \frac{14}{16},... ,\frac{12}{16} \right ) \), but the degree of parallelism doubles.
To cover medium code rates, we need to find patterns that can be fast decoded with good BLER performance. Unfortunately, to the best of our knowledge, there exists no parallel decoding method for polar codes with code rates between \( \frac{M-3}{M} \) and \( \frac{3}{M} \). The good news, however, is that the outer codes represented by a subtree can be replaced by any codes, as shown in many previous works. A good solution involves removing the polar nodes with code rate falling into the gap, and grafting a different code that allows for fast decoding.
BCH codes are ideal candidates due to their good minimum distance property and fast hard-input decoding algorithms. If the error correcting capability is t, it is easy to design BCH codes with a minimum Hamming distance larger than 2 × t. This leads to good BLER performance. Meanwhile, the Berlekamp-Massey (BM) algorithm can decode a BCH code with t=1 or t=2 within a few clock cycles. When grafted to polar codes as fast-decodable nodes, hard decisions are applied to the LLRs from the inner polar codes (parent nodes) before being sent to the outer BCH codes (child nodes). Here, the BCH codes are called "BCH nodes".
But BCH codes cannot immediately solve our problem. They only support a few code rates and code lengths, meaning they cannot cover all codes rates within the gap. For the degree of parallelism M=16, the target code length is \( 2^{4} \), so the nearest code length of BCH is 15. Meanwhile, BCH codes only support code rates \( \frac{7}{15} \) and \( \frac{11}{15} \) within the gap, and the corresponding number of information bits are k=7, k=11.
To overcome these issues, we must first extend the code length to 16 bits. For BCH codes with k=7 and t=2, the original codes can correct two error bits, and we add an additional bit to be the parity check of all BCH code bits. The proposed two-step hard decoding works as follows: When the hard decision incurs three bit errors and one has the minimum amplitude, the SPC bit can help correct one error bit first. The remaining two error bits can then be corrected by the BM algorithm. However, the same SPC extension does not work for BCH codes with k=11 and t=1. This is because if there are two or more bit errors in the node, the SPC function and the BM algorithm both fail. Alternatively, if there is one error, the SPC decoding failure will lead to further errors during BM decoding. Instead of SPC extension, we repeat one BCH code bit to improve its reliability.
Now that we have grafted two types of BCH nodes, pattern-based decoding can support 10 code rates. The ratio of simplified nodes increases to 10/17, and the maximum gap reduces to \(\frac{4}{16}\). Figure 3 shows the code rates supported by pattern-based decoding for a degree of parallelism M=16.
Figure 2 Nodes (code rates) supporting fast decoding for degree of parallelism
Figure 3 BLER Performance comparison between GA and fast polar code construction
Even with the inclusion of BCH nodes, the fast decoding algorithm could not cover all the code rates of length-16 subtrees. As the second part of the solution, we propose the construction of fast polar codes to avoid the "slow" nodes, and the use of the existing ten patterns only. Here "fast" resembles the speed of fast SC decoding, but is achieved by altering code construction instead of decoding. Our demonstration shows that it greatly reduces decoding latency and increases throughput with only a slight performance loss.
The following steps show how to construct fast polar codes using only the node patterns of discontinuous code rates:
1. Employ traditional methods such as Gaussian approximation (GA) or polarization weight (PW) to build polar codes with the parameters of code length N and code rate R.
2. Split all N synthesized sub-channels to N/16 segments. Each segment constitutes a 16-bit long block code, equivalent to a subtree with 16 leaf nodes.
3. Identify all "slow" segments which do not match the supported code rates or patterns. Re-allocate the code rates among segments to match the nearest supported code rate or pattern, which has K information bits.
4. If the number of information bits of the current segment exceeds or falls short of K, we remove or add a few information bits according to reliability. Apply this process to the remaining "slow" segments until all segments become fast-decodable.
The resulting code is coined as "fast polar code". A detailed description of the construction algorithm for fast polar codes can be found in Algorithm 1.
Take code length N=1024 and code rate R=0.875 as examples. We count the number of fast-decodable nodes to be visited, \(f_{+/-}\) functions to be executed, and edges to be traversed. These numbers provide a good estimate for SC decoding latency, and are thus used to compare the construction proposed in this section with the GA construction in Table 1. As we can see, the traversed nodes and edges reduce by 55% and 43.5%, respectively, while the \(f_{+/-}\) function executions reduce only by 8.9%. Note that the former two parameters have a greater influence than \(f_{+/-}\) functions, because they cannot be parallelized in any form.
Table 1 A summary of current spatial non-stationary channel models and their pros and cons
It is worth noting that the proposed fast polar code construction algorithm reallocates the code rates of some nodes against their actual capacity which is derived from channel polarization. This inevitably incurs a BLER performance loss. We run simulations to evaluate the loss, and Figure 3 compares the BLER curves of both constructions under code length N=1024 and code rates R={0.75, 0.8125, 0.875, 0.9375}. There is a maximum of 0.3 dB loss at BLER \(10^{-2}\) between GA polar codes and the fast polar codes when adopting QPSK modulation.
In this section, we describe the algorithms used to support fast decoding of the newly defined SPC-2, REP-2, RPC, and PCR nodes. For BCH nodes, we employ the classic BM algorithm which takes hard inputs and supports hardware-friendly fast decoding.
Each fast-decodable node v at stage s can be viewed as an outer code of length \( M=2^{s}\). The code bits of v as an outer code are denoted by \( X_{v} \), with M bits.
For a dual-SPC node v, we divide its code bits \( X_{v} \) into two groups, \(X_{v}^{even}\) with even-numbered indices, and \(X_{v}^{odd}\) with odd-numbered indices. According to the definition of an SPC-2 node, there are two parity-check bits in the subtree \( V_{v} \) , and the corresponding parity functions p[0] and p[1] can be written as
\(\left\{\begin{array}{l} p[0]: \oplus x=0, x \in X_{v} \\ p[1]: \oplus x=0, x \in X_{v}^{o d d} \end{array}\right.\)
We add the two parity functions to get a parity function p[2]:
\(p[2]=p[0] \oplus p[1]: \oplus x=0, x \in X_{v}^{\text {even }}\)
Since the two parity functions p[1] and p[2] involve two disjointed sets of code bits, the decoding of an SPC-2 node can be parallelized to two SPC nodes, each of which inherits half of the elements from \( X_{v} \). We can reuse two SPC decoding modules to fast decode the SPC-2 node.
For a dual-REP node v, we divide its code bits \( X_{v} \) into two groups, \(X_{v}^{even}\) with even-numbered indices, and \(X_{v}^{odd}\) with odd-numbered indices. There are two information bits in the subtree \( V_{v} \) according to the definition of a REP-2 node, and these are denoted by \( U_{M-2} \) and \( U_{M-1} \).
It can be easily verified that \(X_{v}^{odd}\) are the repetition of \( U_{M-1}\), and \(X_{v}^{even}\)n are the repetition of \( U_{M-2} \bigoplus U_{M-1} \). Consequently, we can divide a length-M dual-REP node into two M/2 REP nodes, and we can reuse two REP decoding modules in parallel to fast decode the REP-2 node.
For an RPC node v, we divide its code bits \( X_{v} \) into four groups as follows:
\( X_{v}^{i}=\left\{x \in X_{v}, \bmod (I(x), 4)=i\right\}, i \in\{0,1,2,3\} \)
According to the definition of an RPC node, there are three parity-check bits in the subtree Vv, and the parity functions p[0], p[1] and p[2] can be written as
\(\left\{\begin{array}{l} p[0]: \oplus x=0, x \in X_{v}^{0} \cup X_{v}^{1} \cup X_{v}^{2} \cup X_{v}^{3} \\ p[1]: \oplus x=0, x \in X_{v}^{1} \cup X_{v}^{3} \\ p[2]: \oplus x=0, x \in X_{v}^{2} \cup X_{v}^{3} \end{array}\right.\)
We add the latter two parity functions to get parity function p[3]:
\(p[3]=p[1] \oplus p[2]: \oplus x=0, x \in X_{v}^{1} \cup X_{v}^{2}\)
And we add this parity function to the first one to get parity function p[4]:
\(p[4]=p[0] \oplus p[3]: \oplus x=0, x \in X_{v}^{0} \cup X_{v}^{3}\)
We define ^c i=⊕ x, x ∈ Xiv , i ∈ [0,1,2,3]. According to parity functions p[1] to p[4], we can easily verify that the following relationship holds true:
\( \hat{C}_{1} \oplus \hat{C}_{3}=\hat{C}_{2} \oplus \hat{C}_{3}=\hat{C}_{1} \oplus \hat{C}_{2}=\hat{C}_{0} \oplus \hat{C}_{3}=0 \) (2)
Equation (2) implies the existence of a virtual repetition code of rate, because:
\( \hat{c}_{0}=\hat{c}_{1}=\hat{c}_{2}=\hat{c}_{3}=0 \) or
\( \hat{c}_{0}=\hat{c}_{1}=\hat{c}_{2}=\hat{c}_{3}=1 \)
where \( \hat{C}_{0},\hat{C}_{1},\hat{C}_{2},\hat{C}_{3}\) are the virtual repeated code bits. Given the above knowledge, the decoding algorithm for an RPC node at stage s where s ≤ 2, can be easily derived as Algorithm 2, in which
For a PCR node v, we divide its code bits \( X_{v} \) into four groups, just like in (1). There are three information bits in this node according to the definition of a PCR node, and these are denoted by \(U_{M-3}\), \(U_{M-2}\), \(U_{M-1} \). We define \(C_{i}\) , i ∈ {0, 1, 2, 3} according to the following equation
\( \left[\begin{array}{llll} c_{0} & c_{1} & c_{2} & c_{3} \end{array}\right]=\left[\begin{array}{llll} 0 & u_{M-3} & u_{M-2} & u_{M-1} \end{array}\right] \times G_{4} \) （3）
It can be easily verified that \( X_{v} \)0 are the repetition of \(C_{0}\), \( X_{v} \)1 are the repetition of \(C_{1}\), \( X_{v} \)2 are the repetition of \(C_{1}\), and \( X_{v} \)3 are the repetition of \(C_{3}\). Consequently, we divide the input signal \(\alpha_{v}\) into four groups according the indices, and combine the input signals within each group into four enhanced signals \(\Delta _{i}\), i ∈ {0, 1, 2, 3}, as in an REP node.
Equation (3) implies the existence of a virtual single parity check code of rate \(\frac{3}{4}\) , with virtual code bits \(C_{i}\), i ∈ {0, 1, 2, 3}, so that we can reuse the SPC module to decode it. A detailed description of PCR decoding is provided in Algorithm 3.
We designed two types of hardware architectures to verify performance, area efficiency, and energy efficiency.
Both of the decoder implementations mentioned above adopt successive cancellation algorithms accelerated by pattern-based fast decoding. The maximum degrees of parallelization are 128 for SPC and SPC-2 nodes, and 256 for R1 nodes. All other nodes enjoy a degree of parallelism of 16.
We can observe that there are several large SPC nodes in the right half of the binary tree. As described, these SPC nodes need to be processed with a higher degree of parallelism to achieve a higher throughput. The SPC decoding algorithm is very simple, and can be explained as follows. First, obtain the signs of an SPC node’s input signals. Then find the minimum amplitude of input signals and record its position. Next, perform a parity check on the signs. If it passes, return these signs. Otherwise, reverse the sign of the recorded minimum-amplitude position and return the updated signs.
To process a large SPC node, a circuit is required to locate a minimum amplitude from a large number of input signals. The traditional pairwise comparison method requires a circuit of depth \( log_{2}(M) \), where M is the number of amplitudes to be compared. Finding the smallest among, for example, 128 amplitudes requires 7 comparison steps. Considering that clock frequency is set to 1 GHz, it is very challenging to meet the timing constraints and complete all comparisons in one clock cycle.
To address these issues, we advocate a parallel comparison architecture to replace the traditional one. For node v at stage s, its input signals \( \alpha_{v}\) include \( M=2^{s} \) elements, the amplitudes of which are denoted as [A0 A1 ··· AM-1]. Each amplitude has x -bit quantization, and we fill the x-bit quantized binary vectors into the columns of a matrix as follows:
If we rewrite the matrix with respect to its row vectors bits matrix we will have \( [B_{0}...B_{j}...B_{M-1}...]^{T} \), in which \(B_{j}=\left[b_{0}^{j} \cdots b_{i}^{j} \cdots b_{M-1}^{j}\right]\), j ∈ {0, 1 ··· x − 1} is a row vector. We propose Algorithm 4 to determine the minimum-amplitude position through reverse mask D, in which the bit "1" indicates the minimum. The parallel comparison algorithm reduces the comparison logic depth from \( log_{2}(M) \) to 1. However, the reverse mask D may have two or more minimum positions, which means that the input signals \( \alpha_{v}\) include two or more minimum amplitudes, and it must generate an error in such cases. To prevent this error, we can apply an additional circuit to ensure the uniqueness of the selected minimum position.
An attractive property of polar codes is that SC decoding works well under low-precision quantization (4 bits to 6 bits). Lower precision quantization is the key to higher throughput, as it effectively reduces implementation.
Figure 4 Performance comparison between Floating Point and Fixed Point
There are two types of quantization numbers — one for channel LLR and another for internal LLR. We first test a case with 6-bit input quantization and 6-bit internal quantization. According to Figure 4, this setting achieves the same performance as floating-point. The second case is 5-bit quantization and 5-bit internal quantization, which incurs < 0.1 dB loss. Finally, 4-bit input quantization and 5-bit internal quantization incurs < 0.2 dB loss. In this paper, we evaluate the physical implementation result under 5-bit quantization for both input and internal signals to strike a good balance between complexity and throughput.
At the same time, we also compare the BLER performance between the original SPC and parallelized SPC. None of the quantization schemes result in a harmful loss.
We carry out the two FPGA implementations for both the recursive and unrolled architectures and convert the results to the physical implementations.
According the FPGA synthesis results, the recursive decoder has 10170 LUTs and 12772 FFs; meanwhile, the unrolled decoder has 66192 LUTs and 55187 FFs. Both decoders avoid the use of memories, making it is easy to convert the FPGA results to ASIC. Converted to 16 nm technology, the recursive decoder estimated synthesis area and the layout size are
0.032 mm^{2} and 0.045 mm^{2}, respectively, at a clock frequency of 1 GHz. The unrolled decoder's estimated synthesis area and the layout size are 0.17 mm^{2} and 0.30 mm^{2}, respectively, at a clock frequency of 1.20 GHz.
The key performance indicators (KPIs) are reported in this section. First of all, we evaluate the area efficiency using equation:
AreaEff ( Gbps/mm^{2} )= InfoSize (bits)/(Latency (ns)×Area (mm^{2} ))
The recursive decoder takes 40 clock cycles to decode one packet under fast polar code construction with code length N=1024 and code rate R=0.875. As such, the throughput is (1024 bits × 1 GHz)/40 cycles=25.6 Gbps for coded bits, and ((1024 × 0.875) bits × 1 GHz)/40 cycles=22.4 Gbps for information bits. Converting to 16 nm process, the area efficiency for coded bits is 561 Gbps/mm^{2}.
The unrolled decoder takes 25 clock cycles to decode one packet. It is fully pipelined, meaning a new packet of decoded results is generated continuously every cycle after the first 25 clock cycles of the first packet processing time. The throughput is thus 1024 bits × 1.2 GHz=1229 Gbps for coded bits, and (1024 × 0.875) bits × 1 GHz=1075 Gbps for information bits. Converting to 16 nm process, the area efficiency for coded bits is 4096 Gbps/mm^{2}.
Table 2 Comparison with the high throughput polar decoder
Table 3 Comparison with high throughput polar decoder
We further evaluate the power consumption and decoding energy per bit through a simulation in which 200 packets are decoded. The process, voltage, and temperature (PVT) condition of evaluation is TT corner, 0.8 V, and 20°C, the result of the recursive decoder’s power consumption is 30.9 mW, and decoding each bit costs 1.21 pJ of energy on average; while the unrolled decoder’s power consumption is 784 mW, and decoding each bit costs 0.63 pJ of energy on average.
We also compare the decoding throughput, area efficiency, and power consumption with several other high-throughput decoders, and present the results in Table 3. From the KPIs, we conclude that unrolled decoders are more suitable for scenarios requiring extremely high throughput but only support fixed code length and rate. Recursive decoders are much smaller, which are better for resource constrained devices, and at the same time provide flexible code rates and lengths — a desirable property for wireless communications.
In this paper, we propose a new construction method for fast polar codes, which is solely composed of fast-decodable special nodes at length 16. By viewing the decoding process as a binary tree traversal, the fast polar codes can reduce 55% of node visits, 8.9% of \( f_{+/-} \) calculation, and 43.5% of edge traversal over the original polar construction at code length N=1024 and code rate R=0.875, at the cost of slight BLER performance loss.
We implement two types of decoders for the fast polar codes. The recursive decoder can support flexible code lengths and code rates, with support for code lengths of up to 1024. This decoder layout area is only 0.045 mm^{2}, and it can provide 25.6 Gbps coded bits throughput with an area efficiency of 561 Gbps/mm^{2}.
The unrolled decoder only supports one code length N=1024 and one code rate R=0.875. However, the fully pipelined structure leads to hardware offering ultra-high area efficiency and low decoding power consumption. The estimated layout area of this decoder is 0.3 mm^{2}, and it can provide a code bit throughput of 1229 Gbps with an area efficiency as high as 4096 Gbps/mm^{2}.
These results indicate that fast polar codes can meet the high-throughput demand of next-generation wireless communication systems, and that recursive and unrolled hardware designs can be adopted to satisfy different system requirements.