This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy
Abstract: A real case (Boolean extremum solution algorithm of NumPy) is used to describe the process of optimizing the algorithm using NEON and some key techniques. Compared with the C code optimization using a compiler, the performance is improved by about 80%. The following content is very helpful for developers who need to use NEON to implement high-performance computing.
Hardware configuration: Kunpeng (ARM64) server
For a one-dimensional vector, NumPy provides an argmax function to solve the index corresponding to the maximum value of elements in the array.
import numpy as np
a = np.array([3, 1, 2, 4, 6, 1])
# Obtain the index corresponding to the maximum value of the element in a. In this case, the maximum value is 6, and the corresponding position index value is 4. (The index value starts from 0 by default.)
b=np.argmax(a)
print(b)
If the vectors are all composed of 0s and 1s, then the value returned by argmax is the index of the first 1 appearing, which is the Boolean vector we are going to introduce.
static int
BOOL_argmax(npy_bool *ip, npy_intp n, npy_intp *max_ind,
`PyArrayObject *NPY_UNUSED(aip))
{
npy_intp i = 0;
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
*max_ind = 0;
return 0;
}
The basic idea of the algorithm is simple. The one-dimensional array IP is scanned linearly. If the value corresponding to the position is 1, the maximum index is saved to max_ind. For an array with a small data volume, it is efficient. However, in big data processing, a large number of 0 values exist, requiring a high solution performance of this function. How should the performance be optimized? Using the unique Neon parallel instruction set on the ARM platform seems to improve the computing efficiency (officially, four times). It does not hurt to try.
int32_t sign_mask(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}
static int
BOOL_argmax(npy_bool *ip, npy_intp n, npy_intp *max_ind,
PyArrayObject *NPY_UNUSED(aip))
{
npy_intp i = 0;
#if defined(__ARM_NEON__) || defined (__ARM_NEON)
uint8x16_t zero = vdupq_n_u8(0);
for(; i < n - (n % 32); i+=32) {
uint8x16_t d1 = vld1q_u8((char *)&ip[i]);
uint8x16_t d2 = vld1q_u8((char *)&ip[i + 16]);
d1 = vceqq_u8(d1, zero);
d2 = vceqq_u8(d2, zero);
if(sign_mask(vminq_u8(d1, d2)) != 0xFFFF) {
break;
}
}
#endif
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
*max_ind = 0;
return 0;
}
There are two key points for the implementation:
After the implementation, check the effect. The benchmark test result of NumPy is as follows:
before after ratio
[3f11db40] [00b21d1b]
<master> <neon-argmax>
- 161±0.3μs 47.7±0.5μs 0.30 bench_reduce.ArgMax.time_argmax(<class 'bool'>)
The optimization effect is obvious with performance improved by 70%. Is it over here? After the code is submitted to the community, some old hands who are familiar with OpenCV put forward that the performance can still be further optimized. Did you notice that too many instructions are used in the sign_mask function? Actually, the mask operation does not require so many low-order and high-order byte accumulate and shift operations, which can be simplified into the following code:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
int8x8_t m0 = vcreate_s8(0x0706050403020100ULL);
uint8x16_t v0 = vshlq_u8(vshrq_n_u8(input, 7), vcombine_s8(m0, m0));
uint64x2_t v1 = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(v0)));
return (int)vgetq_lane_u64(v1, 0) + ((int)vgetq_lane_u64(v1, 1) << 8);
}
After the optimization, the performance is improved by 8%. So far, the community finally accepts the submission.