This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy
Background: A compiler can automatically optimize parallelable code blocks based on the parallel instruction set of a platform, but this optimization has great restrictions. The optimal instruction pipeline is not generated in most cases, and the CPU capability of the x86/ARM cannot be maximized. Therefore, the current mainstream method is to manually write assembly/intrinsic code to generate the optimal parallel code segment. However, writing a set of code for each instruction set architecture is poor in maintainability and scalability. The maintenance cost of a community becomes higher with the increase of the code. There is an urgent need for a general instruction set optimization solution, which is the USIMD optimization to be introduced in the following.
It is easy to write the following C language code to find the product of all elements of a vector.
int MultiplyIntList(int const *l1, int n)
{
int s = 1;
while (n--) {
​ s *= (*l1++);
}
return s;
}
Since it is a simple loop structure, the SIMD technology is often used for acceleration. If you are familiar with the ARM Neon instruction set, you can write as follows:
int MultiplyIntList(int const *l1, int n)
{
int s = 1;
#ifdef HAVE_NEON
int dim4 = n >> 2;
n &= 3;
int32x4_t sum_vec = vdupq_n_s32(1);
int32x4_t data_vec;
for(;dim4 > 0; dim4--){
data_vec = vld1q_s32(l1);
sum_vec = vmulq_s32(sum_vec, data_vec);
l1+=4;
}
s = vgetq_lane_s32(sum_vec,0)*vgetq_lane_s32(sum_vec,1)*
vgetq_lane_s32(sum_vec,2)*vgetq_lane_s32(sum_vec,3);
#endif
while (n--) {
s *= (*l1++);
}
return s;
}
The performance of the modified parallel code is improved by more than 50% compared with that of the C language version. If you are familiar with the SSE instruction, you can change the code to the following format for the x86 platform scenarios:
int MultiplyIntList(int const *l1, int n)
{
int s = 1;
#ifdef HAVE_NEON
...
#elif HAVE_SSE
int dim4 = n >> 2;
n &= 3;
_m128i sum_vec = _mm_setzero_si128(1);
_m128i data_vec;
for(;dim4 > 0; dim4--){
data_vec = _mm_setr_epi32(l1);
sum_vec = _mm_mullo_epi32(sum_vec, data_vec);
l1+=4;
}
data_vec = _mm_shuffle_epi32(sum_vec, sum_vec, _MM_SHUFFLE(2,3,0,1));
sum_vec = _mm_mullo_epi32(sum_vec, data_vec);
data_vec = _mm_shuffle_epi32(sum_vec, sum_vec, _MM_SHUFFLE(1,0,3,2));
sum_vec = _mm_mullo_epi32(data_vec, accum_sse);
_mm_store_ss(&s, sum_vec);
#endif
while (n--) {
s *= (*l1++);
}
return s;
}
The code performance is also improved on the x86 platform. However, the x86 platform supports not only SSE, but also other instruction sets such as SSE2, SSE3, SSE4.1, AVX2, and AVX512 (which has multiple versions). As a result, the code is fragmented.
In addition, IBM PowerPC has its own VSX instruction set. Therefore, a complete multi-platform SIMD optimization solution may be similar to the following code:
int MultiplyIntList(int const *l1, int n)
{
int s = 1;
#ifdef HAVE_NEON
...
#elif HAVE_SSE
...
#elif HAVE_SSE2
...
#elif HAVE_AVX2
...
#elif HAVE_VSX
...
#endif
while (n--) {
s *= (*l1++);
}
return s;
}
Does it drive you crazy? The preceding is just a simple function evolution scenario. The actual situation may be much more complex than the product of vectors. Therefore, using assembly for SIMD instruction orchestration is not recommended. The maintainability is sacrificed for improving performance. In addition, the requirements on programmers' skills are high. OpenCV implements USIMD practice at the hardware abstraction layer to address this problem:
After the practice, this theory tends to be mature, its feasibility is recognized, and it has been promoted in the Numpy community recently. The code of the vector product in the USIMD framework is similar to the following:
int MultiplyIntList(int const *l1, int n)
{
int s = 1;
#ifdef NPY_SIMD
int dim4 = n >> 2;
n &= 3;
npyv_s32 sum_vec = npy_setall_s32(1);
npyv_s32 data_vec;
for(;dim4 > 0; dim4--){
data_vec = npy_load_s32(l1);
sum_vec = npy_mulq_s32(sum_vec, data_vec);
l1+=4;
}
s = npy_mulr_s32(sum_vec);
#endif
while (n--) {
s *= (*l1++);
}
return s;
}
This code can be used across x86, ARM, and PowerPC, and the optimal SIMD performance is obtained on all platforms.
Numpy is an open-source numerical computing extension of Python. This tool can be used to store and process large matrices. It is more efficient than the nested list structure of Python and supports a large number of dimension arrays and matrix operations. In addition, it provides a large number of mathematical function libraries for array operations. Numpy's contribution to the generation of the world's first black hole photo has been widely used in machine learning image processing. Before USIMD was introduced, many SSE- and AVX-based optimizations have been implemented in the community, the maintenance of the code brings severe technical debt to the community. However, there is a light at the end of the tunnel. Sayed Adel proposes the NEP38 and implements related functions as a major contributor of OpenCV to improve the running efficiency of Numpy on PowerPC. You can find the detailed description of this solution in SIMD Optimizations. The main steps are as follows:
Before creating a source file by using command parameters, configure the required optimization:
--cpu-baseline
: minimum necessary optimization baseline--cpu-dispatch
: non-baseline optimization dispatchCheck the compiler and platform architecture, and cache some intermediate results to speed up the rebuild.
Test the compiler and view the CPU feature support that the compiler can provide based on the requested optimization.
Generate the header file _cpu_dispatch.h
that contains all instruction set definitions verified in the previous step.
The file also contains the Python-level module attribute __cpu_baseline__
for defining NumPy and C definition __cpu_dispatch__
.
The schedulable source is a special C file, which can be compiled multiple times using different compiler flags and C definitions. In addition, the code execution path is affected. Enable some instruction sets for each compilation object based on configuration statements. These instruction sets must be declared between C comments and starts with a special mark @targets on the top of each dispatchable source. If the optimization is disabled by using the command parameter, the schedulable source is considered as the common C source --disable-optimization
.
Numpy processes the schedulable source in four steps:
(A) Recognition: Like the source template and F2PY, the dispatchable source requires the special extension *.dispatch.c
to mark the dispatchable source C file.
(B) Parsing and verification: The configuration statement parses and verifies each of the dispatchable sources that have been filtered in the previous step to determine the required optimization.
(C) Packaging: This is the approach adopted by NumPy's infrastructure, which has proved flexible enough to compile a single source multiple times using different C definitions and flags that affect the code path. This is done by creating a piece of temporary C source code for each required optimization related to the additional optimization, which contains a declaration of the C definition and includes the source code involved through the C directive #include.
(D) Dispatchable configuration header: The basic structure generates a configuration header for each dispatchable source. The header mainly contains two abstract C macros that are used to identify the generated object. They can be used at runtime to dispatch certain symbols from the generated objects of any C source. It is also used for forward declaration.
Einstein summation convention (also known as einsum) is a mark convention proposed by Einstein in 1916. It can replace but is not limited to functions such as matrix transposition, matrix multiplication, tracing, tensor multiplication, and array summation. It is a universal function and one of the most useful functions in Numpy. Due to its powerful expressiveness and intelligent loop, it can surpass the common array functions in terms of speed and memory efficiency. What is the trick of this fast einsum function? You can find a large amount of SSE-related code in the source code.
#if EINSUM_USE_SSE1 && @float32@
/* Use aligned instructions if possible */
if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
EINSUM_IS_SSE_ALIGNED(data_out)) {
/* Unroll the loop by 8 */
while (count >= 8) {
count -= 8;
/**begin repeat2
* #i = 0, 4#
*/
a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
_mm_store_ps(data_out+@i@, b);
/**end repeat2**/
data0 += 8;
data1 += 8;
data_out += 8;
}
/* Finish off the loop */
goto finish_after_unrolled_loop;
}
#elif EINSUM_USE_SSE2 && @float64@
/* Use aligned instructions if possible */
if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
EINSUM_IS_SSE_ALIGNED(data_out)) {
/* Unroll the loop by 8 */
while (count >= 8) {
count -= 8;
/**begin repeat2
* #i = 0, 2, 4, 6#
*/
a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
_mm_store_pd(data_out+@i@, b);
/**end repeat2**/
data0 += 8;
data1 += 8;
data_out += 8;
}
/* Finish off the loop */
goto finish_after_unrolled_loop;
}
#endif
/* Unroll the loop by 8 */
while (count >= 8) {
count -= 8;
#if EINSUM_USE_SSE1 && @float32@
/**begin repeat2
* #i = 0, 4#
*/
a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
_mm_storeu_ps(data_out+@i@, b);
/**end repeat2**/
#elif EINSUM_USE_SSE2 && @float64@
/**begin repeat2
* #i = 0, 2, 4, 6#
*/
a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
_mm_storeu_pd(data_out+@i@, b);
/**end repeat2**/
#else
/**begin repeat2
* #i = 0, 1, 2, 3, 4, 5, 6, 7#
*/
data_out[@i@] = @to@(@from@(data0[@i@]) *
@from@(data1[@i@]) +
@from@(data_out[@i@]));
/**end repeat2**/
#endif
After the SSE and SSE2 instruction sets are optimized, their performance is improved but they cannot be transplanted. You need to write another set of ARM and VSX code to achieve the same optimization effect. The USIMD needs to be modified to achieve the cross-platform effect. The modification procedure is as follows:
PyArray_EinsteinSum
. However, only get_sum_of_products_function
in the API involves the SIMD parallel algorithm. Therefore, the function is extracted as einsum_get_sum_of_products_function
.static sum_of_products_fn
get_sum_of_products_function(int nop, int type_num, npy_intp itemsize, npy_intp const *fixed_strides)
{
#ifndef NPY_DISABLE_OPTIMIZATION
#include "einsum.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL(return einsum_get_sum_of_products_function,
(nop, type_num, itemsize, fixed_strides))
}
Extract all subfunctions related to the SIMD parallel algorithm to the einsum.dispatch.c.src file. This file is the key to dispatch. During compilation, multiple files with different instruction sets are sent based on __cpu_baseline__
and __cpu_dispatch__
. The default dispatch value needs to be specified in the header. For example, if you want to send the SSE2 AVX2 AVX512F file in the x86 architecture and send the Neon ASIMD file in the ARM architecture, add the following information to the header:
/**
* @targets $maxopt baseline
* SSE2 (AVX2 FMA3) AVX512F
* NEON ASIMD ASIMDHP
*/
Modify APIs. USIMD provides the following common macro definitions:
Macro Definition | Description |
---|---|
NPY_SIMD | Parallel processing capability. SSE/VSX/NEON: 128 bits; AVX2: 256 bits. |
NPY_SIMD_F64 | Specifies whether the capability of processing float64 data is provided. |
npy_add_u8 | Unsigned 8-bit addition function API. Currently, USIMD provides more than 260 function APIs. |
npy_u8 | Unsigned 8-bit data type API. Currently, USIMD provides more than 20 data type APIs. |
You can use the preceding macros to modify SSE code as follows:
#if NPY_SIMD && @float32@
/* Use aligned instructions if possible */
if (EINSUM_IS_ALIGNED(data0) && EINSUM_IS_ALIGNED(data1) &&
EINSUM_IS_ALIGNED(data_out)) {
/* Unroll the loop by UNROLL_BY */
while (count >= UNROLL_BY) {
count -= UNROLL_BY;
// two simd pack in one round
for(int i = 0; i < 2; i++) {
pack_size = i * UNROLL_BY / 2;
a = npyv_mul_f32(npyv_loada_f32(data0+pack_size), npyv_loada_f32(data1+pack_size));
b = npyv_add_f32(a, npyv_loada_f32(data_out+pack_size));
npyv_storea_f32(data_out+pack_size, b);
}
data0 += UNROLL_BY;
data1 += UNROLL_BY;
data_out += UNROLL_BY;
}
/* Finish off the loop */
goto finish_after_unrolled_loop;
}
#elif NPY_SIMD_F64 && @float64@
/* Use aligned instructions if possible */
if (EINSUM_IS_ALIGNED(data0) && EINSUM_IS_ALIGNED(data1) &&
EINSUM_IS_ALIGNED(data_out)) {
/* Unroll the loop by UNROLL_BY */
while (count >= UNROLL_BY) {
count -= UNROLL_BY;
for (int i = 0; i < 4; i++) {
pack_size = i * UNROLL_BY / 4;
a = npyv_mul_f64(npyv_loada_f64(data0+pack_size), npyv_loada_f64(data1+pack_size));
b = npyv_add_f64(a, npyv_loada_f64(data_out+pack_size));
npyv_storea_f64(data_out+pack_size, b);
}
data0 += UNROLL_BY;
data1 += UNROLL_BY;
data_out += UNROLL_BY;
}
/* Finish off the loop */
goto finish_after_unrolled_loop;
}
#endif
/* Unroll the loop by UNROLL_BY */
while (count >= UNROLL_BY) {
count -= UNROLL_BY;
#if NPY_SIMD && @float32@
for (int i = 0; i < 2; i++) {
pack_size = i * UNROLL_BY / 2;
a = npyv_mul_f32(npyv_load_f32(data0+pack_size), npyv_load_f32(data1+pack_size));
b = npyv_add_f32(a, npyv_load_f32(data_out+pack_size));
npyv_store_f32(data_out+pack_size, b);
}
#elif NPY_SIMD_F64 && @float64@
for (int i = 0; i < 4; i++) {
pack_size = i * UNROLL_BY / 4;
a = npyv_mul_f64(npyv_load_f64(data0+pack_size), npyv_load_f64(data1+pack_size));
b = npyv_add_f64(a, npyv_load_f64(data_out+pack_size));
npyv_store_f64(data_out+pack_size, b);
}
#else
for (int i = 0; i < UNROLL_BY; i++) {
data_out[i] = @to@(@from@(data0[i]) *
@from@(data1[i]) +
@from@(data_out[i]));
}
#endif
Execute test cases and benchmarks.
einsum sub module triggered | master | einsum-usimd | ratio |
---|---|---|---|
sum_of_products_stride0_contig_outstride0_two | 110±0.5μs | 100±0.5μs | 0.90 |
sum_of_products_contig_stride0_outstride0_two | 110±0.5μs | 101±0.5μs | 0.90 |
sum_of_products_contig_stride0_outcontig_two | 23.7±1ms | 20.2±0.3ms | 0.85 |
sum_of_products_contig_two | 152±0.8μs | 140±2μs | 0.91 |
sum_of_products_stride0_contig_outcontig_two | 946±10μs | 841±20μs | 0.89 |
contig_contig_outstride0_two | 522±5μs | 353±7μs | 0.68 |
sum_of_products_contig_outstride0_one | 871±10μs | 502±4μs | 0.58 |
The following advantages are brought after code is modified by USIMD:
The USIMD framework consists of three parts: compile time, runtime, and API. The first part is compile time.
Check whether the --disable-optimization
function is enabled. If yes, follow the normal compilation link process. If no, scan the *.dispatch.c
file in the source code, initialize the underlying API library, and determine an architecture platform (x86, ARM, or PowerPC) based on the operating environment. If it is found that *.dispatch.c
has dispatched a specified feature (cached to the temp file in the build directory), add the dispatch file to the source code compilation tree for subsequent compilation and linking. If there is no cache (the feature baseline is compiled or modified for the first time), the configuration header (which defines the features to be dispatched) is generated based on the features in --cpu-baseline and --cpu-dispatch and the support of the compiler. A baseline example is as follows:
/**
* @targets $maxopt baseline
* SSE2 (AVX2 FMA3) AVX512F
* NEON NEON_VFPV4
*/
The following is generated after the compilation using a compiler that supports both AVX2 and FMA3 as well as AVX512F.
#define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512_SKX)), AVX512F, __VA_ARGS__)) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX)&&CHK(F16C)&&CHK(FMA3)&&CHK(AVX2)), AVX2, __VA_ARGS__))
The code means the same as the following:
if (NPY_CPU_HAVE(AVX512F)) {
return XXX_avx512f(...);
} else if (NPY_CPU_HAVE(AVX2) && NPY_CPU_HAVE(FMA3)) {
return XXX_avx2__fma3(...);
} else {
// fallback to the baseline
return XXX(...);
}
The compiler has generated patch packages for various features. However, CPUs of other machines may not support AVX or SSE when patches are running on those machines. What can I do? In this case, the runtime works.
The following code is used at runtime to check whether the machine supports a feature:
x86 platform:
#if defined(_MSC_VER) || defined (__INTEL_COMPILER)
return _xgetbv(0);
#elif defined(__GNUC__) || defined(__clang__)
unsigned int eax, edx;
__asm(".byte 0x0F, 0x01, 0xd0" : "=a"(eax), "=d"(edx) : "c"(0));
return eax;
#else
return 0;
#endif
PowerPC platform:
unsigned int hwcap = getauxval(AT_HWCAP);
if ((hwcap & PPC_FEATURE_HAS_VSX) == 0)
return;
hwcap = getauxval(AT_HWCAP2);
ARM platform:
if (getauxval != 0) {
hwcap = getauxval(NPY__HWCAP);
#ifdef __arm__
hwcap2 = getauxval(NPY__HWCAP2);
#endif
} else {
unsigned long auxv[2];
int fd = open("/proc/self/auxv", O_RDONLY);
if (fd >= 0) {
while (read(fd, &auxv, sizeof(auxv)) == sizeof(auxv)) {
if (auxv[0] == NPY__HWCAP) {
hwcap = auxv[1];
}
#ifdef __arm__
else if (auxv[0] == NPY__HWCAP2) {
hwcap2 = auxv[1];
}
#endif
// detect the end
else if (auxv[0] == 0 && auxv[1] == 0) {
break;
}
}
close(fd);
}
}
if (hwcap == 0 && hwcap2 == 0) {
/*
* try parsing with /proc/cpuinfo, if sandboxed
* failback to compiler definitions
*/
if(!get_feature_from_proc_cpuinfo(&hwcap, &hwcap2)) {
return 0;
}
}
Through the preceding detection, all CPU features can be identified. If a feature is supported, the related macro is enabled. In this way, packages are dispatched at runtime.
Underlying APIs can be classified into the following types:
Instruction Category | Example |
---|---|
Arithmetic | npyv_add_u8 |
Conversion | npyv_cvt_u8_b8 |
Memory | npyv_load_f32 |
Reordering | npyv_combinel_u8 |
Bitwise operation | npyv_shl_u16 |
Miscellaneous | npyv_zero_u8 |
Data type | npyv_u8 |
Use SSE of the x86 platform as the benchmark API. If other platforms do not have corresponding instructions, use the simulation method.