This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy

USIMD: Best Practice of Full-Platform Performance Optimization

Aug 25, 2021

Background: A compiler can automatically optimize parallelable code blocks based on the parallel instruction set of a platform, but this optimization has great restrictions. The optimal instruction pipeline is not generated in most cases, and the CPU capability of the x86/ARM cannot be maximized. Therefore, the current mainstream method is to manually write assembly/intrinsic code to generate the optimal parallel code segment. However, writing a set of code for each instruction set architecture is poor in maintainability and scalability. The maintenance cost of a community becomes higher with the increase of the code. There is an urgent need for a general instruction set optimization solution, which is the USIMD optimization to be introduced in the following.

Beginning of USIMD

It is easy to write the following C language code to find the product of all elements of a vector.

int MultiplyIntList(int const *l1, int n)

{
  int s = 1;
  while (n--) {
​    s *= (*l1++);
  }
  return s;
}

Since it is a simple loop structure, the SIMD technology is often used for acceleration. If you are familiar with the ARM Neon instruction set, you can write as follows:

int MultiplyIntList(int const *l1, int n)
{
    int s = 1;
#ifdef HAVE_NEON
    int dim4 = n >> 2;
    n &= 3;
    int32x4_t sum_vec = vdupq_n_s32(1);
    int32x4_t data_vec;
    for(;dim4 > 0; dim4--){
        data_vec = vld1q_s32(l1);
        sum_vec = vmulq_s32(sum_vec, data_vec);
        l1+=4;
    }
    s = vgetq_lane_s32(sum_vec,0)*vgetq_lane_s32(sum_vec,1)*
    vgetq_lane_s32(sum_vec,2)*vgetq_lane_s32(sum_vec,3);
#endif
    while (n--) {
        s *= (*l1++);
    }
    return s;
}

The performance of the modified parallel code is improved by more than 50% compared with that of the C language version. If you are familiar with the SSE instruction, you can change the code to the following format for the x86 platform scenarios:

int MultiplyIntList(int const *l1, int n)
{
    int s = 1;
#ifdef HAVE_NEON
    ...
#elif HAVE_SSE
    int dim4 = n >> 2;
    n &= 3;
    _m128i sum_vec = _mm_setzero_si128(1);
    _m128i data_vec;
    for(;dim4 > 0; dim4--){
        data_vec = _mm_setr_epi32(l1);
        sum_vec = _mm_mullo_epi32(sum_vec, data_vec);
        l1+=4;
    }
    data_vec = _mm_shuffle_epi32(sum_vec, sum_vec, _MM_SHUFFLE(2,3,0,1));
    sum_vec = _mm_mullo_epi32(sum_vec, data_vec);
    data_vec = _mm_shuffle_epi32(sum_vec, sum_vec, _MM_SHUFFLE(1,0,3,2));
    sum_vec = _mm_mullo_epi32(data_vec, accum_sse);
    _mm_store_ss(&s, sum_vec);
#endif
    while (n--) {
        s *= (*l1++);
    }
    return s;
}

The code performance is also improved on the x86 platform. However, the x86 platform supports not only SSE, but also other instruction sets such as SSE2, SSE3, SSE4.1, AVX2, and AVX512 (which has multiple versions). As a result, the code is fragmented.

image

In addition, IBM PowerPC has its own VSX instruction set. Therefore, a complete multi-platform SIMD optimization solution may be similar to the following code:

int MultiplyIntList(int const *l1, int n)
{
    int s = 1;
#ifdef HAVE_NEON
    ...
#elif HAVE_SSE
    ...
#elif HAVE_SSE2
    ...
#elif HAVE_AVX2
    ...
#elif HAVE_VSX
    ...
#endif
    while (n--) {
        s *= (*l1++);
    }
    return s;
}

Does it drive you crazy? The preceding is just a simple function evolution scenario. The actual situation may be much more complex than the product of vectors. Therefore, using assembly for SIMD instruction orchestration is not recommended. The maintainability is sacrificed for improving performance. In addition, the requirements on programmers' skills are high. OpenCV implements USIMD practice at the hardware abstraction layer to address this problem:

  • Provide a set of infrastructure for abstracting intrinsics.
  • Use the compiler macro and CPU detection to convert abstract intrinsics functions into specific intrinsics calls during compilation.
  • Use the CPU feature detection code to further limit the available instruction sets and select an optimal loop path at runtime.

After the practice, this theory tends to be mature, its feasibility is recognized, and it has been promoted in the Numpy community recently. The code of the vector product in the USIMD framework is similar to the following:

int MultiplyIntList(int const *l1, int n)
{
    int s = 1;
#ifdef NPY_SIMD
    int dim4 = n >> 2;
    n &= 3;
    npyv_s32 sum_vec = npy_setall_s32(1);
    npyv_s32 data_vec;
    for(;dim4 > 0; dim4--){
        data_vec = npy_load_s32(l1);
        sum_vec = npy_mulq_s32(sum_vec, data_vec);
        l1+=4;
    }
    s = npy_mulr_s32(sum_vec);
#endif
    while (n--) {
        s *= (*l1++);
    }
    return s;
}

This code can be used across x86, ARM, and PowerPC, and the optimal SIMD performance is obtained on all platforms.

Rise of USIMD

Numpy is an open-source numerical computing extension of Python. This tool can be used to store and process large matrices. It is more efficient than the nested list structure of Python and supports a large number of dimension arrays and matrix operations. In addition, it provides a large number of mathematical function libraries for array operations. Numpy's contribution to the generation of the world's first black hole photo has been widely used in machine learning image processing. Before USIMD was introduced, many SSE- and AVX-based optimizations have been implemented in the community, the maintenance of the code brings severe technical debt to the community. However, there is a light at the end of the tunnel. Sayed Adel proposes the NEP38 and implements related functions as a major contributor of OpenCV to improve the running efficiency of Numpy on PowerPC. You can find the detailed description of this solution in SIMD Optimizations. The main steps are as follows:

1. Configuration

Before creating a source file by using command parameters, configure the required optimization:

  • --cpu-baseline: minimum necessary optimization baseline
  • --cpu-dispatch: non-baseline optimization dispatch

2. Discovering the Environment

Check the compiler and platform architecture, and cache some intermediate results to speed up the rebuild.

3. Verifying the Requested Optimization

Test the compiler and view the CPU feature support that the compiler can provide based on the requested optimization.

4. Generating Main Configuration Header

Generate the header file _cpu_dispatch.h that contains all instruction set definitions verified in the previous step.

The file also contains the Python-level module attribute __cpu_baseline__ for defining NumPy and C definition __cpu_dispatch__.

5. Schedulable Source Code and Configuration Statements

The schedulable source is a special C file, which can be compiled multiple times using different compiler flags and C definitions. In addition, the code execution path is affected. Enable some instruction sets for each compilation object based on configuration statements. These instruction sets must be declared between C comments and starts with a special mark @targets on the top of each dispatchable source. If the optimization is disabled by using the command parameter, the schedulable source is considered as the common C source --disable-optimization.

Numpy processes the schedulable source in four steps:

  • (A) Recognition: Like the source template and F2PY, the dispatchable source requires the special extension *.dispatch.c to mark the dispatchable source C file.

  • (B) Parsing and verification: The configuration statement parses and verifies each of the dispatchable sources that have been filtered in the previous step to determine the required optimization.

  • (C) Packaging: This is the approach adopted by NumPy's infrastructure, which has proved flexible enough to compile a single source multiple times using different C definitions and flags that affect the code path. This is done by creating a piece of temporary C source code for each required optimization related to the additional optimization, which contains a declaration of the C definition and includes the source code involved through the C directive #include.

  • (D) Dispatchable configuration header: The basic structure generates a configuration header for each dispatchable source. The header mainly contains two abstract C macros that are used to identify the generated object. They can be used at runtime to dispatch certain symbols from the generated objects of any C source. It is also used for forward declaration.

Best USIMD Practice

Einstein summation convention (also known as einsum) is a mark convention proposed by Einstein in 1916. It can replace but is not limited to functions such as matrix transposition, matrix multiplication, tracing, tensor multiplication, and array summation. It is a universal function and one of the most useful functions in Numpy. Due to its powerful expressiveness and intelligent loop, it can surpass the common array functions in terms of speed and memory efficiency. What is the trick of this fast einsum function? You can find a large amount of SSE-related code in the source code.

#if EINSUM_USE_SSE1 && @float32@
    /* Use aligned instructions if possible */
    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
        EINSUM_IS_SSE_ALIGNED(data_out)) {
        /* Unroll the loop by 8 */
        while (count >= 8) {
            count -= 8;

/**begin repeat2
 * #i = 0, 4#
 */
            a = _mm_mul_ps(_mm_load_ps(data0+@i@), _mm_load_ps(data1+@i@));
            b = _mm_add_ps(a, _mm_load_ps(data_out+@i@));
            _mm_store_ps(data_out+@i@, b);
/**end repeat2**/
            data0 += 8;
            data1 += 8;
            data_out += 8;
        }

        /* Finish off the loop */
        goto finish_after_unrolled_loop;
    }
#elif EINSUM_USE_SSE2 && @float64@
    /* Use aligned instructions if possible */
    if (EINSUM_IS_SSE_ALIGNED(data0) && EINSUM_IS_SSE_ALIGNED(data1) &&
        EINSUM_IS_SSE_ALIGNED(data_out)) {
        /* Unroll the loop by 8 */
        while (count >= 8) {
            count -= 8;

/**begin repeat2
 * #i = 0, 2, 4, 6#
 */
            a = _mm_mul_pd(_mm_load_pd(data0+@i@), _mm_load_pd(data1+@i@));
            b = _mm_add_pd(a, _mm_load_pd(data_out+@i@));
            _mm_store_pd(data_out+@i@, b);
/**end repeat2**/
            data0 += 8;
            data1 += 8;
            data_out += 8;
        }

        /* Finish off the loop */
        goto finish_after_unrolled_loop;
    }
#endif

    /* Unroll the loop by 8 */
    while (count >= 8) {
        count -= 8;

#if EINSUM_USE_SSE1 && @float32@
/**begin repeat2
 * #i = 0, 4#
 */
        a = _mm_mul_ps(_mm_loadu_ps(data0+@i@), _mm_loadu_ps(data1+@i@));
        b = _mm_add_ps(a, _mm_loadu_ps(data_out+@i@));
        _mm_storeu_ps(data_out+@i@, b);
/**end repeat2**/
#elif EINSUM_USE_SSE2 && @float64@
/**begin repeat2
 * #i = 0, 2, 4, 6#
 */
        a = _mm_mul_pd(_mm_loadu_pd(data0+@i@), _mm_loadu_pd(data1+@i@));
        b = _mm_add_pd(a, _mm_loadu_pd(data_out+@i@));
        _mm_storeu_pd(data_out+@i@, b);
/**end repeat2**/
#else
/**begin repeat2
 * #i = 0, 1, 2, 3, 4, 5, 6, 7#
 */
        data_out[@i@] = @to@(@from@(data0[@i@]) *
                             @from@(data1[@i@]) +
                             @from@(data_out[@i@]));
/**end repeat2**/
#endif

After the SSE and SSE2 instruction sets are optimized, their performance is improved but they cannot be transplanted. You need to write another set of ARM and VSX code to achieve the same optimization effect. The USIMD needs to be modified to achieve the cross-platform effect. The modification procedure is as follows:

  1. Extract the function module to be dispatched. The external API of the einsum module is PyArray_EinsteinSum. However, only get_sum_of_products_function in the API involves the SIMD parallel algorithm. Therefore, the function is extracted as einsum_get_sum_of_products_function.
static sum_of_products_fn
get_sum_of_products_function(int nop, int type_num, npy_intp itemsize, npy_intp const *fixed_strides)
{
    #ifndef NPY_DISABLE_OPTIMIZATION
        #include "einsum.dispatch.h"
    #endif
    NPY_CPU_DISPATCH_CALL(return einsum_get_sum_of_products_function,
        (nop, type_num, itemsize, fixed_strides))
}
  1. Extract all subfunctions related to the SIMD parallel algorithm to the einsum.dispatch.c.src file. This file is the key to dispatch. During compilation, multiple files with different instruction sets are sent based on __cpu_baseline__ and __cpu_dispatch__. The default dispatch value needs to be specified in the header. For example, if you want to send the SSE2 AVX2 AVX512F file in the x86 architecture and send the Neon ASIMD file in the ARM architecture, add the following information to the header:

    /**
     * @targets $maxopt baseline
     * SSE2 (AVX2 FMA3) AVX512F
     * NEON ASIMD ASIMDHP
       */
  2. Modify APIs. USIMD provides the following common macro definitions:

    Macro Definition Description
    NPY_SIMD Parallel processing capability. SSE/VSX/NEON: 128 bits; AVX2: 256 bits.
    NPY_SIMD_F64 Specifies whether the capability of processing float64 data is provided.
    npy_add_u8 Unsigned 8-bit addition function API. Currently, USIMD provides more than 260 function APIs.
    npy_u8 Unsigned 8-bit data type API. Currently, USIMD provides more than 20 data type APIs.

    You can use the preceding macros to modify SSE code as follows:

    #if NPY_SIMD && @float32@
        /* Use aligned instructions if possible */
        if (EINSUM_IS_ALIGNED(data0) && EINSUM_IS_ALIGNED(data1) &&
            EINSUM_IS_ALIGNED(data_out)) {
            /* Unroll the loop by UNROLL_BY */
            while (count >= UNROLL_BY) {
                count -= UNROLL_BY;
                // two simd pack in one round
                for(int i = 0; i < 2; i++) {
                    pack_size = i * UNROLL_BY / 2;
                    a = npyv_mul_f32(npyv_loada_f32(data0+pack_size), npyv_loada_f32(data1+pack_size));
                    b = npyv_add_f32(a, npyv_loada_f32(data_out+pack_size));
                    npyv_storea_f32(data_out+pack_size, b);
                }
                data0 += UNROLL_BY;
                data1 += UNROLL_BY;
                data_out += UNROLL_BY;
            }
    
            /* Finish off the loop */
            goto finish_after_unrolled_loop;
        }
    #elif NPY_SIMD_F64 && @float64@
        /* Use aligned instructions if possible */
        if (EINSUM_IS_ALIGNED(data0) && EINSUM_IS_ALIGNED(data1) &&
            EINSUM_IS_ALIGNED(data_out)) {
            /* Unroll the loop by UNROLL_BY */
            while (count >= UNROLL_BY) {
                count -= UNROLL_BY;
                for (int i  = 0; i < 4; i++) {
                    pack_size = i * UNROLL_BY / 4;
                    a = npyv_mul_f64(npyv_loada_f64(data0+pack_size), npyv_loada_f64(data1+pack_size));
                    b = npyv_add_f64(a, npyv_loada_f64(data_out+pack_size));
                    npyv_storea_f64(data_out+pack_size, b);
                }
                data0 += UNROLL_BY;
                data1 += UNROLL_BY;
                data_out += UNROLL_BY;
            }
            /* Finish off the loop */
            goto finish_after_unrolled_loop;
        }
    #endif
    
        /* Unroll the loop by UNROLL_BY */
        while (count >= UNROLL_BY) {
            count -= UNROLL_BY;
    
    #if NPY_SIMD && @float32@
            for (int i  = 0; i < 2; i++) {
                pack_size = i * UNROLL_BY / 2;
                a = npyv_mul_f32(npyv_load_f32(data0+pack_size), npyv_load_f32(data1+pack_size));
                b = npyv_add_f32(a, npyv_load_f32(data_out+pack_size));
                npyv_store_f32(data_out+pack_size, b);
            }
    #elif NPY_SIMD_F64 && @float64@
            for (int i  = 0; i < 4; i++) {
                pack_size = i * UNROLL_BY / 4;
                a = npyv_mul_f64(npyv_load_f64(data0+pack_size), npyv_load_f64(data1+pack_size));
                b = npyv_add_f64(a, npyv_load_f64(data_out+pack_size));
                npyv_store_f64(data_out+pack_size, b);
            }
    #else
        for (int i = 0; i < UNROLL_BY; i++) {
            data_out[i] = @to@(@from@(data0[i]) *
                                     @from@(data1[i]) +
                                     @from@(data_out[i]));
        }
    #endif
  3. Execute test cases and benchmarks.

einsum sub module triggered master einsum-usimd ratio
sum_of_products_stride0_contig_outstride0_two 110±0.5μs 100±0.5μs 0.90
sum_of_products_contig_stride0_outstride0_two 110±0.5μs 101±0.5μs 0.90
sum_of_products_contig_stride0_outcontig_two 23.7±1ms 20.2±0.3ms 0.85
sum_of_products_contig_two 152±0.8μs 140±2μs 0.91
sum_of_products_stride0_contig_outcontig_two 946±10μs 841±20μs 0.89
contig_contig_outstride0_two 522±5μs 353±7μs 0.68
sum_of_products_contig_outstride0_one 871±10μs 502±4μs 0.58

The following advantages are brought after code is modified by USIMD:

  • Enhanced cross-platform capability: You only need to be familiar with USIMD APIs instead of the instruction set differences between platforms.
  • Enhanced code maintainability: USIMD APIs have passed professional cross-platform tests. The infrastructure is highly reliable and is the optimal underlying implementation. You only need to write a set of code.
  • Higher abstraction level: The scalability is greatly improved.

Principle: USIMD Implementation Mechanism

The USIMD framework consists of three parts: compile time, runtime, and API. The first part is compile time.

image

Check whether the --disable-optimization function is enabled. If yes, follow the normal compilation link process. If no, scan the *.dispatch.c file in the source code, initialize the underlying API library, and determine an architecture platform (x86, ARM, or PowerPC) based on the operating environment. If it is found that *.dispatch.c has dispatched a specified feature (cached to the temp file in the build directory), add the dispatch file to the source code compilation tree for subsequent compilation and linking. If there is no cache (the feature baseline is compiled or modified for the first time), the configuration header (which defines the features to be dispatched) is generated based on the features in --cpu-baseline and --cpu-dispatch and the support of the compiler. A baseline example is as follows:

/**
 * @targets $maxopt baseline
 * SSE2 (AVX2 FMA3) AVX512F
 * NEON NEON_VFPV4
 */

The following is generated after the compilation using a compiler that supports both AVX2 and FMA3 as well as AVX512F.

#define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
 NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512_SKX)), AVX512F, __VA_ARGS__)) \
 NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX)&&CHK(F16C)&&CHK(FMA3)&&CHK(AVX2)), AVX2, __VA_ARGS__))

The code means the same as the following:

if (NPY_CPU_HAVE(AVX512F)) {
   return XXX_avx512f(...);
} else if (NPY_CPU_HAVE(AVX2) && NPY_CPU_HAVE(FMA3)) {
   return XXX_avx2__fma3(...);
} else {
    // fallback to the baseline
    return XXX(...);
}

The compiler has generated patch packages for various features. However, CPUs of other machines may not support AVX or SSE when patches are running on those machines. What can I do? In this case, the runtime works.

image-20200731111126479

The following code is used at runtime to check whether the machine supports a feature:

x86 platform:

#if defined(_MSC_VER) || defined (__INTEL_COMPILER)
    return _xgetbv(0);
#elif defined(__GNUC__) || defined(__clang__)
    unsigned int eax, edx;
    __asm(".byte 0x0F, 0x01, 0xd0" : "=a"(eax), "=d"(edx) : "c"(0));
    return eax;
#else
    return 0;
#endif

PowerPC platform:

unsigned int hwcap = getauxval(AT_HWCAP);
if ((hwcap & PPC_FEATURE_HAS_VSX) == 0)
     return;
hwcap = getauxval(AT_HWCAP2);

ARM platform:

if (getauxval != 0) {
        hwcap = getauxval(NPY__HWCAP);
    #ifdef __arm__
        hwcap2 = getauxval(NPY__HWCAP2);
    #endif
    } else {
        unsigned long auxv[2];
        int fd = open("/proc/self/auxv", O_RDONLY);
        if (fd >= 0) {
            while (read(fd, &auxv, sizeof(auxv)) == sizeof(auxv)) {
                if (auxv[0] == NPY__HWCAP) {
                    hwcap = auxv[1];
                }
            #ifdef __arm__
                else if (auxv[0] == NPY__HWCAP2) {
                    hwcap2 = auxv[1];
                }
            #endif
                // detect the end
                else if (auxv[0] == 0 && auxv[1] == 0) {
                    break;
                }
            }
            close(fd);
        }
    }
    if (hwcap == 0 && hwcap2 == 0) {
        /*
         * try parsing with /proc/cpuinfo, if sandboxed
         * failback to compiler definitions
        */
        if(!get_feature_from_proc_cpuinfo(&hwcap, &hwcap2)) {
            return 0;
        }
    }

Through the preceding detection, all CPU features can be identified. If a feature is supported, the related macro is enabled. In this way, packages are dispatched at runtime.

Underlying APIs can be classified into the following types:

Instruction Category Example
Arithmetic npyv_add_u8
Conversion npyv_cvt_u8_b8
Memory npyv_load_f32
Reordering npyv_combinel_u8
Bitwise operation npyv_shl_u16
Miscellaneous npyv_zero_u8
Data type npyv_u8

Use SSE of the x86 platform as the benchmark API. If other platforms do not have corresponding instructions, use the simulation method.