This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy

Open-Source Software Optimization: Optimizing Compilation Rules to Improve Software Performance

Aug 25, 2021

Based on the analysis of the Go community's optimization solution for floating-point variable comparison on the ARM64 platform, this document describes how to add compilation rules to make the compiler more intelligent and obtain better instruction combinations to improve the software execution speed.

A compiler is used to translate the high-level language code into the low-level language code. To facilitate optimization, the compiler converts the source code into the intermediate representation (IR). Many compilation optimization processes are performed in this form. The following describes how to optimize the performance by adding compilation rules to the compiler.

The Go language compiler is usually used to compile Go language code. It includes multiple compilation processes, such as syntax analysis, AST transformation, static single assignment (SSA) pass, and machine code generation. After the SSA IR is generated, multiple compilation optimization process passes are performed. Each pass converts the SSA function. For example, deadcode elimination detects and deletes the code that will not be executed and useless variables. In all passes, lower converts the IR from architecture-independent (such as x86 and ARM) to architecture-related based on the compiled optimization rules. This is implemented by adding a large number of compilation rules and is the main focus in the following.

1. Comparison of Floating-point Variables

Floating-point number is widely used in application development, for example, indicating a decimal value or bonus point. The floating-point number is often compared with 0. For example, when product information is entered into the database, you can check whether the entered amount is greater than 0 to prevent errors. When a customer buys a product, the system checks whether the account balance is greater than 0. If yes, the system queries the product information and records order information. In this way, invalid or malicious requests can be excluded at the beginning of the transaction.

Many live broadcast websites hold annual activities to display the accumulated value of gifts sent by users during the activities. Top-ranking users will be listed. floating-point numbers are often used to indicate accumulated value. When an activity starts, items with bonus points less than or equal to 0 need to be shielded by using the following functions:

func comp(x float64, arr []int) {
    for i := 0; i < len(arr); i++ {
        if x > 0 {
            arr[i] = 1
        }
    }
}

Use the Go compile tool to view the assembly code of the function. Some useless code is omitted.

go tool compile -S main.go
"".comp STEXT size=80 args=0x20 locals=0x0 leaf
        0x0000 00000 (main.go:3)        TEXT    "".comp(SB), LEAF|NOFRAME|ABIInternal, $0-32
#-------------------------Obtain the data from the stack to the register.------------------------------
..................................
        0x0000 00000 (main.go:4)        MOVD    "".arr+16(FP), R0         // Obtain the array length information to the register R0.
..................................
        0x0004 00004 (main.go:4)        MOVD    "".arr+8(FP), R1           // Obtain the array address to the register R1.
        0x0008 00008 (main.go:4)        FMOVD   "".x(FP), F0                 // Place parameter x in the register F0.
        0x000c 00012 (main.go:4)        MOVD    ZR, R2                          // ZR indicates 0. Here, R2 is cleared.
#---------------------------for loop execution logic----------------------------------
        0x0010 00016 (main.go:4)        JMP     24                                   // In the first loop, the system directly skips to the condition comparison stage without increasing i.
        0x0014 00020 (main.go:4)        ADD     $1, R2, R2                       // i++
        0x0018 00024 (main.go:4)        CMP     R0, R2                            // Compare i < len(arr).
        0x001c 00028 (main.go:4)        BGE     68                                   // When i == len(arr), branch to the end.
#--------if x > 0---------
        0x0020 00032 (main.go:5)        FMOVD   ZR, F1                         // Copy 0 to the floating-point register F1.
        0x0024 00036 (main.go:5)        FCMPD   F1, F0                          // Compare values in the floating-point registers F0 and F1.
#--------arr[i] = 1-------
        0x0028 00040 (main.go:5)        CSET    GT, R3                            // If F0 > F1: R3 = 1
        0x002c 00044 (main.go:5)        CBZ     R3, 60                             // R3 == 1. That is, if x <= 0, branch to 60.
        0x0030 00048 (main.go:6)        MOVD    $1, R3                         // x > 0
         0x0034 00052 (main.go:6)        MOVD    R3, (R1)(R2<<3)         // Set the slice value to 1.
        0x0038 00056 (main.go:6)        JMP     20                                  // Branch to 20, that is, the i++ operation is performed cyclically.
#--------If x <= 0, branch to i++----
        0x003c 00060 (main.go:6)        MOVD    $1, R3                         // x <= 0
        0x0040 00064 (main.go:5)        JMP     20
...........................................................................
...........................................................................

In the preceding code, 0 is placed in the register F1, and then the FCMPD command is used to compare the variable value x in the register F0 with the value 0 in the register F1.

You may wonder why the comparison between a floating-point variable and constant 0 can be performed only after the variable is placed in a register. There are two usages of the floating-point comparison instruction FCMP of ARMV8:

  1. Compare values in two floating-point registers.
  2. Compare a value in a floating-point register with the value 0.

For the FCMP instruction, although the floating-point number must be placed in a register to compare with almost all constants, the comparison with 0 is a special case. You do not need to place 0 in a floating-point register. Instead, you can use FCMP F0, $(0) for comparison. Therefore, the assembly code generated above is not optimal.

2. Optimizing Compilation Rules to Improve the Floating-point Variable Comparison Performance

This problem is not complex but often occurs, and the compiler cannot even optimize it. How do we solve this problem? The following is a compilation rule optimization case of comparing floating-point variables with 0 in the Go language community. It improves the compiler performance by simply adding compilation rules.

image

After the optimization, all floating-point variables will benefit from the comparison operation with 0. To clearly view the SSA IR and the changes before and after the optimization, use the Go compiler tool to view the detailed compilation process. The compiler records the detailed compilation process to the ssa.html file. After opening the file using a browser, you can view the IR modified by each SSA pass. The following figure shows the compilation rule before optimization.

image

There are many SSA pass processes. Notice the following figure, which is the final form after the SSA pass is executed, and v24 and v20 in the figure. Before optimization, constant 0 is placed in register F1, and array elements are placed in register F0. Then, the FCMPD floating-point comparison instruction is called to compare the values in F1 and F0 and update the status register based on the comparison result.

image

Obviously, the purpose of the optimization now is to convert the two instructions FMOVD $(0.0), F1 and FCMPD F1, F0 into one instruction FCMPD $(0.0), F0.

In this optimization, the SSA compilation rule makes the compiler more intelligent by using the S-expression. S-expression finds the matched expression and converts it to another expression that is more efficient or architecture-related, which is ideal for the operation of a compiler, as shown in the following figure:

image

The preceding optimization makes the compiler more intelligent because it has learned the following conversion rules:

// Optimize the comparison between the floating-point number and 0 to the expression FCMP $(0.0), Fn.
(FCMPS x (FMOVSconst [0])) -> (FCMPS0 x)                              // Compare float32 x with constant 0 -> FCMPS0 x.
(FCMPS (FMOVSconst [0]) x) -> (InvertFlags (FCMPS0 x))         // Compare constant 0 with float32 x -> Invert the result of (FCMPS0 x).
(FCMPD x (FMOVDconst [0])) -> (FCMPD0 x)                            // Compare float64 x with constant 0 -> FCMPD0 x.
(FCMPD (FMOVDconst [0]) x) -> (InvertFlags (FCMPD0 x))       // Compare constant 0 with float64 x -> Invert the result of (FCMPD0 x).
-------------------Rule for inverting comparison result-----------------------------------
(LessThanF (InvertFlags x)) -> (GreaterThanF x)                        // Invert (a < b) -> a > b.
(LessEqualF (InvertFlags x)) -> (GreaterEqualF x)                      // Invert (a <= b) -> a >= b.
(GreaterThanF (InvertFlags x)) -> (LessThanF x)                        // Invert (a > b) -> a < b.
(GreaterEqualF (InvertFlags x)) -> (LessEqualF x)                      // Invert (a >= b) -> a <= b.

Note: Common expression type fields such as type <type>, variable value [auxint], non-numeric variable value {aux}, and condition expression [&& extra conditions]: are omitted in the preceding example. Explore more if you are interested.

You may have come to realize that there are two simplified operation code FCMPS0 and FCMPD0 in the rules. Why are they better? They indicate the comparison between float32 and 0 and the comparison between float64 and 0. The following figure shows the specific meanings.

image

Now, you have learned each part of the compiler SSA rule optimization. Let us summarize the above optimization rules of SSA compiler into the following architecture. In the Go compiler, compilation rule optimization is crucial to SSA pass, which improves some common expressions that are irrelevant to the architecture and make them more efficient. For example, for the redundant condition judgment negation expression, remove the invert operation and directly perform invert on the judgment condition. For example, convert invert(<=) to >, and convert the architecture-independent expression to the expression related to the architecture (ARM64 or x86).

image

Add the preceding compilation rules, update the compiler to the latest version, and generate the SSA pass execution result diagram. You can see that the two instructions are combined into one.

image

3. Code Description

The code optimization is described as follows:

  1. Define operation code. Two pieces of operation code are added to compare the floating-point number with 0.
//--------------Define the input parameter mask of a register. fp indicates all floating-point registers.----------------
fp1flags  = regInfo{inputs: []regMask{fp}}
..............................
//--------------------------Add two pieces of operation code to compare the floating-point number with 0.----------------------------
// Define the operation FCMPS0, compare the parameter (float32) in the floating-point register with 0, and use the assembly instruction FCMPS.
{name: "FCMPS0", argLength: 1, reg: fp1flags, asm: "FCMPS", typ: "Flags"},
// Define the operation FCMPD0, compare the parameter (float64) in the floating-point register with 0, and use the assembly instruction FCMPD.
{name: "FCMPD0", argLength: 1, reg: fp1flags, asm: "FCMPD", typ: "Flags"},
  1. Automatically generate opGen.go based on the operation code.
{
    name:   "FCMPS0",                       // Operation name
    argLen: 1,                                     // Number of parameters
    asm:    arm64.AFCMPS,                 // Machine instruction, which is the FCMPS of the ARM64 platform
    reg: regInfo{
        inputs: []inputInfo{                  // Supported input parameter register
            {0, 9223372034707292160},  // F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F29 F30 F31
        },
    },
},
{
    name:   "FCMPD0",                       // Operation name
    argLen: 1,                                     // Number of parameters
    asm:    arm64.AFCMPD,                 // Machine instruction, which is the FCMPD of the ARM64 platform
    reg: regInfo{
        inputs: []inputInfo{                  // Supported input parameter register
            {0, 9223372034707292160},  // F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F29 F30 F31
        },
    },
},
  1. Convert the operation code FCMPS0 and FCMPD0 to the prog format. Each prog corresponds to a specific instruction, which is used during linking.
   case ssa.OpARM64FCMPS0,                          // FCMPS0 -> FCMPS $(0.0), F0
        ssa.OpARM64FCMPD0:                            // FCMPD0 -> FCMPD $(0.0), F0
        p := s.Prog(v.Op.Asm())                          // FCMPS | FCMPD
        p.From.Type = obj.TYPE_FCONST             // $(0.0) of constant type
        p.From.Val = math.Float64frombits(0)    // $(0.0) to be compared
        p.Reg = v.Args[0].Reg()                           // The second source operand, that is, the floating-point register F0 for comparison
  1. Add new SSA compilation rules to ARM64.rules.
// Optimize the comparison between the floating-point number and 0 to the expression FCMP $(0.0), Fn.
(FCMPS x (FMOVSconst [0])) -> (FCMPS0 x)                               // Compare float32 x with constant 0 -> FCMPS0 x.
(FCMPS (FMOVSconst [0]) x) -> (InvertFlags (FCMPS0 x))          // Comparison between constant 0 and float32 x -> Invert the result of (FCMPS0 x).
(FCMPD x (FMOVDconst [0])) -> (FCMPD0 x)                             // Comparison between float64 x and constant 0 -> FCMPD0 x.
(FCMPD (FMOVDconst [0]) x) -> (InvertFlags (FCMPD0 x))        // Compare constant 0 with float64 x -> Invert the result of (FCMPD0 x).
-------------------Rule for inverting comparison result-----------------------------------
(LessThanF (InvertFlags x)) -> (GreaterThanF x)                        // Invert (a < b) -> a > b.
(LessEqualF (InvertFlags x)) -> (GreaterEqualF x)                      // Invert (a <= b) -> a >= b.
(GreaterThanF (InvertFlags x)) -> (LessThanF x)                        // Invert (a > b) -> a < b.
(GreaterEqualF (InvertFlags x)) -> (LessEqualF x)                      // Invert (a >= b) -> a <= b.
  1. Automatically generate Go conversion code based on ARM64.rules:
//--------------The following rules are matched one by one in the lower pass. After the matching, the conversion is performed.----------------
case OpARM64FCMPD:
    return rewriteValueARM64_OpARM64FCMPD_0(v)
case OpARM64FCMPS:
    return rewriteValueARM64_OpARM64FCMPS_0(v)
case OpARM64GreaterEqualF:
    return rewriteValueARM64_OpARM64GreaterEqualF_0(v)
case OpARM64GreaterThanF:
    return rewriteValueARM64_OpARM64GreaterThanF_0(v)
case OpARM64LessEqualF:
    return rewriteValueARM64_OpARM64LessEqualF_0(v)
case OpARM64LessThanF:
    return rewriteValueARM64_OpARM64LessThanF_0(v)

//------------------------If x(float64) is compared with 0, convert to FCMPD0 x.------------------------
func rewriteValueARM64_OpARM64FCMPD_0(v *Value) bool {
    b := v.Block
    _ = b
    // match: (FCMPD x (FMOVDconst [0]))
    // cond:
    // result: (FCMPD0 x)
    for {
        _ = v.Args[1]
        x := v.Args[0]
        v_1 := v.Args[1]
        if v_1.Op != OpARM64FMOVDconst {
            break
        }
        if v_1.AuxInt != 0 {                                                                   // If it is not compared with 0, exit the process.
            break
        }
        v.reset(OpARM64FCMPD0)                                                      // Change the OpARM64FCMPD instruction to OpARM64FCMPD0.
        v.AddArg(x)
        return true
    }
    // match: (FCMPD (FMOVDconst [0]) x)
    // cond:
    // result: (InvertFlags (FCMPD0 x))
    for {
        _ = v.Args[1]
        v_0 := v.Args[0]
        if v_0.Op != OpARM64FMOVDconst {
            break
        }
         if v_0.AuxInt != 0 {                                                                    // If it is not compared with 0, exit the process.
            break
        }
        x := v.Args[1]
        v.reset(OpARM64InvertFlags)                                                   // Change the OpARM64FCMPD instruction to OpARM64InvertFlags.
        v0 := b.NewValue0(v.Pos, OpARM64FCMPD0, types.TypeFlags) // Add a value indicating the OpARM64FCMPD0 instruction. SSA represents a value.
        v0.AddArg(x)
        v.AddArg(v0)
        return true
    }
    return false
}

//------------------------If x(float32) is compared with 0, convert to FCMPS0 x.------------------------
func rewriteValueARM64_OpARM64FCMPS_0(v *Value) bool {
    b := v.Block
    _ = b
    // match: (FCMPS x (FMOVSconst [0]))
    // cond:
    // result: (FCMPS0 x)
    for {
        _ = v.Args[1]
        x := v.Args[0]
        v_1 := v.Args[1]
        if v_1.Op != OpARM64FMOVSconst {
            break
        }
        if v_1.AuxInt != 0 {                              // If the operand is not 0, exit the process.
            break
        }
        v.reset(OpARM64FCMPS0)                  // Change the OpARM64FCMPS instruction to OpARM64FCMPS0.
        v.AddArg(x)
        return true
    }
    // match: (FCMPS (FMOVSconst [0]) x)
    // cond:
    // result: (InvertFlags (FCMPS0 x))
    for {
        _ = v.Args[1]
        v_0 := v.Args[0]
        if v_0.Op != OpARM64FMOVSconst {
            break
        }
        if v_0.AuxInt != 0 {
            break
        }
        x := v.Args[1]
        v.reset(OpARM64InvertFlags)
        v0 := b.NewValue0(v.Pos, OpARM64FCMPS0, types.TypeFlags)
        v0.AddArg(x)
        v.AddArg(v0)
        return true
    }
    return false
}

//--------------Comparison between floating-point numbers with the invert flag: Convert invert(x >= 0) to x <= 0.----------------
func rewriteValueARM64_OpARM64GreaterEqualF_0(v *Value) bool {
    // match: (GreaterEqualF (InvertFlags x))
    // cond:
    // result: (LessEqualF x)
    for {
        v_0 := v.Args[0]
        if v_0.Op != OpARM64InvertFlags { // If the inversion is not required, the conversion rule is not applicable. Exit and continue the subsequent rules.
            break
        }
        x := v_0.Args[0]
        v.reset(OpARM64LessEqualF)         // Change the OpARM64GreaterEqualF instruction to OpARM64LessEqualF.
        v.AddArg(x)
        return true
    }
    return false
}

//--------------Comparison between floating-point numbers with the invert flag: Convert invert(x > 0) to x < 0.-----------------
func rewriteValueARM64_OpARM64GreaterThanF_0(v *Value) bool {
    // match: (GreaterThanF (InvertFlags x))
    // cond:
    // result: (LessThanF x)
    for {
        v_0 := v.Args[0]
        if v_0.Op != OpARM64InvertFlags { // If the inversion is not required, the conversion rule is not applicable. Exit and continue the subsequent rules.
            break
        }
        x := v_0.Args[0]
        v.reset(OpARM64LessThanF)          // Change the OpARM64GreaterThanF instruction to OpARM64LessThanF.
        v.AddArg(x)
        return true
    }
    return false
}

//-------------Comparison between floating-point numbers with the invert flag: Convert invert(x <= 0) to x >= 0.----------------
func rewriteValueARM64_OpARM64LessEqualF_0(v *Value) bool {
    // match: (LessEqualF (InvertFlags x))
    // cond:
    // result: (GreaterEqualF x)
    for {
        v_0 := v.Args[0]
        if v_0.Op != OpARM64InvertFlags { // If the inversion is not required, the conversion rule is not applicable. Exit and continue the subsequent rules.
            break
        }
        x := v_0.Args[0]
        v.reset(OpARM64GreaterEqualF)    // Change the OpARM64LessEqualF instruction to OpARM64GreaterEqualF.
        v.AddArg(x)
        return true
    }
    return false
}

//-------------Comparison between floating-point numbers with the invert flag: Convert invert(x < 0) to x > 0.----------------
func rewriteValueARM64_OpARM64LessThanF_0(v *Value) bool {
    // match: (LessThanF (InvertFlags x))
    // cond:
    // result: (GreaterThanF x)
    for {
        v_0 := v.Args[0]
        if v_0.Op != OpARM64InvertFlags { // If the inversion is not required, the conversion rule is not applicable. Exit and continue the subsequent rules.
            break
        }
        x := v_0.Args[0]
        v.reset(OpARM64GreaterThanF)     // Change the OpARM64LessThanF instruction to OpARM64GreaterThanF.
        v.AddArg(x)
        return true
    }
    return false
}

4. Hands-on Lab

Follow the instructions to learn how compilation rule optimization makes the compiler become more intelligent.

  • Environment preparation
  1. Hardware configuration: Kunpeng (ARM64) Linux ECS- general computing-plus KC1 kc1.2xlarge.2 (8-core | 16 GB)
  2. Go language release 1.12.1 - 1.12.17. For details about how to prepare the development environment, see Configuring Go in the ARM64 Development Environment.
  3. Go language GitHub source code repository. Git installation and use is used for version control.
  4. Test code
  5. Compilation rule code generation tool
  • Procedure
# Prepare a test directory, for example, /usr/local/src/.
cd /usr/local/src
# Pull the test case code.
git clone https://github.com/OptimizeLab/sample
# Access compile/ssa/opt_float_cmp_0_by_SSA_rule/src.
cd /usr/local/src/sample/compile/ssa/opt_float_cmp_0_by_SSA_rule/src
# The Go language release 1.12 does not contain the optimized compilation rule. Therefore, use the Go compiler provided by the release.
# Obtain and view ssa.html before optimization.
GOSSAFUNC=comp go tool compile main.go
# Run the Go benchmark instruction to test the performance and record the test result in the before-ssa-bench.txt file.
go test -bench BenchmarkFloatCompare -count=5 > before-ssa-bench.txt
# Use the optimized Go compiler to obtain ssa.html.
# Find a directory for storing the Go source code repository, for example, /usr/local/src/exp.
mkdir /usr/local/src/exp
cd /usr/local/src/exp
# Use the git tool to pull the code repository of golang on the GitHub code hosting platform.
git clone https://github.com/golang/go
# The latest source code contains the optimization. Therefore, you can directly compile the code to obtain the latest Go compiler.
# Access the source code directory.
cd /usr/local/src/exp/go/src
# Compile the Go source code to generate the Go development environment.
bash ./make.bash
# Switch to the test code directory.
cd /usr/local/src/sample/compile/ssa/opt_float_cmp_0_by_SSA_rule/src
# Specify the GOROOT directory and set GOSSAFUNC to the function to be displayed (for example, comp) to generate the optimized ssa.html file.
GOROOT=/usr/local/src/exp/go; GOSSAFUNC=comp go tool compile main.go
# Run the Go benchmark instruction to test the performance and record the test result in the after-ssa-bench.txt file.
GOROOT=/usr/local/src/exp/go; go test -bench BenchmarkFloatCompare -count=5 > after-ssa-bench.txt
# benchstat comparison result
benchstat before-ssa-bench.txt after-ssa-bench.txt
# When developing a new compilation rule, you only need to compile the .rules and Ops.go files and use the compilation rule code generation tool to generate the rule conversion execution code.
# The files that are automatically generated include opGen.go and rewriteARM64.go.
cd /usr/local/src/exp/go/src/cmd/compile/internal/ssa/gen
go run *.go

5. Result

In the preceding case, the performance of comparing all floating-point variables with 0 on the ARM64 platform is finally optimized. Although the improvement is small, massive code that meets the preceding rules is optimized without any modification by the user. Therefore, the optimization is valuable.