ARTICLEblog.kaving.me18 min read

Resolving a 25% Performance Regression in LLVM for RISC-V

Resolving a 25% Performance Regression in LLVM for RISC-V

AI Summary

I recently tackled a performance regression issue in LLVM for RISC-V targets, which caused a significant 24% slowdown compared to GCC. This regression stemmed from a recent LLVM commit that inadvertently disrupted a narrowing optimization. The commit improved the `isKnownExactCastIntToFP` function to fold `fpext(sitofp x to float)` to double into a direct `uitofp x to double` cast. However, this change broke the downstream optimization in `visitFPTrunc`, leading to the emission of `fdiv.d` instead of `fdiv.s`, resulting in higher cycle latency.

To address this, I extended the `getMinimumFPType` function with range analysis, enabling it to recognize that `fptrunc(uitofp x double)` to float can be reduced to `uitofp x to float`, thus restoring the optimization. During my analysis, I compared LLVM's performance to GCC on a specific benchmark, noting that LLVM required about 8% more cycles on the SiFive P550 CPU. The key difference was LLVM's use of `fdiv.d`, which I identified as a recent regression by comparing it to a prior LLVM build.

The SiFive P550 CPU's out-of-order execution capabilities further complicated the issue, as it attempts to hide the latency of the `fdiv.d` instruction by dispatching other instructions. My investigation led me to suspect a recent change in the RISC-V backend, but none of the commits stood out initially. Instead, I focused on the LLVM IR produced by the optimization pipeline.

The IR showed that the double was narrowed to a float by the end of the middle-end, confirming that the middle-end was responsible for the regression. The regression was due to the removal of the `fpext` instruction, which `visitFPTrunc` relied on for optimization. I proposed a solution by modifying `isKnownExactCastIntToFP` to accept a parameter with a different type, allowing it to perform the analysis with the type given.

With guidance from the community, I refined my approach, creating a variant `canBeCastedExactlyIntToFP` to perform the actual analysis. This change allowed the `InstCombiner` to optimize the `sitofp` and `fpext` pattern into a single `uitofp` operation, restoring the performance. After implementing the fix, the benchmark executed in 1.67 billion cycles, a 25% improvement, demonstrating the effectiveness of the patch.

Key Concepts

Narrowing Optimization

Narrowing optimization is a compiler technique that reduces the precision of data types, such as converting a double to a float, to improve performance by reducing computational overhead.

LLVM Middle-End

The LLVM middle-end is a stage in the LLVM compiler pipeline responsible for optimizing intermediate representation (IR) code. It applies various transformations to improve performance and efficiency before code generation.

Category

Programming
M

Summarized by Mente

Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.

Start free, no credit card