Resolving a 25% Performance Regression in LLVM for RISC-V
AI Summary
I recently tackled a performance regression issue in LLVM for RISC-V targets, which caused a significant 24% slowdown compared to GCC. This regression stemmed from a recent LLVM commit that inadvertently disrupted a narrowing optimization. The commit improved the `isKnownExactCastIntToFP` function to fold `fpext(sitofp x to float)` to double into a direct `uitofp x to double` cast. However, this change broke the downstream optimization in `visitFPTrunc`, leading to the emission of `fdiv.d` instead of `fdiv.s`, resulting in higher cycle latency.
To address this, I extended the `getMinimumFPType` function with range analysis, enabling it to recognize that `fptrunc(uitofp x double)` to float can be reduced to `uitofp x to float`, thus restoring the optimization. During my analysis, I compared LLVM's performance to GCC on a specific benchmark, noting that LLVM required about 8% more cycles on the SiFive P550 CPU. The key difference was LLVM's use of `fdiv.d`, which I identified as a recent regression by comparing it to a prior LLVM build.
The SiFive P550 CPU's out-of-order execution capabilities further complicated the issue, as it attempts to hide the latency of the `fdiv.d` instruction by dispatching other instructions. My investigation led me to suspect a recent change in the RISC-V backend, but none of the commits stood out initially. Instead, I focused on the LLVM IR produced by the optimization pipeline.
The IR showed that the double was narrowed to a float by the end of the middle-end, confirming that the middle-end was responsible for the regression. The regression was due to the removal of the `fpext` instruction, which `visitFPTrunc` relied on for optimization. I proposed a solution by modifying `isKnownExactCastIntToFP` to accept a parameter with a different type, allowing it to perform the analysis with the type given.
With guidance from the community, I refined my approach, creating a variant `canBeCastedExactlyIntToFP` to perform the actual analysis. This change allowed the `InstCombiner` to optimize the `sitofp` and `fpext` pattern into a single `uitofp` operation, restoring the performance. After implementing the fix, the benchmark executed in 1.67 billion cycles, a 25% improvement, demonstrating the effectiveness of the patch.
Key Concepts
Narrowing optimization is a compiler technique that reduces the precision of data types, such as converting a double to a float, to improve performance by reducing computational overhead.
The LLVM middle-end is a stage in the LLVM compiler pipeline responsible for optimizing intermediate representation (IR) code. It applies various transformations to improve performance and efficiency before code generation.
Category
ProgrammingMore on Discover
Summarized by Mente
Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.
Start free, no credit card