Modern computer architectures provide SIMD (single instruction multiple data) instruction set extensions to operate on multiple data at once. Commonly used for numerical applications such as video codecs, graphics rendering, and scientific computing, use of SIMD techniques also aids in basic data processing tasks such as those implemented by libc functions. While other libc implementations already provide SIMD enhanced variants of standard libc functions, the FreeBSD libc largely does not. The objective of this project by Robert Clausecker is to provide such SIMD-enhanced versions of relevant libc library functions and thus improving the performance of software linked against it. As these libc functions are used by most software available for FreeBSD, these enhancements are projected to give a broad benefit for a wide range of programs. The primary focus of the project is amd64 with the aim to produce SIMD optimised implementations based on the architecture levels defined by the x86_64 psAB.
Multiple implementations of various routines are planned, if the routine benefits from additional instructions available at higher architecture levels. This usually means one routine for baseline or x86-64-v2 and one routine each for x86-64-v3 and x86-64-v4. A benchmark suite is planned to ascertain the impact of these routines on the performance of the libc. In future work, these routines could be adapted for i386 or ported to other architectures including arm64 (ASIMD, SVE) and ppc64/ppc64le if there is sufficient interest.
Technical Details
It is planned to implement the optimised routines in assembly to ensure toolchain independency. For dynamically linked executables, the ifunc mechanism is planned to be used to select the best implementation of each routine at runtime. If possible, an environment variable will be queried to permit the user to select a different architecture level or to disable SIMD enhancements altogether. For statically linked executables or when the function is called directly (e.g. from inside the libc through a hidden alias), the plan is to provide dispatch trampolines. In the first call to the trampoline, the call resolves to a dispatch function determining which implementation to use. The dispatch function writes the target of the dispatch into the function pointer used by the trampoline and then tail calls the selected routine. On the next iteration, the correct function is then directly called. Both mechanisms will be implemented in a thread-safe and async-signal-safe manner. The best implementation is usually the one using the highest architecture level supported by the CPU. However, hardware
constraints such as thermal licensing and AVX-SSE transition penalties may render architecture levels v3 and v4 unattractive on some processors. Implementations may be written such that they overrun the end of strings during reads, but ensure that no page boundary is crossed. Such an overrun is harmless unless a segment limit is set but may confuse analysis tools such as valgrind. This is especially required for fast performance on NUL-terminated strings.
Documentation
The presence of SIMD-enhanced functions will be documented in a new manual page simd(7). This page will explain to the user how the libc chooses which implementation to use and how to configure this behaviour. Other manual pages such as environ(7), string(3), and bstring(3) will be enhanced with cross-references and additional information as appropriate. Internal documentation will be produced, explaining the dispatch and function selection mechanisms. As it is not planned to make these mechanisms available to user code, no end-user documentation will be produced. Additional documentation on the benchmark and test setup may be produced as needed. A final report describing the techniques used and giving the final performance improvements may be produced.