WebAssembly is a technology famous of boosting performance of web applications, but what might be the best practice to do achieve maximum performance and what is the cost is what we are going to discuss in this post.
To begin with, let’s talk about how developers develop WebAssembly applications based on the result of The State of WebAssembly 2022. The top 3 languages used to develop WebAssembly are Rust, JavaScript and C/C++. (JavaScript come in second place due to QuickJS, a tiny JS VM, is ported to WebAssembly and boosts JavaScript runtime performance) As Rust and C/C++ are top languages, it is common that developers develop Rust or C/C++ programs and compile to WebAssembly.
Take C/C++ as an example, usually you might use GCC to compile like this:
gcc -o main main.c
When you want to compile to WebAssembly, you might use Emscripten:
emcc -o main.js main.c
# or
emcc -o main.wasm main.c
Then you can use NodeJS or runtime like wasmtime or wasmer to run the program.
But is that the end of the story? Maybe not. Let’s use a matrix multiplication as example to benchmark:
JavaScript Version (Source Code):
// in_b is transposed to speed up multiplication.
const mulMats = (out, in_a, in_b, n) => {
for (let i = 0; i < n; i++) {
for (let j = 0; j < n; j++) {
out.arr[i*n+j] = 0;
for (let k = 0; k < n; k++) {
out.arr[i*n+j] += in_a.arr[i*n+k] * in_b.arr[j*n+k];
}
}
}
};
C Version (Source Code):
// in_b is transposed to speed up multiplication.
void multiply_mats(int* out, int* in_a, int* in_b, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
out[i*n+j] = 0;
for (int k = 0; k < n; k++) {
out[i*n+j] += in_a[i*n+k] * in_b[j*n+k];
}
}
}
}
The performance of JavaScript version, C version and WebAssembly version:
Version | Time to Complete | Remark |
---|---|---|
JavaScript | 5.768s | NodeJS v16.16.0 |
C | 7.697s | GCC 12.2.0 |
WebAssembly | 6.865s | Emscripten 3.1.18 |
It might be unexpected to see that JavaScript version is the fastest one, it is simply because this is NOT a fair comparison as we didn’t use flags to optimize performance.
One of the most significant flag we can use here is -O3
, let’s add this flag
and below is the result:
Version | Time to Complete | Remark |
---|---|---|
JavaScript | 5.768s | NodeJS v16.16.0 |
C | 0.401s | GCC 12.2.0 |
WebAssembly | 2.012s | Emscripten 3.1.18 |
With a simple flag, now WebAssembly version is around 65% faster!
But how might we improve the performance even more? Let’s discuss what options we have.
Methods to Improve Performance
There are mainly three approaches to improve performance:
- More optimization flags
- Use multi-thread
- Use SIMD intrinsics
More optimization flags
More optimization flags like -flto
might improve performance, but sometimes it
might introduce side effects, so it is essential to do more tests when using
more flags.
Use multi-thread
WebAssembly supports multi-thread, so you can use libraries like pthread to achieve multi-threading. It works but with a major drawback: SharedArrayBuffer.
SharedArrayBuffer is an essential technology behind multi-threading in WebAssembly. It enables threads (which is web worker in browser) to communicate with each other. But it also consumes more memory and suffers from browser compatibility issue. (Mainly mobile browsers)
SharedArrayBuffer compatibility: https://caniuse.com/sharedarraybuffer
Use SIMD intrinsics
SIMD (Single Instruction, Multiple Data) is a type of parallel processing
supported by WebAssembly. In Emscripten, you can add flag -msimd128
to
enable SIMD and improve performance for free:
Version | Time to Complete |
---|---|
JavaScript | 5.768s |
C | 0.401s |
WebAssembly | 2.012s |
WebAssembly (w/ -msimd128) | 0.278s |
-O3
flag comes with auto-vectorization, so it is better to use-msimd128
with-O3
Wow, now our performance is so good that it is even faster than C version! But it is not the end of the story as we haven’t used our secret weapon: SIMD intrinsics.
SIMD intrinsics are low-level functions to enable developers to write assembly code in a more user-friendly way. It is supported not only in Emscripten, but also other major languages like Rust and AssemblyScript. With SIMD intrinsics, we can rewrite our code to improve performance further more:
// in_b is transposed to speed up multiplication.
void multiply_mats(int* out, int* in_a, int* in_b, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
out[i*n+j] = 0;
int sum_arr[] = {0, 0, 0, 0};
v128_t sum = wasm_v128_load(sum_arr);
for (int k = 0; k < n; k+=4) {
v128_t a = wasm_v128_load(&in_a[i*n+k]);
v128_t b = wasm_v128_load(&in_b[j*n+k]);
v128_t prod = wasm_i32x4_mul(a, b);
sum = wasm_i32x4_add(sum, prod);
}
v128_t sum_duo = wasm_i32x4_add(sum, wasm_i32x4_shuffle(sum, sum, 2, 3, 0, 0));
v128_t sum_one = wasm_i32x4_add(sum_duo, wasm_i32x4_shuffle(sum_duo, sum_duo, 1, 0, 0, 0));
out[i*n+j] += wasm_i32x4_extract_lane(sum_one, 0);
}
}
}
Version | Time to Complete |
---|---|
JavaScript | 5.768s |
C | 0.401s |
WebAssembly | 2.012s |
WebAssembly (w/ -msimd128) | 0.278s |
WebAssembly w/ SIMD intrinsics (w/ -msimd128) | 0.245s |
As the result, we are 95.8% faster than the original JavaScript version!
Conclusion
In this post, we visited a few methods to improve WebAssembly performance, and
finally we achieve a much better performance by using SIMD intrinsics. Although
it looks perfect, but there is actually a pretty steep learning curve to learn
SIMD intrinsics and master it. And you might only get less than 1% performance
improvement if your program is already auto-vectorized by the compiler. So the
best bet is to use -O3 -msimd128
flags first and rewrite the most computing
intensive parts in your code using SIMD intrinsics to improve the performance
little by little.
Hope you enjoy this post. :) Feel free to leave comments and try to run the code using the source code in this repository: https://github.com/jeromewu/wasm-perf