Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-12011

Speed up crc32c on arm64

    • Storage Engines
    • 8
    • 2024-02-06 tapioooooooooooooca

      My testing shows that we can improve the perf of crc32c on neoverse-n1/graviton2 by making the main loop do 16 bytes at a time rather than 8. This enables using the ldp (load pair) instruction rather than ldr (load register). It also reduces the per-byte loop overhead. This gets us from ~16GB/s to 18.3GB/s which seems to be about the limit a single core can issue loads. I saw no additional advantage when going to 32 bytes at a time.

      Additional improvements:

      • Stop trying to align before entering the main loop. Doesn't seem to improve perf vs just letting the main loop do unaligned loads.
      • Use a cascade of if (remaining & 4)/if (remaining & 2)/if (remaining & 1) blocks to handle the tail in at most 3 crcs rather than looping with byte-at-a-time. (4 when adding a case for bytes & 8 after expanding the main loop to 16 bytes)
      • Use the __crc32cX(...) (where X is one of b-yte, h-alfword, w-ord, or d-oubleword) intrinsics from #include <arm_acle.h> rather than inline asm.

      Here's the code I found worked well. Its C++ designed to update a member variable _val with the incremental hash, but should be easy to translate to C with whichever API you want.

          void addBytes(const void* start, size_t bytes) {
              auto p = static_cast<const char*>(start);
      
              // Unfortunately our compiler tries to update _val as it goes rather than keeping it in
              // a register.
              auto reg = _val;
      
              // Do chunks of 16 bytes at a time. (faster than doing 8)
              while (bytes >= 16) {
                  auto a = loadAt<uint64_t>(p);
                  auto b = loadAt<uint64_t>(p + 8);
                  reg = __crc32cd(reg, a);
                  reg = __crc32cd(reg, b);
                  p += 16;
                  bytes -= 16;
              }
      
              // Now pick off the tails.
              if (bytes & 8) {
                  reg = __crc32cd(reg, loadAt<uint64_t>(p));
                  p += 8;
              }
              if (bytes & 4) {
                  reg = __crc32cw(reg, loadAt<uint32_t>(p));
                  p += 4;
              }
              if (bytes & 2) {
                  reg = __crc32ch(reg, loadAt<uint16_t>(p));
                  p += 2;
              }
              if (bytes & 1) {
                  reg = __crc32cb(reg, loadAt<uint8_t>(p));
              }
      
              // Finally, put the register back.
              _val = reg;
          }
      

            Assignee:
            chenhao.qu@mongodb.com Chenhao Qu
            Reporter:
            mathias@mongodb.com Mathias Stearn
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: