Investigate WiredTiger checksum errors in mongod integration branch perf tests

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Server Integration
    • 1.016
    • WhatThePelly - 2025-09-02, SE Persistence - 2025-08-15
    • 8
    • Not Needed

      While running performance tests on the disagg mongod integration branch, I saw sporadic WiredTiger failures in the YCSB workloads indicating a checksum error:

      {"t":{"$date":"2025-08-13T04:00:06.563+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"conn70","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1755057606,"ts_usec":563828,"thread":"7839:0xffffb6657980","session_dhandle_name":"file:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable","session_name":"WT_CURSOR.search","category":"WT_VERB_DEFAULT","log_id":1000000,"category_id":12,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__block_disagg_read_checksum_err:47:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable: read checksum error for 0B block at page 1147405, lsn 7537909630182098037: block header checksum of 2366917735 doesn't match expected checksum of 6b8e8f16"}}}
      {"t":{"$date":"2025-08-13T04:00:06.563+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"conn70","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1755057606,"ts_usec":563941,"thread":"7839:0xffffb6657980","session_dhandle_name":"file:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable","session_name":"WT_CURSOR.search","category":"WT_VERB_DEFAULT","log_id":1000000,"category_id":12,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_bm_corrupt_dump:77:{0: 1147405, 0, 0x6b8e8f16}: empty buffer, no dump available"}}}
      {"t":{"$date":"2025-08-13T04:00:06.563+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"conn70","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1755057606,"ts_usec":563967,"thread":"7839:0xffffb6657980","session_dhandle_name":"file:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable","session_name":"WT_CURSOR.search","category":"WT_VERB_DEFAULT","log_id":1000000,"category_id":12,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__block_disagg_read_multiple:239:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable: fatal read error","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
      {"t":{"$date":"2025-08-13T04:00:06.564+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"conn70","msg":"WiredTiger error message","attr":{"error":-31804,"message":{"ts_sec":1755057606,"ts_usec":563992,"thread":"7839:0xffffb6657980","session_dhandle_name":"file:collection-4b15b48b-0f49-4d6c-9174-a4a345c9615a.wt_stable","session_name":"WT_CURSOR.search","category":"WT_VERB_DEFAULT","log_id":1000000,"category_id":12,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__block_disagg_read_multiple:239:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804}}}
      {"t":{"$date":"2025-08-13T04:00:06.564+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"conn70","msg":"Fatal assertion","attr":{"msgid":50853,"location":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp:645:9:int mongo::{anonymous}::mdb_handle_error_with_startup_suppression(WT_EVENT_HANDLER*, WT_SESSION*, int, const char*)"}}
      {"t":{"$date":"2025-08-13T04:00:06.564+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"conn70","msg":"\n\n***aborting after fassert() failure\n\n"}
      

      This seems to happen across various YCSB workloads, and it's not deterministic. This ticket is to investigate and fix the issue. Here's an example of a failing YCSB workload with the checksum error.

            Assignee:
            Nic Hollingum
            Reporter:
            Ali Mir
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: