disagg: Investigate/fix spurious bug using modify (potential data corruption)

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Transactions
    • SE Transactions - 2025-10-10
    • 5

      A spurious failure running the disagg hook:

      test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 0 (string-row): starting
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]:     Replacing, old uri = "table:test_timestamp20"
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]:     Replacing, new uri = "layered:test_timestamp20"
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 0 (string-row): 18.10 seconds
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: ERROR in test_timestamp20.test_timestamp20.test_timestamp20_modify -s 0 (string-row)
      [2025/09/11 23:44:08.908] test_timestamp20.py: FAIL
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 1 (column): starting
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 1 (column): 2.04 seconds
      [2025/09/11 23:44:08.908] test_timestamp20.py: ok
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 2 (column-fix): starting
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_modify -s 2 (column-fix): 1.03 seconds
      [2025/09/11 23:44:08.908] test_timestamp20.py: ok
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 0 (string-row): starting
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]:     Replacing, old uri = "table:test_timestamp20"
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]:     Replacing, new uri = "layered:test_timestamp20"
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 0 (string-row): 12.08 seconds
      [2025/09/11 23:44:08.908] test_timestamp20.py: ok
      [2025/09/11 23:44:08.908] test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 1 (column): starting
      [2025/09/11 23:44:08.909] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 1 (column): 2.04 seconds
      [2025/09/11 23:44:08.909] test_timestamp20.py: ok
      [2025/09/11 23:44:08.909] test_timestamp20.py: [pid:73697]: None ... [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 2 (column-fix): starting
      [2025/09/11 23:44:08.909] test_timestamp20.py: [pid:73697]: test_timestamp20.test_timestamp20.test_timestamp20_standard -s 2 (column-fix): 7.04 seconds
      [2025/09/11 23:44:08.909] test_timestamp20.py: ok
      [2025/09/11 23:44:08.909] test_timestamp20.py:
      [2025/09/11 23:44:08.909] test_timestamp20.py: ======================================================================
      [2025/09/11 23:44:08.909] test_timestamp20.py: [pid:73697]: FAIL: None
      [2025/09/11 23:44:08.909] test_timestamp20.py: ----------------------------------------------------------------------
      [2025/09/11 23:44:08.909] test_timestamp20.py: [pid:73697]: Traceback (most recent call last):
      [2025/09/11 23:44:08.909] test_timestamp20.py:   File "/data/mci/a2607e1e3900e25787216c2600f2238e/wiredtiger/test/suite/wttest.py", line 247, in _callTestMethod
      [2025/09/11 23:44:08.909] test_timestamp20.py:     method()
      [2025/09/11 23:44:08.909] test_timestamp20.py:   File "/data/mci/a2607e1e3900e25787216c2600f2238e/wiredtiger/test/suite/test_timestamp20.py", line 207, in test_timestamp20_modify
      [2025/09/11 23:44:08.909] test_timestamp20.py:     self.assertEqual(old_reader_cursor[self.get_key(i)], expected)
      [2025/09/11 23:44:08.909] test_timestamp20.py: AssertionError: 'aaaa[191 chars]aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa[255 chars]aaaa' != 'aaaa[191 chars]aaaaaCaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa[255 chars]aaaa'
      [2025/09/11 23:44:08.909] test_timestamp20.py: Diff is 1415 characters long. Set self.maxDiff to None to see it.
      [2025/09/11 23:44:08.909] test_timestamp20.py:
      [2025/09/11 23:44:08.909] test_timestamp20.py: ----------------------------------------------------------------------
      [2025/09/11 23:44:08.909] test_timestamp20.py: Ran 6 tests in 42.332s
      [2025/09/11 23:44:08.909] test_timestamp20.py:
      [2025/09/11 23:44:08.909] test_timestamp20.py: FAILED (failures=1)
      [2025/09/11 23:44:08.933] ERROR: test_timestamp20.py process returned 1 

      Looking at test_timestamp20.py, I see that in line 159, a single char 'C' is inserted.

                  self.assertEqual(cursor.modify([wiredtiger.Modify('C', 200, 1)]), 0) 

      (So the "aaaaaCaaaaa...." doesn't come out of the blue).  This modify is supposed to happen at timestamp 30, and after checkpoints/rollbacks, this comparison came out unexpected.  I didn't delve into the test logic, but note that this is spurious - I could not get it to reproduce on my workstation, and rerunning the task in evergreen "fixed" the problem.

      I haven't seen other bugs that look like DC, and the fact that this is modify stands out, I don't know how heavily tested that is.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: