Timing bugs in test_layered* python tests

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Security Level: Public (Available to anyone on the web)
    • Storage Engines - Foundations
    • 3,905.588
    • None
    • None

      test_layered04 (as with many other tests) has sleeps in it. Take out the sleeps, and it fails. It does something like:

      • open cursor, insert 150k records, close cursor. (sleep occasionally during insertion)
      • sleep 1
      • open cursor, count records, close cursor
      • make sure the number of records read == 150k

      Remove the sleeps and it fails.  The more sleeping, the closer to 150k. 

      test_layered06 fails almost at the same point, it's only slightly different.

      • open a second connection to be a follower, and a session
        • the follower session is not used before the crash.
      • open a cursor, insert 300k records, reset and close the cursor (sleep occasionally during insertion)
      • sleep 2
      • open cursor, count records, close cursor

      In the last step, it gets a crash

      [1731354394:7176][21453:0xfffff7ff54c0], test_oligarch06.test_oligarch06.test_oligarch06(100k), file:test_oligarch06.wt_stable, WT_CURSOR.next: [WT_VERB_DEFAULT][ERROR]: __block_disagg_read_checksum_err, 48: test_oligarch06.wt_stable: read checksum error for 2286B block at page 225, ckpt 1: block header checksum of 1948959266 (2) doesn't match expected checksum of 884395128 (1)
      

      The relevant part is that the number in parens next to the checksum is the reconciliation_id. So the "rec-id" found in PALI was 2, and we were asking for 1. This is all for the file_id, same page_id, same checkpoint_id (checkpoint 1). I've confirmed in the LMDB storage that we have three versions of this page, at rec-ids 0, 1 and 2.  They are all full versions, no deltas.

      How can it be that we are asking for an earlier reconcilation version than we've previously written?

      Theory: Is it possible that we didn't update the new checksum/rec-id pair in the cookie in the parent internal page?  So we're asking for the old one, since we didn't record the new one. 

      UPDATE: These tests are now named test_layered04.py, test_layered06.py etc.

            Assignee:
            Unassigned
            Reporter:
            Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: