-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines - Transactions
-
None
-
None
While implementing parallel checkpoints, or at least trying to resolve test failures on the branch I came across a failure in test_drop01, the test logs:
[pid:69414]: ERROR in test_drop01.test_drop01.test_drop_hs_truncate
F
======================================================================
FAIL: test_drop01.test_drop01.test_drop_hs_truncate
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/other-wt-repro/test/suite/wttest.py", line 161, in wrapper
return func(self, *args, **kwargs)
File "/home/ubuntu/other-wt-repro/test/suite/wttest.py", line 309, in _callTestMethod
method()
File "/home/ubuntu/other-wt-repro/test/suite/test_drop01.py", line 68, in test_drop_hs_truncate
self.assertEqual(self.get_hs_size(), 2)
AssertionError: 0 != 2
----------------------------------------------------------------------
Ran 1 test in 78.737s
FAILED (failures=1)
[Inferior 1 (process 69414) exited with code 01]
This assertion is saying it counted zero records in the history store when it should've seen two. Now initially I believed this failure to be a history store reconciliation oddity that parallel checkpoints weren't playing nice with, but it turns out the actual cause is quite different.
This failure mode can be reproduced on develop by modifying the test slightly:
diff --git a/test/suite/test_drop01.py b/test/suite/test_drop01.py
index 2f6417ab15..1d7c2efcd9 100644
--- a/test/suite/test_drop01.py
+++ b/test/suite/test_drop01.py
@@ -39,11 +39,13 @@ class test_drop01(wttest.WiredTigerTestCase):
self.session.commit_transaction('commit_timestamp=' + self.timestamp_str(timestamp))
def get_hs_size(self):
- cursor = self.session.open_cursor("file:WiredTigerHS.wt")
+ sess = self.conn.open_session()
+ cursor = sess.open_cursor("file:WiredTigerHS.wt")
size = 0
while cursor.next() == 0:
size += 1
cursor.close()
+ sess.close()
return size
uri = 'table:test_drop01'
Essentially, by creating a new session prior to counting the records in the history the test fails. The reason for this is that the cursor opened by the session does not get WT_CURSTD_IGNORE_TOMBSTONE set on it and thus treats every history store record as invisible. This indicates a bug, I think every cursor opened on the history store should have that flag set. Now the reason this passes on develop currently is that the history store cursor opened by the checkpoint prior to counting the history store records never has WT_CURSTD_IGNORE_TOMBSTONE un-set from it. So effectively when the checkpoint completes the cursor is cached and then re-used when re-opened later in the test. This also is likely a bug.
I'm not entirely clear on the semantics expected when a user opens a cursor directly on the history store. I don't think MongoDB does this either so I don't think there are external implications but effectively the bug impacts what records are visible in the history store, and hides them artificially so its a somewhat concerning issue.
The fix may be as simple as adding a WT_CURSTD_IGNORE_TOMBSTONE in cur_file open, and clearing that flag on cache.
Scope:
- Investigate validity of issue
- Fix, and add regression testing