[SERVER-34590] oplog visibility issues with round_to_oldest Created: 20/Apr/18  Updated: 29/Oct/23  Resolved: 27/Apr/18

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.7.5
Fix Version/s: 4.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Fixed Votes: 0
Labels: nyc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-4057 round_to_oldest should establish txn ... Closed
Related
related to SERVER-32876 Don't stall ftdc due to WT cache full Closed
related to SERVER-34192 Secondary reads during batch applicat... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage NYC 2018-05-07
Participants:
Linked BF Score: 72

 Description   

A frequent build failure has been identified since SERVER-34192 Secondary reads during batch application that causes "dbhash mismatch" errors consistently in master which result in missing documents on secondaries.

When SERVER-32876 Don't stall FTDC due to WT cache full is reverted, the errors go away. This patch was reverted in 3.6 because it caused dbhash mismatch errors.

The current belief is that the previous synchronization in WiredTigerSnapshotManager preventing opening transactions concurrently was removed. However, this change should be correct without data inconsistency issues.

Now that the synchronization for opening transactions on the oplog is gone, we believe there is a latent bug exposed that is preventing this concurrent behavior of opening transactions and then subsequently setting a read timestamp on them.

The diff I believe is responsible for this failure:

 void WiredTigerSnapshotManager::beginTransactionOnOplog(WiredTigerOplogManager* oplogManager,
                                                         WT_SESSION* session) const {
     invariantWTOK(session->begin_transaction(session, nullptr));
     auto rollbacker =
         MakeGuard([&] { invariant(session->rollback_transaction(session, nullptr) == 0); });
 
-    stdx::lock_guard<stdx::mutex> lock(_mutex);
     auto allCommittedTimestamp = oplogManager->getOplogReadTimestamp();
     invariant(Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)).asULL() ==
               allCommittedTimestamp);
     auto status = setTransactionReadTimestamp(
-        Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)), session);
+        Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)),
+        session,
+        true /* roundToOldest */);
 
-    // If we failed to set the read timestamp, we assume it is due to the oldest_timestamp racing
-    // ahead.  Rather than synchronizing for this rare case, if requested, throw a
-    // WriteConflictException which will be retried.
-    if (!status.isOK() && status.code() == ErrorCodes::BadValue) {
-        throw WriteConflictException();
-    }
     fassert(50771, status);
     rollbacker.Dismiss();
}



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 27/Apr/18 ]

This was fixed in the WT drop including WT-4057

Comment by Eric Milkie [ 25/Apr/18 ]

An amendment to my previous comment: It turns out that WiredTiger actually does reset the snapshot on a transaction when setting a read timestamp. The bug is that for round_to_oldest, it is doing the rounding after it sets the snapshot the second time. The fix may be to ensure the rounding occurs prior to setting the snapshot. I filed WT-4057 in light of this. The proposed code changes in that ticket make the repro for this ticket no longer fail.

Comment by Eric Milkie [ 24/Apr/18 ]

I think I figured this out. The problem is that begin_transaction() establishes the snapshot isolation that affects what is visible to the transaction. Later on, a call to timestamp_transaction() adds a read timestamp to the session, which further affects visibility. At the point where timestamp_transaction() runs, there is an ability to change the read timestamp from what was provided to what the current, in-the-moment oldest_timestamp value is ("round_to_oldest" behavior). This is the problematic behavior. I believe round_to_oldest would also be broken when setting the read timestamp inside begin_transaction() itself, but the race window would be a lot smaller.
What can happen is this:
(on a primary with two secondaries, A and B)
1. oldest_timestamp is 0
2. write 2 has committed, write 3 is in flight, write 4 is committed
3. oplog read timestamp is currently "2".
4. Secondary A begins a read on the oplog; it calls begin_transaction() which sets up its snapshot isolation to hide visibility for write 3 since it is not committed yet.
5. write 3 commits.
6. oplog read timestamp moves to "4".
7. Secondary B begins a read on the oplog; it calls begin_transaction() (all writes are visible), then calls timestamp_transaction() with read timestamp 4, then reads oplog entries "2","3","4" and applies them, and notifies the primary node.
8. the primary node recalculates the majority level to "4" and sets oldest_timestamp to "4".
9. Secondary A finally calls timestamp_transaction() with its read timestamp of "2". round_to_oldest rounds the read timestamp to "4". We now have a transaction with a read timestamp of "4" but which cannot see write "3" due to snapshot isolation rules. This secondary will read oplog entries "2","4".

It sounds like using round_to_oldest as currently implemented is not possible, unless we can move its oldest-timestamp-observing behavior to occur before the snapshot isolation is set up for the transaction.

Generated at Thu Feb 08 04:37:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.