Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0.0-rc0
Affects Version/s: 3.7.5
Component/s: Storage, WiredTiger
Labels:
- nyc

Assigned Teams:

Storage Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Storage NYC 2018-05-07
Linked BF Score:
72
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A frequent build failure has been identified since ~~SERVER-34192~~ Secondary reads during batch application that causes "dbhash mismatch" errors consistently in master which result in missing documents on secondaries.

When ~~SERVER-32876~~ Don't stall FTDC due to WT cache full is reverted, the errors go away. This patch was reverted in 3.6 because it caused dbhash mismatch errors.

The current belief is that the previous synchronization in WiredTigerSnapshotManager preventing opening transactions concurrently was removed. However, this change should be correct without data inconsistency issues.

Now that the synchronization for opening transactions on the oplog is gone, we believe there is a latent bug exposed that is preventing this concurrent behavior of opening transactions and then subsequently setting a read timestamp on them.

The diff I believe is responsible for this failure:

 void WiredTigerSnapshotManager::beginTransactionOnOplog(WiredTigerOplogManager* oplogManager,
                                                         WT_SESSION* session) const {
     invariantWTOK(session->begin_transaction(session, nullptr));
     auto rollbacker =
         MakeGuard([&] { invariant(session->rollback_transaction(session, nullptr) == 0); });

-    stdx::lock_guard<stdx::mutex> lock(_mutex);
     auto allCommittedTimestamp = oplogManager->getOplogReadTimestamp();
     invariant(Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)).asULL() ==
               allCommittedTimestamp);
     auto status = setTransactionReadTimestamp(
-        Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)), session);
+        Timestamp(static_cast<unsigned long long>(allCommittedTimestamp)),
+        session,
+        true /* roundToOldest */);

-    // If we failed to set the read timestamp, we assume it is due to the oldest_timestamp racing
-    // ahead.  Rather than synchronizing for this rare case, if requested, throw a
-    // WriteConflictException which will be retried.
-    if (!status.isOK() && status.code() == ErrorCodes::BadValue) {
-        throw WriteConflictException();
-    }
     fassert(50771, status);
     rollbacker.Dismiss();
}

depends on

WT-4057 round_to_oldest should establish txn snapshot after establishing rounded read timestamp

Closed

related to

SERVER-32876 Don't stall ftdc due to WT cache full

Closed

SERVER-34192 Secondary reads during batch applications

Closed

Assignee:: [DO NOT USE] Backlog - Storage Execution Team
Reporter:: Louis Williams
Participants:: [DO NOT USE] Backlog - Storage Execution Team, Daniel Gottlieb, Eric Milkie, Louis Williams
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Apr 20 2018 03:03:10 PM UTC
Updated:: Oct 29 2023 10:32:32 PM UTC
Resolved:: Apr 27 2018 03:33:28 PM UTC
Confidence Status Last Update:: 25/Apr/18 2:57 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates