[SERVER-33812] First initial sync oplog read batch fetched may be empty; do not treat as an error. Created: 12/Mar/18  Updated: 29/Oct/23  Resolved: 18/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.6, 3.7.6

Type: Bug Priority: Critical - P2
Reporter: Geert Bosch Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-31679 Increase in disk i/o for writes to re... Closed
is related to SERVER-30927 Use readConcern afterClusterTime for ... Closed
is related to SERVER-30977 Need to sign cluster times for unshar... Closed
is related to SERVER-31007 Calculate rollback time limit correctly Closed
is related to SERVER-34768 Rollback can fail if run against a la... Closed
is related to SERVER-35256 Do not treat it as an error if the fi... Closed
is related to SERVER-42910 Oplog query with higher timestamp but... Closed
is related to SERVER-29213 Have KVWiredTigerEngine implement Sto... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6, v3.4
Steps To Reproduce:

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp
index f8c6c70196..13330a526e 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp
@@ -198,6 +198,7 @@ void WiredTigerOplogManager::_oplogJournalThreadLoop(WiredTigerSessionCache* ses
         // Publish the new timestamp value.
         _setOplogReadTimestamp(lk, newTimestamp);
         lk.unlock();
+        sleepmillis(1500);
 
         // Wake up any await_data cursors and tell them more data might be visible now.
         oplogRecordStore->notifyCappedWaitersIfNeeded();

Cherry-picking f23bcbfa6d08c24b5570b3b29641f96babfc6a34 onto v3.6 also reproduces the bug with the RHEL-62 enterprise builder (the required one on evergreen), though I haven't been able to reproduce locally without inserting extra delays.

Sprint: Repl 2018-04-23
Participants:
Linked BF Score: 52

 Description   

Currently we depend on an initial sync not receiving an initial batch that is empty. However, this is a possibility, depending on timing in the _oplogJournalThreadLoop. The fix for SERVER-31679 exacerbates this issue, blocking its backport to 3.6. So, currently this issue is blocking the backport of that ticket.



 Comments   
Comment by Siyuan Zhou [ 20/Aug/19 ]

SERVER-42910 is filed for the fix for the aforementioned read concern issue.

Comment by Siyuan Zhou [ 19/Aug/19 ]

milkie, when working on SERVER-42219, I happened to make transition to primary faster, then I observed some test failures when a node that should roll back actually has higher timestamp than the sync source, e.g. the old primary has (ts: 9, term: 1), while the new primary has (ts: 8, term: 2). In this case, an empty batch is desired. However, the query failed with MaxTimeMSExpired. MaxTimeMS is added in SERVER-35200, but I don't think that's the root cause. My theory is the query times out while waiting for afterClusterTime. In production, it's very likely the old primary will roll back when new writes arrive with even higher timestamp, maybe by the periodical no-op writer. However, it is still a liveness issue.

Some ideas to fix this:

  1. Revert the behavioral change in this ticket outside initial sync. Unnecessary rollback is possible then and may cause unnecessary build failures.
  2. Check the top of oplog before waiting for read concern for oplog query.
  3. Change the query to include the term either in the predicate or the metadata and detect this scenario explicitly.

None of them sounds satisfying to me. Ideally, the sync source only waits for all oplog slots to be filled rather than waiting for afterClusterTime. In terms of rollback, the failures of SERVER-42219 are only observed on 4.2+, so I don't expect to backport further than that. CC judah.schvimer and william.schultz.

Additionally, it would be nice if the syncing node sends both the timestamp and the term to the sync source so choosing sync source can be simpler and checking for rollbackId is no longer needed. This proposal is out of scope but may affect the trade-off of the solutions.

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-33812 OplogFetcher always uses an afterOpTime read concern

This avoids issues with the oplog query being rejected in mixed version
clusters when the sync source is not fully upgraded to 3.6.

(cherry picked from commit 46e2583c36856e0d377fcb35f2208a0ac516f031)
Branch: v3.6
https://github.com/mongodb/mongo/commit/d0b69c0fbc2fd3cbc732301307078d716468ac75

Comment by Githook User [ 02/May/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: Revert "SERVER-33812 always add afterClusterTime to oplog tailing query"

This reverts commit 5e10e0f84bfc225686547b732fec0d5fb104a6f8.
Branch: v3.6
https://github.com/mongodb/mongo/commit/7ce10aea42bc2adb8376955c1b47499f77899c1c

Comment by Githook User [ 24/Apr/18 ]

Author:

{'email': 'benety@mongodb.com', 'username': 'benety', 'name': 'Benety Goh'}

Message: SERVER-33812 always add afterClusterTime to oplog tailing query

(cherry picked from commit 46e2583c36856e0d377fcb35f2208a0ac516f031)
Branch: v3.6
https://github.com/mongodb/mongo/commit/5e10e0f84bfc225686547b732fec0d5fb104a6f8

Comment by Githook User [ 24/Apr/18 ]

Author:

{'email': 'benety@mongodb.com', 'username': 'benety', 'name': 'Benety Goh'}

Message: SERVER-33812 remove rollback_too_new.js

roll_back_local_operations_test provides coverage for this error condition.

(cherry picked from commit 91fc5673cab5d1267fd805f1375577df9072ea1b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/cc4e45a4ef3a7a4fbc329050b369adacd76af7d7

Comment by Githook User [ 18/Apr/18 ]

Author:

{'email': 'benety@mongodb.com', 'username': 'benety', 'name': 'Benety Goh'}

Message: SERVER-33812 always add afterClusterTime to oplog tailing query
Branch: master
https://github.com/mongodb/mongo/commit/46e2583c36856e0d377fcb35f2208a0ac516f031

Comment by Benety Goh [ 17/Apr/18 ]

Instead of handling the case for an empty initial batch returned by the query, we will address the issue by always providing a read concern with the afterClusterTime. We can safely remove the term comparison because SERVER-30977 is completed.

https://github.com/mongodb/mongo/blob/91fc5673cab5d1267fd805f1375577df9072ea1b/src/mongo/db/repl/oplog_fetcher.cpp#L370

oplog_fetcher.cpp

370
    // TODO(SERVER-30977): Remove the term comparison when this ticket is fixed.
371
    if (term == lastOpTimeFetched.getTerm()) {
372
        cmdBob.append("readConcern", BSON("afterClusterTime" << lastOpTimeFetched.getTimestamp()));
373
    }
374
 
375
    return cmdBob.obj();
376
}

Comment by Githook User [ 17/Apr/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33812 remove rollback_too_new.js

roll_back_local_operations_test provides coverage for this error condition.
Branch: master
https://github.com/mongodb/mongo/commit/91fc5673cab5d1267fd805f1375577df9072ea1b

Comment by Githook User [ 13/Apr/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-33812 provide non-empty first batch to OplogFetcher test case

This test checks the metadata handling in the query results and should
not be affected whether the first batch is empty.
Branch: master
https://github.com/mongodb/mongo/commit/35c2a58bda1b90fbc25a3204290ca23c1dc7e00b

Comment by Daniel Gottlieb (Inactive) [ 13/Mar/18 ]

Is this only being observed when a node is querying the oplog of an active secondary?

I'm only guessing here, but is it possible that a node doing a GTE query in steady-state chose a server based on that servers reported last applied (or some other replication gossiped value), but when performing a query, the first batch returned is empty because visibility hasn't caught up?

If so, I'm optimistic that SERVER-33846 will clean this up. If this is happening when issuing queries against the primary, I'm less certain how we can be getting into this situation. Perhaps when a node steps up to primary it needs to forcibly refresh its oplog visibility.

Comment by Eric Milkie [ 13/Mar/18 ]

You’re correct. I misspoke earlier. The first batch is entirely dependent on the query, regardless of awaitdata or tailable.

Comment by Spencer Brody (Inactive) [ 13/Mar/18 ]

I think I understand why relying on the batch not being empty is problematic in initial sync, because as you say we decide what optime to start our query from by doing a reverse scan which doesn't respect oplog visibility.

For steady state, however, your assertion that any tailable awaitData query can return an empty batch doesn't seem right/safe. In steady state, getting an empty first batch of a new oplog tailing GTE query is used to indicate a need to rollback, as the optime queried for in the GTE is the last one we got on our previous batch, and thus must be visible. If we got an empty batch in this case when there were actual oplog documents that matched our query, that would be wrong and would trigger unnecessary rollbacks.

Comment by Eric Milkie [ 13/Mar/18 ]

Note that replication is formulating the tailable/awaitdata initial sync "find" query by specifying an oplog entry that is not yet visible, because it discovered this entry by doing a reverse-cursor find on the oplog. Reverse collection scans ($natural: -1) are not subject to oplog visibility rules. Therefore, the awaitdata query cannot be expected to always return data within the specified awaitdata timeout.

Comment by Eric Milkie [ 13/Mar/18 ]

The semantics of the AwaitData query parameter combined with query parameter Tailable are that any batch can return empty, including the first one.

Comment by Spencer Brody (Inactive) [ 13/Mar/18 ]

It seems like erroring if the oplog batch is empty is correct behavior. The question in my mind is why is the query getting an empty batch in the first place. geert.bosch?

Generated at Thu Feb 08 04:34:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.