[SERVER-27005] Write error revalidate logic needs to wait for lastVisibleOpTime to be committed Created: 11/Nov/16  Updated: 18/Dec/16  Resolved: 18/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Nathan Myers
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-24630 Mongos erroneously advances config op... Closed
is related to SERVER-20487 w:majority writes that return an erro... Closed
Operating System: ALL
Sprint: Sharding 2017-01-02
Participants:
Linked BF Score: 26

 Description   

A concrete example of this is during applyOps failure:

https://github.com/mongodb/mongo/blob/r3.4.0-rc3/src/mongo/s/catalog/sharding_catalog_client_impl.cpp#L1325-L1335

The issue here is that when the applyOps fail for example when the current primary steps down, we don't know how far the write was applied. In the case that the write was replicated, the applyOps will fail when a retry is made because the precondition will fail. However, if we try to inspect the document, we may not see the post-write state because it is not yet in the committed snapshot.

A fix was attempted before (SERVER-20487) to always advance the lastVisibleOpTime whenever a write is attempted so the client can use it to do readAfterOpTime. This however will become an issue when the set loses the primary and never advances the opTime (SERVER-24630).

One proposed solution is a hybrid where instead of advancing the global config optime, have the current request either use the last returned visibleOpTime on the readAfterOpTime on the next query or make it wait for replication using getLastError with the returned visibleOpTime.



 Comments   
Comment by Randolph Tan [ 18/Dec/16 ]

I see. Closing this ticket then.

Comment by Kaloian Manassiev [ 18/Dec/16 ]

PrepareConfigsFailed is an error, which may only be returned from SCCC. Precondition failed returns BadValue. Therefore we can never enter the branch, which calls undoDonateChunk.

Comment by Randolph Tan [ 18/Dec/16 ]

kaloian.manassiev I haven't tried to repro, but the scenario you explained is different from my previous comment. I saw a similar situation in a v3.2 build failure for split can also happen for moveChunk. This is the scenario in more detail:

1. applyOps sent to config.
2. config steps down.
3. shard retries applyOps.
4. previous applyOps happen to get replicated to new primary, so applyOps failed with "preCondition failed".
5. The validation logic would try to check the config with readConcern majority, but since it has not yet been committed, it will not see the new chunk. And in this case, it will return with the original preCondition failed error (ref)
6. So it goes to the "if (applyOpsStatus == ErrorCodes::PrepareConfigsFailed)" branch then calls undoDonateChunk and not the else branch that fasserts. (ref)

Comment by Kaloian Manassiev [ 16/Dec/16 ]

I looked at the way migration commit works in 3.2 and I can confirm that this will not cause a problem for it. Specifically, the commit works like this:

  1. Donor shard 'majority' writes the applyOps for the chunk hand-off
  2. If there is a network error, refresh the chunks list with 'majority' read. There are two possible outcomes here:
    1. The refresh hits an up-to-date config server node and sees the committed chunk. Because this is a 'majority' read, it will not be rolled back, so the migration is successful.
    2. The refresh hits a stale config server node and does not see the chunk commit. In this case, we will crash the shard.

For the split/merge cases like renctan mentioned above, this situation is benign because it does not impact routing or filtering and subsequent metadata operations will perform a full refresh with a most up-to-date optime.

Given that this is not a correctness problem (see above) and that there is only one Evergreen build failure report I am going to close this as Won't Fix.

Comment by Randolph Tan [ 02/Dec/16 ]

schwerin After re-examining this again, I believe this issue is benign in the split/merge case, but can cause lost writes for moveChunk. This is because in this scenario applyOps will be applied successfully on the config servers, while the shard will undo the increment on the shard version metadata. This is bad because any mongos who have not yet seen the new config update will continue on sending writes to the shard although it officially doesn't own it anymore (and the shard thinks it still owns it).

Comment by Randolph Tan [ 11/Nov/16 ]

Correction: in split and merge, it happens that a write to the config.changelog happens afterwards on success, so the opTime will have advanced and the top level waitForWriteConcern will properly wait for w: majority.

Comment by Randolph Tan [ 11/Nov/16 ]

Note: v3.4 has a slightly different issue since the applyOps are now initiated from within the config servers and uses local read concern to revalidate. The issue is that since it does not do any write, it may not wait for the { w: majority } to replicate correctly.

Comment by Kaloian Manassiev [ 11/Nov/16 ]

Alternatively we can always do a majority write after failure in order to get the most up-to-date committed and visible op times. This is what we do during migration commit in 3.4.

Generated at Thu Feb 08 04:13:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.