[SERVER-25126] Return a different error code if the step down occurs after the write Created: 18/Jul/16  Updated: 16/May/17  Resolved: 25/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.3.12

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-24968 Make sure we don't retry non-idempote... Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: Repl 18 (08/05/16), Repl 2016-08-29
Participants:

 Description   

We need to return a different error code if the step down occurs after vs before the write. NotMaster should always mean "no write occurred." We should audit and eliminate all cases where NotMaster gets returned even though a write happened.



 Comments   
Comment by Githook User [ 25/Aug/16 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-25126 Return a different error code if the step down occurs after the write
Branch: master
https://github.com/mongodb/mongo/commit/6b571fa314a9c5d193d362570bb58064d1d1fb0f

Comment by Eric Milkie [ 16/Aug/16 ]

I think for (1) above, if NotMaster is returned, the incomplete index build is reverted, even though some writes have happened. So I think it would still be okay to return NotMaster for index builds. For the others, I do not believe they rollback or revert any of the writes that have already happened.

The code is batch_downconvert.cpp is simply propagating a NotMaster error from shard to client, without changing the circumstances of its occurrence. So you're right, we will need to add a new case to handle the new error code that may come from a shard.

Comment by Siyuan Zhou [ 16/Aug/16 ]

Open question: NotMaster is used in batch_downconvert.cpp, shall we add another case here? CC milkie.

Comment by Siyuan Zhou [ 16/Aug/16 ]

Audited all occurrences of NotMaster. Most occurrences only check primary-ship once at the beginning.

1. CmdCreateIndex checks if the node is still the primary during execution and may return NotMaster several times. It seems like all checks except the first one happen after some writes.

2. mapreduce checks twice primary-ship, one in prepTempCollection(), the other in insert().

3. cloner may return NotMaster for stepdown at several different stages as well.

4. findAndModify may return NotMaster at different stages when checking checkCanAcceptWritesForDatabase() .

5. UpdateStage::restoreUpdateState returns NotMaster when it's not primary on restore. The comment says "We may have stepped down during the yield."

6. In write_ops_exec.cpp, primary-ship is checked in assertCanWrite_inlock().

The plan is to change all occurrence in findAndModify, update stage, write_ops_exec.cpp and get_executor.cpp to use the new error code PrimarySteppedDown, because we've checked the primary-ship at the beginning for every command in dbcommands.cpp. For the first three commands, I'll change the return error except when it's clear no writes have happened.

Generated at Thu Feb 08 04:08:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.