[SERVER-46118] Make migration_util helpers save the repl set term before finalizing migration Created: 12/Feb/20  Updated: 25/Mar/20  Resolved: 28/Feb/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Randolph Tan Assignee: Randolph Tan
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-47012 Make it more obvious that retryIdempo... Closed
Sprint: Sharding 2020-02-24, Sharding 2020-03-09
Participants:

 Description   

Instead of saving it right before every step:

https://github.com/mongodb/mongo/blob/56721bbc268e137a20974d9b6d880caaca189541/src/mongo/db/s/migration_util.cpp#L452
https://github.com/mongodb/mongo/blob/56721bbc268e137a20974d9b6d880caaca189541/src/mongo/db/s/migration_util.cpp#L514



 Comments   
Comment by Esha Maharishi (Inactive) [ 25/Mar/20 ]

When I did the write-up above, I missed that the "term check" also checks if the node is primary, meaning the local writes will not keep getting retried if the node is secondary.

I amended SERVER-47012 to just make it more obvious that this is the case.

Comment by Esha Maharishi (Inactive) [ 20/Mar/20 ]

Filed SERVER-47012.

Comment by Esha Maharishi (Inactive) [ 20/Mar/20 ]

Per stakeholder demo, the "retrying indefinitely while secondary" bug is bad enough that we should try to fix it, particularly because a warning message will keep being logged.

Comment by Esha Maharishi (Inactive) [ 19/Mar/20 ]

For more context:

If the term is saved at the beginning of each operation's loop, the operation must be safe to execute if the term changed just before the term was saved.

There are two important cases:

  1. The node stepped down just before the term was saved, meaning the operation must be safe to execute while the node is not primary.
  2. The node stepped down and back up just before the term was saved, meaning the operation must be safe to execute concurrently with an identical operation that was spawned as part of stepup recovery.

Here are all the operations run in the loop currently, and why they are safe in both cases:

    • If the node stepped down just before the term was saved, the write or waiting for writeConcern will fail. **I think this means the write will keep getting retried indefinitely while the node is secondary, because the term check will keep passing.
    • If the node stepped down and up just before the term was saved, the operation will succeed because the node is primary again, but that's ok because the write is idempotent.
  • deleteRangeDeletionTaskOnRecipient / markAsReadyRangeDeletionTaskOnRecipient / ensureChunkVersionIsGreaterThan: sends a remote write (not a retryable write)
    • This write is to communicate a decision that was made durable already, and the write is idempotent, so is fine.
  • advanceTransactionOnRecipient: sends a remote dummy retryable write to bump the remote node's txnNumber for the session
    • Bumping a transaction number is idempotent, so is fine.
  • refreshFilteringMetadataUntilSuccess: does a cache refresh
    • Since the cache refresh protocol is different for a primary and secondary, this is actually vulnerable to the bug described in SERVER-45646.
Comment by Esha Maharishi (Inactive) [ 28/Feb/20 ]

Closing as Won't Fix since we will instead do SERVER-46395.

Generated at Thu Feb 08 05:10:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.