[SERVER-14551] Runner yield during migration cleanup (removeRange) results in fassert Created: 14/Jul/14  Updated: 15/Nov/14  Resolved: 21/Jul/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.10, 2.6.4
Fix Version/s: 2.6.4

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-15798 Helpers::removeRange does not check i... Closed
related to SERVER-14261 stepdown during migration range delet... Closed
related to SERVER-16115 Helpers::removeRange should check if ... Closed
Tested
Operating System: ALL
Steps To Reproduce:

Trigger a yield and stepdown during a migration cleanup.

Participants:

 Description   
Issue Status as of Aug 6, 2014

ISSUE SUMMARY
If a runner performing a chunk migration cleanup yields, and during that time the node becomes non-primary, when the cleanup resumes the runner assumes the node is still primary and incorrectly attempts to write to the oplog, causing a fatal assertion.

The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred.

USER IMPACT
Under the conditions described above, the cleanup operation fails with an assert, and the primary node shuts down.

WORKAROUNDS
N/A

AFFECTED VERSIONS
MongoDB 2.6 production releases up to 2.6.3 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.4 production release.

RESOLUTION DETAILS
During cleanup, always check the replica set status after yielding and abort the cleanup operation if the node is no longer primary.

Original description

The removeRange helper used by migration cleanup does not re-check replica set state after using a YIELD_AUTO cursor - if yielding and stepdown occurs, logOp() will fail (correctly) with an fassert().

We need to either not yield or re-check replica set state before deleting the document.

Affects v2.4, does not affect v2.7 due to changes in yield behavior.



 Comments   
Comment by Randolph Tan [ 21/Jul/14 ]

Background:
There is an fassert check in the code right before the server inserts an operation to the oplog to make sure that it is a primary. For this ticket, the fassert was triggered when removeRange (which is currently used by cleanupOrphaned command and migration cleanup) changed state from primary to non-primary while iterating over the documents to be deleted. This can only happen when the cleanup thread yields since the server needs to acquire the global write lock to step down.

Fix:
Right after retrieving a single document, check whether a yield occurred and if yes, make sure that the server is still primary before processing the write.

Tested by hand.

Setup:

  • 2 shards with 2 node replica set
  • Artificially inserted code to yield for 5 seconds just right after iterating over to the next document (https://github.com/mongodb/mongo/blob/r2.6.3/src/mongo/db/dbhelpers.cpp#L383)
  • Spawn another shell process to kill the secondary right after calling moveChunk (the donor shard should have a few documents, tested with 3) to make the primary step down.
  • make sure that the warning "not primary anymore" is logged.
Comment by Githook User [ 21/Jul/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-14551 Runner yield during migration cleanup (removeRange) results in fassert
Branch: v2.6
https://github.com/mongodb/mongo/commit/c4bfb68647de2835aec1888b55a47ab2813c6e20

Generated at Thu Feb 08 03:35:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.