[SERVER-14261] stepdown during migration range delete can abort mongod Created: 16/Jun/14 Updated: 11/Mar/15 Resolved: 29/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.1 |
| Fix Version/s: | 2.6.4, 2.7.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
| Comments |
| Comment by Ramon Fernandez Marina [ 28/Aug/14 ] | ||||||||
|
kay.agahd@idealo.de, the workaround does prevent the crash reported in this ticket. The crash you reported in Regards, | ||||||||
| Comment by Kay Agahd [ 08/Aug/14 ] | ||||||||
This workaround didn't work wor us. waitForDelete=true didn't help and even with balancer disabled, mongo crashed. | ||||||||
| Comment by Githook User [ 21/Jul/14 ] | ||||||||
|
Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}Message: | ||||||||
| Comment by Randolph Tan [ 23/Jun/14 ] | ||||||||
|
Fixed by commit: | ||||||||
| Comment by Kay Agahd [ 17/Jun/14 ] | ||||||||
|
Thanks Eric, much appreciated! | ||||||||
| Comment by Eric Milkie [ 17/Jun/14 ] | ||||||||
|
The error can happen whenever a node steps down for any reason. By not executing the replSetFreezeCommand, you are likely, but not guaranteed, to avoid the problem. | ||||||||
| Comment by Kay Agahd [ 16/Jun/14 ] | ||||||||
|
If the node stepped down for other reasons, it wouldn't have received a replSetFreezeCommand since we send this command only manually. So the error shouldn't occur then. Does this makes sense? | ||||||||
| Comment by Randolph Tan [ 16/Jun/14 ] | ||||||||
|
kay.agahd@idealo.de You can also do that as well. But the issue can potentially occur again if the node stepped down for other reasons. | ||||||||
| Comment by Kay Agahd [ 16/Jun/14 ] | ||||||||
|
Thank you for the workaround Randolph! | ||||||||
| Comment by Randolph Tan [ 16/Jun/14 ] | ||||||||
|
kay.agahd@idealo.de One workaround you can do to workaround this issue while waiting for the fix is to change the balancer setting so it will always perform synchronous migration cleanup. This can be done by setting the _waitForDelete field in config.settings document to true like this:
The side effect is that migrations will now take longer to complete because it will always wait for the deletion to complete. | ||||||||
| Comment by Randolph Tan [ 16/Jun/14 ] | ||||||||
|
This is a bug and it is scheduled to be fixed on 2.6.3. | ||||||||
| Comment by Kay Agahd [ 16/Jun/14 ] | ||||||||
|
Am I right that's a bug? Will you fix it or should we avoid the replSetFreeze command? | ||||||||
| Comment by Randolph Tan [ 16/Jun/14 ] | ||||||||
|
Based from the logs the node step down after receiving a replSetFreeze command. The rangeDeleter thread asserted because the node is not primary anymore, but the exception was never captured so it terminated the server. | ||||||||
| Comment by Kay Agahd [ 16/Jun/14 ] | ||||||||
|
Greg, I've uploaded the log file mongod.log.1.tgz to you:
If you need more information, just tell me please. | ||||||||
| Comment by Greg Studer [ 16/Jun/14 ] | ||||||||
|
Is there any more of the log file you can provide us from this mongod - ideally something close to an hour before/after? This looks like it happened due to a replica set state change while waiting for replication, but more context would help us further confirm this. |