[SERVER-29293] Recipient shard fails to abort migration on stepdown Created: 19/May/17 Updated: 30/Oct/23 Resolved: 14/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.4, 3.5.10 |
| Fix Version/s: | 3.4.11, 3.5.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Wayne Wang | Assignee: | Nathan Myers |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Requested: |
v3.4
|
||||
| Steps To Reproduce: | with no obvious steps to recur. |
||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
condition:
|
| Comments |
| Comment by Githook User [ 09/Nov/17 ] | |||||||||||||||||||||
|
Author: {'name': 'Nathan Myers', 'username': 'nathan-myers-mongo', 'email': 'nathan.myers@10gen.com'}Message: (cherry picked from commit 6d921f47c0fcb29266c57286f1dd0d411cc459f0) | |||||||||||||||||||||
| Comment by Githook User [ 14/Jul/17 ] | |||||||||||||||||||||
|
Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}Message: | |||||||||||||||||||||
| Comment by Nathan Myers [ 29/Jun/17 ] | |||||||||||||||||||||
|
BTW, I would bet that the 600 minute timeout was supposed to be 600 seconds. | |||||||||||||||||||||
| Comment by Nathan Myers [ 29/Jun/17 ] | |||||||||||||||||||||
|
This log is from a shard primary trying get ready to begin requesting documents from a chunk on a donor shard. The relevant part of the (filtered) log seems to be:
This is followed, ten hours later and after many primary step-ups and step-downs, by aborting the migration. By then, the structure asserted about doesn't exist anymore, so the assertion is correct. The failure appears to be that the migration is not aborted when the "Periodic reload of shard registry" fails; or, at the latest, when the receiving host steps down as primary. | |||||||||||||||||||||
| Comment by Wayne Wang [ 27/Jun/17 ] | |||||||||||||||||||||
|
The same problem occurs almost every day. So I must restart the mongod process ot the corresponding shard manually. [root@203-01 ~]# ps -e l|grep mongo [root@203-01 ~]# /home/mongodb3.4.4/bin/mongod --shardsvr --replSet shard1ReplSet --port 22001 --dbpath /home/mongodb3.4.4/shard1/data --logpath /home/mongodb3.4.4/shard1/log/shard1.log --fork --nojournal | |||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 19/May/17 ] | |||||||||||||||||||||
|
Thanks for your report wayne80, and for including uploading the full logs. This is MongoDB 3.4.4 and here are the relevant debug symbols. The demangled stack trace is below:
Stay tuned for updates. Regards, |