[SERVER-21192] Failed shutdown of removed shard member Created: 29/Oct/15 Updated: 14/Apr/16 Resolved: 11/Dec/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.0-rc1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Cailin Nelson | Assignee: | Siyuan Zhou |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Repl C (11/20/15), Repl D (12/11/15), Repl E (01/08/16) | ||||
| Participants: | |||||
| Description |
|
Not sure which of these steps might be significant 1. Start with 2-shard CSRS cluster. Each shard is PSA What I observed was: 1. At step 4, the removed node did not shutdown. Here is the process - you can see it has been running for many hours:
2. At step 5, the error message I got was unusual. I got:
I was expecting the usual message that indicates that mongod recognized that there was still a lock file from a previously running process. Logs attached:
|
| Comments |
| Comment by Siyuan Zhou [ 11/Dec/15 ] | ||||||||||||||||||||||||
|
Hi cailin.nelson and eric.daniels@10gen.com, It seems that this issue didn't happen in the past month and we can not reproduce it, would you mind me closing it as "Cannot reproduce" for now and reopen it if it happens again in the future? | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 29/Oct/15 ] | ||||||||||||||||||||||||
|
Looking into the logs. Here're some observations / questions. 1. It seems like the cluster has been upgraded through CSRS upgrade process as the log has several "changing hosts" messages.
However, the mongod hung during or after the shutdown of replication. Between "Stopping replication applier threads' and "now exiting", mongod also shuts down sharding.
I'll keep investigating to narrow down the root cause. If this happened again, I would like to attach gdb to the process to see where it actually hangs. | ||||||||||||||||||||||||
| Comment by Eric Milkie [ 29/Oct/15 ] | ||||||||||||||||||||||||
|
The log tells me it hung while waiting for replication to shut down. We'll look into this. I think we may have switched the order of initialization so that we open the sockets prior to checking for the lock file. I'll file a ticket about that. |