[SERVER-21192] Failed shutdown of removed shard member Created: 29/Oct/15  Updated: 14/Apr/16  Resolved: 11/Dec/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0-rc1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cailin Nelson Assignee: Siyuan Zhou
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb.1.log     Text File mongodb.2.log    
Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl C (11/20/15), Repl D (12/11/15), Repl E (01/08/16)
Participants:

 Description   

Not sure which of these steps might be significant

1. Start with 2-shard CSRS cluster. Each shard is PSA
2. Add third shard
3. Remove one S from second shard
4. Shutdown the removed S
5. Eventually give up and "assume" the node is shutdown and attempt to restart.

What I observed was:

1. At step 4, the removed node did not shutdown.

Here is the process - you can see it has been running for many hours:

[red_red_1_4]$ ps -ef | grep "/tmp/data" | grep 87564
  501 87564     1   0  4:43PM ??         7:00.96 /var/lib/mongodb-mms-automation/mongodb-osx-x86_64-3.2.0-rc1/bin/mongod -f /tmp/data/red_red_1_4/automation-mongod.conf
[red_red_1_4]$ cat mongod.lock 
87564
[red_red_1_4]$ date
Thu Oct 29 01:18:31 UTC 2015

2. At step 5, the error message I got was unusual. I got:

2015-10-29T01:08:39.668+0000 E NETWORK  [initandlisten] listen(): bind() failed errno:48 Address already in use for socket: 0.0.0.0:28004
2015-10-29T01:08:39.668+0000 E NETWORK  [initandlisten]   addr already in use
2015-10-29T01:08:39.668+0000 E STORAGE  [initandlisten] Failed to set up sockets during startup.
2015-10-29T01:08:39.668+0000 I CONTROL  [initandlisten] dbexit:  rc: 48

I was expecting the usual message that indicates that mongod recognized that there was still a lock file from a previously running process.

Logs attached:

  • mongodb.1.log - steps 1 through 4, including the failed shutdown
  • mongodb.2.log - step 5 - the attempt to restart even though the old process is still running


 Comments   
Comment by Siyuan Zhou [ 11/Dec/15 ]

Hi cailin.nelson and eric.daniels@10gen.com,

It seems that this issue didn't happen in the past month and we can not reproduce it, would you mind me closing it as "Cannot reproduce" for now and reopen it if it happens again in the future?

Comment by Siyuan Zhou [ 29/Oct/15 ]

Looking into the logs. Here're some observations / questions.

1. It seems like the cluster has been upgraded through CSRS upgrade process as the log has several "changing hosts" messages.
2. How did you remove one S from second shard? I thought it was a reconfig without S, but cannot find log messages about that.
3. This mongod (cailinmac:28004) took over the primary in term 2 and stepped down after seeing term 3, which seems suspicious in a stable environment, but I also observed that the node was quite slow around 20:40:03 when replying heartbeats. The failover and slowness may not be relevant to this issue.
4. A normal shutdown by signal looks like the following.

[js_test:upsert_sharded] 2015-10-29T18:30:01.082+0000 c20014| 2015-10-29T18:30:01.081+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
[js_test:upsert_sharded] 2015-10-29T18:30:01.083+0000 c20014| 2015-10-29T18:30:01.081+0000 I FTDC     [signalProcessingThread] Stopping full-time diagnostic data capture
[js_test:upsert_sharded] 2015-10-29T18:30:01.083+0000 c20014| 2015-10-29T18:30:01.082+0000 I REPL     [signalProcessingThread] Stopping replication applier threads
[js_test:upsert_sharded] 2015-10-29T18:30:02.086+0000 c20014| 2015-10-29T18:30:02.086+0000 I CONTROL  [signalProcessingThread] now exiting
[js_test:upsert_sharded] 2015-10-29T18:30:02.087+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
[js_test:upsert_sharded] 2015-10-29T18:30:02.087+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] closing listening socket: 29
[js_test:upsert_sharded] 2015-10-29T18:30:02.087+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] closing listening socket: 30
[js_test:upsert_sharded] 2015-10-29T18:30:02.088+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-20014.sock
[js_test:upsert_sharded] 2015-10-29T18:30:02.088+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
[js_test:upsert_sharded] 2015-10-29T18:30:02.088+0000 c20014| 2015-10-29T18:30:02.086+0000 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
[js_test:upsert_sharded] 2015-10-29T18:30:02.088+0000 c20014| 2015-10-29T18:30:02.086+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
[js_test:upsert_sharded] 2015-10-29T18:30:02.088+0000 c20014| 2015-10-29T18:30:02.087+0000 I NETWORK  [conn1] end connection 127.0.0.1:60671 (1 connection now open)
[js_test:upsert_sharded] 2015-10-29T18:30:02.089+0000 c20014| 2015-10-29T18:30:02.087+0000 I NETWORK  [conn7] end connection 10.65.153.62:33179 (0 connections now open)
[js_test:upsert_sharded] 2015-10-29T18:30:02.267+0000 c20014| 2015-10-29T18:30:02.266+0000 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
[js_test:upsert_sharded] 2015-10-29T18:30:02.267+0000 c20014| 2015-10-29T18:30:02.266+0000 I CONTROL  [signalProcessingThread] dbexit:  rc: 0

However, the mongod hung during or after the shutdown of replication. Between "Stopping replication applier threads' and "now exiting", mongod also shuts down sharding.

2015-10-29T01:04:47.402+0000 I COMMAND  [conn445] terminating, shutdown command received
2015-10-29T01:04:47.402+0000 I FTDC     [conn445] Stopping full-time diagnostic data capture
2015-10-29T01:04:47.406+0000 I REPL     [conn445] Stopping replication applier threads
2015-10-29T01:04:47.425+0000 I STORAGE  [conn443] got request after shutdown()
2015-10-29T01:04:47.426+0000 I STORAGE  [conn444] got request after shutdown()
2015-10-29T01:04:51.688+0000 I STORAGE  [conn236] got request after shutdown()
2015-10-29T01:04:51.688+0000 I STORAGE  [conn237] got request after shutdown()
2015-10-29T01:05:46.493+0000 I STORAGE  [conn226] got request after shutdown()
2015-10-29T01:06:46.205+0000 I STORAGE  [conn223] got request after shutdown()

I'll keep investigating to narrow down the root cause. If this happened again, I would like to attach gdb to the process to see where it actually hangs.

Comment by Eric Milkie [ 29/Oct/15 ]

The log tells me it hung while waiting for replication to shut down. We'll look into this.

I think we may have switched the order of initialization so that we open the sockets prior to checking for the lock file. I'll file a ticket about that.

Generated at Thu Feb 08 03:56:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.