[SERVER-21118] A slaveDelay'd member in sleep gets stuck in shutdown Created: 26/Oct/15  Updated: 07/Dec/16  Resolved: 23/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0, 3.2.0-rc0
Fix Version/s: 3.3.12

Type: Bug Priority: Major - P3
Reporter: Joanna Cheng Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-25546 Handle slaveDelay as part of op batch... Closed
related to SERVER-21004 Implement wait objects that support o... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  1. Start a 3 node replica set with one member delayed
  2. Issue a write to trigger the delayed member to go into a sleep
  3. Try to issue db.shutdownServer() (or send a signal) on the delayed node
Sprint: Repl 2016-08-29
Participants:

 Description   

When a delayed secondary sees an oplog entry it doesn't need to apply yet, it goes into a sleep.

2015-10-26T12:00:27.413+1100 W REPL     [rsSync] replSet slavedelay causing a long sleep of 269 seconds

If the member is attempted to be shut down during this sleep, it gets stuck.

This happens in 3.0.7 and 3.2.0-rc0 (note the timestamps)

2015-10-26T12:00:27.413+1100 W REPL     [rsSync] replSet slavedelay causing a long sleep of 269 seconds
2015-10-26T12:00:27.930+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58065 #5 (1 connection now open)
2015-10-26T12:00:27.932+1100 I REPL     [ReplicationExecutor] replSetElect voting yea for Joannas-MacBook-Pro.local:25217 (0)
2015-10-26T12:00:28.050+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58066 #6 (2 connections now open)
2015-10-26T12:00:28.209+1100 I REPL     [ReplicationExecutor] Member Joannas-MacBook-Pro.local:25217 is now in state PRIMARY
2015-10-26T12:00:57.984+1100 I NETWORK  [conn5] end connection 10.8.1.231:58065 (1 connection now open)
2015-10-26T12:00:57.985+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58111 #7 (2 connections now open)
2015-10-26T12:00:58.128+1100 I NETWORK  [conn6] end connection 10.8.1.231:58066 (1 connection now open)
2015-10-26T12:00:58.129+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58112 #8 (2 connections now open)
2015-10-26T12:01:07.048+1100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:58129 #9 (3 connections now open)
2015-10-26T12:01:07.065+1100 I COMMAND  [conn9] terminating, shutdown command received
2015-10-26T12:01:07.066+1100 I REPL     [conn9] Stopping replication applier threads
2015-10-26T12:01:08.019+1100 I STORAGE  [conn7] got request after shutdown()
2015-10-26T12:01:08.152+1100 I STORAGE  [conn8] got request after shutdown()
2015-10-26T12:04:57.563+1100 I CONTROL  [conn9] now exiting
2015-10-26T12:04:57.564+1100 I NETWORK  [conn9] shutdown: going to close listening sockets...
2015-10-26T12:04:57.564+1100 I NETWORK  [conn9] closing listening socket: 7
2015-10-26T12:04:57.565+1100 I NETWORK  [conn9] closing listening socket: 8
2015-10-26T12:04:57.565+1100 I NETWORK  [conn9] removing socket file: /tmp/mongodb-25219.sock
2015-10-26T12:04:57.569+1100 I NETWORK  [conn9] shutdown: going to flush diaglog...
2015-10-26T12:04:57.569+1100 I NETWORK  [conn9] shutdown: going to close sockets...
2015-10-26T12:04:57.569+1100 I STORAGE  [conn9] shutdown: waiting for fs preallocator...
2015-10-26T12:04:57.570+1100 I STORAGE  [conn9] shutdown: final commit...
2015-10-26T12:04:57.585+1100 I JOURNAL  [conn9] journalCleanup...
2015-10-26T12:04:57.585+1100 I JOURNAL  [conn9] removeJournalFiles
2015-10-26T12:04:57.586+1100 I JOURNAL  [conn9] Terminating durability thread ...
2015-10-26T12:04:57.684+1100 I JOURNAL  [journal writer] Journal writer thread stopped
2015-10-26T12:04:57.684+1100 I JOURNAL  [durability] Durability thread stopped
2015-10-26T12:04:57.685+1100 I STORAGE  [conn9] shutdown: closing all files...
2015-10-26T12:04:57.687+1100 I STORAGE  [conn9] closeAllFiles() finished
2015-10-26T12:04:57.687+1100 I STORAGE  [conn9] shutdown: removing fs lock...
2015-10-26T12:04:57.688+1100 I CONTROL  [conn9] dbexit:  rc: 0

This is a regression - running the same test in 2.6 allows the mongod to shut down

2015-10-26T11:47:13.262+1100 [rsSync] replSet slavedelay sleep long time: 3600
2015-10-26T11:47:23.345+1100 [conn125] end connection 10.8.1.231:56443 (1 connection now open)
2015-10-26T11:47:23.345+1100 [initandlisten] connection accepted from 10.8.1.231:56472 #127 (2 connections now open)
2015-10-26T11:47:24.456+1100 [conn126] end connection 10.8.1.231:56445 (1 connection now open)
2015-10-26T11:47:24.457+1100 [initandlisten] connection accepted from 10.8.1.231:56475 #128 (2 connections now open)
2015-10-26T11:47:25.345+1100 [initandlisten] connection accepted from 127.0.0.1:56476 #129 (3 connections now open)
2015-10-26T11:47:50.514+1100 [conn129] assertion 13435 not master and slaveOk=false ns:test.system.namespaces query:{}
2015-10-26T11:47:53.417+1100 [conn127] end connection 10.8.1.231:56472 (2 connections now open)
2015-10-26T11:47:53.417+1100 [initandlisten] connection accepted from 10.8.1.231:56523 #130 (3 connections now open)
2015-10-26T11:47:54.518+1100 [conn128] end connection 10.8.1.231:56475 (2 connections now open)
2015-10-26T11:47:54.519+1100 [initandlisten] connection accepted from 10.8.1.231:56524 #131 (3 connections now open)
2015-10-26T11:47:55.487+1100 [conn129] assertion 13435 not master and slaveOk=false ns:admin.system.namespaces query:{}
2015-10-26T11:47:56.207+1100 [conn129] terminating, shutdown command received
2015-10-26T11:47:56.207+1100 [conn129] dbexit: shutdown called
2015-10-26T11:47:56.208+1100 [conn129] shutdown: going to close listening sockets...
2015-10-26T11:47:56.208+1100 [conn129] closing listening socket: 9
2015-10-26T11:47:56.208+1100 [conn129] closing listening socket: 10
2015-10-26T11:47:56.209+1100 [conn129] removing socket file: /tmp/mongodb-25219.sock
2015-10-26T11:47:56.211+1100 [conn129] shutdown: going to flush diaglog...
2015-10-26T11:47:56.211+1100 [conn129] shutdown: going to close sockets...
2015-10-26T11:47:56.212+1100 [conn129] shutdown: waiting for fs preallocator...
2015-10-26T11:47:56.212+1100 [rsBackgroundSync] Socket recv() errno:9 Bad file descriptor 10.8.1.231:25217
2015-10-26T11:47:56.212+1100 [conn129] shutdown: lock for final commit...
2015-10-26T11:47:56.212+1100 [conn131] end connection 10.8.1.231:56524 (2 connections now open)
2015-10-26T11:47:56.212+1100 [rsBackgroundSync] SocketException: remote: 10.8.1.231:25217 error: 9001 socket exception [RECV_ERROR] server [10.8.1.231:25217]
2015-10-26T11:47:56.212+1100 [conn130] end connection 10.8.1.231:56523 (2 connections now open)
2015-10-26T11:47:56.213+1100 [conn129] shutdown: final commit...
2015-10-26T11:47:56.214+1100 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: Joannas-MacBook-Pro.local:25217
2015-10-26T11:47:56.223+1100 [conn129] shutdown: closing all files...
2015-10-26T11:47:56.223+1100 [conn129] closeAllFiles() finished
2015-10-26T11:47:56.224+1100 [conn129] journalCleanup...
2015-10-26T11:47:56.224+1100 [conn129] removeJournalFiles
2015-10-26T11:47:56.225+1100 [conn129] shutdown: removing fs lock...
2015-10-26T11:47:56.225+1100 [conn129] dbexit: really exiting now



 Comments   
Comment by Githook User [ 22/Sep/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-21118 test that we don't hang at shutdown with slaveDelay

Part of the SERVER-26016 testing work.
Branch: master
https://github.com/mongodb/mongo/commit/bf86770c8a5de97b30bc008ad59e34de99065c60

Comment by Mathias Stearn [ 23/Aug/16 ]

Fixed by SERVER-25546. The secondaries now wait at most 1 second then verify that they should keep sleeping or stop due to shutdown or reconfig with shorter slaveDelay, etc.

Comment by Mathias Stearn [ 22/Aug/16 ]

I think this will be addressed by my work on SERVER-25546.

Comment by Eric Milkie [ 03/Nov/15 ]

Hi Scott,
Can you describe how it is a blocker for you? I may have some workaround ideas, but I'd like to understand how it affects you first. Thanks!

Comment by Scott Simpson [ 03/Nov/15 ]

This is a blocker for us deploying a new version of Mongo in production at Upwork. Would appreciate an update.

Generated at Thu Feb 08 03:56:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.