Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21118

A slaveDelay'd member in sleep gets stuck in shutdown

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.3.12
    • Affects Version/s: 3.0.0, 3.2.0-rc0
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
      1. Start a 3 node replica set with one member delayed
      2. Issue a write to trigger the delayed member to go into a sleep
      3. Try to issue db.shutdownServer() (or send a signal) on the delayed node
    • Repl 2016-08-29

      When a delayed secondary sees an oplog entry it doesn't need to apply yet, it goes into a sleep.

      2015-10-26T12:00:27.413+1100 W REPL     [rsSync] replSet slavedelay causing a long sleep of 269 seconds
      

      If the member is attempted to be shut down during this sleep, it gets stuck.

      This happens in 3.0.7 and 3.2.0-rc0 (note the timestamps)

      2015-10-26T12:00:27.413+1100 W REPL     [rsSync] replSet slavedelay causing a long sleep of 269 seconds
      2015-10-26T12:00:27.930+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58065 #5 (1 connection now open)
      2015-10-26T12:00:27.932+1100 I REPL     [ReplicationExecutor] replSetElect voting yea for Joannas-MacBook-Pro.local:25217 (0)
      2015-10-26T12:00:28.050+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58066 #6 (2 connections now open)
      2015-10-26T12:00:28.209+1100 I REPL     [ReplicationExecutor] Member Joannas-MacBook-Pro.local:25217 is now in state PRIMARY
      2015-10-26T12:00:57.984+1100 I NETWORK  [conn5] end connection 10.8.1.231:58065 (1 connection now open)
      2015-10-26T12:00:57.985+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58111 #7 (2 connections now open)
      2015-10-26T12:00:58.128+1100 I NETWORK  [conn6] end connection 10.8.1.231:58066 (1 connection now open)
      2015-10-26T12:00:58.129+1100 I NETWORK  [initandlisten] connection accepted from 10.8.1.231:58112 #8 (2 connections now open)
      2015-10-26T12:01:07.048+1100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:58129 #9 (3 connections now open)
      2015-10-26T12:01:07.065+1100 I COMMAND  [conn9] terminating, shutdown command received
      2015-10-26T12:01:07.066+1100 I REPL     [conn9] Stopping replication applier threads
      2015-10-26T12:01:08.019+1100 I STORAGE  [conn7] got request after shutdown()
      2015-10-26T12:01:08.152+1100 I STORAGE  [conn8] got request after shutdown()
      2015-10-26T12:04:57.563+1100 I CONTROL  [conn9] now exiting
      2015-10-26T12:04:57.564+1100 I NETWORK  [conn9] shutdown: going to close listening sockets...
      2015-10-26T12:04:57.564+1100 I NETWORK  [conn9] closing listening socket: 7
      2015-10-26T12:04:57.565+1100 I NETWORK  [conn9] closing listening socket: 8
      2015-10-26T12:04:57.565+1100 I NETWORK  [conn9] removing socket file: /tmp/mongodb-25219.sock
      2015-10-26T12:04:57.569+1100 I NETWORK  [conn9] shutdown: going to flush diaglog...
      2015-10-26T12:04:57.569+1100 I NETWORK  [conn9] shutdown: going to close sockets...
      2015-10-26T12:04:57.569+1100 I STORAGE  [conn9] shutdown: waiting for fs preallocator...
      2015-10-26T12:04:57.570+1100 I STORAGE  [conn9] shutdown: final commit...
      2015-10-26T12:04:57.585+1100 I JOURNAL  [conn9] journalCleanup...
      2015-10-26T12:04:57.585+1100 I JOURNAL  [conn9] removeJournalFiles
      2015-10-26T12:04:57.586+1100 I JOURNAL  [conn9] Terminating durability thread ...
      2015-10-26T12:04:57.684+1100 I JOURNAL  [journal writer] Journal writer thread stopped
      2015-10-26T12:04:57.684+1100 I JOURNAL  [durability] Durability thread stopped
      2015-10-26T12:04:57.685+1100 I STORAGE  [conn9] shutdown: closing all files...
      2015-10-26T12:04:57.687+1100 I STORAGE  [conn9] closeAllFiles() finished
      2015-10-26T12:04:57.687+1100 I STORAGE  [conn9] shutdown: removing fs lock...
      2015-10-26T12:04:57.688+1100 I CONTROL  [conn9] dbexit:  rc: 0
      

      This is a regression - running the same test in 2.6 allows the mongod to shut down

      2015-10-26T11:47:13.262+1100 [rsSync] replSet slavedelay sleep long time: 3600
      2015-10-26T11:47:23.345+1100 [conn125] end connection 10.8.1.231:56443 (1 connection now open)
      2015-10-26T11:47:23.345+1100 [initandlisten] connection accepted from 10.8.1.231:56472 #127 (2 connections now open)
      2015-10-26T11:47:24.456+1100 [conn126] end connection 10.8.1.231:56445 (1 connection now open)
      2015-10-26T11:47:24.457+1100 [initandlisten] connection accepted from 10.8.1.231:56475 #128 (2 connections now open)
      2015-10-26T11:47:25.345+1100 [initandlisten] connection accepted from 127.0.0.1:56476 #129 (3 connections now open)
      2015-10-26T11:47:50.514+1100 [conn129] assertion 13435 not master and slaveOk=false ns:test.system.namespaces query:{}
      2015-10-26T11:47:53.417+1100 [conn127] end connection 10.8.1.231:56472 (2 connections now open)
      2015-10-26T11:47:53.417+1100 [initandlisten] connection accepted from 10.8.1.231:56523 #130 (3 connections now open)
      2015-10-26T11:47:54.518+1100 [conn128] end connection 10.8.1.231:56475 (2 connections now open)
      2015-10-26T11:47:54.519+1100 [initandlisten] connection accepted from 10.8.1.231:56524 #131 (3 connections now open)
      2015-10-26T11:47:55.487+1100 [conn129] assertion 13435 not master and slaveOk=false ns:admin.system.namespaces query:{}
      2015-10-26T11:47:56.207+1100 [conn129] terminating, shutdown command received
      2015-10-26T11:47:56.207+1100 [conn129] dbexit: shutdown called
      2015-10-26T11:47:56.208+1100 [conn129] shutdown: going to close listening sockets...
      2015-10-26T11:47:56.208+1100 [conn129] closing listening socket: 9
      2015-10-26T11:47:56.208+1100 [conn129] closing listening socket: 10
      2015-10-26T11:47:56.209+1100 [conn129] removing socket file: /tmp/mongodb-25219.sock
      2015-10-26T11:47:56.211+1100 [conn129] shutdown: going to flush diaglog...
      2015-10-26T11:47:56.211+1100 [conn129] shutdown: going to close sockets...
      2015-10-26T11:47:56.212+1100 [conn129] shutdown: waiting for fs preallocator...
      2015-10-26T11:47:56.212+1100 [rsBackgroundSync] Socket recv() errno:9 Bad file descriptor 10.8.1.231:25217
      2015-10-26T11:47:56.212+1100 [conn129] shutdown: lock for final commit...
      2015-10-26T11:47:56.212+1100 [conn131] end connection 10.8.1.231:56524 (2 connections now open)
      2015-10-26T11:47:56.212+1100 [rsBackgroundSync] SocketException: remote: 10.8.1.231:25217 error: 9001 socket exception [RECV_ERROR] server [10.8.1.231:25217]
      2015-10-26T11:47:56.212+1100 [conn130] end connection 10.8.1.231:56523 (2 connections now open)
      2015-10-26T11:47:56.213+1100 [conn129] shutdown: final commit...
      2015-10-26T11:47:56.214+1100 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: Joannas-MacBook-Pro.local:25217
      2015-10-26T11:47:56.223+1100 [conn129] shutdown: closing all files...
      2015-10-26T11:47:56.223+1100 [conn129] closeAllFiles() finished
      2015-10-26T11:47:56.224+1100 [conn129] journalCleanup...
      2015-10-26T11:47:56.224+1100 [conn129] removeJournalFiles
      2015-10-26T11:47:56.225+1100 [conn129] shutdown: removing fs lock...
      2015-10-26T11:47:56.225+1100 [conn129] dbexit: really exiting now
      

            Assignee:
            mathias@mongodb.com Mathias Stearn
            Reporter:
            joanna.cheng@mongodb.com Joanna Cheng
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: