Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24753

The balancer thread initialization is not interruptible

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 3.3.9
    • Fix Version/s: 3.3.9
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Sharding 17 (07/15/16)
    • Linked BF Score:
      0

      Description

      When the balancer thread is started, it tries to read the list of shards, talk to shards and acquire the balancer distributed lock. If any of these operations fails, it then sleeps for up to 60 seconds.

      This sleep prevents replication stepdown from running and causes stepdown failures with the following error:

      [js_test:replmonitor_bad_seed] 2016-06-16T20:19:19.945-0500 assert: command failed: {
      [js_test:replmonitor_bad_seed] 2016-06-16T20:19:19.945-0500 	"ok" : 0,
      [js_test:replmonitor_bad_seed] 2016-06-16T20:19:19.945-0500 	"errmsg" : "Could not acquire the global shared lock within the amount of time specified that we should step down for",
      [js_test:replmonitor_bad_seed] 2016-06-16T20:19:19.945-0500 	"code" : 50
      [js_test:replmonitor_bad_seed] 2016-06-16T20:19:19.945-0500 } : undefined
      

      The call stacks show this thread:

       [2016/06/23 10:51:26.123] Thread 39 (Thread 0x7fcf8c9d5700 (LWP 4803)):
       [2016/06/23 10:51:26.123] #0  0x00007fcfc38162fd in pthread_join () from /lib64/libpthread.so.0
       [2016/06/23 10:51:26.123] #1  0x00007fcfc7ec9a37 in std::thread::join() ()
       [2016/06/23 10:51:26.123] #2  0x00007fcfc72745f8 in mongo::Balancer::joinThread() ()
       [2016/06/23 10:51:26.123] #3  0x00007fcfc6f73b50 in mongo::repl::ReplicationCoordinatorExternalStateImpl::shardingOnDrainingStateHook(mongo::OperationContext*) ()
       [2016/06/23 10:51:26.123] #4  0x00007fcfc6f894e0 in mongo::repl::ReplicationCoordinatorImpl::signalDrainComplete(mongo::OperationContext*) ()
       [2016/06/23 10:51:26.123] #5  0x00007fcfc6ffc9b8 in mongo::repl::SyncTail::oplogApplication() ()
       [2016/06/23 10:51:26.124] #6  0x00007fcfc6fe6d95 in mongo::repl::runSyncThread(mongo::repl::BackgroundSync*) ()
       [2016/06/23 10:51:26.124] #7  0x00007fcfc7ec9af0 in execute_native_thread_routine ()
       [2016/06/23 10:51:26.124] #8  0x00007fcfc3815aa1 in start_thread () from /lib64/libpthread.so.0
       [2016/06/23 10:51:26.124] #9  0x00007fcfc3562aad in clone () from /lib64/libc.so.6
      

      waiting on the balancer initialization:

       [2016/06/23 10:51:33.020] Thread 7 (Thread 0x7f03bf243700 (LWP 5701)):
       [2016/06/23 10:51:33.020] #0  0x00007f03fd5af00d in nanosleep () from /lib64/libpthread.so.0
       [2016/06/23 10:51:33.020] #1  0x00007f040120c4d0 in mongo::sleepmicros(long long) ()
       [2016/06/23 10:51:33.021] #2  0x00007f040100d5af in mongo::Balancer::_mainThread() ()
       [2016/06/23 10:51:33.021] #3  0x00007f0401c5baf0 in execute_native_thread_routine ()
       [2016/06/23 10:51:33.021] #4  0x00007f03fd5a7aa1 in start_thread () from /lib64/libpthread.so.0
       [2016/06/23 10:51:33.021] #5  0x00007f03fd2f4aad in clone () from /lib64/libc.so.6
      

        Attachments

          Activity

            People

            Assignee:
            kaloian.manassiev Kaloian Manassiev
            Reporter:
            kaloian.manassiev Kaloian Manassiev
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: