Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26134

Building sharding config database indexes holds up leaving drain mode

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.3.12
    • Component/s: Sharding
    • Sharding
    • ALL
    • 0

      The config database's indexes are being built while the CSRS config server's primary is still in draining mode, which means it is done under a global X lock. This holds up exiting draining mode and also blocks any incoming requests (such as initial sync).

      This is causing initial sync failures of this kind:

      [js_test:min_optime_recovery_on_failed_move_chunk_commit] 2016-09-16T02:42:21.061+0000 c22013| 2016-09-16T02:42:21.061+0000 I -        [initial sync] Fatal assertion 40088 ExceededTimeLimit: Operation timed out, request was RemoteCommand 42 -- target:WIN-K6DURTQEFVN:22011 db:local expDate:2016-09-16T02:42:11.640+0000 cmd:{ getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } at src\mongo\db\repl\replication_coordinator_impl.cpp 614
      

      Note that the getmore command below took 18 seconds to acquire the global lock:

      c22011| 2016-09-16T02:42:01.843+0000 I REPL     [ReplicationExecutor] My optime is most up-to-date, skipping catch-up and completing transition to primary.
      c22011| 2016-09-16T02:42:05.900+0000 I INDEX    [rsSync] build index on: config.chunks properties: { v: 2, unique: true, key: { ns: 1, min: 1 }, name: "ns_1_min_1", ns: "config.chunks" }
      ...
      c22011| 2016-09-16T02:42:18.855+0000 I INDEX    [rsSync] build index done.  scanned 0 total records. 2 secs
      c22011| 2016-09-16T02:42:21.043+0000 I SHARDING [Balancer] CSRS balancer is starting
      c22011| 2016-09-16T02:42:21.044+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
      c22011| 2016-09-16T02:42:21.049+0000 I COMMAND  [conn7] command local.oplog.rs command: getMore { getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1473993702000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 0 } planSummary: COLLSCAN cursorid:17405408007 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:352 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 18627803 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_command 19408ms
      c22011| 2016-09-16T02:42:31.102+0000 I SHARDING [Balancer] CSRS balancer is now stopped
      

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: