The config database's indexes are being built while the CSRS config server's primary is still in draining mode, which means it is done under a global X lock. This holds up exiting draining mode and also blocks any incoming requests (such as initial sync).
This is causing initial sync failures of this kind:
[js_test:min_optime_recovery_on_failed_move_chunk_commit] 2016-09-16T02:42:21.061+0000 c22013| 2016-09-16T02:42:21.061+0000 I - [initial sync] Fatal assertion 40088 ExceededTimeLimit: Operation timed out, request was RemoteCommand 42 -- target:WIN-K6DURTQEFVN:22011 db:local expDate:2016-09-16T02:42:11.640+0000 cmd:{ getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } at src\mongo\db\repl\replication_coordinator_impl.cpp 614
Note that the getmore command below took 18 seconds to acquire the global lock:
c22011| 2016-09-16T02:42:01.843+0000 I REPL [ReplicationExecutor] My optime is most up-to-date, skipping catch-up and completing transition to primary. c22011| 2016-09-16T02:42:05.900+0000 I INDEX [rsSync] build index on: config.chunks properties: { v: 2, unique: true, key: { ns: 1, min: 1 }, name: "ns_1_min_1", ns: "config.chunks" } ... c22011| 2016-09-16T02:42:18.855+0000 I INDEX [rsSync] build index done. scanned 0 total records. 2 secs c22011| 2016-09-16T02:42:21.043+0000 I SHARDING [Balancer] CSRS balancer is starting c22011| 2016-09-16T02:42:21.044+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted c22011| 2016-09-16T02:42:21.049+0000 I COMMAND [conn7] command local.oplog.rs command: getMore { getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1473993702000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 0 } planSummary: COLLSCAN cursorid:17405408007 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:352 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 18627803 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_command 19408ms c22011| 2016-09-16T02:42:31.102+0000 I SHARDING [Balancer] CSRS balancer is now stopped
- is related to
-
SERVER-27854 initiate replicasets faster in replicaset.py
- Closed