Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.3.12
Component/s: Sharding
Labels:
- PM-108

Assigned Teams:

Sharding
Operating System:
ALL
Linked BF Score:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The config database's indexes are being built while the CSRS config server's primary is still in draining mode, which means it is done under a global X lock. This holds up exiting draining mode and also blocks any incoming requests (such as initial sync).

This is causing initial sync failures of this kind:

[js_test:min_optime_recovery_on_failed_move_chunk_commit] 2016-09-16T02:42:21.061+0000 c22013| 2016-09-16T02:42:21.061+0000 I -        [initial sync] Fatal assertion 40088 ExceededTimeLimit: Operation timed out, request was RemoteCommand 42 -- target:WIN-K6DURTQEFVN:22011 db:local expDate:2016-09-16T02:42:11.640+0000 cmd:{ getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } at src\mongo\db\repl\replication_coordinator_impl.cpp 614

Note that the getmore command below took 18 seconds to acquire the global lock:

c22011| 2016-09-16T02:42:01.843+0000 I REPL     [ReplicationExecutor] My optime is most up-to-date, skipping catch-up and completing transition to primary.
c22011| 2016-09-16T02:42:05.900+0000 I INDEX    [rsSync] build index on: config.chunks properties: { v: 2, unique: true, key: { ns: 1, min: 1 }, name: "ns_1_min_1", ns: "config.chunks" }
...
c22011| 2016-09-16T02:42:18.855+0000 I INDEX    [rsSync] build index done.  scanned 0 total records. 2 secs
c22011| 2016-09-16T02:42:21.043+0000 I SHARDING [Balancer] CSRS balancer is starting
c22011| 2016-09-16T02:42:21.044+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
c22011| 2016-09-16T02:42:21.049+0000 I COMMAND  [conn7] command local.oplog.rs command: getMore { getMore: 17405408007, collection: "oplog.rs", maxTimeMS: 5000, term: 0, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1473993702000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 0 } planSummary: COLLSCAN cursorid:17405408007 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:352 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 18627803 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_command 19408ms
c22011| 2016-09-16T02:42:31.102+0000 I SHARDING [Balancer] CSRS balancer is now stopped

is related to

SERVER-27854 initiate replicasets faster in replicaset.py

Closed

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Kaloian Manassiev
Participants:: [DO NOT USE] Backlog - Sharding Team, Eric Milkie, Judah Schvimer, Kaloian Manassiev, Max Hirschhorn, Sheeri Cabral, Spencer Brody
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Sep 16 2016 02:54:37 PM UTC
Updated:: Dec 06 2022 04:16:39 AM UTC
Resolved:: Jan 02 2020 02:17:55 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates