[SERVER-33285] Add "thread time waiting for response from replication sync source" stats Created: 13/Feb/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Diagnostics, Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dmitry Agranat Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 3
Labels: SWDI, former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Assigned Teams:
Replication
Participants:

 Description   

When there is an increase in replicated write ops to a secondary, serverStatus includes stats for reader threads waiting for a global lock Global timeAcquiringMicros r, but doesn't report how this would have impacted the thread reading the oplog to replicate to another secondary (in case of a chained replication).
It would be good to have "thread time waiting for a response from replication sync source" since that information is useful for diagnosing problematic lagging nodes.



 Comments   
Comment by Alyson Cabral (Inactive) [ 04/Apr/18 ]

Got it, so this seems completely unrelated to Global timeAcquiringMicros r

This is about providing a measurement of sync source speed (or health) relative to the requesting server. spencer tess.avitabile Let's discuss in the next product/repl triage meeting to determine if this is the best way to measure relative sync source perf.

Let's discuss how hard this is by itself. I could also see this being a part of a larger project making chaining better/smarter.

Comment by Geert Bosch [ 04/Apr/18 ]

The time acquiring the global lock is just one component of the total time it takes a node to process a request. In addition, you' d really want to measure the operation time on the node issuing the request, as that also includes network latency. For example, if you have a local sync source that returns batches in 50ms, and one halfway across the world that returns it in 10ms. It would still be faster to read from the local node in this case.

Comment by Asya Kamsky [ 04/Apr/18 ]

Isn't this already available on the secondary where this getMore is running?

2017-10-05T05:05:43.518+0000 I COMMAND  [conn859458] command local.oplog.rs command: getMore { getMore: 34741198607, collection: "oplog.rs" } originatingCommand: { find: "oplog.rs", filter: {}, tailable: true, awaitData: true } planSummary: COLLSCAN cursorid:34741198607 keysExamined:0 docsExamined:265 numYields:3 nreturned:265 reslen:89018 locks:{ Global: { acquireCount: { r: 8 }, acquireWaitCount: { r: 3 }, timeAcquiringMicros: { r: 2212456 } }, Database: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 7301 } }, oplog: { acquireCount: { r: 4 } } } protocol:op_query 2341ms
2017-10-09T16:01:12.650+0000 I COMMAND  [conn859458] command local.oplog.rs command: getMore { getMore: 34741198607, collection: "oplog.rs" } originatingCommand: { find: "oplog.rs", filter: {}, tailable: true, awaitData: true } planSummary: COLLSCAN cursorid:34741198607 keysExamined:0 docsExamined:383 numYields:2 nreturned:383 reslen:1514172 locks:{ Global: { acquireCount: { r: 6 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 8603146 } }, Database: { acquireCount: { r: 3 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 23687 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_query 8768ms

Here is a secondary that's slow to service getMores due to high global timeAcquiringMicros... right?

Comment by Alyson Cabral (Inactive) [ 04/Apr/18 ]

Or will that essentially become the value we are asking for here?

Comment by Alyson Cabral (Inactive) [ 04/Apr/18 ]

With the addition of snapshot reads during oplog application on secondaries, is the Global timeAcquiringMicros r sever status field going away? geert.bosch

Comment by Gregory McKeon (Inactive) [ 26/Mar/18 ]

ping alyson.cabral

Generated at Thu Feb 08 04:32:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.