[SERVER-57232] Increase in slow queries in ConnectionsBuildup case in Mongos on v4.4 Created: 26/May/21  Updated: 29/Mar/22  Resolved: 29/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tyler Seip (Inactive) Assignee: Daniel Morilha (Inactive)
Resolution: Duplicate Votes: 0
Labels: servicearch-q4-2021, servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-57230 mongos doesn't distribute secondary r... Closed
Related
Operating System: ALL
Sprint: Service Arch 2022-1-24, Service Arch 2022-2-07, Service Arch 2022-2-21, Service Arch 2022-03-07, Service Arch 2022-03-21, Service Arch 2022-04-04
Participants:

 Description   

The comment below, transcribed from the linked ticket, describes a worrying increase in the number of slow queries in the ConnectionsBuildup workload. It should be the case our performance is bottlenecked by networking (ConnectionPools in particular), but the connection stats indicate that our performance is actually bottlenecked elsewhere. In this ticket, investigate what caused this apparent regression and file a ticket to fix it. 



 Comments   
Comment by Daniel Morilha (Inactive) [ 29/Mar/22 ]

Removing story points and closing as duplicate of SERVER-57230 which better describes the problem.

Comment by Daniel Morilha (Inactive) [ 21/Mar/22 ]

The proposed fixes for SERVER-57230 by vojislav.stojkovic seem to also address the issue reported here.

Comment by Daniel Morilha (Inactive) [ 27/Jan/22 ]

not actively working on it as of now.

Comment by Daniel Morilha (Inactive) [ 25/Jan/22 ]

Another interesting fact by looking at the size of the generated logs of each mongod entry. Master has an unevenly distributed load to one (assumed) secondary when compared to v4.2:

  

$ find . -type f -path *connections_buildup*mongod.log -exec ls -lah {} + | awk '{ print $5 "\t" $9 }';
152K ./master/WorkloadOutput/reports-2022-01-20T22:14:41.399424+00:00/connections_buildup/configsvr.0/mongod.log
 239K ./master/WorkloadOutput/reports-2022-01-20T22:14:41.399424+00:00/connections_buildup/mongod.0/mongod.log
 189K ./master/WorkloadOutput/reports-2022-01-20T22:14:41.399424+00:00/connections_buildup/mongod.1/mongod.log
 856K ./master/WorkloadOutput/reports-2022-01-20T22:14:41.399424+00:00/connections_buildup/mongod.2/mongod.log
 63K ./v4.2/WorkloadOutput/reports-2022-01-20T23_24_48.228302+00_00/connections_buildup/configsvr.0/mongod.log
 107K ./v4.2/WorkloadOutput/reports-2022-01-20T23_24_48.228302+00_00/connections_buildup/mongod.0/mongod.log
 73K ./v4.2/WorkloadOutput/reports-2022-01-20T23_24_48.228302+00_00/connections_buildup/mongod.1/mongod.log
 73K ./v4.2/WorkloadOutput/reports-2022-01-20T23_24_48.228302+00_00/connections_buildup/mongod.2/mongod.log
$

 

So perhaps ticket SERVER-57230 deserves special attention first.

Comment by Daniel Morilha (Inactive) [ 24/Jan/22 ]

I've tried to reproduce by running evergreen sys-perf projects against master (currently pointing to 5.3) and 4.2 where the initial baseline came from. Interestingly, mongos logs for v4.2 were a fraction of the size. I then scanned through the code base to see where the specific "slow query" log line was getting generated and found it in here. By just following the logic I am not certain what makes a query / operation to be considered slow. From the 215,000 log lines durationMillis ranged from 100 to 363.

Comment by Daniel Morilha (Inactive) [ 20/Jan/22 ]

In my previous update I forgot to mention there is also SERVER-57230 created out of the same investigation that might overlap this one. It might be wise to understand / fix that problem which is way better defined to then tackle this one.

Comment by Daniel Morilha (Inactive) [ 20/Jan/22 ]

Met with tyler.seip and george.wangensteen yesterday to discuss everything behind this ticket and in summary there was a dramatic increase in "slow queries" while running the ConnectionsBuildup performance test from version 4.2 to version 4.4. Reasons are inconclusive at this point and but might be related to the introduction of Preferred Initial Sync Source or Hedged Reads 

First steps into the investigation are to diff the two branches and check if something obvious is noticed and try to operationally reproduce the issue.

Comment by Daniel Morilha (Inactive) [ 18/Jan/22 ]

Bringing it to the active sprint per blake.oler's priority suggestion.

Comment by Tyler Seip (Inactive) [ 01/Jun/21 ]

Other relevant comment

Comment by Tyler Seip (Inactive) [ 27/May/21 ]

I have one more (useful) patch which reduces the target set in the NITL to one host. It confirms to me beyond a shadow of a doubt that this issue is the NITL and not the RSM:

<screenshot removed>

I'd also like to point out that SERVER-54278 apparently hit v4.2 only and I believe it is included in the r4.2.13 patch we have here. Thus I think we're comparing without opportunistic reads to with opportunistic reads. This is still not the effect that opportunistic reads is supposed to generate in practice.

My general suspicion is that we acquire idle connections and then return them after response faster because they might be executed on the baton starting here. Connections that we do not use are always returned on the reactor. This makes me wonder if it's worth making a form of ConnectionPool::get() that also takes an executor pointer or a baton. We might be able substantially inline our connection pooling because we'd be inline queuing on the baton that is only processing one operation. On the flip side, we might contend much more for the pool lock.

I still don't have a great explanation for the isMaster burst though. Perhaps we are binding late unused connections until the command response arrives in some esoteric way? The full connPoolStats from here might answer this more definitively. We take the forFTDC branch on mongos, so the result is not all that informative right now: 

<screenshot removed>

We could run patches with that branch disabled and get a better picture of connections that are not inUse but are either available or refreshing. For instance, we could be trying and failing to refresh connections to d2 for some bad reason.

from Benjamin Caimano

Generated at Thu Feb 08 05:41:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.