[SERVER-21907] mongos has more than 50 connections open to the primary of each replica-set shard Created: 15/Dec/15  Updated: 08/Jan/24  Resolved: 05/Oct/16

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Jonathan Reams
Resolution: Cannot Reproduce Votes: 0
Labels: PM-314
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Steps To Reproduce:

python buildscripts/resmoke.py --executor concurrency_sharded jstests/concurrency/fsm_all_sharded_replication_legacy_config_servers.js --storageEngine wiredTiger

Some basic stats about the connections

$ while true; do lsof -p <pid_of_mongos1,pid_of_mongos2> | grep "\->.*:2001[03].*ESTABLISHED" | wc -l; sleep 10; done
73
88
94
102
122
125
125
146
156
156
156
161
161
169
186
204
205
205
205
205
205
203
197
208
211
210
210
208
206
205
205
205
205
209
209
203
200
201
211
211
211
211
211
211
211

Sprint: Platforms 18 (08/05/16), Platforms 2016-08-26, Platforms 2016-09-19, Platforms 2016-10-10
Participants:
Linked BF Score: 0

 Description   

fsm_all_sharded_replication_legacy_config_servers.js runs the FSM workloads against a sharded cluster with 2 mongos processes, 2 3-node replica-set shards, and 3 legacy config servers. None of the FSM workloads run (individually) with more than 20 threads. In this test, half of the clients communicate solely with one of the mongos processes, and the half of the client communicate solely with the other mongos process.

As the test is running, the mongos goes from having ~30 connections open to the primary of each of the replica-set shards to over 50. If this behavior is expected and desirable, then I think we should consider scaling the number of task executors in the pool down when test commands are enabled to avoid overwhelming our test hosts.



 Comments   
Comment by Jonathan Reams [ 05/Oct/16 ]

I've tried to reproduce this a number of ways and on a number of revisions and a number of major versions both locally and on a spawn host and have never been able to. I'm going to close this as cannot reproduce - if we see this again, we can re-open.

Comment by Mira Carey [ 12/Jul/16 ]

dan@10gen.com,

Possibly. Depends on whether we spike and then drop down. If there's a sustained load, that's what we'd expect, but if we have some tests that have many more concurrent operations than others, SERVER-25006 might be the fix.

Comment by Daniel Pasette (Inactive) [ 12/Jul/16 ]

mira.carey@mongodb.com, SERVER-25006?

Comment by Max Hirschhorn [ 15/Dec/15 ]
Comparison with 3.0 behavior

As the test is running, the mongos goes from having ~20 connections open to the primary of each of the replica-set shards. to a little over 30.

$ while true; do lsof -p <pid_of_mongos1,pid_of_mongos2> | grep "\->.*:2001[03].*ESTABLISHED" | wc -l; sleep 10;
14
39
44
45
84
91
92
98
97
97
97
97
101
102
104
104
104
104
107
108
108
108
108
108
108
108
108
108
108
108
108
110
111
111
112
112
112
112
119
126
126
126
126
126
126
(...test not run to completion...)

To run fsm_all_sharded_replication_legacy_config_servers.js using a 3.0 mongod, a 3.0 mongos, and a 3.2 mongo shell, apply the following patch to workaround commands and failpoints that were introduced in 3.2.

diff --git a/buildscripts/resmokeconfig/suites/concurrency_sharded.yml b/buildscripts/resmokeconfig/suites/concurrency_sharded.yml
index b5bd8b7..132aba3 100644
--- a/buildscripts/resmokeconfig/suites/concurrency_sharded.yml
+++ b/buildscripts/resmokeconfig/suites/concurrency_sharded.yml
@@ -9,4 +9,3 @@ executor:
     config:
       shell_options:
         nodb: ''
-        readMode: commands
diff --git a/jstests/concurrency/fsm_libs/assert.js b/jstests/concurrency/fsm_libs/assert.js
index d6de79b..15196b3 100644
--- a/jstests/concurrency/fsm_libs/assert.js
+++ b/jstests/concurrency/fsm_libs/assert.js
@@ -29,6 +29,7 @@ var AssertLevel = (function() {
     }
 
     return {
+        NONE: new AssertLevel(-1),
         ALWAYS: new AssertLevel(0),
         OWN_COLL: new AssertLevel(1),
         OWN_DB: new AssertLevel(2),
diff --git a/jstests/concurrency/fsm_libs/cluster.js b/jstests/concurrency/fsm_libs/cluster.js
index 6e6738e..464acfb 100644
--- a/jstests/concurrency/fsm_libs/cluster.js
+++ b/jstests/concurrency/fsm_libs/cluster.js
@@ -146,11 +146,6 @@ var Cluster = function(options) {
                     oplogSize: 1024,
                     verbose: verbosityLevel
                 };
-                shardConfig.rsOptions = {
-                    // Specify a longer timeout for replSetInitiate, to ensure that
-                    // slow hardware has sufficient time for file pre-allocation.
-                    initiateTimeout: REPL_SET_INITIATE_TIMEOUT_MS,
-                }
             }
 
             st = new ShardingTest(shardConfig);
diff --git a/jstests/concurrency/fsm_libs/runner.js b/jstests/concurrency/fsm_libs/runner.js
index e3240a3..e6beb78 100644
--- a/jstests/concurrency/fsm_libs/runner.js
+++ b/jstests/concurrency/fsm_libs/runner.js
@@ -595,9 +595,6 @@ var runner = (function() {
         var bgThreadMgr = new ThreadManager(clusterOptions, { composed: false });
 
         var cluster = new Cluster(clusterOptions);
-        if (cluster.isSharded()) {
-            useDropDistLockFailPoint(cluster, clusterOptions);
-        }
         cluster.setup();
 
         // Clean up the state left behind by other tests in the concurrency suite
diff --git a/jstests/concurrency/fsm_libs/worker_thread.js b/jstests/concurrency/fsm_libs/worker_thread.js
index 8fb4130..89ec6f2 100644
--- a/jstests/concurrency/fsm_libs/worker_thread.js
+++ b/jstests/concurrency/fsm_libs/worker_thread.js
@@ -23,7 +23,8 @@ var workerThread = (function() {
         var myDB;
         var configs = {};
 
-        globalAssertLevel = args.globalAssertLevel;
+        // Hack to disable assertions in the workloads.
+        globalAssertLevel = AssertLevel.NONE;
 
         try {
             if (Cluster.isStandalone(args.clusterOptions)) {

Generated at Thu Feb 08 03:58:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.