[SERVER-14755] Tests that are failing due to replica set slowness Created: 30/Jul/14  Updated: 18/Aug/14  Resolved: 15/Aug/14

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.7.5

Type: Improvement Priority: Critical - P2
Reporter: Spencer Brody (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Participants:
Linked BF Score: 0

 Description   

All the recent changes to replication code have slowed down some things in replication temporarily. Because of this there has been an increase in flaky test failures due to timing out waiting for various replication events (electing a primary is the most common). Rather than 1 ticket per failure like this, this ticket will be an umbrella ticket for all these types of failures. For each of these tests we should either remove them and replace them with unit tests, or verify that they pass consistently after we've switched fully over to the new replication code, which should speed things back up.



 Comments   
Comment by Githook User [ 01/Aug/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: Revert "SERVER-14755 increase ReplSetTest timeouts to prevent spurious failures while using HybridCoordinator"

This reverts commit 9ec7d68a97dc54f534e95959e62cafcab38bd440.
Branch: master
https://github.com/mongodb/mongo/commit/243e7efa12db0a92129199388d03197b3415b6cf

Comment by Githook User [ 01/Aug/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-14442 SERVER-14755 fix repl deadlock at shutdown
Branch: master
https://github.com/mongodb/mongo/commit/5d870c10dcc796ef3fcbf9d65414988c44a767b6

Comment by Eric Milkie [ 01/Aug/14 ]

I have determined that the cause of the failures is due to mongod shutdown hanging. Since our testing harness shutdown eventually times out and kills the process without signaling a test failure, this type of problem is not immediately obvious from test logs. I'll be reverting the commit that broke this.

Comment by J Rassi [ 01/Aug/14 ]

9ec7d68a didn't help.

See the recent history (most recent first):
2ea0749 dannenberg SERVER=14443 implement buildsIndexes() in ReplicationCoordintorImpl
5dd5518 dannenberg SERVER=14443 implement isMasterForReportingPurposes() in ReplicationCoordintorImpl
89c7ad5 dannenberg SERVER-14443 implement canAcceptWritesForDatabase() and canServeReadsFor() in ReplicationCoordinatorImpl
6cc8c43 dannenerg SERVER-14443 correct modeReplSet detection in ReplicationCoordinatorImpl::getReplicationMode()
ce38e09 stbrody SERVER-14451 Implement processReplSetFreeze in ReplicationCoordinatorImpl

Notice how ce38e09c (bottom of list) is clean on replsets/sharding, and 2ea07493 (top of list) is not clean. From that, I infer that one of the first four commits is bad. mattd@10gen.com, can you double-check my work and then assess which of the four is the most likely to be the bad commit, and then perform a revert?

Note that 25-50 failures of the replsets suite came in overnight (generated from MCI's bisect attempt, which is still running), in addition to a number of failures across other test suites.

Comment by Githook User [ 31/Jul/14 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-14755 increase ReplSetTest timeouts to prevent spurious failures while using HybridCoordinator
Branch: master
https://github.com/mongodb/mongo/commit/9ec7d68a97dc54f534e95959e62cafcab38bd440

Comment by J Rassi [ 31/Jul/14 ]

Bumping priority. The current volume of test noise on master from this isn't sustainable.

Comment by J Rassi [ 31/Jul/14 ]

Here are test failures that have failed due to replication-related timeouts, from the past 24 hours.

cloneDb.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_dur_off_6d66597265e158ec16dd31f823f0d474194519e7_14_07_30_03_47_06_replicasets_osx_108_dur_off

mongos_slaveok.js:
https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_debug_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_aggregation_osx_108_debug

repl1.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_duroff_f7df4c7dcfabbadb4fde07018f85d00bc8ba47a0_14_07_30_21_17_09_replication_linux_64_debug_duroff
https://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_replication_linux_64
https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_2k8_debug_ce38e09ca894cd66e22f0c7d3be2ece78ba2834d_14_07_31_15_57_05_replication_windows_64_2k8_debug
https://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_replication_auth_linux_64_debug

repl6.js:
https://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_replication_linux_64_debug

replsetadd_profile.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_cxx11_debug_9e72a5b850f7bc31453828db522f4e0ecbfdb691_14_07_30_17_52_06_replicasets_osx_108_cxx11_debug
https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_cxx11_debug_9e72a5b850f7bc31453828db522f4e0ecbfdb691_14_07_30_17_52_06_replicasets_osx_108_cxx11_debug

replsets_priority1.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_amazon_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_slow1_amazon
https://mci.10gen.com/ui/task/mongodb_mongo_master_ubuntu1204_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_slow1_ubuntu1204
https://mci.10gen.com/ui/task/mongodb_mongo_master_solaris_64_bit_ce38e09ca894cd66e22f0c7d3be2ece78ba2834d_14_07_31_15_57_05_slow1_solaris_64_bit
https://mci.10gen.com/ui/task/mongodb_mongo_master_ubuntu1404_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_slow1_ubuntu1404

rollback_too_new.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_cxx11_debug_791e2e00806449a01e7dec4d9b969392e183f393_14_07_30_17_38_51_replicasets_osx_108_cxx11_debug

stepdown.js: https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_debug_7b56f68f2b81c46231335b52921ad65ec01aebe8_14_07_30_16_05_07_replicasets_osx_108_debug

sync2.js:
https://mci.10gen.com/ui/task/mongodb_mongo_master_amazon_ce38e09ca894cd66e22f0c7d3be2ece78ba2834d_14_07_31_15_57_05_replicasets_auth_amazon
https://mci.10gen.com/ui/task/mongodb_mongo_master_enterprise_debian71_64_b494ade9a9461d16ab3812640ca72ee9c9e4345c_14_07_31_15_00_02_replicasets_auth_enterprise_debian71_64

Comment by Spencer Brody (Inactive) [ 30/Jul/14 ]

replsets/replset1.js
81676bfa36 OS X 10.8 DEBUG replicasets
https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_debug_81676bfa36c68b1247f0e08b666e33c3e3875755_14_07_29_20_45_06_replicasets_osx_108_debug
http://buildlogs.mongodb.org/mci_0.9_osx-108-debug/builds/56696/test/replicasets_0/replset1.js

	
 m31000| 2014-07-29T17:52:28.303-0400 [rsMgr] replSet PRIMARY
2014-07-29 17:52:29 EDT	
 m31001| 2014-07-29T17:52:29.541-0400 [rsHealthPoll] replSet member bs-osx108-5:31000 is now in state PRIMARY
 m31002| 2014-07-29T17:52:29.746-0400 [rsHealthPoll] replSet member bs-osx108-5:31000 is now in state PRIMARY
2014-07-29 17:52:33 EDT	
 m31001| 2014-07-29T17:52:33.307-0400 [conn3] end connection 127.0.0.1:55651 (2 connections now open)
 m31001| 2014-07-29T17:52:33.308-0400 [initandlisten] connection accepted from 127.0.0.1:55663 #6 (3 connections now open)
2014-07-29 17:52:49 EDT	
2014-07-29T17:52:49.761-0400 [initandlisten] connection accepted from 127.0.0.1:55510 #73 (1 connection now open)
2014-07-29 17:55:00 EDT	
assert.soon failed: function () {
          replTest.getMaster(); // Reload who the current slaves are.
          var slaves = replTest.liveNodes.slaves;
          var len = slaves.length;
          var ready = true;
          for(var i=0; i<len; i++) {
              var isMaster = slaves[i].getDB("admin").runCommand({ismaster: 1});
              var arbiter = isMaster['arbiterOnly'] == undefined ? false : isMaster['arbiterOnly'];
              ready = ready && ( isMaster['secondary'] || arbiter );
          }
          return ready;
      }, msg:Awaiting secondaries
Error: assert.soon failed: function () {
          replTest.getMaster(); // Reload who the current slaves are.
          var slaves = replTest.liveNodes.slaves;
          var len = slaves.length;
          var ready = true;
          for(var i=0; i<len; i++) {
              var isMaster = slaves[i].getDB("admin").runCommand({ismaster: 1});
              var arbiter = isMaster['arbiterOnly'] == undefined ? false : isMaster['arbiterOnly'];
              ready = ready && ( isMaster['secondary'] || arbiter );
          }
          return ready;
      }, msg:Awaiting secondaries
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.soon (src/mongo/shell/assert.js:180:60)
    at ReplSetTest.awaitSecondaryNodes (src/mongo/shell/replsettest.js:351:10)
    at ReplSetTest.initiate (src/mongo/shell/replsettest.js:451:10)
    at doTest (/data/mci/shell/src/jstests/replsets/replset1.js:22:14)
    at /data/mci/shell/src/jstests/replsets/replset1.js:139:1
2014-07-29T17:55:00.080-0400 Error: assert.soon failed: function () {
          replTest.getMaster(); // Reload who the current slaves are.
          var slaves = replTest.liveNodes.slaves;
          var len = slaves.length;
          var ready = true;
          for(var i=0; i<len; i++) {
              var isMaster = slaves[i].getDB("admin").runCommand({ismaster: 1});
              var arbiter = isMaster['arbiterOnly'] == undefined ? false : isMaster['arbiterOnly'];
              ready = ready && ( isMaster['secondary'] || arbiter );
          }
          return ready;
      }, msg:Awaiting secondaries
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.soon (src/mongo/shell/assert.js:180:60)
    at ReplSetTest.awaitSecondaryNodes (src/mongo/shell/replsettest.js:351:10)
    at ReplSetTest.initiate (src/mongo/shell/replsettest.js:451:10)
    at doTest (/data/mci/shell/src/jstests/replsets/replset1.js:22:14)
    at /data/mci/shell/src/jstests/replsets/replset1.js:139:1 at src/mongo/shell/assert.js:13
failed to load: /data/mci/shell/src/jstests/replsets/replset1.js

Generated at Thu Feb 08 03:35:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.