[SERVER-10344] Race condition when starting up new master/slave cluster. Was: repl4.js failing on Linux 64-bit Weekly Slow Tests Created: 26/Jul/13  Updated: 14/Apr/16  Resolved: 26/Aug/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Unassigned
Resolution: Done Votes: 0
Labels: tertiary
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-10090 repl4.js failing on Nightly Linux 32-... Closed
Related
related to SERVER-10374 repl_sync_only_db_with_special_chars.... Closed
Operating System: ALL
Steps To Reproduce:

time scons --dd --sharedclient all
python buildscripts/cleanbb.py
./buildscripts/smoke.py --mode=files --auth jstests/repl/repl4.js

Participants:

 Description   

jstests/repl/repl4.js was failing when started under auth mode.

I believe this is actually a timing bug in master/slave replication, not directly an auth issue. It seems to have something to do with setting up a slave that only syncs a single DB from the master.

If you put a "sleep(5000)" after line 13 in repl4.js (the line that starts the primary), then the test passes. Also, if you switch the order of lines 19 and 20 (the lines that do writes into 2 dbs, one that's synced and one that isn't) then the test passes.

When the test fails, this shows up in the logs of the slave:

m31001| Fri Jul 26 15:30:01.972 [replslave] repl:   nextOpTime Jul 26 15:30:01 51f2ce39:1 > syncedTo Dec 31 19:00:00 0:0
 m31001| repl:   time diff: 1374867001sec
 m31001| repl:   tailing: 0
 m31001| repl:   data too stale, halting replication



ORIGINAL DESCRIPTION:

Linux 64-bit Weekly Slow Tests Build #256 July 14 rev f204f7f

Linux 64-bit Weekly Slow Tests Build #261 July 21 rev 9bf7075

Linux 64-bit Weekly Slow Tests Build #262 July 23 rev 37f7f30

(#263 was interrupted)

Linux 64-bit Weekly Slow Tests Build #264 July 25 rev 25395ab

All of these failed with a final error similar to SERVER-10090 (only 32-bit boxes):

Thu Jul 25 14:21:13.822 assert.soon failed: function () { 
                return s.getDB( db )[ coll ].find().count() == count; 
                }, msg:undefined at src/mongo/shell/assert.js:7

But prior to this failure, a bunch of these happen.

assert.soon failed: function () {
            // Set authenticated to stop an infinite recursion from getDB calling
            // back into authenticate.
            conn.authenticated = true;
            print ("Authenticating to admin database as " +
                   jsTestOptions().adminUser + " with mechanism " +
                   DB.prototype._defaultAuthenticationMechanism +
                   " on connection: " + conn);
            conn.authenticated = conn.getDB('admin').auth({
                user: jsTestOptions().adminUser,
                pwd: jsTestOptions().adminPassword
            });
            return conn.authenticated;
        }, msg:Authenticating connection: connection to 127.0.0.1:31001
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at Object.jsTest.authenticate (src/mongo/shell/utils.js:437:16)
    at Mongo.getDB (src/mongo/shell/mongo.js:38:16)
    at /data/buildslaves/Linux_64bit_Weekly_Slow_Tests/mongo/jstests/repl/repl4.js:5:26
    at Function.assert.soon (src/mongo/shell/assert.js:168:17)
    at soonCount (/data/buildslaves/Linux_64bit_Weekly_Slow_Tests/mongo/jstests/repl/repl4.js:4:12)
    at doTest (/data/buildslaves/Linux_64bit_Weekly_Slow_Tests/mongo/jstests/repl/repl4.js:22:5)
    at /data/buildslaves/Linux_64bit_Weekly_Slow_Tests/mongo/jstests/repl/repl4.js:39:5

As far as I can tell, all that's happening and failing here is authentication attempts.



 Comments   
Comment by Spencer Brody (Inactive) [ 20/Oct/14 ]

Replication refactor didn't fix the problem. Assigning to Eric to re-triage.

Comment by Eric Milkie [ 28/Aug/14 ]

Before 2.8 release, we should try re-enabling these tests and see if our refactor has solved the race condition.

Comment by Andrew Morrow (Inactive) [ 22/Aug/13 ]

This issue was also causing repl_sync_only_db_with_special_chars.js to fail (see SERVER-10374). When this issue is resolved, the test should be re-enabled (see the commit linked in SERVER-10374 for details).

Comment by auto [ 22/Aug/13 ]

Author:

{u'username': u'acmorrow', u'name': u'Andrew Morrow', u'email': u'acm@10gen.com'}

Message: SERVER-10374 Disable sync only with special chars test due to SERVER-10344
Branch: master
https://github.com/mongodb/mongo/commit/c5a876e531ef57c3b7e60ec53ef9617bc7475969

Comment by auto [ 31/Jul/13 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: SERVER-10344 Skip repl/repl4.js because of race condition in master/slave
Branch: v2.4
https://github.com/mongodb/mongo/commit/a57449ba7630559b5a9f4f090253057d0677b8b1

Comment by auto [ 31/Jul/13 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: SERVER-10344 Skip repl/repl4.js because of race condition in master/slave
Branch: master
https://github.com/mongodb/mongo/commit/46f9caf7217bfc411301b22ecac80a67c79611b7

Comment by Matt Kangas [ 30/Jul/13 ]

After discussing with Spencer, we've identified (a) a probable server bug, which may or may not be worth fixing, and (b) a deficiency in "repl" tests that prevents us from working around it.

The probable server bug: start up a master/slave pair of mongods, then write to the primary before the secondary is "ready" (for some definition of ready). The secondary gives up with "data too stale, halting replication".

The test deficiency: "repl" jstests need a function analogous to the "awaitReplication" function available to "replsets" tests.

Assuming we don't solve the server problem, here's how we might write "awaitReplication" for master/slave tests.

Secondaries in M/S replication do not create oplogs by default, but they do have a marker of how far they have synced which could be used: the `db.sources` collection. It might be sufficient to test that `db.sources.find()` returns nonzero results. Or we may need to check the `syncedTo` field of the latest record on the secondary, and ensure it matches the primary's latest oplog entry.

I am now also convinced that all failures in SERVER-10090 are exactly this same failure, and I will mark that as a dup of this ticket.

Comment by auto [ 29/Jul/13 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: SERVER-10344 Skip racy test in auth passthrough
Branch: master
https://github.com/mongodb/mongo/commit/761a59942350db8523bddc105ac5509c8aac18d0

Comment by Matt Kangas [ 26/Jul/13 ]

Confirmed that I can repro the failure on Linux 64 debug only with --auth enabled. As you say, I think it's really a timing issue.

The test was already disabled on Windows by Eric with the comment "skip racy test on Windows". Shall we skip this also when --auth is enabled?

Comment by Spencer Brody (Inactive) [ 26/Jul/13 ]

The commit that broke this is https://github.com/mongodb/mongo/commit/3f4169d7144fe1300129f5b0ccda1b75b8f76cef.

I believe this is actually a timing bug in master/slave replication, not directly an auth issue.
If you put a "sleep(5000)" after line 16 in repl4.js (the line that starts the secondary), then the test passes. The main problem seems to be this:

m31001| Fri Jul 26 15:30:01.972 [replslave] repl:   nextOpTime Jul 26 15:30:01 51f2ce39:1 > syncedTo Dec 31 19:00:00 0:0
 m31001| repl:   time diff: 1374867001sec
 m31001| repl:   tailing: 0
 m31001| repl:   data too stale, halting replication

Which shows up in the log output shortly after the secondary is started.

Generated at Thu Feb 08 03:22:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.