[SERVER-5899] Timeout waiting for database name to replicate to slave (master/slave) Created: 22/May/12  Updated: 06/Dec/22  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: tertiary
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 64-bit


Issue Links:
Related
is related to SERVER-4852 Master/Slave initial sync doesn't wor... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

m31001| Tue May 22 08:30:51 [replslave] repl ASSERTION failed : syncedTo < nextOpTime
 m31001| Tue May 22 08:30:51 [replslave] repl syncTo:     May 22 08:30:36 4fbb86ec:1
 m31001| Tue May 22 08:30:51 [replslave] repl nextOpTime: May 22 08:30:36 4fbb86ec:1
 m31001| Tue May 22 08:30:51 [replslave]   Assertion failure false src\mongo\db\repl.cpp 914
 m31001| Tue May 22 08:30:51 [replslave] replMain AssertionException assertion src\mongo\db\repl.cpp:914
 m31001| Tue May 22 08:30:51 [replslave] repl: sleep 60 sec before next pass
assert.soon failed: function () {
    return s.getDB("zzz").c.count();
}, msg:undefined
Error("Printing Stack Trace")@:0
()@src/mongo/shell/utils.js:37
("assert.soon failed: function () {\n    return s.getDB(\"zzz\").c.count();\n}, msg:undefined")@src/mongo/shell/utils.js:58
((function () {return s.getDB("zzz").c.count();}))@src/mongo/shell/utils.js:167
checkTwice("dbNAme")@D:\10gen\buildslaves\mongo\Windows_64bit\mongo\jstests\repl\dbcase.js:41
@D:\10gen\buildslaves\mongo\Windows_64bit\mongo\jstests\repl\dbcase.js:96
 
Tue May 22 08:31:11 uncaught exception: assert.soon failed: function () {
    return s.getDB("zzz").c.count();
}, msg:undefined
failed to load: D:\10gen\buildslaves\mongo\Windows_64bit\mongo\jstests\repl\dbcase.js

http://buildbot.mongodb.org/builders/Windows%2064-bit/builds/4723/steps/test_10/logs/stdio



 Comments   
Comment by Eric Milkie [ 12/Feb/18 ]

master/slave is being removed.

Comment by Andy Schwerin [ 23/May/12 ]

I'm not sure that this bug is a regression, though it is certainly not desirable. Since when the bug manifests, the slave stops replicating forever, I propose that we try to make the failure fatal to the slave. It occurs during "initial sync", so restarting the slave is probably the best and safest way to get to a clean state, anyways.

Comment by Andy Schwerin [ 23/May/12 ]

From the logs, the master drops "dbNAme". The slave tries to do an initial sync of "dbNAme", causing "dbNAme" to be reopened in a read lock, which fails because the database was just deleted (needs investigation). This causes the operation on conn22 of the master to fail, causing the replslave thread on the slave to raise a SyncException and sleep 10 seconds. At this point, it has already observed a sync time (syncedTo) for an operation it never applied.

m31000| Tue May 22 08:30:40 [conn1] dropDatabase dbNAme
 m31000| Tue May 22 08:30:41 [conn1] removeJournalFiles
 m31000| Tue May 22 08:30:41 [conn22] opening db:  dbNAme
 m31000| Tue May 22 08:30:41 [conn22] Assertion: 15927:can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error
 m31000| Tue May 22 08:30:41 [conn22] assertion 15927 can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error ns:dbNAme.system.namespaces query:{}
 m31000| Tue May 22 08:30:41 [conn22] { $err: "can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error", code: 15927 }
 m31000| Tue May 22 08:30:41 [conn22] query dbNAme.system.namespaces ntoreturn:0 keyUpdates:0 exception: can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error code:15927 reslen:176 564ms
 m31001| Tue May 22 08:30:41 [replslave]  dbNAme resync of dbNAme from 127.0.0.1:31000 failed query failed dbNAme.system.namespaces
 m31001| Tue May 22 08:30:41 [replslave] caught SyncException
 m31001| Tue May 22 08:30:41 [replslave] repl: sleep 10 sec before next pass

When the slave starts trying to sync again, it notices that the initial sync of "dbNAme" failed, and tries to do it again. At this point, it sees the wrong value of syncedTo, and raises a verify() exception, which puts it to sleep for 60 seconds, causing this test to fail. However, the real problem is that the sync will never succeed at this point. The slave must be restarted. This should probably, as a result, be an fassert().

 m31001| Tue May 22 08:30:51 [replslave] repl: syncing from host:127.0.0.1:31000
 m31001| Tue May 22 08:30:51 [replslave] An earlier initial clone of 'dbNAme' did not complete, now resyncing.
 m31001| Tue May 22 08:30:51 [replslave] resync: dropping database dbNAme
 m31001| Tue May 22 08:30:51 [replslave] info journal _rotate called inside dbMutex - ok but should be somewhat rare
 m31001| Tue May 22 08:30:51 [replslave] removeJournalFiles
 m31001| Tue May 22 08:30:51 [replslave] resync: cloning database dbNAme to get an initial copy
 m31000| Tue May 22 08:30:51 [initandlisten] connection accepted from 127.0.0.1:54186 #23 (3 connections now open)
 m31001| Tue May 22 08:30:51 [replslave] build index dbNAme.c { _id: 1 }
 m31001| Tue May 22 08:30:51 [replslave] 	 fastBuildIndex dupsToDrop:0
 m31001| Tue May 22 08:30:51 [replslave] build index done.  scanned 1 total records. 0 secs
 m31001| Tue May 22 08:30:51 [replslave] resync: done with initial clone for db: dbNAme
 m31000| Tue May 22 08:30:51 [conn23] end connection 127.0.0.1:54186 (2 connections now open)
 m31001| Tue May 22 08:30:51 [replslave] repl ASSERTION failed : syncedTo < nextOpTime
 m31001| Tue May 22 08:30:51 [replslave] repl syncTo:     May 22 08:30:36 4fbb86ec:1
 m31001| Tue May 22 08:30:51 [replslave] repl nextOpTime: May 22 08:30:36 4fbb86ec:1
 m31001| Tue May 22 08:30:51 [replslave]   Assertion failure false src\mongo\db\repl.cpp 914
 m31001| Tue May 22 08:30:51 [replslave] replMain AssertionException assertion src\mongo\db\repl.cpp:914
 m31001| Tue May 22 08:30:51 [replslave] repl: sleep 60 sec before next pass

Comment by Andy Schwerin [ 23/May/12 ]

I believe that both of these bugs occur roughly for the reason Aaron outlines in SERVER-4852. The master races closing (in this issue dropping) a database against the slave performing its initial sync.

Comment by Andy Schwerin [ 22/May/12 ]

The "assert.soon()" in the test fails because replication on the slave pauses for 60 seconds after a non-fatal assertion failure. Checking to see if this assertion failure is "expected behavior", in which case assert.soon() needs to wait longer, or demonstrates an error cropping up in master-slave replication.

Generated at Thu Feb 08 03:10:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.