|
I'm not sure that this bug is a regression, though it is certainly not desirable. Since when the bug manifests, the slave stops replicating forever, I propose that we try to make the failure fatal to the slave. It occurs during "initial sync", so restarting the slave is probably the best and safest way to get to a clean state, anyways.
|
|
From the logs, the master drops "dbNAme". The slave tries to do an initial sync of "dbNAme", causing "dbNAme" to be reopened in a read lock, which fails because the database was just deleted (needs investigation). This causes the operation on conn22 of the master to fail, causing the replslave thread on the slave to raise a SyncException and sleep 10 seconds. At this point, it has already observed a sync time (syncedTo) for an operation it never applied.
m31000| Tue May 22 08:30:40 [conn1] dropDatabase dbNAme
|
m31000| Tue May 22 08:30:41 [conn1] removeJournalFiles
|
m31000| Tue May 22 08:30:41 [conn22] opening db: dbNAme
|
m31000| Tue May 22 08:30:41 [conn22] Assertion: 15927:can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error
|
m31000| Tue May 22 08:30:41 [conn22] assertion 15927 can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error ns:dbNAme.system.namespaces query:{}
|
m31000| Tue May 22 08:30:41 [conn22] { $err: "can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error", code: 15927 }
|
m31000| Tue May 22 08:30:41 [conn22] query dbNAme.system.namespaces ntoreturn:0 keyUpdates:0 exception: can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error code:15927 reslen:176 564ms
|
m31001| Tue May 22 08:30:41 [replslave] dbNAme resync of dbNAme from 127.0.0.1:31000 failed query failed dbNAme.system.namespaces
|
m31001| Tue May 22 08:30:41 [replslave] caught SyncException
|
m31001| Tue May 22 08:30:41 [replslave] repl: sleep 10 sec before next pass
|
When the slave starts trying to sync again, it notices that the initial sync of "dbNAme" failed, and tries to do it again. At this point, it sees the wrong value of syncedTo, and raises a verify() exception, which puts it to sleep for 60 seconds, causing this test to fail. However, the real problem is that the sync will never succeed at this point. The slave must be restarted. This should probably, as a result, be an fassert().
m31001| Tue May 22 08:30:51 [replslave] repl: syncing from host:127.0.0.1:31000
|
m31001| Tue May 22 08:30:51 [replslave] An earlier initial clone of 'dbNAme' did not complete, now resyncing.
|
m31001| Tue May 22 08:30:51 [replslave] resync: dropping database dbNAme
|
m31001| Tue May 22 08:30:51 [replslave] info journal _rotate called inside dbMutex - ok but should be somewhat rare
|
m31001| Tue May 22 08:30:51 [replslave] removeJournalFiles
|
m31001| Tue May 22 08:30:51 [replslave] resync: cloning database dbNAme to get an initial copy
|
m31000| Tue May 22 08:30:51 [initandlisten] connection accepted from 127.0.0.1:54186 #23 (3 connections now open)
|
m31001| Tue May 22 08:30:51 [replslave] build index dbNAme.c { _id: 1 }
|
m31001| Tue May 22 08:30:51 [replslave] fastBuildIndex dupsToDrop:0
|
m31001| Tue May 22 08:30:51 [replslave] build index done. scanned 1 total records. 0 secs
|
m31001| Tue May 22 08:30:51 [replslave] resync: done with initial clone for db: dbNAme
|
m31000| Tue May 22 08:30:51 [conn23] end connection 127.0.0.1:54186 (2 connections now open)
|
m31001| Tue May 22 08:30:51 [replslave] repl ASSERTION failed : syncedTo < nextOpTime
|
m31001| Tue May 22 08:30:51 [replslave] repl syncTo: May 22 08:30:36 4fbb86ec:1
|
m31001| Tue May 22 08:30:51 [replslave] repl nextOpTime: May 22 08:30:36 4fbb86ec:1
|
m31001| Tue May 22 08:30:51 [replslave] Assertion failure false src\mongo\db\repl.cpp 914
|
m31001| Tue May 22 08:30:51 [replslave] replMain AssertionException assertion src\mongo\db\repl.cpp:914
|
m31001| Tue May 22 08:30:51 [replslave] repl: sleep 60 sec before next pass
|
|
|
The "assert.soon()" in the test fails because replication on the slave pauses for 60 seconds after a non-fatal assertion failure. Checking to see if this assertion failure is "expected behavior", in which case assert.soon() needs to wait longer, or demonstrates an error cropping up in master-slave replication.
|