[SERVER-13435] dropdb_race.js in small_oplog (causes replication to halt on slave) Created: 25/Mar/14  Updated: 11/Jul/16  Resolved: 01/Apr/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 2.6.0-rc3

Type: Bug Priority: Major - P3
Reporter: J Rassi Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-13351 Slaves halt replication if oplog getm... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

444bb62c24 / Windows 64-bit 2008R2+ / jsCore_small_oplog
https://mci.10gen.com/ui/task_log_raw/mongodb_mongo_v2.6_windows_64_2k8_444bb62c24786a65334f4be214d1154dde00c71b_14_03_24_14_29_20_jsCore_small_oplog_windows_64_test?type=T
http://buildlogs.mongodb.org/mci_0.9_windows_64_2k8/builds/42921/test/jsCore_small_oplog_0/dropdb_race.js

In the small_oplog suite, this test issues the closeAllDatabases command on the master while the slave's getmore into oplog.rs is yielding. When the getmore returns from the yield, the dbtemprelease() destructor acquires a Context on the local database, which attempts to re-open the local database (which had been closed during the yield); this trips an massert, since opening a database requires a write lock. The slave gets back an error from the getmore call, and halts replication on the slave entirely (the "autoresync" server option is not set, so replication is never re-attempted). This prevents any further tests in the suite from running, as small_oplog waits for the slave to catch up to the master inbetween tests.

Assuming the above analysis is correct, possible remedies include any of:

  • make closeAllDatabases not close "local"
  • make dbtemprelease detect this failure
  • turn on autoresync
  • remove this test from the small_oplog suite

Test log excerpt:

2014-03-24T16:43:11.561+0000 [conn987] build index on: jstests_dur_droprace.foo properties: { v: 1, key: { x: 1.0 }, name: "x_1", ns: "jstests_dur_droprace.foo" }
2014-03-24T16:43:11.561+0000 [conn987] 	 building index using bulk method
2014-03-24T16:43:11.561+0000 [conn987] build index done.  scanned 3 total records. 0 secs
2014-03-24T16:43:11.562+0000 [conn987] DatabaseHolder::closeAll path:/data/db/sconsTests
2014-03-24T16:43:11.793+0000 [conn987] command admin.$cmd command: closeAllDatabases { closeAllDatabases: 1.0 } keyUpdates:0 numYields:0 locks(micros) W:231463 reslen:436 231ms
2014-03-24T16:43:11.793+0000 [conn4] opening db:  local
2014-03-24T16:43:11.793+0000 [conn4] Assertion: 15927:can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\util\stacktrace.cpp(169)                           mongo::printStackTrace+0x43
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\util\log.cpp(158)                                  mongo::logContext+0x9c
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\util\assert_util.cpp(183)                          mongo::msgasserted+0xfb
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\db\catalog\database_holder.cpp(65)                 mongo::DatabaseHolder::getOrCreate+0x34e
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\db\client.cpp(357)                                 mongo::Client::Context::_finishInit+0x85
2014-03-24T16:43:12.690+0000 [conn4] mongod.exe    ...\src\mongo\db\db.h(70)                                        mongo::dbtemprelease::~dbtemprelease+0x49
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\clientcursor.cpp(225)                           mongo::ClientCursor::staticYield+0x29b
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\query\plan_executor.cpp(83)                     mongo::PlanExecutor::getNext+0xf7
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\query\new_find.cpp(204)                         mongo::newGetMore+0x413
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\instance.cpp(730)                               mongo::receivedGetMore+0x67b
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\instance.cpp(436)                               mongo::assembleResponse+0x32c
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\db\db.cpp(204)                                     mongo::MyMessageHandler::process+0x111
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(210)              mongo::PortMessageServer::handleIncomingMsg+0x677
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(185)  boost::`anonymous namespace'::thread_start_function+0x21
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
2014-03-24T16:43:12.691+0000 [conn4] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
2014-03-24T16:43:12.691+0000 [conn4] kernel32.dll                                                                   BaseThreadInitThunk+0xd
2014-03-24T16:43:12.691+0000 [conn4] local.oplog.$main 
2014-03-24T16:43:12.692+0000 [conn987] note: not profiling because db went away - probably a close on: testing
2014-03-24T16:43:12.692+0000 [conn4] getmore local.oplog.$main cursorid:18746693299 ntoreturn:0 keyUpdates:0 exception: can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error code:15927 numYields:5 locks(micros) r:899043 nreturned:1 reslen:176 10025ms
2014-03-24T16:43:12.698+0000 [conn987] dropDatabase jstests_dur_droprace starting
2014-03-24T16:43:12.745+0000 [conn987] removeJournalFiles
2014-03-24T16:43:12.764+0000 [conn987] dropDatabase jstests_dur_droprace finished
2014-03-24T16:43:12.766+0000 [FileAllocator] allocating new datafile /data/db/sconsTests\jstests_dur_droprace.ns, filling with zeroes...
2014-03-24T16:43:13.880+0000 [FileAllocator] done allocating datafile /data/db/sconsTests\jstests_dur_droprace.ns, size: 16MB,  took 1.114 secs
2014-03-24T16:43:13.881+0000 [FileAllocator] allocating new datafile /data/db/sconsTests\jstests_dur_droprace.0, filling with zeroes...
2014-03-24T16:43:19.602+0000 [FileAllocator] done allocating datafile /data/db/sconsTests\jstests_dur_droprace.0, size: 64MB,  took 5.719 secs
2014-03-24T16:43:19.603+0000 [conn987] build index on: jstests_dur_droprace.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "jstests_dur_droprace.foo" }
2014-03-24T16:43:19.603+0000 [conn987] 	 added index to empty collection
2014-03-24T16:43:19.604+0000 [conn987] insert jstests_dur_droprace.foo query: { _id: ObjectId('533060a05e3a4439beec0f21') } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:6837265 6837ms
2014-03-24T16:43:19.604+0000 [conn987] command jstests_dur_droprace.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('533060a05e3a4439beec0f21') } ], ordered: true } keyUpdates:0 numYields:0 locks(micros) w:55 reslen:56 6837ms
2014-03-24T16:43:19.611+0000 [conn987] build index on: jstests_dur_droprace.foo properties: { v: 1, key: { x: 1.0 }, name: "x_1", ns: "jstests_dur_droprace.foo" }
2014-03-24T16:43:19.611+0000 [conn987] 	 building index using bulk method
2014-03-24T16:43:19.611+0000 [conn987] build index done.  scanned 3 total records. 0 secs
2014-03-24T16:43:19.614+0000 [conn987] dropDatabase jstests_dur_droprace starting
2014-03-24T16:43:19.666+0000 [conn987] removeJournalFiles
2014-03-24T16:43:19.682+0000 [conn987] dropDatabase jstests_dur_droprace finished
2014-03-24T16:43:19.684+0000 [FileAllocator] allocating new datafile /data/db/sconsTests\jstests_dur_droprace.ns, filling with zeroes...
2014-03-24T16:43:12.692+0000 [replslave] op: { $err: "can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error", code: 15927 }
2014-03-24T16:43:12.692+0000 [replslave] halting replication
2014-03-24T16:43:12.692+0000 [replslave] caught SyncException
2014-03-24T16:43:12.692+0000 [replslave] repl: sleep 10 sec before next pass
2014-03-24T16:43:22.319+0000 [FileAllocator] done allocating datafile /data/db/sconsTests\jstests_dur_droprace.ns, size: 16MB,  took 2.633 secs
2014-03-24T16:43:22.693+0000 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
2014-03-24T16:43:27.694+0000 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds

Suite log excerpt:

 [2014/03/24 16:43:02.400]  *******************************************
 [2014/03/24 16:43:02.404]          Test : drop_index.js ...
 [2014/03/24 16:43:02.404]       Command : ./mongo.exe --port 27999 --authenticationMechanism MONGODB-CR --writeMode commands --eval testingReplication = true; D:\data\mci\git@github.commongodb\mongo.git\mongodb-mongo-v2.6\jstests\core\drop_index.js
 [2014/03/24 16:43:02.404]          Date : Mon Mar 24 16:43:02 2014
 [2014/03/24 16:43:02.502]                 (output suppressed; see http://buildlogs.mongodb.org/build/53305ef0d2a60f71ad0009ef/test/53306096d2a60f732a0002da/)
 [2014/03/24 16:43:02.620]                   219.9998 ms
 [2014/03/24 16:43:02.704]
 [2014/03/24 16:43:02.704] waiting for slave (28000) to catch up to master (27999)
 [2014/03/24 16:43:02.704] caught up!
 [2014/03/24 16:43:02.704] replication ok for 9750 collections
 [2014/03/24 16:43:02.704] skipping D:\data\mci\git@github.commongodb\mongo.git\mongodb-mongo-v2.6\jstests\core\dropdb.js
 [2014/03/24 16:43:02.708] waiting for slave (28000) to catch up to master (27999)
 [2014/03/24 16:43:02.708] caught up!
 [2014/03/24 16:43:02.708] replication ok for 9878 collections
 [2014/03/24 16:43:02.708]  *******************************************
 [2014/03/24 16:43:02.708]          Test : dropdb_race.js ...
 [2014/03/24 16:43:02.708]       Command : ./mongo.exe --port 27999 --authenticationMechanism MONGODB-CR --writeMode commands --eval testingReplication = true; D:\data\mci\git@github.commongodb\mongo.git\mongodb-mongo-v2.6\jstests\core\dropdb_race.js
 [2014/03/24 16:43:02.708]          Date : Mon Mar 24 16:43:02 2014
 [2014/03/24 16:43:02.817]                 (output suppressed; see http://buildlogs.mongodb.org/build/53305ef0d2a60f71ad0009ef/test/53306096d2a60f737900008e/)
 [2014/03/24 16:44:05.459]                     1.0459 minutes
 [2014/03/24 16:49:10.494] timed out
 [2014/03/24 16:49:10.494] timed out
 [2014/03/24 16:49:10.494]
 [2014/03/24 16:49:16.531] 154 tests succeeded
 [2014/03/24 16:49:16.531] 591 tests didn't get run
 [2014/03/24 16:49:16.534] replication ok for 9878 collections
 [2014/03/24 16:49:16.624] timed out
 [2014/03/24 16:49:16.624] Traceback (most recent call last):
 [2014/03/24 16:49:16.624]   File "buildscripts/smoke.py", line 1267, in <module>
 [2014/03/24 16:49:16.624]     main()
 [2014/03/24 16:49:16.624]   File "buildscripts/smoke.py", line 1251, in main
 [2014/03/24 16:49:16.624]     run_tests(tests)
 [2014/03/24 16:49:16.624]   File "buildscripts/smoke.py", line 696, in run_tests
 [2014/03/24 16:49:16.624]     master.wait_for_repl()
 [2014/03/24 16:49:16.624]   File "buildscripts/smoke.py", line 306, in wait_for_repl
 [2014/03/24 16:49:16.627]     Connection(port=self.port).testing.smokeWait.insert({}, w=2, wtimeout=5*60*1000)
 [2014/03/24 16:49:16.627]   File "C:\Python27\lib\site-packages\pymongo\collection.py", line 362, in insert
 [2014/03/24 16:49:16.627]     self.database.connection)
 [2014/03/24 16:49:16.627]   File "C:\Python27\lib\site-packages\pymongo\message.py", line 248, in _do_batched_insert
 [2014/03/24 16:49:16.627]     client._send_message(_insert_message(EMPTY.join(data), safe), safe)
 [2014/03/24 16:49:16.627]   File "C:\Python27\lib\site-packages\pymongo\mongo_client.py", line 969, in _send_message
 [2014/03/24 16:49:16.627]     rv = self.__check_response_to_last_error(response)
 [2014/03/24 16:49:16.627]   File "C:\Python27\lib\site-packages\pymongo\mongo_client.py", line 911, in __check_response_to_last_error
 [2014/03/24 16:49:16.627]     raise OperationFailure(details["err"], details["code"])
 [2014/03/24 16:49:16.627] pymongo.errors.OperationFailure: timeout
 [2014/03/24 16:49:16.656] Running script task for command 
${killall_mci|pkill mongo; pkill -9 mongod; pkill -9 mongos; pkill -f buildlogger.py; pkill -f smoke.py}
 in directory /data/mci/git@github.commongodb/mongo.git/mongodb-mongo-v2.6
 [2014/03/24 16:49:16.656] Full command: taskkill /IM mongod.exe /IM mongos.exe /IM mongo.exe /IM test.exe /IM buildlogger.py /IM smoke.py /IM python.exe /IM cl.exe /f
 [2014/03/24 16:49:16.658] Command successfully started and appended to running commands



 Comments   
Comment by Githook User [ 01/Apr/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-13435 do not run dropdb_race.js in smalloplog

(cherry picked from commit cffd39b33fd0c6e311237cacc7360da3ea776a21)
Branch: v2.6
https://github.com/mongodb/mongo/commit/d2eedb1dda23d7aba2f4f38a69a05790cf706d72

Comment by Githook User [ 01/Apr/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-13435 do not run dropdb_race.js in smalloplog
Branch: master
https://github.com/mongodb/mongo/commit/cffd39b33fd0c6e311237cacc7360da3ea776a21

Comment by J Rassi [ 25/Mar/14 ]

SERVER-13351 filed.

One of two things need to happen now:

  • SERVER-13351 needs to be resolved, or
  • dropdb_race.js needs to not be run under small_oplog

Bouncing to milkie.

Comment by J Rassi [ 25/Mar/14 ]

Eliot Horowitz wrote:
master/slave should just retry the oplog query. doesn't require a resync

Hence, I will file a bug against master/slave replication and link it here.

Comment by J Rassi [ 25/Mar/14 ]

Assigning back to self, reached out to eliot via email for next steps.

Comment by J Rassi [ 25/Mar/14 ]

mattd@10gen.com: please triage.

Generated at Thu Feb 08 03:31:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.