[SERVER-11370] applyOps should not allow yields Created: 25/Oct/13  Updated: 11/Jul/16  Resolved: 04/Feb/14

Status: Closed
Project: Core Server
Component/s: Concurrency, Replication, Sharding
Affects Version/s: 2.5.3
Fix Version/s: 2.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: J Rassi Assignee: J Rassi
Resolution: Done Votes: 0
Labels: 26qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-11432 Audit usage of checkForInterrupt(false) Closed
is related to SERVER-11038 Interrupting renameCollection via shu... Closed
Operating System: ALL
Participants:

 Description   

[ed: original title "Interrupting applyOps can lead to out-of-sync replicas"]

The applyOps command acquires a global lock, applies the provided operations one-by-one, and (if run on a primary member) then logs the command invocation to the oplog. However, it is possible for the command to be interrupted after only some of the operations have been applied, in which case the changes already processed are persisted to disk but not reflected in the oplog, and thus not reflected in the data copies of the other members.

The command can be interrupted, for example, while processing a delete operation: applyOperation_inlock() calls deleteObjects() with god=false, which, even on a simple _id query, can force a yield if the associated index entry is determined to not be resident in physical memory.

See the following shell session. After processing the first operation in applyOps, a normal shutdown was initiated with SIGINT, followed by a restart. Note the change in the collection contents, but the absence of an oplog entry.

> db.foo.insert([{_id:1},{_id:2}])
> var remove1 = { "ts" : Timestamp(1382671580, 1), "h" : NumberLong("8276060108251241091"), "v" : 2, "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : 1 } }
> var remove2 = { "ts" : Timestamp(1382673262, 1), "h" : NumberLong("-5171031159600005302"), "v" : 2, "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : 2 } }
> db.runCommand({applyOps:[remove1,remove2]})
{
	"errmsg" : "exception: interrupted at shutdown",
	"code" : 11600,
	"ok" : 0
}
> db.foo.find()
2013-10-24T23:56:09.545-0400 DBClientCursor::init call() failed
Error: error doing query: failed
> db.foo.find()
2013-10-24T23:56:10.419-0400 trying reconnect to 127.0.0.1:27017
2013-10-24T23:56:10.419-0400 reconnect 127.0.0.1:27017 ok
{ "_id" : 2 }
> db.getSiblingDB("local").oplog.rs.count({op:"c","o.applyOps":{$exists:true}})
0

Stack trace for above, at interruption:

(gdb) bt
#0  mongo::uasserted (msgid=11600, msg=0x100bc9565 "interrupted at shutdown") at assert_util.cpp:159
#1  0x00000001002bf14a in mongo::KillCurrentOp::checkForInterrupt (this=0x100ec1180, heedMutex=false) at kill_current_op.cpp:131
#2  0x000000010011c924 in std::auto_ptr<mongo::LockMongoFilesShared>::auto_ptr () at /usr/include/c++/4.2.1/memory:379
#3  0x000000010011c924 in mongo::ClientCursor::staticYield (micros=100, ns=@0x13be1afa0, rec=0x137b940b0) at clientcursor.cpp:381
#4  0x000000010011ccf6 in mongo::ClientCursor::yield (this=0x10423a570, micros=100, recordToLoad=0x137b940b0) at clientcursor.cpp:767
#5  0x000000010011ce28 in mongo::ClientCursor::yieldSometimes (this=0x10423a540, need=mongo::ClientCursor::WillNeed, yielded=0x1) at clientcursor.cpp:781
#6  0x00000001002e1a93 in mongo::deleteObjects (ns=@0x13be1b470, pattern=@0x13be1b470, god=false, logop=false, justOne=true) at delete.cpp:106
#7  0x00000001004447da in mongo::applyOperation_inlock (op=@0x13be1c050, convertUpdateToUpsert=true, fromRepl=true) at oplog.cpp:603
#8  0x000000010012dace in mongo::ApplyOpsCmd::run (this=0x13be1c6a0, dbname=@0x13be1c6a0, cmdObj=@0x13be1c6a0, unnamed_arg=0, errmsg=@0x13be1c6a0, result=@0x13be1c6a0, fromRepl=false) at apply_ops.cpp:116
#9  0x00000001001aa635 in mongo::_execCommand (c=0x100ebfc60, dbname=@0x13be1cac8, cmdObj=@0xffffffff00000001, queryOptions=0, errmsg=@0x13be1caa8, result=@0x13be1d010, fromRepl=false) at dbcommands.cpp:1850
#10 0x00000001001ab622 in mongo::Command::execCommand (c=0x100ebfc60, client=@0x13be1cc40, queryOptions=1004653632, cmdns=0x13be1cc40 "���;\001", cmdObj=@0x13be1cc40, result=@0x13be1d010, fromRepl=false) at dbcommands.cpp:2017
#11 0x00000001001ac02c in mongo::_runCommands (ns=0x13be1ce90 " ��;\001", _cmdobj=@0x13be1ce90, b=@0x13be1ce90, anObjBuilder=@0x13be1ce90, queryOptions=0, fromRepl=false) at dbcommands.cpp:2081
#12 0x000000010030612e in mongo::runCommands (ns=0x100ebfc60 "0��", jsobj=@0x13be1d0c8, curop=@0x104398b00, b=@0x13be1d010, anObjBuilder=@0x13be1ccb8, queryOptions=0) at query.cpp:69
#13 0x0000000100306c1d in mongo::runQuery (m=@0x13be1d0c8, q=@0x13be1d5e0, curop=@0x13be1d5e0, result=@0x13be1d5e0) at query.cpp:1064
#14 0x0000000100294746 in receivedQuery [inlined] () at /Users/rassi/work/mongo/src/mongo/db/instance.cpp:280
#15 0x0000000100294746 in mongo::assembleResponse (m=@0x13be1d958, dbresponse=@0x13be1db60, remote=@0x4) at instance.cpp:443
#16 0x00000001000075d4 in mongo::MyMessageHandler::process (this=0x13be1d958, m=@0x13be1dc70, port=0x13be1dc70, le=0x13be1dc70) at db.cpp:199
#17 0x00000001005de601 in boost::shared_ptr<mongo::Socket>::operator-> () at /Users/rassi/work/mongo/src/third_party/boost/boost/smart_ptr/shared_ptr.hpp:209
#18 0x00000001005de601 in mongo::PortMessageServer::handleIncomingMsg (arg=0x13be1d958) at message_server_port.cpp:210
#19 0x0000000100645fd5 in thread_proxy (param=0x10435b9b0) at thread.cpp:121
#20 0x00007fff82126782 in _pthread_start ()
#21 0x00007fff821131c1 in thread_start ()



 Comments   
Comment by Githook User [ 04/Feb/14 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-11370 Prevent applyOps from yielding
Branch: master
https://github.com/mongodb/mongo/commit/51e8eee65cdff953b283fe0b9bb3515e298a0e83

Comment by J Rassi [ 29/Oct/13 ]

Creating separate tickets for above two items.

New name for this ticket: applyOps should not allow yielding after first write.
Name for new ticket SERVER-11432: Audit usage of checkForInterrupt(false).

Comment by J Rassi [ 25/Oct/13 ]

Summarizing discussion with schwerin / eliot:

  • It is an error for the applyOps command to yield the write lock after the first write.
  • Separately, it is an error for staticYield to call checkForInterrupt(false) on behalf of any caller with a recursive lock (staticYield is only a yield point for threads that do not hold a recursive lock). Other calls to checkForInterrupt(false) need auditing.
Comment by J Rassi [ 25/Oct/13 ]

Using checkForInterrupt(true) isn't a fix for the underlying issue. The issue is that a yield was allowed at all. In order for the applyOps contract to be satisfied, no group commit can happen between the first write and the logOp("c"). Thus, it is an error for applyOps to ever release the lock after the first write.

To illustrate: I can still repro the issue with the same applyOps+SIGINT scenario above even if the interrupt check in staticYield is changed to checkForInterrupt(true). The execution path changes in the following way: instead of applyOps releasing the lock at the interrupt point, applyOps instead releases the lock moments later during the actual dbtemprelease(). In either case: the newly-released lock is immediately handed off to signalProcessingThread which, upon acquisition, assumes it is safe to group commit and exit the process.

So, the way I see it, it should always be safe for a yield point to be an interrupt point, since we allow group commits at yield points.

Comment by Eliot Horowitz (Inactive) [ 25/Oct/13 ]

Ok, so the real bug is that staticYield should not be called with false.

It has some fallout, in update and delete, after each op, need to reset that variable after the actually isolation. (i.e. after logOp)

Comment by J Rassi [ 25/Oct/13 ]

staticYield() calls killCurrentOp.checkForInterrupt(false), which specifically ignores that note.

In the repro, you have to get likelyInPhysicalMemory() to return false to force the yield – I think the simplest way to do that is with a fail point.

Comment by Eliot Horowitz (Inactive) [ 25/Oct/13 ]

Can you double check this?
When I test, things work as designed.

When a write happens, a note is made on the Client object, which will block all kill attempts until that op is finished.

I verified a few ways and can't see what you're seeing.

Generated at Thu Feb 08 03:25:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.