[SERVER-7952] killop of replication get more can cause replication operations to be skipped Created: 16/Dec/12  Updated: 11/Jul/16  Resolved: 10/Feb/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.2, 2.3.2
Fix Version/s: 2.4.0-rc1

Type: Bug Priority: Critical - P2
Reporter: Ron Avnur Assignee: Aaron Staple
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Steps To Reproduce:

Test:

replTest = new ReplSetTest( { name:'test', nodes:3 } );
nodes = replTest.startSet();
replTest.initiate();
master = replTest.getMaster();
secondary = replTest.getSecondary();
db = master.getDB( 'test' );
db.test.save( { a:0 } );
replTest.awaitReplication();
 
// Start a parallel shell to insert new documents on the primary.
inserter = startParallelShell( "for( i = 1; i < 1e5; ++i ) { db.test.save( { a:i } ); sleep( 1 ); }" );
 
// Periodically kill replication get mores.
for( i = 0; i < 1e3; ++i ) {
    ops = db.currentOp();
    printjson( ops );
    for( j in ops.inprog ) {
        op = ops.inprog[ j ];
        if ( op.ns == 'local.oplog.rs' && op.op == 'getmore' ) {
            db.killOp( op.opid );
        }
    }
    sleep( 100 );
}
 
// Wait for the inserter to finish.
inserter();
 
function allReplicated() {
    count = secondary.getDB( 'test' ).test.count();
    if ( count == 1e5 ) {
        // Return true if the count is as expected.
        return true;
    }
 
    // Identify and print the missing a-values.
    found = {};
    c = secondary.getDB( 'test' ).test.find().sort( { $natural:1 } );
    while( c.hasNext() ) {
        found[ '' + c.next().a ] = true;
    }
    missing = [];
    for( i = 0; i < 1e5; ++i ) {
        if ( !( ( '' + i ) in found ) ) {
            missing.push( i );
        }
    }
    print( 'count: ' + count + ' missing: ' + missing );
    return false;
}
 
// Wait for all documents to be replicated.
assert.soon( allReplicated );
 

Participants:

 Description   

Updated diagnosis (aaron):

Several factors combine to allow a client killop of a replication get more operation to cause replication to skip operations sent to a secondary.

1) get more
Get more is used to retrieve batches of results from a client cursor. If an exception (such as a killop exception) is thrown during a get more request, an OP_REPLY <http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/#op-reply> is sent containing the following:

  • responseFlags == 0 (neither QueryFailure or CursorNotFound are set)
  • cursorID == 0
  • numberReturned == 0

The client cursor this get more was issued against remains alive with its original cursor id, and is ready to return additional documents. When an error occurs and aborts a get more, the cursor the get more iterates over is left at its position as of the time the error occurred, and will continue iterating from that position on subsequent requests. Since a get more that errors out will return no results, any documents iterated over in a get more implementation before an error is thrown will never be sent to the client.

2) DBClientCursor::more()
This cursor method retrieves a new batch of results when the previous batch is exhausted. The implementation:

  • Sets its own internal flag based on responseFlags.QueryFailure but does nothing with this internal flag.
  • Ignores the value of cursorID unless responseFlags.CursorNotFound is set.
    Thus, after one get more request produces an error, DBClientCursor can continue iterating over the cursor on subsequent calls to more(). As described in part 1) this may result in skipping cursor results.

3) replication
The replication implementations (both replica sets and master slave) routinely call more() twice in a row, in particular for handling cases where a tailable cursor is exhausted at the first more() call but populated at a second more() call. By calling more() twice in a row, the possibility of skipping documents as described in part 2) can be triggered.

I recommend:

  • On error, have get more kill its ClientCursor and return a clear indication of an error to the client (using QueryFailure flag if appropriate)
  • Have DBClientCursor::more() handle errors and validate all flags and attributes for unexpected values. It seems like throwing in these cases would be consistent with existing behavior.
  • Ensure that the replication code (including master/slave) handles such exceptions correctly.

I don't know if we want to make all of these changes right now though.

Also, a somewhat related issue with replication and get more is described here <https://jira.mongodb.org/browse/SERVER-4853>.

-----------------------------------------------------

If a user uses db.killOp() to kill a replication operation on a primary, the secondary might ultimately miss the operation.



 Comments   
Comment by auto [ 10/Feb/13 ]

Author:

{u'date': u'2013-01-30T23:10:50Z', u'name': u'aaron', u'email': u'aaron@10gen.com'}

Message: SERVER-7952 Kill an authorized ClientCursor after processGetMore throws.
Branch: master
https://github.com/mongodb/mongo/commit/186f91d83d158adb06e024e7d3a1f4f5673ef5c9

Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ]

Yes, try that first.
If that's what you end up doing, file another ticket for 2.5.w to change getMore

Comment by Aaron Staple [ 28/Jan/13 ]

That makes sense to me. Though the wire protocol page states:

1	QueryFailure	Set when query failed. Results consist of one document containing an “$err” field describing the failure.

I don't know what the consequences would be of setting QueryFailure without an $err document.

In terms of absolute minimum change, everything may work correctly for replication if we just make the get more implementation kill its ClientCursor when there is an error (rather than allow subsequent reads on the same ClientCursor). Should I look into exploring that as a short term fix?

Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ]

ok - so this should be setting QueryFailure I think.

Comment by Aaron Staple [ 28/Jan/13 ]

I think get more has always been implemented this way.

According to SERVER-4853:

Historically since we didn't have an error flag I think since we didn't want to check for $err on every doc
 
$err doesn't make sense as then you have to check not just the first doc.
this should be a new error flag
it can just be a generic "error" flag and the details can be inside $err in the first document.
but a flag absolutely is needed

Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ]

Definitely trying to keep scope small right now.
Why isn't an $err returned, that seems like the real bug?

Comment by Aaron Staple [ 28/Jan/13 ]

eliot In this case get more does not return an error or $err, it just returns an empty batch. There is some discussion in SERVER-4853 stating that getMore should not return a $err document. I agree we should change more() and repl (where necessary) but we should probably have the wire part nailed down to do that. Also, short term we might want to keep in mind minimizing the scale of changes made in release candidates.

Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ]

Aaron - we should be returning an error with $err though right?
So we should definitely change the repl code and more() code.

Comment by Aaron Staple [ 28/Jan/13 ]

eliot Could you advise on the "I recommend" section in the description above. In particular on potential wire protocol changes for get more.

Comment by Aaron Staple [ 23/Jan/13 ]

From a look at a related case it looks like the issue here relates to killing the query or getMore the secondary is using to read oplog ops from the primary, not to killing a write operation that is going to be replicated.

Comment by Aaron Staple [ 23/Jan/13 ]

I believe right now a write op can only be interrupted with killop if the write op calls killCurrentOp.checkForInterrupt( false ), or if the write op relinquishes the mutex, for example when yielding. (Right now ClientCursor::staticYield() calls checkForInterrupt( false ) internally.) I think the goal with this design is to ensure that individual storage layer modifications performed by a write operation are not committed until the data store is in a consistent state.

For replication to work correctly in the face of killop, we would additionally want to make sure that a write operation cannot be interrupted after a document is modified and before the modification of that document is added to the oplog.

ron.avnur@10gen.com, -is there a specific case you have in mind where this can happen, or does this ticket represent doing an audit of the write operations to see if/where it can happen an fixing those cases.

Comment by Dwight Merriman [ 18/Dec/12 ]

what kind of operation?

Comment by Dwight Merriman [ 18/Dec/12 ]

that's weird. i believe we throw an exception when killed perhaps a throw is happening between the write and the logTheOp() call?

Generated at Thu Feb 08 03:16:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.