[SERVER-7952] killop of replication get more can cause replication operations to be skipped Created: 16/Dec/12 Updated: 11/Jul/16 Resolved: 10/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.2, 2.3.2 |
| Fix Version/s: | 2.4.0-rc1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Ron Avnur | Assignee: | Aaron Staple |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Test:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Updated diagnosis (aaron): Several factors combine to allow a client killop of a replication get more operation to cause replication to skip operations sent to a secondary. 1) get more
The client cursor this get more was issued against remains alive with its original cursor id, and is ready to return additional documents. When an error occurs and aborts a get more, the cursor the get more iterates over is left at its position as of the time the error occurred, and will continue iterating from that position on subsequent requests. Since a get more that errors out will return no results, any documents iterated over in a get more implementation before an error is thrown will never be sent to the client. 2) DBClientCursor::more()
3) replication I recommend:
I don't know if we want to make all of these changes right now though. Also, a somewhat related issue with replication and get more is described here <https://jira.mongodb.org/browse/SERVER-4853>. ----------------------------------------------------- If a user uses db.killOp() to kill a replication operation on a primary, the secondary might ultimately miss the operation. |
| Comments |
| Comment by auto [ 10/Feb/13 ] | ||||||
|
Author: {u'date': u'2013-01-30T23:10:50Z', u'name': u'aaron', u'email': u'aaron@10gen.com'}Message: | ||||||
| Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ] | ||||||
|
Yes, try that first. | ||||||
| Comment by Aaron Staple [ 28/Jan/13 ] | ||||||
|
That makes sense to me. Though the wire protocol page states:
I don't know what the consequences would be of setting QueryFailure without an $err document. In terms of absolute minimum change, everything may work correctly for replication if we just make the get more implementation kill its ClientCursor when there is an error (rather than allow subsequent reads on the same ClientCursor). Should I look into exploring that as a short term fix? | ||||||
| Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ] | ||||||
|
ok - so this should be setting QueryFailure I think. | ||||||
| Comment by Aaron Staple [ 28/Jan/13 ] | ||||||
|
I think get more has always been implemented this way. According to
| ||||||
| Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ] | ||||||
|
Definitely trying to keep scope small right now. | ||||||
| Comment by Aaron Staple [ 28/Jan/13 ] | ||||||
|
eliot In this case get more does not return an error or $err, it just returns an empty batch. There is some discussion in | ||||||
| Comment by Eliot Horowitz (Inactive) [ 28/Jan/13 ] | ||||||
|
Aaron - we should be returning an error with $err though right? | ||||||
| Comment by Aaron Staple [ 28/Jan/13 ] | ||||||
|
eliot Could you advise on the "I recommend" section in the description above. In particular on potential wire protocol changes for get more. | ||||||
| Comment by Aaron Staple [ 23/Jan/13 ] | ||||||
|
From a look at a related case it looks like the issue here relates to killing the query or getMore the secondary is using to read oplog ops from the primary, not to killing a write operation that is going to be replicated. | ||||||
| Comment by Aaron Staple [ 23/Jan/13 ] | ||||||
|
| ||||||
| Comment by Dwight Merriman [ 18/Dec/12 ] | ||||||
|
what kind of operation? | ||||||
| Comment by Dwight Merriman [ 18/Dec/12 ] | ||||||
|
that's weird. i believe we throw an exception when killed perhaps a throw is happening between the write and the logTheOp() call? |