[SERVER-19768] Failed applyOps command does not create an oplog entry even with some successful writes Created: 05/Aug/15  Updated: 06/Dec/22  Resolved: 05/Apr/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.1.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Kamran K. Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 3
Labels: 32qa, RF, fuzzer-blacklist, idempotency
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-28594 non-atomic applyOps should log each i... Closed
Related
related to SERVER-23326 applyOps should use a single WriteUni... Closed
related to SERVER-25994 Allow applyOps to validate authorizat... Closed
is related to SERVER-23390 Missing collection during replication... Closed
is related to SERVER-17634 do not apply replicated insert operat... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Repl 2016-11-21
Participants:

 Description   

If an applyOps command is run with some valid ops and some invalid ops, the valid ones can be actually applied, but then the applyOps fails and no oplog entries are generated for the writes performed as part of the valid ops.

The manifestation of the bug is a result of the change from SERVER-17634, so it only affects 3.1.1+.

Server output:

[repl writer worker 0] writer worker caught exception:  :: caused by :: 26 Failed to apply insert due to missing collection: { ts: Timestamp 1438736773000|1, h: 2778434085248541821, v: 2, op: "i", ns: "test.foo", o: { _id: ObjectId('55c1618530cd6c7ac35af4b7') } } on: { ts: Timestamp 1438736773000|1, h: 2778434085248541821, v: 2, op: "i", ns: "test.foo", o: { _id: ObjectId('55c1618530cd6c7ac35af4b7') } }
[repl writer worker 0] Fatal Assertion 16360
[repl writer worker 0] 
 
 ***aborting after fassert() failure


Repro script:

// Start up a replica set with at least one secondary
var rst = new ReplSetTest({nodes: 2});
rst.startSet();
rst.initiate();
 
var t = rst.getPrimary().getDB('test').foo;
t.drop();
 
var ops = [
    // The first op is valid and results in one document being written to 'test.foo'
    {op: 'u', ns: t.getFullName(), o2: {_id: 0}},
    // The second op is invalid
    {op: 'f', ns: t.getFullName(), o2: {}}
];
 
// The command fails, but a 'test.foo' collection is created on the primary and a document is inserted into the collection.
// A 'test.foo' collection is not created on the secondary because the update op is not written to the oplog.
t.runCommand({applyOps: ops});
 
// This follow-up insert will cause an fassert on the secondary:
// 'Failed to apply insert due to missing collection'
t.insert({});
 
rst.awaitReplication();
rst.stopSet();



 Comments   
Comment by Spencer Brody (Inactive) [ 18/Nov/16 ]

This issue only affects the applyOps command, which is internal-only and not designed for general use. Internal users of this command use it correctly with only valid ops, so there's no need to fix this.

Comment by Spencer Brody (Inactive) [ 29/Mar/16 ]

adq, while the issue you encountered is probably not the same as this one as this one only applies to user-issued applyOps commands, it's possible that there exists another bug in this space as we've had a few other users report similar issues. I filed SERVER-23390 to consolidate our investigation into this other issue. Any details about what happened in your case, especially if you can reproduce the issue, would be very helpful if you could attach to that ticket.

Comment by Andrew de Quincey [ 01/Feb/16 ]

ok, we will retest once the other 3.2 bug we're waiting on has been fixed. ta!

Comment by Eric Milkie [ 29/Jan/16 ]

Since this ticket is concerning the behavior of the applyOps command, I'm afraid that it is unrelated to your issue.
I don't believe we have other outstanding issues for that particular assertion, so we would be grateful to have you file a new ticket with the details of the problem.

Comment by Andrew de Quincey [ 29/Jan/16 ]

Eric, we used a normal update command from pymongo and ran into "Fatal Assertion 16360" on one of our secondary nodes when it replicated to it.

Comment by Eric Milkie [ 29/Jan/16 ]

Hi Alan.
Am I to understand you correctly that your application is using the applyOps command, and some but not all of the ops that you are passing to the command are routinely failing to apply?

Comment by Alan Jackson [ 28/Jan/16 ]

Hi.
We ran into this issue on 3.2
Our application dynamically creates databases, and then collections within them by issuing inserts/upserts.
Is this going to be fixed?

Comment by Scott Hernandez (Inactive) [ 05/Aug/15 ]

I forgot that an "update" in applyOps implicitly defaults to {upsert:true} , unless overridden – which leads to the update inserting a doc, as you noticed, in your example.

BTW. ApplyOps is supposed to copy the whole command to the oplog, more or less, independent of if it fails in the middle.

Comment by Scott Hernandez (Inactive) [ 05/Aug/15 ]

Yeah, if the applyOps isn't being written to the oplog since it doesn't complete successfully that would lead to work being done but not replicated, like the collection creation.

Comment by Kamran K. [ 05/Aug/15 ]

Sorry for the confusion. I hadn't noticed that the update op actually creates a collection and inserts a document on the primary despite not writing any entries to the oplog. The insertion of the empty document with t.insert() then causes the collection to have two documents and for the oplog to have these entries:

{ "ts" : Timestamp(1438746894, 1), "h" : NumberLong("5166445252306839291"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : Timestamp(1438746944, 1), "h" : NumberLong("1009357400624115533"), "v" : 2, "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("55c189400ddaae3b512baa4d") } }

The insert then fails to apply on the secondary, leading to the fassert.

Comment by Scott Hernandez (Inactive) [ 05/Aug/15 ]

I think what you are describing is that the update, which is a no-op, doesn't add an oplog entry to create the collection (which is now required), and/or the update fails since the collection doesn't exist, but the collection is implicitly created and isn't rollbacked, in the storage layer, since the update doesn't actually write anything.

And I think the bug is the latter, not the former.

Generated at Thu Feb 08 03:52:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.