[SERVER-20677] Failed collMod during replication results in fassert shutdown Created: 29/Sep/15  Updated: 05/Apr/17  Resolved: 16/Nov/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.1.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cailin Nelson Assignee: Siyuan Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-26118 CollMod 'c' op: ignore metadata error... Closed
Related
related to SERVER-23919 Database/Collection drop during initi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
State at time of failure:
  • cloud-dev-0, PRIMARY, mmapv1, 3.0.6
  • cloud-dev-1, SECONDARY, mmapv1, 3.1.8
  • cloud-dev-2, STARTUP2, wt, 3.1.8
Log Snippet

2015-09-28T22:29:57.211+0000 E REPL     [repl writer worker 0] failed to apply update: { ts: Timestamp 1443455631000|18, h: -9014016430135857895, v: 2, op: "u", ns: "mmsdbjobs.data.jobsProcessor", o2: { _id: ObjectId('56096289e4b0e5c46d2025c3') }, o: { $set: { expireAt: new Date(1443459111840), updated: 1443455631840, status: "COMPLETED" } } }
2015-09-28T22:29:57.211+0000 I REPL     [repl writer worker 0] adding missing object
2015-09-28T22:29:57.271+0000 I REPL     [repl writer worker 0] missing object not found on source. presumably deleted later in oplog
2015-09-28T22:29:57.271+0000 I REPL     [repl writer worker 0] o2: { _id: ObjectId('56096289e4b0e5c46d2025c3') }
2015-09-28T22:29:57.271+0000 I REPL     [repl writer worker 0] o firstfield: $set
2015-09-28T22:29:57.285+0000 E REPL     [repl writer worker 4] failed to apply update: { ts: Timestamp 1443455664000|17, h: 6934536607052632399, v: 2, op: "u", ns: "mmsdbconfig.config.agentSessions", o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }, o: { $set: { lud: new Date(1443455664494), uc: 6393 } } }
2015-09-28T22:29:57.285+0000 I REPL     [repl writer worker 4] adding missing object
2015-09-28T22:29:57.317+0000 I REPL     [repl writer worker 4] missing object not found on source. presumably deleted later in oplog
2015-09-28T22:29:57.317+0000 I REPL     [repl writer worker 4] o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }
2015-09-28T22:29:57.317+0000 I REPL     [repl writer worker 4] o firstfield: $set
2015-09-28T22:29:57.452+0000 E REPL     [repl writer worker 10] failed to apply update: { ts: Timestamp 1443455719000|198, h: 849378850158100444, v: 2, op: "u", ns: "mmsdbconfig.config.agentSessions", o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }, o: { $set: { lud: new Date(1443455719516), uc: 6394 } } }
2015-09-28T22:29:57.452+0000 I REPL     [repl writer worker 10] adding missing object
2015-09-28T22:29:57.505+0000 I REPL     [repl writer worker 10] missing object not found on source. presumably deleted later in oplog
2015-09-28T22:29:57.505+0000 I REPL     [repl writer worker 10] o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }
2015-09-28T22:29:57.505+0000 I REPL     [repl writer worker 10] o firstfield: $set
2015-09-28T22:29:57.625+0000 E REPL     [repl writer worker 4] failed to apply update: { ts: Timestamp 1443455774000|625, h: -8965142833625862340, v: 2, op: "u", ns: "mmsdbconfig.config.agentSessions", o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }, o: { $set: { lud: new Date(1443455774497), uc: 6395 } } }
2015-09-28T22:29:57.625+0000 I REPL     [repl writer worker 4] adding missing object
2015-09-28T22:29:57.678+0000 I REPL     [repl writer worker 4] missing object not found on source. presumably deleted later in oplog
2015-09-28T22:29:57.678+0000 I REPL     [repl writer worker 4] o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }
2015-09-28T22:29:57.678+0000 I REPL     [repl writer worker 4] o firstfield: $set
2015-09-28T22:29:57.882+0000 E REPL     [repl writer worker 5] failed to apply update: { ts: Timestamp 1443455829000|608, h: 6458318593359235959, v: 2, op: "u", ns: "mmsdbconfig.config.agentSessions", o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }, o: { $set: { lud: new Date(1443455829500), uc: 6396 } } }
2015-09-28T22:29:57.882+0000 I REPL     [repl writer worker 5] adding missing object
2015-09-28T22:29:57.918+0000 I REPL     [repl writer worker 5] missing object not found on source. presumably deleted later in oplog
2015-09-28T22:29:57.919+0000 I REPL     [repl writer worker 5] o2: { _id: "grabar-0.grabar-test.55d2398ce4b0b44f5b06273b.mongo.plumbing-44c2b15f292bece3cc3fb87c9bee564c-5421c55ce4b0aa879d8b8328" }
2015-09-28T22:29:57.919+0000 I REPL     [repl writer worker 5] o firstfield: $set
2015-09-28T22:29:57.936+0000 E REPL     [repl writer worker 6] Failed command { collMod: "data.rrd", noPadding: true, usePowerOf2Sizes: false } on mmsdbrrd-raw-dbs-PT5S-PT2H-20150928T1555 with status NamespaceNotFound ns does not exist during oplog application
2015-09-28T22:29:57.936+0000 I REPL     [repl writer worker 6] adding missing object
2015-09-28T22:29:57.958+0000 F REPL     [repl writer worker 6] cannot fetch missing document without _id field: { ts: Timestamp 1443455834000|31, h: -7038049490327803840, v: 2, op: "c", ns: "mmsdbrrd-raw-dbs-PT5S-PT2H-20150928T1555.$cmd", o: { collMod: "data.rrd", noPadding: true, usePowerOf2Sizes: false } }
2015-09-28T22:29:57.958+0000 I -        [repl writer worker 6] Fatal Assertion 28742
2015-09-28T22:29:57.958+0000 I -        [repl writer worker 6] 
 
***aborting after fassert() failure

Full Logs


 Comments   
Comment by Siyuan Zhou [ 16/Nov/16 ]

Closing as a dup to SERVER-26118.

Comment by Siyuan Zhou [ 11/Nov/16 ]

Yes. I believe this is fixed by SERVER-26118. The specific test case is here, which runs collMod again after the drop. I think we can close this ticket now or waiting for the backport to 3.2.

The unexpected missing doc fetch is fixed as a side-effect of this commit.

Comment by Spencer Brody (Inactive) [ 11/Nov/16 ]

siyuan.zhou, can you confirm that this has been fixed by your recent work?

Comment by Eric Milkie [ 03/May/16 ]

Even after SERVER-23919 goes out, this will continue to be an issue, as we are not building indexes prior to processing ops in the oplog. This issue will be solved by the initial sync project scheduled for version 3.4, however.

Comment by Scott Hernandez (Inactive) [ 25/Apr/16 ]

Thanks for the timeline; that identified the cause, I believe.

The short of it is that a dropDatabase can cause the initial sync to not copy the collection, but later see a collMod for that collection in the oplog (before the dropDatabase) which will then fail the initial sync.

I have created SERVER-23919 to address it.

Comment by Scott Hernandez (Inactive) [ 25/Apr/16 ]

John, do you know why the collection didn't exist? Does it exist on the other replica nodes?

Could that collection/db be created/deleted during the initial sync?

Comment by Scott Hernandez (Inactive) [ 29/Sep/15 ]

Cool, then this shouldn't be much of an issue in practice, and the behavior is (probably) correct right now, but we can provide a better error message for sure.

Comment by Cailin Nelson [ 29/Sep/15 ]

I think the likely reason that the collection didn't exist is this SERVER-20638.

SERVER-20638 has led us to have very different lists of databases on the mmapv1 primary versus the mmapv1 secondary. We've also switched which is primary/secondary which further confuses the state of the system.

Comment by Scott Hernandez (Inactive) [ 29/Sep/15 ]

The cause of the underlying problem is the error applying the collMod command:

Failed command { collMod: "data.rrd", noPadding: true, usePowerOf2Sizes: false } on mmsdbrrd-raw-dbs-PT5S-PT2H-20150928T1555 with status NamespaceNotFound ns does not exist during oplog application

The missing document not containing an _id is just how it tried to recover when the command failed, which shouldn't have.

I'm not sure of the reason the collection doesn't exist, and will have to look a but more into how that could have happened.

Generated at Thu Feb 08 03:54:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.