[SERVER-38724] Under rare conditions, updates to capped collections can cause SECONDARY members to fail Created: 20/Dec/18  Updated: 19/Apr/22  Resolved: 22/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication, TTL
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mark Brinsmead Assignee: Gregory Noma
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-21700 Do not relax constraints during stead... Closed
is duplicated by SERVER-38741 Mongodb replset member fails to start... Closed
Related
related to SERVER-8972 extent layout differences can cause i... Closed
is related to SERVER-37172 rollbackViaRefetch should not update ... Closed
Operating System: ALL
Sprint: Execution Team 2020-07-27
Participants:
Case:

 Description   
Issue Status as of April 13, 2021

ISSUE DESCRIPTION AND IMPACT
When multiple (size-neutral) updates are made to a single document in a capped collection, it is possible for a secondary to abort on errors like:

2018-12-18T15:19:00.857-0700 F REPL     [repl writer worker 7] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 33 != 61 on: { ts: Timestamp 1545171540000|1, t: 3, h: 7402520059475909511, v: 2, op: "u", ns: "test.my_capped", o2: { _\
id: ObjectId('5c12e0f1ea151ba34bd0ae11') }, o: { $set: { y: "a size-neutral value" } } }
2018-12-18T15:19:00.863-0700 I -        [repl writer worker 7] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 33 != 61 at src/mongo/db/repl/sync_tail.cpp 1061 

Additional issue: when an update is made to a single document in a capped collection and the document is about to be removed as part of the collection cap, it is possible for a secondary to abort with a similar error.
DIAGNOSIS AND AFFECTED VERSIONS
This can occur when a document is overwritten on the Secondary before it is overwritten on the Primary, and is then subject to two size-neutral updates on two fields.
On the mmapV1 storage engine, differences in storage fragmentation between the Primary and the Secondary are one of the known triggers of this issue.

This can also occur when a document is deleted by capped collection enforcement on the Secondary before an update to that document is replicated from the Primary.

This affects all MongoDB versions 3.2 through 4.4. SERVER-21700 fixes this issue in MongoDB 5.0.

REMEDIATION AND WORKAROUNDS

  • If this occurs, resync the affected Secondaries.
  • Ensure that capped collections are the same size across the whole replica set.
  • Do not update documents that are frequently rolled over capped collections. Consider using a TTL for this use-case instead.
Original description

When multiple (size-neutral) updates are made to a single document in a capped collection, it is possible for a secondary to abort on errors like:

2018-12-18T15:19:00.857-0700 F REPL     [repl writer worker 7] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 33 != 61 on: { ts: Timestamp 1545171540000|1, t: 3, h: 7402520059475909511, v: 2, op: "u", ns: "test.my_capped", o2: { _\
id: ObjectId('5c12e0f1ea151ba34bd0ae11') }, o: { $set: { y: "a size-neutral value" } } }
2018-12-18T15:19:00.863-0700 I -        [repl writer worker 7] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 33 != 61 at src/mongo/db/repl/sync_tail.cpp 1061 



 Comments   
Comment by Eric Milkie [ 22/Jul/20 ]

Work on SERVER-21700 has fixed this issue. Updates to documents that are about to be deleted on a primary will be ignored by secondary nodes if such documents have already been deleted by the time the secondary begins to apply such operations.

Comment by raffis [ 12/Nov/19 ]

Well 1 and 2 is not an option since you will end up with the same problem. And the third option usually requires re engineering. 
I've moved capped collections to the local database. Collections in local do not get synced to secondaries therefore you wont run into this bug.

Comment by raffis [ 12/Feb/19 ]

Hi Brian

Sounds good, We have no problem to upgrade to 3.6 or even higher.

 

Cheers

Raffi

Comment by Brian Lane [ 11/Feb/19 ]

Hi raffis,

You can't configure this. We would need to do code changes on our side for this and then backport to older versions.

We would look to only backport this change as far as 3.6 once completed.

-Brian

Comment by raffis [ 04/Feb/19 ]

Hi Brian

 

No TTL is not an alternative at all. The capped collection is used because of the tailable functionality.

 

> If you can upgrade to a later version of MongoDB and migrate to WiredTiger as your storage engine, we may be able to address this issue by skipping the capped update check on secondaries

 

That would be fine, how can I configure this? Can't find any relevant documentation.

What I did now is to place this collection in the local db, which does not get touched by the replset mechanism.

 

Comment by Brian Lane [ 25/Jan/19 ]

Hi raffis,

You might want to consider a TTL index on an uncapped collection as an alternative.

If you can upgrade to a later version of MongoDB and migrate to WiredTiger as your storage engine, we may be able to address this issue by skipping the capped update check on secondaries. This check is a hard requirement with MMAPv1, but with WiredTiger we may be able to relax this requirement.

FYI: We have deprecated MMAPv1 as of version 4.0.

-Brian

Comment by raffis [ 24/Jan/19 ]

Hi Brian

It' s 3.4.18 and mmapv1.

Let me know if you need something else.

 

Cheers

Comment by Brian Lane [ 24/Jan/19 ]

Hi raffis

Thanks for the additional details.

Can you also let me know which version of MongoDB you are using as well as which storage engine - WiredTiger / MMAPv1.

Cheers!
-Brian

Comment by raffis [ 22/Jan/19 ]

Hi Brian

 

Well the collection where the problem occurs is heavily used by https://github.com/gyselroth/mongodb-php-task-scheduler

This is queue job processor and the capped collection makes perfectly sense there.  Theoretically the library could be changed to use the newer watch functionality on the oplog but still, a capped collection makes perfectly sense. (And it would need some time to release a newer version for this functionality).

 

The bug occurs now (randomly) daily. Basically our replset is not functional at all. I can resync entire slave nodes and quite fast they crash with the error mentioned above.

Interesting fact: In all cases the consumer crashed at { $set:

{ status: 3, ended: new Date(1545406647813) }

}  while a a new datetime object is set.

(The library makes other changes as well before changing this end timestamp, for example setting status to 1 without any date changes).

 

$ cat /var/log/mongodb/mongod.log*gz | gunzip | grep -a "Cannot change"; cat /var/log/mongodb/mongod.log* | grep "Cannot change" -a

2019-01-03T10:25:08.782+0100 F REPL [repl writer worker 11] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1546507508000|2, h: -496745303623253444, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c2c8373cd9132000910d522') }

, o: { $set:

{ status: 3, ended: new Date(1546507507990) }

} }
2019-01-03T10:25:08.905+0100 I - [repl writer worker 11] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2019-01-03T13:27:55.107+0100 I STORAGE [initandlisten] exception in initAndListen: 10003 Cannot change the size of a document in a capped collection: 71 != 86, terminating

 

2018-12-21T16:37:27.830+0100 F REPL [repl writer worker 5] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1545406647000|4, h: -308296782776005596, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c1d087b95aa0a02386e9843') }

, o: { $set:

{ status: 3, ended: new Date(1545406647813) }

} }
2018-12-21T16:37:27.830+0100 I - [repl writer worker 5] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2018-12-21T18:33:15.567+0100 F REPL [repl writer worker 2] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1545406647000|4, h: -308296782776005596, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c1d087b95aa0a02386e9843') }

, o: { $set:

{ status: 3, ended: new Date(1545406647813) }

} }
2018-12-21T18:33:15.567+0100 I - [repl writer worker 2] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2018-12-21T18:54:41.777+0100 F REPL [repl writer worker 14] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1545406647000|4, h: -308296782776005596, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c1d087b95aa0a02386e9843') }

, o: { $set:

{ status: 3, ended: new Date(1545406647813) }

} }
2018-12-21T18:54:41.777+0100 I - [repl writer worker 14] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2019-01-21T08:43:21.477+0100 I STORAGE [initandlisten] exception in initAndListen: 10003 Cannot change the size of a document in a capped collection: 71 != 86, terminating
2019-01-21T18:37:32.433+0100 F REPL [repl writer worker 11] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1548092252000|2, h: 8891639960730045349, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c4592d8a7621c000a270723') }

, o: { $set:

{ status: 3, ended: new Date(1548092252319) }

} }
2019-01-21T18:37:32.433+0100 I - [repl writer worker 11] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2019-01-17T07:23:58.732+0100 F REPL [repl writer worker 6] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1547706238000|2, h: -1287491416624820144, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c3ecdfd81fa83000704e0f6') }

, o: { $set:

{ status: 3, ended: new Date(1547706238359) }

} }
2019-01-17T07:23:58.770+0100 I - [repl writer worker 6] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2019-01-17T07:52:46.610+0100 I STORAGE [initandlisten] exception in initAndListen: 10003 Cannot change the size of a document in a capped collection: 71 != 86, terminating
2019-01-18T10:36:02.219+0100 F REPL [repl writer worker 0] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1547804162000|2, h: -9155582286729571842, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c404c817047201a535de75f') }

, o: { $set:

{ status: 3, ended: new Date(1547804162166) }

} }
2019-01-18T10:36:02.219+0100 I - [repl writer worker 0] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087
2019-01-19T10:36:03.302+0100 F REPL [repl writer worker 13] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1547890562000|3, h: -9125484584779178179, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2:

{ _id: ObjectId('5c419e02863a75000b7a54b9') }

, o: { $set:

{ status: 3, ended: new Date(1547890562963) }

} }
2019-01-19T10:36:03.315+0100 I - [repl writer worker 13] Fatal assertion 16359 CannotGrowDocumentInCappedNamespace: Cannot change the size of a document in a capped collection: 71 != 86 at src/mongo/db/repl/sync_tail.cpp 1087

 

Comment by Brian Lane [ 21/Jan/19 ]

Hi raffis,

Have you considered using a TTL index instead? Here is a tutorial on how to do this.

If you are unable to use TTL indexes would be interested in understanding why.

-Brian

Comment by raffis [ 18/Jan/19 ]

This happened now three times in a row within 3 days. Definitely not rare.

Comment by raffis [ 17/Jan/19 ]

This is certainly not a rare condition. See SERVER-38741, We encountered this bug multiple times over the last 1-2 months. Resync the affected nodes helps but it takes ages for big nodes.

Comment by Danny Hatcher (Inactive) [ 20/Dec/18 ]

SERVER-8972 is the root cause of this issue but that ticket just mentioned possible data inconsistency. This ticket shows that that inconsistency can cause a server to crash.

Generated at Thu Feb 08 04:49:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.