[SERVER-38724] Under rare conditions, updates to capped collections can cause SECONDARY members to fail Created: 20/Dec/18 Updated: 19/Apr/22 Resolved: 22/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, TTL |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Brinsmead | Assignee: | Gregory Noma |
| Resolution: | Duplicate | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Sprint: | Execution Team 2020-07-27 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Description |
|
Issue Status as of April 13, 2021 ISSUE DESCRIPTION AND IMPACT
Additional issue: when an update is made to a single document in a capped collection and the document is about to be removed as part of the collection cap, it is possible for a secondary to abort with a similar error. This can also occur when a document is deleted by capped collection enforcement on the Secondary before an update to that document is replicated from the Primary. This affects all MongoDB versions 3.2 through 4.4. REMEDIATION AND WORKAROUNDS
Original descriptionWhen multiple (size-neutral) updates are made to a single document in a capped collection, it is possible for a secondary to abort on errors like:
|
| Comments |
| Comment by Eric Milkie [ 22/Jul/20 ] |
|
Work on |
| Comment by raffis [ 12/Nov/19 ] |
|
Well 1 and 2 is not an option since you will end up with the same problem. And the third option usually requires re engineering. |
| Comment by raffis [ 12/Feb/19 ] |
|
Hi Brian Sounds good, We have no problem to upgrade to 3.6 or even higher.
Cheers Raffi |
| Comment by Brian Lane [ 11/Feb/19 ] |
|
Hi raffis, You can't configure this. We would need to do code changes on our side for this and then backport to older versions. We would look to only backport this change as far as 3.6 once completed. -Brian |
| Comment by raffis [ 04/Feb/19 ] |
|
Hi Brian
No TTL is not an alternative at all. The capped collection is used because of the tailable functionality.
> If you can upgrade to a later version of MongoDB and migrate to WiredTiger as your storage engine, we may be able to address this issue by skipping the capped update check on secondaries
That would be fine, how can I configure this? Can't find any relevant documentation. What I did now is to place this collection in the local db, which does not get touched by the replset mechanism.
|
| Comment by Brian Lane [ 25/Jan/19 ] |
|
Hi raffis, You might want to consider a TTL index on an uncapped collection as an alternative. If you can upgrade to a later version of MongoDB and migrate to WiredTiger as your storage engine, we may be able to address this issue by skipping the capped update check on secondaries. This check is a hard requirement with MMAPv1, but with WiredTiger we may be able to relax this requirement. FYI: We have deprecated MMAPv1 as of version 4.0. -Brian |
| Comment by raffis [ 24/Jan/19 ] |
|
Hi Brian It' s 3.4.18 and mmapv1. Let me know if you need something else.
Cheers |
| Comment by Brian Lane [ 24/Jan/19 ] |
|
Hi raffis Thanks for the additional details. Can you also let me know which version of MongoDB you are using as well as which storage engine - WiredTiger / MMAPv1. Cheers! |
| Comment by raffis [ 22/Jan/19 ] |
|
Hi Brian
Well the collection where the problem occurs is heavily used by https://github.com/gyselroth/mongodb-php-task-scheduler This is queue job processor and the capped collection makes perfectly sense there. Theoretically the library could be changed to use the newer watch functionality on the oplog but still, a capped collection makes perfectly sense. (And it would need some time to release a newer version for this functionality).
The bug occurs now (randomly) daily. Basically our replset is not functional at all. I can resync entire slave nodes and quite fast they crash with the error mentioned above. Interesting fact: In all cases the consumer crashed at { $set: { status: 3, ended: new Date(1545406647813) }} while a a new datetime object is set. (The library makes other changes as well before changing this end timestamp, for example setting status to 1 without any date changes).
$ cat /var/log/mongodb/mongod.log*gz | gunzip | grep -a "Cannot change"; cat /var/log/mongodb/mongod.log* | grep "Cannot change" -a 2019-01-03T10:25:08.782+0100 F REPL [repl writer worker 11] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1546507508000|2, h: -496745303623253444, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2: { _id: ObjectId('5c2c8373cd9132000910d522') }, o: { $set: { status: 3, ended: new Date(1546507507990) }} }
2018-12-21T16:37:27.830+0100 F REPL [repl writer worker 5] writer worker caught exception: 10003 Cannot change the size of a document in a capped collection: 71 != 86 on: { ts: Timestamp 1545406647000|4, h: -308296782776005596, v: 2, op: "u", ns: "cloudfs.taskscheduler.jobs", o2: { _id: ObjectId('5c1d087b95aa0a02386e9843') }, o: { $set: { status: 3, ended: new Date(1545406647813) }} } , o: { $set: { status: 3, ended: new Date(1545406647813) }} } , o: { $set: { status: 3, ended: new Date(1545406647813) }} } , o: { $set: { status: 3, ended: new Date(1548092252319) }} } , o: { $set: { status: 3, ended: new Date(1547706238359) }} } , o: { $set: { status: 3, ended: new Date(1547804162166) }} } , o: { $set: { status: 3, ended: new Date(1547890562963) }} }
|
| Comment by Brian Lane [ 21/Jan/19 ] |
|
Hi raffis, Have you considered using a TTL index instead? Here is a tutorial on how to do this. If you are unable to use TTL indexes would be interested in understanding why. -Brian |
| Comment by raffis [ 18/Jan/19 ] |
|
This happened now three times in a row within 3 days. Definitely not rare. |
| Comment by raffis [ 17/Jan/19 ] |
|
This is certainly not a rare condition. See |
| Comment by Danny Hatcher (Inactive) [ 20/Dec/18 ] |
|
|