[SERVER-34188] CappedPositionLost errors in very large oplog Created: 29/Mar/18  Updated: 27/Oct/23  Resolved: 22/Apr/18

Status: Closed
Project: Core Server
Component/s: Querying, WiredTiger
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Question Priority: Minor - P4
Reporter: Stefanos Boglou Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian
Debian packages


Attachments: PNG File wt_cache.png    
Participants:

 Description   

Hello,

I am using the oplog collection to create incremental backups from a MongoDB replica (Secondary).

Sometimes, during busy periods on the replica set with undersized WiredTiger cache on the secondary, a very small percentage of the backups will fail with the following error from the MongoDB server.

29-Mar 06:01 + oplog_backup
29-Mar 07:06 since filter: {'ts': {'$gt': Timestamp(1522278803, 94)}}
29-Mar 07:06 Traceback (most recent call last):
29-Mar 07:06   File "oplog_backup", line 99, in <module>
29-Mar 07:06     main()
29-Mar 07:06   File "oplog_backup", line 74, in main
29-Mar 07:06     for doc in oplog.find(filter).sort("$natural"):
29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 1114, in next
29-Mar 07:06     if len(self.__data) or self._refresh():
29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 1036, in _refresh
29-Mar 07:06     self.__collation))
29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 928, in __send_message
29-Mar 07:06     helpers._check_command_response(doc['data'][0])
29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/helpers.py", line 210, in _check_command_response
29-Mar 07:06     raise OperationFailure(msg % errmsg, code, response)
29-Mar 07:06 pymongo.errors.OperationFailure: Executor error during find command: CappedPositionLost: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(<InsertLongNumber>)

From pymongo it seems that CappedPositionLost error originates from the MongoDB secondary server

Log from mongod

2018-03-29T07:06:19.873+0300 W QUERY    [conn492712] Could not restore RecordCursor for CollectionScan: local.oplog.rs
2018-03-29T07:06:19.873+0300 E QUERY    [conn492712] Plan executor error during find command: DEAD, stats: { stage: "COLLSCAN", filter: { ts: { $gt: Timestamp 1522278803000|94 } }, nReturned: 0, executionTimeMillisEstimate: 1690, works: 1241191, advanced: 0, needTime: 1241190, needYield: 0, saveState: 9697, restoreState: 9697, isEOF: 0, invalidates: 0, direction: "forward", docsExamined: 1241189 }
2018-03-29T07:06:19.874+0300 I COMMAND  [conn492712] command local.oplog.rs command: find { find: "oplog.rs", filter: { ts: { $gt: Timestamp 1522278803000|94 } }, sort: { $natural: 1 } } keyUpdates:0 writeConflicts:0 numYields:9697 reslen:256 locks:{ Global: { acquireCount: { r: 19396 }, acquireWaitCount: { r: 2519 }, timeAcquiringMicros: { r: 3889033515 } }, Database: { acquireCount: { r: 9698 } }, oplog: { acquireCount: { r: 9698 } } } protocol:op_query 3890891ms

Replication info shows a size that covers a period of ten days and each incremental snapshot is 20 minutes up to a couple of hours so I do not see how a record can be missing from the oplog.

configured oplog size:   9502.277587890625MB
log length start to end: 989089secs (274.75hrs)
oplog first event time:  Sun Mar 18 2018 05:08:54 GMT+0200 (EET)
oplog last event time:   Thu Mar 29 2018 16:53:43 GMT+0300 (EEST)
now:                     Thu Mar 29 2018 16:54:01 GMT+0300 (EEST)

Is this normal behavior of mongod ?



 Comments   
Comment by Dmitry Agranat [ 22/Apr/18 ]

Hi vfxcode,

Thank you for the update and glad to hear that after the upgrade it seems to work much better and much faster with the backups.

The periods that there are no diagnostic data seem to be related to this https://jira.mongodb.org/browse/WT-3150 issue that seems to affect our masters as well

I did not find any evidence that the reported issue is related to WT-3150

What it seems to be weird was the error (CollectionScan died due to position in capped collection being deleted) not that it was just slow (the expected behavior).

As mentioned in the previous comment, the CappedPositionLost error was a result of the pressure on the undersized WiredTiger cache

Thanks,
Dima

Comment by Stefanos Boglou [ 19/Apr/18 ]

The servers are of different size because they have different roles.
Each mongodb replication setup has 4 servers, 2 for identical production servers, 1 for snapshot/backups and 1 for off-site disaster recovery.
For the two backup servers we do not care about for performance hence the much smaller cache size.

The periods that there are no diagnostic data seem to be related to this issue issue that seems to affect our masters as well.
We updated to 3.2.17 and it seems to work much better and much faster with the backups.

What it seems to be weird was the error (CollectionScan died due to position in capped collection being deleted) not that it was just slow (the expected behavior).

Comment by Dmitry Agranat [ 17/Apr/18 ]

Hi vfxcode,

Thank you for uploading the diagnostic.data, it was very useful.

What I think is happening here is a result of the pressure on the WiredTiger cache which is undersized. The CappedPositionLost error, in this case, is just a symptom.

  • It is not clear how much overall data you are trying to read from the oplog but at point A we can see ~255MB/s. I assume that over a period between A and B, we could have tried to read a few GB of oplog into the WT cache.
  • We can see that between A and B the WiredTiger cache is struggling the keep the desired threshold at 80%. - During this time the application threads start to help the eviction process which throttles other operations going into WiredTiger. This ensures that evictions, especially dirty pages evictions, take priority when the server is busy or is interfacing with a slow disk.

I've noticed that you have a different configuration between the members of the replica set. Primary is configured with 193GB RAM / 80GB WiredTiger cache while the reported secondary is configured with 64GB RAM / 10GB WiredTiger cache. This is against our best practice to have all members of the replica set being configured identically.

So you are correct about the undersized WiredTiger cache on the secondary.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Thanks,
Dima

Comment by Stefanos Boglou [ 02/Apr/18 ]

I have uploaded the two files to the link you provided.
One for the primary and one for the affected node.

Comment by Kelsey Schubert [ 02/Apr/18 ]

Hi vfxcode,

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Kelsey

Comment by Stefanos Boglou [ 02/Apr/18 ]

Hallo!

thank you for your response. There is nothing except noise from the NETWORK facility and slow commands on the primary and just NETWORK on the affected node during that time window (+/- 1 hour at least).

I can provide you with the diagnostic data however this database has sensitive data and I would prefer not to send this file in public.

Is there a place or way for me to send the files privately?

Thank you in advance,
Stefanos

Comment by Kelsey Schubert [ 31/Mar/18 ]

Hi vfxcode,

Thank you for reporting this issue. So we can continue to investigate, would you please upload the complete log files and an archive of the diagnostic.data from both the affected node and the primary of the replica set?

Thank you,
Kelsey

Generated at Thu Feb 08 04:35:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.