[SERVER-34188] CappedPositionLost errors in very large oplog Created: 29/Mar/18 Updated: 27/Oct/23 Resolved: 22/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, WiredTiger |
| Affects Version/s: | 3.2.11 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Minor - P4 |
| Reporter: | Stefanos Boglou | Assignee: | Dmitry Agranat |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian |
||
| Attachments: |
|
| Participants: |
| Description |
|
Hello, I am using the oplog collection to create incremental backups from a MongoDB replica (Secondary). Sometimes, during busy periods on the replica set with undersized WiredTiger cache on the secondary, a very small percentage of the backups will fail with the following error from the MongoDB server.
From pymongo it seems that CappedPositionLost error originates from the MongoDB secondary server Log from mongod
Replication info shows a size that covers a period of ten days and each incremental snapshot is 20 minutes up to a couple of hours so I do not see how a record can be missing from the oplog.
Is this normal behavior of mongod ? |
| Comments |
| Comment by Dmitry Agranat [ 22/Apr/18 ] |
|
Hi vfxcode, Thank you for the update and glad to hear that after the upgrade it seems to work much better and much faster with the backups.
I did not find any evidence that the reported issue is related to
As mentioned in the previous comment, the CappedPositionLost error was a result of the pressure on the undersized WiredTiger cache Thanks, |
| Comment by Stefanos Boglou [ 19/Apr/18 ] |
|
The servers are of different size because they have different roles. The periods that there are no diagnostic data seem to be related to this issue issue that seems to affect our masters as well. What it seems to be weird was the error (CollectionScan died due to position in capped collection being deleted) not that it was just slow (the expected behavior). |
| Comment by Dmitry Agranat [ 17/Apr/18 ] |
|
Hi vfxcode, Thank you for uploading the diagnostic.data, it was very useful. What I think is happening here is a result of the pressure on the WiredTiger cache which is undersized. The CappedPositionLost error, in this case, is just a symptom.
I've noticed that you have a different configuration between the members of the replica set. Primary is configured with 193GB RAM / 80GB WiredTiger cache while the reported secondary is configured with 64GB RAM / 10GB WiredTiger cache. This is against our best practice to have all members of the replica set being configured identically. So you are correct about the undersized WiredTiger cache on the secondary. Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group. Thanks, |
| Comment by Stefanos Boglou [ 02/Apr/18 ] |
|
I have uploaded the two files to the link you provided. |
| Comment by Kelsey Schubert [ 02/Apr/18 ] |
|
Hi vfxcode, I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you, |
| Comment by Stefanos Boglou [ 02/Apr/18 ] |
|
Hallo! thank you for your response. There is nothing except noise from the NETWORK facility and slow commands on the primary and just NETWORK on the affected node during that time window (+/- 1 hour at least). I can provide you with the diagnostic data however this database has sensitive data and I would prefer not to send this file in public. Is there a place or way for me to send the files privately? Thank you in advance, |
| Comment by Kelsey Schubert [ 31/Mar/18 ] |
|
Hi vfxcode, Thank you for reporting this issue. So we can continue to investigate, would you please upload the complete log files and an archive of the diagnostic.data from both the affected node and the primary of the replica set? Thank you, |