[SERVER-32827] Initial sync can fail when syncing a capped collection if the capped collection rolls over on the sync source Created: 22/Jan/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying, Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergey Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Master: MongoDB 3.4.5
Slave: MongoDB 3.4.10 (tried 3.4.5, the result was the same)
Storage driver: WiredTiger
Total DB size: 533 GB (/var/lib/mongodb dir size)
oplog size: 50 GB (log length start to end: 6-15 hours on our workload)
Hardware: HP DL360, CPU - 1x Xeon E5-2640 v3 @ 2.60GHz, 378 GB RAM, 4xSAS 2.5" 15K RAID 10, 1Gbit LAN.


Attachments: HTML File mongo_initial_sync_20     File test.tar.gz    
Issue Links:
Depends
depends on SERVER-16049 Replicate capped collection deletes e... Closed
is depended on by TOOLS-1636 mongodump fails when capped collectio... Waiting (Blocked)
Duplicate
is duplicated by SERVER-33652 Restarting oplog query due to error: ... Closed
Related
related to SERVER-12293 initial sync of a capped collection c... Backlog
related to TOOLS-1636 mongodump fails when capped collectio... Waiting (Blocked)
Assigned Teams:
Replication
Operating System: ALL
Participants:
Case:

 Description   

There is a problem with an initial sync. Several attempts have failed with the following error:
CappedPositionLost: CollectionScan died due to position in capped collection being deleted

The capped collections size on which the errors are occured: 30 - 100 GB
On our workload the capped collections "capacity" (the time before each document is deleted) varies between 24 and 60 hours.

Here are some more detailed info about the collections:

Number of CappedPositionLost errors, collection name, capped size, capacity
6 DB1.collection1 - 40G - 2.46week
3 DB1.collection14 - 37G - min 56h
3 DB1.collection2 - 30G - min 36h
9 DB1.collection9 - 100G - min 24h

The logs for the 7 failed attempts to perform the initial sync are attached.

Currently there is only one alive instance is left in the replica set on our production system. Please help us to bring the replica up.



 Comments   
Comment by Louis Williams [ 11/Oct/21 ]

Moving back to "Open" because the dependent ticket, SERVER-16049, was fixed in 5.0.

Comment by Judah Schvimer [ 15/Mar/21 ]

Hi sombrafam@gmail.com,

Thank you for reaching out! SERVER-16049 is currently in progress, which is a prerequisite for fixing this. After that is complete we will investigate if there is any further work needed to fix this bug.

Thanks,
Judah

Comment by Erlon Cruz [ 14/Mar/21 ]

Hi folks,  what would it take to fix this bug? We have a customer with this problem but I needed to understand if this is something easily fixable or would require large or structural changes to Mongo.

Comment by Sergey [ 23/Jan/18 ]

I reproduced the problem on test environment. Please see the attached test.tar.gz. It contains a script to reproduce the problem and the logs of the test run from my computer. m1, m2 are two replicas. In the test a new replica (m3) is added and MongoDB gives the same error as the error we had on our production environment (CappedPositionLost).

test.sh is a script to reproduce the problem with MongoDB when an initial sync of a replica fails.

The problem occurs when there is a capped collection with a secondary index and high insert rate and a new replica performs the initial sync from the existing member of the replica set. By the time the new replica finishes building the indexes for the capped collection the collection's data is already washed out by the new data and the new replica reports CappedPositionLost error and the initial sync fails.

How to run the test:
1. Install Docker. It is required to throttle MongoDB (the new replica) to make it build indexes long enough for the data in the capped collection to be washed out by the new data.
2. Put test.sh in a new folder.
3. cd into the new folder.
4. Run ./test.sh. All the logs and diagnostic.data are captured automatically and put into the current folder. test.log is a copy of the console's output. m*.log are logs of the mongodb instances. m3.log is a log of the new replica which failed to join the replica set. The logs of the new replica will also be printed to stdout.

test.sh has been tested on MacBook Pro 15" 2016 on Intel Core i7 and on Dell notebook with Intel Core i5 on Docker v17. If you have a slow CPU please increase DOCKER_NEW_REPLICA_CPUS parameter, or decrease otherwise.

Comment by Mark Agarunov [ 22/Jan/18 ]

Hello plsfixmymongo,

Thank you for the report. To get a better idea of what may be causing the CappedPositionLost error could you please provide the following:

  • The complete log files from all affected mongod nodes
  • Please archive (tar or zip) the $dbpath/diagnostic.data directory from all affected mongod nodes.

This should give us some insight into this behavior.

Thanks,
Mark

Comment by Sergey [ 22/Jan/18 ]

The capped collections have secondary indexes

Comment by Sergey [ 22/Jan/18 ]

This issue is probably related to the subject: TOOLS-1636

Generated at Thu Feb 08 04:31:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.