[SERVER-42078] mongodb is unable to replay the oplog and restarts the initial-sync Created: 03/Jul/19 Updated: 02/Apr/21 Resolved: 16/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Eric Sedor |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
We are unable to add a new replica to a exisiting replica set because the inital sync restarts after cloning is done. Here are some related log snippets from the new replica. I can send the complete log files from both the sync source and the new replica to a confidential destination.
Some more details:
Until a few days ago, both servers were running succesfully in a replica-set with Linux jessy, mongodb v3.4 and ext4. Now we wanted to upgrade them wich was successful for the first server that is now Primary. The oplog sizes of both servers have been increased from 50 GB to 200 GB in order to increase the oplog window which varies now from 30 to 45 hours. The cloning process takes round about 8 hours. We thought we hit the same bug as described in |
| Comments |
| Comment by Alec Istomin [ 02/Apr/21 ] | ||||||||
|
I've run into a similar issue on 3.6.16 and was able to sync after following instructions from https://jira.mongodb.org/browse/SERVER-18721 (sudo sysctl net.ipv4.tcp_tw_reuse=1 net.ipv4.ip_local_port_range="10000 60999") The scope for the reference:
| ||||||||
| Comment by Eric Sedor [ 16/Jul/19 ] | ||||||||
|
Thanks for clarifying, kay.agahd@idealo.de, we are glad to hear. The "Fetching missing document/update of non-mod failed/Missing document not found" messages indicate cases where documents are being obtained (or found to not exist) after the initial collection clone during the initial sync process. A delay in syncing (and/or a longer oplog) could lead to more of these messages, but by themselves they aren't typically a cause for concern. Possibly because of the missing Primary diagnostic data, we don't have good evidence that could confirm or rule out a bug. I am going to close this ticket, but if this happens again please do reach out and we will be happy to investigate further. | ||||||||
| Comment by Kay Agahd [ 12/Jul/19 ] | ||||||||
|
Hi eric.sedor, thanks for your investigation. If we had known that an initial sync would become so problematic, we could have taken a snapshot of the data partition before the upgrade and restored it afterwards. But we haven't had such problems yet (we've been using mongodb in production since v1.6), so we didn't take a snapshot. Yes, it's true that since the replSets are back in sync we don't see a discrepancy in IO metrics. | ||||||||
| Comment by Eric Sedor [ 11/Jul/19 ] | ||||||||
|
I wanted to note that the upgrade instructions from 3.4 to 3.6 do not require an initial sync. We have been most readily able to examine logs for the sync attempt on 3.6. From 2019-07-03T05:41:45 to 2019-07-03T05:42:01 on the Primary, getMore operations on the oplog began taking more than 2 seconds each. This suggests a load issue on the Primary. However, the diagnostic data provided for the Primary begins on 7/4, so we aren't able to confirm that. Do you have diagnostic data for 7/3 from the Primary at the time? Finally, do I understand correctly from your last message that after upgrading all nodes to 4.0 you are no longer seeing a discrepancy in IO metrics? | ||||||||
| Comment by Kay Agahd [ 09/Jul/19 ] | ||||||||
|
Files have been uploaded. Interestingly, the errors documented above have occurred quite as often:
I've uploaded the log files of the successful initial-sync process as well. | ||||||||
| Comment by Danny Hatcher (Inactive) [ 08/Jul/19 ] | ||||||||
|
I've generated a Secure Upload Portal that only MongoDB engineers can access. Please provide the mongod logs and "diagnostic.data" folders for both the source and destination. | ||||||||
| Comment by Kay Agahd [ 07/Jul/19 ] | ||||||||
|
We wonder whether mongodb v3.4 under Jessie with ext4 uses the I/O system more efficiently than mongodb v3.6/v4.0 under Stretch with xfs. | ||||||||
| Comment by Kay Agahd [ 07/Jul/19 ] | ||||||||
|
We upgraded the servers from mongodb v3.6.13 to v4.0.10 while keeping FCV=3.6 in order to be able to easily downgrade once the new replica joined successfully the replica set. However, the issue remains. It's the same as described above. |