[SERVER-42078] mongodb is unable to replay the oplog and restarts the initial-sync Created: 03/Jul/19  Updated: 02/Apr/21  Resolved: 16/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Eric Sedor
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File mongo-echo01-01_iostat_last7days.png     PNG File mongo-echo01-01_opcounters_last7days.png     PNG File mongo-echo01-02_iostat_last7days.png     PNG File mongo-echo01-02_opcounters_last7days.png     PNG File v3.4_iostat.png     PNG File v3.4_opcounters.png     PNG File v3.6_iostat.png     PNG File v3.6_opcounters.png    
Issue Links:
Related
is related to SERVER-18721 Initial-sync can exhaust available po... Closed
Operating System: ALL
Participants:

 Description   

We are unable to add a new replica to a exisiting replica set because the inital sync restarts after cloning is done.
The log says that the new member can't connect to the sync source any longer, so it drops all databases and restarts the initial sync by connecting to the the same sync source which has been declared "unavailabe" just a few seconds ago.

Here are some related log snippets from the new replica. I can send the complete log files from both the sync source and the new replica to a confidential destination.

# Cloning is finished:
2019-07-03T05:25:09.689+0200 I REPL     [repl writer worker 0] Finished cloning data: OK. Beginning oplog replay.
2019-07-03T05:25:09.690+0200 I REPL     [replication-250] Applying operations until { : Timestamp(1562124309, 5) } before initial sync can complete. (starting at { : Ti
mestamp(1562098892, 137) })
2019-07-03T05:25:09.802+0200 I STORAGE  [replication-260] createCollection: local.replset.oplogTruncateAfterPoint with generated UUID: 8488b127-0f54-4294-b9a1-d7ec9acaf
d48
 
# many of these "update of non-mod failed", "Fetching missing document" and "Missing document not found on source" follow:
2019-07-03T05:25:11.508+0200 E REPL     [repl writer worker 5] update of non-mod failed: { ts: Timestamp(1562098895, 15), t: 23, h: -7086754239136309992, v: 2, op: "u",
 <CUT-OUT> }
2019-07-03T05:25:11.508+0200 I REPL     [repl writer worker 5] Fetching missing document: { ts: Timestamp(1562098895, 15), t: 23, h: -7086754239136309992, v: 2, op: "u"
, <CUT-OUT> }
2019-07-03T05:25:11.510+0200 I REPL     [repl writer worker 5] Missing document not found on source; presumably deleted later in oplog. o first field: _id, o2: { _id: {
 offerKey: "aebd8b896c20b361df37a4e0d2a51200", listIdentifier: "5621756", siteId: 1 } }
 
# a few seconds later it "Can no longer connect to initial sync source":
2019-07-03T05:25:57.079+0200 E REPL     [replication-259] Failed to apply batch due to 'Location15916: error applying batch :: caused by :: Can no longer connect to ini
tial sync source: mongo-echo01-01.db00.tst06.eu.idealo.com:27017'
2019-07-03T05:25:57.098+0200 I ACCESS   [conn1480] Successfully authenticated as principal nagios on admin from client 10.113.1.242:40168
2019-07-03T05:25:57.101+0200 I NETWORK  [conn1480] end connection 10.113.1.242:40168 (9 connections now open)
2019-07-03T05:25:57.111+0200 I REPL     [replication-250] Finished fetching oplog during initial sync: CallbackCanceled: oplog fetcher shutting down. Last fetched optim
e and hash: { ts: Timestamp(1562124342, 998), t: 23 }[8473475730077512578]
2019-07-03T05:25:57.111+0200 I REPL     [replication-250] Initial sync attempt finishing up.
 
# ...and drops all databases:
2019-07-03T05:25:57.112+0200 E REPL     [replication-250] Initial sync attempt failed -- attempts left: 9 cause: Location15916: error applying batch :: caused by :: Can
 no longer connect to initial sync source: mongo-echo01-01.db00.tst06.eu.idealo.com:27017
2019-07-03T05:25:58.112+0200 I REPL     [replication-259] Starting initial sync (attempt 2 of 10)
2019-07-03T05:25:58.112+0200 I STORAGE  [replication-259] Finishing collection drop for local.temp_oplog_buffer (no UUID).
2019-07-03T05:25:58.904+0200 I STORAGE  [replication-259] createCollection: local.temp_oplog_buffer with no UUID.
2019-07-03T05:25:58.910+0200 I REPL     [replication-259] sync source candidate: mongo-echo01-01.db00.tst06.eu.idealo.com:27017
2019-07-03T05:25:59.022+0200 I STORAGE  [replication-259] dropAllDatabasesExceptLocal 44
 
# ...and connects anew to the initial sync source within a few milliseconds:
2019-07-03T05:26:16.025+0200 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to mongo-echo01-01.db00.tst06.eu.idealo.com:27017, took 35ms (3 connections n
ow open to mongo-echo01-01.db00.tst06.eu.idealo.com:27017)

Some more details:
Both servers (sync source and new replica) have identical hardware, OS and configuration:

  • 56 CPUs
  • 128 GB RAM
  • 4.5 TB HD
  • Linux 64 bit, stretch 9.9, xfs
  • mongodb v3.6.13, wired tiger

Until a few days ago, both servers were running succesfully in a replica-set with Linux jessy, mongodb v3.4 and ext4. Now we wanted to upgrade them wich was successful for the first server that is now Primary. The oplog sizes of both servers have been increased from 50 GB to 200 GB in order to increase the oplog window which varies now from 30 to 45 hours. The cloning process takes round about 8 hours.

We thought we hit the same bug as described in SERVER-18721 so we set "net.ipv4.tcp_tw_reuse = 1". However we did not observe an increase of TIME_WAIT so it seems to be another cause for our issue.



 Comments   
Comment by Alec Istomin [ 02/Apr/21 ]

I've run into a similar issue on 3.6.16 and was able to sync after following instructions from https://jira.mongodb.org/browse/SERVER-18721 (sudo sysctl net.ipv4.tcp_tw_reuse=1 net.ipv4.ip_local_port_range="10000 60999")

The scope for the reference:

$ grep "presumably deleted later in oplog" mongod.log |wc -l
155792
$ grep "Fetching missing document:" mongod.log |wc -l
155792
$ grep "failed to apply update:" mongod.log |wc -l
147990
$ grep "update of non-mod failed" mongod.log |wc -l
7802

Comment by Eric Sedor [ 16/Jul/19 ]

Thanks for clarifying, kay.agahd@idealo.de, we are glad to hear.

The "Fetching missing document/update of non-mod failed/Missing document not found" messages indicate cases where documents are being obtained (or found to not exist) after the initial collection clone during the initial sync process. A delay in syncing (and/or a longer oplog) could lead to more of these messages, but by themselves they aren't typically a cause for concern.

Possibly because of the missing Primary diagnostic data, we don't have good evidence that could confirm or rule out a bug.

I am going to close this ticket, but if this happens again please do reach out and we will be happy to investigate further.

Comment by Kay Agahd [ 12/Jul/19 ]

Hi eric.sedor, thanks for your investigation.
The reason why an initial sync was required is that we needed to upgrade the operating system from Jessie to Stretch, so the whole data partition has been wiped. Therefore we have no diagnostic data from the primary before 7/3.

If we had known that an initial sync would become so problematic, we could have taken a snapshot of the data partition before the upgrade and restored it afterwards. But we haven't had such problems yet (we've been using mongodb in production since v1.6), so we didn't take a snapshot.

Yes, it's true that since the replSets are back in sync we don't see a discrepancy in IO metrics.

Comment by Eric Sedor [ 11/Jul/19 ]

Hi kay.agahd@idealo.de,

I wanted to note that the upgrade instructions from 3.4 to 3.6 do not require an initial sync.

We have been most readily able to examine logs for the sync attempt on 3.6. From 2019-07-03T05:41:45 to 2019-07-03T05:42:01 on the Primary, getMore operations on the oplog began taking more than 2 seconds each. This suggests a load issue on the Primary. However, the diagnostic data provided for the Primary begins on 7/4, so we aren't able to confirm that. Do you have diagnostic data for 7/3 from the Primary at the time?

Finally, do I understand correctly from your last message that after upgrading all nodes to 4.0 you are no longer seeing a discrepancy in IO metrics?

Comment by Kay Agahd [ 09/Jul/19 ]

Files have been uploaded.
In the meantime we got the replicaSet up & running with mongodb v4.0. We have it downgraded to v3.6 afterwards and stepped down the Primary in order to downgrade it to v3.6 too.
Some more details:
At 2019-07-07T16:28:00.928+0200 a new clone process started on mongo-ech01-02. At 2019-07-08T05:39:56.028+0200 mongo-ech01-02 switched from RECOVERING to SECONDARY status. From 8:30 the I/O load returned to a similarly low level as in v3.4. From 10:30 to 12:00 "iostat utitilisation" was close to 100% again but then decreased permanently to 50-60% (similar to v3.4). At about 15:57 mongo-ech01-02 (SECONDARY) was downgraded from v4.0 to v3.6 and restarted. Around 16:07 mongo-ech01-02 became the primary, so that mongo-ech01-01, which was primary until then, could be downgraded to v3.6 as well. "iostat utitilisation" is still at a low level.

Interestingly, the errors documented above have occurred quite as often:

 TST [16:55][root@mongo-echo01-02.db00.tst06:/home/kay.agahd]# ✔ grep "Fetching missing document" shared04_v4.log.1 | wc -l
28022
 TST [16:56][root@mongo-echo01-02.db00.tst06:/home/kay.agahd]# ✔ grep "update of non-mod failed" shared04_v4.log.1 | wc -l
28005
 TST [16:56][root@mongo-echo01-02.db00.tst06:/home/kay.agahd]# ✔ grep "Missing document not found" shared04_v4.log.1 | wc -l
16708

I've uploaded the log files of the successful initial-sync process as well.

Comment by Danny Hatcher (Inactive) [ 08/Jul/19 ]

I've generated a Secure Upload Portal that only MongoDB engineers can access. Please provide the mongod logs and "diagnostic.data" folders for both the source and destination.

Comment by Kay Agahd [ 07/Jul/19 ]

We wonder whether mongodb v3.4 under Jessie with ext4 uses the I/O system more efficiently than mongodb v3.6/v4.0 under Stretch with xfs.
Please see the attached screenshots. They show the opcounters and iostat-utilisation of the old dbs (v3.4) and the new dbs (v3.6 and since today v4.0). The underlying hardware ist still the same, although the hostnames have been changed during the upgrade.
As you can see by the opcounters (graph is stacked), the workload is roughly the same. However, the I/O system of the upgraded dbs is much more saturated than before the upgrade to v3.6. The I/O is even more saturated (almost always at 100%) since we upgraded to v4.0 (on 7/7 at 00:00).

Comment by Kay Agahd [ 07/Jul/19 ]

We upgraded the servers from mongodb v3.6.13 to v4.0.10 while keeping FCV=3.6 in order to be able to easily downgrade once the new replica joined successfully the replica set. However, the issue remains. It's the same as described above.

Generated at Thu Feb 08 04:59:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.