[SERVER-37468] Problem with initial sync when adding an member in a rs in 3.2.21 Created: 04/Oct/18  Updated: 28/Dec/18  Resolved: 28/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Guillaume LAPOUGE Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-38425 Oplog Visibility Query is a collectio... Closed
Related
is related to SERVER-30724 Initial sync might miss ops that were... Closed
Operating System: ALL
Participants:
Case:

 Description   

Hi,

Since i migrate an repliset in mongodb 3.2.21 i can't add an member because the initial sync not works.

It's seems to be an bug corrected in 3.2.21. But it seems not.

In logs in the member i try to add i see this messages : 

2018-10-03T16:28:43.553+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]
2018-10-03T16:28:43.553+0200 I NETWORK [rsSync] DBClientCursor::init call() failed
2018-10-03T16:28:43.555+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576857000|17 } }
2018-10-03T16:28:43.555+0200 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
2018-10-03T16:28:48.555+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:28:48.556+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017
2018-10-03T16:28:48.559+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576928000|18, t: -1 }
2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017
2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]
2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] DBClientCursor::init call() failed
2018-10-03T16:29:18.560+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576928000|18 } }
2018-10-03T16:29:18.560+0200 E REPL [rsSync] initial sync attempt failed, 8 attempts remaining
2018-10-03T16:29:23.561+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:29:23.561+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017
2018-10-03T16:29:23.566+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576932000|7, t: -1 }
2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017
2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]
2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] DBClientCursor::init call() failed
2018-10-03T16:29:53.568+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576932000|7 } }
2018-10-03T16:29:53.568+0200 E REPL [rsSync] initial sync attempt failed, 7 attempts remaining
2018-10-03T16:29:58.568+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:29:58.568+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017
2018-10-03T16:29:58.572+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576998000|18, t: -1 }
2018-10-03T16:30:00.586+0200 E REPL [rsSync] 11602 operation was interrupted
2018-10-03T16:30:00.587+0200 E REPL [rsSync] initial sync attempt failed, 6 attempts remaining
2018-10-03T16:30:02.420+0200 I NETWORK [conn14] end connection 192.168.50.22:57160 (19 connections now open)
2018-10-03T16:30:02.601+0200 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to sf06-b.essos.lan:27017
2018-10-03T16:30:02.601+0200 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-10-03T16:30:02.602+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to sf06-b.essos.lan:27017
2018-10-03T16:30:02.603+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-b.essos.lan:27017, took 2ms (1 connections now open to sf06-b.essos.lan:27017)
2018-10-03T16:30:02.604+0200 I REPL [ReplicationExecutor] Member sf06-b.essos.lan:27017 is now in state SECONDARY
2018-10-03T16:30:04.424+0200 I REPL [ReplicationExecutor] replSetElect voting yea for sf06-a.essos.lan:27017 (4)
2018-10-03T16:30:04.603+0200 I REPL [ReplicationExecutor] Member sf06-a.essos.lan:27017 is now in state PRIMARY
2018-10-03T16:30:05.587+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:30:05.587+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017
2018-10-03T16:30:05.592+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576999000|19, t: -1 }
2018-10-03T16:30:08.720+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.22:57178 #21 (20 connections now open)
2018-10-03T16:30:15.617+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.23:47300 #22 (21 connections now open)
^C
[root@sf06-c ~]# tail /var/log/mongodb/mongod.log -F
2018-10-03T16:30:40.745+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.185:45334 #24 (23 connections now open)
2018-10-03T16:31:06.007+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43886 #25 (24 connections now open)
2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017
2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]
2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] DBClientCursor::init call() failed
2018-10-03T16:31:10.600+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576999000|19 } }
2018-10-03T16:31:10.600+0200 E REPL [rsSync] initial sync attempt failed, 4 attempts remaining
2018-10-03T16:31:15.600+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:31:15.600+0200 I REPL [ReplicationExecutor] syncing from: sf06-a.essos.lan:27017 by request
2018-10-03T16:31:15.604+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538577047000|79, t: -1 }
2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.23:27017
2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.23:27017]
2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] DBClientCursor::init call() failed
2018-10-03T16:31:45.606+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-a.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538577047000|79 } }
2018-10-03T16:31:45.606+0200 E REPL [rsSync] initial sync attempt failed, 3 attempts remaining
2018-10-03T16:31:50.606+0200 I REPL [rsSync] initial sync pending
2018-10-03T16:31:50.607+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017
2018-10-03T16:31:50.712+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538577110000|31, t: -1 }
2018-10-03T16:32:20.712+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017
2018-10-03T16:32:20.712+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]

 



 Comments   
Comment by Kelsey Schubert [ 18/Dec/18 ]

Hi all,

Given the impact of this issue and its recent introduction to the 3.2 series, we've resolved this issue under SERVER-38425 and have just published MongoDB 3.2.22 containing just this fix. MongoDB 3.2.22 is a recommend upgrade for all 3.2 users. We appreciate the impact that this issue had on each of your deployments and thank you for taking the time to provide feedback on this ticket.

Please note that MongoDB 3.2 continues to be EOL, and future patch releases will not provided. Therefore, we strongly recommend that users on 3.2 upgrade to more recent and supported versions of MongoDB.

Thanks again,
Kelsey

Comment by Dave Muysson [ 27/Nov/18 ]

Add us to the list of affected users as well. We upgrade two of our large clusters to 3.2.21 recently while we make our way towards 3.4 and then ultimately 3.6. We hit this rebuilding a node in our production environment.

It is very unfortunate this case was closed. If there is no desire to investigate of fix this, I would recommend pulling 3.2.21 entirely due to this issue.

Experiencing this behaviour right when you are trying to recover a production node is very unpleasant...

We will be rolling back to 3.2.20 across the board as we do not want to risk being unable to recover a failed node in our production environment.

Comment by Hailin Hu [ 23/Oct/18 ]

I met similar issue and downgraded to 3.2.20, which works.

SERVER-30724 might be relevant according to the release notes.
SERVER-30724 Initial sync might miss ops that were in flight when it started.

Comment by Kelsey Schubert [ 19/Oct/18 ]

Please note that MongoDB 3.2 has reached "end of life", and we have not heard any reports of issues like this on more recent versions of MongoDB. Therefore my recommendation would be to upgrade to a more recent version of MongoDB. If you continue to encounter this issue after upgrading please let us know and we will continue to investigate.

Thank you,
Kelsey

Comment by Daniel Z. [ 11/Oct/18 ]

Hi Guillaume LAPOUGE
I had very similar issue (say almost identical) and this is a bug for sure. I downgraded MongoDB to the 3.2.19 version (I choose random version, but maybe 3.2.20 also works in your example) and "C" instance started replication without any problems. So interesting, in my situation/configuration "A" and "B" instances still have 3.2.21 and only problematic "C" got 3.2.19.
 
My commands:

$ yum erase $(rpm -qa | grep mongodb)
 
$ yum install mongodb-enterprise-3.2.19

Tip. Restore your "mongod.conf" file after installation.

Good Luck!

Regards,

Danny

Comment by Guillaume LAPOUGE [ 09/Oct/18 ]

Hi,

I have upload log files for both nodes.

In the sf06-b log file you can see the sf06-c node isn't online, it's just time to start the mongod service in sf06-c. 

Comment by Nick Brewer [ 08/Oct/18 ]

glapouge Can you provide the corresponding logs as well? You can upload them to our secure portal If you'd prefer not to make them publicly accessible.

-Nick

Comment by Guillaume LAPOUGE [ 08/Oct/18 ]

diagnostics data files are posted.

Comment by Nick Brewer [ 08/Oct/18 ]

glapouge Thanks. To determine the cause of the slowness, we'll need a complete set of logs from both of the mongods when the failed initial sync is occurring, as well as archives (tar or zip) of the dbpath/diagnostic.data directory on each of the nodes.

-Nick

Comment by Guillaume LAPOUGE [ 08/Oct/18 ]

When testing with mtr between nodes B and C i not detect issue in network performance.

 

HOST: sf06-b.essos.lan Loss% Snt Last Avg Best Wrst StDev
1. sf06-c.essos.lan 0.0% 10 0.3 0.2 0.2 0.3 0.0

HOST: sf06-c.essos.lan Loss% Snt Last Avg Best Wrst StDev
1.|-- sf06-b.essos.lan 0.0% 10 0.2 0.2 0.2 0.3 0.0

 

Comment by Nick Brewer [ 05/Oct/18 ]

glapouge The connection is initially successful, but it appears to be quite slow:

2018-10-05T10:03:11.623+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-c.essos.lan:27017, took 2023ms (1 connections now open to sf06-c.essos.lan:27017)

Given that these machines appear to be on a the same local network, a connection time of 2023ms seems unusual. You can use a tool like mtr to determine the average response time from the affected node to the primary and secondary.

-Nick

Comment by Guillaume LAPOUGE [ 05/Oct/18 ]

I also try to change the sync member in the replicaset from B (secondary) to A (primary) i have the same issue.

Comment by Guillaume LAPOUGE [ 05/Oct/18 ]

The tcp 27017 port is totally open.

We already have  A and B nodes in this rs and the replication works perfectly between this nodes.

The C want to have initial sync.

The first log in the bug declaration is from  the node C.

The node C want to synchronise with the node B.

 This is an extract from the node B

 

2018-10-05T10:03:09.600+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to sf06-c.essos.lan:27017
2018-10-05T10:03:11.043+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:03:10.796+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms
2018-10-05T10:03:11.623+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-c.essos.lan:27017, took 2023ms (1 connections now open to sf06-c.essos.lan:27017)
2018-10-05T10:03:11.623+0200 I REPL [ReplicationExecutor] Member sf06-c.essos.lan:27017 is now in state STARTUP
2018-10-05T10:03:11.625+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60814 #730 (24 connections now open)
2018-10-05T10:03:11.634+0200 I NETWORK [ReplicaSetMonitorWatcher] Successfully connected to sf06-c.essos.lan:27017 (1 connections now open to sf06-c.essos.lan:27017 with a 5 second timeout)
2018-10-05T10:03:11.635+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60818 #731 (25 connections now open)
2018-10-05T10:03:11.636+0200 I NETWORK [conn731] end connection 192.168.50.65:60818 (24 connections now open)
2018-10-05T10:03:11.644+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60820 #732 (25 connections now open)
2018-10-05T10:03:11.653+0200 I NETWORK [conn730] end connection 192.168.50.65:60814 (24 connections now open)
2018-10-05T10:03:13.624+0200 I REPL [ReplicationExecutor] Member sf06-c.essos.lan:27017 is now in state STARTUP2
2018-10-05T10:03:13.744+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60822 #733 (25 connections now open)
2018-10-05T10:03:41.325+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:03:41.043+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms
2018-10-05T10:03:49.107+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60824 #734 (26 connections now open)
2018-10-05T10:04:11.606+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:04:11.325+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms
2018-10-05T10:04:24.114+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60826 #735 (27 connections now open)
2018-10-05T10:04:41.886+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:04:41.606+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms

 

The node C was correctly detected.

 

 

 

 

Comment by Nick Brewer [ 04/Oct/18 ]

glapouge This is often the result of a connectivity issue - have you tested connectivity between the primary and this node?

It would be useful to see the mongod logs from both the primary and the node you were attempting to add. If you'd prefer to keep this information private, we can generate a secure portal for you to upload it to.

-Nick

Generated at Thu Feb 08 04:46:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.