[SERVER-31994] rs.add connection increased and the service impact occurred. Created: 16/Nov/17  Updated: 16/Nov/21  Resolved: 09/Jan/18

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: hyeonghwa park Assignee: Mark Agarunov
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File MongoJira_31994.tar.gz     PNG File primary_2017-11-21 13.54.52.png     PNG File primary_2017-11-21 14.00.24.png     PNG File secondary1_2017-11-21 14.01.14.png     PNG File secondary1_2017-11-21 14.01.45.png     PNG File secondary2_2017-11-21 14.02.26.png     PNG File secondary2_2017-11-21 14.03.01.png     PNG File 스크린샷 2017-11-16 12.16.41.png     PNG File 스크린샷 2017-11-16 12.17.00.png    
Operating System: ALL
Participants:

 Description   

I would like to report a strange behaviour here,
on MongoDB 3.0.6 with wiredTiger.

#Architecture
mongos: 3ea
config: 3ea
replicatSet: shard3
primary,secondary#1,secondary#2,new secondary#3(dump)

I do not understand why the connection increased and the service impact occurred.

Summary
rs.add Increased connection due to work impact, error occurred on app server side Service affected.

 
 # connection(primary:1.242k(avg)>7.192k(max), secondary#1:617k(avg)>0k)
 # app server(Unexpected exception occurred: com.mongodb.MongoSocketReadTimeoutException: Timeoutwhilereceiving message, message: Timeoutwhilereceiving message occurred)
 
 
//task config
rs.add( { _id: 3, "host": "secondary#3 (dump)", "priority": 0, "hidden": true } )
 
 
// new secondary#3(dump) mongod.log
2017-11-15T11:21:48.638+0900 I REPL     [rsSync] replSet warning did not receive a valid config yet, sleeping 5 seconds
2017-11-15T11:21:53.638+0900 I REPL     [rsSync] ******
2017-11-15T11:21:53.638+0900 I REPL     [rsSync] creating replication oplog of size: 50000MB...
2017-11-15T11:21:53.646+0900 I REPL     [rsSync] ******
2017-11-15T11:21:53.646+0900 I REPL     [rsSync] initial sync pending
2017-11-15T11:21:53.666+0900 I REPL     [rsSync] initial sync drop all databases
2017-11-15T11:21:53.666+0900 I REPL     [rsSync] initial sync clone all databases
2017-11-15T11:21:53.672+0900 I REPL     [rsSync] initial sync cloning db: admin
2017-11-15T11:21:53.723+0900 I REPL     [rsSync] initial sync cloning db: shop                              <<<<<<<<<<<<<<<<<<<<<<<<<<<<< connection(primary:1.242k(avg)>7.192k(max), secondary#1:617k(avg)>0k), app server( Timeout > reconnect > spike )
 
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 D STORAGE  [rsSync] WT commit_transaction
 D STORAGE  [rsSync] WT begin_transaction
 
2017-11-15T12:15:18.110+0900 I REPL     [rsSync] initial sync cloning db: test
2017-11-15T12:15:18.138+0900 I REPL     [rsSync] initial sync data copy, starting syncup
2017-11-15T12:15:18.139+0900 I REPL     [rsSync] oplog sync 1 of 3
2017-11-15T12:15:18.140+0900 D REPL     [rsSync] Applying oplog entries from Nov 15 11:21:48:1 until Nov 15 12:14:46:3
2017-11-15T12:15:18.931+0900 D REPL     [rsSync] replication batch size is 46
2017-11-15T12:15:18.965+0900 D REPL     [rsSync] replication batch size is 332
2017-11-15T12:15:19.003+0900 D REPL     [rsSync] SyncTail applied 378 entries (42714 bytes) and finished at opTime Nov 15 12:14:46:3
2017-11-15T12:15:19.003+0900 I REPL     [rsSync] oplog sync 2 of 3
2017-11-15T12:15:19.004+0900 I REPL     [rsSync] initial sync building indexes
2017-11-15T12:15:19.004+0900 I REPL     [rsSync] initial sync cloning indexes for : admin
2017-11-15T12:15:19.031+0900 I REPL     [rsSync] initial sync cloning indexes for : shop
2017-11-15T12:35:41.554+0900 I REPL     [rsSync] initial sync cloning indexes for : test
2017-11-15T12:35:41.569+0900 I REPL     [rsSync] oplog sync 3 of 3
2017-11-15T12:35:41.577+0900 D REPL     [rsSync] Applying oplog entries from Nov 15 12:14:46:3 until Nov 15 12:34:47:a
2017-11-15T12:35:41.577+0900 D REPL     [rsSync] replication batch size is 152
2017-11-15T12:35:41.884+0900 D REPL     [rsSync] SyncTail applied 152 entries (17176 bytes) and finished at opTime Nov 15 12:34:47:a
2017-11-15T12:35:41.898+0900 I REPL     [rsSync] initial sync finishing up
2017-11-15T12:35:41.898+0900 I REPL     [rsSync] replSet set minValid=5a0bb5d7:a
2017-11-15T12:35:41.919+0900 I REPL     [rsSync] initial sync done
2017-11-15T12:35:49.292+0900 D REPL     [rsSync] replication batch size is 1
2017-11-15T12:35:49.294+0900 D REPL     [rsSync] replication batch size is 6
2017-11-15T12:36:47.109+0900 D REPL     [rsSync] replication batch size is 1
2017-11-15T12:36:47.111+0900 D REPL     [rsSync] replication batch size is 7
2017-11-15T12:37:47.121+0900 D REPL     [rsSync] replication batch size is 1
2017-11-15T12:37:47.124+0900 D REPL     [rsSync] replication batch size is 7
2017-11-15T12:37:47.126+0900 D REPL     [rsSync] replication batch size is 2
2017-11-15T12:38:47.139+0900 D REPL     [rsSync] replication batch size is 1
2017-11-15T12:38:47.141+0900 D REPL     [rsSync] replication batch size is 8

Thanks for some feedback,
park hyeonghwa



 Comments   
Comment by Mark Agarunov [ 09/Jan/18 ]

Hello parkhyeonghwa,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thanks,
Mark

Comment by Mark Agarunov [ 27/Dec/17 ]

Hello parkhyeonghwa,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the diagnostic data?

Thanks,
Mark

Comment by Mark Agarunov [ 28/Nov/17 ]

Hello parkhyeonghwa,

Thank you for the additional information. I suspect the SocketException errors are likely unrelated to this issue, however the diagnostic data should confirm that. I've put this ticket into "Waiting for User Input" until you have a chance to collect the data.

Thanks,
Mark

Comment by hyeonghwa park [ 22/Nov/17 ]

hi Mark.
okay, i got it. Thanks
(when ready, attach it.

and
after loglevel change, the following SocketException occurred.
is there a problem?

//change logLevel 0 > 2
db.runCommand( { setParameter: 1, logLevel: 2 } )
 
 
2017-11-22T08:22:25.843+0900 D NETWORK  [conn6781824] SocketException: remote: 10.10.100.180:43887 error: 9001 socket exception [CLOSED] server [10.10.100.180:43887]
2017-11-22T08:22:30.681+0900 D NETWORK  [conn6781829] SocketException: remote: 10.100.100.199:47384 error: 9001 socket exception [CLOSED] server [10.100.100.199:47384]
2017-11-22T08:22:37.613+0900 D NETWORK  [conn6781826] SocketException: remote: 10.10.10.192:56180 error: 9001 socket exception [CLOSED] server [10.10.10.192:56180]
2017-11-22T08:22:48.580+0900 D NETWORK  [conn6781827] SocketException: remote: 10.10.100.176:52514 error: 9001 socket exception [CLOSED] server [10.10.100.176:52514]
2017-11-22T08:22:55.865+0900 D NETWORK  [conn6781828] SocketException: remote: 10.10.100.180:43891 error: 9001 socket exception [CLOSED] server [10.10.100.180:43891]
2017-11-22T08:23:00.681+0900 D NETWORK  [conn6781833] SocketException: remote: 10.100.100.199:47872 error: 9001 socket exception [CLOSED] server [10.100.100.199:47872]
2017-11-22T08:23:07.633+0900 D NETWORK  [conn6781830] SocketException: remote: 10.10.10.192:56186 error: 9001 socket exception [CLOSED] server [10.10.10.192:56186]
2017-11-22T08:23:18.600+0900 D NETWORK  [conn6781831] SocketException: remote: 10.10.100.176:52518 error: 9001 socket exception [CLOSED] server [10.10.100.176:52518]
2017-11-22T08:23:25.886+0900 D NETWORK  [conn6781832] SocketException: remote: 10.10.100.180:43894 error: 9001 socket exception [CLOSED] server [10.10.100.180:43894]
2017-11-22T08:23:30.684+0900 D NETWORK  [conn6781837] SocketException: remote: 10.100.100.199:48366 error: 9001 socket exception [CLOSED] server [10.100.100.199:48366]

Thanks,
hyeongwa

Comment by Mark Agarunov [ 21/Nov/17 ]

Hello parkhyeonghwa,

Thank you for the response. The serverStatus metric collection should not have a noticeable effect on performance. Additionally, the tcmalloc:true parameter causes the tcmalloc data to be collected in addition to the other data serverStatus collection, not instead of. These metrics should give us a better idea of why you're seeing this behavior.

Thanks,
Mark

Comment by hyeonghwa park [ 21/Nov/17 ]

hi Mark.
i appreciated your reply.

1.is there a performance no problem with enable tcmalloc stats(db.serverStatus(

{tcmalloc:true}

))?
https://jira.mongodb.org/browse/SERVER-19941 < no problem , right?
(rs.add since it is a work problem, currently there is no problem,
so i think that checking tcmalloc stats is meaningless
2.attach the system,mongo chart when the problem occurred.
occurred time: 10/24 07:30~08:00

regards,
hyeongwa

Comment by Mark Agarunov [ 20/Nov/17 ]

Hello parkhyeonghwa,

Thank you for providing these logs. Unfortunately I'm not seeing anything concrete as to the reason, and will likely need the diagnostic data to diagnose this issue. To collect this data, please run the following while you are experiencing the issue, for at least 15 minutes (longer is better however), and upload the generated ss.log and iostat.log files:

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000*${delay:?})}" >ss.log &
iostat -k -t -x ${delay:?} >iostat.log &

Thanks,
Mark

Comment by hyeonghwa park [ 20/Nov/17 ]

Hi Mark.
sorry late for reply

pls check out the attached file.

Regards,
hyeongwa

Comment by Mark Agarunov [ 17/Nov/17 ]

Hello parkhyeonghwa,

Thank you for the additional information. My apologies, the lack of diagnostic data in 3.0 slipped my mind in my response. To collect this data, please run the following while you are experiencing the issue, for at least 15 minutes (longer is better however), and upload the generated ss.log and iostat.log files:

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000*${delay:?})}" >ss.log &
iostat -k -t -x ${delay:?} >iostat.log &

Additionally, unfortunately I'm not seeing the log files, could you please attach those to the ticket as well?

Thanks,
Mark

Comment by hyeonghwa park [ 17/Nov/17 ]

Hi Mark.
I appreciated your feedback.

i want to add a secondary for dump to the shard,
by the way, a shard problem occurred.(only shard2 would cause problems.
and i did the same test. but same problem does not occur.

  //shard1
  rs.add( { _id: 3, "host": "secondary1(dump)", "priority": 0, "hidden": true } )
  //shard2
  rs.add( { _id: 3, "host": "secondary2(dump)", "priority": 0, "hidden": true } )
  //shard3
  rs.add( { _id: 3, "host": "secondary3(dump)", "priority": 0, "hidden": true } )

pls check the following log file.

1)
#mongos,config,mongod log
  time: 07:00~07:59:00
 
#mongos,config
  //grep -v Successfully authenticated
  mongos1.log
  mongos2.log
  mongos3.log
  config.log.1
  config.log.2
  config.log.3
#mongod
  //grep -v Successfully authenticated
  primary.log
  secondary1.log
  secondary2.log
  secondary3.log
  //SyncSourceFeedback error sending update
  I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update, response: { ok: 0.0, errmsg: "Received replSetUpdatePosition for node with memberId 4 whose config version of 10 doesn't match our config version of 9", code: 93 }

#dbpath/diagnostic.data
Currently version 3.0.6, so no log
// New in version 3.2.
diagnosticDataCollectionEnabled

Regards,
hyeongwa

Comment by Mark Agarunov [ 16/Nov/17 ]

Hello parkhyeonghwa,

Thank you for the report. To get a better idea of what may be causing this, could you please provide the following:

  • The complete log files from all of the affected mongod,mongos, and config server nodes
  • An archive (tar or zip) of the $dbpath/diagnostic.data directory from all affected mongod nodes.

This should give some insight into why you are experiencing this behavior.

Thanks,
Mark

Generated at Thu Feb 08 04:28:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.