[SERVER-35738] Mongodb bad performance causes heartbeat lose Created: 22/Jun/18  Updated: 04/Sep/18  Resolved: 01/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Zihang Cui Assignee: Nick Brewer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 2018-06-17T04-primary.log     Text File 2018-06-17T04-secondary-with-priority-0.log     Text File 2018-06-17T04-secondary.log     Zip Archive backup.zip     Zip Archive primary.zip     Zip Archive secondary.zip    
Operating System: ALL
Steps To Reproduce:

we don't konw

Participants:

 Description   

Secondary became primary through election when secondary can not find primary.

Before this time, many time-consuming operations happened at the primary node, which can be seen in the attach file.

primary 10.0.208.149,important log:
2018-06-17T04:06:44.004 end connection 10.0.189.100:42238 (2829 connections now open)
2018-06-17T04:06:45.770 end connection 10.0.189.100:36520 (2827 connections now open)
2018-06-17T04:06:51.82 [ReplicationExecutor] Error in heartbeat request to 10.0.189.100:27017; ExceededTimeLimit: Operation timed out

2018-06-17T04:06:58.242 connection accepted from 10.0.189.100:39064 #206971
2018-06-17T04:06:58.242 [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.0.189.100:27017

2018-06-17T04:06:59.888 [ReplicationExecutor] Member 10.0.189.100:27017 is now in state SECONDARY
2018-06-17T04:06:59.890 [initandlisten] connection accepted from 10.0.189.100:39066 #2069721 (2831 connections now open)
2018-06-17T04:06:59.890 end connection 10.0.189.100:39065 (2830 connections now open)

secondry 10.0.189.100,important log

2018-06-17T04:06:41.460 [ReplicationExecutor] could not find member to sync from

2018-06-17T04:06:58.239+0000 [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-06-17T04:06:58.240+0000 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected

2018-06-17T04:06:58.243+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.0.208.149:27017 because: candidate's data is staler than mine, resp:

{ term: 95, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }



 Comments   
Comment by Zihang Cui [ 02/Aug/18 ]

I check current and available connections in the mongo shell via:

db.serverStatus().connections

/* 1 */ { "current" : 3664, "available" : 47536, "totalCreated" : NumberLong(2378823) }

 

Comment by Nick Brewer [ 01/Aug/18 ]

cuizihang It's difficult to compare the diagnostic data against the logs you've provided, as the dates are much older (from when this ticket was first opened). That said, I am seeing multiple entries similar to this in the primary logs:

SocketException handling request, closing client connection: 9001 socket exception 

There are a few potential causes for this message, but I would start by looking at your operating system limits to see if they are configured according to our production notes. It is possible that your ulimit settings for files are too low.

From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

-Nick

Comment by Zihang Cui [ 26/Jul/18 ]

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory from each node in the replica set, and attach them to this ticket?

--------------------------------------------------------------------------------------------
I have uploaded the three files mentioned above. primary.zip secondary.zip backup.zip...

I hope some new discoveries.

Thank you.

Comment by Zihang Cui [ 26/Jul/18 ]

Secondary became primary again.

Secondary log:
2018-07-26T04:01:00.023+0000 I REPL [ReplicationExecutor] Error in heartbeat request to 10.0.189.100:27017; ExceededTimeLimit: Operation timed out
2018-07-26T04:01:04.254+0000 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 98
2018-07-26T04:01:04.322+0000 I REPL [replExecDBWorker-1] transition to SECONDARY
2018-07-26T04:01:05.413+0000 I REPL [ReplicationExecutor] Member 10.0.189.100:27017 is now in state PRIMARY

Comment by Nick Brewer [ 25/Jun/18 ]

cuizihang Sure - we'll await your findings.

Regards,
Nick

Comment by Zihang Cui [ 25/Jun/18 ]

$> ls /data/db_dir/mongodb/diagnostic.data/ -lrth
total 105M
rw-rr- 1 mongodb mongodb 10M Jun 20 17:57 metrics.2018-06-20T04-33-26Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 21 06:30 metrics.2018-06-20T17-57-02Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 21 17:01 metrics.2018-06-21T06-30-18Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 22 03:47 metrics.2018-06-21T17-01-49Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 22 14:32 metrics.2018-06-22T03-47-12Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 23 01:27 metrics.2018-06-22T14-32-26Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 23 12:13 metrics.2018-06-23T01-27-54Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 23 23:09 metrics.2018-06-23T12-13-41Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 24 10:10 metrics.2018-06-23T23-09-26Z-00000
rw-rr- 1 mongodb mongodb 10M Jun 24 21:05 metrics.2018-06-24T10-10-00Z-00000
rw-rr- 1 mongodb mongodb 4.8M Jun 25 02:20 metrics.2018-06-24T21-05-35Z-00000

Sorry,file had been delete by mongo daemon at Jun 17. We will attach them to this ticket when accident happen again. Would you please keep this jira “WAITING FOR USER”?

Thank you.

Comment by Nick Brewer [ 22/Jun/18 ]

cuizihang

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory from each node in the replica set, and attach them to this ticket?

Thank you,
Nick

Comment by Zihang Cui [ 22/Jun/18 ]

"Mongodb bad performance causes heartbeat lose" in the title of this issue is our guess to cause election, but we can't be sure.

Generated at Thu Feb 08 04:40:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.