[SERVER-30652] PRIMARY switching over to SECONDARY frequently Created: 14/Aug/17  Updated: 09/Feb/18  Resolved: 18/Jan/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Tanveer Madan Marate Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hi,
We deployed a 3 node replica set (1-PRIMARY, 1-SECONDARY and 1-ARBITER) for POC purpose
When trying to load around 100K collections to the database, the SECONDARY could not keep with the load and went out of sync and shutdown
The load continued as there was still a PRIMARY but it then crashed with the below symptoms

1. Throughout the load, we see errors like
a. [conn270741] thread over memory limit, cleaning up, current: 498k
b. Socket say send() Broken pipe
c. Fri Aug 11 03:08:21.466 I COMMAND [conn165804] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 6589, after security: 6589, after sharding: 6589, after storageEngine: 6589, after tcmalloc: 6589, after wiredTiger: 6589, at end: 6589 }

2. We see that the PRIMARY transitioned to SECONDARY multiple times (around 14 times in a day) and an election took place and was transitioned back to PRIMARY

Fri Aug 11 03:03:32.034 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db1:27030 at 2017-08-11T10:03:33.978Z
Fri Aug 11 03:03:32.041 I REPL [ReplicationExecutor] Member xsj-db2:27030 is now in state ARBITER
Fri Aug 11 03:03:32.041 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db2:27030 at 2017-08-11T10:03:34.041Z
Fri Aug 11 03:03:32.042 I REPL [replExecDBWorker-0] transition to SECONDARY

Fri Aug 11 03:03:43.143 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms

Fri Aug 11 03:03:43.297 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 26
Fri Aug 11 03:03:43.298 I REPL [ReplicationExecutor] transition to PRIMARY

All the while we have checked and found that the ARBITER has been up

3. After the switchover to secondary for the 14th time, the election does not take place and the number of connections increase to 32k all the while the max number of connections was only around 415. After reaching 32k connections the database is hung and below error is recorded continously until the database process crashes

Fri Aug 11 22:35:42.361 I - [thread1] pthread_create failed: Resource temporarily unavailable
Fri Aug 11 22:35:42.365 I - [thread1] failed to create service entry worker thread for 172.19.154.189:9621

Can you please suggest what should be the action taken during such occurences?

Thanks,
Tanveer



 Comments   
Comment by Kelsey Schubert [ 18/Jan/18 ]

Hi tanveermadan@gmail.com,

Sorry for the delay getting back to you. From our investigation we have not identified a bug in MongoDB contributing to this behavior. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Kelsey

Comment by Tanveer Madan Marate [ 14/Aug/17 ]

Hi Thomas,

I have updated the log files and also the diagnostic.data directory
Please let me know if any other files are required

Thanks,
Tanveer

Comment by Kelsey Schubert [ 14/Aug/17 ]

Hi tanveermadan@gmail.com,

Thanks for reporting this behavior. So we can better understand what is going on here, would you please provide the complete log files and an archive of the diagnostic.data directory in the $dbpath for both the primary and secondary nodes?

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Kind regards,
Thomas

Generated at Thu Feb 08 04:24:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.