[SERVER-17845] Running the dropDatabase command can cause an election to occur Created: 01/Apr/15  Updated: 03/Apr/15  Resolved: 03/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Stam Assignee: Amalia Hawkins
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 27019.log     Text File db27017.log     Text File db27018.log    
Issue Links:
Duplicate
duplicates SERVER-12235 Don't require a database read on ever... Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Sprint: Security 1 04/03/15
Participants:

 Description   

We are encountering unexpected elections when running the .NET driver unit tests against a replica set (with auth and ssl if that makes a difference).

Apparently running the dropDatabase command can cause an unexpected election to occur.

From the attached logs:

On member 27017:

2015-04-01T16:22:18.013+0000 D REPL     [ReplicationExecutor] localhost:27018 thinks that we are down because they cannot send us heartbeats.
2015-04-01T16:22:18.013+0000 I REPL     [ReplicationExecutor] Member localhost:27018 is now in state PRIMARY
2015-04-01T16:22:18.013+0000 I REPL     [ReplicationExecutor] replset: another primary seen with election time 551c1b39:1 my election time is 551c1ab4:1
2015-04-01T16:22:18.013+0000 I REPL     [ReplicationExecutor] stepping down; another primary was elected more recently
2015-04-01T16:22:18.013+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to localhost:27018 at 2015-04-01T16:22:20.013Z
2015-04-01T16:22:18.013+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat

And on member 27018:

2015-04-01T16:22:16.553+0000 I NETWORK  [ReplExecNetThread-0] Socket recv() timeout  127.0.0.1:27017
2015-04-01T16:22:16.553+0000 I NETWORK  [ReplExecNetThread-0] SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27017] 
2015-04-01T16:22:16.553+0000 I NETWORK  [ReplExecNetThread-0] DBClientCursor::init call() failed
2015-04-01T16:22:16.553+0000 D -        [ReplExecNetThread-0] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D534D4E376A716762326245447A315847506C4D2B50695A6E2B6E627338315834) }
2015-04-01T16:22:16.740+0000 D REPL     [ReplExecNetThread-0] Network status of sending replSetHeartbeat to localhost:27017 was Location10276 DBClientBase::findN: transport error: localhost:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D534D4E376A716762326245447A315847506C4D2B50695A6E2B6E627338315834) }
2015-04-01T16:22:16.740+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to localhost:27017; Location10276 DBClientBase::findN: transport error: localhost:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D534D4E376A716762326245447A315847506C4D2B50695A6E2B6E627338315834) }
2015-04-01T16:22:16.740+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to localhost:27017 at 2015-04-01T16:22:18.740Z
2015-04-01T16:22:16.740+0000 I REPL     [ReplicationExecutor] Standing for election
...
2015-04-01T16:22:17.209+0000 I REPL     [ReplicationExecutor] replSet election succeeded, assuming primary role
2015-04-01T16:22:17.209+0000 I REPL     [ReplicationExecutor] transition to PRIMARY

So 27018 has taken the role of primary away from 27017.

Here's what seems to be bogging member 27017 down so much that 27018 concluded it was down:

2015-04-01T16:21:59.390+0000 D COMMAND  [conn27] run command Tests04011621.$cmd { dropDatabase: 1 }
2015-04-01T16:21:59.390+0000 I COMMAND  [conn27] dropDatabase Tests04011621 starting
2015-04-01T16:21:59.390+0000 D STORAGE  [conn27] dropDatabase Tests04011621
...
2015-04-01T16:22:18.777+0000 I JOURNAL  [conn27] journalCleanup...
2015-04-01T16:22:18.778+0000 I JOURNAL  [conn27] removeJournalFiles
2015-04-01T16:22:18.778+0000 I STORAGE  [DataFileSync] flushing mmaps took 17177ms  for 10 files
2015-04-01T16:22:18.814+0000 D JOURNAL  [conn27] removeJournalFiles end
2015-04-01T16:22:19.019+0000 I JOURNAL  [conn27] journalCleanup...
2015-04-01T16:22:19.019+0000 I JOURNAL  [conn27] removeJournalFiles
2015-04-01T16:22:19.019+0000 D JOURNAL  [conn27] removeJournalFiles end
2015-04-01T16:22:19.047+0000 D STORAGE  [conn27] remove file D:\jenkins\workspace\mongo-csharp-driver-test-windows\label\windows64\mc\replica_set\mo\auth\ms\30-release\mssl\ssl\artifacts\data\db27017\Tests04011621.ns
2015-04-01T16:22:19.156+0000 I COMMAND  [conn27] dropDatabase Tests04011621 finished
2015-04-01T16:22:19.174+0000 I COMMAND  [conn27] command Tests04011621.$cmd command: dropDatabase { dropDatabase: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:64 locks:{} 19784ms



 Comments   
Comment by Matt Dannenberg [ 02/Apr/15 ]

Cherry-picking the commit from SERVER-12235 fixed the problem for me. Reassigning to amalia.hawkins@10gen.com since she had mentioned wanting to add a small patch on top of that one.

Comment by Bernie Hackett [ 02/Apr/15 ]

OIC, that's definitely not the problem then.

Comment by Jeffrey Yemin [ 01/Apr/15 ]

This is happening on the dedicated Windows machines. I don't think there are disk space issues there.

Comment by Bernie Hackett [ 01/Apr/15 ]

Jenkins has been having issues with running out of disk space recently. I wonder if this is related?

Comment by Robert Stam [ 01/Apr/15 ]

I don't know. We run our tests on Jenkins and Jenkins is configured to run the entire replica set (and the unit tests) on a single machine.

Comment by Eric Milkie [ 01/Apr/15 ]

Does the same problem happen if you run the test on separate physical machines for each node? I am curious if a lack of system resources is part of the issue.

Generated at Thu Feb 08 03:45:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.