[SERVER-17034] Deadlock between poorly-formed copydb and reading admin.system.users for localhost exception check Created: 24/Jan/15  Updated: 25/Jan/17  Resolved: 24/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication, Security
Affects Version/s: 3.0.0-rc6
Fix Version/s: 3.1.2

Type: Bug Priority: Major - P3
Reporter: Bernie Hackett Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 27019.log     Text File backtrace.txt     File db27017.log     File db27018.log    
Issue Links:
Depends
is depended on by PYTHON-832 Re-enable copy_database tests in auth... Closed
Related
related to SERVER-17180 Don't create a connection back to our... Closed
related to SERVER-12235 Don't require a database read on ever... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Security 1 04/03/15, Sharding 2 04/24/15
Participants:

 Description   

I started seeing "not master" errors when running PyMongo's test suite in Jenkins against a replica set with auth enabled. You can see an example here:

https://jenkins.10gen.com/view/Python/job/mongo-python-driver-v2.8-auth/58/label=linux64,mongodb_configuration=replica_set,mongodb_server=27-nightly,python_language_version=2.4/console

Please note that the entire replica set is running on the same host. When the test run starts the primary is localhost:27017, the secondaries are localhost:27018,localhost:27019.

At some point during the test run the secondaries will no longer be able to communicate with the primary. For example:

2015-01-23T23:51:14.708+0000 I NETWORK  [ReplExecNetThread-1] Socket recv() timeout  127.0.0.1:27017
2015-01-23T23:51:14.708+0000 I NETWORK  [ReplExecNetThread-1] SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27017]
2015-01-23T23:51:14.708+0000 I NETWORK  [ReplExecNetThread-1] DBClientCursor::init call() failed
2015-01-23T23:51:14.708+0000 D -        [ReplExecNetThread-1] User Assertion: 10276:DBClientBase::findN: transport error: 127.0.0.1:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D2F574F375956673641772F2B77766B6D71783862314C676B4D454B494D303342) }

One of the secondaries is elected the new primary but will almost immediately be able to connect to the old primary again:

2015-01-23T23:51:15.688+0000 I REPL     [ReplicationExecutor] Standing for election
2015-01-23T23:51:15.688+0000 D REPL     [ReplicationExecutor] Scheduling replSetFresh to 127.0.0.1:27019
2015-01-23T23:51:15.688+0000 D REPL     [ReplExecNetThread-1] Network status of sending replSetFresh to 127.0.0.1:27019 was OK
2015-01-23T23:51:15.688+0000 D REPL     [ReplicationExecutor] FreshnessChecker: Got response from 127.0.0.1:27019 of { opTime: new Date(6107688497026433025), fresher: false, veto: false, ok: 1.0 }
2015-01-23T23:51:15.688+0000 I REPL     [ReplicationExecutor] replSet info electSelf
2015-01-23T23:51:15.688+0000 D REPL     [ReplicationExecutor] Scheduling replSetElect to 127.0.0.1:27019
2015-01-23T23:51:15.689+0000 D REPL     [ReplExecNetThread-2] Network status of sending replSetElect to 127.0.0.1:27019 was OK
2015-01-23T23:51:15.689+0000 D REPL     [ReplicationExecutor] replSet elect res: { vote: 1, round: ObjectId('54c2de73e420360a414ded2e'), ok: 1.0 }
2015-01-23T23:51:15.689+0000 I REPL     [ReplicationExecutor] replSet election succeeded, assuming primary role
2015-01-23T23:51:15.689+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
2015-01-23T23:51:15.761+0000 D NETWORK  [conn41] SocketException: remote: 127.0.0.1:44333 error: 9001 socket exception [CLOSED] server [127.0.0.1:44333]
2015-01-23T23:51:15.761+0000 I NETWORK  [conn41] end connection 127.0.0.1:44333 (2 connections now open)
2015-01-23T23:51:16.651+0000 D REPL     [rsSync] Removing temporary collections from auth_test
2015-01-23T23:51:16.651+0000 D REPL     [rsSync] Removing temporary collections from pymongo_test
2015-01-23T23:51:16.652+0000 D REPL     [rsSync] Removing temporary collections from pymongo_test1
2015-01-23T23:51:16.652+0000 D REPL     [rsSync] Removing temporary collections from admin
2015-01-23T23:51:16.653+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2015-01-23T23:51:16.692+0000 D REPL     [ReplicationExecutor] Scheduling replSetHeartbeat to 127.0.0.1:27019
2015-01-23T23:51:16.692+0000 D REPL     [ReplExecNetThread-1] Network status of sending replSetHeartbeat to 127.0.0.1:27019 was OK
2015-01-23T23:51:16.693+0000 D REPL     [ReplicationExecutor] Choosing to remain primary
2015-01-23T23:51:16.693+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to 127.0.0.1:27019 at 2015-01-23T23:51:18.693Z
2015-01-23T23:51:16.709+0000 D REPL     [ReplicationExecutor] Scheduling replSetHeartbeat to 127.0.0.1:27017
2015-01-23T23:51:16.709+0000 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
2015-01-23T23:51:16.709+0000 D NETWORK  [ReplExecNetThread-2] connected to server 127.0.0.1:27017 (127.0.0.1)
2015-01-23T23:51:16.753+0000 D REPL     [ReplExecNetThread-2] Network status of sending replSetHeartbeat to 127.0.0.1:27017 was OK
2015-01-23T23:51:16.753+0000 I REPL     [ReplicationExecutor] Member 127.0.0.1:27017 is now in state PRIMARY
2015-01-23T23:51:16.753+0000 I REPL     [ReplicationExecutor] replset: another primary seen with election time 54c2ddef:1 my election time is 54c2de73:1
2015-01-23T23:51:16.753+0000 I REPL     [ReplicationExecutor] another PRIMARY detected and it should step down since it was elected earlier than me
2015-01-23T23:51:16.753+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to 127.0.0.1:27017 at 2015-01-23T23:51:18.753Z
2015-01-23T23:51:16.753+0000 I REPL     [ReplicationExecutor] Requesting 127.0.0.1:27017 step down from primary
2015-01-23T23:51:16.753+0000 D REPL     [ReplicationExecutor] Scheduling replSetStepDown to 127.0.0.1:27017
2015-01-23T23:51:16.753+0000 D REPL     [ReplExecNetThread-1] Network status of sending replSetStepDown to 127.0.0.1:27017 was OK
2015-01-23T23:51:16.753+0000 D REPL     [ReplicationExecutor] replset: stepdown of primary(127.0.0.1:27017) succeeded with response -- { ok: 0.0, errmsg: "stepdown period must be longer than secondaryCatchUpPeriodSecs", code: 2 }

The old primary notices the problem and steps down:

2015-01-23T23:51:16.829+0000 D REPL     [ReplicationExecutor] Scheduling replSetHeartbeat to 127.0.0.1:27018
2015-01-23T23:51:16.829+0000 D NETWORK  [ReplExecNetThread-2] polling for status of connection to 127.0.0.1:27018, no events
2015-01-23T23:51:16.830+0000 D REPL     [ReplExecNetThread-2] Network status of sending replSetHeartbeat to 127.0.0.1:27018 was OK
2015-01-23T23:51:16.830+0000 I REPL     [ReplicationExecutor] Member 127.0.0.1:27018 is now in state PRIMARY
2015-01-23T23:51:16.830+0000 I REPL     [ReplicationExecutor] replset: another primary seen with election time 54c2de73:1 my election time is 54c2ddef:1
2015-01-23T23:51:16.830+0000 I REPL     [ReplicationExecutor] stepping down; another primary was elected more recently
2015-01-23T23:51:16.830+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to 127.0.0.1:27018 at 2015-01-23T23:51:18.830Z
2015-01-23T23:51:16.830+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
2015-01-23T23:51:16.830+0000 I REPL     [replCallbackWithGlobalLock-0] transition to SECONDARY

The strange thing is that when the connection problem occurs the original primary also can't connect to itself as evidenced by messages like this:

2015-01-23T23:51:12.827+0000 D REPL     [ReplicationExecutor] Scheduling replSetHeartbeat to 127.0.0.1:27018
2015-01-23T23:51:12.827+0000 D REPL     [ReplExecNetThread-2] Network status of sending replSetHeartbeat to 127.0.0.1:27018 was OK
2015-01-23T23:51:12.828+0000 D REPL     [ReplicationExecutor] Choosing to remain primary
2015-01-23T23:51:12.828+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to 127.0.0.1:27018 at 2015-01-23T23:51:14.828Z
2015-01-23T23:51:14.635+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:45337 #119 (21 connections now open)
2015-01-23T23:51:14.677+0000 D REPL     [ReplicationExecutor] Scheduling replSetHeartbeat to 127.0.0.1:27019
2015-01-23T23:51:14.677+0000 D NETWORK  [ReplExecNetThread-2] polling for status of connection to 127.0.0.1:27019, no events
2015-01-23T23:51:14.677+0000 D REPL     [ReplExecNetThread-2] Network status of sending replSetHeartbeat to 127.0.0.1:27019 was OK
2015-01-23T23:51:14.677+0000 D REPL     [ReplicationExecutor] Choosing to remain primary
2015-01-23T23:51:14.677+0000 D REPL     [ReplicationExecutor] Scheduling heartbeat to 127.0.0.1:27019 at 2015-01-23T23:51:16.677Z
2015-01-23T23:51:14.764+0000 I NETWORK  [conn114] Socket recv() timeout  127.0.0.1:27017
2015-01-23T23:51:14.764+0000 I NETWORK  [conn114] SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [RECV_TIMEOUT] server [127.0.0.1:27017]
2015-01-23T23:51:14.764+0000 I NETWORK  [conn114] DBClientCursor::init call() failed
2015-01-23T23:51:14.764+0000 D -        [conn114] User Assertion: 10276:DBClientBase::findN: transport error: localhost:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D345330677A724D726C6E6878466C4B485845323848304F3346554C6A444A6731) }
2015-01-23T23:51:14.764+0000 I ACCESS   [conn114] can't authenticate to localhost:27017 (127.0.0.1) failed as internal user, error: DBClientBase::findN: transport error: localhost:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D345330677A724D726C6E6878466C4B485845323848304F3346554C6A444A6731) }

One theory, since this only appears to happen when auth is enabled, is that somewhere in the networking code a lock is being held during authentication that is stopping new connections from being made, or auth itself is hanging so the "client" (another replica set member in this case) never gets a response to saslStart. saslStart hanging would appear to be a possibility given the RECV_TIMEOUT error when calling saslStart.

This problem occurs with the latest nightly as well as the nightly from the 20th (the oldest nightly we still have on Jenkins), and presumably the few nightlies in between. Logs attached for all replica set members from one of the failing test runs.



 Comments   
Comment by Spencer Brody (Inactive) [ 24/Apr/15 ]

Yes, the real fix was completed for SERVER-12235, which has already been backported for 3.0.3

Comment by Bernie Hackett [ 24/Apr/15 ]

Is this fix going to be backported to 3.0.x?

Comment by Eric Milkie [ 26/Jan/15 ]

Handing off to Spencer for further investigation.

Comment by Eric Milkie [ 26/Jan/15 ]

Someone else logged in and terminated my connections; they appear to be running the test suite now.

Comment by Eric Milkie [ 26/Jan/15 ]

I'm trying out the tests on jsl8 until Andy gets in.

Generated at Thu Feb 08 03:43:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.