[SERVER-17034] Deadlock between poorly-formed copydb and reading admin.system.users for localhost exception check Created: 24/Jan/15 Updated: 25/Jan/17 Resolved: 24/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Security |
| Affects Version/s: | 3.0.0-rc6 |
| Fix Version/s: | 3.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bernie Hackett | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Security 1 04/03/15, Sharding 2 04/24/15 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
I started seeing "not master" errors when running PyMongo's test suite in Jenkins against a replica set with auth enabled. You can see an example here: Please note that the entire replica set is running on the same host. When the test run starts the primary is localhost:27017, the secondaries are localhost:27018,localhost:27019. At some point during the test run the secondaries will no longer be able to communicate with the primary. For example:
One of the secondaries is elected the new primary but will almost immediately be able to connect to the old primary again:
The old primary notices the problem and steps down:
The strange thing is that when the connection problem occurs the original primary also can't connect to itself as evidenced by messages like this:
One theory, since this only appears to happen when auth is enabled, is that somewhere in the networking code a lock is being held during authentication that is stopping new connections from being made, or auth itself is hanging so the "client" (another replica set member in this case) never gets a response to saslStart. saslStart hanging would appear to be a possibility given the RECV_TIMEOUT error when calling saslStart. This problem occurs with the latest nightly as well as the nightly from the 20th (the oldest nightly we still have on Jenkins), and presumably the few nightlies in between. Logs attached for all replica set members from one of the failing test runs. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 24/Apr/15 ] |
|
Yes, the real fix was completed for |
| Comment by Bernie Hackett [ 24/Apr/15 ] |
|
Is this fix going to be backported to 3.0.x? |
| Comment by Eric Milkie [ 26/Jan/15 ] |
|
Handing off to Spencer for further investigation. |
| Comment by Eric Milkie [ 26/Jan/15 ] |
|
Someone else logged in and terminated my connections; they appear to be running the test suite now. |
| Comment by Eric Milkie [ 26/Jan/15 ] |
|
I'm trying out the tests on jsl8 until Andy gets in. |