[SERVER-18096] Shard primary incorrectly reuses closed sockets after relinquish and re-election Created: 17/Apr/15 Updated: 19/May/15 Resolved: 07/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | 2.6.9 |
| Fix Version/s: | 2.6.10 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kevin Pulo | Assignee: | Kevin Pulo |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Sharding 3 05/15/15 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
When a shard primary relinquishes, it closes all incoming — and outgoing — connections. This is normal and necessary. However, if it later becomes primary again, it will incorrectly try to reuse the (now closed) outgoing sockets to the configsvrs and other shards members (ReplicaSetMonitorWatcher). Since these fds have been closed and are no longer valid, this causes a profusion of "Bad file descriptor" (errno = EBADF) messages in the logfile. However, the connections are not automatically re-established, causing subsequent chunk migrations to fail (and probably other operations that require the shards to write to the configsvrs). The actual impact depends on whether the FROM or TO shard has "bounced" (step-down/step-up).
Initially the failures are early in the migration process, but subsequent migrations fail later in the process — notably, after documents have been transferred (causing orphaned documents). In some of these failures, Attached are a jstest reproducer and wrapper script suitable for "git bisect run". This only affects 2.6; it has been incidentally fixed in 3.0. Using git bisect shows that commit fbbb0d2a1d845728cd714272199a652573e2f27d ( I have confirmed that the following hunk alone is sufficient to fix the problem:
Given that this is a very simple fix for a logic bug of moderately high impact, can this please be backported to the v2.6 branch? |
| Comments |
| Comment by Githook User [ 07/May/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'name': u'Kevin Pulo', u'email': u'kevin.pulo@mongodb.com'}Message: Signed-off-by: Spencer T Brody <spencer@mongodb.com> | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 17/Apr/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Example of the impact of bouncing the FROM shard (with a simple reconfig in 2.6):
|