[SERVER-18096] Shard primary incorrectly reuses closed sockets after relinquish and re-election Created: 17/Apr/15  Updated: 19/May/15  Resolved: 07/May/15

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 2.6.9
Fix Version/s: 2.6.10

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Kevin Pulo
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File shard_primary_relinquish_migrate.js     File shard_primary_relinquish_migrate.sh    
Issue Links:
Related
related to SERVER-15022 TO-shard fails to accept new chunk in... Closed
related to SERVER-15593 Initial autosplit heuristics are very... Closed
related to SERVER-17066 cleanupOrphaned misses orphans after ... Closed
is related to SERVER-18358 Add shard_primary_relinquish_migrate ... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 3 05/15/15
Participants:

 Description   

When a shard primary relinquishes, it closes all incoming — and outgoing — connections. This is normal and necessary. However, if it later becomes primary again, it will incorrectly try to reuse the (now closed) outgoing sockets to the configsvrs and other shards members (ReplicaSetMonitorWatcher).

Since these fds have been closed and are no longer valid, this causes a profusion of "Bad file descriptor" (errno = EBADF) messages in the logfile. However, the connections are not automatically re-established, causing subsequent chunk migrations to fail (and probably other operations that require the shards to write to the configsvrs).

The actual impact depends on whether the FROM or TO shard has "bounced" (step-down/step-up).

  • FROM shard bounce => next 4 migrations fail
  • TO shard bounce => next 3 migrations fail
  • FROM and TO shard bounce => next 8 migrations fail

Initially the failures are early in the migration process, but subsequent migrations fail later in the process — notably, after documents have been transferred (causing orphaned documents). In some of these failures, SERVER-17066 means that the resulting orphans cannot be cleaned by cleanupOrphaned.

Attached are a jstest reproducer and wrapper script suitable for "git bisect run".

This only affects 2.6; it has been incidentally fixed in 3.0. Using git bisect shows that commit fbbb0d2a1d845728cd714272199a652573e2f27d (SERVER-15593) fixed the issue. However, that ticket is different and the bulk of the commit is completely unrelated.

I have confirmed that the following hunk alone is sufficient to fix the problem:

diff --git a/src/mongo/util/net/sock.cpp b/src/mongo/util/net/sock.cpp
index 8e9517f..e649a43 100644
--- a/src/mongo/util/net/sock.cpp
+++ b/src/mongo/util/net/sock.cpp
@@ -824,6 +824,12 @@ namespace mongo {
     // isStillConnected() polls the socket at max every Socket::errorPollIntervalSecs to determine
     // if any disconnection-type events have happened on the socket.
     bool Socket::isStillConnected() {
+        if (_fd == -1) {
+            // According to the man page, poll will respond with POLLVNAL for invalid or
+            // unopened descriptors, but it doesn't seem to be properly implemented in
+            // some platforms - it can return 0 events and 0 for revent. Hence this workaround.
+            return false;
+        }
 
         if ( errorPollIntervalSecs < 0 ) return true;
         if ( ! isPollSupported() ) return true; // nothing we can do

Given that this is a very simple fix for a logic bug of moderately high impact, can this please be backported to the v2.6 branch?



 Comments   
Comment by Githook User [ 07/May/15 ]

Author:

{u'name': u'Kevin Pulo', u'email': u'kevin.pulo@mongodb.com'}

Message: SERVER-18096: don't try to reuse closed socket fds

Signed-off-by: Spencer T Brody <spencer@mongodb.com>
Branch: v2.6
https://github.com/mongodb/mongo/commit/7c51c3a17457f46aa55c4c419c15add471d4e232

Comment by Kevin Pulo [ 17/Apr/15 ]

Example of the impact of bouncing the FROM shard (with a simple reconfig in 2.6):

shard01:PRIMARY> var cfg = rs.conf()
shard01:PRIMARY> cfg.members[0].priority = 2
2
shard01:PRIMARY> rs.reconfig(cfg)
2015-04-14T12:57:34.871+1000 DBClientCursor::init call() failed
2015-04-14T12:57:34.872+1000 trying reconnect to 127.0.0.1:18059 (127.0.0.1) failed
2015-04-14T12:57:34.872+1000 reconnect 127.0.0.1:18059 (127.0.0.1) ok
reconnected to server after rs command (which is normal)

mongos> sh.moveChunk("test.test", {_id: MinKey}, "shard02")
{
        "cause" : {
                "errmsg" : "exception: all servers down/unreachable when querying: genique:18062,genique:18063,genique:18064",
                "code" : 8002,
                "ok" : 0
        },
        "code" : 8002,
        "ok" : 0,
        "errmsg" : "move failed"
}
mongos> sh.moveChunk("test.test", {_id: MinKey}, "shard02")
{
        "cause" : {
                "ok" : 0,
                "errmsg" : "moveChunk could not contact to: shard shard02 to start transfer :: caused by :: 9001 socket exception [SEND_ERROR] server [127.0.1.1:18059] "
        },
        "ok" : 0,
        "errmsg" : "move failed"
}
mongos> sh.moveChunk("test.test", {_id: MinKey}, "shard02")
{
        "cause" : {
                "cause" : {
                },
                "ok" : 0,
                "errmsg" : "_recvChunkCommit failed!"
        },
        "ok" : 0,
        "errmsg" : "move failed"
}
mongos> sh.moveChunk("test.test", {_id: MinKey}, "shard02")
{
        "cause" : {
                "ok" : 0,
                "errmsg" : "Failed to send migrate commit to configs because { $err: \"SyncClusterConnection::findOne prepare failed:  genique:18062 (127.0.1.1) failed:9001 socket exception [SEND_ERROR] server [127.0.1.1:18062]  genique:...\", code: 13104 }"
        },
        "ok" : 0,
        "errmsg" : "move failed"
}
mongos> sh.moveChunk("test.test", {_id: MinKey}, "shard02")
{ "millis" : 1506, "ok" : 1 }

Generated at Thu Feb 08 03:46:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.