Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.6.10
Affects Version/s: 2.6.9
Component/s: Networking
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Sharding 3 05/15/15
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a shard primary relinquishes, it closes all incoming — and outgoing — connections. This is normal and necessary. However, if it later becomes primary again, it will incorrectly try to reuse the (now closed) outgoing sockets to the configsvrs and other shards members (ReplicaSetMonitorWatcher).

Since these fds have been closed and are no longer valid, this causes a profusion of "Bad file descriptor" (errno = EBADF) messages in the logfile. However, the connections are not automatically re-established, causing subsequent chunk migrations to fail (and probably other operations that require the shards to write to the configsvrs).

The actual impact depends on whether the FROM or TO shard has "bounced" (step-down/step-up).

FROM shard bounce => next 4 migrations fail
TO shard bounce => next 3 migrations fail
FROM and TO shard bounce => next 8 migrations fail

Initially the failures are early in the migration process, but subsequent migrations fail later in the process — notably, after documents have been transferred (causing orphaned documents). In some of these failures, ~~SERVER-17066~~ means that the resulting orphans cannot be cleaned by cleanupOrphaned.

Attached are a jstest reproducer and wrapper script suitable for "git bisect run".

This only affects 2.6; it has been incidentally fixed in 3.0. Using git bisect shows that commit fbbb0d2a1d845728cd714272199a652573e2f27d (~~SERVER-15593~~) fixed the issue. However, that ticket is different and the bulk of the commit is completely unrelated.

I have confirmed that the following hunk alone is sufficient to fix the problem:

Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

diff --git a/src/mongo/util/net/sock.cpp b/src/mongo/util/net/sock.cpp
index 8e9517f..e649a43 100644
--- a/src/mongo/util/net/sock.cpp
+++ b/src/mongo/util/net/sock.cpp
@@ -824,6 +824,12 @@ namespace mongo {
     // isStillConnected() polls the socket at max every Socket::errorPollIntervalSecs to determine
     // if any disconnection-type events have happened on the socket.
     bool Socket::isStillConnected() {
+        if (_fd == -1) {
+            // According to the man page, poll will respond with POLLVNAL for invalid or
+            // unopened descriptors, but it doesn't seem to be properly implemented in
+            // some platforms - it can return 0 events and 0 for revent. Hence this workaround.
+            return false;
+        }

         if ( errorPollIntervalSecs < 0 ) return true;
         if ( ! isPollSupported() ) return true; // nothing we can do

Given that this is a very simple fix for a logic bug of moderately high impact, can this please be backported to the v2.6 branch?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

shard_primary_relinquish_migrate.js
4 kB
Apr 17 2015 03:00:01 AM UTC
shard_primary_relinquish_migrate.sh
0.6 kB
Apr 17 2015 03:00:01 AM UTC

is related to

SERVER-18358 Add shard_primary_relinquish_migrate jstest from SERVER-18096

Backlog

related to

SERVER-15022 TO-shard fails to accept new chunk in inactive clusters after first reconfig

Closed

SERVER-15593 Initial autosplit heuristics are very aggressive when config servers are down

Closed

SERVER-17066 cleanupOrphaned misses orphans after failed chunk migration

Closed

links to

Pull Request

Assignee:: Kevin Pulo
Reporter:: Kevin Pulo
Participants:: Githook User, Kevin Pulo
Votes:: 1 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Apr 17 2015 03:00:01 AM UTC
Updated:: May 19 2015 06:17:51 PM UTC
Resolved:: May 07 2015 03:23:15 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty