Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7271

On transient config server failure, mongod should not abort

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4.2, 2.5.0
    • Component/s: Sharding
    • Labels:
      None

      Description

      Certain failure modes of the SyncClusterConnection to the config server are safe failures (we are sure nothing has been written) - it is incorrect to fail by aborting mongod here since our state is fully known (the migration failed). Rollback and continue instead.

      Original description:

      If the config server is offline for a short amount of time during the critical section causing a write error but we're able to reconnect later, we should try to A) verify the write failure on all servers and B) locally rollback the new version, aborting the migration. Currently we fail hard, and rely on the mongod restart for the config reload.

      Note this is different from the case when the config servers go down and stay down, since in that case we are unable to read the state of the metadata and so have to terminate.

        Activity

        Hide
        sverch Shaun Verch (Inactive) added a comment - - edited

        setup: 9 servers, 3 shards with 3 rs members each. MongoDB 2.2

        One member (member1) in a set crashed, because of a server crash. This server also runs 1 of 3 config servers. After that another member (member2) crashed because it couldn't reach the crashed server? This is the message on member2:

            Tue Sep 25 13:34:56 [conn538] DBClientCursor::init call() failed
            Tue Sep 25 13:34:56 [conn538] scoped connection to config1:27019,config2:27019,config3:27019 not being returned to the pool
            Tue Sep 25 13:34:56 [conn538] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: config3:27019 ns: admin.$cmd query: { fsync: 1 } config3:27019:{}
            Tue Sep 25 13:34:56 [conn538] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "db.coll1-uuid_"38f9dbbe-86ec-444b-9e6a-483eab0f9bb2"_id_ObjectId('50444151e4b0c4a3a8c5cf74')", lastmod: Timest$
            Tue Sep 25 13:34:57 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:34:59 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:03 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:05 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018
            Tue Sep 25 13:35:06 [conn538] ERROR: moveChunk commit failed: version is at907|1||000000000000000000000000 instead of 908|1||50604b9fb961dd917fdc2316
            Tue Sep 25 13:35:06 [conn538] ERROR: TERMINATING
            Tue Sep 25 13:35:06 dbexit:
            Tue Sep 25 13:35:06 [conn538] shutdown: going to close listening sockets...
            Tue Sep 25 13:35:06 [conn538] closing listening socket: 6
            Tue Sep 25 13:35:06 [conn538] closing listening socket: 7
            Tue Sep 25 13:35:06 [conn538] shutdown: going to flush diaglog...
            Tue Sep 25 13:35:06 [conn538] shutdown: going to close sockets...
            Tue Sep 25 13:35:06 [conn538] shutdown: waiting for fs preallocator...
            Tue Sep 25 13:35:06 [conn538] shutdown: lock for final commit...
            Tue Sep 25 13:35:06 [conn538] shutdown: final commit...
            Tue Sep 25 13:35:06 [conn1] end connection member2_IP:41925 (21 connections now open)
            Tue Sep 25 13:35:06 [initandlisten] now exiting
            Tue Sep 25 13:35:06 dbexit: ; exiting immediately

        In this case, a config server went down in the middle of a moveChunk command, and the remaining config servers went read only. The moveChunk command attempted to write the new config with applyOps, read the old config, got the old version, and then aborted.

        Show
        sverch Shaun Verch (Inactive) added a comment - - edited setup: 9 servers, 3 shards with 3 rs members each. MongoDB 2.2 One member (member1) in a set crashed, because of a server crash. This server also runs 1 of 3 config servers. After that another member (member2) crashed because it couldn't reach the crashed server? This is the message on member2: Tue Sep 25 13:34:56 [conn538] DBClientCursor::init call() failed Tue Sep 25 13:34:56 [conn538] scoped connection to config1:27019,config2:27019,config3:27019 not being returned to the pool Tue Sep 25 13:34:56 [conn538] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: config3:27019 ns: admin.$cmd query: { fsync: 1 } config3:27019:{} Tue Sep 25 13:34:56 [conn538] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "db.coll1-uuid_"38f9dbbe-86ec-444b-9e6a-483eab0f9bb2"_id_ObjectId('50444151e4b0c4a3a8c5cf74')", lastmod: Timest$ Tue Sep 25 13:34:57 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:34:59 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:03 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:05 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server member1:27018 Tue Sep 25 13:35:06 [conn538] ERROR: moveChunk commit failed: version is at907|1||000000000000000000000000 instead of 908|1||50604b9fb961dd917fdc2316 Tue Sep 25 13:35:06 [conn538] ERROR: TERMINATING Tue Sep 25 13:35:06 dbexit: Tue Sep 25 13:35:06 [conn538] shutdown: going to close listening sockets... Tue Sep 25 13:35:06 [conn538] closing listening socket: 6 Tue Sep 25 13:35:06 [conn538] closing listening socket: 7 Tue Sep 25 13:35:06 [conn538] shutdown: going to flush diaglog... Tue Sep 25 13:35:06 [conn538] shutdown: going to close sockets... Tue Sep 25 13:35:06 [conn538] shutdown: waiting for fs preallocator... Tue Sep 25 13:35:06 [conn538] shutdown: lock for final commit... Tue Sep 25 13:35:06 [conn538] shutdown: final commit... Tue Sep 25 13:35:06 [conn1] end connection member2_IP:41925 (21 connections now open) Tue Sep 25 13:35:06 [initandlisten] now exiting Tue Sep 25 13:35:06 dbexit: ; exiting immediately In this case, a config server went down in the middle of a moveChunk command, and the remaining config servers went read only. The moveChunk command attempted to write the new config with applyOps, read the old config, got the old version, and then aborted.
        Hide
        auto auto (Inactive) added a comment -

        Author:

        {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

        Message: SERVER-7271 Do not exit if a transient config server error aborts a migration.
        Branch: master
        https://github.com/mongodb/mongo/commit/356f8a74eb794432e8f01afc0557b78219636cb8

        Show
        auto auto (Inactive) added a comment - Author: {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'} Message: SERVER-7271 Do not exit if a transient config server error aborts a migration. Branch: master https://github.com/mongodb/mongo/commit/356f8a74eb794432e8f01afc0557b78219636cb8
        Hide
        auto auto (Inactive) added a comment -

        Author:

        {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

        Message: SERVER-7271 Do not exit if a transient config server error aborts a migration.

        Conflicts:

        src/mongo/client/distlock.cpp
        Branch: v2.4
        https://github.com/mongodb/mongo/commit/4a472d8df3a07679edfe36339490a50830408843

        Show
        auto auto (Inactive) added a comment - Author: {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'} Message: SERVER-7271 Do not exit if a transient config server error aborts a migration. Conflicts: src/mongo/client/distlock.cpp Branch: v2.4 https://github.com/mongodb/mongo/commit/4a472d8df3a07679edfe36339490a50830408843
        Hide
        auto auto (Inactive) added a comment -

        Author:

        {u'date': u'2013-04-02T16:06:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}

        Message: SERVER-7271 - Fix compile error with ScopedDbConnection
        Branch: v2.4
        https://github.com/mongodb/mongo/commit/81212176da8a2b0e3791f0644b562eb66c000219

        Show
        auto auto (Inactive) added a comment - Author: {u'date': u'2013-04-02T16:06:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'} Message: SERVER-7271 - Fix compile error with ScopedDbConnection Branch: v2.4 https://github.com/mongodb/mongo/commit/81212176da8a2b0e3791f0644b562eb66c000219
        Hide
        auto auto (Inactive) added a comment -

        Author:

        {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

        Message: SERVER-7271 Improve error reporting when migrate commit fails.
        Branch: master
        https://github.com/mongodb/mongo/commit/0077dc8ecb111075177c66abc8a5b7809a8547a0

        Show
        auto auto (Inactive) added a comment - Author: {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'} Message: SERVER-7271 Improve error reporting when migrate commit fails. Branch: master https://github.com/mongodb/mongo/commit/0077dc8ecb111075177c66abc8a5b7809a8547a0
        Hide
        auto auto (Inactive) added a comment -

        Author:

        {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

        Message: SERVER-7271 Improve error reporting when migrate commit fails.
        Branch: v2.4
        https://github.com/mongodb/mongo/commit/ab76660a759117bf60d5b54d0b0d257d105fb55b

        Show
        auto auto (Inactive) added a comment - Author: {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'} Message: SERVER-7271 Improve error reporting when migrate commit fails. Branch: v2.4 https://github.com/mongodb/mongo/commit/ab76660a759117bf60d5b54d0b0d257d105fb55b

          People

          • Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              2 years, 2 weeks, 5 days ago
              Date of 1st Reply: