Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14863

Mongos ReplicaSetMonitorWatcher continues to monitor drained/removed shard

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Gone away
    • 2.4.10, 2.6.3, 2.7.5
    • None
    • Sharding
    • Sharding
    • ALL
    • Hide

      Create a sharded cluster with multiple mongos processes connected to it. For example, using mlaunch:

      mlaunch init --sharded 2 --replicaset 3 --mongos 4

      Enable logging on a database/collection:

      > ./mongo
      MongoDB shell version: 2.7.5-pre-
      connecting to: test
      mongos> sh.enableSharding("foo")
      { "ok" : 1 }
      mongos> sh.shardCollection("foo.dummydata", {name: "hashed"})
      { "collectionsharded" : "foo.dummydata", "ok" : 1 }

      Insert some dummy data:

      mongos> for (var i = 1; i <= 2000; i++) db.dummydata.insert( { name : i, foo: "Lorem ipsum dolor sic amet." } )
      WriteResult({ "nInserted" : 1 })

      Increase the logging on each of your mongos processes to logging level 4:

      > ./mongo --port 27017
      MongoDB shell version: 2.7.5-pre-
      connecting to: 127.0.0.1:27017/test
      mongos> use admin
      switched to db admin
      mongos> db.runCommand({setParameter:1, logLevel: 4})
      { "was" : 0, "ok" : 1 }
      mongos> quit()

      Start the draining process:

      mongos> use admin
      switched to db admin
      mongos> db.runCommand({removeShard:"shard02"})
      {
              "msg" : "draining started successfully",
              "state" : "started",
              "shard" : "shard02",
              "ok" : 1
      }

      After the chunks have finished draining, run it a second time to remove the shard:

      mongos> use admin
      switched to db admin
      mongos> db.runCommand({removeShard: "shard02"})
      {
              "msg" : "removeshard completed successfully",
              "state" : "completed",
              "shard" : "shard02",
              "ok" : 1
      }

      Run flushRouterConfig on each mongos:

      > ./mongo --port 27018
      MongoDB shell version: 2.7.5-pre-
      connecting to: 127.0.0.1:27018/test
      mongos> use admin
      switched to db admin
      mongos> db.adminCommand({flushRouterConfig: 1})
      { "flushed" : true, "ok" : 1 }
      mongos> quit()

      The mongos on which you performed the drain will only check for shard01:

      > grep "checking replica set" mongos_27017.log | tail -n 10
      2014-08-12T12:20:03.373+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:20:13.377+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:20:23.382+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:20:33.387+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:20:43.392+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:20:53.396+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:21:03.401+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:21:13.406+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:21:23.413+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:21:33.417+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01

      However, the other mongos processes will still have a ReplicaSetMonitorWatcher checking for shard02:

      > grep "checking replica set" mongos_27018.log | tail -n 10
      2014-08-12T12:22:33.512+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard02
      2014-08-12T12:22:33.515+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:22:43.520+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard02
      2014-08-12T12:22:43.523+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:22:53.527+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard02
      2014-08-12T12:22:53.535+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:23:03.539+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard02
      2014-08-12T12:23:03.541+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T12:23:13.546+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard02
      2014-08-12T12:23:13.549+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01

      After a reboot of the affected mongos, they no longer monitor the removed shard.

      > mlaunch stop 27018
      1 node stopped.
      > mlaunch start 27018
      launching: /Users/victorhooi/code/mongo/mongos on port 27018
      > cd data/mongos/
      > grep "checking replica set" mongos_27018.log | tail -n 10
      2014-08-12T14:01:30.956+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:01:40.962+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:01:50.974+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:00.980+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:10.985+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:20.991+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:30.997+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:41.002+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:02:51.007+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01
      2014-08-12T14:03:01.014+1000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: shard01

      Show
      Create a sharded cluster with multiple mongos processes connected to it. For example, using mlaunch : mlaunch init --sharded 2 --replicaset 3 --mongos 4 Enable logging on a database/collection: > ./mongo MongoDB shell version: 2.7.5-pre- connecting to: test mongos> sh.enableSharding("foo") { "ok" : 1 } mongos> sh.shardCollection("foo.dummydata", {name: "hashed"}) { "collectionsharded" : "foo.dummydata", "ok" : 1 } Insert some dummy data: mongos> for (var i = 1; i <= 2000; i++) db.dummydata.insert( { name : i, foo: "Lorem ipsum dolor sic amet." } ) WriteResult({ "nInserted" : 1 }) Increase the logging on each of your mongos processes to logging level 4: > ./mongo --port 27017 MongoDB shell version: 2.7.5-pre- connecting to: 127.0.0.1:27017/test mongos> use admin switched to db admin mongos> db.runCommand({setParameter:1, logLevel: 4}) { "was" : 0, "ok" : 1 } mongos> quit() Start the draining process: mongos> use admin switched to db admin mongos> db.runCommand({removeShard:"shard02"}) { "msg" : "draining started successfully", "state" : "started", "shard" : "shard02", "ok" : 1 } After the chunks have finished draining, run it a second time to remove the shard: mongos> use admin switched to db admin mongos> db.runCommand({removeShard: "shard02"}) { "msg" : "removeshard completed successfully", "state" : "completed", "shard" : "shard02", "ok" : 1 } Run flushRouterConfig on each mongos : > ./mongo --port 27018 MongoDB shell version: 2.7.5-pre- connecting to: 127.0.0.1:27018/test mongos> use admin switched to db admin mongos> db.adminCommand({flushRouterConfig: 1}) { "flushed" : true, "ok" : 1 } mongos> quit() The mongos on which you performed the drain will only check for shard01 : > grep "checking replica set" mongos_27017.log | tail -n 10 2014-08-12T12:20:03.373+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:20:13.377+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:20:23.382+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:20:33.387+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:20:43.392+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:20:53.396+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:21:03.401+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:21:13.406+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:21:23.413+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:21:33.417+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 However, the other mongos processes will still have a ReplicaSetMonitorWatcher checking for shard02 : > grep "checking replica set" mongos_27018.log | tail -n 10 2014-08-12T12:22:33.512+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard02 2014-08-12T12:22:33.515+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:22:43.520+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard02 2014-08-12T12:22:43.523+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:22:53.527+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard02 2014-08-12T12:22:53.535+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:23:03.539+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard02 2014-08-12T12:23:03.541+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T12:23:13.546+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard02 2014-08-12T12:23:13.549+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 After a reboot of the affected mongos , they no longer monitor the removed shard. > mlaunch stop 27018 1 node stopped. > mlaunch start 27018 launching: /Users/victorhooi/code/mongo/mongos on port 27018 > cd data/mongos/ > grep "checking replica set" mongos_27018.log | tail -n 10 2014-08-12T14:01:30.956+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:01:40.962+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:01:50.974+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:00.980+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:10.985+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:20.991+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:30.997+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:41.002+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:02:51.007+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01 2014-08-12T14:03:01.014+1000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: shard01

    Description

      We have a MongoDD sharded cluster with two shards. We have multiple mongos processes connected to this cluster.

      Through one of the mongos processes, we initiate a drain of, and removal of one of the shards. We also run flushRouterConfig on the other mongos processes.

      The other mongos processes continue to have ReplicaSetMonitorWatcher's that check for the removed shard. A restart of the mongos seems to be the only way to get it to recognise that the shard has been removed.

      I have tested the above behaviour against 2.4.10, 2.6.3 and 2.7.5 (Git version c184143fa4d8a4fdf4fdc684404d4aad3e55794b)

      Attachments

        Activity

          People

            backlog-server-sharding Backlog - Sharding Team
            victor.hooi Victor Hooi
            Votes:
            9 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: