Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3739

mongos: "too many attempts to update config, failing"

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 1.8.1
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      sharded cluster with three shards, three config servers, connecting through mongos
    • ALL

      Not sure what's happening here, but mongos seem to be very confused about which databases exist. It threw errors at the application with the message "too many attempts to update config, failing", and at the same time this can be found in the mongos log:

      Thu Sep  1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:43:43 [conn2] couldn't find database [complete_20110828] in config db
      Thu Sep  1 07:43:43 [conn2]      put [complete_20110828] on: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04
      Thu Sep  1 07:46:10 [LockPinger] dist_lock pinged successfully for: richassembler03.byburt.com:1314862900:1804289383
      Thu Sep  1 07:47:59 [mongosMain] dbexit: received signal 15 rc:0 received signal 15
      

      Then it died.

      Running "show dbs" in the mongo console while connected to the mongos clearly shows that the database in question exists.

      This is not the first problem we've encountered where mongos is confused about which databases exist, and frankly we're getting scared of using sharding because it's so easily corrupted. I haven't found or heard any way to fix the problem but to clean the whole cluster and start over.

      If you're wondering about the date in the database name we use a application side partitioning scheme, mostly because we need to remove old data, but also partly because it's so easy to get a corrupted sharding config, and in such a case we don't want as little of our active data in that database as possible.

      This may be related to SERVER-3738, which happened at roughly the same time.

      This is some more context from the mongos logs:

      Thu Sep  1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf01:28100]
      Thu Sep  1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf02:28100]
      Thu Sep  1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf01:28100]
      Thu Sep  1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf02:28100]
      Thu Sep  1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:41:41 [conn8] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:42 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:42 [conn5] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:42 [conn2] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn6] warning: splitChunk failed - cmd: { splitChunk: "complete_20110901.exposures", keyPattern: { _id: 1 }, min: { _id: MinKey }, max: { _id: "3LLLLL" }, from: "richcollshard2/richcolldb03.byburt.com:27017,richcolldb
      Thu Sep  1 07:41:43 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn5] warning: splitChunk failed - cmd: { splitChunk: "complete_20110901.exposures", keyPattern: { _id: 1 }, min: { _id: MinKey }, max: { _id: "3LLLLL" }, from: "richcollshard2/richcolldb03.byburt.com:27017,richcolldb
      Thu Sep  1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf01:28100]
      Thu Sep  1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf02:28100]
      Thu Sep  1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf01:28100]
      Thu Sep  1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf02:28100]
      Thu Sep  1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 1
      Thu Sep  1 07:41:43 [conn2] autosplitted complete_20110901.exposures shard: ns:complete_20110901.exposures at: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 lastmod: 9|2 min: { _id: "SSSSSO" } max: { _id: "WEEEE9
      Thu Sep  1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn5] autosplitted complete_20110901.exposures shard: ns:complete_20110901.exposures at: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 lastmod: 11|3 min: { _id: "001F5CLQTKHLAAE4" } max: { _
      Thu Sep  1 07:41:43 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn2] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn8] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0
      Thu Sep  1 07:41:43 [mongosMain] connection accepted from 127.0.0.1:34480 #10
      Thu Sep  1 07:41:43 [mongosMain] connection accepted from 127.0.0.1:34481 #11
      Thu Sep  1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf01:28100]
      Thu Sep  1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf02:28100]
      Thu Sep  1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf03:28100]
      Thu Sep  1 07:43:43 [conn2] couldn't find database [complete_20110828] in config db
      Thu Sep  1 07:43:43 [conn2]      put [complete_20110828] on: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04
      Thu Sep  1 07:46:10 [LockPinger] dist_lock pinged successfully for: richassembler03.byburt.com:1314862900:1804289383
      Thu Sep  1 07:47:59 [mongosMain] dbexit: received signal 15 rc:0 received signal 15
      

            Assignee:
            greg_10gen Greg Studer
            Reporter:
            iconara Theo Hultberg
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: