Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-94141

Sporadic ReshardingCriticalSectionTimeout error

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 7.0 Required
    • Component/s: None
    • None
    • ALL
    • Hide

      Reproduced in 7.0.11 on FreeBSD 14.1 with default package set.

      Show
      Reproduced in 7.0.11 on FreeBSD 14.1 with default package set.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The bug is, we encounter this error sometimes during resharding after about 90 minutes. Background & details follow.

      command:

      db.adminCommand({
      	reshardCollection:"pb.work_dates",
      	key:{feedyard_id:"hashed",work_date:1},zones:[
      		{min:{feedyard_id:MinKey,work_date:MinKey},
      		 max:{feedyard_id:NumberLong("-4611686018427387904"),work_date:MinKey},zone:"zone1"},
      		{min:{feedyard_id:NumberLong("-4611686018427387904"),work_date:MinKey},
      		 max:{feedyard_id:NumberLong("0"),work_date:MinKey},zone:"zone2"},
      		{min:{feedyard_id:NumberLong("0"),work_date:MinKey},
      		 max:{feedyard_id:NumberLong("4611686018427387904") ,work_date:MinKey},zone:"zone3"},
      		{min:{feedyard_id:NumberLong("4611686018427387904") ,work_date:MinKey},
      		 max:{feedyard_id:MaxKey,work_date:MaxKey},zone:"zone4"},
      	],
      }));
      

      output:

      {
              "ok" : 0,
              "errmsg" : "Resharding critical section timed out.",
              "code" : 342,
              "codeName" : "ReshardingCriticalSectionTimeout",
              "$clusterTime" : {
                      "clusterTime" : Timestamp(1724769403, 2245),
                      "signature" : {
                              "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                              "keyId" : NumberLong(0)
                      }
              },
              "operationTime" : Timestamp(1724769403, 2245)
      }
      

      We're planning a process to reshard 22 collections, and just one of them is encountering this error. Our current sharding scheme is suboptimal in a few critical ways, so we're planning an operation to reshard nearly all of our collections. We have a test cluster where we can revert back to a starting-point with ZFS snapshots, which gives us the opportunity to try the exact same resharding process many times with controlled variations, and test our project's code against the result.

      One of our collections, called work_dates, can't be resharded by merely running reshardCollection. Doing so NEVER works, and throws the error ReshardingCriticalSectionTimeout. However, if we drop two of the indexes (does not matter which), it ALWAYS works. If we drop only one index, SOMETIMES it works, depending on which snapshot we clone from production. I think it never depends on which index we drop. Just to dispel suspicions of coincidence, these "always" "sometimes" and "never" claims come from around 10 attempts before claiming consistency.

      Here's a brief idea of the "shape" of the database. Everything is sharded across 5 replica sets (4 shards + config set), each of which is a 3-way replica set (master and two slaves, no arbiters). Currently, the balancer chooses the ranges, and during the resharding, we're specifying explicit zone ranges to put exactly 1/4 of the range on each shard, as you see in the command above. Work_dates has 6.4M documents, 262GiB, and 7 indexes. Indexes besides the required _id are all two-field compound indexes.

      All of the other collections reshard perfectly. Of those that work, one has more indexes (9), one has more documents (281M), but none have more bytes. Work_dates probably has the most bytes in indexes, which I'm suspicious is why it's the only one to trigger ReshardingCriticalSectionTimeout.

      Although I've no idea what the error means, the fact that others reshard perfectly, and that we can make it work by dropping any two indexes, gives me confidence that it's not something weird about our configuration, but is a genuine bug.

            Assignee:
            rhea.thorne@mongodb.com Rhea Thorne
            Reporter:
            leif@ofwilsoncreek.com Leif Pedersen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved:
              None
              None
              None
              None