Too long migration critical section on conf replicaset failover (~15 minutes)

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 8.0.17
    • Component/s: None
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We had a cluster outage for 15 minutes during a loss of connectivity to one availability zone.

      Timeline:

      • at 13:42:24, the migration of the chunk from `rs04` to `rs12` began;
      • at 13:42:26, the completion of the migration of the chunk began (Migration successfully entered critical section);
      • at 13:42 lost of connectivity to conf replicaset PRIMARY (`mongo-conf-c`);
      • at 13:42:35 PRIMARY made a failover from `mongo-conf-c` to `mongo-conf-b` on the conf replicaset;
      • requests affecting `rs04` and `rs12` started to stick and keep connecting to clients;
      • clients were disconnected from mongos by timeout, but for mongos they were still connected ("connectionCount":10000);
      • at 13:58:23 the mongod migration process concluded that it could not reach the old PRIMARY on the conf replicaset (`mongo-conf-c`, `HostUnreachable: onInvoke :: caused by :: Connection timed out`);
      • at 13:58:24 the migration of the chunk leave critical section (Finished critical section);
      • requests that were waiting for the chunk migration started to complete (~15 minutes after they started);
      • the situation has returned to normal without external interference;
      • at 13:58:36 migration finished
      • a couple of hours later, the lost connectivity to availability zone was restored.

      An example of mongos client connection errors:

      mongos-01: 2026-02-18 13:44:03.000+00 I NETWORK [listener] {"t":{"$date":"2026-02-18T13:44:03.000+00:00"},"s":"I",  "c":"NETWORK",  "id":22942,   "svc":"-", "ctx":"listener","msg":"Connection refused because there are too many open connections","attr":{"remote":"[3fff::2054:a494:0]:60846","isLoadBalanced":false,"uuid":{"uuid":{"$uuid":"4c0f3fa4-b0d8-4c3a-9d05-4f426a49a42a"}},"connectionId":10223070,"connectionCount":10000}}
      

      Failover on conf replicaset:

      mongo-conf-a: 2026-02-18 13:42:35.536+00 I ELECTION [ReplCoord-3412] {"t":{"$date":"2026-02-18T13:42:35.536+00:00"},"s":"I",  "c":"ELECTION", "id":4615652, "svc":"S", "ctx":"ReplCoord-3412","msg":"Starting an election, since we've seen no PRIMARY in election timeout period","attr":{"electionTimeoutPeriodMillis":10000}}
      mongo-conf-a: 2026-02-18 13:42:35.546+00 I REPL [ReplCoord-3412] {"t":{"$date":"2026-02-18T13:42:35.546+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "svc":"S", "ctx":"ReplCoord-3412","msg":"Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}}
      mongo-conf-b: 2026-02-18 13:42:35.555+00 I REPL [ReplCoord-1062] {"t":{"$date":"2026-02-18T13:42:35.555+00:00"},"s":"I",  "c":"REPL",     "id":21215,   "svc":"S", "ctx":"ReplCoord-1062","msg":"Member is in new state","attr":{"hostAndPort":"mongo-conf-a:27019","newState":"PRIMARY"}}
      mongo-conf-c: 2026-02-18 13:42:35.727+00 I REPL [ReplCoord-1393] {"t":{"$date":"2026-02-18T13:42:35.728+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "svc":"S", "ctx":"ReplCoord-1393","msg":"Replica set state transition","attr":{"newState":"SECONDARY","oldState":"PRIMARY"}}
      

      Mongod chunk migration logs:

      mongo-rs12-a: 2026-02-18 13:42:24.357+00 I MIGRATE [migrateThread] {"t":{"$date":"2026-02-18T13:42:24.357+00:00"},"s":"I",  "c":"MIGRATE",  "id":22000,   "svc":"S", "ctx":"migrateThread","msg":"Starting receiving end of chunk migration","attr":{"chunkMin":{"_id":"0e8f102c7df0a60a646bf35753102cd9_c2d597d3362c4a2b9a272cfba2fd69d2"},"chunkMax":{"_id":"0e92b9e8207c055fe2b972928d201e6f_0f7a2e55f0714d6f8e351a2319615967"},"namespace":"data-storage.calculations","fromShard":"rs04","sessionId":"rs04_rs12_6995c1c0b013c341feda9939","migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:24.362+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:24.363+00:00"},"s":"I",  "c":"SHARDING", "id":6718405, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration data fetch start","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:24.516+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:24.517+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16760571,"fetchMillis":153}}
      mongo-rs12-a: 2026-02-18 13:42:24.714+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:24.715+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16769086,"fetchMillis":198}}
      mongo-rs12-a: 2026-02-18 13:42:24.845+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:24.845+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16769570,"fetchMillis":129}}
      mongo-rs12-a: 2026-02-18 13:42:25.019+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:25.019+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16766606,"fetchMillis":174}}
      mongo-rs12-a: 2026-02-18 13:42:25.276+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:25.276+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16768773,"fetchMillis":256}}
      mongo-rs12-a: 2026-02-18 13:42:25.536+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:25.536+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16773618,"fetchMillis":259}}
      mongo-rs12-a: 2026-02-18 13:42:25.788+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:25.788+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16769642,"fetchMillis":251}}
      mongo-rs12-a: 2026-02-18 13:42:26.012+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:26.012+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16766925,"fetchMillis":224}}
      mongo-rs12-a: 2026-02-18 13:42:26.187+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:26.187+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":16769014,"fetchMillis":174}}
      mongo-rs12-a: 2026-02-18 13:42:26.276+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:26.276+00:00"},"s":"I",  "c":"SHARDING", "id":6718416, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone fetch end","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"batchSize":14318607,"fetchMillis":88}}
      mongo-rs12-a: 2026-02-18 13:42:26.276+00 I SHARDING [migrateThread] {"t":{"$date":"2026-02-18T13:42:26.277+00:00"},"s":"I",  "c":"SHARDING", "id":6718401, "svc":"S", "ctx":"migrateThread","msg":"Shutting down and joining inserter threads for migration {migrationId}","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:26.276+00 I SHARDING [ChunkMigrationFetchers-0] {"t":{"$date":"2026-02-18T13:42:26.276+00:00"},"s":"I",  "c":"SHARDING", "id":6718404, "svc":"S", "ctx":"ChunkMigrationFetchers-0","msg":"Chunk migration initial clone complete","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"durationMicros":612}}
      mongo-rs12-a: 2026-02-18 13:42:26.349+00 I SHARDING [migrateThread] {"t":{"$date":"2026-02-18T13:42:26.350+00:00"},"s":"I",  "c":"SHARDING", "id":6718415, "svc":"S", "ctx":"migrateThread","msg":"Inserter threads for migration {migrationId} joined","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs04-a: 2026-02-18 13:42:26.367+00 I MIGRATE [MoveChunk] {"t":{"$date":"2026-02-18T13:42:26.367+00:00"},"s":"I",  "c":"MIGRATE",  "id":22017,   "svc":"S", "ctx":"MoveChunk","msg":"Migration successfully entered critical section","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:26.371+00 I MIGRATE [migrateThread] {"t":{"$date":"2026-02-18T13:42:26.371+00:00"},"s":"I",  "c":"MIGRATE",  "id":22004,   "svc":"S", "ctx":"migrateThread","msg":"Waiting for replication to catch up before entering critical section","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:26.371+00 I MIGRATE [migrateThread] {"t":{"$date":"2026-02-18T13:42:26.371+00:00"},"s":"I",  "c":"MIGRATE",  "id":22005,   "svc":"S", "ctx":"migrateThread","msg":"Chunk data replicated successfully.","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs12-a: 2026-02-18 13:42:26.371+00 I MIGRATE [migrateThread] {"t":{"$date":"2026-02-18T13:42:26.372+00:00"},"s":"I",  "c":"MIGRATE",  "id":22008,   "svc":"S", "ctx":"migrateThread","msg":"Migration commit succeeded flushing to secondaries","attr":{"namespace":"data-storage.calculations","chunkMin":{"_id":"0e8f102c7df0a60a646bf35753102cd9_c2d597d3362c4a2b9a272cfba2fd69d2"},"chunkMax":{"_id":"0e92b9e8207c055fe2b972928d201e6f_0f7a2e55f0714d6f8e351a2319615967"},"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs04-a: 2026-02-18 13:43:00.332+00 I NETWORK [ReplicaSetMonitor-TaskExecutor] {"t":{"$date":"2026-02-18T13:43:00.332+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"mongocfg","host":"mongo-conf-c:27019","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Timed out connecting to mongo-conf-c:27019 after 20000ms"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"mongo-conf-c:27019","success":false,"errorMessage":"HostUnreachable: Timed out connecting to mongo-conf-c:27019 after 20000ms"}}}}
      mongo-rs04-a: 2026-02-18 13:43:00.332+00 I - [ReplicaSetMonitor-TaskExecutor] {"t":{"$date":"2026-02-18T13:43:00.332+00:00"},"s":"I",  "c":"-",        "id":4333222, "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"mongo-conf-c:27019","error":"HostUnreachable: Timed out connecting to mongo-conf-c:27019 after 20000ms","replicaSet":"mongocfg","response":{}}}
      mongo-rs04-a: 2026-02-18 13:58:21.681+00 I CONNPOOL [ReplicaSetMonitor-TaskExecutor] {"t":{"$date":"2026-02-18T13:58:21.682+00:00"},"s":"I",  "c":"CONNPOOL", "id":22572,   "svc":"-", "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Dropping all pooled connections","attr":{"hostAndPort":"mongo-conf-c:27019","error":"HostUnreachable: Timed out connecting to mongo-conf-c:27019 after 20000ms"}}
      mongo-rs04-a: 2026-02-18 13:58:23.961+00 I NETWORK [Sharding-Fixed-3963] {"t":{"$date":"2026-02-18T13:58:23.962+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "svc":"-", "ctx":"Sharding-Fixed-3963","msg":"Host failed in replica set","attr":{"replicaSet":"mongocfg","host":"mongo-conf-c:27019","error":{"code":6,"codeName":"HostUnreachable","errmsg":"onInvoke :: caused by :: Connection timed out"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"mongo-conf-c:27019","success":false,"errorMessage":"HostUnreachable: onInvoke :: caused by :: Connection timed out"}}}}
      mongo-rs04-a: 2026-02-18 13:58:23.961+00 I NETWORK [MoveChunk] {"t":{"$date":"2026-02-18T13:58:23.962+00:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "svc":"S", "ctx":"MoveChunk","msg":"Host failed in replica set","attr":{"replicaSet":"mongocfg","host":"mongo-conf-c:27019","error":{"code":6,"codeName":"HostUnreachable","errmsg":"onInvoke :: caused by :: Connection timed out"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"mongo-conf-c:27019","success":false,"errorMessage":"HostUnreachable: onInvoke :: caused by :: Connection timed out"}}}}
      mongo-rs04-a: 2026-02-18 13:58:24.010+00 I MIGRATE [MoveChunk] {"t":{"$date":"2026-02-18T13:58:24.011+00:00"},"s":"I",  "c":"MIGRATE",  "id":22018,   "svc":"S", "ctx":"MoveChunk","msg":"Migration succeeded and updated collection placement version","attr":{"updatedCollectionPlacementVersion":"3356347|1||686657ebe0c1a741c290886f||Timestamp(1751537643, 13)","migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs04-a: 2026-02-18 13:58:24.010+00 I MIGRATE [MoveChunk] {"t":{"$date":"2026-02-18T13:58:24.011+00:00"},"s":"I",  "c":"MIGRATE",  "id":6107801, "svc":"S", "ctx":"MoveChunk","msg":"Exiting commit critical section","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"durationMillis":957627}}
      mongo-rs04-a: 2026-02-18 13:58:24.068+00 I MIGRATE [MoveChunk] {"t":{"$date":"2026-02-18T13:58:24.068+00:00"},"s":"I",  "c":"MIGRATE",  "id":6107802, "svc":"S", "ctx":"MoveChunk","msg":"Finished critical section","attr":{"migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}},"durationMillis":957711}}
      mongo-rs04-a: 2026-02-18 13:58:24.068+00 I MIGRATE [MoveChunk] {"t":{"$date":"2026-02-18T13:58:24.068+00:00"},"s":"I",  "c":"MIGRATE",  "id":23893,   "svc":"S", "ctx":"MoveChunk","msg":"MigrationCoordinator delivering decision to self and to recipient","attr":{"decision":"committed","migrationId":{"uuid":{"$uuid":"0544d33b-0a5a-4de3-999d-56efe0709519"}}}}
      mongo-rs04-a: 2026-02-18 13:58:36.414+00 I SHARDING [MoveChunk] {"t":{"$date":"2026-02-18T13:58:36.415+00:00"},"s":"I",  "c":"SHARDING", "id":7627801, "svc":"S", "ctx":"MoveChunk","msg":"Migration finished","attr":{"migrationId":"0544d33b-0a5a-4de3-999d-56efe0709519","totalTimeMillis":972135,"docsCloned":10988,"bytesCloned":165176199,"cloneTime":2046}}
      

            Assignee:
            Kenan Ali
            Reporter:
            Artem Navrotskiy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: