Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66106

sharded_moveChunk_partitioned.js failed moveChunk check may be incorrect (only pre-6.0)

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.15
    • Affects Version/s: 4.2.22, 4.4.18, 5.0.13
    • Component/s: None
    • Fully Compatible
    • ALL
    • Sharding EMEA 2022-06-13, Sharding EMEA 2022-06-27, Sharding EMEA 2022-07-11, Sharding EMEA 2022-07-25, Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05, Sharding EMEA 2022-09-19, Sharding EMEA 2022-11-28, Sharding EMEA 2022-12-12, Sharding EMEA 2022-12-26, Sharding EMEA 2023-01-09, Sharding EMEA 2023-01-23, Sharding EMEA 2023-02-06, Sharding EMEA 2023-02-20
    • 0

      [DISCLAIMER] This is a very rare test-only race condition that has been observed only once and can happen only on pre-v6.0 versions because starting from v6.0 moveChunk issued by users are being transformed into moveRange that are not being persisted on the CSRS.

      When a thread is executing a moveChunk state targeting a chunk C, the following actions are performed:

      1. Retrieve number of docs present in C on donor before move
      2. Call moveChunk on C
      3. Retrieve number of docs present in C on donor after move
      4. In case the move failed:
        • Check that number of docs in C on donor before and after the move is the same

      It has been observed the following problem in an evergreen failure:

      • (During step 2) The move fails because the donor is already serving another moveChunk
      • The CSRS primary steps down right before calling the cleanup for the aborted migration, failing to delete the recovery info
      "Failed to remove recovery info","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'delete' operation  on 'config' database via fire-and-forget command execution."} 
      • A new CSRS node very quickly steps up, the move is resent and succeeds
      "Balancer scheduler recovery complete. Switching to regular execution 
      • (During step 3) The FSM thread retrieve documents after the "failed" move (that in reality succeeded on the second attempt) and throws an exception because the number of documents before and after the move do not match

            Assignee:
            enrico.golfieri@mongodb.com Enrico Golfieri
            Reporter:
            pierlauro.sciarelli@mongodb.com Pierlauro Sciarelli
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: