[SERVER-66106] sharded_moveChunk_partitioned.js failed moveChunk check may be incorrect (only pre-6.0) Created: 02/May/22  Updated: 29/Oct/23  Resolved: 09/Feb/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.22, 4.4.18, 5.0.13
Fix Version/s: 5.0.15

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Enrico Golfieri
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2022-06-13, Sharding EMEA 2022-06-27, Sharding EMEA 2022-07-11, Sharding EMEA 2022-07-25, Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05, Sharding EMEA 2022-09-19, Sharding EMEA 2022-11-28, Sharding EMEA 2022-12-12, Sharding EMEA 2022-12-26, Sharding EMEA 2023-01-09, Sharding EMEA 2023-01-23, Sharding EMEA 2023-02-06, Sharding EMEA 2023-02-20
Participants:
Linked BF Score: 0

 Description   

[DISCLAIMER] This is a very rare test-only race condition that has been observed only once and can happen only on pre-v6.0 versions because starting from v6.0 moveChunk issued by users are being transformed into moveRange that are not being persisted on the CSRS.

When a thread is executing a moveChunk state targeting a chunk C, the following actions are performed:

  1. Retrieve number of docs present in C on donor before move
  2. Call moveChunk on C
  3. Retrieve number of docs present in C on donor after move
  4. In case the move failed:
    • Check that number of docs in C on donor before and after the move is the same

It has been observed the following problem in an evergreen failure:

  • (During step 2) The move fails because the donor is already serving another moveChunk
  • The CSRS primary steps down right before calling the cleanup for the aborted migration, failing to delete the recovery info

"Failed to remove recovery info","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'delete' operation  on 'config' database via fire-and-forget command execution."} 

  • A new CSRS node very quickly steps up, the move is resent and succeeds

"Balancer scheduler recovery complete. Switching to regular execution 

  • (During step 3) The FSM thread retrieve documents after the "failed" move (that in reality succeeded on the second attempt) and throws an exception because the number of documents before and after the move do not match


 Comments   
Comment by Githook User [ 09/Feb/23 ]

Author:

{'name': 'Enrico Golfieri', 'email': 'enrico.golfieri@mongodb.com', 'username': 'enricogolfieri'}

Message: SERVER-66106 sharded_moveChunk_partitioned.js failed moveChunk check may be incorrect (only pre-6.0)
Branch: v5.0
https://github.com/mongodb/mongo/commit/ff04e4b935075c049db641ab59371cda350d15ea

Comment by Connie Chen [ 05/Jan/23 ]

Moving this back to open as there isn't a linked dependency

Comment by Sergi Mateo Bellido [ 17/Nov/22 ]

The suggestion after team discussion is to fix the test or blacklist it from stepdown suites.

Generated at Thu Feb 08 06:04:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.