[SERVER-30636] RangeDeleter assert failed because of replication lag Created: 14/Aug/17 Updated: 30/Oct/23 Resolved: 28/Aug/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.14, 3.4.7 |
| Fix Version/s: | 3.2.17, 3.4.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Zhang Youdong | Assignee: | Nathan Myers |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Requested: |
v3.2
|
||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
The problem is quite the same as
I dive into the source code and find the reason. In function Helpers::removeRange, it try to ignore the writeconcern timeout error by comparing the error code with ErrorCodes::ExceededTimeLimit.
But in the implement of awaitReplication, it return ErrorCodes::WriteConcernFailed when timeout because of replication lag.
I am trying to create a pull request to fix it, but find the master branch has changed a lot, but both the latest version of 3.2 and 3.4 have this problem. I am confused why removeRange need assert when awaitReplication failed, maybe we can remove the assert or change the error code of awaitReplication. |
| Comments |
| Comment by bhargav [ 16/Oct/17 ] |
|
After upgrading 3.4 replica set not working, below are the logs. we have fresh dataset contain only system log and table. { "ok" : 0, "errmsg" : "'10.10.10.25:27017' has data already, cannot initiate set.", "code" : 110, "codeName" : "CannotInitializeNodeWithData" }We have followed the same step as previous version. Thanks, |
| Comment by Githook User [ 29/Aug/17 ] |
|
Author: {'name': 'Nathan Myers', 'username': 'nathan-myers-mongo', 'email': 'nathan.myers@10gen.com'}Message: Merge pull request zyd_com/ Backport |
| Comment by Nathan Myers [ 28/Aug/17 ] |
|
zyd_com Yes, the PR does completely fix the problem. It will go out in 3.4.8, and is scheduled for a 3.2 release. Thank you again for the patch and the work behind it. |
| Comment by Githook User [ 28/Aug/17 ] |
|
Author: {'username': 'nathan-myers-mongo', 'name': 'Nathan Myers', 'email': 'nathan.myers@10gen.com'}Message: |
| Comment by Githook User [ 28/Aug/17 ] |
|
Author: {'username': 'nathan-myers-mongo', 'name': 'Nathan Myers', 'email': 'nathan.myers@10gen.com'}Message: |
| Comment by Zhang Youdong [ 23/Aug/17 ] |
|
@Nathan Myers What's the progress of this issue, does the PR fix the problem? |
| Comment by Nathan Myers [ 17/Aug/17 ] |
|
No, your fix is correct. Without it, the range deleter will skip deleting remaining documents in the range. |
| Comment by Zhang Youdong [ 17/Aug/17 ] |
|
@Nathan Myers I thought the assertion is by desin, so just skip the WriteConcernFailed error code in the PR. It's much better to fix the problem by catching the exception, thank you. |
| Comment by Nathan Myers [ 16/Aug/17 ] |
|
Thank you for the PR. I agree that this should fix the immediate problem. I have investigated the master branch, and this problem cannot arise in the new range deletion system in 3.5. In releases 3.4 and 3.2, the massertStatusOK statement (like its replacement in 3.4, uassertStatusOK) throws an exception |
| Comment by Zhang Youdong [ 16/Aug/17 ] |
|
I created a pull request for v3.4, and it can be also used for v3.2. detail see https://github.com/mongodb/mongo/pull/1171 |
| Comment by Nathan Myers [ 15/Aug/17 ] |
|
Thank you, YZ. A PR just for 3.2 and/or 3.4 would be welcome. The 3.6 code is very different, but I will see if it needs similar attention. |
| Comment by Ramon Fernandez Marina [ 14/Aug/17 ] |
|
Thanks for the detailed report zyd_com, we're investigating. |