[SERVER-26791] move/split/mergeChunk commands do a full metadata refresh on the shard Created: 26/Oct/16  Updated: 01/Nov/17  Resolved: 23/Jan/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.10, 3.4.0-rc1
Fix Version/s: 3.4.2, 3.5.2

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File createChunks_v3-4.js     File r3-4-0-rc2_test.tgz     File r3-4-0_test.tgz    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-26913 Make move/merge/split chunk operation... Closed
Related
related to SERVER-25652 Slow chunk migrations when there are ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Sharding 2016-12-12, Sharding 2017-01-02, Sharding 2017-02-13
Participants:
Case:

 Description   

The move/split/mergeChunk commands on the shards perform a full metadata refresh on the shard prior to commencing the operation instead of incremental. For clusters with large number of chunks this can cause a significant delay.



 Comments   
Comment by Githook User [ 20/Jan/17 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-26791 Shard metadata commands should perform partial refresh as much as possible

(cherry picked from commit 6add2c82bacecd5f54613ebf4be1553f3b046cbc)
Branch: v3.4
https://github.com/mongodb/mongo/commit/774514cfad04866b42a1deb27b48488dec0f7520

Comment by Githook User [ 19/Jan/17 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-26791 remove deleted test from blacklist
Branch: master
https://github.com/mongodb/mongo/commit/c35c342e4f4d59db4803f7c1707d9998bec7b793

Comment by Githook User [ 19/Jan/17 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: Revert "Revert "SERVER-26791 Shard metadata commands should perform partial refresh as much as possible""

This reverts commit d3e67186d1f9c633e8e69ebb7bf2418d3850688a.
Branch: master
https://github.com/mongodb/mongo/commit/0e9947736fa66f1a997dd43fea6d1f854bb79518

Comment by Githook User [ 19/Jan/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: Revert "SERVER-26791 Shard metadata commands should perform partial refresh as much as possible"

This reverts commit 6add2c82bacecd5f54613ebf4be1553f3b046cbc.
Branch: master
https://github.com/mongodb/mongo/commit/d3e67186d1f9c633e8e69ebb7bf2418d3850688a

Comment by Githook User [ 19/Jan/17 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-26791 Shard metadata commands should perform partial refresh as much as possible
Branch: master
https://github.com/mongodb/mongo/commit/6add2c82bacecd5f54613ebf4be1553f3b046cbc

Comment by Kaloian Manassiev [ 06/Dec/16 ]

Yes I agree. I was just wondering whether it is safe to make all refresh to be incremental with a back-off to full reload on epoch mismatch or there are some cases where we must force a complete reload.

I think this is safe, as long as we are always including the version epoch in the incremental refresh query. This, combined with the check we do on the config.collections entry will ensure that the refresh will never load chunks belonging to different incarnations of the same collection from different epochs.

Comment by Randolph Tan [ 06/Dec/16 ]

Hm... I think you might be right. I am little bit concerned that if we don't have the full reload on epoch mismatch, there might be a scenario where there is no other way for a shard to recover. For example, if _recvChunkStart encounters this, then the shard will never accept chunks for this collection until you restart it.

Comment by Kaloian Manassiev [ 06/Dec/16 ]

Do you mean that we should include the retry logic used by onStaleShardVersion in move/split/merge or as part of every refresh and have it back-off to full refresh in this case?

Because for move/split/merge it is fine to fail the operation since on epoch mismatch it does not mean anything after collection drop and recreate.

Comment by Randolph Tan [ 05/Dec/16 ]

It looks like we should be able to switch to incremental refresh, because the refresh code already checks for epoch mismatch so it should be safe to do so. We might need to add some extra code to handle the case of epoch mismatch.

Generated at Thu Feb 08 04:13:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.