[SERVER-28248] stale secondary pings the primary to refresh chunks and waits for updates to propagate Created: 08/Mar/17 Updated: 06/Dec/17 Resolved: 06/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.5.10 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Esha Maharishi (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Sharding 2017-06-19, Sharding 2017-07-10, Sharding 2017-07-31 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The secondary must know how long to wait for the latest chunk metadata to replicate to it, so it can refresh successfully. The command will return the optime of the last chunk metadata write on the primary. The shard primary must always stash the optime of the last chunk metadata write from each refresh, so that it can be returned it if the secondary calls. If the command causes the primary to refresh, it will refresh and return the optime from that chunk metadata write operation. The secondary must send the expectedShardVersion in the command to the primary, in case the primary must go to the config server to refresh. |
| Comments |
| Comment by Githook User [ 06/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'EshaMaharishi', u'name': u'Esha Maharishi', u'email': u'esha.maharishi@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 05/Jul/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 30/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: And rename the class to NamespaceMetadataChangeNotifications to better | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 12/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Meh, I dunno. I think as long as this map exists somewhere, and a client thread can specify its wantedVersion (possibly something we'll have to wire in), things should work... | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 12/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
dianna.hohensee, ok, we were thinking of making the OpObserver on the secondary update a map on the chunk loader, something like map<string, pair<Notification, ChunkVersion>> latestVersion where the string is the collection name (or later UUID). The notification would be signaled when latestVersion is updated, and the client threads would wait a signal for their desired collection. Based on what you said about the secondary shard's OpObserver invalidating cached chunk metadata on the CatalogCache, perhaps it makes more sense for the CatalogCache to own this map? | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
The secondary will not have a wantedVersion to send to the shard chunk loader for refresh. For example, a secondary shard OpObserver will invalidate the cached chunk metadata on the CatalogCache. The CatalogCache will eventually get a caller asking for the chunk metadata, and see its cache is invalid (needsRefresh), then tell the shard chunk loader to refresh, without telling it what to refresh to. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 07/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
The first caller which installs the notification must call into the primary to force it to do refresh (even if it is a noop refresh). | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 07/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I don't think it works if you don't hold the mutex while reading the collection :/ Consider: Thread 1 reads the collection, finds it stale. Even if some second client thread has installed the notification so it does get signaled by the OpObserver, the first client thread can get stuck: Thread 1 reads the collection, finds it stale | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 07/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
This looks good to me and like a better option than adding a synchronous refresh option to the CatalogCache, which is what I was trying to propose yesterday. I have one suggestion though, because I see a possible race condition between checking whether the persisted version in the admin.system.chunks collections has been reached and proceeding to wait for the notification. This check is not atomic because you can't hold a mutex while reading the collection. Similar race condition exists with purging the notification entries. What I suggest is:
In the OpObserver, just take the _refreshNotifications mutex, check if there is an entry for the namespace and if there is, signal it and remove it right away. That way don't need to worry about purging or associating the notification with a particular version, just plain old notification that something in a collection has changed. I would also check that we don't have such a mechanism in place for capped collections or for exhaust cursors, which could be reused (although I doubt it). | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 07/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Since we realized we'll need to plug in some notification machinery on the primary to wait for the (async) writes to the chunks collection, I thought it might be easier to just:
Suggested implementation: Put a map<NamespaceString, Notification> (where the string key is the full namespace, or later database name + UUID?) in the ShardServerCatalogCacheLoader. Then add a signalCollectionRefreshed() and waitForCollectionRefresh() to it. Only the thread that waits to get the signal adds entries to the map.
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 31/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
There are two ways to do this, an optimized solution and an unoptimized one. Whatever the command used, the secondary will send a nss and shardVersion to the primary and then expect to get back an optime upon which to wait to ensure that the latest metadata has finished propagating to the secondary before the secondary attempts to load it. Where the code should go:
Unoptimized solution:
Optimized solution:
Either approach is complicated by the fact that forcing a refresh doesn't force persistence of that metadata. The shard chunk loader persists metadata asynchronously, so it seems like we must wire in an additional notification somehow to wait for a particular Task to finish in the chunk loader. Perhaps expose a function on ShardServerCatalogCacheLoader, 'Notification& registerNotification(nss, version)", and whenever a Task is complete, check whether we have a notification registered and whether the task applied the desired version for nss. This actually means that there may not be a way to avoid making a new command. For either solution, I think we actually will need the secondary to know to which version it should be refreshing to when its metadata is invalidated – say by a moveChunk moving out a chunk on the primary and then updating its metadata, which will (soon) cause an opobserver on the secondary to invalidate the secondary's chunk metadata cache for that collection. This will be done in |