[SERVER-28248] stale secondary pings the primary to refresh chunks and waits for updates to propagate Created: 08/Mar/17  Updated: 06/Dec/17  Resolved: 06/Jul/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.5.10

Type: Task Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Esha Maharishi (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-29239 Make shard secondaries refresh from s... Closed
Gantt Dependency
has to be done before SERVER-28948 open up secondaries to checking shard... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2017-06-19, Sharding 2017-07-10, Sharding 2017-07-31
Participants:

 Description   

The secondary must know how long to wait for the latest chunk metadata to replicate to it, so it can refresh successfully.

The command will return the optime of the last chunk metadata write on the primary. The shard primary must always stash the optime of the last chunk metadata write from each refresh, so that it can be returned it if the secondary calls. If the command causes the primary to refresh, it will refresh and return the optime from that chunk metadata write operation. The secondary must send the expectedShardVersion in the command to the primary, in case the primary must go to the config server to refresh.



 Comments   
Comment by Githook User [ 06/Jul/17 ]

Author:

{u'username': u'EshaMaharishi', u'name': u'Esha Maharishi', u'email': u'esha.maharishi@mongodb.com'}

Message: SERVER-28248 stale secondary pings the primary to refresh chunks and waits for updates to propagate
Branch: master
https://github.com/mongodb/mongo/commit/5020fa1cb2ac9364f29f52bd14b49224ddc43c93

Comment by Githook User [ 05/Jul/17 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-28248 Fix move construction of NamespaceMetadataChangeNotifications::ScopedNotification
Branch: master
https://github.com/mongodb/mongo/commit/390bb2badbc53345945b83fdcb2402f3f9cb4964

Comment by Githook User [ 30/Jun/17 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-28248 Use Notification<void> in the VersionNotifications class

And rename the class to NamespaceMetadataChangeNotifications to better
reflect its new simplified purpose.
Branch: master
https://github.com/mongodb/mongo/commit/ebb1e4c6192bcf440ff008222689ec5f0a7a2d57

Comment by Esha Maharishi (Inactive) [ 12/Jun/17 ]

Meh, I dunno. I think as long as this map exists somewhere, and a client thread can specify its wantedVersion (possibly something we'll have to wire in), things should work...

Comment by Esha Maharishi (Inactive) [ 12/Jun/17 ]

dianna.hohensee, ok, we were thinking of making the OpObserver on the secondary update a map on the chunk loader, something like

map<string, pair<Notification, ChunkVersion>> latestVersion

where the string is the collection name (or later UUID).

The notification would be signaled when latestVersion is updated, and the client threads would wait a signal for their desired collection.

Based on what you said about the secondary shard's OpObserver invalidating cached chunk metadata on the CatalogCache, perhaps it makes more sense for the CatalogCache to own this map?

Comment by Dianna Hohensee (Inactive) [ 12/Jun/17 ]

SERVER-29239 will need notification objects as well, so the secondary can avoid using sleepmillis when waiting for a refresh that has already started to finish before proceeding to read the shard persisted chunk metadata. So I'll add the notifications map in SERVER-29239, making it generically usable for both scenarios, and including cleanup of the map when collections are deleted. This ticket and SERVER-29239 will have separate code using the notifications.

Comment by Dianna Hohensee (Inactive) [ 12/Jun/17 ]

The secondary will not have a wantedVersion to send to the shard chunk loader for refresh. For example, a secondary shard OpObserver will invalidate the cached chunk metadata on the CatalogCache. The CatalogCache will eventually get a caller asking for the chunk metadata, and see its cache is invalid (needsRefresh), then tell the shard chunk loader to refresh, without telling it what to refresh to.

Comment by Kaloian Manassiev [ 07/Jun/17 ]

The first caller which installs the notification must call into the primary to force it to do refresh (even if it is a noop refresh).

Comment by Esha Maharishi (Inactive) [ 07/Jun/17 ]

I don't think it works if you don't hold the mutex while reading the collection :/

Consider:

Thread 1 reads the collection, finds it stale.
The collection gets refreshed and the OpObserver thread sees no entry for the collection, so doesn't signal anything.
Thread 1 locks the mutex, installs a notification, and waits on it.

Even if some second client thread has installed the notification so it does get signaled by the OpObserver, the first client thread can get stuck:

Thread 1 reads the collection, finds it stale
Thread 2 reads the collection, finds it stale
Thread 2 locks the mutex and installs the notification
The OpObserver signals the notification, then removes it from the map
Thread 2 wakes up and sees the refreshed chunks
Thread 1 locks the mutex and installs a new notification and waits on it.

kaloian.manassiev?

Comment by Kaloian Manassiev [ 07/Jun/17 ]

This looks good to me and like a better option than adding a synchronous refresh option to the CatalogCache, which is what I was trying to propose yesterday.

I have one suggestion though, because I see a possible race condition between checking whether the persisted version in the admin.system.chunks collections has been reached and proceeding to wait for the notification. This check is not atomic because you can't hold a mutex while reading the collection. Similar race condition exists with purging the notification entries.

What I suggest is:

  • Check if the on-disk version is LT what is expected and if not, no need to wait
  • Else, take the mutex protecting _refreshNotifications, install a shared_ptr notification if necessary (or if one already exists, get it), unlock the mutex and start waiting on the notification
  • When signaled, loop again and keep doing the same until the on-disk version becomes GTE what is expected

In the OpObserver, just take the _refreshNotifications mutex, check if there is an entry for the namespace and if there is, signal it and remove it right away. That way don't need to worry about purging or associating the notification with a particular version, just plain old notification that something in a collection has changed. I would also check that we don't have such a mechanism in place for capped collections or for exhaust cursors, which could be reused (although I doubt it).

Comment by Esha Maharishi (Inactive) [ 07/Jun/17 ]

Since we realized we'll need to plug in some notification machinery on the primary to wait for the (async) writes to the chunks collection, I thought it might be easier to just:

  • add an OpObserver that signals a notification when a particular collection has been refreshed, along with the refreshed-to version
  • wait (in a loop) on one thread for the signal that the collection has been refreshed to the wantedVersion
  • ping the primary on another thread using any existing command with the wanted shardVersion attached

Suggested implementation:

Put a map<NamespaceString, Notification> (where the string key is the full namespace, or later database name + UUID?) in the ShardServerCatalogCacheLoader.

Then add a signalCollectionRefreshed() and waitForCollectionRefresh() to it. Only the thread that waits to get the signal adds entries to the map.

stdx::mutex _mutex;
 
map<NamespaceString, shared_ptr<Notification>> _refreshNotifications;
 
waitForCollectionRefreshIfNeeded(OperationContext* opCtx, NamespaceString nss, ChunkVersion wantedVersion) {
    stdx::lock_guard<stdx::mutex> lock(_mutex);
    // Read the local data on disk to see if we already are at wantedVersion
 
    // If not, schedule an asynchronous task to ping the primary and start waiting for the refresh notification.
    // Because we are under the lock between checking the local data and waiting,
    // we are guaranteed to receive the signal after we start waiting
    // even if the writes come in before we go to sleep,
    // because the OpObserver won't be able to take the lock to send the signal until we sleep
 
    if (_refreshNotifications.find(nss) == _refreshNotifications.end()) {
        // This is the first thread to wait for this collection's refresh. Add an entry to the map.
        _refreshNotifications.insert(make_pair<NamespaceString, shared_ptr<Notification>>(nss, std::make_shared<Notification>()));
    }
 
    ChunkVersion foundVersion;
    do {
        // The wait is interruptible, because opCtx is passed.
        // TODO: do we have to unlock _mutex manually around this wait?
        foundVersion = _refreshNotifications.find(nss)->second->get(opCtx);
    } while (foundVersion < wantedVersion);
}
 
signalCollectionRefresh(NamespaceString nss, ChunkVersion newVersion) {
    stdx::lock_guard<stdx::mutex> lock(_mutex);
    auto it = _refreshNotifications.find(nss);
    if (it == _refreshNotifications.end()) {
        // No one was waiting or has ever waited for this collection's refresh.
        return;
    }
    it->second->signal(newVersion);
    
    // Replace the notification with a new one, since each can only be signaled once.
    it->second = make_shared<Notification>();
}
 
// Called on dropCollection
purgeCollectionEntry(NamespaceString nss) {
    stdx::lock_guard<stdx::mutex> lock(_mutex);
    _refreshNotifications.erase(_refreshNotifications.find(nss));
}

Comment by Dianna Hohensee (Inactive) [ 31/May/17 ]

There are two ways to do this, an optimized solution and an unoptimized one. Whatever the command used, the secondary will send a nss and shardVersion to the primary and then expect to get back an optime upon which to wait to ensure that the latest metadata has finished propagating to the secondary before the secondary attempts to load it.

Where the code should go:

  • place an anonymous function on the shard chunk loader that takes a namespace and version. This function will execute the command to the primary, parse the command response for the optime, and then wait on that optime before returning. Probably just need to return a Status.
  • Note: there is as of yet no place to call this function, because SERVER-29239 is adding that code. The new function will be the first thing that the secondary shard chunk loader does on refresh.

Unoptimized solution:

  • use setShardVersion to force a metadata refresh on the primary, if primary is staler than the version sent. Parse the command response for the primary's last applied optime – supposedly this is returned, I have not confirmed. Wait on that optime. ReplicationCoordinator and read_concern.cpp have wait for optime functions that may work for this use case or contain code that can be used to create a new helper function.

Optimized solution:

  • If we wish to use the more optimized approached, a new command must be created to retrieve the special collection specific metadata write optimes from the shard chunk loader, as well as new code for the shard chunk loader to store the optimes of its writes per collection.

Either approach is complicated by the fact that forcing a refresh doesn't force persistence of that metadata. The shard chunk loader persists metadata asynchronously, so it seems like we must wire in an additional notification somehow to wait for a particular Task to finish in the chunk loader. Perhaps expose a function on ShardServerCatalogCacheLoader, 'Notification& registerNotification(nss, version)", and whenever a Task is complete, check whether we have a notification registered and whether the task applied the desired version for nss. This actually means that there may not be a way to avoid making a new command.


For either solution, I think we actually will need the secondary to know to which version it should be refreshing to when its metadata is invalidated – say by a moveChunk moving out a chunk on the primary and then updating its metadata, which will (soon) cause an opobserver on the secondary to invalidate the secondary's chunk metadata cache for that collection. This will be done in SERVER-29437 by changing the ShardCollectionType::refreshSequenceNumber to a ChunkVersion, and does not hold up this ticket.

Generated at Thu Feb 08 04:17:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.