The way we do refreshes on secondaries is defined by two steps:
- The secondary node asks to the primary node to do a refresh with local write concern.
- The secondary node waits until the changes done by the primary are replicated.
The issue is how we implement the second step: we are waiting on the logical time (just a time, without the term component) associated with the oplog entry generated on the primary node. Note that this entry hasn't been majority committed, so it could totally happen that another node steps up, does some writes and at some point this entry is rollbacked. Then, at some point the secondary node might fetch an oplog entry with a logical time bigger than the one it was waiting, and it will assume that it has the changes associated with the refresh on the primary. However that's not true.
Under that scenario it might happen that the ShardServerCatalogCacheLoader returns some metadata associated with a CollectionVersion older than what the CatalogCache already knows. Then, the CatalogCache will try to combine the SSCCL result with its local metadatada, creating an inconsistent routing history: it will potentially contain the collection metadata we got from the SSCCL but the chunks we already had in the CatalogCache. Thus, the problem is that the new routing history has stale collection information.
We believe that this could be potentially problematic for the two fields we have in config.collections that are mutable and replicated to shards: allowMigrations and reshardingFields.
Which is the behavior under this scenario? In the 5.0 binary we would hit one of these two invariants stating that we found different collection information for the same collection version. In 5.1 or more recent versions the CatalogCache is not going to throw an invariant, so users might experience incorrect executions of the ongoing DDL operations.
Affected versions: I took a look at 3.6 and already has this problem.