Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47549

changes stream can't keep casual consistency when move chunk happens

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Querying
    • Labels:
      None
    • Server Triage
    • ALL

      I want to use "changes stream" to listen to all the change events from source sharding, and apply this data into target mongodb so that the target mongodb is just like a secondary of the primary. However, after I go through the source code, I think the changes stream can't keep casual consistency when move chunk happens. For a better explanation, I draw a picture as the attachment.

      In this picture, there are two shards, MongoS builds two cursors to fetch change stream event which is a transformation of oplog from MongoD. After that, MongoS runs the merge stage by "ts+uuid+documentKey" to sort the key. Everything is ok event move chunk happens because of the current hybrid logical time can keep the casual consistency. However, there is one corn case that may raise a bug: what if shard2 fetching speed is so fast that shard1, so that in my move chunk case, once shard2 has already fetch after "ts=A2", but shard1 hasn't reach "ts=A1". After shard1 catches up, the writing order of key A is "set a = 2" and then "set a = 1", so the result of A is 2, but the real result is 1.

      There may be another case that lost data which isnot related to the previous example:
      Since v3.6, MongoDB uses the logical time as timestamp which keeps the causal consistency. So different MongoS may have a different timestamp. In the change stream, MongoS uses local logical time if no afterClusterTime options given, will this cause some data loss? For example, mongos1 timestamp is 10:00, mongos2 is 10:02, shard1 is 09:59, shard2 is 10:01, if users send change stream command to mongos1, then mongos1 will send the aggregate command to each shard begin with 10:00, and the shard1 oplog from 09:59~10:00 will lost. I also find a jira SERVER-31767(https://jira.mongodb.org/browse/SERVER-31767). It means this problem has been solved since v4.1.1 by global point time, right?

      In my point of view, it'll be useful to add a "wait policy" on MongoS, for example, shard2 returns events with ts=05:10, after that MongoS cache it until all oplogs from other shards older than 05:10 to be received, and then, reply to user.

            Assignee:
            backlog-server-triage [HELP ONLY] Backlog - Triage Team
            Reporter:
            cvinllen@gmail.com vinllen chen
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: