[SERVER-75993] Change Streams: Document key is not fully captured Created: 12/Apr/23  Updated: 30/Jun/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: 6.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ajay Mathias Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 1
Labels: changestreams
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-75607 Change Stream: Shard keys not capture... Closed
Assigned Teams:
Query Execution
Operating System: ALL
Steps To Reproduce:

Environment used:

MongoDB Atlas M30 cluster, version: 6.0.5

Participants:

 Description   

The DocumentKey field of the change event does not capture a shard key if the value of the shard key is null

Consider a collection emp with a shard key name.

Inserting a document with no value for name, like
{{db.emp.insertOne({"_id":"1001"}) 
}}produces a change event with DocumentKey that contains only the _id field
fullDocument=Document{{_id=1002, documentKey={"_id": "1001"}}}

Similarly, updating the name field does not have the previous value of the shard key in the document key. That is,
db.emp.updateOne({"_id":"1001","name":null},{$set:{"name":"Adam") }}also produces a change event with DocumentKey that contains only the _id field{{{}
documentKey={"_id": "1001"}, updateDescription=UpdateDescription{removedFields=[], updatedFields=

{"name": "Adam"}

, truncatedArrays=[]}{}}}

In both these operations, the DocumentKey field should contain a field for the name as well, like documentKey={"_id": "1001", "name": null}

 

In our use case, we rely on the DocumentKey field to know the shard keys and their previous values. As a workaround we couldn't depend on retrieving the shard key from the config database when Change Streams is run as there is a possibility shard keys may have been refined between when the operation was performed and Change Streams was run.

Expected behavior:
Provide all the shard keys in DocumentKey even if their value is null.



 Comments   
Comment by Bernard Gorman [ 18/May/23 ]

Hi ajay2589@gmail.com,

Thank you for raising this issue to our attention. The behaviour you're observing is due to the semantics of null and missing, and the fact that our query language treats a query for null as matching both documents where a field is explicitly null and where that field is absent.

When you first insert the document {"_id":"1001"}, the value of "name" is NOT actually null in this case, it is missing. When we started allowing shard keys to be omitted, we took the approach of treating them as-if those keys were null for data-distribution and query-routing purposes, so documents where a shard key field is missing clump together with documents where the field is null.

We did this, as mentioned above, because queries for null match documents where that field is explicitly null or where it is missing, which is a special behaviour of our query language. This is why your update on {"_id":"1001", "name":null} matches this document; if you were to instead use a query that actually does match only null:

db.emp.updateOne({"_id":"1001", $expr: {$eq: ["$name", null]}}, {$set:{"name":"Adam"}})

... then it would not match or update this document.

But a document with an explicitly null field and one in which the field is missing are still in two distinct states, which is what the documentKey in the change stream output is indicating. We cannot simply report a non-existent shard key field as null, because then use-cases which rely upon distinguishing between shard keys that are present or absent would have no way to differentiate between these two states.

As shown below, if the name field is set to explicitly null then subsequent updates will indeed show that field in the documentKey. Incidentally, the reason that the name field is still absent in the update documentKey even though you set {name: "Adam"} is because the documentKey is from before the update occurred - otherwise, there would be no way to tell which document the update applied to.

[direct: mongos] test> db.emp.insertOne({ "_id": "1001" })
 
[direct: mongos] test> var csCursor = db.emp.watch()
 
// Note that we're explicitly setting the value of "name" from <missing> to <null> here.
// If it was already null, this would be a no-op and no update would occur.
[direct: mongos] test> db.emp.updateOne({"_id":"1001", "name": null}, {$set:{"name": null}})
[direct: mongos] test> csCursor.next()
{
  _id: {
    _data: '...'
  },
  operationType: 'update',
  clusterTime: Timestamp({ t: 1684401795, i: 1 }),
  wallTime: ISODate("2023-05-18T09:23:15.126Z"),
  ns: { db: 'test', coll: 'emp' },
  documentKey: { _id: '1001' }, // "name" still missing because documentKey is *before* update
  updateDescription: {
    updatedFields: { name: null },
    removedFields: [],
    truncatedArrays: []
  }
}
 
[direct: mongos] test> db.emp.updateOne({"_id":"1001", "name": null},{$set:{"name": "Adam"} })
[direct: mongos] test> csCursor.next()
{
  _id: {
    _data: '...'
  },
  operationType: 'update',
  clusterTime: Timestamp({ t: 1684401826, i: 29 }),
  wallTime: ISODate("2023-05-18T09:23:46.925Z"),
  ns: { db: 'test', coll: 'emp' },
  documentKey: { name: null, _id: '1001' }, // "name" is now reported as explicitly null
  updateDescription: {
    updatedFields: { name: 'Adam' },
    removedFields: [],
    truncatedArrays: []
  }
}

I hope this helps to clarify the behaviour you're seeing. We plan to revisit our null and missing semantics in the future, so I'll add this ticket to the set of issues for consideration as part of that effort.

Generated at Thu Feb 08 06:31:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.