[SERVER-49161] Change stream event with empty fullDocument field for existing document Created: 29/Jun/20  Updated: 11/Oct/21  Resolved: 05/Oct/21

Status: Closed
Project: Core Server
Component/s: Change streams, Sharding
Affects Version/s: 4.0.19, 4.2.8, 5.0.0, 4.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Artem Navrotskiy Assignee: Rishab Joshi (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

I tested on host:

  • OS: Ubuntu 20.04 LTS
  • CPU: Intel i7-8700K
  • RAM: 64Gb
  • SSD: Samsung NVMe 500Gb

Attachments: File change_streams_empty_document.js    
Issue Links:
Depends
depends on SERVER-55309 Re-order post-image lookup stage when... Closed
Related
is related to SERVER-55309 Re-order post-image lookup stage when... Closed
Operating System: ALL
Steps To Reproduce:

On Linux host:

Sprint: Query 2020-11-30, Query 2020-12-14, Query 2020-12-28, Query 2021-01-11, Query 2021-01-25, Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Sharding EMEA 2021-07-12, QE 2021-10-18
Participants:

 Description   

On investigation flaky test in our codebase I found strange issue: sometimes I got change stream update event with empty fullDocument field for existing document.

I reproduced this behaviour by looping with same collection:

  • drop collection;
  • shard collection;
  • open change stream;
  • run background thread which add ~2000 records and then update ~100 records;
  • read change stream and check then update event has non-empty fullDocument.

Environment details:

  • one shard from single-node replicaset;
  • single-node configuration replicaset;
  • 3 mongos.

On my computer, a spike of errors occurs in the first 10-30 iterations. Then the probability of errors greatly decreases.

I wrote a simple program with Golang for reproducing this issue (main.go). This program uses go.mongodb.org/mongo-driver v1.3.4 for MongoDB access but originally this issue was reproduced with github.com/globalsign/mgo fork.



 Comments   
Comment by Rishab Joshi (Inactive) [ 05/Oct/21 ]

I ran the scenarios multiple times with/without change stream optimization feature flag. I can see that everytime with feature flag disabled the issue happens and with the feature flag enabled the issue does not happen. It looks like the post-image stage reordering work fixed the issue. I can't see this issue happening now and hence marking it as a duplicate of the post image work done as part of the change stream optimization project.

Comment by Tommaso Tocci [ 15/Jul/21 ]

I've been able to reproduce the problem using this js test: change_streams_empty_document.js you can run it through resmoke by using:

resmoke.py --mongosSetParameters="{logComponentVerbosity: {sharding: 3}}" --suites=sharding <test>

From what I understood this is what is happening:

  1. mongosB has information about collection "db.myColl" with UUID_1
  2. "db.myColl" gets dropped and recreated through mongosA and the new collection has UUID_2
  3. a ChangeStream with "fullDocument" option is opened for "db.myColl" on mongosA (that has still stale metadata for the collection)
  4. An update event arrives to mongosB through the opened ChangeStream
  5. mognosB calls "MongosProcessInterface::[lookupSingleDocument" passing the UUID_2 contained in the resume token of the change stream event|https://github.com/mongodb/mongo/blob/331759573b9e60847a605e4d09292293e7cfa472/src/mongo/db/pipeline/document_source_change_stream_lookup_post_image.cpp#L133-L134].
  6. At this point mongosB will realize that the UUID_2 received with the event don't match the one on the cached routing table (UUID_1) and it will through  a NamepsaceNotFound exception
  7. The exception will be catched by the lookupSingleDocument function that will simply return an empty "fullDocument".

In general the information that we get through the ChangeStream could be misaligned with respect to the metadata that we have cached on the mongos. In fact we could have:

  • ChangeStream behind local metadata: mongos receives events for an old collection that have been already dropped and for which we don't have any information cached anymore.
  • ChangeStream ahead of local metadata: mongos receives events for a new collection for which we don't still have cached information

The problem described in this ticket is a symptom of the latter.

Unfortunately by just comparing the received UUID with the cached one we can't differentiate between the two situations, because we don't know which of the two UUID is newer. Recently we added a timestamp field to the collection entry that can be used to properly compare collection's metadata.

I'm not sure if we can use the oplog entry timestamp to campare it with the collection metadata one, otherwise another solution solution  would be to add the collection timestamp along with the UUID in the ShangeStream events.

Generated at Thu Feb 08 05:19:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.