[SERVER-35038] Violations of linearizability in MongoDB 3.6.4 Created: 17/May/18  Updated: 13/Jun/18  Resolved: 13/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Denis Rystsov Assignee: William Schultz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Repl 2018-06-18
Participants:

 Description   

Hey folks,

It looks like I found a violation of linearizability in MongoDB 3.6.4 - it returns stale data after a network partition is healed.

The docker scripts with repro steps: https://github.com/rystsov/consistency-mongodb
Logs: https://github.com/rystsov/consistency-mongodb/tree/master/experiments/may-17

Structure of the test:

  • Four nodes in the system
  • Three host MongoDB's replica set
  • Same nodes host a thin Node.js application with talks with the DB and exposes http interface to read, create and update a key/value pair
  • Fourth node is a client working with http interface of each node and detecting consistency violations

The client has three async/await coroutines to sequentially update value of its key using an increasing sequence. Each coroutine works with its key and node. Updates to key1 go to node1, key2 to node2 and key3 to node3. Also there are 9 reading coroutines: three coroutines per node to read "key1", "key2" and "key3" keys.

All reads were done with "linearizable" read concern and "majority" write concern.

It looks like the bug may be in the replication layer of the database or the MongoDB driver for Node.js (3.1.0-beta4).



 Comments   
Comment by William Schultz (Inactive) [ 13/Jun/18 ]

Hello rystsov, thanks for the bug report. I see now that you've already acknowledged a bug in your testing setup, but I figured I might as well follow up just for reference, since I had already done some investigation on this.

After reading through your code and reproducing your experiments myself, I had a few thoughts. As I understand it, in the histories generated by your test scripts, write-start events are paired with write-ack events, and, similarly, read-start events paired with read-ack events. The writer threads execute their updates sequentially, and the values being updated increase monotonically. This all seems reasonable. There is something that appears problematic to me, however, with how you detect linearizability violations. At the beginning of this read function, you try to verify linearizability with the following procedure. You first look for the index of the last acknowledged write attempt to key K in that key's write history. Call that index i_last. Then, you do a linearizable read of key K, and save that value, call it v_read. Finally, you step forward in the history of update attempts for key K, starting at i_last, looking for an update with a value that matches v_read. If you find one, then the read function finishes, without raising an error. However, if you don't find an update with a value equivalent to v_read, then you consider it an error, producing the 'never written or stale data' message. This verification method, though, doesn't appear to take into account previous writes that received no definite acknowledgement or error. Writes that never return a response should be allowed to take effect at any later time, without violating linearizability.

As an example of this, consider the events history from your may-17 experiment. You claim that the linearizability violation is observed in the following set of events (filtered to only include key3, node1 events, and line numbers included for reference):

7654: {"key":"key3","from":"node1","event":"read-start"}
7671: {"key":"key3","from":"node1","value":"375","event":"read-ack"}
7672: {"key":"key3","from":"node1","event":"read-start"}
7686: {"key":"key3","from":"node1","value":"371","event":"read-ack"}

We see two non-concurrent reads of key3 from node1. The earlier one returns 375 and the later one returns 371. You claim that this cannot be possible because each writer thread updates its key sequentially and monotonically. What is not shown, however, are the write related events for the relevant values i.e. 371, 375, which happen a short time before. This is the part of the event sequence for (key3, node1) read/write events leading up to the violation you described:

7601:{"key":"key3","value":"371","event":"write-start"}
7606:{"key":"key3","from":"node1","event":"read-start"}
7613:{"key":"key3","value":"372","event":"write-start"}
7618:{"key":"key3","from":"node1","event":"read-start"}
7625:{"key":"key3","value":"373","event":"write-start"}
7630:{"key":"key3","from":"node1","event":"read-start"}
7637:{"key":"key3","value":"374","event":"write-start"}
7642:{"key":"key3","from":"node1","event":"read-start"}
7649:{"key":"key3","value":"375","event":"write-start"}
7654:{"key":"key3","from":"node1","event":"read-start"}
7667:{"key":"key3","event":"write-ack"}
7668:{"key":"key3","value":"376","event":"write-start"}
7671:{"key":"key3","from":"node1","value":"375","event":"read-ack"}
7672:{"key":"key3","from":"node1","event":"read-start"}
7686:{"key":"key3","from":"node1","value":"371","event":"read-ack"}

It is clear to see that updates to key3 for 371..374 all went unacknowledged. If an invocation of a write never receives a response, it should be valid to occur at any later time. This means it would be valid for the key3=371 update it to occur immediately after the read-ack of key3=375 completed, causing the next read to return 371, which we see in this history. Does this illustrate the same issue that you mentioned as the "bug in your testing setting"?

Comment by Kelsey Schubert [ 13/Jun/18 ]

Thanks for following up, rystsov! I'm glad you were able to determine the root cause of the issue.

Comment by Denis Rystsov [ 13/Jun/18 ]

Everything works as expected, it was bug in my testing setting, sorry for bothering

Comment by Denis Rystsov [ 12/Jun/18 ]

Also I managed to reproduce the issue with Jepsen: https://github.com/rystsov/consistency-mongodb/tree/jepsen

Comment by Andy Schwerin [ 24/May/18 ]

Thanks for the report, rystsov, and thanks for sharing the code of your reproduction. It will take us a little while to thoroughly analyze the repro, but you've given us a lot to get started with.

Comment by Denis Rystsov [ 24/May/18 ]

By the way, I've updated the repo and implemented the HTTP API using Java driver and still was able to run into stale data, logs: https://github.com/rystsov/consistency-mongodb/tree/master/experiments/may-24. The only difference is that I was able to see violations after I isolated/rejoined a primary twice (see steps for repro steps).

So it looks like the problem is in replication layer rather than in Node.js driver.

Comment by Denis Rystsov [ 18/May/18 ]

Unexpected behavior: MongoDB returned stale data when a former isolated primary rejoined the cluster.

Details:

I was updating key key3 via node3 and reading it from all nodes (each node in the replica set was also hosting an app exposing key-value read/write http interface). In the beginning of the experiment node1 was primary, I isolated it from the network using iptable rules, waited for 20 seconds and rejoined node1 to the cluster. Just after I did, the client read key key3 via node1 and received data which were older than known last written data on the moment of the request.

Logs: https://github.com/rystsov/consistency-mongodb/tree/master/experiments/may-17

  • "node{1-3}.log" are mongod's logs corresponding to each node
  • "events" contains events of start and end of read/write operations ordered by "wall clock" (Node.js is single threaded so we don't have ambiguity)

Using "events" we can see that the client first read "key3=375" and then read "key3=371" which is impossible because there is only a single writer whose updates form an increasing sequence so every next read must be greater than the previous:

{"key":("key3","from":"node1","event":"read-start"}
...
{"key":"key3","from":"node1","value":"375","event":"read-ack"}
{"key":"key3","from":"node1","event":"read-start"}
...
{"key":"key3","from":"node1","value":"371","event":"read-ack"}

Versions:

macOS : 10.13.3 (17D47)
Docker: 18.03.0-ce, build 0520e24
Ubuntu: 17.10
MongoDB: v3.6.4
MongoDB Node.js driver: 3.1.0-beta4

Generated at Thu Feb 08 04:38:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.