[SERVER-35038] Violations of linearizability in MongoDB 3.6.4 Created: 17/May/18 Updated: 13/Jun/18 Resolved: 13/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Denis Rystsov | Assignee: | William Schultz (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | Repl 2018-06-18 |
| Participants: |
| Description |
|
Hey folks, It looks like I found a violation of linearizability in MongoDB 3.6.4 - it returns stale data after a network partition is healed. The docker scripts with repro steps: https://github.com/rystsov/consistency-mongodb Structure of the test:
The client has three async/await coroutines to sequentially update value of its key using an increasing sequence. Each coroutine works with its key and node. Updates to key1 go to node1, key2 to node2 and key3 to node3. Also there are 9 reading coroutines: three coroutines per node to read "key1", "key2" and "key3" keys. All reads were done with "linearizable" read concern and "majority" write concern. It looks like the bug may be in the replication layer of the database or the MongoDB driver for Node.js (3.1.0-beta4). |
| Comments |
| Comment by William Schultz (Inactive) [ 13/Jun/18 ] | |||||||||||||||||||
|
Hello rystsov, thanks for the bug report. I see now that you've already acknowledged a bug in your testing setup, but I figured I might as well follow up just for reference, since I had already done some investigation on this. After reading through your code and reproducing your experiments myself, I had a few thoughts. As I understand it, in the histories generated by your test scripts, write-start events are paired with write-ack events, and, similarly, read-start events paired with read-ack events. The writer threads execute their updates sequentially, and the values being updated increase monotonically. This all seems reasonable. There is something that appears problematic to me, however, with how you detect linearizability violations. At the beginning of this read function, you try to verify linearizability with the following procedure. You first look for the index of the last acknowledged write attempt to key K in that key's write history. Call that index i_last. Then, you do a linearizable read of key K, and save that value, call it v_read. Finally, you step forward in the history of update attempts for key K, starting at i_last, looking for an update with a value that matches v_read. If you find one, then the read function finishes, without raising an error. However, if you don't find an update with a value equivalent to v_read, then you consider it an error, producing the 'never written or stale data' message. This verification method, though, doesn't appear to take into account previous writes that received no definite acknowledgement or error. Writes that never return a response should be allowed to take effect at any later time, without violating linearizability. As an example of this, consider the events history from your may-17 experiment. You claim that the linearizability violation is observed in the following set of events (filtered to only include key3, node1 events, and line numbers included for reference):
We see two non-concurrent reads of key3 from node1. The earlier one returns 375 and the later one returns 371. You claim that this cannot be possible because each writer thread updates its key sequentially and monotonically. What is not shown, however, are the write related events for the relevant values i.e. 371, 375, which happen a short time before. This is the part of the event sequence for (key3, node1) read/write events leading up to the violation you described:
It is clear to see that updates to key3 for 371..374 all went unacknowledged. If an invocation of a write never receives a response, it should be valid to occur at any later time. This means it would be valid for the key3=371 update it to occur immediately after the read-ack of key3=375 completed, causing the next read to return 371, which we see in this history. Does this illustrate the same issue that you mentioned as the "bug in your testing setting"? | |||||||||||||||||||
| Comment by Kelsey Schubert [ 13/Jun/18 ] | |||||||||||||||||||
|
Thanks for following up, rystsov! I'm glad you were able to determine the root cause of the issue. | |||||||||||||||||||
| Comment by Denis Rystsov [ 13/Jun/18 ] | |||||||||||||||||||
|
Everything works as expected, it was bug in my testing setting, sorry for bothering | |||||||||||||||||||
| Comment by Denis Rystsov [ 12/Jun/18 ] | |||||||||||||||||||
|
Also I managed to reproduce the issue with Jepsen: https://github.com/rystsov/consistency-mongodb/tree/jepsen | |||||||||||||||||||
| Comment by Andy Schwerin [ 24/May/18 ] | |||||||||||||||||||
|
Thanks for the report, rystsov, and thanks for sharing the code of your reproduction. It will take us a little while to thoroughly analyze the repro, but you've given us a lot to get started with. | |||||||||||||||||||
| Comment by Denis Rystsov [ 24/May/18 ] | |||||||||||||||||||
|
By the way, I've updated the repo and implemented the HTTP API using Java driver and still was able to run into stale data, logs: https://github.com/rystsov/consistency-mongodb/tree/master/experiments/may-24. The only difference is that I was able to see violations after I isolated/rejoined a primary twice (see steps for repro steps). So it looks like the problem is in replication layer rather than in Node.js driver. | |||||||||||||||||||
| Comment by Denis Rystsov [ 18/May/18 ] | |||||||||||||||||||
|
Unexpected behavior: MongoDB returned stale data when a former isolated primary rejoined the cluster. Details: I was updating key key3 via node3 and reading it from all nodes (each node in the replica set was also hosting an app exposing key-value read/write http interface). In the beginning of the experiment node1 was primary, I isolated it from the network using iptable rules, waited for 20 seconds and rejoined node1 to the cluster. Just after I did, the client read key key3 via node1 and received data which were older than known last written data on the moment of the request. Logs: https://github.com/rystsov/consistency-mongodb/tree/master/experiments/may-17
Using "events" we can see that the client first read "key3=375" and then read "key3=371" which is impossible because there is only a single writer whose updates form an increasing sequence so every next read must be greater than the previous:
Versions: macOS : 10.13.3 (17D47) |