[SERVER-27127] drop() regression heisenbug Created: 19/Nov/16 Updated: 22/Nov/16 Resolved: 21/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 3.4.0-rc1, 3.4.0-rc3, 3.4.0-rc4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Golden | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | Install libmongoc 1.5.0-rc6, checkout the branch from the github link above, build the mongocxx driver and run tests repeatedly. |
| Sprint: | Storage 2016-12-12 |
| Participants: |
| Description |
|
In developing mongocxx, I stumbled across a bug where a collection drop followed immediately by a count on the collection with an empty document filter doesn't always return a count of zero. This occurs in fixture code, so the collection is being repeatedly filled and dropped for different tests. The problem was seen on some (but only some) test runs on 3.4-rc1, 3.4-rc3 and 3.4-rc4. (3.4-rc2 was not tested). The problem did not occur on 3.2.9 or 3.2.11. Replacing the drop with multi-document delete does not exhibit the problem on any of the 3.4 rc's. I don't have time to do a more complete investigation, but I wanted to file a ticket so that it wasn't lost and others could explore it. The failing test code can be seen here: https://github.com/xdg/mongo-cxx-driver/blob/CXX-894-server-bug-demo/src/mongocxx/test/collection.cpp#L98-L103 Tests were run on my macbook. |
| Comments |
| Comment by Eric Milkie [ 22/Nov/16 ] | |||||||
|
I think the crux of the problem is defining "writes arriving prior to time X", since "unacknowledged write concern" is not something one can use with strict ordering guarantees. When you do define a write as having arrived? | |||||||
| Comment by David Golden [ 21/Nov/16 ] | |||||||
|
FWIW, I agree the 3.4 behavior is technically "correct" (sequential consistency) but may be surprising to customers who still tend to think transactionally. Given historical scrutiny of our consistency guarantees, the bigger question is what model we want to provide for collection drops? Do we think customers want a linearizable drop? I.e. "At time X, I dropped the collection, so I know that no writes arriving prior to time X are in the collection." Under the current 3.4 behavior, how could someone achieve an equivalent? Spin on drop until count with linearized read concern is zero? Should we document that in the release notes? I'm happy either way we choose, but I'd like to make sure we're changing vs expectations intentionally, not coincidentally. | |||||||
| Comment by Eric Milkie [ 21/Nov/16 ] | |||||||
|
I believe the 3.4 behavior is still okay. Thanks for looking into the code, rassi | |||||||
| Comment by J Rassi [ 21/Nov/16 ] | |||||||
|
In the C++ reproduction case, the unacknowledged inserts on line 50 (performed by server thread N) are racing with the collection drop on line 27 (performed by server thread N+1). When the drop wins the race, the "Count after drop was 2" message is printed. If the inserts win the race, the "Count after drop was 0" message is printed. Though the reproduction case contains a client-side race, I can confirm that the server version does reliably affect which thread wins this race. I've diagnosed the difference in 3.2 versus 3.4 server behavior to the following:
The server changes were made under When I reproduce locally on a build not incorporating the fix for When I reproduce locally on a build incorporating for fix for See also attached shell repro (server27127.js), which is similar (but not identical) to C++ driver repro. The shell repro also exhibits different behavior on 3.3.4 and below versus 3.3.5 and above. milkie/redbeard0531: could one of you take a look at David's C++ repro case and say whether we're okay with shipping the 3.4 behavior? I personally don't see any correctness issue with dropping locks in an insert between implicit collection creation and the actual insert, but would like one of you to speak on behalf of the server team here. | |||||||
| Comment by David Golden [ 19/Nov/16 ] | |||||||
|
I have a repro I'll attach – it occurs after unordered bulk insert with write concern set to unacknowledged. The repro also demonstrates that deleting all documents can also occasionally fail – though apparently less often. Also, note that the drop/count() occurs on a subsequent client connection from the bulk insert. I.e.:
| |||||||
| Comment by David Golden [ 19/Nov/16 ] | |||||||
|
Standalone, enterprise edition – just my normal "keep mongod in background for development" arrangement. Config file:
Startup options: --setParameter enableTestCommands=1 --enableMajorityReadConcern --fork When confirming the bug, I started mongod with an empty data directory. | |||||||
| Comment by Andy Schwerin [ 19/Nov/16 ] | |||||||
|
Are these tests run against a standalone, a single-node replset, a multi-node replset, or a sharded configuration? |