[SERVER-21275] Document not found due to WT commit visibility issue Created: 03/Nov/15 Updated: 20/Sep/17 Resolved: 01/Dec/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.2.0-rc2 |
| Fix Version/s: | 3.0.8, 3.2.0-rc4 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Jonathan Abrahams | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||||||||||
| Sprint: | Repl C (11/20/15), QuInt D (12/14/15) | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||
| Description |
|
Issue Status as of Dec 10, 2015 ISSUE SUMMARY USER IMPACT
Deployments where a WiredTiger node is or was used as a source of data may be affected. This includes:
MMAPv1-only deployments are not affected by this issue. Mixed storage engine deployments are not affected when WiredTiger nodes never become primary, or when WiredTiger secondaries are not used as a source for chained replication. WORKAROUNDS Users experiencing the "Fatal Assertion 16360" error may restart the affected node to fix the issue, but this condition may recur so upgrading to 3.0.8 is strongly recommended. AFFECTED VERSIONS FIX VERSION Original descriptionA new test is being introduced into the FSM tests to check the dbHash of the DB (and collections) on all replica set nodes, during these phases of the workload (
Before the dbHash is computed, cluster.awaitReplication() is invoked to ensure that all nodes in the replica set have caught up. During the development of this test it was noticed that infrequent failures would occur for workload remove_and_bulk_insert, for wiredTiger storage. |
| Comments |
| Comment by ITWEBTF SAXOBANK [ 15/Dec/15 ] | ||||||||||||||||
|
Hi Dan, The tests we run will
There are around 500 tests that run local developer machines (Intel i7 CPU, 16 GB RAM and at least 250 GB SSD) non-clustered. The particular test I saw failing,
The failure I saw was that the added sub-document was not there. I hope you can use this info. Feel free to copy it to another issue if that provides better tracking. If you do so, I can add the actual documents. Best regards, | ||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 14/Dec/15 ] | ||||||||||||||||
|
Hi itwebtf@saxobank.com, we do run the same tests on MMAPv1 that we do on WT. This particular issue is limited to the WiredTiger storage engine. I would be very interested in understanding your test case, but it would be a separate issue. If you can describe your cluster details and workload setup, please do so. It would be best to get the details in a new SERVER issue though. | ||||||||||||||||
| Comment by ITWEBTF SAXOBANK [ 14/Dec/15 ] | ||||||||||||||||
|
Today I saw a very similar issue on mmapv1 on MongoDb 3.2, a local non-clustered installation. I have tests that all write to the database, then assert on the result. I have only once seen that the value read was the value before the last write operation. I cannot repro, and I cannot prove that this is not a problem in our code, but the code and the tests have run unchanged for more than a year on a 2.4.9 database. I suppose that whatever tests you run on WiredTiger will be run on mmapv1 as well, and then you will eventually see this issue. | ||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 24/Nov/15 ] | ||||||||||||||||
|
Resolved as duplicate of | ||||||||||||||||
| Comment by Benety Goh [ 18/Nov/15 ] | ||||||||||||||||
|
logs associated with data_db_dbhash_0212513.tar.gz | ||||||||||||||||
| Comment by Benety Goh [ 18/Nov/15 ] | ||||||||||||||||
|
attached /data/db contents from failed run_dbhash.sh run under linux (git hash 0212513) | ||||||||||||||||
| Comment by Jonathan Abrahams [ 05/Nov/15 ] | ||||||||||||||||
|
Attached is a script which can reproduce the issue. Caution, it happens infrequently, so the script has to be run multiple times. Also attached is the log from the failed run. It was run in this manner:
|