[SERVER-17114] WiredTiger capped collection cursor needs to go DEAD when restoreState to a deleted doc Created: 29/Jan/15 Updated: 18/Sep/15 Resolved: 03/Feb/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.0.0-rc7 |
| Fix Version/s: | 3.0.0-rc8 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Michael Grundy | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | 28qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | 3 member replica set, c3.2xlarge instances, 8 cpu, 15g Socialite load workload: java -jar target/socialite-0.0.1-SNAPSHOT.jar load --users 10000000 --maxfollows 1000 --messages 2000 --threads 32 sample-config.yml |
||||||||
| Participants: | |||||||||
| Description |
|
Running a heavy insert workload on a three node replica set, the secondaries fell behind. One quit with Fatal Assertion 18750, the other stayed in RECOVERING. It seems that after going into RECOVERING the secondary tried syncing from the other secondary,
|
| Comments |
| Comment by Githook User [ 04/Feb/15 ] | |||||||||||||||||||||||||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: (cherry picked from commit 0e523eec39056da5b0dfb3826bbe27b4f376deb6) | |||||||||||||||||||||||||||||||
| Comment by Githook User [ 03/Feb/15 ] | |||||||||||||||||||||||||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | |||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 02/Feb/15 ] | |||||||||||||||||||||||||||||||
|
The fassert appears to be due to holes in the oplog. That is, a range of documents is missing from both secondaries. The ranges of missing documents are different between the two secondaries, but the ranges do somewhat overlap, and correspond to a time when the primary was slow (as determined by slow-query logging on the primary). The running theory is that the slowness triggered something that caused the secondaries' getMore operations on their tailable cursors to miss oplog documents. | |||||||||||||||||||||||||||||||
| Comment by Michael Grundy [ 29/Jan/15 ] | |||||||||||||||||||||||||||||||
|
shard2-02 went into recovery first according to the logs. Updating desc | |||||||||||||||||||||||||||||||
| Comment by Michael Grundy [ 29/Jan/15 ] | |||||||||||||||||||||||||||||||
|
This is the log from shard2-03 when the shard2-02 (10.93.30.138) secondary went into recovery:
And here's where shard2-03 went into recovery:
Will attach complete logs | |||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 29/Jan/15 ] | |||||||||||||||||||||||||||||||
|
michael.grundy@10gen.com can you attach the log from the other secondary in this set? | |||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 29/Jan/15 ] | |||||||||||||||||||||||||||||||
|
What I think is happening here is that the secondary that we are choosing as a sync source happens to delete the op we are looking for from its capped oplog collection right before we query for it. The code currently isn't set up to handle this situation; we could handle it by, instead of rolling back, running another query looking for the first op in the sync source oplog and if it's ahead of our last op, transition to (or remain in) RS102 RECOVERING and do not attempt rollback. |