[SERVER-16119] Hang during initial sync of 2.8.0-rc0 (wiredtiger) Created: 13/Nov/14 Updated: 11/Jul/16 Resolved: 19/Nov/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 2.8.0-rc0 |
| Fix Version/s: | 2.8.0-rc1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Cailin Nelson | Assignee: | Kaloian Manassiev |
| Resolution: | Done | Votes: | 0 |
| Labels: | wiredtiger | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Trying to do an initial sync of a 2.8.0rc0 secondary from a 2.6.3 primary, got the following error:
Command line options for the 2.8.0rc0 secondary:
|
| Comments |
| Comment by Githook User [ 02/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Revert " This reverts commit fe363010e4b2b953c9876608861c02df4c8662ec, which stalls | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 02/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Revert " This reverts commit fe363010e4b2b953c9876608861c02df4c8662ec, which stalls | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 27/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 19/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The hang has been fixed, but we now hit the issue in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 18/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I figured out why the initial sync is fasserting with "oplog back in time". Part of the problem was introduced when we added initial sync op batching. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 14/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
OK, I think I see the problem. It looks like we use the point where we return sessions to the cache as an opportunity to update the size of the collection. Which means we are trying to commit transactions under that lock:
This happens while the original WT transaction is still active, I believe. Thus if there are a lot of these active, commit will block waiting for them to drain, but they will never drain because they are blocked on writing the size information. I think the correct fix is like Eliot suggested, to first make this lock only protect the actual removal from the session pool (this would allow pending transactions to drain). And also it would be best if the size storing happens after we have dropped the session so we don't hold any active WiredTiger state. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cailin Nelson [ 14/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is a longer log, of another run, left overnight: https://dropbox.10gen.com/cailin/2014-11-14-08-36/wiredtiger_sync_3.log | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 14/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It is still building the index as of the end of that log file. Parts of serverstatus status require a dblock which is probably why it is blocking. db.currentOp() should yield more info, or just watching the log also. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cailin Nelson [ 14/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Tried this again.
At the conclusion of this log file the process was up and responsive in the shell, but db.serverStatus() would not return results:
EDIT: I was looking through the log file and noticed that there were still some serverStatus calls that looked like they were coming from an Automation Agent. I realized I had forgotten to turn off all the Automation Agents associated with this replica set. So, I tried it one more time with all Automation Agents off. I got the same result as before - the LockManager warnings start after about 10 minutes. I'll let this one go for another 30 minutes, so that we can satisfy ourselves that it never recovers from this state. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 13/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
log segment from the end of the file. Should be noted that this is a build after RC0: 2014-11-13T19:46:51.559+0000 I CONTROL [conn36] db version v2.8.0-rc1-pre- | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 13/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
attaching backtrace for all threads | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cailin Nelson [ 13/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried the sync again, and it failed again, but this time at a different point, and with a different backtrace:
|