[SERVER-25255] After upgrading to 3.2.8 (from 3.0.9) startup appears to hang with lots of disk reads to the local collection. Created: 25/Jul/16 Updated: 26/Aug/16 Resolved: 26/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.2.8 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Paul Ridgway | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
After upgrading to 3.2.8 (from 3.0.9) startup appears to hang with lots of disk reads to the local collection. fatrace shows lots of: mongod(87260): R /srv/mongodb/local/collection-2--6617514398526579318.wt The file is a ~230gb file which is likely made up of the oplog. This all happens after the log lines:
and on another run with debug:
With verbose logging on it just shows WT Journal flushing at a regular rate. Why is this happening and how long might it take for the node to come back online? Will it try and read the whole collection / op log? |
| Comments |
| Comment by Bartosz Debski [ 24/Aug/16 ] |
|
I think we can close this as we have raised a new issue : https://jira.mongodb.org/browse/SERVER-25773 |
| Comment by Kelsey Schubert [ 28/Jul/16 ] |
|
Hi paul.ridgway, So we can continue to investigate, would you please upload the complete logs? I have created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you, |
| Comment by Paul Ridgway [ 26/Jul/16 ] |
|
The EBS volumes are far from 'cold' - in fact they've been in use for months. 'Warming' the whole volume will take a while but I'll dd the local db / op log collection before hand and see if that helps. Some logging on progress/activity would be useful considering with log level of 5 it was not clear - but I appreciate that is not likely to be backported. |
| Comment by Ramon Fernandez Marina [ 26/Jul/16 ] |
|
paul.ridgway, I think this could be the same issue as described in |
| Comment by Paul Ridgway [ 25/Jul/16 ] |
|
I should add - we waited about 10 mins for anything to happen but had to revert as the uncertainty on what was happening introduced too much risk. And Disk read was around 20Mb/sec (AWS EBS gp2 volume) - pretty much 100% of all activity. |