[SERVER-25255] After upgrading to 3.2.8 (from 3.0.9) startup appears to hang with lots of disk reads to the local collection. Created: 25/Jul/16  Updated: 26/Aug/16  Resolved: 26/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.2.8
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Paul Ridgway Assignee: Unassigned
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-25773 Upgrade from 3.0.9 to 3.2.9 discards ... Closed
Participants:

 Description   

After upgrading to 3.2.8 (from 3.0.9) startup appears to hang with lots of disk reads to the local collection.

fatrace shows lots of:

mongod(87260): R /srv/mongodb/local/collection-2--6617514398526579318.wt

The file is a ~230gb file which is likely made up of the oplog.

This all happens after the log lines:

2016-07-25T13:12:15.345+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul 25 11:21:00:37f
2016-07-25T13:12:15.345+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul 25 12:55:11:25b
2016-07-25T13:12:25.296+0000 I NETWORK  [websvr] admin web console waiting for connections on port 28017
2016-07-25T13:12:25.302+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election

and on another run with debug:

2016-07-25T15:12:49.643+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2016-07-25T15:12:49.643+0000 D REPL     [initandlisten] returning minvalid: (term: -1, timestamp: May 24 18:17:53:802)({ ts: Timestamp 1464113873000|2050, t: -1 }) -> (term: -1, timestamp: Jul 25 15:04:29:36)({ ts: Timestamp 1469459069000|54, t: -1 })
2016-07-25T15:12:49.643+0000 D REPL     [initandlisten] Recovering from a failed apply batch, start:{ ts: Timestamp 1464113873000|2050, t: -1 }

With verbose logging on it just shows WT Journal flushing at a regular rate.

Why is this happening and how long might it take for the node to come back online?

Will it try and read the whole collection / op log?



 Comments   
Comment by Bartosz Debski [ 24/Aug/16 ]

I think we can close this as we have raised a new issue : https://jira.mongodb.org/browse/SERVER-25773

Comment by Kelsey Schubert [ 28/Jul/16 ]

Hi paul.ridgway,

So we can continue to investigate, would you please upload the complete logs?

I have created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Thomas

Comment by Paul Ridgway [ 26/Jul/16 ]

The EBS volumes are far from 'cold' - in fact they've been in use for months.

'Warming' the whole volume will take a while but I'll dd the local db / op log collection before hand and see if that helps.

Some logging on progress/activity would be useful considering with log level of 5 it was not clear - but I appreciate that is not likely to be backported.

Comment by Ramon Fernandez Marina [ 26/Jul/16 ]

paul.ridgway, I think this could be the same issue as described in SERVER-23935. Please see this comment for more details.

Comment by Paul Ridgway [ 25/Jul/16 ]

I should add - we waited about 10 mins for anything to happen but had to revert as the uncertainty on what was happening introduced too much risk. And Disk read was around 20Mb/sec (AWS EBS gp2 volume) - pretty much 100% of all activity.

Generated at Thu Feb 08 04:08:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.