[SERVER-29230] Journal files accumulating on ReplicaSet Secondary Created: 16/May/17  Updated: 26/Feb/18  Resolved: 26/Feb/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alberto Ornaghi Assignee: Susan LoVerso
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-05-16 at 14.17.43.png     Text File journal_dir.txt    
Issue Links:
Documented
is documented by DOCS-10562 Add warning about using WiredTiger wi... Closed
Duplicate
duplicates WT-3327 Checkpoints can hang if time runs bac... Closed
is duplicated by SERVER-29308 CLONE - Journal files accumulating Closed
Related
Operating System: ALL
Participants:

 Description   

we have a replica set composed by a primary, a secondary and an arbiter. we deployed it with automation-agent from mongodb cloud manager.
everything is ok on the primary but we noticed that the free disk space on the secondary in constantly decreasing, checking the filesystem we discovered that the journal directory is the culprit.
we performed a full resync of the secondary, but after 24h the journal directory occupies 380Gb (we have 485Gb of data). see attachment of the free disk space graph over 24h (primary on the left, secondary on the right).

currently there are 3926 journal files in the journal directory. (see attachment)
the primary is not accumulating journal files and behaves as expected.



 Comments   
Comment by Kelsey Schubert [ 26/Feb/18 ]

reopening to change resolution to clarify that this work was completed in WT-3327 and the issue been resolved in MongoDB 3.2.17 and 3.4.6.

Comment by Alberto Ornaghi [ 21/May/17 ]

just for reference (if needed by someone else in the future) doing this:

"echo 2dd1ce17-079e-403c-b352-a1921ee207ee > /sys/bus/vmbus/drivers/hv_util/unbind"

will resolve the "Time has been changed" problem.
hoping it will not trigger the WT bug anymore.

Comment by Alberto Ornaghi [ 21/May/17 ]

it took 4.5 hours to recover, but now it seems ok (210 Mb of journal). it's rebuilding the index in foreground so i have to wait some more time...
after the secondary is up i will do it on the primary.

the timesyncd daemon was already stopped. but the "Time has been changed" message appears even in that situation.
it seems s problem with Azure Hyper-V:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1676635?comments=all

unfortunately i only have to wait for a kernel patch for this... the perfect storm.

Comment by Susan LoVerso [ 21/May/17 ]

There is a WiredTiger verbose option that will give progress for recovery. You can turn it on by adding this:

--wiredTigerEngineConfigString=',verbose=(recovery_progress)'

However, the checkpoint after recovery is where your issue is and the potential hang. Can you turn off the time daemon or change its setting so that it updates much less frequently, particularly during this recovery process?

Comment by Alberto Ornaghi [ 20/May/17 ]

last week i tried it and it seemed stuck on recovering the journal files... 200 Gb of journal to recover is a lot...

how can i check the recovery process progress?
will it help to configure the mongod.conf this way:

systemLog:
verbosity: 0
component:
storage:
verbosity: 1
journal:
verbosity: 3

Comment by Susan LoVerso [ 20/May/17 ]

For the moment, if you restart the site it will replay the journal and then it will perform a checkpoint and that will allow it to remove the old journal files. The situation you have is that the code assumes time goes forward and the system is stuck in checkpoint in a very long sleep because it computed a time difference when time went backward. Assuming the checkpoint that happens after the restart and replay of the journal completes then it will be able to remove the files.

I suggest doing this one site at a time.

Comment by Alberto Ornaghi [ 20/May/17 ]

I created an index on the primary and then it replicated to the secondary.
i've noticed that the new index is 4 Gb but the total index size grew from 105 Gb to 144 Gb. very strage, i've checked the journal and now i have the journal stuck on both the primary and the secondary (234 Gb on primary and 193 on secondary). how can i resolve this critical situation? they will fill up the free disk space in few days.

is it safe to kill on of them and let them redo the journal? is there some sort of debug output to see in the log when it re-do the journal on startup to check if it's stucked or not?
will the bug prevent the journal to be re-done on startup?
please help, thank you.

Comment by Susan LoVerso [ 18/May/17 ]

Work in WiredTiger relating to time structures and time going backward will be done in WT-3327.

Comment by Alberto Ornaghi [ 17/May/17 ]

yes, its full of those messages:

May 17 14:37:05 db-rs0-2 systemd[4467]: Time has been changed
May 17 14:37:05 db-rs0-2 systemd[1]: snapd.refresh.timer: Adding 3h 53min 17.369714s random time.
May 17 14:37:10 db-rs0-2 systemd[4467]: Time has been changed
May 17 14:37:10 db-rs0-2 systemd[1]: Time has been changed
May 17 14:37:10 db-rs0-2 systemd[1]: apt-daily.timer: Adding 8h 13min 56.523873s random time.
May 17 14:37:10 db-rs0-2 systemd[1]: snapd.refresh.timer: Adding 4h 5min 54.630702s random time.
May 17 14:37:15 db-rs0-2 systemd[4467]: Time has been changed
May 17 14:37:15 db-rs0-2 systemd[1]: Time has been changed
May 17 14:37:15 db-rs0-2 systemd[1]: apt-daily.timer: Adding 4h 40min 22.691008s random time.
May 17 14:37:15 db-rs0-2 systemd[1]: snapd.refresh.timer: Adding 5h 46min 46.616894s random time.
May 17 14:37:20 db-rs0-2 systemd[4467]: Time has been changed
May 17 14:37:20 db-rs0-2 systemd[1]: Time has been changed
May 17 14:37:20 db-rs0-2 systemd[1]: apt-daily.timer: Adding 1h 19min 2.758074s random time.
May 17 14:37:20 db-rs0-2 systemd[1]: snapd.refresh.timer: Adding 5h 1min 24.560710s random time.

unfortunately the comment in the stack overflow article you linked (the one about azure) is not reassuring...
I will try to resolve this issue searching for specific problem of ubuntu + azure.

thanks again

Comment by Susan LoVerso [ 17/May/17 ]

If you check /var/log/syslog you may see messages that say something like "Time has been changed" (or if you want to upload that file I will look at it here). You might also find this page helpful: https://askubuntu.com/questions/888493/var-log-syslog-systemd1-time-has-been-changed-message-every-5-seconds

ETA: If you see those messages, resolving those time change messages should resolve the other problems you're seeing. Please let me know.

Comment by Susan LoVerso [ 17/May/17 ]

If you notice this happening again, please try to gather stacks (one suggested method is pmp, poor man's profiler, from poormansprofiler.org). Something inside WiredTiger using time is causing a checkpoint to get stuck and that needs to get fixed. That is the reason the journal files are accumulating. I will be looking at code to avoid time going backward and also find the hang. Thank you for your cooperation!

Comment by Alberto Ornaghi [ 17/May/17 ]

I've restarted the secondary from another snapshot from the primary. deleted the diagnostic.data dir. the resync is done and the journal files are not accumulating right now, but I will monitor the situation and let you know if it happens again. till now it happened 2 times over 3 resync. pretty bad rate.

do you have any recommendation on the timesyncd configuration?

Comment by Alberto Ornaghi [ 17/May/17 ]

we are running on azure VM, they are Ubuntu 16.04, time is synchronised with the default timesyncd (no change to the configuration).
same configuration is on primary and secondary.
the issue happened after the first import, so I deleted the data dir of the secondary and copied over a snapshot of the primary, but it happened again.
what do you suggest now? should I resync again from another snapshot from the primary ? do I have to delete the diagnostic.data dir after the copy before restarting the secondary?

thank you

Comment by Susan LoVerso [ 17/May/17 ]

Can you describe the system you're running on? Is it a virtual machine? Or an AWS instance or some other cloud instance? Are you running any time daemon process like ntpd or similar? There are values that appear in the diagnostic data that indicate the system clock may have been adjusted backward, skewing several values, but also causing a checkpoint to get stuck. That checkpoint getting stuck is the reason that journal files cannot be removed.

If you can get it into this situation again, can you attempt to gather stacks so that we can see where the checkpoint thread is stuck. I will be attempting to reproduce it here as well.

Comment by Alberto Ornaghi [ 16/May/17 ]

the secondary server stopped working since it filled all the free space.

the last server log (in mongodb.log) was:
2017-05-16T16:49:15.053+0000 E STORAGE [repl writer worker 10] WiredTiger error (28) [1494953355:45764][31637:0x7fd39ab78700], WT_SESSION.commit_transaction: /data/rs0_1/journal/WiredTigerLog.0000012522: han

i was tailing it live, but now its a zero byte files. so i cannot send it

Comment by Alberto Ornaghi [ 16/May/17 ]

uploaded to the same portal.
if you need other info, just ask.

note that yesterday we performed a snapshot on the primary and attached it to the secondary that resynched from that point.
the problem was present even before so i tried the way of resyncing from a primary snapshot.

Comment by Susan LoVerso [ 16/May/17 ]

alberto.ornaghi@gmail.com can you also upload the WiredTiger.wt and WiredTiger.turtle files from your database directory?

Comment by Alberto Ornaghi [ 16/May/17 ]

I've uploaded the diagnostic data.

Thank you

Comment by Kelsey Schubert [ 16/May/17 ]

Hi alberto.ornaghi@gmail.com,

Would you please upload an archive of the diagnostic.data directory from the affected node's dbpath?

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Thomas

Generated at Thu Feb 08 04:20:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.