[SERVER-13410] split does not install metadata under the dblock Created: 31/Mar/14  Updated: 11/Jul/16  Resolved: 01/Apr/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.6.0-rc2
Fix Version/s: 2.6.0-rc3

Type: Bug Priority: Critical - P2
Reporter: Cailin Nelson Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File cbi1_mmsrs3_2014_03_29.log     Text File cbi2_mmsrs2_2014_03_30.log     Text File cbi8_mmsrs4_2014_03_27.log    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-7790 Segfault in splitchunk following drop... Closed
Related
related to SERVER-13429 Replace writes to cout/cerr or stdout... Closed
Operating System: ALL
Participants:

 Description   

We have observed 3 separate instances of the mongod process dying unexpectedly with no message in the log file. In each instance, the last message in the log file was about to log metadata event.

In each episodes the mongod in question has been a member of a shard in a cluster. The episodes were observed on three separate physical servers.

Log files attached.

In the third episode dmesg said the following:

INFO: task mongod:29664 blocked for more than 120 seconds.
 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 
mongod        D 0000000000000001     0 29664  29663 0x00000080
 
 ffff88066032bdf8 0000000000000086 ffff88066032bdc0 ffff88066032bdbc
 
 ffff88066032bd88 ffff88063fc24500 ffff880028215f80 0000000000000400
 
 ffff88008217fab8 ffff88066032bfd8 000000000000f4e8 ffff88008217fab8
 
Call Trace:
 
 [<ffffffffa00745c5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
 
 [<ffffffff81090d30>] ? autoremove_wake_function+0x0/0x40
 
 [<ffffffffa0074676>] ? __jbd2_log_start_commit+0x36/0x40 [jbd2]
 
 [<ffffffffa009409c>] ext4_sync_file+0x13c/0x250 [ext4]
 
 [<ffffffff811a57a1>] vfs_fsync_range+0xa1/0xe0
 
 [<ffffffff811a584d>] vfs_fsync+0x1d/0x20
 
 [<ffffffff811a588e>] do_fsync+0x3e/0x60
 
 [<ffffffff811a58e0>] sys_fsync+0x10/0x20
 
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b



 Comments   
Comment by Githook User [ 01/Apr/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-13410 metadata changes in dblock
(cherry picked from commit c4f3416806135c1d2d3289a0648fc9dc754a0adf)
Branch: v2.6
https://github.com/mongodb/mongo/commit/b3c2e5171b558bfa76bab83a36083eac5d5c363b

Comment by Githook User [ 01/Apr/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-13410 metadata changes in dblock
Branch: master
https://github.com/mongodb/mongo/commit/c4f3416806135c1d2d3289a0648fc9dc754a0adf

Comment by Greg Studer [ 01/Apr/14 ]

Tentative summary -

It's important that mongod collection metadata changes don't happen outside the db write lock - this allows metadata users to make certain assumptions about when it is safe to write. This is/was the case for all metadata changes except for split, which did not previously cause problems because the only information write operations used was (a copy of) the shard version and because split does not change the logical ranges tracked.

In 2.6, we now use FieldRef information cached inside the collection metadata to validate updates. This is to ensure we don't change shard key fields, and updates assume the FieldRefs will be valid so long as the lock is held. On split, this isn't the case, so think this is what is causing the crashing (and maybe silent and incorrect update validation).

Comment by Greg Studer [ 31/Mar/14 ]

The changelogs of the first two configs agree, there are no changelog entries for any of the three crashes. This means that the clusterWrite is not actually getting sent, and should narrow things down considerably.

Generated at Thu Feb 08 03:31:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.