[SERVER-17689] Fatal assertion during replication, and/or initial sync Created: 23/Mar/15  Updated: 19/Sep/15  Resolved: 23/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.1
Fix Version/s: 3.0.5, 3.1.2

Type: Bug Priority: Critical - P2
Reporter: Sjoerd Mulder Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-19518 mongod crashes with WriteConflictExce... Closed
Related
is related to SERVER-16994 Handle WriteConflictException when wr... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

1. Use the rolling update mechanishm on mongodb 3.0.1. from MMAP to WiredTiger described here: http://docs.mongodb.org/manual/release-notes/3.0-upgrade/#change-replica-set-storage-engine-to-wiredtiger

Sprint: RPL 1 04/03/15, RPL 2 04/24/15
Participants:

 Description   
Issue Status as of Jul 14, 2015

ISSUE SUMMARY
During replication and/or initial sync, when using the WiredTiger storage engine, a replica set member may terminate with a fatal assertion about a WriteConflictException. This assertion shuts down the server, causing replication or initial sync to fail.

This error is not common, but may be dependent on the workload of the application or user data.

USER IMPACT
A replica set may terminate during replication and/or initial sync with a fatal assertion, which requires the member to be restarted. In certain low availability configurations, this issue may affect the ability for the replica set to maintain a Primary member to take writes.

WORKAROUNDS
N/A

AFFECTED VERSIONS
MongoDB 3.0.0 through 3.0.4.

FIX VERSION
The fix is included in the 3.0.5 production release.

RESOLUTION DETAILS
By handling WriteConflictExceptions during the applyOps stage of replication, including during initial sync the system will now be able to retry until the WiredTiger operation completes successfully.

Original description

After successful replication the collections and indexes from MMAPv1 to Wiredtiger storage engine on our replicaset containing 2 servers the server crashes with the following output:

2015-03-23T12:49:36.500+0000 I INDEX    [rsSync] build index done.  scanned 31 total records. 0 secs
2015-03-23T12:49:36.502+0000 I REPL     [rsSync] initial sync data copy, starting syncup
2015-03-23T12:49:36.527+0000 I REPL     [rsSync] oplog sync 1 of 3
2015-03-23T12:49:37.106+0000 I REPL     [ReplicationExecutor] syncing from: primary:27017
2015-03-23T12:49:37.111+0000 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to primary:27017
2015-03-23T12:50:10.804+0000 I REPL     [repl writer worker 14] replication update of non-mod failed: { ts: Timestamp 1427114721000|133, h: -8520449917638273792, v: 2, op: "u", ns: "...REMOVED...", o2: { ...REMOVED... } }
2015-03-23T12:50:10.807+0000 I REPL     [repl writer worker 14] replication info adding missing object
2015-03-23T12:50:10.882+0000 E REPL     [repl writer worker 14] writer worker caught exception:  :: caused by :: 112 WriteConflict on: { ts: Timestamp 1427114721000|133, h: -8520449917638273792, v: 2, op: "u", ns: "...REMOVED...", o2: { ...REMOVED... }
2015-03-23T12:50:10.883+0000 I -        [repl writer worker 14] Fatal Assertion 16361
2015-03-23T12:50:10.883+0000 I -        [repl writer worker 14] 
 
***aborting after fassert() failure
 



 Comments   
Comment by Githook User [ 07/Jul/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-17689: handle wce better in initial sync

(cherry picked from commit 57232d88b869b0741d95998f8ed09a96b2e7a1bc)
Branch: v3.0
https://github.com/mongodb/mongo/commit/40410b820cd93e54fb7b42d37146b2d930334bbb

Comment by Scott Hernandez (Inactive) [ 02/Jun/15 ]

Backport Notes:
Backporting all of the wce handling will not be possible without including other changes, like making logop rollback safe (SERVER-15192), the query-system wce handling, and probably other changes as well.

2015-06-01T19:47:23.463+0000 F REPL     [conn42] Rollback of logOp not currently allowed (SERVER-15192)
2015-06-01T19:47:23.463+0000 I -        [conn42] Fatal Assertion 18805
2015-06-01T19:47:23.470+0000 I CONTROL  [conn42] 
 0xf643c9 0xf0d131 0xef0c31 0xc32fcc 0xd2e982 0xd2ed50 0x7e7266 0x91ed5f 0x934e96 0x9c1d74 0x9c2cfd 0x9c3a0b 0xb9755a 0xaa7320 0x7e969d 0xf2188b 0x2b43ed73483d 0x2b43ee5befcd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B643C9"},{"b":"400000","o":"B0D131"},{"b":"400000","o":"AF0C31"},{"b":"400000","o":"832FCC"},{"b":"400000","o":"92E982"},{"b":"400000","o":"92ED50"},{"b":"400000","o":"3E7266"},{"b":"400000","o":"51ED5F"},{"b":"400000","o":"534E96"},{"b":"400000","o":"5C1D74"},{"b":"400000","o":"5C2CFD"},{"b":"400000","o":"5C3A0B"},{"b":"400000","o":"79755A"},{"b":"400000","o":"6A7320"},{"b":"400000","o":"3E969D"},{"b":"400000","o":"B2188B"},{"b":"2B43ED72E000","o":"683D"},{"b":"2B43EE4EA000","o":"D4FCD"}],"processInfo":{ "mongodbVersion" : "3.0.4-pre-", "gitVersion" : "411802f844fa09bedb4f80d14e76bb94d2e3cae0", "uname" : { "sysname" : "Linux", "release" : "2.6.18-194.el5xen", "version" : "#1 SMP Tue Mar 16 22:01:26 EDT 2010", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "2B43ED72E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "2B43ED94A000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "2B43EDB53000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "2B43EDD58000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "2B43EE058000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "2B43EE2DB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "2B43EE4EA000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "2B43ED510000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf643c9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0d131]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef0c31]
 mongod(+0x832FCC) [0xc32fcc]
 mongod(_ZN5mongo15DurRecoveryUnit15rollbackChangesEv+0x192) [0xd2e982]
 mongod(_ZN5mongo15DurRecoveryUnit13endUnitOfWorkEv+0x70) [0xd2ed50]
 mongod(_ZN5mongo15WriteUnitOfWorkD1Ev+0x16) [0x7e7266]
 mongod(_ZN5mongo6Cloner2goEPNS_16OperationContextERKSsS4_RKNS_12CloneOptionsEPSt3setISsSt4lessISsESaISsEERSsPi+0x157F) [0x91ed5f]
 mongod(_ZN5mongo9CmdCopyDb3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x966) [0x934e96]

Alternative:
It might be possible to produce a more targeted patch which can be applied to older versions, like 3.0, for this specific case of apply the oplog during initial sync, but not address all the other places which might also have issues, like the data cloning. I would not suggest this as it leads to a patchwork of backports, coming from many specific patches which will come up until we essentially just backport (almost) everything anyway.

Comment by Githook User [ 01/Jun/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: Revert "SERVER-17689: handle wce better in initial sync"

This reverts commit 411802f844fa09bedb4f80d14e76bb94d2e3cae0.
Branch: v3.0
https://github.com/mongodb/mongo/commit/0a7d47f9076bf6070b66462715ca225eca2cdb6b

Comment by Githook User [ 01/Jun/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-17689: handle wce better in initial sync

(cherry picked from commit 57232d88b869b0741d95998f8ed09a96b2e7a1bc)
Branch: v3.0
https://github.com/mongodb/mongo/commit/411802f844fa09bedb4f80d14e76bb94d2e3cae0

Comment by Githook User [ 23/Apr/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-17689: handle wce better in initial sync
Branch: master
https://github.com/mongodb/mongo/commit/57232d88b869b0741d95998f8ed09a96b2e7a1bc

Comment by Scott Hernandez (Inactive) [ 23/Mar/15 ]

sjoerdmulder
Yes, it can be hard to reproduce normally or in regular use but we have some tools which we can use to cause the storage engine (wiredTiger) to generate this kind of error more often during testing.

Thanks for filing the issue so we can address this in initial sync.

Comment by Sjoerd Mulder [ 23/Mar/15 ]

Update: second time the initial sync replication did succeed, so i'm not sure if this can be reproduced easily

Comment by Andy Schwerin [ 23/Mar/15 ]

My guess is that Sync::shouldRetry is throwing WriteConflictException in multiInitialSyncApply. SyncTail::syncApply handles WCE internally, which is probably what Sync::shouldRetry ought to do, as well.

Generated at Thu Feb 08 03:45:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.