[SERVER-22390] RangeDeleter crashes primary node due to lagging secondary Created: 01/Feb/16  Updated: 18/Nov/16  Resolved: 09/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.1
Fix Version/s: 3.2.3, 3.3.2

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Misha Tyulenev
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos_rs_shard_failure_tolerance.log     File remove2.log    
Issue Links:
Depends
is depended on by SERVER-22176 mongod fasserts in auto_rebalance.js ... Closed
Duplicate
is duplicated by SERVER-22561 RangeDeleter crashes PRIMARY in a ver... Closed
is duplicated by SERVER-25365 MongoDB 3.2.1 crash during resync Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Sharding 10 (02/19/16)
Participants:
Linked BF Score: 0

 Description   

Our setup: primary, secondary, arbiter. When adding a new replica member - which is >1hr behind, range deleter asserts the primary and crashes it.

This behavior is too conservative and should be treated otherwise.

2016-02-01T06:30:15.765+0000 I SHARDING [migrateThread] rangeDeleter took 3600 seconds  waiting for deletes to be replicated to majority nodes
2016-02-01T06:30:15.765+0000 I -        [migrateThread] Fatal assertion 18512 WriteConcernFailed waiting for replication timed out
2016-02-01T06:30:15.770+0000 I CONTROL  [migrateThread] 
 0x12d5772 0x12713d4 0x125d662 0xdf123e 0xdf22ca 0xf24517 0xf272de 0xf27770 0x1a99100 0x7f9e247dcdc5 0x7f9e24509bdd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"ED5772"},{"b":"400000","o":"E713D4"},{"b":"400000","o":"E5D662"},{"b":"400000","o":"9F123E"},{"b":"400000","o":"9F22CA"},{"b":"400000","o":"B24517"},{"b":"400000","o":"B272DE"},{"b":"400000","o":"B27770"},{"b":"400000","o":"1699100"},{"b":"7F9E247D5000","o":"7DC5"},{"b":"7F9E24413000","o":"F6BDD"}],"processInfo":{ "mongodbVersion" : "3.2.1", "gitVersion" : "a14d55980c2cdc565d4704a7e3ad37e4e535c1b2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.1.10-17.31.amzn1.x86_64", "version" : "#1 SMP Sat Oct 24 01:31:37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "A931E563057BCC34ED2D1971AAC77D44F47033A9" }, { "b" : "7FFE3EBD0000", "elfType" : 3, "buildId" : "4106419F92DF72BDE396133D4FE9E47CB6983EF2" }, { "b" : "7F9E259FF000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "22480480235F3B1C6C2E5E5953949728676D3796" }, { "b" : "7F9E2561A000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "F1C146B78505646930DD9003AA2B3622C5226D1B" }, { "b" : "7F9E25412000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "42833B65941483A611C40EA7D32F56EA83EA6E93" }, { "b" : "7F9E2520E000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "6335077ACD51527BE9F2F18451A88E2B7350C5B6" }, { "b" : "7F9E24F09000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "0A90C35D3174805453EA67A785446D628E298B59" }, { "b" : "7F9E24C07000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "BB312C4A65B8FD830C148612CBEACEACC8B08E4F" }, { "b" : "7F9E249F1000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "00FA2883FB47B1327397BBF167C52F51A723D013" }, { "b" : "7F9E247D5000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "E5E575776DAD20ADE8FD0CAF17897C9D89020A87" }, { "b" : "7F9E24413000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "D84E3AFDFF3E164A09C125F85B5DCABC6F545B5E" }, { "b" : "7F9E25C6C000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "7B7BF8FEEF1A9C627EF90CA5C9188EFD4DA2DDD2" }, { "b" : "7F9E241C7000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "FF843C37C38E5BFFD57F7BCCAE05FDADC6390C8F" }, { "b" : "7F9E23EE4000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "0BB150CC29DB5B0E039879EFC00152A75E3B00B9" }, { "b" : "7F9E23CE1000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "5C01209C5AE1B1714F19B07EB58F2A1274B69DC8" }, { "b" : "7F9E23AAF000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "1485992B0E5CDBA0A34817FC8C6A4C45E82CD1A9" }, { "b" : "7F9E23899000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7F9E2368A000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "A75A81EC50E9E0164A12B59D9987AD61AC7576C8" }, { "b" : "7F9E23487000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7F9E2326D000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "47EC2C63132D25E4FE83F77023DA1A66457A88F1" }, { "b" : "7F9E2304C000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12d5772]
 mongod(_ZN5mongo10logContextEPKc+0x134) [0x12713d4]
 mongod(_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE+0x62) [0x125d662]
 mongod(+0x9F123E) [0xdf123e]
 mongod(_ZN5mongo12RangeDeleter9deleteNowEPNS_16OperationContextERKNS_19RangeDeleterOptionsEPSs+0x4CA) [0xdf22ca]
 mongod(_ZN5mongo27MigrationDestinationManager14_migrateDriverEPNS_16OperationContextERKSsRKNS_7BSONObjES7_S7_S4_RKNS_3OIDERKNS_19WriteConcernOptionsE+0xB87) [0xf24517]
 mongod(_ZN5mongo27MigrationDestinationManager14_migrateThreadESsNS_7BSONObjES1_S1_SsNS_3OIDENS_19WriteConcernOptionsE+0xAE) [0xf272de]
 mongod(+0xB27770) [0xf27770]
 mongod(+0x1699100) [0x1a99100]
 libpthread.so.0(+0x7DC5) [0x7f9e247dcdc5]
 libc.so.6(clone+0x6D) [0x7f9e24509bdd]
-----  END BACKTRACE  -----



 Comments   
Comment by Dennis Zhuang [ 15/Aug/16 ]

@Yoni Douek

Hi , can we apply this patch to 3.0.12 by ourself?

We upgraded our mongodb to 3.0 last week, but we can't upgrade it to 3.2 or 3.3 immediately. We had crashed today with this issue.

Comment by Githook User [ 09/Feb/16 ]

Author:

{u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}

Message: SERVER-22390 catch all errors in RangeDeleter::_waitForMajority

(cherry picked from commit 805b8c51a6103e0482df81b5ef9b3efd448b0806)
Branch: v3.2
https://github.com/mongodb/mongo/commit/cbf7d52a1d5b25c2d8acaed39db19cda3df3b8a7

Comment by Githook User [ 09/Feb/16 ]

Author:

{u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}

Message: SERVER-22390 catch all errors in RangeDeleter::_waitForMajority
Branch: master
https://github.com/mongodb/mongo/commit/805b8c51a6103e0482df81b5ef9b3efd448b0806

Comment by Githook User [ 09/Feb/16 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-22390 special case InterruptedAtShutdown in waiting for replication inside RangeDeleter to avoid crash
Branch: master
https://github.com/mongodb/mongo/commit/2d42c9ad4e65ff20d1ea1a3799afedca8d78364a

Comment by Randolph Tan [ 05/Feb/16 ]

Another example (remove2.log), with a different status:

[migrateThread] Fatal assertion 18512 InterruptedAtShutdown: interrupted at shutdown

Should probably not fassert with this error.

Comment by Randolph Tan [ 05/Feb/16 ]

Attached log from a build failure that exhibited the same behavior

Comment by Yoni Douek [ 04/Feb/16 ]

All you need is to add a replica which is far behind, and therefore can't participate in replicated deletes.

In our case -
Primary + secondary + arbiter - all synced.
Add secondary from an old backup (which is BTW affected by this bug and therefore won't sync) - SERVER-22389.
Do this while moving chunks and balancing.

Primary doesn't have a majority when range deleting, and will crash after 1 hour.

Comment by Misha Tyulenev [ 03/Feb/16 ]

yonido, we need more information to troubleshoot this ticket:

  1. Could you please clarify the scenario where you add a new replica member that is behind?
  2. Is this scenario reproducible?
  3. Do you have a moveChunk command in progress when you observe the crash?

Thanks!

Comment by Ramon Fernandez Marina [ 02/Feb/16 ]

Thanks yonido, I've updated the ticket accordingly. One of our engineers is already looking into this.

Comment by Yoni Douek [ 02/Feb/16 ]

3.2.1

Comment by Ramon Fernandez Marina [ 02/Feb/16 ]

yonido, what version(s) of MongoDB are in use in this replica set?

Generated at Thu Feb 08 04:00:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.