[SERVER-22618] RangeDeleter asserts and causes primary to be unusable Created: 15/Feb/16  Updated: 03/Nov/16  Resolved: 08/Apr/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Andy Schwerin
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Sharding 11 (03/11/16), Sharding 12 (04/01/16), Sharding 13 (04/22/16)
Participants:

 Description   

#8.

Note: this looks like SERVER-22390 but is totally different - please read carefully.

Our setup: primary+secondary+arbiter+another secondary which is >1 hour behind and syncing. in SECONDARY. we added it as votes:0 and priority:0 so it will not crash (as a workaround SERVER-22390)

The other secondary is perfectly sync'd (0 sec lag).

Range deleter shows this:

2016-02-15T13:04:40.161+0000 I -        [RangeDeleter] Assertion: 64:waiting for replication timed out
2016-02-15T13:04:40.506+0000 I CONTROL  [RangeDeleter]
 0x12d5772 0x12713d4 0x125cfa8 0x125d05c 0xb59fab 0xdf5026 0xdf2829 0x1a99100 0x7ff1831d7df3 0x7ff182f051bd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"ED5772"},{"b":"400000","o":"E713D4"},{"b":"400000","o":"E5CFA8"},{"b":"400000","o":"E5D05C"},{"b":"400000","o":"759FAB"},{"b":"400000","o":"9F5026"},{"b":"400000","o":"9F2829"},{"b":"400000","o":"1699100"},{"b":"7FF1831D0000","o":"7DF3"},{"b":"7FF182E0F000","o":"F61BD"}],"processInfo":{ "mongodbVersion" : "3.2.1", "gitVersion" : "a14d55980c2cdc565d4704a7e3ad37e4e535c1b2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.14.19-17.43.amzn1.x86_64", "version" : "#1 SMP Wed Sep 17 22:14:52 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "A931E563057BCC34ED2D1971AAC77D44F47033A9" }, { "b" : "7FFFC40FE000", "elfType" : 3, "buildId" : "8E3D893F8991DFE6C5D9AB55196714E9AF81DC88" }, { "b" : "7FF1843FD000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "22480480235F3B1C6C2E5E5953949728676D3796" }, { "b" : "7FF184018000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "ADD80D7DBE8B04C3BA8E3242D96F39FF870A862A" }, { "b" : "7FF183E10000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7FF183C0C000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7FF183903000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261" }, { "b" : "7FF183601000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7FF1833EC000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "E77BA674F63D5C56373C03316B5E74C5C781A0BC" }, { "b" : "7FF1831D0000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7FF182E0F000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "DF6DA145A649EA093507A635AF383F608E7CE3F2" }, { "b" : "7FF18466A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7FF182BCC000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "DE5A9F7A11A0881CB64E375F4DDCA58028F0FAF8" }, { "b" : "7FF1828E7000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "A3E43FC66908AC8B00773707FECA3B1677AFF311" }, { "b" : "7FF1826E4000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "622F315EB5CB2F791E9B64020692EBA98195D06D" }, { "b" : "7FF1824B9000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B10FBFEC246C4EAD1719D16090D0BE54904BBFC9" }, { "b" : "7FF1822A2000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "E492542502DF88A2F752AD77D1905D13FF1AC6FF" }, { "b" : "7FF182097000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "7292C0673D7C116E3389D3FFA67087A6B9287A71" }, { "b" : "7FF181E94000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7FF181C7A000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "6A7DA1CED90F65F27CB7B5BACDBB1C386C05F592" }, { "b" : "7FF181A59000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB9D9" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12d5772]
 mongod(_ZN5mongo10logContextEPKc+0x134) [0x12713d4]
 mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x125cfa8]
 mongod(+0xE5D05C) [0x125d05c]
 mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x144B) [0xb59fab]
 mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x426) [0xdf5026]
 mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0xdf2829]
 mongod(+0x1699100) [0x1a99100]
 libpthread.so.0(+0x7DF3) [0x7ff1831d7df3]
 libc.so.6(clone+0x6D) [0x7ff182f051bd]
-----  END BACKTRACE  -----
2016-02-15T13:04:40.513+0000 W SHARDING [RangeDeleter] Error encountered while trying to delete range: Error encountered while deleting range: nsmydomain.col from { p: 0, as: ObjectId('5660b4295e27753f5c614e0a'), d: new Date(1455408000000) } -> { p: 2, as: ObjectId('5660b4295e27753f5c614e0a'), d: new Date(1455408000000) }, cause by: :: caused by :: 64 waiting for replication timed out
2016-02-15T13:04:40.513+0000 I SHARDING [RangeDeleter] Deleter starting delete for: mydomain.col from { as: ObjectId('561ceef4bf86ce4ffd5032c2'), d: new Date(1451685631505), c: ObjectId('56c118ebeec1e72d766a28be'), u: ObjectId('54b538278071936bb168304e') } -> { as: ObjectId('565eb8575e27753f5c3b8c36'), d: new Date(1451713553551), c: ObjectId('56c0deffeec1e72d7269fa72'), u: ObjectId('568764114953771315eb1e4a') }, with opId: 2282185777

Note: this DOES NOT crash the node, it keeps living - but - after exactly 1 hour, the primary becomes unusable and very very slow.

1. This is a bug.

2. Why does the range deleter wait for the secondary if it has votes:0 and priority:0?

3. Any workaround for this? This will happen again.



 Comments   
Comment by Daniel Pasette (Inactive) [ 03/Nov/16 ]

Hi viktor.gema@cloudessa.com,

As Andy mentioned in the comment above, the assertion and stack trace are not the interesting issue here. But it does indicate that the w:2 write during the range deletion timed out. Which likely means you have replication lag in your system, and this log message is a symptom of a different problem, not the problem in itself.

Thanks
Dan

Comment by Viktor Gema [ 03/Nov/16 ]

Look like that issue happens in our cluster can we reopen that issue ?

2016-10-16T22:58:33.029+0000 I CONTROL  [RangeDeleter]
 0x12d5772 0x12713d4 0x125cfa8 0x125d05c 0xb59fab 0xdf5026 0xdf2829 0x1a99100 0x7ff315008dc5 0x7ff314d35c9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"ED5772"},{"b":"400000","o":"E713D4"},{"b":"400000","o":"E5CFA8"},{"b":"400000","o":"E5D05C"},{"b":"400000","o":"759FAB"},{"b":"400000","o":"9F5026"},{"b":"400000","o":"9F2829"},{"b":"400000","o":"1699100"},{"b":"7FF315001000","o":"7DC5"},{"b":"7FF314C3F000","o":"F6C9D"}],"processInfo":{ "mongodbVersion" : "3.2.1", "gitVersion" : "a14d55980c2cdc565d4704a7e3ad37e4e535c1b2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.1.10-17.31.amzn1.x86_64", "version" : "#1 SMP Sat Oct 24 01:31:37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "A931E563057BCC34ED2D1971AAC77D44F47033A9" }, { "b" : "7FFE117E7000", "elfType" : 3, "buildId" : "4106419F92DF72BDE396133D4FE9E47CB6983EF2" }, { "b" : "7FF31622C000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "C9DB6FDB87F3CAAF209AEACD2EC345B152605135" }, { "b" : "7FF315E46000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "61E79B2D9C6792919D83B45FB89E36A26CA8DE09" }, { "b" : "7FF315C3E000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "42833B65941483A611C40EA7D32F56EA83EA6E93" }, { "b" : "7FF315A3A000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "6335077ACD51527BE9F2F18451A88E2B7350C5B6" }, { "b" : "7FF315735000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "0A90C35D3174805453EA67A785446D628E298B59" }, { "b" : "7FF315433000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "BB312C4A65B8FD830C148612CBEACEACC8B08E4F" }, { "b" : "7FF31521D000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "00FA2883FB47B1327397BBF167C52F51A723D013" }, { "b" : "7FF315001000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "E5E575776DAD20ADE8FD0CAF17897C9D89020A87" }, { "b" : "7FF314C3F000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "7128B52741C2494951D285A80DAF25B2BC9C7A0D" }, { "b" : "7FF316499000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "7B7BF8FEEF1A9C627EF90CA5C9188EFD4DA2DDD2" }, { "b" : "7FF3149F3000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "E203354E7F907ACC8C3028FE465541B666DCFBA0" }, { "b" : "7FF31470E000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "D769C8FFAF8772FDA55031ABF2E167DF2207E378" }, { "b" : "7FF31450B000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "5C01209C5AE1B1714F19B07EB58F2A1274B69DC8" }, { "b" : "7FF3142D9000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "6C2243D37143F7FD1E16ED1F6CE4D7F16C2D7EF1" }, { "b" : "7FF3140C3000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7FF313EB4000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "744272FEAAABCE629AB9E11FAA4A96AEBE8BC2B4" }, { "b" : "7FF313CB1000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7FF313A97000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "EE46CA824A1683662BCDA445CC1A4122BBE552BA" }, { "b" : "7FF313876000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12d5772]
 mongod(_ZN5mongo10logContextEPKc+0x134) [0x12713d4]
 mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x125cfa8]
 mongod(+0xE5D05C) [0x125d05c]
 mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x144B) [0xb59fab]
 mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x426) [0xdf5026]
 mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0xdf2829]
 mongod(+0x1699100) [0x1a99100]
 libpthread.so.0(+0x7DC5) [0x7ff315008dc5]
 libc.so.6(clone+0x6D) [0x7ff314d35c9d]
-----  END BACKTRACE  -----
2016-10-16T22:58:41.340+0000 W SHARDING [RangeDeleter] Error encountered while trying to delete range: Error encountered while deleting range: nsskyteamz2.radiusattributestored_12358 from { _id: -8665760703329430854 } -> { _id: -8642350070475471630 }, cause by: :: caused by :: 64 waiting for replication timed out

Comment by Andy Schwerin [ 08/Apr/16 ]

I've done a bit of digging, and the stack trace is almost certainly not the cause of the primary becoming very very slow. In fact, there's really no reason to print the stack trace. It's acceptable for secondary throttle waits to time out, and while they cancel the current range deletion task, they do not by themselves destabilize or endanger the system.

The two interesting questions are

  1. why did waiting for w:2 replication during the range delete time out in this case, and
  2. Was that timeout causally related to the later performance problem.

The data available in the ticket are insufficient to answer either question, and we have not been able to reproduce the performance behavior in house.

Comment by Yoni Douek [ 17/Mar/16 ]

Hey,

This happened in our production servers and due to this bug (and others) - we are preventing lagging secondaries. We cannot dedicate the efforts to try to reproduce this on our test environment.

I wouldn't close this issue if I were you. This is an assert, which needs to be rethinked, even if not reproduced. The description above should be enough to show the potential of what it does.

Comment by Misha Tyulenev [ 16/Mar/16 ]

yoni@appsee.com , could you please clarify if the issue is still happening and if so could you please provide more info to help reproducing the problem?
Thanks!

Comment by Misha Tyulenev [ 19/Feb/16 ]

Thanks for submitting the issue description.
Currently I can not reproduce the issue, here is the shard's replica set config used:
Primary (P),
Secondary1: no lag (S1),
Secondary 2: > 1hr lag, votes:0, priority:0 (S2) ,
Arbiter (A)

To the point 2 in the report: votes:0, priority:0 should not affect the replication sync behavior, because in this case moveChunk which is issued by balancer in migration thread should use writeConcern w:2 that will be satisfied if P and S1 are healthy. To confirm and identify the issue we need more info:

  • Could you please add a rs.status() for the shard that has an issue?
  • Could you please send logs around time when the primary becomes "very very slow" ?
  • Could you please send db.settings.find() output from admin db, I want to be sure that the balancer does not have w:3 writeConcern settings that will trigger the issue reported.

Thanks!

Generated at Thu Feb 08 04:00:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.