[SERVER-24994] Assertion when sharded collection is dropped during metadata changes Created: 11/Jul/16  Updated: 20/Jan/17  Resolved: 15/Nov/16

Status: Closed
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: 3.2.5
Fix Version/s: 3.2.12, 3.4.0-rc4

Type: Bug Priority: Major - P3
Reporter: Aaron Westendorf Assignee: Max Hirschhorn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-27039 Update sharded_moveChunk_drop_shard_k... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Backport Requested:
v3.2
Sprint: TIG 2016-10-31, TIG 2016-11-21
Participants:

 Description   

I had a sharded collection that was initialized while the balancer was off. Consequently, tests I was running on that collection skewed the load to a single shard. Seeing this, I enabled the balancer and waited a short time, but then decided that it would be faster for me to start with a new collection. To free resources, I disabled balancing on that collection and then not seeing a large enough decrease in resource usage, tried to delete the whole database.

From a mongos shell, I tried:

mongos> db
zion_patch1_0711_5783d8f5f5adf41d933505e1
mongos> db.dropDatabase()
{ "code" : 50, "ok" : 0, "errmsg" : "Operation timed out" }
mongos> db.dropDatabase()
{ "ok" : 0, "errmsg" : "seen a collection twice!" }

Moments later, the primary for the shard crashed with the following assertion:

2016-07-11T19:19:34.496+0000 I COMMAND  [conn2797] command admin.$cmd command: splitVector { splitVector: "zion_patch1_0711_5783d8f5f5adf41d933505e1.game_ProfileFeedEntry", keyPattern: { account_id: "hashed" }, min: { account_id: -3669295402892714827 }, max: { account_id: -3465469707898765886 }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } keyUpdates:0 writeConflicts:0 numYields:290 reslen:277 locks:{ Global: { acquireCount: { r: 582 } }, MMAPV1Journal: { acquireCount: { r: 588 } }, Database: { acquireCount: { r: 291 }, acquireWaitCount: { r: 6 }, timeAcquiringMicros: { r: 64102 } }, Collection: { acquireCount: { R: 291 }, acquireWaitCount: { R: 291 }, timeAcquiringMicros: { R: 4397253 } } } protocol:op_command 4482ms
2016-07-11T19:19:34.496+0000 I COMMAND  [conn2815] command admin.$cmd command: splitVector { splitVector: "zion_patch1_0711_5783d8f5f5adf41d933505e1.game_ProfileFeedEntry", keyPattern: { account_id: "hashed" }, min: { account_id: -3669295402892714827 }, max: { account_id: -3465469707898765886 }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } keyUpdates:0 writeConflicts:0 numYields:244 reslen:277 locks:{ Global: { acquireCount: { r: 490 } }, MMAPV1Journal: { acquireCount: { r: 495 } }, Database: { acquireCount: { r: 245 }, acquireWaitCount: { r: 5 }, timeAcquiringMicros: { r: 37828 } }, Collection: { acquireCount: { R: 245 }, acquireWaitCount: { R: 245 }, timeAcquiringMicros: { R: 3664084 } } } protocol:op_command 3720ms
2016-07-11T19:19:34.500+0000 I -        [RangeDeleter] Invariant failure descriptor src/mongo/db/query/internal_plans.cpp 149
2016-07-11T19:19:34.500+0000 I -        [RangeDeleter] 
 
***aborting after invariant() failure
 
 
2016-07-11T19:19:34.544+0000 I SHARDING [conn2968] remotely refreshing metadata for zion_patch1_0711_5783d8f5f5adf41d933505e1.game_ProfileFeedEntry with requested shard version 0|0||000000000000000000000000, current shard version is 3|0||5783d918ba74d2d6247ab977, current metadata version is 3|0||5783d918ba74d2d6247ab977
2016-07-11T19:19:34.545+0000 F -        [RangeDeleter] Got signal: 6 (Aborted).
 
 0x12ff312 0x12fe239 0x12fea42 0x7f6163650340 0x7f61632b1cc9 0x7f61632b50d8 0x1285e4b 0xdcafb9 0xdcb0b9 0xb7a988 0xe19bac 0xe17379 0x1b2b8c0 0x7f6163648182 0x7f616337547d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EFF312","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"EFE239"},{"b":"400000","o":"EFEA42"},{"b":"7F6163640000","o":"10340"},{"b":"7F616327B000","o":"36CC9","s":"gsignal"},{"b":"7F616327B000","o":"3A0D8","s":"abort"},{"b":"400000","o":"E85E4B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"9CAFB9","s":"_ZN5mongo15InternalPlanner10_indexScanEPNS_16OperationContextEPNS_10WorkingSetEPKNS_10CollectionEPKNS_15IndexDescriptorERKNS_7BSONObjESD_bNS0_9DirectionEi"},{"b":"400000","o":"9CB0B9","s":"_ZN5mongo15InternalPlanner9indexScanEPNS_16OperationContextEPKNS_10CollectionEPKNS_15IndexDescriptorERKNS_7BSONObjESB_bNS_12PlanExecutor11YieldPolicyENS0_9DirectionEi"},{"b":"400000","o":"77A988","s":"_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb"},{"b":"400000","o":"A19BAC","s":"_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs"},{"b":"400000","o":"A17379","s":"_ZN5mongo12RangeDeleter6doWorkEv"},{"b":"400000","o":"172B8C0","s":"execute_native_thread_routine"},{"b":"7F6163640000","o":"8182"},{"b":"7F616327B000","o":"FA47D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.5", "gitVersion" : "34e65e5383f7ea1726332cb175b73077ec4a1b02", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-29-generic", "version" : "#53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "74F187FDF5E701A36D5188BD90B24583D1430613" }, { "b" : "7FFFF7AFE000", "elfType" : 3, "buildId" : "3D068D088E7EAC15D9DA7C3AC912E783C0897EE7" }, { "b" : "7F6164561000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "A20EFFEC993A8441FA17F2079F923CBD04079E19" }, { "b" : "7F6164186000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F000D29917E9B6E94A35A8F02E5C62846E5916BC" }, { "b" : "7F6163F7E000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F6163D7A000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F6163A74000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7F616385E000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "CC0D578C2E0D86237CA7B0CE8913261C506A629A" }, { "b" : "7F6163640000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7F616327B000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7F61647C0000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12ff312]
 mongod(+0xEFE239) [0x12fe239]
 mongod(+0xEFEA42) [0x12fea42]
 libpthread.so.0(+0x10340) [0x7f6163650340]
 libc.so.6(gsignal+0x39) [0x7f61632b1cc9]
 libc.so.6(abort+0x148) [0x7f61632b50d8]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x1285e4b]
 mongod(_ZN5mongo15InternalPlanner10_indexScanEPNS_16OperationContextEPNS_10WorkingSetEPKNS_10CollectionEPKNS_15IndexDescriptorERKNS_7BSONObjESD_bNS0_9DirectionEi+0x1E9) [0xdcafb9]
 mongod(_ZN5mongo15InternalPlanner9indexScanEPNS_16OperationContextEPKNS_10CollectionEPKNS_15IndexDescriptorERKNS_7BSONObjESB_bNS_12PlanExecutor11YieldPolicyENS0_9DirectionEi+0x89) [0xdcb0b9]
 mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x308) [0xb7a988]
 mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x44C) [0xe19bac]
 mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0xe17379]
 mongod(execute_native_thread_routine+0x20) [0x1b2b8c0]
 libpthread.so.0(+0x8182) [0x7f6163648182]
 libc.so.6(clone+0x6D) [0x7f616337547d]
-----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 20/Jan/17 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-24994 Check in Helpers::removeRange() if shard key index dropped.

(cherry picked from commit 57ed82f0692bfb4e7a045a0108d029e53b21e3f8)
Branch: v3.2
https://github.com/mongodb/mongo/commit/bcca51caf0054c7def89d7588a1c99c6b2513a79

Comment by Githook User [ 15/Nov/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-24994 Check in Helpers::removeRange() if shard key index dropped.
Branch: master
https://github.com/mongodb/mongo/commit/57ed82f0692bfb4e7a045a0108d029e53b21e3f8

Comment by Aaron Westendorf [ 15/Jul/16 ]

That's great news Max, I'm glad that you were able to reproduce and find the cause.

Comment by Max Hirschhorn [ 14/Jul/16 ]

Hi aaron.westendorf,

I have successfully reproduced the issue you've described. The following is an explanation of some of the related technical details.

The invariant(descriptor) failure in InternalPlanner::_indexScan() means that the IndexDescriptor* was null. The caller Helpers::removeRange() had passed in a null pointer because we failed to find the shard key index in the Collection's index catalog. This is because the shard key index had been dropped as part of the "dropDatabase" command you issued.

The root cause of this failure is a race between the RangeDeleter thread trying to clean up (using the shard key index) the chunk that was donated to another shard by the balancer and another thread dropping the shard key index.

I'm reassigning this ticket to the sharding team to work on a fix now that the cause is understood.

Thanks,
Max

Generated at Thu Feb 08 04:07:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.