[SERVER-29397] Invariant failure on config server when inserting tag into config.tags Created: 30/May/17  Updated: 30/Oct/23  Resolved: 28/Nov/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.4, 3.5.8
Fix Version/s: 3.4.11, 3.6.1, 3.7.1

Type: Bug Priority: Major - P3
Reporter: Clive Hill Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 3.2017-05-30T11-50-43.mdmp     Zip Archive LocalMongoDB.zip     Zip Archive UpgradeTester.zip    
Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6, v3.4
Steps To Reproduce:
  1. Ensure MongoDB 3.4 installed and available from "C:\Program Files\MongoDB\Server\3.4\bin
  2. Unzip attached LocalMongoDB.zip to e.g. C:\LocalFolder\temp\MongoDBUpgrade\3.4
  3. In all bat files update MONGODB_VERSION to 3.4
  4. Start all bat files
  5. (I used Robomongo 0.9.0-RC7 to run the following commands)
  6. Open mongo shell on localhost:27001 (LN1 mongod), and run:
  7. config = {
    _id : "LN1",
    members : [
    {_id : 1, host : "localhost:27001"},
    {_id : 2, host : "localhost:27002"},
    {_id : 3, host : "localhost:27003", arbiterOnly: true}
    ]
    }
    rs.initiate(config)
  8. Open mongo shell on localhost:27004 (NY1 mongod), and run:
  9. config = {
    _id : "NY1",
    members : [
    {_id : 1, host : "localhost:27004"},
    {_id : 2, host : "localhost:27005"},
    {_id : 3, host : "localhost:27006", arbiterOnly: true}
    ]
    }
    rs.initiate(config)
  10. Open mongo shell on localhost:27007 (LN1 config server mongod), and run:
  11. config = {
    _id : "conf",
    members : [
    {_id : 1, host : "localhost:27007"},
    {_id : 2, host : "localhost:27008"}
    ]
    }
    rs.initiate(config)
  12. Open shell on localhost:27009 (LN1 mongos) and run the following:
    1. sh.addShard("LN1/localhost:27001")
    2. sh.addShard("NY1/localhost:27004")
    3. sh.addShardTag("LN1", "LN1")
    4. sh.addShardTag("NY1", "NY1")
    5. sh.enableSharding("ddp")
  13. Shard collection datasources:
    1. sh.shardCollection("ddp.datasources", { "location" : 1, "shard" : 1 }

      )

    2. db.datasources.createIndex( { "location" : 1, "shard" : 1 }

      )

  14. Now insert tag:
    1. use config
    2. db.tags.insertOne({ "_id" :
      Unknown macro: { "ns" }

      , "ns" : "ddp.datasources", "min" :

      { "location" : "LN", "shard" : "LN1" }

      , "max" :

      { "location" : "LM", "shard" : "LN1" }

      , "tag" : "LN1" })

  15. Wait and will see that output from cmd window running LN1-config.bat fails with stack trace given in the description.

I have repeated the steps using 3.2 without any issues.

Can someone please advise? I can provide a Java program that does all the manual steps above and gets the same error if this is helpful.

Sprint: Sharding 2017-12-04
Participants:

 Description   

I am writing a Java program to test various aspects of MongoDb so that we can upgrade in confidence in future. As part of this, I test aspects of sharding and shard tags.

In MongoDB 3.2 (3.2.9) the sharding steps work correctly. In MongoDB 3.4.4 (also seen in 3.4.0), after inserting the following into tags collection on config database via mongos:

db.tags.insertOne({ "_id" : { "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

}, "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

, "max" :

{ "location" : "LM", "shard" : "LN1" }

, "tag" : "LN1" })

I get the following form the config server (I have attached mdmp):

2017-05-30T12:50:43.588+0100 I -        [Balancer] Invariant failure splitPoint.woCompare(it->second.splitKeys.back()) == 0 src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp 154
2017-05-30T12:50:43.589+0100 I -        [Balancer]
 
***aborting after invariant() failure
 
 
2017-05-30T12:50:43.692+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\util\stacktrace_windows.cpp(239)                                   mongo::printStackTrace+0x43
2017-05-30T12:50:43.693+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\util\signal_handlers_synchronous.cpp(180)                          mongo::`anonymous namespace'::printSignalAndBacktrace+0x74
2017-05-30T12:50:43.695+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\util\signal_handlers_synchronous.cpp(236)                          mongo::`anonymous namespace'::abruptQuit+0x85
2017-05-30T12:50:43.696+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                   raise+0x1e8
2017-05-30T12:50:43.697+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                   abort+0x31
2017-05-30T12:50:43.698+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\util\assert_util.cpp(154)                                          mongo::invariantFailed+0x19a
2017-05-30T12:50:43.700+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(154)        mongo::`anonymous namespace'::SplitCandidatesBuffer::addSplitPoint+0x32e
2017-05-30T12:50:43.701+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(408)        mongo::BalancerChunkSelectionPolicyImpl::_getSplitCandidatesForCollection+0x4c6
2017-05-30T12:50:43.703+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(217)        mongo::BalancerChunkSelectionPolicyImpl::selectChunksToSplit+0x3d0
2017-05-30T12:50:43.704+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\db\s\balancer\balancer.cpp(546)                                    mongo::Balancer::_enforceTagRanges+0x65
2017-05-30T12:50:43.705+0100 I CONTROL  [Balancer] mongod.exe    ...\src\mongo\db\s\balancer\balancer.cpp(387)                                    mongo::Balancer::_mainThread+0xa41
2017-05-30T12:50:43.707+0100 I CONTROL  [Balancer] mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247)  std::_LaunchPad<std::unique_ptr<std::tuple<<lambda_d43e9bbb1383790602f0e36fad81e7d4> >,std::default_delete<std::tuple<<lambda_d43e9bbb1383790602f0e36fad81
e7d4> > > > >::_Run+0x77
2017-05-30T12:50:43.709+0100 I CONTROL  [Balancer] mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210)  std::_Pad::_Call_func+0x9
2017-05-30T12:50:43.710+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                   crt_at_quick_exit+0x7d
2017-05-30T12:50:43.712+0100 I CONTROL  [Balancer] kernel32.dll                                                                                   BaseThreadInitThunk+0xd
2017-05-30T12:50:43.713+0100 F -        [Balancer] Got signal: 22 (SIGABRT).
2017-05-30T12:50:43.715+0100 I CONTROL  [Balancer] *** unhandled exception 0x0000000E at 0x000007FEFD3BA06D, terminating
2017-05-30T12:50:43.716+0100 I CONTROL  [Balancer] *** stack trace for unhandled exception:
2017-05-30T12:50:43.723+0100 I CONTROL  [Balancer] KERNELBASE.dll                                                                                   RaiseException+0x3d
2017-05-30T12:50:43.724+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\util\signal_handlers_synchronous.cpp(237)                          mongo::`anonymous namespace'::abruptQuit+0x9d
2017-05-30T12:50:43.726+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                     raise+0x1e8
2017-05-30T12:50:43.727+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                     abort+0x31
2017-05-30T12:50:43.728+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\util\assert_util.cpp(154)                                          mongo::invariantFailed+0x19a
2017-05-30T12:50:43.730+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(154)        mongo::`anonymous namespace'::SplitCandidatesBuffer::addSplitPoint+0x32e
2017-05-30T12:50:43.732+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(408)        mongo::BalancerChunkSelectionPolicyImpl::_getSplitCandidatesForCollection+0x4c6
2017-05-30T12:50:43.733+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\db\s\balancer\balancer_chunk_selection_policy_impl.cpp(217)        mongo::BalancerChunkSelectionPolicyImpl::selectChunksToSplit+0x3d0
2017-05-30T12:50:43.734+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\db\s\balancer\balancer.cpp(546)                                    mongo::Balancer::_enforceTagRanges+0x65
2017-05-30T12:50:43.736+0100 I CONTROL  [Balancer] mongod.exe      ...\src\mongo\db\s\balancer\balancer.cpp(387)                                    mongo::Balancer::_mainThread+0xa41
2017-05-30T12:50:43.737+0100 I CONTROL  [Balancer] mongod.exe      c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247)  std::_LaunchPad<std::unique_ptr<std::tuple<<lambda_d43e9bbb1383790602f0e36fad81e7d4> >,std::default_delete<std::tuple<<lambda_d43e9bbb1383790602f0e36fad
81e7d4> > > > >::_Run+0x77
2017-05-30T12:50:43.739+0100 I CONTROL  [Balancer] mongod.exe      c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210)  std::_Pad::_Call_func+0x9
2017-05-30T12:50:43.741+0100 I CONTROL  [Balancer] ucrtbase.DLL                                                                                     crt_at_quick_exit+0x7d
2017-05-30T12:50:43.742+0100 I CONTROL  [Balancer] kernel32.dll                                                                                     BaseThreadInitThunk+0xd
2017-05-30T12:50:43.744+0100 I -        [Balancer]
2017-05-30T12:50:43.746+0100 I CONTROL  [Balancer] writing minidump diagnostic file C:\Program Files\MongoDB\Server\3.2017-05-30T11-50-43.mdmp
2017-05-30T12:50:44.017+0100 I CONTROL  [Balancer] *** immediate exit due to unhandled exception



 Comments   
Comment by Githook User [ 06/Dec/17 ]

Author:

{'name': 'Dianna Hohensee', 'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com'}

Message: SERVER-29397 Ensure user inserted invalid config.tags documents cause the auto-balancer to error rather than invariant

(cherry picked from commit 1340d505df3eb777cbe1684d53c64848052b7151)
Branch: v3.6
https://github.com/mongodb/mongo/commit/0c84fadd609357a79b4e20cdd7a974d8c07fa61d

Comment by Githook User [ 05/Dec/17 ]

Author:

{'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna Hohensee'}

Message: SERVER-29397 Ensure user inserted invalid config.tags documents cause the auto-balancer to error rather than invariant

(cherry picked from commit 1340d505df3eb777cbe1684d53c64848052b7151)
Branch: v3.4
https://github.com/mongodb/mongo/commit/a02b351fa19e64ba790263f7462b1a5e43aa4fca

Comment by Githook User [ 28/Nov/17 ]

Author:

{'name': 'Dianna Hohensee', 'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com'}

Message: SERVER-29397 Ensure user inserted invalid config.tags documents cause the auto-balancer to error rather than invariant
Branch: master
https://github.com/mongodb/mongo/commit/1340d505df3eb777cbe1684d53c64848052b7151

Comment by Esha Maharishi (Inactive) [ 31/May/17 ]

Ah, that makes sense. I'm marking this as affecting 3.4.4 and 3.5.8, setting it to Needs Triage, and putting it on the sharding backlog.

Here's a javascript repro (the key was to sleep for a while after inserting the invalid config.tags entry, to give the balancer round a chance to try to use it):

var st = new ShardingTest({ mongos: 1, shards: 1, other: { enableBalancer: true}});
 
var shards = st.s.getDB("config").shards.find().toArray();
 
st.s.adminCommand({ addShardToZone: shards[0]._id, zone: "invalidZone" });
 
// Insert an invalid entry into config.tags. Here, the min is greater than the max.
st.s.getDB("config").tags.insertOne({
    _id : { ns: "test.foo", min: { _id: "b" }},
    ns : "test.foo",
    min : { _id: "b" },
    max : { _id: "a" },
    tag : "invalidZone"
});
 
st.s.adminCommand({ enableSharding: "test" });
st.s.adminCommand({ shardCollection: "test.foo", key: { _id: 1 } });
 
// Allow the balancer time to read the invalid entry and try to act on it. The config server primary should hit the invariant.
sleep(1000000);
 
st.stop();

Comment by Clive Hill [ 31/May/17 ]

Thanks Andy! I look to make code changes to run commands from Java.

FYI, in 3.2 the sharding wasn't working correctly due to max being smaller than min, but we hadn't noticed, i.e. it didn't crash. We're fixing the shard key now.

Comment by Andy Schwerin [ 31/May/17 ]

We'll look into improving behavior when this happens; at the very least, it should not crash the server. It may be possible to use the new validation framework to prevent this kind of mistake.

While the java driver cannot run the shell command sh.updateZoneKeyRange, it can use the runCommand method to directly invoke the updateZoneKeyRange command against a mongos router. I'm not an expert on the java driver, but to perform the equivalent of the following in the shell:

sh.updateZoneKeyRange(
  "ddp.datasources",
  { "location" : "LM", "shard" : "LN1" },
  { "location" : "LN", "shard" : "LN1" },
  "LN1")

You need to construct a BSON document that looks as follows:

{
  updateZoneKeyRange: "ddp.datasources",
  min: { "location" : "LM", "shard" : "LN1" },
  max: { "location" : "LN", "shard" : "LN1" },
  zone: "LN1"
}

And use the java runCommand method to transmit that document as a command agains the "admin" database.

Comment by Clive Hill [ 31/May/17 ]

I found the problem...

Andy Schwerin thanks for your comments around using updateZoneKeyRange command in 3.4. This provided a helpful error message stating that the min must be less than max. This then helped me notice that I was putting the max as LM and min as LN:

sh.updateZoneKeyRange("ddp.datasources",

{ "location" : "LN", "shard" : "LN1" }

,

{ "location" : "LM", "shard" : "LN1" }

, "LN1")

This was causing the error! I changed the max to be LO and it worked fine.

Do you think it would be possible to have better error message if from Java the tags collection is entered directly by e.g. doing:

db.tags.insertOne({ "_id" : { "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

}, "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

, "max" :

{ "location" : "LM", "shard" : "LN1" }

, "tag" : "LN1" })

? And not collapsing with stack trace I sent?

(My understanding is that it is not possible to call commands from Java, such as sh.updateZoneKeyRange , instead I have been checking what the function does and implementing directly.)

Comment by Clive Hill [ 31/May/17 ]

FYI, what may be of interest, is that I updated from 3.2 to 3.4 by copying across the data. Everything appears to work fine if the tags already exist, and then new tags are added.

Comment by Clive Hill [ 31/May/17 ]

1) Yep, if you run the Java program it happens every time. It also happens every time if done manually.
2) It happens on the first attempt. The config server crashes after that hence no more items are added.

Hopefully with the Java program you will be able to also reproduce against 3.4.4. Please add comment and I'll get back if you need any further clarification.

I am running this on Windows 7 Enterprise.

Comment by Esha Maharishi (Inactive) [ 30/May/17 ]

Whoops, I hadn't refreshed the page - just saw you added a java repro. Thanks, I'll work with that first.

Comment by Esha Maharishi (Inactive) [ 30/May/17 ]

Hey EvilChill, two quick questions for you:

1) is this crash consistently reproducible via those steps, or were those the steps that caused the crash just one particular time?
2) is the tag that is inserted manually the only entry in config.tags ever inserted into config.tags?

I wasn't able to immediately reproduce this on 3.4.4 in our regression suite, so I wanted to check if it could be a timing-related issue with the balancer. But if those steps do reproduce it consistently, maybe some detail isn't being reflected in the javascript repro script.

Comment by Clive Hill [ 30/May/17 ]

I've attached zip (UpgradeTester.zip) with simple Java program and zip.

It assumes that MongoDB 3.4 is installed, and unzips to this folder:

C:\LocalFolder\temp\MongoDBUpgrade\3.4

Put a break point on line 424,. and step over. Wait. Then look at output of LN1-config.bat. After a while it will show the bug I raised.

Please let me know if you cannot reproduce.

I will look into managing tag zones...

Comment by Clive Hill [ 30/May/17 ]

This is zip with simple java program which will show the issue

Comment by Andy Schwerin [ 30/May/17 ]

We made changes to tag-aware sharding (now called zoned sharding) and the balancer during the 3.4 release. I wouldn't expect them to lead to invariant failure, so there's probably a bug somewhere, but the prescribed way to manage zones is to use the updateZoneKeyRange command.

We'll work on a repro for our regression suite, but if you have a test program already, please do share it.

Comment by Clive Hill [ 30/May/17 ]

In above where it says "Unknown macro:

{ "ns" }

" it should read:

"db.tags.insertOne({ "_id" : { "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

}, "ns" : "ddp.datasources", "min" :

{ "location" : "LN", "shard" : "LN1" }

, "max" :

{ "location" : "LM", "shard" : "LN1" }

, "tag" : "LN1" })"

Version that confirmed works was 3.2.4 (not 3.2.9)

Generated at Thu Feb 08 04:20:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.