[SERVER-14759] Splitting very close to an existing double precision value causes missing chunks Created: 01/Aug/14  Updated: 28/Aug/19  Resolved: 28/Aug/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.7.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Janna Golden
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File split_close_doubles.js    
Issue Links:
Depends
depends on SERVER-8829 String representation for chunk id is... Closed
depends on SERVER-42106 Use auto-generated _ids for config.ch... Closed
Related
related to SERVER-9931 hashed shard keys do not appear to ha... Closed
is related to SERVER-14761 split command should only allow Numbe... Closed
Operating System: ALL
Sprint: Sharding 2019-09-09
Participants:

 Description   

Consider a shard key has values that are double precision floats (ie. "numbers" in javascript). If an attempt is made to split at a point that is very close to an existing chunk min/max (ie. a double value which is "adjacent" or nearly so), then metadata corruption occurs. Specifically, at least one chunk will be missing, causing a gap in the chunk ranges and an inability for the config metadata to be loaded for that collection (since the config is invalid). Only certain values cause the problem.

The exact outcome depends on whether the split point is just above or just below the existing chunk endpoint.

  • Just larger than existing point: split command fails (ok: 0), the "left" chunk (chunk B) is missing:

               existing endpoint     new split point
    ...------------------------|-----|------------------------...
                Chunk A         Chunk       Chunk C
                                  B
                              (missing)
    

  • Just smaller than existing point: split command succeeds (ok: 1), chunk is split correctly into chunks A and B, but the "subsequent" chunk (chunk C) is missing:

                 new split point     existing endpoint
    ...------------------------|-----|------------------------...
                Chunk A         Chunk       Chunk C
                                  B        (missing)
    

The attached reproducer shows some values that work and that don't. It splits at two double precision values in order. Since some combinations of these work and some don't, there must be something specific about the actual values (or the difference between them) which is causing the failure. If the "A then B" case doesn't work, then "B then A" also doesn't work (though with the different symptoms as above).

Here are the results. The expectation is that every test should pass (or at least not cause an invalid config).

test1: *** FAILED ***: 1 then 1.0000000000000002: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
        {  "_id" : "test1.test1-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888422"),  "ns" : "test1.test1",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard" : "shard0000" }
        {  "_id" : "test1.test1-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888422"),  "ns" : "test1.test1",  "min" : {  "field" : 1.0000000000000002 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
test2: *** FAILED ***: 1 then 1.0000000000000004: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
        {  "_id" : "test2.test2-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888427"),  "ns" : "test2.test2",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard" : "shard0000" }
        {  "_id" : "test2.test2-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888427"),  "ns" : "test2.test2",  "min" : {  "field" : 1.0000000000000004 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
test3: passed: 1 then 1.0000000000000007
test4: passed: 1 then 1.0000000000000009
test5: *** FAILED ***: 1.0000000000000002 then 1.0000000000000004: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
        {  "_id" : "test5.test5-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888436"),  "ns" : "test5.test5",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000002 },  "shard" : "shard0000" }
        {  "_id" : "test5.test5-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888436"),  "ns" : "test5.test5",  "min" : {  "field" : 1.0000000000000004 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
test6: passed: 1.0000000000000002 then 1.0000000000000007
test7: passed: 1.0000000000000002 then 1.0000000000000009
test8: passed: 1.0000000000000004 then 1.0000000000000007
test9: passed: 1.0000000000000004 then 1.0000000000000009
test10: *** FAILED ***: 1.0000000000000007 then 1.0000000000000009: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
        {  "_id" : "test10.test10-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166fc333c70bae88844f"),  "ns" : "test10.test10",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000007 },  "shard" : "shard0000" }
        {  "_id" : "test10.test10-field_1.000000000000001",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae88844f"),  "ns" : "test10.test10",  "min" : {  "field" : 1.0000000000000009 },  "max" : {  "field" : {
"$maxKey" : 1 } },  "shard" : "shard0000" }
test11: *** FAILED ***: 1.0000000000000002 then 1: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
        {  "_id" : "test11.test11-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888454"),  "ns" : "test11.test11",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard"
: "shard0000" }
        {  "_id" : "test11.test11-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888454"),  "ns" : "test11.test11",  "min" : {  "field" : 1 },  "max" : {  "field" : 1.0000000000000002 },  "shard" :
"shard0000" }
test12: *** FAILED ***: 1.0000000000000004 then 1: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
        {  "_id" : "test12.test12-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888459"),  "ns" : "test12.test12",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard"
: "shard0000" }
        {  "_id" : "test12.test12-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888459"),  "ns" : "test12.test12",  "min" : {  "field" : 1 },  "max" : {  "field" : 1.0000000000000004 },  "shard" :
"shard0000" }
test13: passed: 1.0000000000000007 then 1
test14: passed: 1.0000000000000009 then 1
test15: *** FAILED ***: 1.0000000000000004 then 1.0000000000000002: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
        {  "_id" : "test15.test15-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888468"),  "ns" : "test15.test15",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000002 },  "shard" : "shard0000" }
        {  "_id" : "test15.test15-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888468"),  "ns" : "test15.test15",  "min" : {  "field" : 1.0000000000000002 },  "max" : {  "field" : 1.0000000000000004 },  "shard" : "shard0000" }
test16: passed: 1.0000000000000007 then 1.0000000000000002
test17: passed: 1.0000000000000009 then 1.0000000000000002
test18: passed: 1.0000000000000007 then 1.0000000000000004
test19: passed: 1.0000000000000009 then 1.0000000000000004
test20: *** FAILED ***: 1.0000000000000009 then 1.0000000000000007: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
        {  "_id" : "test20.test20-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888481"),  "ns" : "test20.test20",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000007 },  "shard" : "shard0000" }
        {  "_id" : "test20.test20-field_1.000000000000001",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888481"),  "ns" : "test20.test20",  "min" : {  "field" : 1.0000000000000007 },  "max" : {  "field" : 1.0000000000000009 },  "shard" : "shard0000" }
test21: *** FAILED ***: -4204176258327475000 then -4204176258327474700: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
        {  "_id" : "test21.test21-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888486"),  "ns" : "test21.test21",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : -4204176258327475000 },  "shard" : "shard0000" }
        {  "_id" : "test21.test21-field_-4.204176258327475e+18",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888486"),  "ns" : "test21.test21",  "min" : {  "field" : -4204176258327474700 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
test22: *** FAILED ***: -4204176258327474700 then -4204176258327475000: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
        {  "_id" : "test22.test22-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1671c333c70bae88848b"),  "ns" : "test22.test22",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : -4204176258327475000 },  "shard" : "shard0000" }
        {  "_id" : "test22.test22-field_-4.204176258327475e+18",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae88848b"),  "ns" : "test22.test22",  "min" : {  "field" : -4204176258327475000 },  "max" : {  "field" : -4204176258327474700 },  "shard" : "shard0000" }

The test case values are:

double hex decimal
1.0000000000000000 0x3ff0000000000000 4607182418800017408
1.0000000000000002 0x3ff0000000000001 4607182418800017409
1.0000000000000004 0x3ff0000000000002 4607182418800017410
1.0000000000000007 0x3ff0000000000003 4607182418800017411
1.0000000000000009 0x3ff0000000000004 4607182418800017412


 Comments   
Comment by Janna Golden [ 28/Aug/19 ]

This was fixed in SERVER-42106 which removed the string representation of the _id field for config.chunks.

Generated at Thu Feb 08 03:35:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.