[SERVER-3779] De-dupe index keys inside of a btree bucket Created: 07/Sep/11  Updated: 26/May/15  Resolved: 26/May/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Mathias Stearn Assignee: Unassigned
Resolution: Won't Fix Votes: 6
Labels: pull-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Because of the way we store indexes it should be easy to have multiple keynodes within a btree bucket pointing to the same data offset. This should result in much smaller indexes for things like "tags" where there are a lot of duplicate keys.



 Comments   
Comment by Mathias Stearn [ 26/May/15 ]

This optimization is handled in WT in a more general way as prefix_compression which we enable by default. We have no plans to do this optimization for MMAPv1 indexes.

Comment by Guanqun Lu [ 16/Dec/11 ]

@mathias, after Aaron's review, I've fixed the core logic in btree, at least Aaron's case works, but what it lacks are more tests to cover my code path... In case there are some corner cases. My modification in rebalancing code uses the idea "do my best to balance", because the size calcuation is so different with with the newly added de-dup code.

Sorry for being absent for a long time, it's around the end of the year and my daily job just gets much more busier. Anyway, I'll try to catch up and add several test cases.

btw: what's the release date for 2.2 ?

Comment by Mathias Stearn [ 14/Dec/11 ]

@Guanqun, Have you made any progress? I'd really like to see this feature make 2.2.

Comment by Guanqun Lu [ 09/Nov/11 ]

Hi Aaron,

Thanks for your review, good catch for the setKey function.

I've done simple fixes to item 1 and item 4.

And I see your point of this test case, yes, the size calculation in rebalancing is too naive and is not correct, I'll fix this later and add more test cases. Then I'll ask you for another review.

Thanks!

Comment by Aaron Staple [ 08/Nov/11 ]

PS Thanks for your work so far!

Comment by Aaron Staple [ 08/Nov/11 ]

Hi - here are some more todos. Can take another look if these are taken care of.

  • It looks like there is an incorrect usage of woEqual in setKey()
  • There are some bucket rebalancing issues that are not handled and can lead to btree corruption. For example as seen from this test:

c = db.c;
c.drop();
 
padding = new Array(5000).toString();
for( i = 0; i < 14; ++i ) {
    c.insert({_id:i,p:padding});
}
 
c.ensureIndex({a:1,b:1},{sparse:true});
 
s900 = new Array(901).toString();
 
for( i = 10; i < 18; ++i ) {
    c.save({a:i,b:s900});
}
 
for( i = 0; i < 7; ++i ) {
    a = i % 2 ? 50 : NumberLong(50);
    c.update({_id:i},{a:a,b:s900});
    c.update({_id:14-i-1},{a:a,b:s900});
}
 
//c.runCommand({dumpindex:'c',idx:1});                                                                                          
 
function checkSorted( i ) {
    ret = c.find().sort({a:1,b:1}).toArray();
    for( i = 1; i < ret.length; ++i ) {
        assert.lte( ret[i-1].a, ret[i].a );
    }
}
checkSorted();
 
for( i = 10; i < 15; ++i ) {
    c.remove({a:i});
    assert( !db.getLastError() );
}
 
//c.runCommand({dumpindex:'c',idx:1});                                                                                          
 
checkSorted();

  • Particularly given above items it would be good to have more unit tests. The btree code is pretty delicate, I think it's a good idea to ensure that each corner case (and at least each new/changed line of code) is tested somehow. So if a single line of the patched code is backed out, a test should fail.
  • In terms of code style it might be nice to limit implicit casting of ints to bools eg:
    if ( !cmpResult && key.binaryEqual( prevKey ) )
    needDedup = true;
    There is plenty of casting of this sort in the code already, it would just be nice to limit this in new code.
Comment by Guanqun Lu [ 07/Nov/11 ]

During investigation of the above issue, I found a bug about update and covered index. please see SERVER-4218. Due to the bug, the case shown in the above ascii picture might not happen, but that should be fixed.

Comment by Guanqun Lu [ 07/Nov/11 ]

I've fixed this issue.
please help to review this patch again if your time permits.

ps: the example Aaron pointed out has been added to jstests.
the main changes include:

  • a function to do binary compare is added.
  • in _pushBack, when two keys are equal via 'woCompare', another check to do binary check is added.
  • in basicInsert, I used this binary compare instead of well-ordered compare.
  • implementation in popBack is not changed, because this function is only used in index builder, and the all the previous keys are inserted via pushBack, thus, we won't have a case as in the picture below, so comparing the key offset with the previous one and then deciding whether to _unalloc its space is OK.
  • _pack related functions now add a 'set' or 'map' structure to remember what offsets we've seen before, I can't use my previous implementation (have a lastOfs and compare only with last seen offset, this is not true anymore, see the picture below.)

first, we have 5, 5.
then we insert 5LL into this btree bucket.
           +------------+
           |            |
           |            v
+---+----+---+-----+----+----+
| 5 | 5L | 5 |     | xx | yy |
+---+----+---+-----+----+----+
  |   |            ^    ^
  |   +------------+    |
  +---------------------+

Thanks!

Comment by Guanqun Lu [ 04/Nov/11 ]

Hi Eliot,

Thanks for the clarification, then I'll make insert as fast as the current implementation and have to let pack related operations (pack, packDataSize) to be a little slower.

Comment by Eliot Horowitz (Inactive) [ 04/Nov/11 ]

definitely not full compaction, just good with full correctness and very fast

Comment by Aaron Staple [ 02/Nov/11 ]

Eliot, can you advise?

Comment by Guanqun Lu [ 02/Nov/11 ]

A quick question: are we aiming for the complete de-duplication in a btree bucket or near-optimal solution? I'm asking this because with this unusual equality between keys and binary representation, in some cases, the best dedupe can't be done. e.g. 5, NumberLong(5), 5, NumberLong(5) etc... If we want to achieve the complete de-dupe, insertion operation could be expensive...

Comment by Guanqun Lu [ 02/Nov/11 ]

Hi Aaron,

Thanks for pointing out the equality of keys does not always mean the equality of binary representation. There are two ways (just out of my mind quickly) to fix this case:

  • lots of my implementation has to be changed, we may need a structure to show which key maps to specific offset. the pack could be be expensive then.
  • disable covered index on deduped keys, but that just doesn't obey the idea behind covered index (SERVER-192)

I'll post here if there's any progress.

Comment by Aaron Staple [ 02/Nov/11 ]

If this gets fixed I can take another look.

Comment by Aaron Staple [ 02/Nov/11 ]

One todo here is that equality of two keys does not always mean they can share a binary representation. For example, numeric keys contain type information that will be lost if a binary representation is shared improperly:

With the patched code:

> c.save( {a:5} )
> c.save( {a:NumberLong(5)} )
> c.find()
{ "_id" : ObjectId("4eb08a5ab5772709abd87679"), "a" : 5 }
{ "_id" : ObjectId("4eb08a63b5772709abd8767a"), "a" : NumberLong(5) }
> c.ensureIndex({a:1})
> c.find({a:5},{_id:0,a:1})
{ "a" : 5 }
{ "a" : 5 }

The covered index query at the end returns an incorrect data type for the second document's 'a' field.

Comment by Guanqun Lu [ 27/Oct/11 ]

Hi Dwight,

Thanks for the encouragement.

Regarding your questions:

1. Yes, when keys are the same, their offsets in __KeyNode point to the same location. In practice, it only checks with previous __KeyNode to see if the newly added key is the same with previous one, because we know the __KeyNode are sorted already during the index building process. But for the basicInsert case, we might need to check the next key as well.

2. We don't depend on the references to delete a key. When we delete a key, there are two cases:
a) delete the key node right away, but the key data is still there.
b) mark the key node as un-used (this happens only in legacy btree), but the key data is still there.
For the two cases, the key data part is not touched. The key data will be cleaned out in _pack function. In the _pack process, only the referenced data will be copied, therefore, we don't need to carry the burden of tracking the number of references directly.

Hope the above answers your questions. I'm glad to provide more details about my implementation.

Comment by Dwight Merriman [ 26/Oct/11 ]

haven't read in detail couple questions that would facilitate reading
is it pointing multiple key offsets to the same position when the key is the same? how does it know when it can then release that key (ie # of references)?

Comment by Dwight Merriman [ 26/Oct/11 ]

this is definitely something we want to do.
the patch looks pretty good but going to take a while to review.

Comment by Guanqun Lu [ 24/Oct/11 ]

I've written a rough (informal) post about how much we can gain from this de-dup:
http://printf.me/2011/10/23/whats-the-gain-of-de-duped-key-data-feature/

Comment by Guanqun Lu [ 20/Oct/11 ]

pull request here: https://github.com/mongodb/mongo/pull/130

please review.

Generated at Thu Feb 08 03:04:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.