Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 2.2.3
Component/s: Index Maintenance
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This is similar to ~~SERVER-2193~~. But I think the use case is compelling, and the existing semantics unintuitive.

Given the following GridFS like schema:
1) Metadata collection for file metadata
2) Chunks collection for file chunks

Files are large, and we support multiple versions of files. We therefore chunk files and sha the chunks such that we can save disk space when most the file hasn't changed.

The chunks collection has a trivial unique index:

coll.ensure_index([('symbol', 1), ('parents', 1), ('chunk', 1)], unique=True, sparse=True)

'symbol' / file name can be used for sharding (not strictly necessary)
'parents' - array of parent metadata documents representing multiple versions
'chunk' - chunk number for a given version of the file

For every version of a file, ('parent', 'chunk') must be unique.

Now this works great, it's fast, you can easily slice out ranges of the file, provides version control, and space savings when most data stays the same between versions.

However a problem arises when you try to delete. If the parents array becomes empty for more than one chunk-version, the unique constraint is violated as (null, 'chunk') can result in duplicates.

For example:

coll.ensure_index([('symbol', 1), ('parents', 1), ('chunk', 1)], unique=True, sparse=True)

# Two versions of file 'a', 'b'. They share chunk 1
coll.insert({'symbol':'sym', 'parents':['a', 'b'], 'chunk': '1', 'sha':1})
coll.insert({'symbol':'sym', 'parents':['a'], 'chunk': '2', 'sha':1})
coll.insert({'symbol':'sym', 'parents':['b'], 'chunk': '2', 'sha':2})

# Add version 'c'
coll.insert({'symbol':'sym', 'parents':['c'], 'chunk': '1', 'sha':3})
coll.insert({'symbol':'sym', 'parents':['c'], 'chunk': '2', 'sha':4})

# Now delete versions 'a', 'b'
coll.update({}, {'$pullAll': {'parents': ['a', 'b']}}, multi=True)

Traceback (most recent call last):
  File "/users/is/jblackburn/pyenvs/research/lib/python2.6/site-packages/ipython-0.11_5-py2.6.egg/IPython/core/interactiveshell.py", line 2400, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-29-5a1fca9e1f55>", line 1, in <module>
    coll.update({}, {'$pullAll': {'parents': ['a', 'b']}}, multi=True)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/collection.py", line 481, in update
    check_keys, self.__uuid_subtype), safe)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 844, in _send_message
    rv = self.__check_response_to_last_error(response)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 785, in __check_response_to_last_error
    raise DuplicateKeyError(details["err"])
DuplicateKeyError: E11000 duplicate key error index: jblackburn_scratch.test3.$symbol_1_parents_1_chunk_1  dup key: { : "sym", : undefined, : "2" }

It's great that arrays work as multi-key indexes. However it's less great that the empty array is given a special 'undefined' value.

I can't see how it's useful for documents in a compound unique index, which contains a multi-key field, to be included when that multi-key field is empty. Certainly sparse could reasonably ignore empty multi-key documents in compound indexes.

Assignee:: Unassigned
Reporter:: James Blackburn
Participants:: Daniel Pasette, James Blackburn, Matt DeKrey
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Apr 13 2013 02:57:55 PM UTC
Updated:: Nov 04 2015 11:16:29 AM UTC
Resolved:: Nov 04 2015 11:16:10 AM UTC

Details

Description

Attachments

Activity

People

Dates