[SERVER-26050] Unique key violation for index with a non-simple collation has unclear error message Created: 09/Sep/16  Updated: 30/Jan/24

Status: Backlog
Project: Core Server
Component/s: Querying, Usability
Affects Version/s: 3.3.12
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 3
Labels: collation, query-44-grooming, storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-34974 Dup key value is not readable in erro... Closed
is duplicated by SERVER-75242 E11000 duplicate key error cannot be ... Closed
Related
related to SERVER-24007 Server can return invalid UTF8 for er... Backlog
related to SERVER-46810 Broken E11000 duplicate key error whe... Closed
related to SERVER-60298 Explain can include ICU collation key... Closed
Assigned Teams:
Query Execution
Sprint: QE 2023-04-03, QE 2023-04-17
Participants:
Case:

 Description   

db.numbers.ensureIndex({text: 1}, {unique: true, collation: {locale: "en", numericOrdering: true}})
db.numbers.insert({text: "10"})
db.numbers.insert({text: "10"})

Gives an output of:

WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "E11000 duplicate key error index: test.numbers.$text_1 dup key: { : \"\u000f\f\u0001\u0005\u0001\u0005\" }"
	}
})

Whereas values not transformed by ICU provide a readable key:

WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "E11000 duplicate key error index: test.strings.$text_1 dup key: { : \"10\" }"
	}
})



 Comments   
Comment by Kyle Suarez [ 18/Apr/23 ]

Sending to the backlog but adding it to the quick wins bucket.

david.storch@mongodb.com thinks this might not be that quick. One interesting note is that we do want to eventually upgrade the ICU library and that will necessitate an index format change (as well as affecting how we approach this issue, as Ivan has mentioned earlier.)

Comment by Ivan Fefer [ 13/Apr/23 ]

After cosmetic usability fix and some more investigation, I found two ways to address this issue:

1. Reverse-engineer icu::CollationKeys::writeSortKeyUpToQuaternary function to be able to extract collation elements from collation key and do a reverse-lookup to find corresponding UTF code points to build a string.

Pros:

Gives the best output.

Cons:

Tied to exact version of ICU library (but we already tied to it, because we store collation keys on disk).

There are edge cases. For example, numerical ordering might contain just numbers.

2. Check for DuplicateKey error status after index write operations in mongo::IndexAccessMethod::insert/update and extend error message with a conflicting document or human-readable key.

Pros:

Easier to implement.

Less dependency on third party code.

Cons:

If we put whole document into error message (which we do in some cases, for example: batch insert), error messages will be less usable and it is possible that conflicting key would be hidden due to compaction.

If we decided to try to extract human-readable index key from a document, we will need to implement extra logic that might be complex, to pass “human-readable” flag to underlying key generators and call them again.

Comment by Githook User [ 31/Mar/23 ]

Author:

{'name': 'Ivan Fefer', 'email': 'ivan.fefer@mongodb.com', 'username': 'Fefer-Ivan'}

Message: SERVER-26050 Add CollationKey(..) to hex-encoded index keys when using collation
Branch: master
https://github.com/mongodb/mongo/commit/6ee4d650e456ea73526729ec3e3cca3b7a5a1b14

Comment by Ivan Fefer [ 30/Mar/23 ]

While we think about more complicated fix, I suggest adding "CollationKey()" around hex-encoded binary data to make output a little bit more clear.
Like we do for index bounds in explain output: 

            indexBounds: {
              text: [ '[CollationKey(0x292b292d292b29010b010b), {})' ]
            }
          }

Comment by Ivan Fefer [ 30/Mar/23 ]

Interesting side effect of this is that indexes with collation can't cover projections

Comment by Ivan Fefer [ 29/Mar/23 ]

Under the hood indexes with collation, use icu::Collator::getCollationKey to get a sequence of bytes that can be compared just using regular comparison that should produce the same result as comparing original strings with the same collator. And we store this sequence in the index, instead of the original string.

The ICU docs state that reverse operation is not supported.

The other approach would be to catch the error and print the document or the update that caused the error to appear as well. Or something similar. 

Comment by Ivan Fefer [ 29/Mar/23 ]

The reason for this output is stated in this comment: https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/index_entry_comparison.cpp#L212

I am investigating how we can extract some human-readable user strings.

Comment by Kasper N/A [ 17/Apr/21 ]

I found this issue and see that information is now hex encoded. However I struggle to successfully decode the information. Any pointers would be great. 

Here my StackOverflow question:
https://stackoverflow.com/q/67131426/8051909

Comment by David Storch [ 16/Apr/20 ]

Under SERVER-46810, we fixed the problem where collation can cause the duplicate key error message to contain invalid UTF-8. However, the error messages still provide collation keys rather than the raw user string that resulted in the duplicate key. This ticket now tracks an improvement which would expose the raw string whose collation key resulted in a duplicate key error. This is not trivial to implement, since duplicate key errors are produced by the storage layer, which has no knowledge of the collation implementation or of the raw string that resulted in the duplicated collation key.

Comment by Eric Tray [ 30/Aug/19 ]

Since this currently makes the error message unreadable (and unusable), is there a way to manually transform the duplicate key values back into a readable format?

Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ]

I adjusted the priority and issue type because this can result in non-UTF-8 characters in that can cause decoding the result in the client to fail.

Generated at Thu Feb 08 04:11:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.