-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Component/s: BSON
-
None
-
Needed
Summary
Avoid generating duplicate ObjectIds when 3-byte counter overflows in the same second.
Motivation
The ObjectId spec explicitly allows duplicates when the counter overflows within a second:
Counter: The counter makes it possible to have multiple ObjectIDs per second, per server, and per process. As the counter can overflow, there is a possibility of having duplicate ObjectIDs if you create more than 16 million ObjectIDs per second in the same process on a single machine.
This is a big design flaw because the duplicate id might not be noticed. For example in sharded clusters, when the _id is not part of the shard key then each shard can have a duplicate _id.
Of course we cannot prevent all scenarios where a duplicate ObjectId can be generated (eg two clients get exceedingly unlucky and get the same 5-byte random value) but we can detect and prevent the counter overflow scenario which is much more likely to happen. One common use case is quickly generating a lot of sample documents on the client (eg [\{"_id": ObjectId(), "x": i\} for i in range(50_000_000)].
In pymongo the fix would be straightforward:
diff --git a/bson/objectid.py b/bson/objectid.py index 970c4e52e..03407f162 100644 --- a/bson/objectid.py +++ b/bson/objectid.py @@ -53,6 +53,8 @@ class ObjectId: _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE) _inc_lock = threading.Lock() + _timestamp = 0 + _inc_start = 0 __random = _random_bytes() @@ -165,12 +167,21 @@ class ObjectId: def __generate(self) -> None: """Generate a new value for this ObjectId.""" + timestamp = int(time.time()) with ObjectId._inc_lock: inc = ObjectId._inc ObjectId._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1) + # Error if we would generate a duplicate id. + if timestamp == ObjectId._timestamp: + if inc == ObjectId._inc_start: + raise Exception( + 'ObjectId overflow would generate duplicate _id. Cannot generate more than 16,777,215 ids within the same second') + else: + ObjectId._timestamp = timestamp + ObjectId._inc_start = inc # 4 bytes current time, 5 bytes random, 3 bytes inc. - self.__id = _PACK_INT_RANDOM(int(time.time()), ObjectId._random()) + _PACK_INT(inc)[1:4] + self.__id = _PACK_INT_RANDOM(timestamp, ObjectId._random()) + _PACK_INT(inc)[1:4] def __validate(self, oid: Any) -> None: """Validate and use the given id for this ObjectId.