Avoid generating duplicate ObjectIds when 3-byte counter overflows

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Component/s: BSON
    • None

      Summary

      Avoid generating duplicate ObjectIds when 3-byte counter overflows in the same second.

      Motivation

      The ObjectId spec explicitly allows duplicates when the counter overflows within a second:

      Counter: The counter makes it possible to have multiple ObjectIDs per second, per server, and per process. As the counter can overflow, there is a possibility of having duplicate ObjectIDs if you create more than 16 million ObjectIDs per second in the same process on a single machine.

      This is a big design flaw because the duplicate id might not be noticed. For example in sharded clusters, when the _id is not part of the shard key then each shard can have a duplicate _id.

      Of course we cannot prevent all scenarios where a duplicate ObjectId can be generated (eg two clients get exceedingly unlucky and get the same 5-byte random value) but we can detect and prevent the counter overflow scenario which is much more likely to happen. One common use case is quickly generating a lot of sample documents on the client (eg [\{"_id": ObjectId(), "x": i\} for i in range(50_000_000)].

      In pymongo the fix would be straightforward:

      diff --git a/bson/objectid.py b/bson/objectid.py
      index 970c4e52e..03407f162 100644
      --- a/bson/objectid.py
      +++ b/bson/objectid.py
      @@ -53,6 +53,8 @@ class ObjectId:
       
           _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE)
           _inc_lock = threading.Lock()
      +    _timestamp = 0
      +    _inc_start = 0
       
           __random = _random_bytes()
       
      @@ -165,12 +167,21 @@ class ObjectId:
       
           def __generate(self) -> None:
               """Generate a new value for this ObjectId."""
      +        timestamp = int(time.time())
               with ObjectId._inc_lock:
                   inc = ObjectId._inc
                   ObjectId._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1)
      +            # Error if we would generate a duplicate id.
      +            if timestamp == ObjectId._timestamp:
      +                if inc == ObjectId._inc_start:
      +                    raise Exception(
      +                        'ObjectId overflow would generate duplicate _id. Cannot generate more than 16,777,215 ids within the same second')
      +            else:
      +                ObjectId._timestamp = timestamp
      +                ObjectId._inc_start = inc
       
               # 4 bytes current time, 5 bytes random, 3 bytes inc.
      -        self.__id = _PACK_INT_RANDOM(int(time.time()), ObjectId._random()) + _PACK_INT(inc)[1:4]
      +        self.__id = _PACK_INT_RANDOM(timestamp, ObjectId._random()) + _PACK_INT(inc)[1:4]
       
           def __validate(self, oid: Any) -> None:
               """Validate and use the given id for this ObjectId.
      

              Assignee:
              Unassigned
              Reporter:
              Shane Harvey
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: