Uploaded image for project: 'Ruby Driver'
  1. Ruby Driver
  2. RUBY-1117

Driver does not compare message response id with the initial request id when deserializing a response message

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.2.6
    • Affects Version/s: 2.2.5, bson-4.1.1
    • Component/s: BSON, Wire Protocol
    • Labels:
      None
    • Environment:
      Used with Mongoid 5.1.3 in a Rails 4.2 app, MongoDB production server version is 3.0.6 but I also reproduced the bug in 3.2.7.

      Ok I'm sorry in advance but you're not gonna like this one. We recently upgraded from Mongoid 4 to Mongoid 5.1.3 / Mongo 2.2.5 after waiting a bit of time for bug to be fixed
      But it seems we found a big one. Unfortunately, after 3 days of investigation I still couldn't find a reliable way to reproduce it, but it happening in production for us about once per day and we also got it a few times on our CI during tests.
      Here it is:
      At some point, Mongoid starts to return shit from requests, and once this starts it keeps going this way for a while. For example in our case we had requests like: Domain.find(id) returning sometimes:

      1. An entirely different model (like a User)
      2. A Domain, but the wrong one (not the same it and attributes)
      3. A Domain object but filled with attributes from another record which has nothing to do (for example in our case we got a domain populated with the attributes from a category record in the database)
      4. Or a Mongoid::RecordNotFound exception, whereas the record exists.
        After a lot of log digging in log, we noticed that Mongoid (we're using mongoid but the bug is most probably in mongo) seemed to be returning the results from the previous request instead of the current one.

      Here is a simplified extract of one of our workers (sidekiq mono-threaded):

      13:57:41 RealtimeWorker: start
      13:57:41 MONGODB | find | STARTED | {"find"=>"domains", "filter"=>{"_id"=>BSON::ObjectId('5721cacc7765624ccd000822')}, "sort"=>{"_id"=>1}}
      13:57:41 MONGODB | find | SUCCEEDED | 0.000484421s
      13:57:41 RealtimeWorker: done: 0.006 sec
      13:57:41 Surveys::RealtimeSenderWorker: start
      13:57:41 MONGODB | find | STARTED | {"find"=>"domains", "filter"=>{"_id"=>BSON::ObjectId('4dccfdd42f1a692a11000001')}, "sort"=>{"_id"=>1}}
      13:57:41 MONGODB | find | SUCCEEDED | 0.0005157159999999999s
      13:57:41 MONGODB | find | STARTED | {"find"=>"domain1#interventions", "filter"=>{"_id"=>BSON::ObjectId('575c444c13047d55c5267bfc')}}
      13:57:41 MONGODB | find | SUCCEEDED | 0.001393829s
      13:57:41 Surveys::RealtimeSenderWorker: done: 0.01 sec
      

      Here we can see a first worker which run and fetch a domain (called 'domain1'). Then another worker run and fetch a domain (called 'domain2') but actually here, Mongo returned the data from the first query ('domain1') which leads to the second worker executing on the wrong domain! (we can see domain1 in the collection name of the next requests, which is sharded by domain) The next worker to run, will then get an Intevention instead of a Domain! and raise. And this goes on and on. After a few seconds (a few hundreds queries) it usually recovers in production.

      I also reproduced this in test (it sometimes does that) and in the test when it starts it never recovers and requests keeps returning shit (seems to always be the last query's response). But I got a few interesting exception in test which may help.

      The first exception I often get before the client return shit is this one:

      ArgumentError:
        negative length -8 given
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/socket.rb:172:in `read'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/socket.rb:172:in `read_from_socket'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/socket.rb:123:in `block in read'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/socket.rb:191:in `handle_errors'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/socket.rb:122:in `read'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/protocol/message.rb:121:in `deserialize'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connectable.rb:107:in `block in read'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connectable.rb:91:in `ensure_connected'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connectable.rb:106:in `read'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connection.rb:158:in `deliver'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connection.rb:104:in `block in dispatch'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/monitoring/publishable.rb:47:in `publish_command'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connection.rb:103:in `dispatch'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/command/writable.rb:38:in `block in execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/connection_pool.rb:108:in `with_connection'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/server/context.rb:63:in `with_connection'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/command/writable.rb:37:in `execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/insert.rb:59:in `execute_write_command'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/bulkable.rb:39:in `execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:181:in `insert_one'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:148:in `execute_operation'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:56:in `block (2 levels) in execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:55:in `each'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:55:in `block in execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/retryable.rb:89:in `call'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/retryable.rb:89:in `write_with_retry'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:53:in `execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/collection.rb:385:in `bulk_write'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/collection.rb:363:in `insert_many'
      # /drone/bundler/ruby/2.1.0/gems/mongoid-5.1.3/lib/mongoid/query_cache.rb:168:in `insert_many_with_clear_cache'
      

      Which seem to indicate something like a wire protocole issue as the packet size is lower than expected.

      Then I often get this one:

      NoMethodError:
        undefined method `[]' for nil:NilClass
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/mergable.rb:38:in `block in aggregate_write_errors'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/mergable.rb:37:in `each'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/mergable.rb:37:in `reduce'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/operation/write/bulk/mergable.rb:37:in `aggregate_write_errors'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write/result_combiner.rb:103:in `combine_write_errors!'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write/result_combiner.rb:98:in `combine_errors!'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write/result_combiner.rb:58:in `combine!'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:148:in `execute_operation'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:56:in `block (2 levels) in execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:55:in `each'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:55:in `block in execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/retryable.rb:89:in `call'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/retryable.rb:89:in `write_with_retry'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/bulk_write.rb:53:in `execute'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/collection.rb:385:in `bulk_write'
      # /drone/bundler/ruby/2.1.0/gems/mongo-2.2.5/lib/mongo/collection.rb:363:in `insert_many'
      # /drone/bundler/ruby/2.1.0/gems/mongoid-5.1.3/lib/mongoid/query_cache.rb:168:in `insert_many_with_clear_cache'
      

      Which I haven't looked at in detail yet.
      I'm sorry I can't provide a way to reproduce the issue, this is the best I've been able to gather so far, if anyone have any idea of what change in the driver could be causing that, or if you have tips on how I can try to reproduce this, I'm all ears.

      In the meantime I've added some guard clauses around our worker to make sure that the domain we get is the correct one and raise otherwise. This should allow us to keep the bug in production while preventing any business damage on our side.

            Assignee:
            emily.stolfo Emily Stolfo
            Reporter:
            bigbourin@gmail.com Adrien Jarthon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: