Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46918

in a sharded db, cursor timeout on a shard while mongos is consuming the cursor on another shard

    • Type: Icon: Improvement Improvement
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • Labels:
    • Query 2020-06-01, Query 2020-06-15

      Suppose you have a 2 shards db and a sharded collection with a shard key 'origin'. 2.6 M documents.

      a sh.status() gives:

       

              {  "_id" : "test",  "primary" : "sh_0",  "partitioned" : true,  "version" : {  "uuid" : UUID("28b86279-c3e8-4432-b325-28136e353a85"),  "lastMod" : 1 } }        {  "_id" : "test",  "primary" : "sh_0",  "partitioned" : true,  "version" : {  "uuid" : UUID("28b86279-c3e8-4432-b325-28136e353a85"),  "lastMod" : 1 } }                
      
                     test.sh_coll
                              shard key: { "origin" : 1 }
                              unique: false
                              balancing: true
                              chunks:
                                      sh_0 5
                                      sh_1 5
                              { "origin" : { "$minKey" : 1 } } -->> { "origin" : "A000001" } on : sh_1 Timestamp(3, 2)
                               { "origin" : "A000001" } -->> { "origin" : "F000001" } on : sh_1 Timestamp(3, 3)
                               { "origin" : "F000001" } -->> { "origin" : "H050002" } on : sh_1 Timestamp(3, 4)
                               { "origin" : "H050002" } -->> { "origin" : "K000003" } on : sh_1 Timestamp(3, 5)
                               { "origin" : "K000003" } -->> { "origin" : "N" } on : sh_1 Timestamp(3, 6)
                               { "origin" : "N" } -->> { "origin" : "P050000" } on : sh_0 Timestamp(3, 7)
                               { "origin" : "P050000" } -->> { "origin" : "S000001" } on : sh_0 Timestamp(3, 8)
                               { "origin" : "S000001" } -->> { "origin" : "U050002" } on : sh_0 Timestamp(3, 9)
                               { "origin" : "U050002" } -->> { "origin" : "Y045375" } on : sh_0 Timestamp(3, 10)
                               { "origin" : "Y045375" } -->> { "origin" : { "$maxKey" : 1 } } on : sh_0 Timestamp(3, 11)
       

       

      So, sorted by 'origin', half the records are in one shard the the next half is in the other shard.

      Now suppose we want to get all the records, sorted by 'origin' and every 1000 records, make a process that takes 1 seconds... Getting half of the records takes more than 10mn (the default timeout for cursors). 

      If we don't specify a batch_size, we don't even arrive to 1.182 records.

      If we specify a batch size of 1000, we get a "RecordNotFound" error when, after having returned half the records + the records from the first get from the 2nd shard, mongos tries to get more data from the 2nd shard.

      That's what we wanted to show up because whatever the batch_size you specify, you won't be able to get all the records. Using a small batch size is often described as the way to solve cursors timeout. In a multi sharded db it is not the case.

       

      Of course, this example has been crafted to reproduce the issue. It is not a real use case but I have real ones where I have 10 shards, process of each record takes more than 1 second and cursorTimeoutMillis: "7200000" (2 hours) and still get CursorNotFound errors.

      no_cursor_timeout=True is not an option because it can lead to ghost cursors in the db

       

       

            Assignee:
            mihai.andrei@mongodb.com Mihai Andrei
            Reporter:
            rj-10gen@arsynet.com Remi Jolin
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: