Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57863

CursorNotFound since we doubled the number of shards

    • ALL
    • Query Execution 2021-07-12, Query Execution 2021-07-26, QE 2021-08-09

      Since we added 4 new shards to an existing 5 shard cluster, one of our database clients receives CursorNotFound errors from time to time.
      The cursorTimeoutMillis is set to 2 hours (7200000 ms) on all mongo nodes.
      The client uses the perl driver without specifying a batch size.
      The client iterates over all documents having the field clickCount being of $type:3 which is covered by an index.

      The query in mongo-shell syntax looks like this:

      db.offer.find({}, {"clickCount": 1, "offerId": 1, "_id": 1}).hint("clickCount_1_offerId_1__id_1")

      The index is defined as follows:

      {
      		"v" : 2,
      		"key" : {
      			"clickCount" : 1,
      			"offerId" : 1,
      			"_id" : 1
      		},
      		"name" : "clickCount_1_offerId_1__id_1",
      		"ns" : "offerStore.offer",
      		"partialFilterExpression" : {
      			"clickCount" : {
      				"$type" : 3
      			}
      		},
      		"background" : true
      }
      

      The client application calls constantly getMore (always within the cursorTimeoutMillis of 2 hours). As you can see in the strace of the client, which is running only one query at the time, at 11:29:04 there was a successful getMore call, and one second later the next getMore call failed with CursorNotFound:

      11:29:04.596296 select(32, NULL, [24], NULL, {tv_sec=1800, tv_usec=0}) = 1 (out [24], left {tv_sec=1799, tv_usec=999997}) <0.000017>
      11:29:04.596362 write(24, "g\0\0\0G@\2254\0\0\0\0\324\7\0\0\0\0\0\0offerStore.$cmd\0\0\0\0\0\377\377\377\377;\0\0\0\22getMore\0\213]\320\275\311>1M\2collection\0\6\0\0\0offer\0\20batchSize\0e\0\0\0\0", 103) = 103 <0.000034>
      11:29:04.596599 select(32, [24], NULL, NULL, {tv_sec=1800, tv_usec=0}) = 1 (in [24], left {tv_sec=1799, tv_usec=999750}) <0.000273>
      11:29:04.596924 read(24, "f&\0\0>\247\t\6G@\2254\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0B&\0\0\3cursor\0\254%\0\0\22id\0\213]\320\275\311>1M\2ns\0\21\0\0\0offerStore.offer\0\4nextBatch\0w%\0\0\0030\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0N\5@\361\220*\332.\2_id\0!\0\0\0d97c8a0eb00c79a3982efa36fb8de9b8\0\0\0031\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0/\7\340\353\20+\332.\2_id\0!\0\0\0c6e2ef3033c554bd9ce1d80b5dc5eb9f\0\0\0032\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\331\35\340\0268+\332.\2_id\0!\0\0\0e2af0314c13f3460f3c13028cbca813b\0\0\0033\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0[\24\240\356(-\332.\2_id\0!\0\0\0c2460966da93961b94c71ab2ec61cdc8\0\0\0034\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\26\7`\r)-\332.\2_id\0!\0\0\0d0ef7338a5c089a018b8a2f766572d88\0\0\0035\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\232\4`l)-\332.\2_id\0!\0\0\0c6bd57d356540a93ea8ae80a53032dfa\0\0\0036\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0009\t v)-\332.\2_id\0!\0\0\0c465c432b3a1b44edb31137a99dae205\0\0\0037\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0!\1\200\224)-\332.\2_id\0!\0\0\0be2a698c3741944e47cebb3d1c19c3ce\0\0\0038\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\v\6 \315)-\332.\2_id\0!\0\0\0d102d497965addfc57d344663f5bacb3\0\0\0039\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\234\2\300\367)-\332.\2_id\0!\0\0\0cf36ee16b06f7f"..., 372480) = 9830 <0.000034>
      ...
      11:29:05.290051 select(32, NULL, [24], NULL, {tv_sec=1800, tv_usec=0}) = 1 (out [24], left {tv_sec=1799, tv_usec=999996}) <0.000017>
      11:29:05.290112 write(24, "g\0\0\0)m\211K\0\0\0\0\324\7\0\0\0\0\0\0offerStore.$cmd\0\0\0\0\0\377\377\377\377;\0\0\0\22getMore\0\213]\320\275\311>1M\2collection\0\6\0\0\0offer\0\20batchSize\0e\0\0\0\0", 103) = 103 <0.000035>
      11:29:05.290349 select(32, [24], NULL, NULL, {tv_sec=1800, tv_usec=0}) = 1 (in [24], left {tv_sec=1799, tv_usec=947221}) <0.052814>
      11:29:05.343287 read(24, "\5\1\0\0D\265\t\6)m\211K\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\341\0\0\0\1ok\0\0\0\0\0\0\0\0\0\2errmsg\0 \0\0\0cursor id 86597569444 not found\0\20code\0+\0\0\0\2codeName\0\17\0\0\0CursorNotFound\0\21operationTime\0\363\6\0\0a\304\311`\3$clusterTime\0X\0\0\0\21clusterTime\0K\10\0\0a\304\311`\3signature\0003\0\0\0\5hash\0\24\0\0\0\0\7)\354*\372\n\355\315\374\231\16\274/K\35`Ml\341h\22keyId\0\376\3\0\0p\215\16`\0\0\0", 372480) = 261 <0.000019>
      

      The size of one returned document is 100 Bytes on average. The index size is round about 1 GB, all shards together.

      We found out, that cursors timed out on all nine mongodb shards within a few seconds:

      mongo-offerstore-de07-03.db00.pro05.eu.idealo.com: 2021-06-21T09:21:55.953+0200 I QUERY    [clientcursormon] Cursor id 86275630521 timed out, idle since 2021-06-21T07:21:54.768+0200
      mongo-offerstore-de08-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:56.124+0200 I QUERY    [clientcursormon] Cursor id 70641344629 timed out, idle since 2021-06-21T07:21:54.432+0200
      mongo-offerstore-de03-03.db00.pro06.eu.idealo.com: 2021-06-21T09:21:57.046+0200 I QUERY    [clientcursormon] Cursor id 88178921267 timed out, idle since 2021-06-21T07:21:54.570+0200
      mongo-offerstore-de02-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:57.435+0200 I QUERY    [clientcursormon] Cursor id 87481684996 timed out, idle since 2021-06-21T07:21:54.829+0200
      mongo-offerstore-de06-03.db00.pro05.eu.idealo.com: 2021-06-21T09:21:57.494+0200 I QUERY    [clientcursormon] Cursor id 98138927074 timed out, idle since 2021-06-21T07:21:54.914+0200
      mongo-offerstore-de04-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:57.716+0200 I QUERY    [clientcursormon] Cursor id 86355932061 timed out, idle since 2021-06-21T07:21:54.473+0200
      mongo-offerstore-de01-02.db00.pro06.eu.idealo.com: 2021-06-21T09:21:57.831+0200 I QUERY    [clientcursormon] Cursor id 88372047114 timed out, idle since 2021-06-21T07:21:54.488+0200
      mongo-offerstore-de09-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:58.005+0200 I QUERY    [clientcursormon] Cursor id 71157191726 timed out, idle since 2021-06-21T07:21:54.704+0200
      mongo-offerstore-de05-01.db00.pro05.eu.idealo.com: 2021-06-21T09:22:00.531+0200 I QUERY    [clientcursormon] Cursor id 65745554296 timed out, idle since 2021-06-21T07:21:57.605+0200
      

      Is it possible that the cursor which the client uses is completely disassociated from the cursors between router and the shards? And if there's some caching or queueing between router and shard nodes, then it could be possible that the "backend" cursors less often get the getMore commands, causing the timeouts.

      Hypothesis: Each shard returns its batch to the router. The router caches these batches. Batches are fix in size. Since our cluster has nearly doubled the number of shards, the router has now nearly the doubled number of documents in its cache. Thus, the router takes almost twice as long to request the next batch from the shards, which may exceed the cursor timeout.
      Holds this hypothesis true?
      If so, the the router should transfer each getMore, received from the client, to the shards to notify them that the cursor is still in use even though no shard had to fetch more documents because the router had them still in cache.

            Assignee:
            mihai.andrei@mongodb.com Mihai Andrei
            Reporter:
            kay.agahd@idealo.de Kay Agahd
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: