[SERVER-44867] Aggregate on Secondary fails with CursorKilled Created: 27/Nov/19  Updated: 19/Feb/20  Resolved: 14/Jan/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Frank Shimizu Assignee: Carl Champain (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Reaper.png    
Operating System: ALL
Steps To Reproduce:
  • Create Replica Set with 1 Primary, 1 Secondary and 1 Arbiter
  • Create collection with >= 52 million documents
  • Reboot Secondary to clear memory caches
  • Connect to Secondary and run expensive aggregation on the collection
  • The aggregation may fail with this error
  • On re-run the aggregation has a good chance to complete correctly, since memory cache is filled partly

 

Participants:

 Description   

I tried searching for issues and entries in the release notes but couldn't find anything that seems to address our issue, apologies if this is a duplicate.

 

MongoDB version: 4.0.2

Setup: Replica Set: 1 Primary, 1 Secondary, 1 Arbiter; no Sharding

OS: CentOS 7.7.1908

CPUs: 6 (Primary & Secondary)

RAM: 31GB (Primary & Secondary)

Data size: ~100GB

 

Aggregations on a collection with ~52 million documents sometimes fail with errors like this (DB, collection names and pipeline redacted):

2019-11-27T11:35:22.354+0000 I COMMAND [conn25] command db.collection appName: "MongoDB Shell" command: aggregate { aggregate: "collection", pipeline: <PIPELINE>, cursor: {}, lsid: { id: UUID("18d0572d-63e3-46f5-b73e-0b2d21943d6e") }, 
$clusterTime: { clusterTime: Timestamp(1574854281, 3), 
signature: { hash: BinData(0, 055F8108EC0EBAF2F63E875B114500BABFDC7FBE), keyId: 6708703671850369025 }}, 
$readPreference: { mode: "secondaryPreferred" }, $db: "db" } 
planSummary: COLLSCAN numYields:71159 ok:0 errMsg:"operation was interrupted" errName:CursorKilled errCode:237 reslen:238 locks:{ Global: { acquireCount: { r: 71941 }}, 
Database: { acquireCount: { r: 71940 }}, 
Collection: { acquireCount:  { r: 71940 }}} protocol:op_msg 230092ms}}

 

In this example log it happened on a COLLSCAN, but it also happens with IXSCAN just the same. It seems to happen much more often when the affected data is not well cached in memory, such as after a server reboot. We tried to reproduce this on the Primary, but so far we can only reproduce it on the Secondary.

 

One generic aggregation which produces this for us is this (had to redact all the field names):

db.collection.aggregate(
[
      {"$match" : {
        "field0" : "value0",
         "field1" : {
          "$gte" : ISODate("2018-01-01T00:00:00.000Z"),
           "$lt" : ISODate("2019-11-26T00:00:00.000Z")
       }}},
 
       { "$project" : {
         "_id" : 0,
        "field1" : 1,
         "field2.field3" : 1 
       }},
       {"$group" : {
         "_id" : {
          "key" : {
             "field1name" : "$field1"
           }
        },
         "field2name" : {
          "$sum" : "$field2.field3"
        },
        "count" : {
          "$sum" : 1
        }
      }}
  ])

 

The only other reference to this that we could find is this post:

https://dba.stackexchange.com/questions/238391/mongodb-return-error-cursor-stage-caused-by-operation-was-interrupted?rq=1

 

Please let me know if I can provide any other info.



 Comments   
Comment by Carl Champain (Inactive) [ 19/Feb/20 ]

Thanks for the follow-up, and we're glad to hear that the upgrade fixed your issue!

Comment by Frank Shimizu [ 19/Feb/20 ]

Just a quick follow-up. We upgraded our cluster from 4.0.2 to 4.0.16. So far I'm unable to reproduce the issue, it looks like it's fixed with the upgrade.

 

Again many thanks for all the help.

Comment by Carl Champain (Inactive) [ 14/Jan/20 ]

frank.shimizu@est.fujitsu.com,

Let's close this ticket as incomplete, and let us know with a comment here wether or not the upgrade worked. If the upgrade doesn't fix your issue, we will reopen this ticket.

Thank you,
Carl

Comment by Frank Shimizu [ 14/Jan/20 ]

Hi Carl,

thanks a lot for your analysis. I will bring the upgrade up with the team and I'm fairly confident that we can do the upgrade, but it may take some time to find a maintenance window. 

Until the upgrade is completed and we can re-test, should we close this issue and - if necessary - create a new one later, or will we keep this one open?

Regards

Frank

Comment by Carl Champain (Inactive) [ 13/Jan/20 ]

Hi frank.shimizu@est.fujitsu.com,

Thank you for the additional details.
We made a few changes on logical sessions lifecycle since 4.0.2. So ideally, you should upgrade to the latest 4.0.14 which contains improvements on this matter. However, if the same issue appears after upgrading, then we would need to collect the logs and the diagnostic data again.

Let us know how it goes,
Carl

Comment by Frank Shimizu [ 08/Jan/20 ]

Thanks for your patience. I set both verbosity levels to 3 and reproduced the effect. The relevant portion of the log file is uploaded to the secure portal - SHA256SUM is:

759ee3329dba602d87820fc70b4c8fa0a0d55b8e8c60899ba596a04d7e5e150c mongod-verbose.log.gz

 

The aggregation starts at 2020-01-08T11:44:23.124+0000 and the error can be seen at 2020-01-08T11:47:47.509+0000.

 

Thanks for all your efforts!

Comment by Frank Shimizu [ 07/Jan/20 ]

Hi Carl, again sorry for the long delay, I was on vacation over the holidays. I can provide the requested log, please allow me some time because of other tasks.

Comment by Carl Champain (Inactive) [ 20/Dec/19 ]

Hi frank.shimizu@est.fujitsu.com,

We'd like to collect more information; could you please increase the log verbosity for query and command by running the following mongo shell commands:

  • db.setLogLevel(3, "query")
  • db.setLogLevel(3, "command")

Then reproduce the issue, share the mongod.log file (covering the verbose-logged reproduction only) with us in the secure upload portal, and go back to the logging verbosity you were using. 

Thanks!
Carl

Comment by Frank Shimizu [ 17/Dec/19 ]

Sorry for the long delay. The compressed log file is now uploaded. Please be aware that is uncompresses to ~4GB. The SHA256 is:

517E36CDE3FADD8889183E2310BE9B2B9425C647FFA8115D3D9512F707D8D69E  mongod.log.redacted.gz

Comment by Carl Champain (Inactive) [ 12/Dec/19 ]

frank.shimizu@est.fujitsu.com,

That's totally acceptable. Thank you!

Comment by Frank Shimizu [ 12/Dec/19 ]

I got approval to upload the complete log file. But it would be necessary to replace all contained personal data, such as email addresses and login names, with anonymized versions before uploading. Would that be acceptable?

Comment by Frank Shimizu [ 10/Dec/19 ]

I understand. Please allow me some time to get this checked and approved internally.

Comment by Carl Champain (Inactive) [ 10/Dec/19 ]

frank.shimizu@est.fujitsu.com,

Thanks for providing more details!
Unfortunately, we need the full log to help us diagnose the issue. Would it be possible?

Comment by Frank Shimizu [ 09/Dec/19 ]

Amazing, thanks! I've uploaded the diagnostic data. SHA256 checksum is:
D918B36902B545BA55AA59FC4C8D100909D90660BDAF7C033FE16A7DB22C616E diagnostic.data.tar.gz

I've also uploaded a short excerpt of the server log of the time when the error was reproduced (mongo-error-log.txt). Since the full log file contains queries and reveals all kinds of data, I would have to get uploading this checked and approved first. Please let me know if this excerpt is enough or if the full log file is needed.

Comment by Carl Champain (Inactive) [ 09/Dec/19 ]

frank.shimizu@est.fujitsu.com,

Sure! I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thank you,
Carl
 

Comment by Frank Shimizu [ 09/Dec/19 ]

Hi @carl.champain

Thanks for your reply and sorry for my late response, I was unavailable last week.

I have reproduced the error and collected the log file (265MB compressed) as well as diagnostic data (191MB compressed) shortly after. We consider both to be sensitive/protected data since this is a production system with customer data. Is it possible to upload these in a non-public way?

Regards
Frank

Comment by Carl Champain (Inactive) [ 02/Dec/19 ]

Hi frank.shimizu@est.fujitsu.com,

Thanks for the report.
To help us understand what is happening, can you please:

  1. Provide the mongod.log file for the node where the operation has failed?
  2. Archive (tar or zip) the $dbpath/diagnostic.data directory for that node(the contents are described here) and attach it to this ticket?

Kind regards,
Carl

Comment by Frank Shimizu [ 27/Nov/19 ]

I'm sorry, it seems I messed up the formatting of the log message and aggregation command, but can't seem to edit the description.

Generated at Thu Feb 08 05:07:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.