[SERVER-50769] server restarted after expr:{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp","line":535}} Created: 04/Sep/20  Updated: 29/Oct/23  Resolved: 12/Jan/21

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 4.4.0
Fix Version/s: 4.9.0, 4.2.12, 4.4.4

Type: Bug Priority: Blocker - P1
Reporter: jeason chan Assignee: Justin Seyster
Resolution: Fixed Votes: 0
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod-data-1-shard2.log     File mongod-shard2.log    
Issue Links:
Backports
Problem/Incident
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Query 2020-11-30, Query 2020-12-14, Query 2020-12-28, Query 2021-01-11, Query 2021-01-25
Participants:
Case:

 Description   

We used an application named monstache with mongo-go-driver v1.4.1 to watch change streams from replica ret with mongods of  v4.4.0. And the whole system works well for a few weeks but the primary mongod restarted frequently after we started to use transaction with java-go-driver.

We tried the following conditions to check the cause:

  1. when the server restarted frequently, we check the connection status between host:post and not this
  2. We checked and ensured that the wach apis of mongo-go-driver are used correctly
  3. The mongo processes are started with systemd and we add the remommended ulimits in the systemd service.  And we created 500 mongoClient instances and ran our transaction unit tests with monstache running but the mongo didn't restart. 

However sometimes the mongod restarted all of a sudden.

 

Below are the logs and the key words are "SERVER RESTARTED" and the "ctx":"connXXX" which is generated by mongo-go-driver used by monstache and the printed BACKTRACE and Invariant failure.

mongod-shard2.log

mongod-data-1-shard2.log



 Comments   
Comment by Githook User [ 12/Jan/21 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-50769 Change streams no longer balk at empty applyOps

(cherry picked from commit e9122ba5078eca4fbc7ea858221dba6af00e90a9)
Branch: v4.2
https://github.com/mongodb/mongo/commit/5593fd8e33b60c75802edab304e23998fa0ce8a5

Comment by Githook User [ 12/Jan/21 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-50769 Change streams no longer balk at empty applyOps

(cherry picked from commit e9122ba5078eca4fbc7ea858221dba6af00e90a9)
Branch: v4.4
https://github.com/mongodb/mongo/commit/41db39acbe673041c487daa83bb9d14791016b93

Comment by Githook User [ 12/Jan/21 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-50769 Change streams no longer balk at empty applyOps
Branch: master
https://github.com/mongodb/mongo/commit/e9122ba5078eca4fbc7ea858221dba6af00e90a9

Comment by Bernard Gorman [ 08/Sep/20 ]

Hey jeasonchanupup@gmail.com, thank you for bringing this issue to our attention. If you could answer the following questions, it will really help our investigation:

1. In your original description, you say that you're using monstache to "watch change streams from a replica set", but you are running in a sharded cluster. Is monstache opening a change stream through the mongoS, or is it opening change streams directly on each shard?

2. What kind of change stream (single-collection, whole-database, or whole-cluster) are you running? From the documentation here it seems that monstache "defaults to opening the change stream against the entire deployment." Did you configure it to run against a single database or single collection instead? If you have any logs that show the actual $changeStream command, please post them here.

3. Could you provide some examples of the kind of transactions your application is running on this cluster?

4. Has your oplog rolled over yet? If not, then the oplog entries which caused this issue should still be present - it would be very helpful if we could examine them. From the logs, it appears that the earliest incident occurred at 2020-09-03T10:00:46.229+08:00:

❯ grep -m 1 Invariant mongod-data-1-shard2.log
{"t":{"$date":"2020-09-03T10:00:46.229+08:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"conn855","msg":"Invariant failure","attr":{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp","line":535}}

Assuming that the change stream was keeping pace with the rate of oplog generation, it is very likely that the oplog entry which caused this issue was written shortly before this time. The following command will use the mongo shell to connect to the host and will dump 50 seconds of its oplog around the time of the incident to a logfile (please note that you may need to add appropriate authentication credentials to this command):

mongo --host <host> --port <port> --eval 'rs.slaveOk(); db.getSiblingDB("local").oplog.rs.find({ts: {$gte: Timestamp(1599098400, 1), $lte: Timestamp(1599098450, 1)}}).forEach(printjson)' > SERVER-50769_oplog.log

Please do not post this log on the ticket, since it may contain some sensitive information. We have created a secure upload portal for you at this link; please upload the oplog file there instead. Files uploaded to this portal are visible only to MongoDB employees, and are routinely deleted after some time.

Thanks!
Bernard

Comment by jeason chan [ 08/Sep/20 ]

Hi, if more details are required, leave a messga and I will reply as soon as possible. Thx.

Comment by jeason chan [ 04/Sep/20 ]

Here are the resource limit params in the systemd service:

#file size
LimitFSIZE=infinity

  1. Restart type
    Restart=on-failure
  2. cpu time
    LimitCPU=infinity
  3. virtual memory size
    LimitAS=infinity
  4. open files
    LimitNOFILE=1640000
  5. processes/threads
    LimitNPROC=164000
  6. locked memory
    LimitMEMLOCK=infinity
  7. total threads (user+kernel)
    TasksMax=infinity
    TasksAccounting=false

And our hosts have 64 cores and 64gb ram and 1T ssd

Generated at Thu Feb 08 05:23:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.