[SERVER-24514] Global IX Lock for more than 4 minutes on Mongo 3.0.11 with wiredTiger Created: 10/Jun/16  Updated: 21/Jun/17  Resolved: 24/May/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Abhishek Amberkar Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2016-04-27 at 9.37.40 AM.png     PNG File Screen Shot 2016-04-27 at 9.37.47 AM.png     PNG File Screen Shot 2016-04-27 at 9.37.56 AM.png     PNG File Screen Shot 2016-04-27 at 9.38.10 AM.png     Text File iostat_with_wtcache_100Gb.log     Text File perf_report2.txt     Text File ss_with_wtcache_100Gb.log     PNG File upgraded_to_3.0.12.png    
Issue Links:
Duplicate
Related
is related to SERVER-24824 Mongo 3.0.12 with MMAPv1 can't serve ... Closed
Participants:

 Description   

Hello All,

The setup
I am running a replica set in production without sharding. All nodes are running latest stable mongo 2.6 except one hidden node which has mongo 3.0 with wiredTiger enabled.

The data
I have around 4TB worth of data on each node (MMAPv1), with close to 7000 databases. The hidden node has 2TB of data because of wiredTiger.

The plan
I decided to upgrade to 3.2 and as an intermediate step, I have to upgrade to 3.0 first. So to start with that, I added the aforementioned hidden member to the existing replica set. I started sending prod like read query traffic to this node to check if it will be able to withstand that much load. I did this for over a week.

The plan was to roll out 3.0 on all secondaries if latencies and rps are close to prod like pattern.

The observation
As expected, wiredTiger ended up creating huge number of files, the number is close to 600k, with number of open files at any given time close to 150k. This node could handle upto 4k queries per second, but I saw a drop in traffic, from 4k RPS to 0 (zero) after every 4 to 5 minutes. This behaviour was observed for over a week, consistently. At the same time CPU/Memory/Disk are doing just fine.

In the process of understanding what exactly is happening, I ran db.currentOp() on that node where I found following.

rs01:SECONDARY> db.currentOp({"secs_running" : { $exists: true}})
{
    "inprog" : [
        {
           "desc" : "conn2640",
           "threadId" : "0xa2a763ba0",
           "connectionId" : 2640,
           "opid" : 15958259,
           "active" : true,
           "secs_running" : 255,
           "microsecs_running" : NumberLong(255666599),
           "op" : "query",
           "ns" : "<db_name>.<collection_name>",
           "query" : {
               "$orderby" : { "placeholder" : NumberLong(-1) },
               "$query" : {
                   "placeholder" : { "$in" : [ "placeholder" ] },
                   "placeholder" : "placeholder",
                   "$or" : [
                       {
                         "placeholder" : {
                           "$gt" : ISODate("2016-03-08T05:54:35.977Z")
                         }
                       }
                   ],
                   "placeholder" : "placeholder"
               }
           },
           "client" : "<some_ip>:34210",
           "numYields" : 1,
           "locks" : { "Global" : "r" },
           "waitingForLock" : true,
           "lockStats" : {
               "Global" : {
                   "acquireCount" : { "r" : NumberLong(3) },
                   "acquireWaitCount" : { "r" : NumberLong(1) },
                   "timeAcquiringMicros" : { "r" : NumberLong(255561934) }
               },
               "Database" : { "acquireCount" : { "r" : NumberLong(1) } },
               "Collection" : { "acquireCount" : { "r" : NumberLong(1) } }
           }
        },
        {
           "desc" : "repl writer worker 5",
           "threadId" : "0x15738ca80",
           "opid" : 15958268,
           "active" : true,
           "secs_running" : 255,
           "microsecs_running" : NumberLong(255634829),
           "op" : "none",
           "ns" : "<db_name>.<collection_name>",
           "query" : { },
           "numYields" : 0,
           "locks" : {
                   "Global" : "w",
                   "Database" : "w",
                   "Collection" : "w"
           },
           "waitingForLock" : false,
           "lockStats" : {
              "Global" : { "acquireCount" : { "w" : NumberLong(1) } },
              "Database" : { "acquireCount" : { "w" : NumberLong(1) } },
              "Collection" : { "acquireCount" : { "w" : NumberLong(1) } }
           }
        }
    ]
}

Here you can see two operations are "in progress". First one is a query, which is "waiting for lock" for whopping 255 seconds, whereas another is a "repl writer worker thread", which is again running for 255 seconds and holding Global, Database and Collection level lock with "w" (Intent Exclusive) mode.

From the docs here, https://docs.mongodb.org/v3.0/faq/concurrency/#how-does-concurrency-affect-secondaries I understand that

In replication, MongoDB does not apply writes serially to secondaries. Secondaries collect oplog entries in batches and then apply those batches in parallel. Secondaries do not allow reads while applying the write operations, and apply write operations in the order that they appear in the oplog.

That explains why lock is present, but I am yet to understand why is it taking more than 4 minutes to finish.

I would appreciate if you guide me in the right direction to understand what exactly is happening here.

Graphs attached.



 Comments   
Comment by Kelsey Schubert [ 24/May/17 ]

Hi abhishek.amberkar,

I've posted a suggested workaround on SERVER-23798, which may help you upgrade to MongoDB 3.2 and later.

Since this issue has likely been addressed in MongoDB 3.2, I'm going to resolve this ticket. Unfortunately, it is not feasible to backport these improvements to MongoDB 3.0.

For support upgrading, please see our Technical Support page.

Thank you,
Thomas

Comment by Abhishek Amberkar [ 26/Feb/17 ]

Hi Thomas,

As mentioned in the description, the goal is to move to 3.2, but at present, we are running 2.6 and as per the docs one can not jump directly to 3.2 from 2.6, without taking the intermediate 3.0.x step.

Comment by Kelsey Schubert [ 21/Feb/17 ]

Hi abhishek.amberkar,

Sorry for letting this ticket slip through the cracks. We've made significant improvement to WiredTiger in MongoDB 3.2. Would you please let us know if upgrading to MongoDB 3.2 resolves the issue?

Thank you,
Thomas

Comment by Abhishek Amberkar [ 24/Aug/16 ]

Hello Thomas,

I am using flashback to replay traffic from production secondary. The read traffic is distributed across all the secondaries, except the one which is running 3.0.11

On source node (2.6.x) I use tcpdump + socat to send traffic to target node

tcpdump -i eth0 -w - 'dst port 27017' | socat stdio tcp4-connect:<TARGET_NODE_IP>:<TARGET_NODE_PORT>

On target node (3.0.11) I use pcap_convertor + flashback to replay traffic

socat tcp4-listen:<TARGET_NODE_PORT>,fork,reuseaddr stdio | ./pcap_converter -f=- 2> ./pcap_conv.out | ./flashback_stdin -style=real -ops_filename=- -url=mongodb://<TARGET_NODE_IP>:27017/?connect=direct

Comment by Kelsey Schubert [ 19/Aug/16 ]

Hi abhishek.amberkar,

Unfortunately, we have not yet determined the cause of this behavior. So we can continue to investigate, would you be able to provide a script that generates a workload that reproduces this issue?

A reproduction script would greatly assist our investigation, and we would be able to use it to benchmark improvements in subsequent versions.

Thank you for your help,
Thomas

Comment by Abhishek Amberkar [ 20/Jun/16 ]

I have also observed exactly same pattern with default WT cache size.

Comment by Abhishek Amberkar [ 20/Jun/16 ]

Hello Ramón,

Attached iostat and serverStatus output as requested.

Both /dev/xvdb and /dev/xvdcb form a /dev/bcache0 device, which I am using as a data volume. Ignore /dev/xvdca, which is not being used.

Comment by Abhishek Amberkar [ 14/Jun/16 ]

Hello Ramón,

Yes, I am going ahead with 1st option for now. I have already configured cache size to be default and waiting for the node to catch up with the 18hrs oplog. Once that is done, I will generate those files and share with you. In fact I already have created those files with 100Gb cache size. So I will share both before and after outputs.

Comment by Ramon Fernandez Marina [ 14/Jun/16 ]

abhishek.amberkar, you're right that you need to upgrade to 3.0 before you can move to 3.2. Thanks for the additional screenshot, but unless you send us the data requested above there's not much more we can do to investigate this behavior. Have you consider which option you'd like to pursue? Have you tried setting the WiredTiger cache size to the default?

Thanks,
Ramón.

Comment by Abhishek Amberkar [ 14/Jun/16 ]

Hello Ramón,

I have added one more screenshot to show you the behavior of cache after 3.0.12 upgrade.

Comment by Abhishek Amberkar [ 14/Jun/16 ]

Hello Ramón,

Just want to update you on earlier setup. After upgrading to 3.0.12, the problem became much worst. For first few hours it showed the same earlier behavior. I updated you after that. Right now I am observing that there's some kind of a deadlock and the node is answering queries very slowly. Lots of repl writer worker threads acquiring IX Global lock (like the one posted in the description above) and oplog is lagging by 18 hours.

Comment by Abhishek Amberkar [ 14/Jun/16 ]

Hello Ramón,

Thank you. I will get back with the details required.

As far as move to 3.2.x is concerned, I understand from the docs that I can not directly move to 3.2.x from 2.6.x, I have to first upgrade to 3.0.x. And since that is the case, I will have to update all my secondaries, one by one to 3.0.x and then do the switch. Repeat the same procedure to upgrade to 3.2.x from there. This way I will be able to do the switch with close to no downtime.

Let me know if you know any better approach which involves no downtime.

Comment by Ramon Fernandez Marina [ 13/Jun/16 ]

Thanks for verifying this abhishek.amberkar. Here are some options we can explore:

  • One option is to collect detailed diagnostic data while the problem is happening. To do this, please run the following commands:

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1*1000)}" >ss.log &
    iostat -k -t -x 1 >iostat.log &
    

    This will create two files, iostat.log and ss.log, that you can attach to the ticket for us to investigate.

  • Another option is for you to move to 3.2.7. There were a lot of improvements made to WiredTiger to make it the default storage engine in 3.2, and if you plan on using WiredTiger I think you'll benefit greatly from this upgrade. However if the problem persists, diagnosis should be easier since 3.2 collects diagnostic data automatically – if you reproduce this behavior in 3.2 please upload the contents of the diagnostic.data directory from your dbpath on this ticket and we'll take a look
  • A third option is to share the data and commands used to reproduce this behavior so we can try to replicate it on our end. I've created a secure upload portal when you can share any data with us privately. If you're interested in exploring this option please let us know when the data upload is finished.

I'd like to add that, from the basic details of your setup, this may be tied to a large WiredTiger cache size. We do not recommend making the cache size larger than the default, as it may cause issues in certain workloads, so the simplest thing to try would be to use the default cache size and, if the problem is still there, explore one of the options above.

Thanks,
Ramón.

Comment by Abhishek Amberkar [ 13/Jun/16 ]

@Ramon,

As suggested, I switched to 3.0.12, but the problem remains the same. I am observing same patterns in the graph. Let me know if you need any more information.

Thank you.

Comment by Abhishek Amberkar [ 13/Jun/16 ]

@Ramon

Since I am running 2.6.x right now, I can't directly jump to 3.2.x. But I can definitely switch to 3.0.12. I will update here if the same behavior is observed.

Thank you.

Comment by Ramon Fernandez Marina [ 10/Jun/16 ]

Thanks for the detailed report abhishek.amberkar. As Michael mentioned this could be SERVER-22964, which was fixed in 3.0.12.

Could you please upgrade to 3.0.12 and try to reproduce this behavior? Alternatively, you could upgrade to 3.2.7, which is our latest stable release at this time and also contains a fix for SERVER-22964 – and since your goal is to upgrade to 3.2 this could be the best way forward.

Thanks,
Ramón.

Generated at Thu Feb 08 04:06:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.