[SERVER-24803] Mongo 3.0.12 WiredTiger Node Crashed After Massive Rise In Connections Created: 26/Jun/16  Updated: 20/Sep/16  Resolved: 27/Jun/16

Status: Closed
Project: Core Server
Component/s: Concurrency, Replication, WiredTiger
Affects Version/s: 3.0.12
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Avraham Kalvo Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File connections.png     File mongodb_public_media.log.gz     File wix_rs_conf.out     File wix_rs_status.out    
Participants:

 Description   

Dear Mongo Analyst,

I'm issuing this issue with your team as we couldn't find a root cause that might explain how come last Friday saw a brand new upgraded replica set node
(Mongo 3.0.12 WiredTiger) crashes due to what seems to be a high rise in connections which in turn apparently generated high wait times on locks acquiring.

The above has happened starting 12:00PM on Friday the 24th, as can be seen in the log file enclosed hereby.
And was only "resolved" once this host had to be restarted after the entire application experienced unavailability due to denial of service from this node (memory exhaustion).

As per the application side, they had no significant processes taking place at the time as far as they claim, neither no changes in configuration (connection pool settings etc.)

Enclosed please find is also the replica set configuration and current status.

Note that the hosts generating the high rise in connections are the application hosts rather than the replica set nodes.



 Comments   
Comment by Avraham Kalvo [ 20/Sep/16 ]

Thanks Ramon,
You may close down this ticket for now.

Comment by Ramon Fernandez Marina [ 27/Jun/16 ]

After examining the information you sent, what I see is that starting around 12pm on Jun 24 (as you pointed out) there's a huge spike on connections to this node:

Here's a list of how many connections came from which IP addresses:

 #conn  IP
   1    10.14.0.52
   1    10.14.0.82
   1    10.15.4.156
   1    10.15.4.210
   1    172.16.144.72
   1    172.16.209.141
   1    172.16.213.56
   1    172.16.222.238
   1    172.16.222.82
   4    172.16.213.45
  22    10.14.0.51
  22    10.14.0.54
  87    10.12.111.1
  90    10.14.6.10
  91    10.14.0.163
  92    10.14.0.77
  92    10.15.2.22
  92    10.15.4.22
  93    172.16.144.217
  94    172.16.144.82
  94    172.16.209.78
 106    172.16.210.96
 320    172.16.144.83
2737    172.16.222.226
2751    172.16.210.48
2779    172.16.210.253
2789    172.16.144.97
2823    172.16.209.66
2866    172.16.144.90

Seeing how none of these IP addresses correspond to other nodes in your replica set, my only conclusion is that they came from application nodes. For some reason decided to connect to this node between 12:00pm and 12:35pm, and this node became overwhelmed.

Since I haven't been able to find evidence of a bug in the server I'm going to close this ticket, since we keep the SERVER project for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your questions will reach a larger audience. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by Avraham Kalvo [ 27/Jun/16 ]

Please update with progress status,
I noticed there are several lock related errors that are resolved in 3.0.12...
(note https://jira.mongodb.org/browse/SERVER-23283)

Thanks,
Avi.

Comment by Avraham Kalvo [ 27/Jun/16 ]

The error accepted on app side is as follows:
com.mongodb.MongoException$Network: Read operation to server 172.16.23.3:27017 failed on database lists
com.wixpress.lists.dao.ItemDao.query(ItemDao.java:509)
com.wixpress.lists.items.QueryItems.perform(QueryItems.java:87)
com.wixpress.lists.WixListsController.operation(WixListsController.java:99)
com.wixpress.lists.WixListsController$$FastClassBySpringCGLIB$$ee24a6e8.invoke(<generated>)
…ecutionlifecycle.ExecutionLifecycleAspect.aroundMethod(ExecutionLifecycleAspect.java:47)
….wixpress.framework.logging.AbstractLogInterceptor.log(AbstractLogInterceptor.java:78)
…om.wixpress.framework.logging.LogInterceptor.intercept(LogInterceptor.java:32)
com.wixpress.lists.items.QueryItems$$EnhancerBySpringCGLIB$$9af7b58c.operation(<generated>)
…press.framework.web.filter.AcceptHeaderFilter.doFilter(AcceptHeaderFilter.java:66)
caused by java.net.SocketTimeoutException: Read timed out
org.bson.io.Bits.readFully(Bits.java:73)
org.bson.io.Bits.readFully(Bits.java:50)
org.bson.io.Bits.readFully(Bits.java:37)
com.wixpress.lists.dao.ItemDao.query(ItemDao.java:509)
com.wixpress.lists.items.QueryItems.perform(QueryItems.java:87)
com.wixpress.lists.WixListsController.operation(WixListsController.java:99)
com.wixpress.lists.WixListsController$$FastClassBySpringCGLIB$$ee24a6e8.invoke(<generated>)
…ecutionlifecycle.ExecutionLifecycleAspect.aroundMethod(ExecutionLifecycleAspect.java:47)
….wixpress.framework.logging.AbstractLogInterceptor.log(AbstractLogInterceptor.java:78)
…om.wixpress.framework.logging.LogInterceptor.intercept(LogInterceptor.java:32)
com.wixpress.lists.items.QueryItems$$EnhancerBySpringCGLIB$$9af7b58c.operation(<generated>)
…press.framework.web.filter.AcceptHeaderFilter.doFilter(AcceptHeaderFilter.java:66)

Comment by Avraham Kalvo [ 26/Jun/16 ]

Thanks Ramón,
Sorry I was sure the files were already at your end,
Probably had to upload them again after I changed the ticket classification from bug to user story.

Enclosed please find.

Thanks,
Avi K

Comment by Ramon Fernandez Marina [ 26/Jun/16 ]

Sorry you had issues with MongoDB avrahamk, will wait for the logs to be uploaded to investigate if these issues were caused by a bug or by lack of machine resources.

Regards,
Ramón.

Generated at Thu Feb 08 04:07:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.