[SERVER-12085] Removing journal files/writebacklister timeout stops cluster traffic Created: 13/Dec/13  Updated: 13/Jan/14  Resolved: 31/Dec/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.5.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tyler Brock Assignee: Davide Italiano
Resolution: Done Votes: 0
Labels: 26qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

Apply load to sharded cluster with randomized crud ops.
Observe load with mongostat

Participants:

 Description   

On a sharded cluster with two shards, three config servers and two mongos, while applying load to the cluster over both mongos nodes I see throughput dramatically decrease (to zero) every couple of minutes.

I noticed this happening at 11:25:49 on mongostat for both mongos:

insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time 
   779    719    763    801       0    2192  2.51g    52m      0   384k   375k   203  RTR   11:25:47 
   109    110    106    133       0     356  2.51g    52m      0    59k    59k   203  RTR   11:25:48 
     0      0      0      0       0       1  2.51g    52m      0    62b   717b   203  RTR   11:25:49 
   562    549    533    540       0    1780  2.51g    52m      0   292k   298k   203  RTR   11:25:50 

insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time
   705    682    697    737       0    1993  2.49g    37m      0   350k   341k   205  RTR   11:25:48 
   113    116    113    107       0     333  2.49g    37m      0    57k    58k   205  RTR   11:25:49  
     0      0      0      0       0       1  2.49g    37m      0    62b   717b   205  RTR   11:25:50 
   613    597    565    548       0    1876  2.49g    37m      0   308k   322k   205  RTR   11:25:51

In the logs for the shards I see that at that time one shard decided to remove old journal files and the writebacklistener times out:

2013-12-13T11:25:47.800-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._35
2013-12-13T11:25:47.819-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._36
2013-12-13T11:25:48.356-0500 [conn552] command admin.$cmd command: { writebacklisten: ObjectId('52a8eae0f4e43082d8561211') } ntoreturn:1 keyUpdates:0  reslen:44 300098ms
2013-12-13T11:25:50.371-0500 [conn2403] insert db1.udrtest ninserted:1 keyUpdates:0 locks(micros) w:36 1997ms
2013-12-13T11:25:50.415-0500 [conn2448] remove db1.udrtest query: { num: { $lt: 83 } } ndeleted:3 keyUpdates:0 numYields:1 locks(micros) w:3995143 2041ms
2013-12-13T11:25:50.420-0500 [conn801] remove db5.whatever query: { num: { $gt: 28 } } ndeleted:1 keyUpdates:0 numYields:1 locks(micros) w:3977145 2036ms



 Comments   
Comment by Tyler Brock [ 31/Dec/13 ]

I think its fine, this was all on the same machine so all the i/o on the shards had the same bottleneck.

Comment by Tyler Brock [ 13/Dec/13 ]

I did not see it on a single mongod during my testing.

Comment by Scott Hernandez (Inactive) [ 13/Dec/13 ]

Does this not happen on a single mongod and is this really specific to sharding?

Generated at Thu Feb 08 03:27:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.