Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12085

Removing journal files/writebacklister timeout stops cluster traffic

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.5.4
    • Component/s: Sharding
    • Labels:
    • ALL
    • Hide

      Apply load to sharded cluster with randomized crud ops.
      Observe load with mongostat

      Show
      Apply load to sharded cluster with randomized crud ops. Observe load with mongostat

      On a sharded cluster with two shards, three config servers and two mongos, while applying load to the cluster over both mongos nodes I see throughput dramatically decrease (to zero) every couple of minutes.

      I noticed this happening at 11:25:49 on mongostat for both mongos:

      insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time 
         779    719    763    801       0    2192  2.51g    52m      0   384k   375k   203  RTR   11:25:47 
         109    110    106    133       0     356  2.51g    52m      0    59k    59k   203  RTR   11:25:48 
           0      0      0      0       0       1  2.51g    52m      0    62b   717b   203  RTR   11:25:49 
         562    549    533    540       0    1780  2.51g    52m      0   292k   298k   203  RTR   11:25:50 
      
      insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time
         705    682    697    737       0    1993  2.49g    37m      0   350k   341k   205  RTR   11:25:48 
         113    116    113    107       0     333  2.49g    37m      0    57k    58k   205  RTR   11:25:49  
           0      0      0      0       0       1  2.49g    37m      0    62b   717b   205  RTR   11:25:50 
         613    597    565    548       0    1876  2.49g    37m      0   308k   322k   205  RTR   11:25:51
      

      In the logs for the shards I see that at that time one shard decided to remove old journal files and the writebacklistener times out:

      2013-12-13T11:25:47.800-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._35
      2013-12-13T11:25:47.819-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._36
      2013-12-13T11:25:48.356-0500 [conn552] command admin.$cmd command: { writebacklisten: ObjectId('52a8eae0f4e43082d8561211') } ntoreturn:1 keyUpdates:0  reslen:44 300098ms
      2013-12-13T11:25:50.371-0500 [conn2403] insert db1.udrtest ninserted:1 keyUpdates:0 locks(micros) w:36 1997ms
      2013-12-13T11:25:50.415-0500 [conn2448] remove db1.udrtest query: { num: { $lt: 83 } } ndeleted:3 keyUpdates:0 numYields:1 locks(micros) w:3995143 2041ms
      2013-12-13T11:25:50.420-0500 [conn801] remove db5.whatever query: { num: { $gt: 28 } } ndeleted:1 keyUpdates:0 numYields:1 locks(micros) w:3977145 2036ms
      

            Assignee:
            davide.italiano Davide Italiano
            Reporter:
            tyler@10gen.com Tyler Brock
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: