Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12085

Removing journal files/writebacklister timeout stops cluster traffic

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Gone away
    • Affects Version/s: 2.5.4
    • Fix Version/s: None
    • Component/s: Sharding
    • Labels:
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Apply load to sharded cluster with randomized crud ops.
      Observe load with mongostat

      Show
      Apply load to sharded cluster with randomized crud ops. Observe load with mongostat

      Description

      On a sharded cluster with two shards, three config servers and two mongos, while applying load to the cluster over both mongos nodes I see throughput dramatically decrease (to zero) every couple of minutes.

      I noticed this happening at 11:25:49 on mongostat for both mongos:

      insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time 
         779    719    763    801       0    2192  2.51g    52m      0   384k   375k   203  RTR   11:25:47 
         109    110    106    133       0     356  2.51g    52m      0    59k    59k   203  RTR   11:25:48 
           0      0      0      0       0       1  2.51g    52m      0    62b   717b   203  RTR   11:25:49 
         562    549    533    540       0    1780  2.51g    52m      0   292k   298k   203  RTR   11:25:50 

      insert  query update delete getmore command  vsize    res faults  netIn netOut  conn repl       time
         705    682    697    737       0    1993  2.49g    37m      0   350k   341k   205  RTR   11:25:48 
         113    116    113    107       0     333  2.49g    37m      0    57k    58k   205  RTR   11:25:49  
           0      0      0      0       0       1  2.49g    37m      0    62b   717b   205  RTR   11:25:50 
         613    597    565    548       0    1876  2.49g    37m      0   308k   322k   205  RTR   11:25:51

      In the logs for the shards I see that at that time one shard decided to remove old journal files and the writebacklistener times out:

      2013-12-13T11:25:47.800-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._35
      2013-12-13T11:25:47.819-0500 [journal] old journal file will be removed: /Users/tbrock/Code/QA/QA-431/cluster/s1/journal/j._36
      2013-12-13T11:25:48.356-0500 [conn552] command admin.$cmd command: { writebacklisten: ObjectId('52a8eae0f4e43082d8561211') } ntoreturn:1 keyUpdates:0  reslen:44 300098ms
      2013-12-13T11:25:50.371-0500 [conn2403] insert db1.udrtest ninserted:1 keyUpdates:0 locks(micros) w:36 1997ms
      2013-12-13T11:25:50.415-0500 [conn2448] remove db1.udrtest query: { num: { $lt: 83 } } ndeleted:3 keyUpdates:0 numYields:1 locks(micros) w:3995143 2041ms
      2013-12-13T11:25:50.420-0500 [conn801] remove db5.whatever query: { num: { $gt: 28 } } ndeleted:1 keyUpdates:0 numYields:1 locks(micros) w:3977145 2036ms

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: