[SERVER-7647] getlasterror seems hang after some queries Created: 13/Nov/12  Updated: 02/Dec/16  Resolved: 10/Apr/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Spiros Ioannou Assignee: David Hows
Resolution: Incomplete Votes: 2
Labels: replica, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

debian, w/ newest java driver


Attachments: Text File currentop1.txt     Text File dbserverStatus.txt     Text File dbstats.txt     File mongodb.owl.tar.bz2     File mongodb.ulysses.tar.bz2     File mongodb.vserver-2.tar.bz2     File mongodb.vserver-3.tar.bz2    
Operating System: Linux
Participants:

 Description   

3 shards, in 3 physical servers, 3 replica sets. Each replica set has 2 members + one arbiter.

I have 50 threads inserting a document 1000 times each, 50K docs in total.
after 30K writes, the process hangs, with currentop showing getlasterror(s) queries running for several thousand seconds. Mongostat show no activity, no locks, no inserts. No errors show in the logs. All inserts are performed with REPLICAS_SAFE
I attach output of db.stats, db.serverStatus and db.currentop.



 Comments   
Comment by Stennie Steneker (Inactive) [ 10/Apr/13 ]

Hi Daniel,

Since you do not have any further information to investigate for this issue (logs, queries, or MMS) we'll resolve this issue as Incomplete.

Thanks,
Stephen

Comment by Daniel Gibbons [ 30/Jan/13 ]

The nodes are not in MMS and I do not have the queries or logs anymore. If it happens again, I will pull the logs and whatever other info I can from the servers. Another fact I remembered that may be helpful is that the primary had around 170 connections at the time.

Comment by David Hows [ 30/Jan/13 ]

Hi Daniel,

Are your nodes in MMS? If so, can you provide your MMS URL?

Can you outline which queries were running? Are they visible in the logs? If so, can you paste the log output?

Thanks,

David

Comment by Daniel Gibbons [ 29/Jan/13 ]

I recently experienced this. The symptoms were as follows:
-Certain queries from the app servers were timing out to the databases
-The secondaries oplogs were about 12 hours behind the primary
-There were several getLastError queries on the primary which had been hung for about 12 hours

Comment by Spiros Ioannou [ 20/Nov/12 ]

vserver-2:192.168.1.77
vserver-3:192.168.1.78
owl:192.168.1.73
ulysses:192.168.1.244

Comment by Spiros Ioannou [ 20/Nov/12 ]

I have all the logs but unfortunately I haven't noted which tests showed the timeout. I had examined the logs with no unusual findings.
Yesterday's test of 500K inserts yielded the usual inconsistency of insert speed varying between 1ms and several taking above 1000000.0ms with 95% of inserts finishing in less than 900ms. That is some inserts took more than 17 minutes. The total insert rate was ~1 insert/second per thread totaling in about ~60inserts/sec for all 50 threads. Equivalent to the performance achieved from a single writer writing the same data to just one shard. Any explanation as to why as I stated to my previous post the disk I/O is two orders higher than what is actually inserted (<200 byte documents)?
I will attach all logs, including yesterday's run.
Best Regards,

Comment by Eliot Horowitz (Inactive) [ 20/Nov/12 ]

From the original issue, the reason things paused was replication falling behind on the w:2 writes.
Do you have the secondary logs from then?
Many people have data sets much larger than ram, so this is probably a relatively simple issue.

Comment by David Hows [ 20/Nov/12 ]

Hi Spiros,

I'm sorry to hear this, but i understand where you are coming from.

The nature of mongodb is to rely on the operating system to control system resource allocation. When you have a system that has potentially high resource contention the performance can indeed be a little unpredictable, as there is never a guarantee of how much of your working data set you will be able to have in RAM at a time.

If you were to try running each of the mongod instances in its own VM with a dedicated slice of resources you should see an increase in performance.

You may want to look at a paravirtualized driver as this will give your DomU's direct access to the disks and should remove any of the overhead from O/I passing through the Dom0.

If there is anything more we can do for you please let me know.

Cheers,

David

Comment by Spiros Ioannou [ 19/Nov/12 ]

Hi David,

  • The hung getlasterror does not happen every time now. So unfortunately it might be difficult to reproduce at will.
  • we are running mongod as root because these are just testing servers with testing load. We use the dom0 because disk performance is much higher on dom0 despite theory. That stands for other DBs as well.
  • no other tests are performed on these machines while mongo tests are running
  • page faults are always 40-70/sec while the tests are running. That suggests 40-70 seeks/sec in the worse case. This is the cause probably of the 100% disk utilization and the bottlneck of the max performance we observe. Those 10Krpm disks will not go much beyond 100seeks/sec. But we are not running in journaling mode, but in REPLICAS_SAFE. So writes should not be flushed constantly, so these page faults must be from read operations performed by mongo and this is backed by a ~9MB/sec read caused on disk by mongo on vserver-dev {2,3}

    vs the 1MB/sec write. These reads are probably mongo mapped pages being read from disk but according to page faults we should have observed a 70faults/sec*4KBpage=280KB read load, not 10MB/sec. This is really strange.

  • we chose 2 mongod/server to reduce locking. A suggestion on the net said about 1 shard/core. Even so, memory:disk ratio of the current configuration is bettern than the production environment and about 1:10
  • we reduced read ahead queue from 128Kb(default) to 32Kb on vserver-dev-3 with no noticeable difference (/sys/block/.../queue/read_ahead_kb).
  • replication timeout settings: I have not changed those. You mean on the java driver? I presume if a timeout had occurred on the writes I would have seen an exception.
    Anyway, thank you David for your input, I feel bad wasting your time with this. It seems mongo is a bit unpredictable, and moreover not really suited for larger than memory installations, since we get 1/20 of performance of postgres with 3x the H/W. Perhaps we need an all-data-in-memory setup for it to scale predictably.
    Best Regards,
    -Spiros
Comment by David Hows [ 19/Nov/12 ]

Hi Spiros,

Re: background flush - yes this is the background thread that persists data to disk and is controlled by syncdelay. Given that it doesn't spike above a second in the event I will discuss below, it is an unlikely cause (for that event at least), but worth keeping an eye on in general.

In terms of getting a better understanding of your environment, can I ask: why are you running your mongo instances as root within what is basically the hypervisor - dom0? Are there any operations that could be contending with Mongo for resources in this domain (like something on the DomU's for example)?

Looking at the rs1 set as an example, the only event we have to go in in the MMS graphs shows a page fault event on the 16th at around 12:00 GMT. That event looks to have been an insert event, and the lock percentage on the ifms collection spikes to ~100%, with a lesser lock spike on the local DB (where the oplog resides) - the oplog is essentially what the secondaries must read from when replicating.

Around the time of that event, the resident memory on both members halves (down to ~1.23GB from 2.49GB on the primary). That suggests a couple of things:

1. That the pages that were brought into memory at that time were not all touched by mongod (this can be caused by readahead)
2. That there is likely memory pressure from other processes, otherwise this type of event would increase resident memory

Some other questions related to this. Are you running multiple instance of mongod in the same dom0? Around the time that the other instances drop, vserver-dev-2:47018 raised its resident memory from 0.5G to 1.6G. This looks like competition between two different mongod instances for RAM.

A few suggestions:

  • Can you retry the tests with the secondaries not competing with primaries on the same host
  • Lowering readahead on the data block devices (and any LVM device mapper volumes, if that is how it is configured) may help memory utilization
    • This requires a restart of the mongod process, and generally we recommend halving the value (from 256 to 128 for example) and retesting
  • REPLICA_SAFE requires the write to make it to the secondary, but will eventually timeout - what are your timeout settings currently?
  • Writing to more than one database at a time may be helpful as an additional data point

Cheers,

David

Comment by Spiros Ioannou [ 16/Nov/12 ]

vserver-dev-

{2,3}

are HP ProLiant DL360 G7, 12 cores (*2 threads each), 45GB installed memory, 8GB available to dom0 (where mongo runs), drives are 2*10Krpm SAS in RAID0.

Comment by Spiros Ioannou [ 16/Nov/12 ]

Hi David,

  • pagefaulting seems normal, our real dataset is several TeraBytes so always larger than memory.
  • write lock (reported by mongostat) was always at 100% on each master even when we had about 6000 inserts/second. That performance was when the DB was still empty (aggregate of 50 writers).
  • background flush: I have no idea what that does nor can find any documentation. If this is just "fsyncs" then I assure you that the background flush curve that raises so much after the inserts pause, does not coincide with the disk I/O activity and util reported by iostat. It is the opposite.
  • queueing reads. I presume these are for the replication. Not sure why no queued writes are there. Perhaps reads needed for replication are queued so data is replicated in batches?
  • Ulimit settings alerts: this alert is probably a bug. No hard or soft limits are present. And mongod runs as root.

I include info for shards 1 and 2. vserver-dev-2 and vserver-dev-3 are identical in h/w and s/w configuration.
This is while the usual tests are running.

in vserver-dev-2 and vserver-dev-3 data and binaries are in the /dev/cciss/c0d0p1 partition
in ulysses data and binaries are in the /dev/sda5 partition

root@vserver-dev-2:/var/log/mongodb# blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   1200187203584   /dev/cciss/c0d0
rw   256   512  4096       2048   1151637389312   /dev/cciss/c0d0p1
rw   256   512  1024 2249295870            1024   /dev/cciss/c0d0p2
rw   256   512  4096 2249295872     48546971648   /dev/cciss/c0d0p5
rw  1024   512  4096          0      5368709120   /dev/sdh
rw  1024   512  4096          0      5368709120   /dev/sdg
rw   256   512  4096          0    107374182400   /dev/sda
rw  1024   512  4096          0   2089071673344   /dev/sdf
rw   256   512  4096          0    107374182400   /dev/sdc
rw  1024   512  4096          0      5368709120   /dev/sdd
rw  1024   512  4096          0   1759217975296   /dev/sdb
rw  1024   512  4096          0   2089071673344   /dev/sdi
rw  1024   512  4096          0      5368709120   /dev/sdj
rw  1024   512  4096          0   1759217975296   /dev/sde
rw  1024   512  4096          0      5368709120   /dev/dm-0
rw  1024   512  4096          0      5368709120   /dev/dm-1
rw   256   512  4096          0    107374182400   /dev/dm-2
rw  1024   512  4096          0   2089071673344   /dev/dm-3
rw  1024   512  4096          0   1759217975296   /dev/dm-4
rw   256   512  4096          0      2147483648   /dev/dm-5
rw   256   512  4096          0     15032385536   /dev/dm-6
rw   256   512  4096          0      1073741824   /dev/dm-7
rw   256   512  4096          0      5368709120   /dev/dm-8
rw   256   512  4096          0      2147483648   /dev/dm-9
rw   256   512  4096          0     15032385536   /dev/dm-10
rw   256   512  4096          0      2147483648   /dev/dm-11
rw   256   512  4096          0     32212254720   /dev/dm-12
rw   256   512  4096          0      1073741824   /dev/dm-13
rw   256   512  4096          0      8589934592   /dev/dm-14
rw   256   512  4096          0      1073741824   /dev/dm-15
rw   256   512  4096          0     10737418240   /dev/dm-16

root@vserver-dev-2:/var/log/mongodb# free -mt
             total       used       free     shared    buffers     cached
Mem:          7727       7163        564          0        112       5250
-/+ buffers/cache:       1801       5926
Swap:        46297         59      46238
Total:       54025       7223      46802

root@ulysses:~# blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    750156374016   /dev/sda
rw   256   512  4096       2048       104857600   /dev/sda1
rw   256   512  1024     206848    250000000000   /dev/sda2
rw   256   512  4096 1442015232     11842617344   /dev/sda3
rw   256   512  1024  488488958            1024   /dev/sda4
rw   256   512  4096  488488960    483928309760   /dev/sda5
rw   256   512  4096 1433663488      4276092928   /dev/sda6

root@ulysses:~# free -mt
             total       used       free     shared    buffers     cached
Mem:          3937       3655        282          0        102        601
-/+ buffers/cache:       2951        985
Swap:         4077        454       3623
Total:        8015       4109       3905

Best Regards,
-Spiros

Comment by David Hows [ 15/Nov/12 ]

Hi Spiros,

I've had a look at your systems in MMS. There are a few standouts;

  • High Pagefaulting
  • High Write Lock Percentage
  • Elevated Background Flush Avg
  • Queueing reads
  • Ulimit Settings alert's

The pagefaults and lock percentage are both worrying and indications of poor performance. Pagefaulting traditionally occurs when your working dataset exceeds the RAM available to your system.

Can you give me some background on your systems Disk and RAM, size, etc. Would you also be able to provide me with the output of:

free -mt
blockdev --report

Cheers,

David

Comment by Spiros Ioannou [ 15/Nov/12 ]

Hi David,
MMS is now populated through mongos.
At "Thu Nov 15 13:14:43 EET 2012" the tests are running, but stuck, with mongostat displaying 0 inserts/second. It was about 100 when it started. The total average will be again something like 1 insert/second.

This could be a duplicate of SERVER-5831 with the exception that the shard disks are mostly idle during this time.

thanks,
-Spiros

Comment by David Hows [ 15/Nov/12 ]

Hi Spiros,

I can see your group, but there is only the agent and no servers within your group.

Have you added any hosts in the past?

Cheers,

David

Comment by Spiros Ioannou [ 14/Nov/12 ]

mms group name is "inaccess" (if it's the string shown above the users) but the only errors are agent-specific

Comment by David Hows [ 13/Nov/12 ]

Hi Spiros,

What you have described sounds like replication lag, thus the safe writes are taking longer to confirm that they have been replicated.

You should check the replication lag on the secondaries of your sets. Do you have MMS? If so what is your group name?

Cheers,

David

Generated at Thu Feb 08 03:15:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.