[SERVER-41901] Mongodb seg fault after network interface reset Created: 25/Jun/19  Updated: 27/Oct/23  Resolved: 06/Nov/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dame Jovanoski Assignee: Brian Lane
Resolution: Gone away Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 20190605_02_mongodump.log     File 20190605_03_mongodump.log     File 20190605_mongodump.log     File 20190608_outlog_mongodb_crash.log     File db.out     File mongo_dead.log     File mongo_sharding_error.log     File syslog.out    
Operating System: ALL
Participants:

 Description   

Since we are using Azure as a cloud provider for the VMs, we have a mongodb instance that after some changes in the network (out of our control) caused to segfault.

 

 



 Comments   
Comment by Brian Lane [ 06/Nov/19 ]

Hi djo,

I am going to resolve this issue now. Feel free to re-open if you are still having problems.

-Brian

Comment by Brian Lane [ 27/Aug/19 ]

Hi djo,

Was able to locate some of the information regarding the consult. Thanks for the information.

Regarding this Jira, what action would you like me to take? Have you experienced another segfault or have things been working fine?

-Brian

Comment by Dame Jovanoski [ 23/Aug/19 ]

So here is the things we tried:

RAID 0 :

  • 3.4 -> failed, crc error
  • 3.6 -> same
  • 4.0.10 -> same

RAID 50:

  • 4.0.10 -> crc error

Sharding (no-raid, single disk 4TB with READ caching):

  • 5 VMs sharded with one hashed shard key, one config server -> this is is the last long so I guess double free pointer or dangling pointer error

 

The sharding is after we have paid for a consultancy from MongoDB, so you should be able to get a hold on the report which describes our setup in our company.

Comment by Brian Lane [ 23/Aug/19 ]

Hi djo,

I noticed in the log you sent over, that you are now running 4.0.10? Your earlier comments mentioned that you were running 3.4.1. The information in the error log doesn't seem to be related to the previous disk errors.

So during the past three weeks, have you had to run repair at all?

-Brian

Comment by Dame Jovanoski [ 22/Aug/19 ]

Hello Brian again, unfortunately we don't really have time to investigate the mongodb issue.

mongo_sharding_error.log

We have changed now the setup which we tought that the RAID was the problem, now we have sharding of 5 MongoDB instances on
separate VMs and single disks and after running the instance for three weeks we got a Segmentation Fault on shard server 4.

I am forwarding you the logs for the sharding error.

Tbh I am a bit concerned about the stability of MongoDB for production servers.

Comment by Brian Lane [ 16/Aug/19 ]

Hi djo,

Sorry to hear you are still having trouble. A negative RecordId does indicate some corruption. It would be interesting to know if you try the repair operation again after restarting mongod if you still see the error. We are curious if the corruption got written to disk or if it was in memory at the time you tried the repair.

-Brian

Comment by Dame Jovanoski [ 15/Aug/19 ]

Hey Brian, yes the same problem occurred even with RAID 50, a different error appeared when repairing:

2019-08-05T15:30:57.795+0000 I INDEX    [initandlisten]          building index using bulk method; build may temporarily use up to 166 megabytes of RAM
2019-08-05T15:30:57.813+0000 I STORAGE  [initandlisten] WTCursor::next -- c->next_key ( RecordId(-28)) was not greater than _lastReturnedId (RecordId(0)) which is a bug.
2019-08-05T15:30:57.921+0000 F STORAGE  [initandlisten] Record store did not exist. Collection: DATABASE.collection_name UUID: 46d12989-c2ef-4e33-a076-b1864ad63f03
2019-08-05T15:30:57.921+0000 F -        [initandlisten] Fatal Assertion 50936 at src/mongo/db/catalog/database_impl.cpp 228
2019-08-05T15:30:57.921+0000 F -        [initandlisten] 

Comment by Brian Lane [ 15/Aug/19 ]

Hi djo,

Thanks for providing the info. I am curious if you have had the issue occur again in your environment, forcing you to rerun repair.

-Brian

Comment by Dame Jovanoski [ 05/Aug/19 ]

The results from the running fio job:

 

 

Jobs: 1 (f=0): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s] [0/1947]
write-and-verify: (groupid=0, jobs=1): err= 0: pid=11057: Thu Aug 1 16:37:10 2019
 read: IOPS=24.6k, BW=96.3MiB/s (101MB/s)(200GiB/2127592msec)
 slat (usec): min=2, max=12917, avg= 5.71, stdev=12.51
 clat (usec): min=23, max=123498, avg=639.59, stdev=330.26
 lat (usec): min=27, max=123522, avg=646.11, stdev=330.82
 clat percentiles (usec):
 | 1.00th=[ 330], 5.00th=[ 392], 10.00th=[ 433], 20.00th=[ 498],
 | 30.00th=[ 545], 40.00th=[ 578], 50.00th=[ 611], 60.00th=[ 652],
 | 70.00th=[ 693], 80.00th=[ 750], 90.00th=[ 824], 95.00th=[ 898],
 | 99.00th=[ 1205], 99.50th=[ 2008], 99.90th=[ 4146], 99.95th=[ 4621],
 | 99.99th=[ 8160]
 iops : min= 1728, max=30552, avg=24654.39, stdev=1272.07, samples=4257
 write: IOPS=5344, BW=20.9MiB/s (21.9MB/s)(250GiB/12272412msec)
 slat (usec): min=3, max=2904.2k, avg=20.79, stdev=559.27
 clat (usec): min=201, max=2908.3k, avg=2965.04, stdev=3081.54
 lat (msec): min=2, max=2909, avg= 2.99, stdev= 3.13
 clat percentiles (usec):
 | 1.00th=[ 2474], 5.00th=[ 2573], 10.00th=[ 2638], 20.00th=[ 2704],
 | 30.00th=[ 2769], 40.00th=[ 2802], 50.00th=[ 2868], 60.00th=[ 2933],
 | 70.00th=[ 2966], 80.00th=[ 3064], 90.00th=[ 3228], 95.00th=[ 3523],
 | 99.00th=[ 5276], 99.50th=[ 6128], 99.90th=[ 9503], 99.95th=[13042],
 | 99.99th=[31327]
 bw ( KiB/s): min= 120, max=22808, per=100.00%, avg=21430.00, stdev=708.99, samples=24526
 iops : min= 30, max= 5702, avg=5357.29, stdev=177.23, samples=24526
 lat (usec) : 50=0.01%, 100=0.01%, 250=0.09%, 500=9.05%, 750=26.65%
 lat (usec) : 1000=7.66%
 lat (msec) : 2=0.74%, 4=54.08%, 10=1.67%, 20=0.03%, 50=0.02%
 lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, >=2000=0.01%
 cpu : usr=12.19%, sys=13.23%, ctx=41642803, majf=0, minf=614424
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwt: total=52428800,65595370,0, short=0,0,0, dropped=0,0,0
 latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
 READ: bw=96.3MiB/s (101MB/s), 96.3MiB/s-96.3MiB/s (101MB/s-101MB/s), io=200GiB (215GB), run=2127592-2127592msec
 WRITE: bw=20.9MiB/s (21.9MB/s), 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=250GiB (269GB), run=12272412-12272412msec
Disk stats (read/write):
 md127: ios=52428800/66808962, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5242880/6679481, aggrmerge=0/1414, aggrticks=23894/1770369, aggrin_queue=23029324, aggrutil=69.94%
 sdf: ios=5242880/6682158, merge=0/1459, ticks=24188/1880236, in_queue=22817348, util=69.26%
 sdd: ios=5242880/6681080, merge=0/1404, ticks=24168/1794672, in_queue=23116728, util=69.74%
 sdk: ios=5242880/6679606, merge=0/1338, ticks=22256/1695900, in_queue=23003360, util=69.55%
 sdi: ios=5242880/6680776, merge=0/1351, ticks=24456/1742212, in_queue=23293368, util=69.93%
 sdg: ios=5242880/6677501, merge=0/1431, ticks=24136/1669908, in_queue=22723940, util=69.08%
 sde: ios=5242880/6681486, merge=0/1500, ticks=23024/1907148, in_queue=23365372, util=69.92%
 sdc: ios=5242880/6680263, merge=0/1409, ticks=23440/1703036, in_queue=22893504, util=69.24%
 sdl: ios=5242880/6678927, merge=0/1367, ticks=27276/1696492, in_queue=22609736, util=68.86%
 sdj: ios=5242880/6677565, merge=0/1519, ticks=24744/1773940, in_queue=23150520, util=69.74%
 sdh: ios=5242880/6675456, merge=0/1370, ticks=21252/1840152, in_queue=23319368, util=69.94%

 

Comment by Dame Jovanoski [ 01/Aug/19 ]

Sorry for my late response, I will try it and get back to you.

Comment by Danny Hatcher (Inactive) [ 23/Jul/19 ]

Hello djo, apologies for the delay.

I've been speaking with our storage engine developers and they are concerned about the general reliability of the backing disks. We'd like to perform a simple fio test on one of the servers and see if it maintains its validity over time. Could you please create a verify.fio file of the following and run fio against it?

[write-and-verify]
# You can change values as needed
# Most important is that it runs for hours on a large size
rw=randwrite
bs=4k
direct=1
ioengine=libaio
iodepth=16
size=100g
runtime=14400
time_based
verify=crc32c
filename=/dev/XXX

fio verify.fio

Comment by Dame Jovanoski [ 12/Jul/19 ]

These are some numbers about peak during the time we do some work on the database:

    *0     2     *0     *0       1     7|0  0.8% 80.0%       0 47.9G 46.7G 0|0 1|0  2.04k   17.7m   12 Jul 12 07:44:28.780     
    *0    *0     *0     *0       1     3|0  0.8% 80.0%       0 47.9G 46.7G 0|0 1|0   640b   31.5m   12 Jul 12 07:44:29.780
  4419     1     *0     *0       1     5|0  0.9% 80.0%       0 47.9G 46.7G 0|0 1|0  36.5m   16.9m   12 Jul 12 07:44:30.779
  4422    *0     *0     *0       0     9|0  0.9% 80.0%       0 47.9G 46.7G 0|0 1|0  32.2m   16.9m   14 Jul 12 07:44:31.780
  8208    *0     *0     *0       0     7|0  1.1% 80.1%       0 47.9G 46.7G 0|1 1|0   106m    130k   12 Jul 12 07:44:32.780
 13873     4     *0     *0       1    23|0  1.3% 80.0%       0 47.9G 46.7G 0|0 1|0  63.9m   17.8m   12 Jul 12 07:44:33.780
 17231     7     *0     *0       4    24|0  1.5% 80.1%       0 47.9G 46.7G 0|0 3|2   167m   52.7m   22 Jul 12 07:44:34.779
 31262     6     *0     *0       6    21|0  1.9% 80.1%       0 47.9G 46.7G 0|0 1|0   194m    121m   27 Jul 12 07:44:35.781
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                time
 12725    18     *0     *0      13    33|0  2.1% 80.0%       0 47.9G 46.7G 0|1 1|1   139m    191m   44 Jul 12 07:44:36.780
 18227     9      1     *0      12    21|0  2.3% 79.8%       0 47.9G 46.7G 0|0 2|0   100m    179m   45 Jul 12 07:44:37.780
 30814    12      4     *0       7    34|0  2.7% 80.0%       0 47.9G 46.7G 0|0 1|0   238m    113m   39 Jul 12 07:44:38.782
 22130    28      2     *0       7    63|0  3.0% 80.3%       0 47.9G 46.7G 0|0 1|0   168m    108m   52 Jul 12 07:44:39.780
 29537    24      7     *0      12   104|0  3.5% 80.5%       0 47.9G 46.6G 0|0 1|1   273m    188m   49 Jul 12 07:44:40.779
 40673    29      2     *0      15   112|0  4.1% 80.7%       0 47.9G 46.6G 0|0 1|2   359m    228m   57 Jul 12 07:44:41.780
 25777    40      2     *0      18   118|0  4.5% 80.2%       0 47.9G 46.6G 0|0 1|1   139m    278m   70 Jul 12 07:44:42.780
 57842    28      5     *0      17   106|0  5.2% 80.8%       0 48.0G 46.6G 0|0 1|1   490m    277m   75 Jul 12 07:44:43.780
 40399    23      5     *0      14    61|0  5.1% 80.2%       0 48.0G 46.7G 0|0 2|0   326m    206m   67 Jul 12 07:44:44.782
 39903    60      6     *0      27   175|0  5.4% 80.2%       0 48.0G 46.7G 0|0 1|0   360m    386m   94 Jul 12 07:44:45.779
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                time
 26458    37      8     *0      20   146|0  5.2% 80.4%       0 48.0G 46.7G 0|0 1|0   144m    347m   82 Jul 12 07:44:46.780
 31970    49      5     *0      25   149|0  5.2% 80.3%       1 48.0G 46.7G 0|0 1|9   355m    357m   97 Jul 12 07:44:47.781
 38678    35      7     *0      24   118|0  4.9% 80.0%       0 48.0G 46.7G 0|0 1|0   206m    345m   84 Jul 12 07:44:48.780
 31726    50     10     *0      18   200|0  4.6% 80.0%       0 48.0G 46.7G 0|2 1|1   338m    201m   79 Jul 12 07:44:49.781
 23073    50      9     *0      16   165|0  4.2% 80.0%       0 48.0G 46.7G 0|0 1|2   195m    221m   79 Jul 12 07:44:50.780
 23758    54   3410     *0      21   236|0  3.8% 80.3%       0 48.0G 46.7G 0|0 1|1   140m    342m   81 Jul 12 07:44:51.781
 18597    38      7     *0      18   114|0  3.3% 80.3%       0 48.0G 46.7G 0|0 1|2   136m    249m   75 Jul 12 07:44:52.780
  9319    37      3     *0      17   156|0  3.1% 80.0%       0 48.0G 46.6G 0|0 1|3   140m    223m   77 Jul 12 07:44:53.782
 48239    35      7     *0       9   159|0  3.9% 80.0%       0 48.0G 46.6G 0|0 1|3   376m    139m   71 Jul 12 07:44:54.784
 17894    28      8     *0       9   118|0  4.2% 79.8%       0 47.9G 46.6G 0|0 1|2   172m    119m   54 Jul 12 07:44:55.781
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                time
 28484    51      5     *0      19   172|0  4.7% 80.2%       0 47.9G 46.6G 0|1 1|0   215m    301m   71 Jul 12 07:44:56.780
 31605    28      3     *0      17   114|0  5.2% 79.6%       0 47.9G 46.6G 0|0 1|1   215m    253m   71 Jul 12 07:44:57.780
 28044    33      4     *0      14   130|0  5.1% 80.0%       0 47.9G 46.6G 0|0 2|0   249m    212m   76 Jul 12 07:44:58.780
  4406    34      2     *0      20    91|0  5.0% 80.0%       0 47.9G 46.6G 0|0 1|0  1.37m    285m   82 Jul 12 07:44:59.780
 13129    31      8     *0      15   104|0  5.0% 80.1%       0 48.0G 46.6G 0|0 1|1   119m    205m   74 Jul 12 07:45:00.780
  4509    23      8     *0       9    92|0  4.8% 80.0%       0 48.0G 46.6G 0|0 1|0  37.9m    121m   58 Jul 12 07:45:01.780
 13245    35      6     *0       8   128|0  5.0% 80.0%       0 48.0G 46.6G 0|0 1|0   139m   81.4m   53 Jul 12 07:45:02.780
  8861    23      9     *0       4   133|0  5.2% 80.1%       0 47.9G 46.6G 0|0 1|0  45.4m   55.4m   34 Jul 12 07:45:03.779
     4    15      3     *0       3    88|0  4.8% 80.0%       0 47.9G 46.6G 0|0 1|0  1.22m   52.6m   32 Jul 12 07:45:04.779
  2883    24      1     *0       8   124|0  4.8% 80.0%       0 47.9G 46.6G 0|0 2|1  32.2m   85.6m   33 Jul 12 07:45:05.780
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn                time
  4477    22      1     *0       3   128|0  4.9% 80.0%       0 47.9G 46.6G 0|0 1|1  35.6m   66.6m   29 Jul 12 07:45:06.780
 10299    11     *0     *0       2    64|0  5.0% 80.1%       0 47.9G 46.6G 0|0 1|0  66.2m   52.3m   31 Jul 12 07:45:07.780
  4422     8     *0     *0       3    37|0  5.1% 80.1%       0 47.9G 46.6G 0|0 1|0  33.6m   41.6m   28 Jul 12 07:45:08.780
     2     8     *0     *0       3    33|0  4.4% 79.9%       0 47.9G 46.6G 0|0 1|0  16.4k   34.0m   30 Jul 12 07:45:09.780
  8827     4      1     *0       0    17|0  4.6% 80.0%       0 47.9G 46.6G 0|0 1|0  74.6m   15.4m   28 Jul 12 07:45:10.780
    *0     5     *0     *0       2    26|0  4.6% 80.0%       0 47.9G 46.6G 0|0 1|0   455k   51.4m   31 Jul 12 07:45:11.780
     2     9      3     *0       1    44|0  4.6% 80.0%       0 47.9G 46.6G 0|0 1|0  1.35m   18.0m   25 Jul 12 07:45:12.780
    *0     4     *0     *0       2    12|0  4.6% 80.0%       0 47.9G 46.6G 0|0 1|0  3.88k   35.0m   28 Jul 12 07:45:13.780

Comment by Dame Jovanoski [ 12/Jul/19 ]

These are the settings we are using for tuning the mongo performances:

net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_syn_backlog = 4096

limits:

mongod       soft        nproc        64000
mongod       hard        nproc        64000
mongod       soft        nofile       64000
mongod       hard        nofile       64000

vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
vm.zone_reclaim_mode = 0
vm.swappiness = 10

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

ACTION=="add|change", KERNEL=="sd[c-z]", ATTR{queue/scheduler}="noop", ATTR{bdi/read_ahead_kb}="0"

We have installed numa also with some configuration.

We have created the disks with:

sudo mdadm --create /dev/md0 -c 256 --level 0 --raid-devices some_number disk_1 disk_2 ..
sudo mkfs.xfs /dev/md0

This is our whole setup for configuring our mongo instance.

 The disks are using READ caching in Azure according to recommendation:
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage-performance

 

Comment by Dame Jovanoski [ 12/Jul/19 ]

I have managed to get the checksum error by running a .find({}) query and iterating over the whole collection, but it doesn't fail if I run the query again on the document it failed.

Comment by Dame Jovanoski [ 12/Jul/19 ]

We use secondaries for reading, but the setup doesn't matter, replica or single node, we are just trying everything we can to find the problem. We tried
the replica setup since we thought that it is the load, but that might not be the issue. We realized the following things:

  • The problem occurs when reading
  • No indication of any problems when writing
  • Random failing (checksum error) when reading the data sequentially.
Comment by Danny Hatcher (Inactive) [ 11/Jul/19 ]

Are you doing Secondary reads or all reads going to the Primary?

Comment by Dame Jovanoski [ 11/Jul/19 ]

We have updated it today, will check if we get tomorrow new logs with the new driver. However it is a bit weird that the server should do a validation on the stuff provided to it.

Like there a logs that probably something weird is going on:

2019-07-11T07:47:42.125+0000 I NETWORK  [conn794965] received client metadata from 10.46.124.35:40040 conn794965: { driver: { name: " ", version: "^A" }, os: { type: "h�f�9^?", name: "h|*�9^?", version: "�w*�9^?", architecture: "����9^?" }, platform: "P�2;�U" }
2019-07-11T08:58:44.598+0000 I NETWORK  [conn797655] received client metadata from 10.46.11.114:33276 conn797655: { driver: { name: " ", version: " " }, os: { type: "T{(17)<i:some_function:<i:some_function:}", name: "���^?", version: "", architecture: "^P" }, platform: "U^D^F��^?" }
2019-07-11T09:07:45.809+0000 I NETWORK  [conn803805] received client metadata from 10.46.11.114:33404 conn803805: { driver: { name: " ", version: "6" }, os: { type: "T{(17)<i:some_function:<i:some_function:}", name: "^P", version: "", architecture: " " }, platform: "^G" }
2019-07-11T10:24:21.730+0000 I NETWORK  [conn805476] received client metadata from 10.46.11.114:34476 conn805476: { driver: { name: " ", version: "@W�^A" }, os: { type: "", name: "�ik��^?", version: "^A", architecture: "T{<i:some_function:<f:some_function:}" }, platform: "_time_us:<I:min_time_us:<f:min_action:<f:max_action:}" }

Also even with the update of the mongoc the server crashed, but we will keep looking for logs.

For the stuff about our setup, the one that we currently use is that we have two replicas and one primary. The crashes are at random so first one of the secondaries crash
with checksum then the rest. All servers crash with checksum error at the end.
 

Comment by Danny Hatcher (Inactive) [ 11/Jul/19 ]

Thanks for clarifying the fact that this is in fact a completely new replica set and it still encounters this issue. I think your idea about the driver is a good one. I see from other commands that your c driver version is 1.9.2 which is a little old. Could you upgrade to 1.14.0 which is the most recent release? If that is too big of a jump, would going to at least 1.9.5 be possible?

Comment by Dame Jovanoski [ 11/Jul/19 ]

After looking the logs for some time now it seems like the database might crash from corrupted JSON which is sent with mongoc driver:

2019-07-05T08:25:44.309+0000 D COMMAND  [conn83585] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "�h^L^T�^?" }, os: { type: "`����^?", name: "�^B^U��^?", version:
 "Ю^O��^?", architecture: "�R&��^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T08:25:44.310+0000 I NETWORK  [conn83585] received client metadata from 10.46.18.76:58020 conn83585: { driver: { name: " ", version: "�h^L^T�^?" }, os: { type: "`����^?", name: "�^B^U��
^?", version: "Ю^O��^?", architecture: "�R&��^?" }, platform: " " }
2019-07-05T08:25:44.310+0000 I COMMAND  [conn83585] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "�h^L^T�^?" }, os: { type: "`����^?", name: "�^B^U�
�^?", version: "Ю^O��^?", architecture: "�R&��^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_
query 0ms
2019-07-05T09:28:06.990+0000 D COMMAND  [conn119666] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "E�-�K^BESC�^N�A��ؽ�^?!c^Q^F^Q3^A^TAYF�^PH���
^K�b-dk� 5���Ο^Y��^L�S"]�^G^B�", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T09:28:06.990+0000 I NETWORK  [conn119666] received client metadata from 10.46.18.83:34746 conn119666: { driver: { name: " ", version: "" }, os: { type: "E�-�K^BESC�^N�A��ؽ�^?!c^Q^F^Q3
^A^TAYF�^PH���^K�b-dk� 5���Ο^Y��^L�S"]�^G^B�", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }
2019-07-05T09:28:06.990+0000 I COMMAND  [conn119666] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "E�-�K^BESC�^N�A��ؽ�^?!c^Q^F^Q3
^A^TAYF�^PH���^K�b-dk� 5���Ο^Y��^L�S"]�^G^B�", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_query 0ms
2019-07-05T09:28:22.737+0000 D COMMAND  [conn119683] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T09:28:22.738+0000 I NETWORK  [conn119683] received client metadata from 10.46.18.83:34788 conn119683: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }
2019-07-05T09:28:22.738+0000 I COMMAND  [conn119683] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_query 0ms
2019-07-05T09:28:43.633+0000 D COMMAND  [conn119706] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T09:28:43.633+0000 I NETWORK  [conn119706] received client metadata from 10.46.18.83:34834 conn119706: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }
2019-07-05T09:28:43.633+0000 I COMMAND  [conn119706] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "" }, os: { type: "�V��^R^?", name: "P���LV", version: "^^", architecture: "Ђ�&^S^?" }, platform: " " }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_query 0ms
2019-07-05T12:49:55.038+0000 D COMMAND  [conn328406] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T12:49:55.038+0000 I NETWORK  [conn328406] received client metadata from 10.46.18.76:33894 conn328406: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }
2019-07-05T12:49:55.038+0000 I COMMAND  [conn328406] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_query 0ms
2019-07-05T12:49:58.441+0000 D COMMAND  [conn328417] run command admin.$cmd { isMaster: 1, client: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" }
2019-07-05T12:49:58.441+0000 I NETWORK  [conn328417] received client metadata from 10.46.18.76:33918 conn328417: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }
2019-07-05T12:49:58.441+0000 I COMMAND  [conn328417] command admin.$cmd command: isMaster { isMaster: 1, client: { driver: { name: " ", version: "p��NS^?" }, os: { type: "��4'S^?", name: "���%S^?", version: "P�;5S^?", architecture: " ^R�ÃU" }, platform: "0" }, compression: [], $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 reslen:239 locks:{} protocol:op_query 0ms

We are still not sure if this is the problem, we are investigating to add more validation to the BSON that is sent to the server.

Also this:

2019-07-09T12:16:35.396+0000 I NETWORK  [conn15534] received client metadata from 10.46.18.76:34358 conn15534: { driver: { name: " ", version: "ecordingId:<q:lastModifiedTimestamp_sec:}:db:&T{<i:type:}:v:}" }, os: { type: "0XL^T�^?", name: "���0�^?", version: "�W^^T�^?", architecture: "^P����U" }, platform: "^L^^security communication rootca1" }
2019-07-09T12:16:53.732+0000 I NETWORK  [conn15545] received client metadata from 10.46.18.76:34394 conn15545: { driver: { name: " ", version: "ecordingId:<q:lastModifiedTimestamp_sec:}:db:&T{<i:type:}:v:}" }, os: { type: "0XL^T�^?", name: "���0�^?", version: "�W^^T�^?", architecture: "^P����U" }, platform: "^L^^security communication rootca1" }
2019-07-09T12:17:15.225+0000 I NETWORK  [conn15556] received client metadata from 10.46.18.76:34430 conn15556: { driver: { name: " ", version: "ecordingId:<q:lastModifiedTimestamp_sec:}:db:&T{<i:type:}:v:}" }, os: { type: "0XL^T�^?", name: "���0�^?", version: "�W^^T�^?", architecture: "P�v��U" }, platform: "@" }
2019-07-09T12:17:34.769+0000 I NETWORK  [conn15567] received client metadata from 10.46.18.76:34464 conn15567: { driver: { name: " ", version: "ecordingId:<q:lastModifiedTimestamp_sec:}:db:&T{<i:type:}:v:}" }, os: { type: "0XL^T�^?", name: "���0�^?", version: "�W^^T�^?", architecture: "P�v��U" }, platform: "@" }

 

Comment by Dame Jovanoski [ 09/Jul/19 ]

I need to check if we have for that version with repair, but not sure the repair matters at all since we started two new instances with new disks and the same thing happen. So this issue happens regardless of new disks, mongodb version 3.6 or 4. By new disks I mean totally new instance with new disks noting in it, after two days .. checksum error.

Comment by Danny Hatcher (Inactive) [ 09/Jul/19 ]

> The repair stuff does its job, but this issue happens every-time even if we use new database with new disks. I would like to know more since this issue

I don't see a mention of the repair process being run in the log files you attached. Can you please launch the 4.0.10 mongod using the --repair option and see what happens?

You mentioned in an earlier comment that you were using the same data but new disks. If you are reusing the same data, even if its on a new disk, the corruption may still be present. Do you see this issue reoccurring on completely brand new data sets?

Comment by Dame Jovanoski [ 09/Jul/19 ]

I am compressing it now so I will forward the logs to the link provided.

Comment by Danny Hatcher (Inactive) [ 08/Jul/19 ]

Can you please provide the full mongod logs that cover a repair attempt, the subsequent starting of the mongod, and everything that happens until the error appears? I've generated a Secure Uploader that you can use as it has a higher file size limit than JIRA and only MongoDB engineers will be able to see the files uploaded.

Comment by Dame Jovanoski [ 08/Jul/19 ]

Yes we are running it without the replica mode and yes on the last log file I have provided we are running Mongodb 4, however I can't really understand that it can be disk issue if we have a new VM in a pristine environment. We have tried this setup twice in Azure with RAID0 and somehow I am struggling to understand why the database would corrupt data.

The repair stuff does its job, but this issue happens every-time even if we use new database with new disks. I would like to know more since this issue
is causing a lot of problems for us.

Comment by Danny Hatcher (Inactive) [ 08/Jul/19 ]

Thanks for the different log files. Unfortunately, this does appear to be a case of data corruption. It will be difficult to tell what originally caused the data corruption but most likely a disk failed once and the corruption has been present since.

It does not appear that you are running in a replica set, is that correct? If you were using a replica set an initial sync may fix the issue but since this is a standalone that is not an option.

We have made improvements in our repair process in our 4.0 release but as your data started at 3.4 it may not fully upgrade without other issues. However, I think that's the best path forward. You mentioned that you've already tried 3.6. With your database up and running on 3.6, please shut it down and then restart with 4.0.10 binaries with the --repair option. Please let me know if you have any issues with the upgrade as they might be an easy fix.

Comment by Dame Jovanoski [ 08/Jul/19 ]

20190608_outlog_mongodb_crash.log it is the file on a new server with 40TB = 10 disks x 4T, ~75k IOPS.

Comment by Dame Jovanoski [ 04/Jul/19 ]

After spending more than 20h on this issue we (our team) can say that we don't really know what the problem is or if the network really is the issue, the only thing that we can see is
the same error over and over again showing on different collections and databases (checksum error on the Wireshark). We are using RAID0 for our disks.

We have tried the following scenarios:
1. New VM (224G with new RAID0 disks 4TB x 7 disks, 7 * 7500 = 52500, with old data (~7.4TB)
2. Old VM (112GB) resized to 224Gb with 10T = 1T x 10 disks, IOPS 10*5000 = 50000 (clean disks and old data 7.4TB)

Comment by Dame Jovanoski [ 02/Jul/19 ]

We are not sure if it is the network or the disks, we contacted Azure and they say that everything is good, but the database still drops from times to times.

I can give you also a short description of the setup, so we are running a single node with 250Gb of RAM and 32cores, we are using xfs on 7 disks x 4TB. Our setup
is based on the mongodb recommendation:
https://docs.mongodb.com/manual/administration/production-notes/

And thats most of it we don't really have additionally added any other stuff.

Comment by Dame Jovanoski [ 02/Jul/19 ]

We instead started running the mongodb with --network=host, since we think that the bridge could not handle the amount of data, but still the same problem with the block.
Last we noticed this message in the syslog:

kernel: [495527.195017] traps: WTJourn.Flusher[55777] general protection ip:7f93b3f31196 sp:7f93ae6ed4d0 error:0 in libc-2.23.so[7f93b3efa000+1c0000]

 
We also upgraded to 3.6 mongodb and started with clean database and new disks, still
the same problem with the checksum.

Comment by Dame Jovanoski [ 02/Jul/19 ]

I have attached one of the logs, this has happened multiple times and all times is the same error, we repaired the database everytime.

Comment by Danny Hatcher (Inactive) [ 25/Jun/19 ]

Additionally, please provide the complete mongod.log covering the event including multiple hours beforehand.

Comment by Danny Hatcher (Inactive) [ 25/Jun/19 ]

The stack trace you provided seems to imply a disk issue rather than a networking issue. Could you please provide more context on the exact problem you experienced with your VM?

Comment by Dame Jovanoski [ 25/Jun/19 ]

We are using mongodb 3.4.1

Generated at Thu Feb 08 04:59:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.