[SERVER-23328] mongodb read error: failed to read 8589938688 bytes at offset 3739193344: WT_ERROR: non-specific WiredTiger error Created: 24/Mar/16  Updated: 23/Apr/16  Resolved: 22/Apr/16

Status: Closed
Project: Core Server
Component/s: Admin, WiredTiger
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Laurent Eon Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.log.2016-04-13T10-52-51     File mongod.log.2016-04-13T10-53-04     PNG File screenshot-graissage.png     PNG File screenshot-serfifi.png    
Issue Links:
Depends
depends on WT-2569 win_handle_read should always call Ge... Closed
Related
related to WT-2521 Enhance error logging on Windows Closed
is related to WT-2522 Incorrect format code in message Closed
Operating System: ALL
Participants:

 Description   

Hello,
I'm using a script to import data in a mongodb collection. I use mongoimport to insert sevreral cvs files in the collection but randomly, mogodb crashes with the logs below during the import.

The collection is pretyy big with 26 000 000 Documents and 8Go of Data

Dou you have an idea? I use mongodb 3.2 on a Windows server
and mongoimport on debian linux

2016-03-24T06:30:09.043+0100 I NETWORK [conn2539] end connection 10.16.194.242:57880 (167 connections now open)
2016-03-24T06:30:09.069+0100 I NETWORK [initandlisten] connection accepted from 10.16.194.242:57881 #2540 (168 connections now open)
2016-03-24T06:30:09.080+0100 I NETWORK [conn2540] end connection 10.16.194.242:57881 (167 connections now open)
2016-03-24T06:30:09.097+0100 I NETWORK [initandlisten] connection accepted from 10.16.194.242:57882 #2541 (168 connections now open)
2016-03-24T06:30:09.101+0100 I NETWORK [conn2541] end connection 10.16.194.242:57882 (167 connections now open)
2016-03-24T06:30:09.127+0100 I NETWORK [initandlisten] connection accepted from 10.16.194.242:57883 #2542 (168 connections now open)
2016-03-24T06:30:09.132+0100 I NETWORK [conn2542] end connection 10.16.194.242:57883 (167 connections now open)
2016-03-24T06:30:09.150+0100 I NETWORK [initandlisten] connection accepted from 10.16.194.242:57884 #2543 (168 connections now open)
2016-03-24T06:30:09.203+0100 E STORAGE [conn2543] WiredTiger (-31802) [1458797409:162043][5088:1998468096], file:collection-0-1903427317893030266.wt, WT_CURSOR.next: collection-0-1903427317893030266.wt read error: failed to read 8589938688 bytes at offset 3739193344: WT_ERROR: non-specific WiredTiger error
2016-03-24T06:30:09.211+0100 I - [conn2543] Invariant failure: advanceRet resulted in status UnknownError -31802: WT_ERROR: non-specific WiredTiger error at src\mongo\db\storage\wiredtiger\wiredtiger_record_store.cpp 481
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\util\stacktrace_windows.cpp(174) mongo::printStackTrace+0x43
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\util\log.cpp(134) mongo::logContext+0xac
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\util\assert_util.cpp(163) mongo::invariantOKFailed+0x139
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\storage\wiredtiger\wiredtiger_record_store.cpp(481) mongo::WiredTigerRecordStore::Cursor::next+0x1dd
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\exec\collection_scan.cpp(139) mongo::CollectionScan::work+0x589
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\query\plan_executor.cpp(393) mongo::PlanExecutor::getNextImpl+0x2f4
2016-03-24T06:30:13.601+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\query\plan_executor.cpp(322) mongo::PlanExecutor::getNext+0x41
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\query\find.cpp(589) mongo::runQuery+0x7a0
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\instance.cpp(401) mongo::receivedQuery+0x17d
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\instance.cpp(531) mongo::assembleResponse+0x916
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\db\db.cpp(171) mongo::MyMessageHandler::process+0xce
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe ...\src\mongo\util\net\message_server_port.cpp(231) mongo::PortMessageServer::handleIncomingMsg+0x4a7
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] mongod.exe c:\program files (x86)\microsoft visual studio 12.0\vc\include\thr\xthread(188) std::_LaunchPad,mongo::RangeDeleter * __ptr64 const> > >::_Go+0x1d
2016-03-24T06:30:13.608+0100 I CONTROL [conn2543] MSVCP120.dll std::_Pad::_Release+0x6c
2016-03-24T06:30:13.609+0100 I CONTROL [conn2543] MSVCR120.dll beginthreadex+0x107
2016-03-24T06:30:13.609+0100 I CONTROL [conn2543] MSVCR120.dll endthreadex+0x192
2016-03-24T06:30:13.609+0100 I CONTROL [conn2543] kernel32.dll BaseThreadInitThunk+0xd
2016-03-24T06:30:13.609+0100 I CONTROL [conn2543]
2016-03-24T06:30:13.609+0100 I - [conn2543]



 Comments   
Comment by Ramon Fernandez Marina [ 22/Apr/16 ]

Thanks for the update kamaileon, and glad to hear MongoDB is working well. Please let us know if the issue resurfaces.

Comment by Laurent Eon [ 20/Apr/16 ]

ok, so since I use mongodb on debian, there's no any more problem, no need to restart mongodb each day
Furthermore, the scripts runs faster...

So it's all good for me now
Thank you

Comment by Laurent Eon [ 14/Apr/16 ]

I'm gonna try to move mongodb on a Linux Debian virtual machine.... (still under hyper-V)

Comment by Laurent Eon [ 14/Apr/16 ]

Mongo is installed on a virtual server, hyper-V, and E: is a virtual drive

Comment by Michael Cahill (Inactive) [ 14/Apr/16 ]

kamaileon, I was reviewing this issue today and wanted to ask a question. I note that MongoDB data is stored on your E: drive – is that a locally connected disk or a network mount? If a network mount, can you please give details of the NAS?

Comment by Laurent Eon [ 13/Apr/16 ]

in the log files, you could see this error :
2016-04-13T06:30:53.478+0200 E STORAGE [thread2] WiredTiger (-28967) [1460521853:464213][26636:2000696320], log-server: journal/WiredTigerLog.0000002469 write error: failed to write 49280 bytes at offset 18389632: The process cannot access the file because another process has locked a portion of the file.

2016-04-13T06:30:53.478+0200 E STORAGE [thread2] WiredTiger (-28967) [1460521853:478862][26636:2000696320], log-server: log server error: The process cannot access the file because another process has locked a portion of the file.

Comment by Laurent Eon [ 13/Apr/16 ]

I have discovered a new issue this morning.
03:00 : reboot of mongo service : OK
06:30 : beginning of the import script
06:30:52 : the import script seems locked
07:30 : beginning of a load script, the load script seems very slow
12:52: I see that the server is very slow and that the import script and the load script are still running, my ram is fully used (26 Go)
I tried to restart manually the mongo service : failed
then I had to kill the mongo process
I restarted the mongo service : OK
I ran the import script : OK
I ran the load script : OK

I enclose you the mongo log files

Comment by Laurent Eon [ 06/Apr/16 ]

3.3 screenshot-graissage is another script which doesn't use mongodb

Comment by Laurent Eon [ 06/Apr/16 ]

1. no, and each time, the restart of mongodb solve the issue
2. yes, since i've installed mongodb. Never before.
3. i've two scripts we can analyse and compare
3.1 the first script makes 40 agregate queries in mongo data base (in the collection with the issue described above)

  • today the script started at 08:15:55 and ended at 08:29:38 (mongodb was restarted at 03:00)
  • yesterday it started at 08:15:55 and wasn't finish when i've restarted mongodb at 08:37:41

3.2 the second script (called serfifi) opens a file and parses it, every five minutes (it doesn't use mongo data base)

  • you can see in the screenshot-serfifi, that usually, the script is executing during 6sec, but between 08:21:12 to 08:36:21 yesterday, the duration increases to 13 sec. (indeed, during the first other script execution)

4. Done

One question : Is mongodb more stable in linux distribution ?

Comment by Kelsey Schubert [ 05/Apr/16 ]

Hi kamaileon,

  1. Is this the first time you have observed this slowdown?
  2. Has your RAM been fully utilized in the past?
  3. Can you quantify the slowdown, eg. the increase in query latency?
  4. So we can investigate this issue, can you please upload the diagnostic.data covering this period? Please either attach it to this ticket or use the same upload portal as before.

Thank you,
Thomas

Comment by Laurent Eon [ 05/Apr/16 ]

I don't know if there is a link or if it's another problem, but since mongodb windows service didn't restart this morning, my RAM memory was fully used (26 Go)
and all my server was very slow until i restart mongodb service.
I saw on the documentation that mongodb uses all memory on the machine as its cache, but it was like mongodb didn't yield that cached memory...

Comment by Laurent Eon [ 05/Apr/16 ]

yesterday I disabled the restart of mongodb to try to reproduce the problem.
But this morning, mongodb wasn't crashed.
But since 1st april, the import script fill a new collection, because my collection changes every month, so it's a smaller collection now.
The problem may re-appear when the collection will be bigger.
let and see...

Comment by Kelsey Schubert [ 31/Mar/16 ]

Hi kamaileon,

Thank you for uploading the additional files. We have investigated the files and diagnostic data that you have uploaded. I'd like to summarize the behavior you are observing.

Each morning updates are imported to your MongoDB server using mongoimport. If the server has not been restarted recently, a query executed during this process may encounter WiredTiger read error. However, if the server has been restarted recently, queries executed during the mongoimport process do not encounter any errors. My understanding is that both of these behaviors are readily reproducible.

It would greatly help our investigation if you could reproduce this issue and upload the files before restarting the server. We would need a copy of the files you have uploaded (all WiredTiger files, journal files, diagnostic.data and the collection mentioned in the error message) before it is restarted. This will allow us to see the state of the files and metadata before recovery is run.

Thank you again for your help,
Thomas

Comment by Laurent Eon [ 30/Mar/16 ]

And here is the result wth

{full:true}

option :

{
"ns" : "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03",
"nrecords" : NumberInt(31521249),
"nIndexes" : NumberInt(3),
"keysPerIndex" :

{ "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$_id_" : NumberInt(31521249), "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$TBA_1_DateHeure_1" : NumberInt(31521249), "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$DateExploitation_1" : NumberInt(31521249) }

,
"indexDetails" : {
"boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$id" :

{ "valid" : true }

,
"boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$TBA_1_DateHeure_1" :

{ "valid" : true }

,
"boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$DateExploitation_1" :

{ "valid" : true }

},
"valid" : true,
"errors" : [

],
"ok" : NumberInt(1)
}

Comment by Laurent Eon [ 30/Mar/16 ]

hello, i can't send you all files in dbpath because there are other big collections not related to our problem, but I have uploaded you all files which I guessed necessary. Tell me if you need some other specific files.

I temporarily solved the issue by restarting mongodb daily at 3:00 am.

Here is the result of the validate command :
{
"ns" : "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03",
"nrecords" : NumberInt(31521249),
"nIndexes" : NumberInt(3),
"keysPerIndex" :

{ "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$_id_" : NumberInt(31521249), "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$TBA_1_DateHeure_1" : NumberInt(31521249), "boiteNoire.Trames_ETH_HMI_C1_STS_M.2016-03.$DateExploitation_1" : NumberInt(31521249) }

,
"valid" : true,
"errors" : [

],
"warning" : "Some checks omitted for speed. use

{full:true}

option to do more thorough scan.",
"ok" : NumberInt(1)
}

Comment by Kelsey Schubert [ 29/Mar/16 ]

Hi kamaileon,

Thank you for uploading the logs to the portal. In order to reproduce this behavior on our side, we would need the the complete contents of dbpath in addition to the collection you have previously uploaded.

This error may indicate some of kind of corruption in your data files. To investigate, can you please execute db.collection.validate() on the affected collection and post the output?

Thank you,
Thomas

Comment by Laurent Eon [ 25/Mar/16 ]

Hello, thank you for your quick response
I'm dowloading multipart zip file of the collection in your upload portal
There are 2 weird things:
1 - This import worked fine for 2 month, without any problem, but now, it crashes each time my daily import script starts
2 - When the import script starts automaticly at 06:30, mongodb crashes like i said, but when i restart mongodb service and restart my import script manually, then every thing works fine

I also attached you the two mongo logs files for yesterday and today. You can see at 6:30, the start of my import script and the mongo crash at 06:32. Then the mongo service restart at 07:32, and from 08:42 to 09:11, you can see my import script is working fine on the data base

Comment by Ramon Fernandez Marina [ 24/Mar/16 ]

kamaileon, the scenario you describe should work well. Would you be able to share the dataset with us so we can try to reproduce the issue on our end?

I've created a private, secure upload portal you can use to send us your data. It will only be accessible to MongoDB staff investigating this ticket.

Note that the portal has a 5GB upload limit, but on your linux box you can use split to break down the dataset into smaller files:

split -d -b 5300000000 dataset.tgz part.

and then upload all the part.* files.

Thanks,
Ramón.

Generated at Thu Feb 08 04:03:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.