[SERVER-38086] MongoDB cannot start after repair Created: 12/Nov/18  Updated: 05/Dec/18  Resolved: 05/Dec/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pavel Zeger [X] Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod_15112018.log     Text File mongodb_repair_log.txt    
Operating System: ALL
Sprint: Storage NYC 2018-11-19
Participants:

 Description   

Hello.

I installed a MongoDB 4.0.4 standalone server in order to perform a migration from Windows to CentOS. After of an unexpected shutdown of the server (virtualization issue) I cannot start a MongoDB server. I perform a repair procedure as described in an documentation: mongod --repair --directoryperdb --dbpath /opt/mongodb/data and assign permissions to mongod again after that by chown.

I added a repair log here.

Any way MOngoDB cannot start and write the following output in the log:

2018-11-12T09:39:47.043+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2018-11-12T09:39:47.069+0000 I NETWORK [main] --maxConns too high, can only handle 51200
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] MongoDB starting : pid=2443 port=27017 dbpath=/opt/mongodb/data 64-bit host=uk1lv8818
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] db version v4.0.4
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] git version: f288a3bdf201007f3693c58e140056adf8b04839
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] allocator: tcmalloc
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] modules: none
2018-11-12T09:39:47.069+0000 I CONTROL [initandlisten] build environment:
2018-11-12T09:39:47.070+0000 I CONTROL [initandlisten] distmod: rhel70
2018-11-12T09:39:47.070+0000 I CONTROL [initandlisten] distarch: x86_64
2018-11-12T09:39:47.070+0000 I CONTROL [initandlisten] target_arch: x86_64
2018-11-12T09:39:47.070+0000 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net:

{ bindIp: "0.0.0.0", maxIncomingConnections: 65536, port: 27017 }

, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid", timeZoneInfo: "/usr/share/zoneinfo" }, storage: { dbPath: "/opt/mongodb/data", directoryPerDB: true, engine: "wiredTiger", journal:

{ enabled: true }

}, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2018-11-12T09:39:47.070+0000 I NETWORK [initandlisten] --maxConns too high, can only handle 51200
2018-11-12T09:39:47.070+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=31620M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2018-11-12T09:39:48.453+0000 I STORAGE [initandlisten] WiredTiger message [1542015588:453710][2443:0x7fbedb4a3b80], txn-recover: Main recovery loop: starting at 9590/132608 to 9591/256
2018-11-12T09:39:48.455+0000 I STORAGE [initandlisten] WiredTiger message [1542015588:455315][2443:0x7fbedb4a3b80], txn-recover: Recovering log 9590 through 9591
2018-11-12T09:39:48.613+0000 I STORAGE [initandlisten] WiredTiger message [1542015588:613087][2443:0x7fbedb4a3b80], file:sizeStorer.wt, txn-recover: Recovering log 9591 through 9591
2018-11-12T09:39:48.680+0000 I STORAGE [initandlisten] WiredTiger message [1542015588:680738][2443:0x7fbedb4a3b80], file:sizeStorer.wt, txn-recover: Set global recovery timestamp: 0
2018-11-12T09:39:49.027+0000 I RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(0, 0)
2018-11-12T09:41:17.094+0000 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2018-11-12T09:41:17.094+0000 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2018-11-12T09:41:17.094+0000 I NETWORK [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2018-11-12T09:41:17.094+0000 I CONTROL [signalProcessingThread] now exiting
2018-11-12T09:41:17.094+0000 I CONTROL [signalProcessingThread] shutting down with code:0

Need your help ASAP.

Regards,

Pavel



 Comments   
Comment by Pavel Zeger [X] [ 26/Nov/18 ]

Thanks. 

Unfortunately we encountered additionals issues with XFS filesystem and are investigating this problem right now. If it will be related to MongoDB and our application I'll open a new thread here.

Meanwhile we use the previous machine for our archiving purposes.

Regards,

Pavel

Comment by Kelsey Schubert [ 20/Nov/18 ]

Hi PavelZeger,

Thanks for these logs. When WiredTiger finds a partial metadata set it prints that informational message, skips that table and keeps going. So we are letting the cursor continue its cursor walk and complete. Consequently, these log lines do not indicate an issue that would prevent startup as MongoDB. In fact, we see that that MongoDB successfully started and began accepting connections.

If you're continuing to have problems with the start up service killing mongod, would you please provide the output of "systemctl status mongod.service" and "journalctl -xe"

Kind regards,
Kelsey

Comment by Pavel Zeger [X] [ 18/Nov/18 ]

Please pay attention to the following rows in the log about metadata in a collection: 

2018-11-18T08:56:38.409+0000 E STORAGE [initandlisten] WiredTiger error (-31803) [1542531398:409205][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-3010138089209235246" not found: WT_NOTFOUND: item not found Raw: [1542531398:409205][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-3010138089209235246" not found: WT_NOTFOUND: item not found
2018-11-18T08:56:38.409+0000 E STORAGE [initandlisten] WiredTiger error (-31803) [1542531398:409359][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-7389076813527555990" not found: WT_NOTFOUND: item not found Raw: [1542531398:409359][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-7389076813527555990" not found: WT_NOTFOUND: item not found

Also we have the same issue with the services in staging environments: after unexpected shutdown and repairing after that we cannot start the services.

 

Comment by Pavel Zeger [X] [ 18/Nov/18 ]

I started mongod with another option:

TimeoutSec=900

Here is the log:

2018-11-18T08:48:30.901+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=31620M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2018-11-18T08:48:32.434+0000 I STORAGE [initandlisten] WiredTiger message [1542530912:434891][2439:0x7f58b4dbcb80], txn-recover: Main recovery loop: starting at 9598/34048 to 9599/256
2018-11-18T08:48:32.436+0000 I STORAGE [initandlisten] WiredTiger message [1542530912:436156][2439:0x7f58b4dbcb80], txn-recover: Recovering log 9598 through 9599
2018-11-18T08:48:32.591+0000 I STORAGE [initandlisten] WiredTiger message [1542530912:591017][2439:0x7f58b4dbcb80], file:sizeStorer.wt, txn-recover: Recovering log 9599 through 9599
2018-11-18T08:48:32.651+0000 I STORAGE [initandlisten] WiredTiger message [1542530912:651574][2439:0x7f58b4dbcb80], file:sizeStorer.wt, txn-recover: Set global recovery timestamp: 0
2018-11-18T08:48:33.814+0000 I RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(0, 0)
2018-11-18T08:56:34.067+0000 I CONTROL [initandlisten]
2018-11-18T08:56:34.067+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2018-11-18T08:56:34.067+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2018-11-18T08:56:34.067+0000 I CONTROL [initandlisten]
2018-11-18T08:56:38.409+0000 E STORAGE [initandlisten] WiredTiger error (-31803) [1542531398:409205][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-3010138089209235246" not found: WT_NOTFOUND: item not found Raw: [1542531398:409205][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-3010138089209235246" not found: WT_NOTFOUND: item not found
2018-11-18T08:56:38.409+0000 E STORAGE [initandlisten] WiredTiger error (-31803) [1542531398:409359][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-7389076813527555990" not found: WT_NOTFOUND: item not found Raw: [1542531398:409359][2439:0x7f58b4dbcb80], file:WiredTiger.wt, WT_CURSOR.next: __schema_create_collapse, 111: metadata information for source configuration "colgroup:MarketStreamer_QuoteProvider_Fortrade/index-1008-7389076813527555990" not found: WT_NOTFOUND: item not found
2018-11-18T08:58:51.906+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/opt/mongodb/data/diagnostic.data'
2018-11-18T08:58:51.941+0000 I NETWORK [initandlisten] waiting for connections on port 27017
2018-11-18T08:58:52.561+0000 I CONTROL [free_mon] Free Monitoring is Enabled. Frequency: 60 seconds
2018-11-18T08:58:56.127+0000 I NETWORK [listener] connection accepted from 127.0.0.1:27664 #1 (1 connection now open)

Comment by Pavel Zeger [X] [ 18/Nov/18 ]

I also tried this option:

TimeoutStartUSec=infinity

The result is:

 

[root@uk1lv8818 ~]# vim /etc/systemd/system/multi-user.target.wants/mongod.service
[root@uk1lv8818 ~]# systemctl daemon-reload
[root@uk1lv8818 ~]# time systemctl start mongod
Job for mongod.service failed because a timeout was exceeded. See "systemctl status mongod.service" and "journalctl -xe" for details.
real 1m30.279s
user 0m0.009s
sys 0m0.010s
[root@uk1lv8818 ~]#

 

 

 

Comment by Pavel Zeger [X] [ 18/Nov/18 ]

 I did it:

 

[root@uk1lv8818 ~]# vim /etc/systemd/system/multi-user.target.wants/mongod.service
[root@uk1lv8818 ~]# time systemctl start mongod
 Job for mongod.service failed because a timeout was exceeded. See "systemctl status mongod.service" and "journalctl -xe" for details.
real 0m0.030s
 user 0m0.009s
 sys 0m0.011s
[root@uk1lv8818 ~]#

 

 

Mongod still didn't start. I also run daemon-reload. mongo.lock is empty.Data directory still after successfull repair.

Comment by Kelsey Schubert [ 15/Nov/18 ]

Hi PavelZeger,

Thanks for the additional information. By default, systemd is setting the TimeoutStartSec to 90 seconds. In this case, mongod is taking longer than 90 seconds to start, resulting in systemd sending the SIGTERM. To allow mongod to complete its startup operations, please include the following setting in etc/systemd/system/multi-user.target.wants/mongod.service under the [Service] section:

TimeoutStartSec=infinity

Kind regards,
Kelsey

Comment by Pavel Zeger [X] [ 15/Nov/18 ]

I added a part of mongod.log when I started the service with a verbosity level of 5. Hope it will help to investigate the issue:

[root@uk1lv8818 ~]# time systemctl start mongod
Job for mongod.service failed because a timeout was exceeded. See "systemctl status mongod.service" and "journalctl -xe" for details.

real 1m30.045s
user 0m0.009s
sys 0m0.015s
[root@uk1lv8818 ~]#

 

Comment by Pavel Zeger [X] [ 15/Nov/18 ]

How can you close the issue if your platform cannot start at all??! The SIGTERM wasn't send by any user - the mongod wrote it each time when cannot start!

Can you solve it or no because we also want to migrate from MongoDB due to lack of stability of your platform. The mongod service cannot start and it's not important how time I can wait for it (I was waiting a whole day and won't wate). If it's your solution I'll suggest to all my clients simple to migrate from MongoDB because you cannot offer a stable data platform.

Comment by Kelsey Schubert [ 14/Nov/18 ]

I suspect that mongod is still progressing as part of its standard start-up routine. I would suggest allowing mongod more time to start up.

If you would like more informational logging, you may start mongod with logging set at verbosity 2, -vv, which should provide more granule updates about the actions that it is taking during this time.

Please let us know if you continue to encounter issues after taking the steps outline above and provide the more verbose logs so we can investigate.

Comment by Kelsey Schubert [ 14/Nov/18 ]

Hi PavelZeger,

Thanks for reporting this issue, we're investigating. It looks like the process is being killed by external SIGTERM, I assume you're issuing this because the mongod isn't making progress starting up, but am wondering how long you've waited before issuing this command.

Thanks,
Kelsey

Generated at Thu Feb 08 04:47:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.