|
I've seen this problem as well on a production cluster but I haven't yet been able to reproduce it in isolation. This created big problems for us, especially because it wasn't at first clear that profiling was to blame.
We have a 9 machine mongo 2.2.0 cluster with 3 config servers and 6 data nodes split into 3 shards, with an arbiter on each of the config nodes. I ran a script to set the profiling level to 2 for a certain database on each of the 6 data nodes. After some time running with this setting, and ONLY when we had heavy load on the cluster, these machines would enter an unusable state that appeared to be caused by corruption in one of the collections within this database, possibly the system.profile collection. When in this state, the mongod server would stay up but one or more of the collections in the db would become at some level unqueryable. Sometimes this meant that we could do an indexed find, but not a scan, and sometimes it would fail to execute all queries. The logs during this period either looked like the one above or were something like "invalid BSONObject size".
To get out of this state, we sometimes could recover by restarting the daemon, sometimes by repairing the database, and sometimes we were never able to recover. Every single machine with profiling on would eventually enter this state if the cluster had heavy load. The problem occurred even after blasting the entire state of the cluster and starting from scratch. It took a while to determine what exactly was causing the corruption, but we haven't seen the issue with profiling off, and if I turn it on for a single data node, we inevitably see corruption occur there with an hour.
The workaround for us is to never turn on profiling. This is annoying because we would like to use it and because we never had any profiling issues on 2.0.4. Please let me know if there is any more information I can provide.
|