[SERVER-17029] Fatal Assertion 16137 Created: 23/Jan/15 Updated: 12/Jun/15 Resolved: 12/Jun/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | quantomass | Assignee: | David Storch |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | Quint Iteration 4, Quint Iteration 5 |
| Participants: |
| Description |
|
Hello, we run a single mongod instance on Ubuntu 14.10 server. The version of db is Today mongod crashed with the following error:
Previous message in logs with this conn219322 is about 13 min before:
(note that I removed data from $set for privacy reasons). We communicate with db from our application via http://mongodb.github.io/node-mongodb-native/ driver version 1.3.15 |
| Comments |
| Comment by David Storch [ 12/Jun/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi quantomass, Unfortunately, after further investigating this issue, we have unable to reproduce or identify the exact root cause. I am going to close this ticket as "Cannot Reproduce". If you see this again or have further clues for reproducing the issue, please re-open the ticket and we will investigate further. As mentioned before, this appears to be a problem with yield recovery in 2.6. However, the code responsible for yielding and reacquiring coarse locks (i.e. global, database and collection-level locks) changed substantially between version 2.6 and version 3.0, including a completely rewritten lock manager. Therefore, I expect that this problem will not manifest on 3.0.x stable versions. Please let me know if you have any further questions. Best, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 03/Feb/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The stack trace actually indicates that the problem occurred on a find, not a findAndModify. Somehow, it appears that the database lock was released before the function returned, and then the RAII DBRead object also tried to unlock the lock. david.storch should take a look, to see if there's a problem in yield recovery somewhere. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 27/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, It's ok for us to share latest db backup, please tell if there will still be a need to run a test application server? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 26/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
quantomass, I'm still unable to trigger this assertion. I tried using iibench to create large databases/collections and add load; this is how my dataset looks like:
I launched 100 threads doing a multi-update and 10000 threads doing a findAndModify, both on bigcol1, but no assertion. I was going to ask you for more details about your application, but I'm afraid that may not be sufficient to reproduce the issue, so I'm working on other options. For example, would you be able to run a test environment with the same dataset and same application but at logLevel 2? It's very strange that an established connection triggers this assertion after some apparent inactivity. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 25/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is our hardware info, say if you need something else. CPU:
Memory:
Disks:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 25/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the additional info quantomass. I'll work on reproducing the issue on my end by creating large collections and running some queries, but it may be very hard to trigger the same assertion. If I'm unsuccessful, would you consider sharing your database files with us? If the answer is yes you can privately and securely upload them as you did with the log for access to MongoDB staff only and for the purpose of reproducing this issue – please let me know. I'm also interested in the hardware configuration you're using: number of CPUs and cores, CPU kind and speed, memory, and storage configuration. Lastly, you may want to consider running a replica set for robustness. If you do, I'd recommend you run Ubuntu 14.04 instead of 14.10 in one of the nodes in the off-chance that the issue is triggered by an OS instability. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 24/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Actually, it crashed again just now with the same error. I've uploaded the mongolog2.log starting from yesterday's restart (so partially overlapping with the previous file). Crash happens at
This time previous records with the conn107 are updates to CustomStatistic collection. Don't know if this is relevant, but these two collections (CustomStatistic and UserStatistic) are one of the biggest in our db and with the larges indexes. Here is some stats
Do you think the problem may be caused by the size of these collections/indexes? Then probably for now I should rename them to smth like CustomStatistic_bkp so that new ones will start from scratch? This will cause less damage to us than crashing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 24/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I ran the same command as the one at 2015-01-23T19:16:58.394+0100 [conn219322] command probtn.$cmd command: findAndModify ... from mongo client and till now the server is working fine. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 24/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for uploading the logs quantomass. I take that if you re-run the findAndModify command the assertion is not triggered again, is that correct? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 24/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 23/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Well, I've never seen this assertion or backtrace before, so I was interested in seeing if there are some previous clues in the log. A log from the last restart until the crash would be a good place to start. One option is to compress it and upload it as follows:
When prompted for a password just hit enter. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by quantomass [ 23/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, but do you need some limited time period? Because the service has been running for about 20 days now so the log is huge. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 23/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is what addr2line has to say:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 23/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Are you able to upload full logs for this instance? It's ok to sanitize operation contents, IP addresses, etc. |