[SERVER-5144] increased memory usage, crashes when upgrading from 1.8.1 to 2.0.3 Created: 29/Feb/12  Updated: 15/Aug/12  Resolved: 13/Jul/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Javier Frias Assignee: Mathias Stearn
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: Solaris
Participants:

 Description   

We just had failed upgrades on two of our mongodb replica set clusters.

Both are setup as such.
1 arbiter, 2 replicaset members.

virtual/mapped memory went from 3-7GB, to immediate 34GB one one replica set, and 16GB on another. Both sets of boxes have 4GB memory and 8GB swap

on version 1.8.1, same data, with active traffic.

18180 root 4 45 0 3522M 1999M cpu/13 4:02 6.25% mongod

on version 2.0.3, same data, same node, with active traffic

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
11752 root 92 59 0 34G 1883M cpu/11 13:19 7.25% mongod

all are running under solaris zones ( joyent ), but are just regular joyent slices.

after about 30 minutes of traffic, mongod's would crash. and replica fail over would continue, back and forth. We were upgrading to 2.0.3 to try to address the constant fail over issue. Performance is fine, so that's why we've kept the boxes at 4GB, our active data set is low. ( on disk data is about 10GB on both).

this is in 64bit



 Comments   
Comment by Ian Whalen (Inactive) [ 13/Jul/12 ]

@javier I'm closing this now, but please reopen if you can attach further logs or provide any further info from the sun debugger.

Comment by Dwight Merriman [ 03/Mar/12 ]

can you post the end of a log with a little more detail. if there is more. i see the ' terminate() called, printing stack:' above but nothing else. so wondering what is up.

this is almost surely sun specific. we will fix but could take a little time without more detail.

running in the sun debugger would be interesting, or chatting with joyent support too.

Comment by Javier Frias [ 02/Mar/12 ]

also, a few of the crashes, this has been the text preceding the "starting mongo.


Wed Feb 29 18:50:43 [conn138] end connection 10.200.14.56:56842
Wed Feb 29 18:50:43 [conn180] end connection 10.200.14.67:61303
Logstream::get called in uninitialized state
Wed Feb 29 18:50:43 [conn199] end connection 10.200.14.67:56790
Wed Feb 29 18:51:18 [initandlisten] MongoDB starting : pid=29869 port=27017 dbpath=/data/db 64-bit

Wed Feb 29 18:55:31 [conn176] end connection 10.200.14.58:41561
Wed Feb 29 18:55:31 [conn260] end connection 10.200.14.67:48064
Wed Feb 29 18:55:31 terminate() called, printing stack:
Wed Feb 29 18:55:31 terminate() called, printing stack:
Wed Feb 29 18:55:36 [initandlisten] MongoDB starting : pid=7661 port=27017 dbpath=/data/db 64-bit

Wed Feb 29 18:58:42 [conn36] end connection 10.200.14.65:38252
Wed Feb 29 18:58:42 [conn38] end connection 10.200.14.49:57457
Wed Feb 29 18:58:42 terminate() called, printing stack:
Wed Feb 29 18:58:42 terminate() called, printing stack:
Wed Feb 29 19:00:52 [initandlisten] MongoDB starting : pid=15572 port=27017 dbpath=/data/db 64-bit

Comment by Javier Frias [ 02/Mar/12 ]

we've added an extra 4GB to the solaris zone running this instance, and the problem still persists. We are also going to change our front end, so that these very large queries don't happen any more, and see if that stops the crashes.

Comment by Javier Frias [ 02/Mar/12 ]

It doesnt seem to log the crash. Basically, the scenario is mongodb gets a series of very large result queries, and it exits/crashes. Here's the log from svc( solaris process watcher ) for that time.

I cut away the query right before the "MongoDB starting" as its pretty large, but can send you a gzipped version of the log with everything if you'd like ( its going to be a few hundred megs ).

Just to give you an idea of how often it happens, heres the last two days ( there's a slight discrepancy, since some the times there was manual intervention ).

>grep "MongoDB starting" mongo.log | gegrep -B 1 '(Mar| Feb 29)'
Wed Feb 29 11:41:00 [initandlisten] MongoDB starting : pid=21220 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 16:22:23 [initandlisten] MongoDB starting : pid=19671 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 16:36:31 [initandlisten] MongoDB starting : pid=1291 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 16:54:43 [initandlisten] MongoDB starting : pid=15145 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 18:10:30 [initandlisten] MongoDB starting : pid=29792 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 18:24:41 [initandlisten] MongoDB starting : pid=28079 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 18:32:16 [initandlisten] MongoDB starting : pid=21862 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 18:51:18 [initandlisten] MongoDB starting : pid=29869 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 18:55:36 [initandlisten] MongoDB starting : pid=7661 port=27017 dbpath=/data/db 64-bit
Wed Feb 29 19:00:52 [initandlisten] MongoDB starting : pid=15572 port=27017 dbpath=/data/db 64-bit
Thu Mar 1 04:01:13 [initandlisten] MongoDB starting : pid=7179 port=27017 dbpath=/data/db 64-bit
Thu Mar 1 12:41:19 [initandlisten] MongoDB starting : pid=14215 port=27017 dbpath=/data/db 64-bit
Thu Mar 1 13:31:27 [initandlisten] MongoDB starting : pid=23957 port=27017 dbpath=/data/db 64-bit
Fri Mar 2 00:45:13 [initandlisten] MongoDB starting : pid=1822 port=27017 dbpath=/data/db 64-bit
Fri Mar 2 13:01:37 [initandlisten] MongoDB starting : pid=25902 port=27017 dbpath=/data/db 64-bit
Fri Mar 2 14:58:08 [initandlisten] MongoDB starting : pid=16651 port=27017 dbpath=/data/db 64-bit
Fri Mar 2 16:18:34 [initandlisten] MongoDB starting : pid=20096 port=27017 dbpath=/data/db 64-bit

>grep "Stopping because all processes in service exited" /var/svc/log/mongo | egrep '(Mar| Feb 29)'
[ Feb 29 03:46:54 Stopping because all processes in service exited. ]
[ Feb 29 16:22:18 Stopping because all processes in service exited. ]
[ Feb 29 16:36:26 Stopping because all processes in service exited. ]
[ Feb 29 16:54:37 Stopping because all processes in service exited. ]
[ Feb 29 18:10:25 Stopping because all processes in service exited. ]
[ Feb 29 18:24:36 Stopping because all processes in service exited. ]
[ Feb 29 18:32:11 Stopping because all processes in service exited. ]
[ Mar 1 04:01:08 Stopping because all processes in service exited. ]
[ Mar 1 12:41:14 Stopping because all processes in service exited. ]
[ Mar 1 13:31:22 Stopping because all processes in service exited. ]
[ Mar 2 00:45:08 Stopping because all processes in service exited. ]
[ Mar 2 13:01:32 Stopping because all processes in service exited. ]
[ Mar 2 14:58:03 Stopping because all processes in service exited. ]
[ Mar 2 16:18:28 Stopping because all processes in service exited. ]

Just so you see one of the long queries that crash it.

>ggrep -B 1 "MongoDB starting" mongo.log | gegrep -B 1 '(Mar| Feb 29)' | tail -2

Fri Mar 2 16:18:28 [conn128] getmore inventory.inventory cid:59281348262697 getMore: { bookingdate:

{ $gte: 20120302, $lt: 20120401 }

, orderModification: 0, price:

{ $gt: 0 }

, propertyId:

{ $in: [ 8, 451, 614, 361, 77, 57, 597, 509, 66, 178, 18, 568, 285, 234, 2, 408, 794, 245, 358, 53, 322, 292, 220, 521, 183, 65, 157, 49, 400, 764, 70, 172, 608, 219, 674, 478, 24, 37, 223, 266, 1454, 511, 770, 1143, 64, 1177, 456, 472, 363, 1415, 519, 111, 304, 110, 409, 396, 1294, 4, 182, 43, 81, 810, 417, 218, 1130, 555, 572, 123, 384, 825, 367, 1087, 835, 659, 733, 1344, 517, 696, 167, 582, 1043, 41, 1160, 645, 845, 732, 330, 76, 139, 40, 254, 545, 604, 1944, 99, 457, 17, 530, 819, 141, 198, 1076, 256, 269, 428, 1446, 497, 3, 633, 31, 636, 296, 997, 670, 323, 370, 295, 482, 603, 288, 898, 341, 29, 790, 1684, 62, 1212, 529, 283, 744, 1209, 74, 113, 979, 1329, 686, 660, 1068, 960, 615, 1804, 233, 974, 171, 273, 1132, 137, 207, 108, 446, 402, 484, 1270, 91, 214, 847, 803, 978, 279, 1020, 250, 829, 2205, 383, 1009, 100, 999, 2259, 487, 246, 35, 368, 189, 425, 512, 903, 224, 536, 5, 376, 226, 116, 73, 578, 345, 439, 662, 1003, 461, 991, 1363, 1357, 1128, 1059, 538, 1048, 87, 2394, 1398, 751, 274, 2078, 389, 546, 85, 342, 1404, 1862, 596, 1179, 523, 773, 284, 500, 2119, 757, 1080, 900, 880, 317, 442, 2530, 2316, 741, 294, 543, 481, 426, 1351, 420, 1850, 184, 1004, 548, 600, 326, 1793, 704, 808, 432, 593, 694, 1369, 174, 471, 1284, 1109, 303, 605, 721, 827, 2061, 526, 319, 1250, 573, 1106, 190, 1439, 72, 1846, 506, 1455, 730, 867, 494, 126, 895, 360, 448, 318, 574, 1376, 2479, 571, 1947, 874, 1300, 398, 737, 763, 201, 2176, 2411, 646, 598, 449, 1845, 1222, 243, 962, 828, 973, 554, 1307, 1368, 203, 1355, 2441, 943, 2190, 1085, 966, 2414, 1320, 2247, 2091, 2046, 1999, 693, 1982, 938, 585, 1302, 338, 774, 882, 655, 34, 918, 963, 477, 1831, 240, 1044, 26, 1553, 616, 2152, 987, 1453, 407, 850, 589, 621, 2056, 499, 1129, 1276, 156, 1267, 533, 873, 752, 740, 736, 1105, 1759, 1633, 39, 643, 2215, 1269, 2223, 1598, 180, 397, 2447, 1008, 492, 1409, 258, 112, 2240, 964, 1054, 152, 1133, 739, 1554, 1022, 1371, 726, 1216, 893, 1170, 800, 823, 1183, 915, 375, 359, 1950, 666, 985, 806, 881, 1809, 1345, 1146, 1736, 1532, 748, 941, 1245, 1533, 1046, 624, 1752, 926, 956, 899, 807, 382, 2166, 1422, 986, 1889, 1382, 1934, 1597, 1865, 1901, 1113, 679, 1690, 669, 2264, 766, 902, 1888, 1035, 651, 1223, 193, 2299, 1111, 1155, 2125, 1233, 746, 1247, 884, 532, 1722, 1217, 894, 772, 2127, 1186, 227, 2513, 1626, 1056, 1322, 709, 2131, 1900, 1936, 1391, 1741, 1167, 1635, 2035, 1271, 1711, 2458, 131, 2285, 2037, 1408, 1285, 1308, 431, 2137, 1784, 2117, 2203, 1568, 452, 866, 1883, 2096, 1244, 2160, 1760, 629, 769, 1769, 1195, 1165, 2043, 1305, 2068, 1010, 458, 870, 528, 2104, 1951, 1045, 2172, 1740, 1338, 1410, 601, 2225, 628, 1636, 1192, 2468, 1396, 1279, 313, 2098, 186, 1681, 2350, 2103, 872, 1595, 804, 768, 2384, 754, 2136, 968, 2139, 635, 2220, 1851, 1818, 2025, 503, 1512, 1147, 2107, 1051, 1328, 2274, 1618, 1963, 1958, 2006, 1265, 1311, 1583, 2189, 1569, 1523, 1406, 1938, 1211, 1152, 1019, 1435, 1112, 1842, 1430, 2087, 1277, 1783, 1807, 1530, 796, 1492, 1705, 975, 2138, 1166, 1287, 1746, 205, 1266, 1835, 993, 2275, 1646, 1826, 1770, 1733, 1683, 2086, 2252, 1927, 1587, 2038, 1940, 2159, 1380, 706, 2374, 2404, 1115, 1942, 814, 443, 1844, 430, 1377, 2507, 1319, 1582, 1282, 2175, 1330, 1825, 2003, 1256, 1843, 1451, 2022, 2286, 2398, 1498, 1905, 889, 1169, 1813, 1935, 1899, 1894, 1962, 980, 1653, 1827, 1047, 1937, 2013, 782, 1941, 939, 657, 767, 547, 1041, 1808, 1527, 61, 904, 1994, 1885, 1668, 1378, 1745, 1872, 919, 1306, 2012, 1788, 2128, 1061, 2327, 2417, 1425, 1904, 1949, 1707, 1251, 2135, 1939, 1427, 1180, 1786, 2261, 1849, 2047, 1778, 1063, 1724, 2422, 1012, 2209, 2421, 1062, 747, 892, 1649, 1274, 1879, 2289, 1754, 977, 2171, 2101, 1550, 1487, 475, 2194, 2089, 1246, 1975, 2196, 1503, 1747, 2088, 1301, 1806, 1589, 1995, 1437, 1110, 1984, 2195, 1413, 2036, 2102, 2375, 2343, 1588, 1812, 1299, 2164, 1918, 2323, 2238, 1431, 1657, 2510, 2301, 687, 1753, 2443, 749, 2465, 1570, 2122, 2231, 2123, 2069, 1450, 2076, 1392, 2167, 1847, 1185, 1848, 1961, 2224, 140, 801, 864, 877, 936, 1018, 1038, 1039, 1040, 1066, 1148, 1151, 1164, 1303, 1314, 1370, 1447, 1448, 1449, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1486, 1488, 1493, 1494, 1496, 1497, 1499, 1500, 1501, 1502, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1522, 1524, 1525, 1526, 1528, 1529, 1531, 1534, 1535, 1536, 1537, 1538, 1539, 1543, 1547, 1548, 1549, 1551, 1552, 1574, 1576, 1578, 1581, 1585, 1590, 1591, 1593, 1596, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1608, 1609, 1610, 1613, 1614, 1615, 1617, 1624, 1625, 1627, 1641, 1643, 1645, 1647, 1648, 1650, 1651, 1652, 1654, 1655, 1656, 1658, 1659, 1660, 1661, 1666, 1671, 1673, 1674, 1678, 1680, 1696, 1697, 1699, 1700, 1701, 1702, 1704, 1708, 1713, 1714, 1715, 1716, 1717, 1723, 1728, 1732, 1737, 1738, 1739, 1742, 1749, 1756, 1757, 1762, 1764, 1766, 1767, 1768, 1776, 1780, 1782, 1789, 1790, 1791, 1794, 1803, 1852, 1853, 1858, 1861, 1868, 1869, 1870, 1871, 1882, 1890, 1897, 1898, 1903, 1906, 1907, 1908, 1909, 1914, 1915, 1916, 1917, 1930, 1957, 1964, 1979, 1981, 1996, 2001, 2002, 2004, 2007, 2008, 2009, 2011, 2017, 2018, 2026, 2028, 2029, 2030, 2052, 2053, 2054, 2062, 2063, 2065, 2066, 2067, 2079, 2080, 2081, 2082, 2083, 2111, 2146, 2147, 2148, 2149, 2153, 2154, 2157, 2168, 2170, 2177, 2178, 2187, 2197, 2198, 2199, 2200, 2201, 2202, 2208, 2210, 2216, 2217, 2221, 2233, 2241, 2260, 2262, 2269, 2270, 2271, 2279, 2280, 2282, 2290, 2291, 2292, 2293, 2305, 2334, 2336, 2337, 2345, 2363, 2368, 2370, 2371, 2379, 2385, 2387, 2388, 2389, 2418, 2419, 2420, 2432, 2438, 2446, 2452, 2482, 2483, 2484, 2485, 2488, 2489, 2508 ] }

} bytes:4194692 nreturned:6446 184ms
Fri Mar 2 16:18:34 [initandlisten] MongoDB starting : pid=20096 port=27017 dbpath=/data/db 64-bit

Comment by Eliot Horowitz (Inactive) [ 02/Mar/12 ]

Is there something else missing from the logs?
Can you the entire log?
Not seeing a crash in there.

Comment by Javier Frias [ 01/Mar/12 ]

Would there be any way to troubleshoot the constant mongo crashes and failovers? The problem, while always there, was greatly exacerbate when we tried the upgrade.

This is the log as the crash happens...
Thu Mar 1 13:31:19 [conn293] getmore XXXXX getMore:
....
} } bytes:4194411 nreturned:6412 107ms
Thu Mar 1 13:31:27 [initandlisten] MongoDB starting : pid=23957 port=27017 dbpath=/data/db 64-bit
Thu Mar 1 13:31:27 [initandlisten] db version v1.8.1, pdfile version 4.5
Thu Mar 1 13:31:27 [initandlisten] git version: a429cd4f535b2499cc4130b06ff7c26f41c00f04
Thu Mar 1 13:31:27 [initandlisten] build sys info: SunOS fhm8e4dc.joyent.us 5.11 snv_89 i86pc BOOST_LIB_VERSION=1_38
Thu Mar 1 13:31:27 [initandlisten] waiting for connections on port 27017
Thu Mar 1 13:31:27 [websvr] web admin interface listening on port 28017
Thu Mar 1 13:31:27 [initandlisten] connection accepted from...

The process, as you can see, just crashes when it gets a series of very large result queries.

Comment by Eliot Horowitz (Inactive) [ 01/Mar/12 ]

2.0 has journalling enabled by default which doubles virtual memory space even though actual ram requirements are the same

Generated at Thu Feb 08 03:08:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.