Description
Saw a core dump last night, when running against the WT-638 branch – it might be connected to that branch, but I don't think so.
(gdb) where
|
#0 0x000000000044dc67 in __bit_set (
|
bitf=0x8044222a8 "wivk??e?WoVn???}\226???\227Uf_?gz\233k?\233?ۦ宿g????f?{zw????m?\236?\231?yZo?\225???ݷiyj??i?v~??U?V????f\233\236\237?i??i???ݥ\225\225w?w?u?\237???k?vo?o??ޯ??\236?W??W??^Y??\237??\225?\177[\177\235?ݾy???w??[knZ??^?ߺ]\231\233{\225\231????[g~\231?o~u}??ݺ?\226?????????"...,
|
bit=18446744073709551612) at bitstring.i:99
|
WT-1 0x000000000044dbef in __bit_setv (
|
bitf=0x8044222a8 "wivk??e?WoVn???}\226???\227Uf_?gz\233k?\233?ۦ宿g????f?{zw????m?\236?\231?yZo?\225???ݷiyj??i?v~??U?V????f\233\236\237?i??i???ݥ\225\225w?w?u?\237???k?vo?o??ޯ??\236?W??W??^Y??\237??\225?\177[\177\235?ݾy???w??[knZ??^?ߺ]\231\233{\225\231????[g~\231?o~u}??ݺ?\226?????????"...,
|
entry=18446744073709551614, width=2 '\002', value=3 '\003')
|
at bitstring.i:303
|
WT-2 0x000000000044e741 in __bit_setv_recno (page=0x8044208e0,
|
recno=18446744073709551615, width=2 '\002', value=3 '\003')
|
at bitstring.i:315
|
WT-3 0x000000000044e3c5 in __rec_col_fix (session=0x80268b660, r=0x804603400,
|
page=0x8044208e0) at ../src/btree/rec_write.c:2331
|
WT-4 0x00000000004494fc in __wt_rec_write (session=0x80268b660,
|
page=0x8044208e0, salvage=0x0, flags=0) at ../src/btree/rec_write.c:317
|
What's happening is that we're reconciling a page, and there's a record on the page's update list that has an invalid record number.
Inside the page->modify->update skip list, I see:
$1775 = {upd = 0x80243b200, u = {recno = 646, key = {offset = 646,
|
size = 0}}, next = 0x80243b1f0}
|
$1776 = {upd = 0x805844380, u = {recno = 651, key = {offset = 651,
|
size = 0}}, next = 0x805844370}
|
$1777 = {upd = 0x805c3b540, u = {recno = 18446744073709551615, key = {
|
offset = 4294967295, size = 4294967295}}, next = 0x80385a5b0}
|
$1778 = {upd = 0x80243b740, u = {recno = 657, key = {offset = 657,
|
size = 0}}, next = 0x80381f050}
|
$1779 = {upd = 0x80243ba80, u = {recno = 658, key = {offset = 658,
|
size = 0}}, next = 0x8024401d0}
|
$1777 has a recno of UINT64_MAX, and offset/size values of UINT32_MAX, but the WT_UPDATE structure looks reasonable:
(gdb) p *(WT_UPDATE *)0x805c3b540
|
$1791 = {size = 1, txnid = 4007, next = 0x0}
|
Entry $1777 isn't at the head of any of the skip-list chains, it's right in the middle of all of them. (That's why I don't think this has anything to do with the WT-638 branch, this entry was pushed into the list long after the list itself was allocated).
I would think that a recno of UINT64_MAX would be a smoking gun, but I can't find it. There's a little dance in __wt_btcur_insert():
/*
|
* If WT_CURSTD_APPEND is set, insert a new record (ignoring
|
* the application's record number). First we search for the
|
* maximum possible record number so the search ends on the
|
* last page. The real record number is assigned by the
|
* serialized append operation.
|
*/
|
if (F_ISSET(cursor, WT_CURSTD_APPEND))
|
cbt->iface.recno = UINT64_MAX;
|
|
WT_ERR(__wt_col_search(session, cbt, 1));
|
|
if (F_ISSET(cursor, WT_CURSTD_APPEND))
|
cbt->iface.recno = 0;
|
but I don't see how it's possible for the recno to be set incorrectly – imagine that we get WT_RESTART from __wt_col_search(), the next time through the loop we should still have WT_CURSTD_APPEND set.
Here's the config that saw this error:
############################################
|
# RUN PARAMETERS
|
############################################
|
bitcnt=2
|
cache=40
|
compression=snappy
|
data_extend=0
|
data_source=file
|
delete_pct=0
|
# dictionary not applicable to this run
|
file_type=fixed-length column-store
|
hot_backups=0
|
# huffman_key not applicable to this run
|
# huffman_value not applicable to this run
|
insert_pct=41
|
internal_key_truncation=0
|
internal_page_max=17
|
key_gap=17
|
# key_max not applicable to this run
|
# key_min not applicable to this run
|
leaf_page_max=12
|
ops=5000
|
# prefix not applicable to this run
|
# repeat_data_pct not applicable to this run
|
reverse=0
|
rows=1000
|
runs=0
|
split_pct=43
|
threads=31
|
# value_max not applicable to this run
|
# value_min not applicable to this run
|
# wiredtiger_config not applicable to this run
|
write_pct=20
|
############################################
|
@michaelcahill, any ideas?
Attachments
Issue Links
- is related to
-
WT-656 test/format failure: checkpoint ranges never verified
- Closed
- related to
-
WT-1 placeholder WT-1
- Closed
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-638 Modify update atomic
- Closed
-
WT-643 test/format failure: illegal cell and page type combination
- Closed