Hi, thanks for the awesome library. I'm seeing a couple memory errors in
valgrind when I use it.
The first:
==7805== Conditional jump or move depends on uninitialised value(s)
==7805== at 0x4C2CB94: strcmp (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7805== by 0x43C412: CLD2::DoTLDLookup(char const*, CLD2::TLDLookup const*,
int) (compact_lang_det_hint_code.cc:1034)
==7805== by 0x43D705: CLD2::SetCLDTLDHint(char const*, CLD2::CLDLangPriors*)
(compact_lang_det_hint_code.cc:1452)
==7805== by 0x40CEB0: CLD2::ApplyHints(char const*, int, bool,
CLD2::CLDHints const*, CLD2::ScoringContext*) (compact_lang_det_impl.cc:1504)
==7805== by 0x40DC4F: CLD2::DetectLanguageSummaryV2(char const*, int, bool,
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*,
double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*)
(compact_lang_det_impl.cc:1644)
==7805== by 0x409BAE: CLD2::DetectLanguageSummary(char const*, int, bool,
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*)
(compact_lang_det.cc:133)
==7805== by 0x405932: codulus::main(int, char**)
(test_language_detection.cc:43)
==7805== by 0x406341: main (test_language_detection.cc:64)
==7805==
This one seems reasonable to me, DoTLDLookup is using strcmp, but the value of
'key' passed to it is not null terminated.
The other issue I see is an invalid read of one character past the end of my
input in a couple places in the code:
==8337== Invalid read of size 1
==8337== at 0x415932: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*)
(getonescriptspan.cc:973)
==8337== by 0x415DAE:
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337== by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool,
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*,
double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*)
(compact_lang_det_impl.cc:1707)
==8337== by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool,
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*)
(compact_lang_det.cc:133)
==8337== by 0x405869: codulus::main(int, char**)
(test_language_detection.cc:42)
==8337== by 0x4060B1: main (test_language_detection.cc:63)
==8337== Invalid read of size 1
==8337== at 0x414D3C: CLD2::UTF8OneCharLen(char const*)
(utf8statetable.h:270)
==8337== by 0x415A6D: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*)
(getonescriptspan.cc:991)
==8337== by 0x415DAE:
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337== by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool,
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*,
double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*)
(compact_lang_det_impl.cc:1707)
==8337== by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool,
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*)
(compact_lang_det.cc:133)
==8337== by 0x405869: codulus::main(int, char**)
(test_language_detection.cc:42)
==8337== by 0x4060B1: main (test_language_detection.cc:63)
==8337== Invalid read of size 1
==8337== at 0x41D1A3:
CLD2::UTF8GenericPropertyTwoByte(CLD2::UTF8StateMachineObj_2 const*, unsigned
char const**, int*) (utf8statetable.cc:403)
==8337== by 0x414D24: CLD2::GetUTF8LetterScriptNum(char const*)
(getonescriptspan.cc:1098)
==8337== by 0x415A87: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*)
(getonescriptspan.cc:992)
==8337== by 0x415DAE:
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337== by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool,
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*,
double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*)
(compact_lang_det_impl.cc:1707)
==8337== by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool,
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*)
(compact_lang_det.cc:133)
==8337== by 0x405869: codulus::main(int, char**)
(test_language_detection.cc:42)
==8337== by 0x4060B1: main (test_language_detection.cc:63)
For now, I'm working around this by passing (input, size - 1) instead of
(input, size) to cld2. My input is not null terminated, if that makes a
difference. It seems to happen with every input I try (they are all web pages,
by the way). Also, I am running this on x64 linux.
Any ideas?