Wednesday, May 25, 2016

How to let MeCab library use a given dictionary directory

MeCab is a famous analysis tool for a few languages. It is used to tokenize Japanese sentences into words by me in my project. I installed it in its default system directories and everything just works well.

Recently I have to hack it to give it a specific dictionary directory which I want it to use in my codes, without installing it on the target machine. I ended up getting issues:MeCab just does not use the dictionary directory I have given, throwing errors.

After reading MeCab source codes, I found mecab-0.996/src/utils.cpp actually looks for the dictionary files using the codes in Reference (3). The function is called load_dictionary_resource() which has to find mecabrc first before loading the real dictionaries. The mecabrc is like a configuration file installed by MeCab to record the dictionary path etc. which looks like:
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
dicdir =  /home/your-name/local/lib/mecab/dic/ipadic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
, where the dicdir could be a wrong path and we could tell MeCab to use a given dictionary directory instead.

The mecabrc could be configured via the option "--rcfile" and the dictionary directory could be configured via "--dicdir".
#include
MeCab::Tagger *m_mecabTagger;
        m_mecabTagger=MeCab::createTagger("--rcfile /path/to/dummy/mecabrc -O wakati --dicdir /path/to/your/dictionary/dir");
        if (!m_mecabTagger)
       {
             const char *e = m_mecabTagger ? m_mecabTagger->what() :  MeCab::getLastError();
             std::cerr << "ERROR: " << e << std::endl;
             delete m_mecabTagger;
       }

References
(1) MeCab: http://taku910.github.io/mecab/libmecab.html
(2) MeCab API: http://taku910.github.io/mecab/doxygen/classMeCab_1_1Tagger.html
(3) piece of mecab-0.996/src/utils.cpp:
292 bool load_dictionary_resource(Param *param) {
293   std::string rcfile = param->get("rcfile");
294
295 #ifdef HAVE_GETENV
296   if (rcfile.empty()) {
297     const char *homedir = getenv("HOME");
298     if (homedir) {
299       const std::string s = MeCab::create_filename(std::string(homedir),
300                                                    ".mecabrc");
301       std::ifstream ifs(WPATH(s.c_str()));
302       if (ifs) {
303         rcfile = s;
304       }
305     }
306   }
307
308   if (rcfile.empty()) {
309     const char *rcenv = getenv("MECABRC");
310     if (rcenv) {
311       rcfile = rcenv;
312     }
313   }
314 #endif
315
316 #if defined (HAVE_GETENV) && defined(_WIN32) && !defined(__CYGWIN__)
317   if (rcfile.empty()) {
318     scoped_fixed_array buf;
319     const DWORD len = ::GetEnvironmentVariableW(L"MECABRC",
320                                                 buf.get(),
321                                                 buf.size());
322     if (len < buf.size() && len > 0) {
323       rcfile = WideToUtf8(buf.get());
324     }
325   }
326 #endif
327
328 #if defined(_WIN32) && !defined(__CYGWIN__)
329   HKEY hKey;
330   scoped_fixed_array v;
331   DWORD vt;
332   DWORD size = v.size() * sizeof(v[0]);
333
334   if (rcfile.empty()) {
335     ::RegOpenKeyExW(HKEY_LOCAL_MACHINE, L"software\\mecab", 0, KEY_READ, &hKey);
336     ::RegQueryValueExW(hKey, L"mecabrc", 0, &vt,
337                        reinterpret_cast(v.get()), &size);
338     ::RegCloseKey(hKey);
339     if (vt == REG_SZ) {
340       rcfile = WideToUtf8(v.get());
341     }
342   }
343
344   if (rcfile.empty()) {
345     ::RegOpenKeyExW(HKEY_CURRENT_USER, L"software\\mecab", 0, KEY_READ, &hKey);
346     ::RegQueryValueExW(hKey, L"mecabrc", 0, &vt,
347                        reinterpret_cast(v.get()), &size);
348     ::RegCloseKey(hKey);
349     if (vt == REG_SZ) {
350       rcfile = WideToUtf8(v.get());
351     }
352   }
353
354   if (rcfile.empty()) {
355     vt = ::GetModuleFileNameW(DllInstance, v.get(), size);
356     if (vt != 0) {
357       scoped_fixed_array drive;
358       scoped_fixed_array dir;
359       _wsplitpath(v.get(), drive.get(), dir.get(), NULL, NULL);
360       const std::wstring path =
361           std::wstring(drive.get()) + std::wstring(dir.get()) + L"mecabrc";
362       if (::GetFileAttributesW(path.c_str()) != -1) {
363         rcfile = WideToUtf8(path);
364       }
365     }
366   }
367 #endif
368
369   if (rcfile.empty()) {
370     rcfile = MECAB_DEFAULT_RC;
371   }
372
373   if (!param->load(rcfile.c_str())) {
374     rcfile = "mecab_etc/mecabrc";
375     if (!param->load(rcfile.c_str())) {
376         return false;
377     }
378   }
379
380   std::string dicdir = param->get("dicdir");
381   if (dicdir.empty()) {
382     dicdir = ".";  // current
383   }
384   remove_filename(&rcfile);
385   replace_string(&dicdir, "$(rcpath)", rcfile);
386   param->set("dicdir", dicdir, true);
387   dicdir = create_filename(dicdir, DICRC);
388
389   if (!param->load(dicdir.c_str())) {
390     return false;
391   }
392
393   return true;
394 }

No comments: