Wednesday, December 19, 2012
How to display Chinese in Graphviz
However, it is not straight forward to display Chinese characters in the generated plots.
One example is as follows using DOT:
node [shape=box,style=dashed,height=0.3,fontname="C:\Windows\Fonts\NSimSun Regular.ttf",fontsize=12]; "你好"; "whr you are ∀";
where you can write UTF-8 encoded Chinese characters in the source file, and alternatively you can write it in xml-like unicode numbers like "∀" (i.e. ∀). More importantly, you need to specify the Chinese font file such that Graphviz can really display the Chinese characters, since by default Graphviz can hardly find the correct font to use for Chinese characters.
Tuesday, December 18, 2012
How to use JDB on Linux
JDB can be found in the JDK package.
You can also learn how to use JDB by reading the manual of JDB (using command "man jdb").
Here I only show the useful parts that I found:
(1) you can run your Java program as usual with the additional option "-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n"
(2) now you can start JDB using command "jdb -attach 8000"
(3) in JDB, you can first use "suspend" to suspend your JAVA program, and then use "threads" to see the thread list of the JAVA program; if you want to see what code is each thread running, you can use "where 0x22" (0x22 is the thread id which is from the thread list); after finishing debugging, you can use "resume" to resume your JAVA program.
(4) if you want to exit JDB and let your JAVA program go on running, you can simply press Control-C
Friday, November 2, 2012
How to set the priority order of jar files in Eclipse
How to set the priority order of your jar files in Eclipse?
Right click your project, and then click the Property menu.
In the "Java Build Path" menu on the left of the popup Window.
Then you can see the "Order and Export" tab which shows the order of the jar files and source codes.
Thursday, September 20, 2012
How To Open a Command Prompt in Windows 8
However, I really like it, for the simple reason that it integrates nearly all the Microsoft products together, e.g. Windows on PC, Windows on Phone, and also Xbox.
One of the new features that I found useful is that in Windows 8 file explorer, you can easily open a command prompt in the current folder by simply clicking the menu File -> Open command prompt, which is the one that I has been expecting for a long time.
Tuesday, September 18, 2012
Useful add-ons of Firefox
A useful tool used to see the html architectures of web pages.
HttpFox
A tool used to see the TCP/UDP packages sent from/to Firefox.
Tuesday, July 24, 2012
How to install Git on Linux
http://git-scm.com/downloads
2. if you really cannot use any package manager, the final choice is to install Git from its source codes, which can be found at:
http://code.google.com/p/git-core/downloads/list
Then you can install Git from the source code (according to the INSTALL in the root directory of the Git source code tarball):
$ make configure ;# as yourself
$ ./configure --prefix=/usr ;# as yourself
$ make all doc ;# as yourself
# make install install-doc install-html;# as root
Wednesday, June 27, 2012
how to change the language of Windows 7 Business
Of course, you can pay some money to upgrade your Windows 7 Business to Ultimate/Enterprise for the privilege of changing languages.
There is also one free solution: using Vistalizator:
http://www.froggie.sk/download.html
This website also provides the language files, MUI language pack, for different Windows systems.
Wednesday, June 13, 2012
Berkeley language model and Google Web 1T language model
English
Chinese
Czech
Dutch
Frenchh
German
Italian
Polish
Portuguese
Romanian
Spanish
Swedish
The homepage of the Berkeley language model project is here, and you can find the binary language models of the Google Web 1T here.
Tuesday, June 12, 2012
Static variables in Python
(1) How to use static variables in Python classes
class Foo(object):
counter = 0
def __call__(self):
Foo.counter += 1
print Foo.counter
foo = Foo()
foo() #prints 1
foo() #prints 2
foo() #prints 3
(2) How to use static variable in Python functions (Python does not really have static variables in functions, so here we use the attribute of a function instead of real static variables)
def myfunc():
if not hasattr(myfunc, "counter"):
myfunc.counter = 0 # it doesn't exist yet, so initialize it
myfunc.counter += 1
Tuesday, June 5, 2012
How to download Jazzy
However, on the sourceforge download page of Jazzy, we can only download the source codes of Jazzy, excluding the necessary dictionaries, which makes it hard to use Jazzy.
One possible solution that I just found is to download Jazzy from its CVS repository, on which page you can click the link Download GNU tarball to download a tarball of the complete Jazzy.
Sunday, May 27, 2012
How to use bitBucket with EGit in Eclipse
To set up a project in Eclipse, and push the project to bitBucket, you need to do the following steps:
(1) install EGit in Eclipse (http://www.eclipse.org/egit/);
(2) create an Eclipse project, e.g. HelloWorld; right click the project, and select Team->Share project... to add the project under Git control; right click the project again, and select Team->Add to index to add all the files of the project under version control; right click the project again, and select Team->Commit... to commit all the files;
(3) open an account on www.bitBucket.org, e.g. your account name is myaccount;
(4) configure the SSH in Eclipse:
click your project HelloWorld;
open menu Window->Preference->General->Network Connections->SSH2;
since now you have no SSH keys (bitBucket needs SSH keys for SSH authorization), select Key Management tab and click the button Generate RSA Key... (You can also use DSA keys);
then you can see the public key in the text area, and you need to copy the public key and save it in your account on bitBucket (Account->SSH keys); you also need to click the button Save Private Key... to save the private key to your local directory;
click the General tab, and click the Add Private Key... button to choose the private key that you just saved;
click the OK button to apply all the changes;
(5) on bitBucket, create a repository named HelloWorld, and then you can get the SSH address of the repository as:
ssh://git@bitbucket.org/myaccount/HelloWorld.git
(6)right click the project in Eclipse, and select Team->Remote->Push...;
then enter the SSH address and choose SSH as the protocol; Click the Next> button;
(7) click Add all branches spec button only, and then click the Next> button;
(8) click OK;
Till now, other developers can clone the project resided on bitBucket, and they can also push changes to the repository.
However, although you can push changes to the remote repository, you cannot pull changes from the repository, since the pull operation is not configured to work with the remote repository.
To solve this problem, you have to add the following lines to the Git configuration file (in your eclipse project folder .git/config):
[remote "origin"]
url = ssh://git@bitbucket.org/myaccount/HelloWorld.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
Sunday, May 20, 2012
How to add a open file dialog in a Netbeans project
Adding the File Chooser
- Choose Window > Navigating > Inspector to open the Inspector window, if it is not open yet.
- In the Inspector, right-click the JFrame node. Choose Add From
Palette > Swing Windows > File Chooser from the context menu
GUI Builder Tip: As an alternative to the 'Add From Palette' context menu, you can also drag and drop a JFileChooser component from the Swing Window category of the Palette to the white area of the GUI builder. It will have the same result, but it is a bit harder, because the preview of the JFileChooser is rather big and you might accidentally insert the window into one of the panels, which is not what you want.
- A look in the Inspector confirms that a JFileChooser was added to the form.
- Right-click the JFileChooser node and rename the variable to
fileChooser
.
Configuring the File Chooser
Implementing the Open Action
- Click to select the JFileChooser in the Inspector window,
and then edit its properties in the Properties dialog box.
Change the 'dialogTitle' property to
This is my open dialog
, press Enter and close the Properties dialog box. -
Click the Source button in the GUI Builder to switch to the Source mode.
To integrate the File Chooser into your application,
paste the following code snippet into the existing
OpenActionPerformed()
method.private void OpenActionPerformed(java.awt.event.ActionEvent evt) { int returnVal = fileChooser.showOpenDialog(this); if (returnVal == JFileChooser.APPROVE_OPTION) { File file = fileChooser.getSelectedFile(); try { // What to do with the file, e.g. display it in a TextArea textarea.read( new FileReader( file.getAbsolutePath() ), null ); } catch (IOException ex) { System.out.println("problem accessing file"+file.getAbsolutePath()); } } else { System.out.println("File access cancelled by user."); } }
- If the editor reports errors in your code, right-click anywhere in the code and select Fix Imports or press Ctrl+Shift+I. In the Fix All Imports dialog box accept the defaults to update the import statements and click OK.
Implementing a File Filter
Now you add a custom file filter that makes the File Chooser display only *.txt files.- Switch to the Design mode and select the FileChooser in the Inspector window.
- In the Properties window, click the elipsis ("...") button next to the File Filter property.
- In the File Filter dialog box, select Custom Code from the combobox.
- Type new MyCustomFilter() in the text field. Click OK.
-
To make the custom code work, you write an inner (or outer) class
MyCustomFilter that extends the FileFilter class.
Copy and paste the following code snippet into the source
of your class below the import statements to create an inner class implementing the filter.
class MyCustomFilter extends javax.swing.filechooser.FileFilter { @Override public boolean accept(File file) { // Allow only directories, or files with ".txt" extension return file.isDirectory() || file.getAbsolutePath().endsWith(".txt"); } @Override public String getDescription() { // This description will be displayed in the dialog, // hard-coded = ugly, should be done via I18N return "Text documents (*.txt)"; } }
Forwarded from:
http://netbeans.org/kb/docs/java/gui-filechooser.html
Tuesday, May 8, 2012
sentence-level alignment tools for statistical machine translation
(1) CTK: Champollion Tool Kit
http://champollion.sourceforge.net/
Note: this tool (from LDC) uses translation lexicons to align sentences, and one disadvantage is that when the two documents are very different in the number of sentences, this tool can not work well.
CTK v1.2 supports three language pairs:
English Chinese(GB)
English Chinese(UTF8)
English Arabic (UTF8)
English Hindi (UTF8)
(2) Gale-Church Aligner
This is a very old sentence-level alignment algorithm, and fortunately Chris Crowner has implemented it in the NLTK.
http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/align/align.py?r=8552&spec=svn8552
Note that the python code is in the nltk_contrib, not in the main release of NLTK.
(3) MTTK: Machine Translation Toolkit
http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/
Note: this tool is supposed to have the ability to do sentence-level alignment, but I still can not figure out how to do it using the tool.
(4) Align
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.html
Note: this tool was developed by Adam Berger, and can be downloaded from:
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.tar
It supports sentence-level alignment using some anchor labels.
(5) Bleualign
https://github.com/rsennrich/Bleualign
This tool requires automatic translations of one side of the unaligned corpus and then uses a modified BLEU evaluation to find the sentence-level alignments. Of course, you need a seed SMT system to generate the automatic translations. The tool is written in Python.
I found a problem when using this aligner which could use the same sentence on the target side multiple times in the output alignments.
(6) Microsoft Bilingual Sentence Aligner
https://www.microsoft.com/en-us/download/details.aspx?id=52608
This is a sentence aligner written in Perl. It uses sentence length.
Thursday, April 26, 2012
How to use the new Bing translator API with access tokens
Updated on December 6th 2017 to use Microsoft Azure accounts
Step 0: sign up for a Microsoft Azure account
- Sign up for a Microsoft Azure account at http://azure.com
- After you have an account go to http://portal.azure.com
- Select the + New option.
- Select AI + Cognitive Services from the list of services.
- Select Translator Text API. You may need to click "See all" or search to see it.
- Fill out the rest of the form, and select the Create button.
- You are now subscribed to Microsoft Translator Text API.
- Go to All Resources and select the Microsoft Translator API you subscribed to.
- Go to the Keys option and copy your subscription key to access the service.
Step 1: get access token
After this curl command, you will get an access token which is valid for a short period.
Step 2: get translation using the obtained access token
You will get some output from curl like (the last line is the real HTTP response):
HTTP/1.1 200 OK
Content-Length: 83
Content-Type: application/xml;
charset=utf-8
X-MS-Trans-Info: 0916.V2_Rest.Translate.1D6C05C5
Date: Mon, 10 Apr 2017 17:50:43 GMT
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">哈哈</string>
References
(2) For details about what languages are supported and what languages are using Neural Machine Translation (NMT) models:
https://www.microsoft.com/en-us/translator/languages.aspx
Wednesday, April 18, 2012
Linux shell: stop ctrl+s
This may not be what you like.
To disable this feature, you can add one line to your .bashrc file in your home directory:
stty –ixon
Friday, March 2, 2012
Train huge language models
(1.1) counting ngrams
Don't use ngram-count directly to count N-grams. Instead, use the make-batch-counts and merge-batch-counts scripts described in training-scripts(1). That way you can create N-gram counts limited only by the maximum file size on your system.(1.2) training language models from ngram counts
- You are likely to run out of memory either because of the size of ngram counts, or of the LM being built. The following are strategies for reducing the memory requirements for training LMs.
- a)
- Assuming you are using Good-Turing or Kneser-Ney discounting, don't use ngram-count in "raw" form. Instead, use the make-big-lm wrapper script described in the training-scripts(1) man page.
- b)
- Switch to using the "_c" or "_s" versions of the SRI binaries. For instructions on how to build them, see the INSTALL file. Once built, set your executable search path accordingly, and try make-big-lm again.
- c)
- Lower the minimum counts for N-grams included in the LM, i.e., the values of the options -gt2min, -gt3min, -gt4min, etc. The higher order N-grams typically get higher minimum counts.
- d)
- Get a machine with more memory. If you are hitting the limitations of a 32-bit machine architecture, get a 64-bit machine and recompile SRILM to take advantage of the expanded address space. (The MACHINE_TYPE=i686-m64 setting is for systems based on 64-bit AMD processors, as well as recent compatibles from Intel.) Note that 64-bit pointers will require a memory overhead in themselves, so you will need a machine with significantly, not just a little, more memory than 4GB.
(2) using IRSTLM
Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.
Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.
The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory stat
under your working directory, where the script will save lots of temporary files; then, simply run the script build-lm.sh as in the example:
build-lm.sh -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10
The script builds a 3-gram LM (option -n
) from the specified input command (-i
), by splitting the training procedure into 10 steps (-k
). The LM will be saved in the output (-o
) file train.irstlm.gz with an intermediate ARPA
format. This format can be properly managed through the compile-lm
command in order to produce a compiled version
or a standard ARPA version
of the LM.
For a detailed description of the procedure and of other commands available under IRSTLM please refer to the user manual supplied with the package.
Thursday, February 23, 2012
Moses: recaser issues
I found that the moses/scripts/recaser/recase.perl actually does a lot of things other than using Moses to translate uncased text to cased text:
(1) by default the moses.ini configuration file of the MT system for recasing uses distortion-limit 6, which means it allows reordering, and the recase.perl script changes the distortion-limit to 1 by passing the option "-dl 1" to the Moses decoder.
(2) the recase.perl script also use some rules to do recasing, e.g., for English, it will always keep some specific words ("a","after","against","al-.+","and","any","as","at","be","because","between","by","during","el-.+","for","from","his","in","is","its","last","not","of","off","on","than","the","their","this","to","was","were","which","will","with") upper casing;
(3) the script also uppercases the initial word of a sentence.
Monday, February 20, 2012
Moses: pruning phrase tables
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc16
(1) I first download the source code of SALM from:
http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
then I go to the directory:
SALM/Distribution/Linux
and run command:
make allO32
make allO64
(There is some errors: make: *** No rule to make target `../../Bin/Linux/Search/SampleNGramIns.O32', needed by `allO32'. )
Note that I compile SALM using g++-4.1, and I had tried to use g++-4.4 but failed.
(2) I found that in the latest Moses got using command git there is no sub-directory named sigtest-filter, so I copied the sigtest-filter from some old version of Moses got using svn.
I go to the directory sigtest-filter, and run command:
make
SALMDIR=/path/to/SALM
(using g++-4.4)
Friday, February 10, 2012
Python: multi threading problem
In CPython/Python, there is an important lock named global interpreter lock (GIL), which is the mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines. Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.
GIL actually prevents threads from running in parallel in Python. The GIL is only a problem when tackling CPU-bounded problems in Python, but it is not a big problem for I/O bounded threads.
Solutions:
(1) Jython is free of GIL;
(2) Cython;
Thursday, February 9, 2012
Compiling latest Moses from git
./bjam --with-srilm=srilm-1.6.0 --with-irstlm=irstlm-5.70.04 --with-giza=giza-pp --with-boost=boost_1_48_0 -j1
to compile the latest Moses checked out using command git,
I got the following errors:
gcc.compile.c++ moses/src/LM/bin/gcc-4.4.1/release/debug-symbols-on/link-static/threading-multi/Factory.o
In file included from moses/src/LM/ORLM.h:8,
from moses/src/LM/Factory.cpp:41:
moses/src/DynSAInclude/onlineRLM.h:22: error: reference to ‘Vocab’ is ambiguous
moses/src/LM/SRI.h:33: error: candidates are: struct Vocab
moses/src/DynSAInclude/vocab.h:17: error: class Moses::Vocab
My solution is to replace all the "Vocab" with "Moses::Vocab" in moses/src/DynSAInclude.
/onlineRLM.h
Sunday, February 5, 2012
Python: buffering problem when using 'for line in sys.stdin'
for line in sys.stdin:
print line
using which after I type in a sentence to the terminal, I get no output.
After investigating for a while, I come to the following solution (using readline() instead):
infile=sys.stdin
line=' '
while len(line)!=0:
line=infile.readline()
print line
Saturday, February 4, 2012
Perl bug: spliting UTF-8 encoded Chinese string
Thursday, January 19, 2012
Moses phrase-based decoder analysis
(2). Main.cpp first calls parameter->LoadParam(argc, argv) to load and check the parameters in the moses.ini configuration file and command line, where the model files are not loaded
(3). Main.cpp then calls StaticData::LoadDataStatic(parameter) to load weights and models according to the parameters of (2)
(3.1) StaticData::LoadDataStatic(parameter) calls StaticData::LoadData(Parameter *parameter)
(3.1.1) in StaticData::LoadData(Parameter *parameter), we load the weights and models by calling, e.g., StaticData::LoadLanguageModels(), LoadPhraseTables()
(3.1.1.1) in StaticData::LoadLanguageModels() calls LanguageModel* CreateLanguageModel(LMImplementation lmImplementation, const std::vector
In Moses, the major specific interfaces of LM classes like LanguageModelInternal are: bool load(...) and float GetValue(const std::vector
(4). Main.cpp uses IOWrapper *ioWrapper = GetIODevice(staticData) to setup the input device (an input file or standard input)
(5). Main.cpp uses vector
(6). Main.cpp starts the main loop of translating input instances (text, confusion network, or lattice):
(6.1). use ReadInput(*ioWrapper,staticData.GetInputType(),source) to load an input, which is saved in source
(6.2). setup the translation manager by calling Manager manager(*source, staticData.GetSearchAlgorithm()), where by calling staticData.InitializeBeforeSentenceProcessing(source) we initialize the translation/language models for this sentence; the language model list is StaticDate.m_languageModel; the default search algorithm is SearchNormal;
(6.3). expand translation hypotheses stack by stack until the end of the input sentence using manager.ProcessSentence()
(6.3.1). ProcessSentence() first reset the statistics using staticData.ResetSentenceStats(m_source)
(6.3.2). ProcessSentence() then collects translation options for the input sentence
(6.3.3). ProcessSentence() calls the search algorithm to process the input using m_search->ProcessSentence()
(6.4). pick the best translation (maximum a posteriori decoding)
Wednesday, January 18, 2012
Sunday, January 15, 2012
How to install Ruby in your local directory from source code
Saturday, January 14, 2012
Drawing figures with GNUplot
There are a lot of helpful examples on wikimedia:
http://commons.wikimedia.org/wiki/Category:Gnuplot_diagrams
How to burn CN image onto a DVD disc using NERO
2. in the popup Window, click on the tab whose title is ISO.
3. in the ISO tab, click the button OPEN to select the image that you want to burn to the disc.
4. after your selection, it will come back to the original Window;
on the TOP LEFT corner of the Window it says CD;
It has a drop down menu, and you need to click it and select DVD instead.
5. finish the burning process as usual.