Word Alignment Using Giza++ and Cygwin on Windows

Ms. Ankita Sati

doi:10.17577/IJERTV2IS50775

Volume 02, Issue 05 (May 2013)

Word Alignment Using Giza++ and Cygwin on Windows

DOI : 10.17577/IJERTV2IS50775

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 865
Total Downloads : 1399
Authors : Ms. Ankita Sati
Paper ID : IJERTV2IS50775
Volume & Issue : Volume 02, Issue 05 (May 2013)
Published (First Online): 24-05-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Word Alignment Using Giza++ and Cygwin on Windows

Ms. Ankita Sati

M.Tech Student, Banasthali Vidyapith, Banasthali, Jaipur

Abstract

Word alignment is the natural language processing task of identifying translational relationships among the words and multi-word units in parallel corpora. Automatic word alignment in bilingual or multilingual parallel corpora has been a challenging issue for natural language processing. Giza++ is the automatic word alignment tool. In Linux system environment, it is very common and convenient to use the word alignment generated from GIZA++ for most statistical machine translation systems. Many researchers conducting their research under the Windows platform .In this paper we discuss how we used Giza++ in windows and output of their results.

Introduction

Word alignment is the task of identifying translational relations between words in parallel corpora with the aim of re-using them in natural language processing. Typical applications that make use of word alignment techniques are machine translation and multi-lingual lexicography.

Several tools had been developed for automatic word alignment. Some of the tools are listed below:
1. Giza++
2. The Berkeley Word Aligner
3. Natura Alignment Tools (NATools)
4. UNL aligner
5. Geometric Mapping and Alignment (GMA)
  
  There are many applications for word alignment in natural language processing, and most of them depend on the quality of word alignment (Och and Ney, 2000; Yarowsky and Wicentowski, 2000). A frequently used application system for word alignment is the automatic extraction of bilingual lexicon and terminology from parallel corpus (Smadja et al.,1996; Melamed, 2000). Och and Ney (2003) compare various methods for computing word alignment using statistical and heuristic models and then develop a statistical word alignment toolkit, GIZA++, which is the mostly used package in statistical machine translation (SMT) nowadays.
Giza++

GIZA++ is part of the statistical machine translation toolkit used to train IBM Model 1 to Model 5 (Brown et al., 1993) and the Hidden Markov Model (HMM) (Och et al., 2003). It is part of the SMT toolkit EGYPT which was developed by the SMT team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU) [1].

GIZA++ is extension of GIZA. Giza++ includes a lot of additional features. The extension of GIZA++ were designed and written by Franz Josef Och.

GIZA++ has following features:
- Implement full IBM-4 alignment model
- Implement IBM-5,dependency on word classes, smoothing
- Implement HMM alignment model
- Smoothing for fertility, distortion/alignment parameters
- Improved perplexity calculation for models IBM-1,IBM-2 and HMM
Installation on Windows

The basic aim of this paper is to show the working of Giza++ in windows. For this we have first installed the Giza++ in windows. Cygwin is used for running Giza++ in windows. Cygwin is free software that provides a Linux-like environment and software tool set to users of any modern version of MS-Windows for x86 CPUs (NT/2000/XP/Vista/7). Cygwin consists of a UNIX system call emulation library, cygwin1.dll, together with a vast set of GNU and other free software applications organized into a large number of optional packages.
Some of these packages will also prompt you to install related or dependent packages. These additional packages are generally necessary, so just accept the defaults. Run Cygwin now for the first time, so that your home directory (/home/your username/) is created.
Implementation Details

In this step we describe how we align the parallel corpora using Giza++.For this first we need the parallel corpora and convert into one sentence per line.Check that the corpora have the same number of lines and that they are correctly aligned. A word alignment is done between two languages. We call the two languages the source language and the target language. This is important in order to correctly do the word alignment, so decide which language will be the source and which the target. We are using source language as English and target language as Hindi. We name the source language raw.src and target language raw.trg

After this Following steps are followed:
Decoder.config
- file used with the ISI Rewrite Decoder
- developed by Daniel Marcu and Ulrich Germann
  
  trn.src.vcb
- list of English words with their unique id and frequency counts
- similar to raw.lw.src.vcb trn.trg.vcb
- list of Hindi words with their unique id and frequency counts
- similar to raw.trg.vcb tst.src.vcb
- blank tst.trg.vcb
- blank ti.final
- file contains word alignments from the English and Hindi corpi
- word alignments are in the specific words unique id
- the probability of that alignment is given after each set of numbers
Ex:

3 0 0.237882

1171 1227 0.963072

actual.ti.final
- file contains word alignments from the English and Hindi corpora
- words alignments are the actual words not their unique ids
- the probability of that is alignment is given after each set of words
  
  Ex: of NULL 0.237882
  
  affidavits 0.25
  
  perp
  - list of perplexity for each iteration and model
    
    #trnsz tstsz -iter model trn-pp test-pp trn- vit-pp tst-vit-pp 19 0 0 Model1
    
    204.141 N/A 5098.14 N/A trns training size
    
    a3.final
    
    tstsz test size iter iteration
    
    trn-pp training perplexity tst-pp test perplexity
    
    trn-vit-pp training viterbi perplexity tst-vit-pp test viterbi perplexity
    
    to 60 %. Comparing the results in Linux environment the output is nearly same.
    
    GIZA++ allows aligning one token from the source language to multiple tokens in the target language, i.e. one-to-many alignments, but does not allow multiple tokens from the source language to align to the same target token. Due to this asymmetry, running GIZA++
  - contains a table with the following format:
    - i j l m p ( i / j, l, m)
    - j = position of target sentence
    - i = position of source sentence
    - l = length of the source sentence
    - m = length of the target sentence
    - p( i / j, l, m) = is the probability that a source word in position i is moved to position j in a pair of sentences of length l and m
      
      Ex:
      
      0 1 1 60 5.262135e-06
      
      0 indicates position of target sentence 1 indicates position of source sentence 1 indicates length of source sentence 60 indicates length of target sentence 5.262135e-06 is the probability that a
      
      source word in position 1 is moved position 0 of sentences of length 1 and 60
      
      d3.final
  - similar to a3.final with positions i and j switched
    
    n3.final
  - contains the probability of the each source token having zero fertility, one fertility, N fertility
    
    t3.final
  - table after all iterations of Model 4 training d4.final
  - translation table for Model 4 D4.final
  - distortion table for IBM-4 gizacfg
  - contains parameter settings that were used in this training.
  - training can be duplicated exactly p_03.final
  - probability of inserting null after a source word
  - file contains 0.883714
Experimental Results

For a word alignment we use EnglishHindi parallel corpora which contain 500 sentences. After running these sentences in Giza++, Accuracy of output is 50%

with source and target languages swapped produces different alignments. Because we desire a high level of precision for prealignments, we run GIZA++ twice, alternating the order of source and target languages, then take the fine intersection of the resulting alignments. The intersection necessarily contains only one-to-one alignments due to the restrictions of the GIZA++ structures.

A variant of GIZA++ is MGIZA++, a derivative of GIZA++ which allows users to save trained model states.
Conclusion

The use of word alignment in natural language processing has increased dramatically in recent years, especially in the development of statistical machine translation. Manual word alignment can be an expensive, time consuming process, especially given the data volumes produced at organizations such as the Linguistic Data Consortium In this paper, we adapted the word alignment model GIZA++ to the work on Windows environment. The contribution of this work is to find a way in making GIZA++ run under the Windows environment.

10. References

F.J.Och, GIZA++: Training of statistical translation models, [Online]. Available at: http://fjoch.com/GIZA++.html
http://wiki.apertium.org/wiki/Using_GIZA%2B
Giza++software [Online]. Available: http://code.google.com/p/giza-pp/downloads/list
Statistical Machine Translation System User manual and Code Guide {online].Available: http://www.statmt.org/moses/manual/manual.pdf/

Volume 02, Issue 05 (May 2013)

Word Alignment Using Giza++ and Cygwin on Windows

Word Alignment Using Giza++ and Cygwin on Windows

Installation of Cygwin

Installation of Giza++

Pre-processing

Create files needed for GIZA++

Create mkcls files needed for GIZA++

Run GIZA++

Leave a Reply