EESTEC Magazine Vol 33 2013/2 | Page 52

the subword items and exposing the semantic information underlying the subword items boosts the performances of the translation systems. In this thesis, the most effective ways to expose the semantic information underlying the subword items are investigated. The major motivation behind this study is to introduce a method that bene?ts from both the rule-based and statistical approaches at same time. The word frequencies are used to determine if a word is decomposed by using rule-based or statistical approach. This determination threshold is learned by a set of experiments. Based on this threshold, frequent words are decomposed by stochastic methods, and the rare words are decomposed by using a rule-based morphological analyzer. The late fusion-based approach combines the outputs of these two separate modules to produce a unique word decomposition hypothesis. The corpus size used during the experiments is adjusted in intervals to measure the reactions of the different approaches using sparse and dense training data as well. The overall system description in this study is shown below: Best results are indicated by a green background color. While the rulebased morphological analyzer performs better given the sparse training data, the introduced fusion-based approach outperforms both the rulebased and statistical approaches when the training data is dense. The rule-based approach owes its success to the data sparsity, because the statistical approaches and stochastic ?avor of the hybrid systems need dense data to produce a more realistic probability distribution for better estimations. In this study, the frequency-driven late fusion-based word decomposition approach is introduced to improve the translation quality of the phrasebased statistical machine translation system from Turkish to English. This late fusion-based approach is compared with the standalone statistical and rule-based word decomposition approaches with a changing corpus size. This novel approach fuses both the rule-based and stochastic word decomposition methods. To summarize, in this study, a novel frequency-driven late fusion-based word decomposition technique is introduced to build more accurate phrase-based statistical machine translation systems. This proposed method is also compared with the well known rule-based morphological analysis and character-based n-gram modeling word decomposition approaches. In addition, the corpus size experiments are conducted in the scale from 10,000 parallel sentences to 160,000. The rule-based, statistical, and the fusion-based approaches are tested and compared among each other. When the size of the training corpus is relatively small, the rule-based approaches perform much better, as it is stated repeatedly in the literature. However, this study clearly shows that the fusion-based approach outperforms when the training corpus size is suf?cient and the training instances are dense. 160,000 parallel sentences are used for the dense training data set experiments, and it is the only publicly available corpus for the Turkish language. Moreover, it was possible to compare the results with the benchmark scores, and the baseline setup performs very similar to those in the previous studies. However, the fusion-based approach results in around a 10\% better BLEU score than the baseline setup. The larger corpora may result in mor ???????????????????????????? ?????)????????????????????????????)?????????????????????????????????????????????????????????)5??????????????????????????????????????????????????????????)?????????????????????????)?????????????????????????????)???????????????????Q????????)???????????????????????????????)????????????????????????????) 1T?????????????????????????????????????????????????????)5M???Q??????4??Q????}??????)5QT???????Q????((??((