the subword items and exposing the
semantic information underlying the
subword items boosts the performances of the translation systems. In
this thesis, the most effective ways to
expose the semantic information underlying the subword items are investigated. The major motivation behind
this study is to introduce a method
that bene?ts from both the rule-based
and statistical approaches at same
time. The word frequencies are used
to determine if a word is decomposed
by using rule-based or statistical approach. This determination threshold
is learned by a set of experiments.
Based on this threshold, frequent
words are decomposed by stochastic methods, and the rare words are
decomposed by using a rule-based
morphological analyzer. The late fusion-based approach combines the
outputs of these two separate modules to produce a unique word decomposition hypothesis. The corpus
size used during the experiments is
adjusted in intervals to measure the
reactions of the different approaches
using sparse and dense training data
as well. The overall system description in this study is shown below:
Best results are indicated by a green
background color. While the rulebased morphological analyzer performs better given the sparse training
data, the introduced fusion-based
approach outperforms both the rulebased and statistical approaches
when the training data is dense. The
rule-based approach owes its success to the data sparsity, because the
statistical approaches and stochastic ?avor of the hybrid systems need
dense data to produce a more realistic probability distribution for better
estimations.
In this study, the frequency-driven late
fusion-based word decomposition
approach is introduced to improve
the translation quality of the phrasebased statistical machine translation
system from Turkish to English. This
late fusion-based approach is compared with the standalone statistical
and rule-based word decomposition
approaches with a changing corpus
size. This novel approach fuses both
the rule-based and stochastic word
decomposition methods.
To summarize, in this study, a novel
frequency-driven late fusion-based
word decomposition technique is
introduced to build more accurate
phrase-based statistical machine
translation systems. This proposed
method is also compared with the
well known rule-based morphological
analysis and character-based n-gram
modeling word decomposition approaches.
In addition, the corpus size experiments are conducted in the scale from
10,000 parallel sentences to 160,000.
The rule-based, statistical, and the
fusion-based approaches are tested
and compared among each other.
When the size of the training corpus
is relatively small, the rule-based approaches perform much better, as it
is stated repeatedly in the literature.
However, this study clearly shows that
the fusion-based approach outperforms when the training corpus size
is suf?cient and the training instances
are dense. 160,000 parallel sentences are used for the dense training
data set experiments, and it is the
only publicly available corpus for the
Turkish language. Moreover, it was
possible to compare the results with
the benchmark scores, and the baseline setup performs very similar to
those in the previous studies. However, the fusion-based approach results
in around a 10\% better BLEU score
than the baseline setup. The larger
corpora may result in mor ???????????????????????????? ?????)????????????????????????????)?????????????????????????????????????????????????????????)5??????????????????????????????????????????????????????????)?????????????????????????)?????????????????????????????)???????????????????Q????????)???????????????????????????????)????????????????????????????) 1T?????????????????????????????????????????????????????)5M???Q??????4??Q????}??????)5QT???????Q????((??((