While the nal goal of a statistical machine translation system is to create a model of the target sentence E given the source sentence F, P(E j F), in this chapter we will take a step back, and attempt to create a language model of only the target sentence P(E). Basically, this model allows us to do two things that are of practical use.
Assess naturalness: Given a sentence E, this can tell us, does this look like an actual, natural sentence in the target language? If we can learn a model to tell us this, we can use it to assess the uency of sentences generated by an automated system to improve its results. It could also be used to evaluate sentences generated by a human for purposes of grammar checking or error correction.
Generate text: Language models can also be used to randomly generate text by sampling a sentence E0 from the target distribution: E0 P(E).2 Randomly generating samples from a language model can be interesting in itself { we can see what the model \thinks" is a natural-looking sentences { but it will be more practically useful in the context of the neural translation models described in the following chapters.
In the following sections, we'll cover a few methods used to calculate this probability P(E).
3.1 Word-by-word Computation of Probabilities
As mentioned above, we are interested in calculating the probability of a sentence E = eT1 . Formally, this can be expressed as
P(E) = P(jEj = T; eT1 ); (3)
the joint probability that the length of the sentence is (jEj = T), that the identity of the rst word in the sentence is e1, the identity of the second word in the sentence is e2, up until the last word in the sentence being eT . Unfortunately, directly creating a model of this probability distribution is not straightforward,3 as the length of the sequence T is not determined in advance, and there are a large number of possible combinations of words.
As a way to make things easier, it is common to re-write the probability of the full sentence as the product of single-word probabilities. This takes advantage of the fact that a joint probability { for example P(e1; e2; e3) { can be calculated by multiplying together conditional probabilities for each of its elements. In the example, this means that P(e1; e2; e3) =P(e1)P(e2 j e1)P(e3 j e1; e2).
Figure 2 shows an example of this incremental calculation of probabilities for the sentence \she went home". Here, in addition to the actual words in the sentence, we have introduced an implicit sentence end (\h/si") symbol, which we will indicate when we have terminated the sentence. Stepping through the equation in order, this means we rst calculate the probability of \she" coming at the beginning of the sentence, then the probability of \went" coming next in a sentence starting with \she", the probability of \home" coming after the sentence prex \she went", and then nally the sentence end symbol \h/si" after \she went home". More generally, we can express this as the following equation:
P(E) = TY+1t=1P(et j et1 1 ) (4)
where eT+1 = h/si. So coming back to the sentence end symbol h/si, the reason why we introduce this symbol is because it allows us to know when the sentence ends. In other words, by examining the position of the h/si symbol, we can determine the jEj = T term in our original LM joint probability in Equation 3. In this example, when we have h/si as the 4th word in the sentence, we know we're done and our nal sentence length is 3.
Once we have the formulation in Equation 4, the problem of language modeling now becomes a problem of calculating the next word given the previous words P(et j et1 1 ). This is much more manageable than calculating the probability for the whole sentence, as we now have a xed set of items that we are looking to calculate probabilities for. The next couple of sections will show a few ways to do so.
3.2 Count-based n-gram Language Models
The rst way to calculate probabilities is simple: prepare a set of training data from which we can count word strings, count up the number of times we have seen a particular string of words, and divide it by the number of times we have seen the context. This simple method, can be expressed by the equation below, with an example shown in Figure 3.
'Paper > Neural Machine Translation' 카테고리의 다른 글
[번역]Statistical MT Preliminaries (0) | 2018.11.01 |
---|---|
[번역]Neural Machine Translation and Sequence-to-sequence Models: A Tutorial (0) | 2018.11.01 |