Statistical Natural Language Tagging

Description:

Statistical Natural Language Tagging [CHARNIAK93] consists of assigning a lexical category to each word in a text. Usually, the first step in the parsing process is to identify which lexical category (noun, verb, etc) each word in the sentence belongs to, i.e to tag the sentence. The difficulty of the tagging process comes from the lexical ambiguity of the words: many words can be in more that one lexical class. The assignment of a tag to a word depends on the assignments to the other words. Let us considered the following words and their tags.

Rice:	NOUN
flies:	NOUN, VERB
like:	PREP, VERB
sand:	NOUN

The most common statistical models for tagging are based on assigning a probability to a given tag according to its neighboring tags (context). Then the tagger tries to maximize the total probability of the tagging of each sentence. The raw data to obtain this probability are extracted from the training text. That is, we need a hand-tagged text such as the one provided in [BROWN79]. Then, the probability of a sentence can be define as the sum of the probabilities of the context of each word $w_i$ : ( $\sum_i (f(w_i))$ ). The probability of a context can be defined as

$\begin{displaymath}f(w) = \log P(T \vert LC, RC)\end{displaymath}$

where P(T|LC, RC) is the probability that the tag of word w is T, given that its context is formed by the sequence of tags LC to the left and the sequence RC to the right. This probability is estimated from the training text as

where occ(LC, T, RC) is the number of occurrences of the list of tags LC, T, RC in the training table and

is the set of all possible tags of $w_i$

Click here to get this description in tex format and here to get the figure in eps format.

Statistical Natural Language Tagging

Description:

Instances and best known solutions for those instances:

Related Papers: