WHAT WE MEAN BY COMPUTATIONAL LINGUISTICS

Computational linguistics might be considered as a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer programs to process words and texts in natural language.

The processing of natural language should be considered here in a very broad sense that will be discussed later.

Actually, this course is slightly “more linguistic than computational,” for the following reasons:

· We are mainly interested in the formal description of language relevant to automatic language processing, rather than in purely algorithmic issues. The algorithms, the corresponding programs, and the programming technologies can vary, while the basic linguistic principles and methods of their description are much more stable.

· In addition to some purely computational issues, we also touch upon the issues related to computer science only in an indirect manner. A broader set of notions and models of general linguistics and mathematical linguistics are described below.

For the purposes of this course, it is also useful to draw a line between the issues in text processing we consider linguistic—and thus will discuss below—and the ones we will not. In our opinion, for a computer system or its part to be considered linguistic, it should use some data or procedures that are:

· language-dependent, i.e., change from one natural language to another,

· large, i.e., require a significant amount of work for compilation.

Thus, not every program dealing with natural language texts is related to linguistics. Though such word processors as Windows’ Notebook do deal with the processing of texts in natural language, we do not consider them linguistic software, since they are not sufficiently language-dependent: they can be used equally for processing of Spanish, English, or Russian texts, after some alphabetic adjustments.

Let us put another example: some word processors can hyphenate words according to the information about the vowels and consonants in a specific alphabet and about syllable formation in a specific language. Thus, they are language-dependent. However, they do not rely on large enough linguistic resources. Therefore, simple hyphenation programs only border upon the software that can be considered linguistic proper. As to spell checkers that use a large word list and complicated morphologic tables, they are just linguistic programs.

WORD, WHAT IS IT?

As it could be noticed, the term word was used in the previous sections very loosely. Its meaning seems obvious: any language operates with words and any text or utterance consists of them. This notion seems so simple that, at the first glance, it does not require any strict definition or further explanation: one can think that a word is just a substring of the text as a letter string, from the first delimiter (usually, a space) to the next one (usually, a space or a punctuation mark). Nevertheless, the situation is not so simple.

Let us consider the Spanish sentence Yo devuelvo los libros el próximo mes, pero tú me devuelves el libro ahora.How many words does it contain? One can say 14 and will be right, since there are just 14 letter substrings from one delimiter to another in this sentence. One can also notice that the article el is repeated twice, so that the number of different words (substrings) is 13. For these observations, no linguistic knowledge is necessary.

However, one can also notice that devuelvo and devuelves are forms of the same verb devolver, and libros and libro are forms of the same noun libro, so that the number of different words is only 11. Indeed, these pairs of wordforms denote the same action or thing. If one additionally notices that the article los is essentially equivalent to the article el whereas the difference in grammatical number is ignored, then there are only 10 different words in this sentence. In all these cases, the “equivalent” strings are to some degree similar in their appearance, i.e., they have some letters in common.

At last, one can consider me the same as yo, but given in oblique grammatical case, even though there are no letters in common in these substrings. For such an approach, the total number of different words is nine.

We can conclude from the example that the term word is too ambiguous to be used in a science with the objective to give a precise description of a natural language. To introduce a more consistent terminology, let us call an individual substring used in a specific place of a text (without taking into account its possible direct repetitions or similarities to other substrings) a word occurrence. Now we can say that the sentence above consisted of 14 word occurrences.

Some of the substrings (usually similar in the appearance) have the same core meaning. We intuitively consider them as different forms of some common entity. A set of such forms is called lexeme. For example, in Spanish {libro, libros}, {alto, alta, altos, altas}, and {devolver, devuelvo, devuelves, devuelve, devolvemos...} are lexemes. Indeed, in each set there is a commonality between the strings in the letters they consist of (the commonality being expressed as patterns libro‑, alt‑, and dev...lv‑), and their meanings are equivalent (namely, ‘book’, ‘high’, and ‘to bring back’, correspondingly). Each entry of such a set—a letter string without regard to its position in the text—is called wordform. Each word occurrence represents a wordform, while wordforms (but not word occurrences) can repeat in the text. Now we can say that the sentence in the example above contains 14 word occurrences, 13 different wordforms, or nine different lexemes. The considerations that gave other figures in the example above are linguistically inconsistent.

A lexeme is identified by a name. Usually, one of its wordforms, i.e., a specific member of the wordform set, is selected for this purpose. In the previous examples, LIBRO, ALTO, and DEVOLVER are taken as names of the corresponding lexemes. Just these names are used as titles of the corresponding entries in dictionaries mentioned above. The dictionaries cover available information about lexemes of a given language, sometimes including morphologic information, i.e., the information on how wordforms of these lexemes are constructed. Various dictionaries compiled for the needs of lexicography, dialectology, and sociolinguistics have just lexemes as their entries rather than wordforms.

Therefore, the term word, as well as its counterparts in other languages, such as Spanish palabra, is too ambiguous to be used in a linguistic book. Instead, we should generally use the terms word occurrence for a specific string in a specific place in the text, wordform for a string regardless to its specific place in any text, and lexeme for a theoretical construction uniting several wordforms corresponding to a common meaning in the manner discussed above.

However, sometimes we will retain the habitual word word when it is obvious which of these more specific terms is actually meant.