EXTRACTION OF FACTUAL DATA FROM TEXTS

Extraction of factual data from texts is the task of automatic generation of elements of a factographic database, such as fields, or parameters, based on on-line texts. Often the flows of the current news from the Internet or from an information agency are used as the source of information for such systems, and the parameters of interest can be the demand for a specific type of a product in various regions, the prices of specific types of products, events involving a particular person or company, opinions about a specific issue or a political party, etc.

The decision-making officials in business and politics are usually too busy to read and comprehend all the relevant news in their available time, so that they often have to hire many news summarizers and readers or even to address to a special information agency. This is very expensive, and even in this case the important relationships between the facts may be lost, since each news summarizer typically has very limited knowledge of the subject matter. A fully effective automatic system could not only extract the relevant facts much faster, but also combine them, classify them, and investigate their interrelationships.

There are several laboratory systems of that type for business applications, e.g., a system that helps to explore news on Dow Jones index, investments, and company merge and acquisition projects. Due to the great difficulties of this task, only very large commercial corporations can afford nowadays the research on the factual data extraction problem, or merely buy the results of such research.

This kind of problem is also interesting from the scientific and technical point of view. It remains very topical, and its solution is still to be found in the future. We are not aware of any such research in the world targeted to the Spanish language so far.

TEXT GENERATION

The generation of texts from pictures and formal specifications is a comparatively new field; it arose about ten years ago. Some useful applications of this task have been found in recent years. Among them are multimedia systems that require a text-generating subsystem to illustrate the pictures through textual explanations. These subsystems produce coherent texts, starting from the features of the pictures.

Another very important application of systems of this kind is the generation of formal specifications in text form from quite formal technical drawings.

For example, compilation of a patent formula for a new device, often many pages long, is a boring, time-consuming, and error-prone task for a human. This task is much more suitable for a machine.

A specific type of such a system is a multilingual text generating system. In many cases, it is necessary to generate descriptions and instructions for a new device in several languages, or in as many languages as possible.

Due to the problems discussed in the section on translation, the quality of automatic translation of a manually compiled text is often very low.

Better results can be achieved by automatic generation of the required text in each language independently, from the technical drawings and specifications or from a text in a specific formal language similar to a programming language.

Text generating systems have, in general, half of the linguistic problems of a translation system, including all of the linguistic problems connected with the grammar and lexicon of the target language. This is still a vast set of linguistic information, which is currently available in adequate detail for only a few very narrow subject areas.