What are corpora?

Generally speaking,“corpora” (the plural of “corpus”) are large libraries of texts that can be read and processed by a computer, generally for research. Corpora can include texts in one or several languages, and may also include additional information concerning different parts of speech, and “alignment”, matched source and target language text segments as part of a translation memory. Monolingual corpora research is valuable in areas such as language teaching and voice recognition.

While some corpora are privately held, others may be used by the public use at no charge. Significant free English-language corpora include:

  • iWeb, released in May, 2018, a network of nearly 95,000 websites. By far the world’s largest and most widely-used corpus, it includes approximately 14 billion words.
  • News on the Web contains more than 8.2 billion words used in online newspapers and magazines beginning in 2010. Approximately 140 – 160 million words are added to the NOW corpus every month.
  • Global Web-Based English, or GloWbE (“globe”), includes nearly two billion words from twenty different countries. It is unique in enabling comparisons among different varieties of English.
  • Wikipedia Corpus, with about 1.9 billion words, is a collection of all 4.4 million Wikipedia articles.
  • The Hansard corpus includes almost every speech delivered in the British Parliament from 1803 until 2005. It contains about 1.6 billion words.

Using corpora in machine translation

Corpora are an indispensable component of certain methods of machine translation, especially for Statistical Machine Translation and Example-Based Machine Translation algorithms.

Statistical Machine Translation (SMT) creates translations statistically by comparing text to be translated against a very large corpus with translations. The SMT algorithm searches for statistical correlations between source phrases and translations, with more correlations indicating a higher likelihood of an accurate translation. SMT is currently the most widely-used method of MT. With sufficiently large corpora, a translation engine can be trained for any pair of languages or field of expertise.

EBMT translation algorithms search a corpus for similar sentence fragments, assembling translations into complete sentences. This method, which has not been widely used commercially, relies heavily upon finding very similar examples.

What is controlled natural language?

“Controlled natural language” (CNL) is a subset of natural language, what humans use for everyday communication, in which the grammar and vocabulary are restricted in order to eliminate or reduce ambiguity. Natural language is rife with informal usage rules and multiple layers of cultural nuance and meaning. This is problematic both for non-native speakers and for machines, making accurate translation extremely difficult.

CNL helps mitigate this ambiguity by limiting vocabulary and employing a simplified grammar and basic usage rules. For example, sentences are kept short, only dictionary-approved words are used, and pronouns and the passive voice are avoided. Examples of CNL include:

  • Basic English: The most famous CNL for human communication, Basic English was developed in the 1930’s to assist non-native speakers in learning English more quickly. Its vocabulary and grammar are derived from standard English, but the vocabulary is reduced to about 850 words, especially by the elimination of most verbs, and grammar rules are greatly simplified. A modified version of Basic English, “Simple English”, is used for the Simple English Wikipedia, designed for children and non-native English speakers.
  • Simplified Technical English: ASD Simplified Technical English (ASD-STE100) was developed for the aerospace industry to make maintenance documentation easier to understand, and has since been adopted by other industries, including defense and medicine. ASD-STE100 includes about 870 words and 60 rules for writing. Each word can be only one part of speech and have one meaning. For example, “test” may be used as a noun but not as a verb, and “follow” can only mean “come after”, not “obey”.

Controlled natural language in machine translation

The unambiguous, univocal nature of controlled natural language helps ensure more accurate output from MT programs, which in most cases requires some human error-checking and editing. By limiting ambiguity and complex sentence structure in the source language, better translation results can be achieved, and the need for human intervention is significantly reduced. This not only produces a better product but also lowers costs significantly.

Using CNL, authors can produce documents that are easily comprehended and retained. This can reduce the required level of technical support, permit automation of routine editing tasks and enable the application of objective measurements of quality. Most importantly for machine translation, CNL produces uniform, standardized source documents, increasing the MT match rate and reducing costs. Building MT engines for controlled language texts is also less expensive, and translation reliability is higher.

The greater consistency in vocabulary and style of a CNL comes at the cost of reduced creative freedom. It is thus most suitable for applications such as software strings, help and customer support documents, and technical specifications and documentation.