LANGUAGE ENGINEERING
by Hristo Georgiev

TABLE OF CONTENTS

Preface 1
Part one
PREPARATORY WORK
I Dictionary of Wordforms
1 Construction of the Dictionary of Wordforms
2 C (C++) source code associated with the morphological, grammatical, syntactical and semantical information
II Disambiguation procedures
1 Disambiguation instructions programmed in C (C++)
2 Ambiguity Noun or Verb.
3 Ambiguity Adjective or Verb.
4 Ambiguity Noun or Adjective
5 Ambiguity of an individual word
III Parsing
1 The Sentence
2 Simple sentence
3 Clause
4 Complex Sentence
5 Parsing of the Sentence
6 Verbal Tenses
7 Parsing instructions programmed in C (C++)
8 Finding the Subject of the Sentence
9 Finding the Object of the Sentence
10 Finding the Complement of the Sentence
11 Recognition of the Verb
12 Recognition of the Verbal Tense
13 Parsing errors


Part two


SPHERES OF APPLICATION
IV Orthographical Rules
1 Recognition of a misspelt word
V Grammatical Rules
1 Transfer of grammatical information
2 English grammatical rules
3 Impossible combination of two words
4. Wrong use of -ly when making an Adverb from an Adjective
5 Wrong Use of a Degree
6. Wrong use of the Article
7. Disagreement in Number or Person
8 Wrong use of the Particle to before a Modal Verb
9. Incorrect Verbal Form
10. Unrecognized irregular form of a Verb
11 Unrecognized irregular form of a Noun
12 Wrong use of which and who
13 Wrong use of a Preposition
14 German grammatical rules
15 Wrong grammatical ending
16 Wrong agreement between two words in Gender
17 Wrong agreement between two words in Number
18 Wrong agreement between two words in Person
19 Wrong agreement between two words in Case
20 Wrong agreement between Adjective and Noun in Case, Number and Gender
21 Wrong use of ‘haben’ and its paradigm with a Past Participle
22 Wrong use of ‘sein’ and its paradigm with a Past Participle
23 Wrong use of ‘zu’ with forms of the Main Verb
24 Wrong use of a Relative Pronoun or Article after Transitive or Dative Verb
25 Impossible combination of two words
26 Wrong use of the Reflexive Pronoun
27 Wrong use of Article in Genitive (between two Nouns)
28 Wrong reference of a Relative Pronoun
29 Wrong use of a Preposition
30 French grammatical rules
31 Italian grammatical rules


VI Lexical semantics
1 Thesauri - Softhesaurus, Linguaterm and Geoatlas
VII Representation of knowledge
VIII Machine Translation
1 Construction of the bilingual dictionary
2 Source code used to define the information about the word
3 Choosing of the right meaning depending on context
4 Choosing of the correct grammatical ending for the translated word
5 Choosing of the right word sequence for the target language
6 Blocking a word in the translation
7 Inserting a word in the translation
8 Known difficulties and problems
IX Question Answering
1 Dividing the knowledge about the word and the world into fields and groups
2 Adding semantical, pragmatical, etc. information to each word in the DW
and programming this information
3 Making a list of all possible questions. Examples
4 Ready-made answer to a question, presented only in the DW. Examples
5 Programmed answer to a particular question. Examples
X Content Recognition and Text Attribution to a particular subject field
1 How to use the meaning of the word-groups for text attribution
2 Making a list of all possible subject fields. Example
3 Counting the occurences of a particular meaning in a text for decision making
XI Information Retrieval
XII German and French Sequences of Parts of Speech
Index of abbreviations
References
Appendix
German dictionary of segments
French dictionary of segments
Sample of source code compilable on a Borland compiler versions 2-4.52
Index of terms

Preface

This book is the result of merging of two disciplines: study of language (linguistics) and computer
programming. The new discipline is called ‘linguistic programming’ known also as
‘computational linguistics’ (a term used to include statistical methods as well), which some prefer to
call either ‘natural language processing’ or ‘language engineering’ or
‘linguistic engineering’, because of the ever increasing role of computers in language
study, teaching, technology and computerisation of all aspects of language. In many languages exists
a parallel term calling this branch of linguistics ‘computer linguistics’.
We will concentrate upon those linguistic aspects of language that need to be programmed for one
or another purpose. In view of this, we hope that our readers will know either linguistics or
programming or both.
This book will describe written language only and will not concern spoken language at all.
However, since spoken language and written language have many things in common, we hope
that the subject matter of this book will find successful application to spoken language as well.
This book will not discuss all languages in the world. This is impossible to do in a single book.
However, we will do our best to show techniques which will be applicable to most languages.
We will show which techniques are common to most languages and which are language specific.
So far the language specific techniques are concerned, we will apply them upon English, French and
German languages only, but we will give examples, wherever necessary in Italian, Turkish, Bulgarian
and Russian.
For example, universal grammatical features are number and gender. Some of the language specific
grammatical features are the Case and the expression of Verbal Tenses. Russian, for example, has
only three Verbal Tenses, English has over thirty (in all different Moods). Russian and German have Cases, Bulgarian and English - not. Vietnamese varies the meaning of a word by changing the tone with which it is pronounced, etc. Our aim is not to make a comparative study, it is to show natural language processing techniques, applicable to most languages. The programming language to express the natural language processing techniques will be C.
It is not possible to do that in all programming languages, but we hope, that the reader
and the programmer, familiar with other programming languages, will find the necessary
parallels and will draw the appropriate conclusions. C remains one of the most widely used
among the programming languages.
Ideas that have found practical implementation in one programming language and on one
platform (Operating System), as is the case here, can be successfully applied to other programming
languages and other platforms.
The examples in C will be explained so that those readers, not familiar with it, or not knowing any
programming language, will still be able to understand the grammatical, etc. rules expressed in C.
On the other hand, those programmers, who are not familiar with linguistics, do not have to learn it
before they read this book. Here, they will learn what they need to know about linguistics,
with programming examples. So far as the linguists are concerned, they will learn, from this book,
what they need to know about programming linguistic and language related tasks. This will help them present linguistic knowledge in a way more understandable for the programmers and for the computer and will show them in what areas of linguistics their research is most needed.
The book will offer practical programming solutions for software development in the sphere of
language engineering. These solutions are already tested and work efficiently in a number of software programs developed by the author.
The book is intended for programmers, software developers, linguists involved in Natural Language
Processing (NLP) and as a textbook for all NLP courses.
Language Engineering is a vast subject. We will ignore the theories and concentrate upon the practice of Dictionary making for computational purposes, incl. tagging (the fundament in all applications), Parsing (the grammatical and syntactical analysis of the sentence used by all other applications), Orthographical and Grammatical Spell-checking, Lexical Semantics for computational purposes as found in the Thesauri and other dictionaries, Machine Translation and natural language understanding, namely Question Answering and Text Attribution.
It is the practice, the achievement, the comparison with other results and achievements that matters in the end. Theories only help to achieve a better result.
Intelligent searching for information on the Internet, robotics and language related Artificial Inelligence (thinking, understanding, reasoning, decision taking), etc. future applications can also profit from this book.
All this considered, may the reader find Language Engineering an appropriate title for this
book.