Project Abstract:
The aim of this project to add grammar support for Turkish language, and write an Office Suite that integrates it. The suggested language checker, Language Tool is a corpus level rule-based language checker that will find errors for which a rule is defined in its XML configuration files. Rules for more complicated errors will be written in Java.
Project Detailed Description:
Linguistic Background :
Syntactic analysis, or syntactic parsing is the process of analyzing an input sequence in order to determine its grammatical structure, i.e. the grammatical relationships between the words of a sentence, with respect to a given grammar ( in this project Turkish). Turkish is an agglutinative language in which words are made up of a linear sequence of distinct morphemes and each component of meaning is represented by its own morpheme. The minimal meaning-bearing unit in a language is defined as a morpheme. For example, the word “yollar” consists of two morphemes, “yol”, and “lar”. Morphemes can be further categorized into two classes, stems, and affixes. Hence, in the previous example, the morpheme “yol” is the stem of the word “yollar”, and the morpheme “lar” is an affix that makes the word plural. Rules specifying the ordering of the morphemes are defined by the term morphotactics. For example, in Turkish the plural suffix “-ler” may follow nouns. Morphological features of words are produced through morphological analysis. Hence, any morphological processor needs morphotactic rules, orthographic( spelling) rules, and lexicons( vocabulary) of its language. In this project, I will separately analyze those components with language processing methods such as Link Grammar, or
Distinctive properties of Turkish language and computational analysis[5]:
1.
Turkish has vowel harmony. For this reason, during the affixation process, the vowels in the suffixes have to agree with the last vowel of the affixed word in certain aspects to achieve vowel harmony. Because, the spell-checker can handle vowel harmony, this is not a big problem for grammatical check.
2.
In Turkish, the basic word order is SOV, but constituent order may vary freely as demanded by the discourse context. For this reason, all six combinations of subject, object, and verb are possible in Turkish. So, while parsing the text, these orders should be considered. New XML rules can be written according to this pattern.
I. O (S) okula (O) gidiyor (V) // She her school going
II. Okuluna (O) o (S) gidiyor (V) // Her school she going
III. Okuluna (O) gidiyor (V) o (S) // Her school going she
IV. Gidiyor (V) okuluna(O) o (S) // Going her school she
V. O (S) gidiyor (V) okuluna (O) //she going her school
VI. Gidiyor (V) o (S) okuluna(O) //going she her school
3.
Turkish is head-final[1], meaning that modifiers always precede the modified item. Therefore, these patterns should be included to check the grammar. For example:
- Object of postpositions1 precede postpositions.
Ezgi ile gittin. (You went with Ezgi)
Ezgi with (you went)
- Adverbs precede verbs or adjectives.
Çok iyi bir iş (A very good work)
Very good a work
- Adjectives precede nouns.
Zeki çocuk (The clever child)
Clever child
1.
Turkish is an agglutinative language, with very productive inflectional and derivational suffixation. A given word form may involve multiple derivations[2]. Moreover, inflectional suffixes have grammatical roles. To be able to add these patterns, new Java classes need to be written, or some basic suffixes such as plural suffix can be added to the source xml file so that these rules are thought to the parser.
NOTE : Although there are lots of distinctive rules for Turkish Language, I only gave some basic examples to indicate how those differences can be handled. Some of the multi-word expressions such as “ olsa olsa” or “ koşa koşa” may not be added to source xml file directly, but Java classes can be written to give the rule of multi-word expression for this pattern. However, it is not guaranteed to do so at the first place, since there can be different types of multi-world expressions that vary from adjective to adverb, or emphatic adjectival forms involving the question suffixes such as “ güzel mi güzel”. There can be also some problems with idiomatic expressions such as “ gözü gönlü açılmak”( to be cheered up) or inverted sentences etc.
Language checking process/pseudo-thoughts ( as used in Language Tool):
1.The text to be checked is split into sentences, the Zemberek Library will be used
2.Each sentence is split into separate words
3.Each word is assigned its part-of-speech tag(s) , POS (e.g. yollar= plural noun, bitti = simple past verb)
4.The analyzed text is then matched against the built-in rules and against the rules loaded from the source XML file.
5.The patterns are extracted and compared, the messages like suggesters “ Do you mean this” will be indicated.
6.Spell-check( preferably)
Language Tool looks like a good candidate at first glance, to add Turkish Grammar support. It is based on corpus level rule checking. It uses part-of-speech (POS) tagging that is a grammatical tagging based on its definition, as well as its context. We are thought a simplified version of it in the primary school to identify words such as nouns, verbs, adjectives etc. It has already an Office Suite, so it will be easier to integrate Turkish support with it. On the other hand, I would like to look at other examples such as GRAC( based on learning algorithms, and written in Python), An Gramadóir[3] (intended as a platform for the development of sophisticated natural language processing tools for languages with limited computational resources, written in Perl) and search about the METU-Sabanci Turkish Treebank and other tools or libraries.
There is another methodology which is named Link Grammar. It works on another principle, i.e., it checks the grammatical structure of the sentence and builds relations between pairs of words, rather than constructing constituents in a tree-like hierarchy. It means that it cannot possibly catch many errors that Language Tool describe in the rules but should be able to catch some irregular grammar structures. However, it doesn’t offer suggestions, in contrast to the Language Tool. Moreover, most frequent style and usage mistakes are not violations of a deep grammar structure, so it won’t complain about frequent mistakes.
There are some basic steps to add Turkish support for the Language Tool:
*develop a tagger dictionary for Turkish that will be so small in size by the use of fsa( fsa .dict file, make use of Zemberek)
*create a few or tons of xml rules as I described basic distinctive properties of Turkish, the rules that are specific to Turkish will be added carefully
*create java classes: Language and Tagger and a lot for more complex rules and errors
*create a …tr.properties file ( because Turkish characters will create a problem, Eclipse Resource Bundle editor ( eclipse-rbe) can be an easy solution )
*validate with Junit tests, and tests for the tagger
The checks happen in LanguageTool.check(String) function. When there is a word that is unknown to the dictionary, it will be tagged with UNKNOWN post-tag. I will also make use of Zemberek Libraries. For example, it is not possible to know all proper names. Those words silently can be ignored and we can use even a simple heuristic. By the use of Zemberek, if it is capitalized and has UNKNOWN post-tag, we can deduce that it is a proper name. There can be exceptions but it can be a way to handle them.
Basic rules can be written as XML rules so that you can easily get what you want. For phrases, regular expressions, ambiguous words, words with multiple corrections, these rules are simply more flexible, and I won’t waste my time for writing code that does the same thing as another piece of Language Tool’s code. Actually, I could make a generic tagger class that takes only dictionary
and reads parameters from files. The most standard tag-set for Turkish language should be used. Zemberek will be very helpful in this manner, because it has already a tagger library. It carefully analyses the root of the word and has Turkish language based syllabication, suggestion, suffix finder methods.
Turkish rules are needed to be checked against a large corpus ( for example METU Turkish Corpus [3] or preferably Zemberek Corpus). However, in most cases, I really need to write special Java rules so that the grammar checking will be useful. So, implementing a new language in Language Tool requires translation, careful rule analysis, specific Java classes etc…
In Language Tool, you can easily (using ) control which rules can be turned on/off by the user, as each group has its own ID. This is important, as people might disagree about what’s an error. If you don’t use , all rule can be turned on/off separately.
Testing
In order to check how the selected tool find grammatical defects and possible mistakes, all the collected examples and texts will be evaluated and some bugs will be added to indicate the process of verification. Moreover,I can easily provide XML pattern tests and Junit tests since I am experienced with Junit testing. XML format is checked quite thoroughly by validation and test cases. Manual checks should be the least used form of quality control of the code. Moreover, it is really easier to work on rules when you can use Junit tests which make sure that your notation does what you really want.
Documentation
Results of tests and verifications, methods of rule additions and process of parsing will be documented. Fortunately, there will be statistical values that indicates the results of the grammar check, errors, exceptions etc, when the Turkish Grammar support is finished by the end of this summer. Zemberek has some classes that makes getting statistical information about the suffixes, stems, words easy. So, it will help me to keep track of statistical data by reusing the existing code. If I have time,I will write plugins to link Emacs/KWrite.
Community Interaction
I will be in touch with other developers and researchers who are interested in Natural Language Processing and computational grammar check algorithms. The process of development will take place fully in the open. All collected data will be hosted on a web page, so users and interested developers will always have access to the latest version of the collection. I will also keep track of my development at my blog http://ezgicicek.wordpress.com/ and everybody will be able to leave comments.
Why I am interested in this project, not the others?
I find this project very interesting and necessary. It worths to spend my summer. Firstly, my abilities correspond with the project requirements and I believe myself that I can complete this project in the limited time. I would not want to involve in a project that exceeds my knowledge and capabilities, since it would be painful to struggle with it in the summer.
Secondly, there is a need for a grammatical checker for Turkish. Although there are some useful sources like METU Turkish Corpus or Zemberek for different purposes, an integrated grammar checker would be used and appreciated in the community. When the project is completed, I can easily tell anybody who is not familiar with programming that I achieved to add Turkish grammar support for a given text, fortunately for Open Office. I hope, they can understand me, and use it.
Project Time-line:
I have developed a schedule to stay focused and motivated.
April 14 – May 26 Requirements stage: Research process, which includes reading documentation, exploring Zemberek and Natural Language Processing, collecting background information. Community interaction, meeting with mentors and other colleagues, brain storming and initial planning and designing.
May 26 – July 10 Midterm stage: Testing and coding process with weekly evaluations until official midterm evaluation.
July 10 – August 11 Development Stage: Coding process continues with tests and documentation and weekly evaluations.
August 18 – September 1 Evaluations: Final evaluations include all process of development
September 3 Submission: Submitting code
Final Point
My interest on this topic can keep me happy and working so I do not think to give up after the formal process is ended. Actually, I can write my senior project on this topic, since there is a great amount of academic research on the NLP. I suppose that there will be lots of things even after summer about this project since Natural Language Processing is a wide area. In the process of researching about ways of Turkish Grammar check and Language Tool, I also feel that my experience in this area has increased. I believe I am an excellent candidate to work on this project because I will be working in my spare time leading this project, and I am the one who can complete this project successfully. I would really like to work on this project–not only because of my own passion towards free software and Java–but also to contribute back to the open source and Pardus community.
Links:
Electronic Journal, my Java project, please check it out, and read the instructions to use it :
http://code.google.com/p/e-journal/
References :
[1] Şehitoğlu, O. Tolga. 1996. A Sign-Based Phrase Structure Grammar for
Turkish.M.S. Thesis, Middle East Technical University, 1996.
[2] Eryiğit, G., and Oflazer, K. 2006. Statistical Dependency Parsing of
Turkish. In Proceedings of EACL 2006 11th Conference of the European
Chapter of the Association for Computational Linguistics, Trento, Italy,
April.
[3] http://borel.slu.edu/gramadoir
[4] http://www.ii.metu.edu.tr/~corpus/treebank.html
[5]İstek, Özlem, 2006, A Link Grammar For Turkish, M.S. Thesis
[...] 22, 2009 Filed under: Uncategorized — heykell @ 10:53 am I am accepted to GSoC 2009 !!! Here is my proposal. I am very surprised , my dream will come true at least! By the way, I will be doing [...]
cool
That’s very useful. Thanks Ezgi
Good luck. if you need any modifications in Zemberek let me know.
çok güzel bir proposal tebrik ederim inşallah dönem sonu başarılı olursunuz POS için saklı markov modelleri kullanmanızı tavsiye ederim bunu mentorunuza danışın bence iyi gunler.
Why don’t you supply the code to the languagetool project so that we could integrate Turkish with newer framework code base? We would love to support Turkish!