CICLing 2010 review results

Notes:

The titles and abstracts provided here are preliminary; they may change in the camera-ready version.
The papers are listed in no particular order yet.
We will contact the authors of short presentations (complementary proceedings / journal special issues) with more details about these journals.

Oral presentation

publication in Springer LNCS

Pablo Gamallo and José Ramom Pichel. Automatic Generation of Bilingual Dictionaries using Intermediary Languages and Comparable Corpora

This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The quality of the entries of the derived dictionary is very high, similar to that of hand-crafted dictionaries. We report a case study where a new, non noisy, English-Galician dictionary with about 12,000 correct bilingual correspondences was automatically generated.

Rodrigo Agerri and Anselmo Peñas. On the automatic generation of Intermediate Logic Forms for WordNet glosses

This paper presents an automatically generated Intermediate Logic Form of WordNet’s glosses. Our proposed logic form includes neo-Davidsonian reification in a simple and flat syntax close to natural language. We offer a comparison with other semantic representations such as those provided by Hobbs and Extended WordNet. Our Intermediate Logic Forms provide a representation suitable to perform semantic inference without the brittleness that characterizes those approaches based on first-order logic and theorem proving. In its current form, the representation allows to tackle semantic phenomena such as co-reference and pronominal anaphora resolution. The Intermediate Logic Forms are straightforwardly obtained from the output of a pipeline consisting of a part-of-speech tagger, a dependency parser and our own Intermediate Logic Form generator (all freely available tools). We apply the pipeline to the glosses of WordNet 3.0 to obtain a lexical resource ready to be used as knowledge base or resource for a variety of tasks involving some kind of semantic inference. We present a qualitative evaluation of the resource and discuss its possible application in Natural Language Understanding.

Hikaru YOKONO and Manabu OKUMURA. Incorporating Cohesive Devices into Entity Grid Model in Evaluating Local Coherence of Japanese Text

This paper describes improvements made to the entity grid local coherence model for Japanese text. We investigate the effectiveness of taking into account cohesive devices, such as conjunction, demonstrative pronoun, lexical cohesion, and refining syntactic roles for a topic marker in Japanese. To take into account lexical cohesion, we consider a semantic relations between entities using lexical chaining. Through the experiments on discrimination where the system has to select the more coherent sentence ordering, and comparison of the system's ranking of automatically created summaries against human judgment based on quality questions, we show that these factors contribute to improve the performance of the entity grid model.

Hugo Hernault, Danushka Bollegala and Mitsuru Ishizuka. A Sequential Model for Discourse Segmentation

Identifying discourse relations in a text is essential for various tasks in Natural Language Processing, such as automatic text summarization, question-answering, and dialogue generation. The first step of this process is segmenting a text into elementary units. In this paper, we present a novel model of discourse segmentation based on sequential data labeling. Namely, we use Conditional Random Fields to train a discourse segmenter on the RST Discourse Treebank, using a set of lexical and syntactic features. Our system is compared to other statistical and rule-based segmenters, including one based on Support Vector Machines. Experimental results indicate that our sequential model outperforms current state-of-the-art discourse segmenters, with an F-score of 0.94. This performance level is close to the human agreement F-score of 0.98.

Samuel Chan, Lawrence Cheung and Mickey Chong. A Machine Learning Parser Using an Unlexicalized Distituent Model

Despite the popularity of lexicalized parsing models, practical concerns such as data sparseness and applicability to domains of different vocabularies makes unlexicalized models that do not refer to word tokens themselves deserve more attention. A classifier-based parser using an unlexicalized parsing model has been developed. Most importantly, to enhance the accuracy of these tasks, we investigated the notion of distituency (the possibility that two parts of speech cannot remain in the same constituent or phrase) and incorporated it as attributes using various statistic measures. A machine learning method integrates linguistic attributes and information-theoretic attributes in two tasks, namely sentence chunking and phrase recognition. The parser was applied to parsing English and Chinese sentences in the Penn Treebank and the Tsinghua Chinese Treebank. It achieved a parsing performance of F-Score 80.3% in English and 82.4% in Chinese.

Ronald Winnemöller. Drive-by Language Identification - A Byproduct of applied Prototype Semantics

While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics. Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words. We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.

Vidas Daudaravicius. The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.

Lieve Macken and Walter Daelemans. A chunk-driven bootstrapping approach to extracting translation patterns

We present a linguistically-motivated sub-sentential alignment system that extends the intersected IBM Model 4 word alignments. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers. We conceive the sub-sentential aligner as a cascaded model consisting of two phases. In the first phase, anchor chunks are linked based on the intersected word alignments and syntactic similarity. In the second phase, we use a bootstrapping approach to extract more complex translation patterns. The results show an overall AER reduction and competitive F-Measures in comparison to the commonly used symmetrized IBM Model 4 predictions (intersection, union and grow-diag-final) on six different text types for English-Dutch. More in particular, in comparison with the intersected word alignments, the proposed method improves recall, without sacrificing precision. Moreover, the system is able to align discontiguous chunks, which frequently occur in Dutch.

Diego Ingaramo, Marcelo Errecalde and Paolo Rosso. A general bio-inspired method to improve the short-text clustering task

``Short-text clustering'' is a very important research field due to the current tendency for people to use very short documents, e.g. blogs, text-messaging and others. In some recent works, new clustering algorithms have been proposed to deal with this difficult problem and novel bio-inspired methods have reported the best results in this area. In this work, a general bio-inspired method based on the AntTree approach is proposed for this task. It takes as input the results obtained by arbitrary clustering algorithms and refines them in different stages. The proposal shows an interesting improvement in the results obtained with different algorithms on several short-text collections.

Azeddine Zidouni and Hervé Glotin. Named Entities Recognition In Transcribed Audio Broadcast News Documents

In this paper we propose an efficient model to perform named entities retrieval (NER) using their hierarchical structure. The NER task consists of identifying and classifying every word in a document into some predefined categories such as person name, locations, organizations, and dates. Usually the classical NER systems use generative approaches to learn models considering only the words characteristics (word context). In this work we show that NER is also sensitive to syntactic and semantic contexts. For this reason, we introduce an extension of conditional random fields (CRFs) approach to consider multiple contexts. We present an adaptation of the text-aproach to the automatic speech recognition (ASR) outputs. Experimental results show that the proposed approach outperformed a CRFs simple application. Our method achieves a significant improvement of 17\% in slot rate error (SER) measure over HMMs method.

Roberto Basili, Danilo Croce, Cristina Giannone and Diego De Cao. Acquiring IE patterns through Distributional Lexical Semantic Models

Techniques for the automatic acquisition of Information Extraction Pattern are still a crucial issue in knowledge engineering. A semi supervised learning method, based on large scale linguistic resources, such as FrameNet and WordNet, is discussed. In particular, a robust method for assigning conceptual relations (i.e. roles) to relevant grammatical structures is defined according to distributional models of lexical semantics over a large scale corpus. Experimental results show that the use of the resulting knowledge base achieves significant accuracy in a typical IE task (above 90%), similarly to supervised semantic parsing methods. This confirms the impact of the proposed approach on the quality and development time of large scale resources.

Antonio Juárez-González, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, David Pinto-Avendaño and Manuel Pérez-Coutiño. Selecting the N-Top Retrieval Result Lists for an Effective Data Fusion

Although the application of data fusion in information retrieval has yielded good results in the majority of the cases, it has been noticed that its achievement is dependent on the quality of the input result lists. In order to tackle this problem, in this paper we explore the combination of only the n-top result lists as an alternative to the fusion of all available data. In particular, we describe a heuristic measure based on redundancy and ranking information to evaluate the quality of each result list, and, consequently, to select the presumably n-best lists per query. Preliminary results in four data sets, considering a total of 266 queries, and employing three different DF methods are encouraging. They indicate that the proposed approach could significantly outperform the results achieved by fusion all result lists, showing an average improvement of 11% in the MAP scores.

Livio Robaldo and Jurij Di Carlo. Flexible disambiguation in DTS

In this paper, we presented procedures to carry out incremental disambiguation in Dependency Tree Semantics (DTS), a constraint-based underspecified formalism. The present paper evolves the the research done by (Robaldo and Di Carlo, 2009), who proposed an Expressively Complete version of DTS, but did not show how disambiguation may be computationally achieved, i.e. how the several readings may obtained blocked starting from the fully-underspecified one. We claim that the disambiguation process proposed here is flexible, in the sense that it is able to account for any kind of NL constraints on available readings.

Rafał Jaworski. Computing transfer score in Example-Based Machine Translation

This paper presents an idea in Example-Based Machine Translation - computing transfer score for each produced translation. When an EBMT system finds an example in the translation memory, it tries to modify it in order to produce the best possible translation of the input sentence. The user of the system, however, is unable to judge the quality of the translation. A solution to this problem is to provide the user with a percentage score for each translated sentence. The idea to base transfer score computation on the similarity between the input sentence and the example is not sufficient. Real-life examples show that the transfer process is equally likely to go well with a bad translation memory example and fail with a good example. The following paper describes a method of computing transfer score, which is strictly associated with the transfer process. The transfer score is reversely proportional to the number of linguistic operations executed on the example target sentence. The paper ends with an evaluation of the suggested method.

Yulan Yan. Multi-View Bootstrapping for Relation Extraction by Exploring Web Features and Linguistic Features

Binary semantic relation extraction from Wikipedia is particularly useful for various NLP and Web applications. Currently frequent pattern mining-based methods and syntactic anlysis-based methods are two types of leading methods for semantic relation extraction task. With a novel view on integrating linguistic analysis on Wikipedia text with redundancy information from the Web, we propose a multi-view learning approach for bootstrapping relationships between entities with the complementary between the Web view and linguistic view. On the one hand, from the linguistic view, features from linguistic parsing by abstracting away from different surface realizations of semantic relations are generated. On the other hand, features from the Web corpus to provide frequency information for relation extraction are extracted. Experimental evaluation on a relational dataset demonstrates that linguistic analysis and web collective information reveal different aspects of the nature of entity-related semantic relationships. And our multi-view learning method considerably boosts the performance comparing to learning with only one view, with the weaknesses of one view complement the strengths of the other.

Michael Granitzer. Adaptive Term Weighting through Stochastic Optimization

Term weighting strongly influences the performance of text mining and information retrieval approaches. Usually term weighting is based on statistical estimates based on static weighting schemes. Such static approaches lack the capability to generalize to different domains and different data sets. In this paper, we introduce a online learning method for adapting term weights in a supervised manner. Via stochastic optimization we determine a linear transformation of the term space to optimize the expected similarities among documents. We evaluate our approach on 18 standard text data sets and show, that the performance improvement of a k-NN classifier ranges between 1\% and 12\% by using adaptive term weighting as preprocessing step. Further, we provide empirical evidence that using pairwise training examples is efficient in on-line learning settings.

Peggy Cellier, Thierry Charnois and Marc Plantevit. Sequential Patterns to Discover and Characterise Biological Relations

In this paper we present a method to automatically detect and characterise interactions between genes in biomedical literature. Our approach is based on a combination of data mining techniques: sequential patterns filtered by linguistic constraints and recursive mining. Unlike most Natural Language Processing (NLP) approaches, our approach does not use syntactic parsing to learn and apply linguistic rules. It does not require any resource except the training corpus to learn patterns. The process is in two steps. First, frequent sequential patterns are extracted from the training corpus. Second, after validation of those patterns, they are applied on the application corpus to detect and characterise new interactions. An advantage of our method is that interactions can be enhanced with modalities and biological information. We use two corpora containing only sentences with gene interactions as training corpus. Another corpus from PubMed abstracts is used as application corpus. We conduct an evaluation that shows that the precision of our approach is good and the recall correct for both targets: interaction detection and interaction characterisation.

Junsheng Zhou, Yabing Zhang, Xinyu Dai and Jiajun Chen. Chinese Event Descriptive Clause Splitting with Structured SVMs

Chinese event descriptive clause splitting is a novel task in Chinese information processing. In this paper, we present the first Chinese clause splitting system with a discriminative approach. By formulating the Chinese clause splitting task as a sequence labeling problem, we apply the structured SVMs model to Chinese clause splitting. Compared with other two baseline systems, our approach gives much better performance.

Victoria Bobicev, Victoria Maxim, Tatiana Prodan, Natalia Burciu and Victoria Angheluş. Emotions in words: developing a multilingual WordNet-Affect

In this paper we describe the process of Russian and Romanian WordNet-Affect creation. WordNet-Affect is a lexical resource created on the basis of the Princeton WordNet which contains information about the emotions that the words convey. It is organized in six basic emotions: anger, disgust, fear, joy, sadness, surprise. WordNet-Affect is a small lexical resource but valuable for its affective annotation. We translated the WordNet-Affect synsets into Russian and Romanian and created an aligned English – Romanian – Russian lexical resource. The resource is freely available for research purposes.

Pepi Stavropoulou, Dimitris Spiliotopoulos and Georgios Kouroupetroglou. Integrating Contrast in a Framework for Predicting Prosody

Information Structure (IS) is known to bear a significant effect on Prosody, making the identification of this effect crucial for improving the quality of synthetic speech. Recent theories identify contrast as a central IS element affecting accentuation. This paper presents the results of two experiments aiming to investigate the function of the different levels of contrast within the topic and focus of the utterance, and their effect on the prosody of Greek. Analysis showed that distinguishing between at least two contrast types is important for determining the appropriate accent type, and, therefore, such a distinction should be included in a description of the IS – Prosody interaction. For this description to be useful for practical applications, a framework is required that makes this information accessible to the speech synthesizer. This work reports on such a language-independent framework integration of all identified grammatical and syntactic prerequisites for creating a linguistically enriched input for speech synthesis.

Onur Gungor and Tunga Gungor. Morphological Annotation of a Corpus with a Collaborative Multiplayer Game

In most of the natural language processing tasks, state-of-the-art systems usually rely on machine learning methods for building their mathematical models. Given that the majority of these systems employ supervised learning strategies, a corpus that is annotated for the problem area is essential. The current method for annotating a corpus is to hire several experts and make them annotate the corpus manually or by using a helper software. However, this method is costly and time-consuming. In this paper, we propose a novel method that aims to solve these problems. By employing a multiplayer collaborative game that is playable by ordinary people on the Internet, it seems possible to direct the covert labour force so that people can contribute by just playing a fun game. Through a game site which incorporates some functionality inherited from social networking sites, people are motivated to contribute to the annotation process by answering questions about the underlying morphological features of a target word. The experiments show that the 63.5\% of the actual question types are successful based on a two-phase evaluation.

Christian M. Meyer and Iryna Gurevych. Worth its Weight in Gold or Yet Another Resource --- A Comparative Study of Wiktionary, OpenThesaurus and GermaNet

In this paper, we analyze the topology and the content of a range of lexical semantic resources for the German language constructed either in a controlled (GermaNet), semi-controlled (OpenThesaurus), or collaborative, i.e. community-based, manner (Wiktionary). For the first time, the comparison of the corresponding resources is performed at the word sense level. For this purpose, the word senses of terms are automatically disambiguated in Wiktionary and the content of all resources is converted to a uniform representation. We show that the resources' topology is well comparable as they share the small world property and contain a comparable number of entries, although differences in their connectivity exist. Our study of content related properties reveals that the German Wiktionary has a different distribution of word senses and contains more polysemous entries than both other resources. We identify that each resource contains the highest number of a particular type of semantic relation. We finally increase the number of relations in Wiktionary by considering symmetric and inverse relations that have been found to be usually absent in this resource.

Ramona Enache and Aarne Ranta. An open-source computational grammar for Romanian

We describe the implementation of a computational grammar for Romanian as a resource grammar in the GF project(Grammatical Framework). Resource grammars are the basic constituents of the GF library. They consist of morphological and syntactical modules which implement the common abstract syntax, also describing the basic features of a language. A lexicon that provides the translation of basic words in the given language is also included for testing purposes. There are currently 15 resource grammars in GF, the Romanian one being the 14th. The present paper explores the main features of the Romanian grammar, along with the way they fit into the framework that GF provides. We also compare the implementation for Romanian with related resource grammars that exist already in the library. The current resource grammar allows generation of natural language, parsing and can be used in multilingual translations and other GF-related applications. Covering a wide range of specific morphological and syntactical features of the Romanian language, this GF resource grammar is the most comprehensive open-source grammar existing so far for Romanian.

Christian Hänig. Unsupervised part-of-speech disambiguation for high frequency words and its influence on unsupervised parsing

Current unsupervised part-of-speech tagging algorithms build context vectors containing high frequency words as features and cluster words -- regarding to their context vectors -- into classes. While part-of-speech disambiguation for mid and low frequency words is achieved by applying a Hidden Markov Model, no corresponding method is applied to high frequency terms. But those are exactly the words being essential for analyzing syntactic dependencies of natural language. Thus, we want to introduce an approach employing unsupervised clustering of contexts to detect and separate a word's different syntactic roles. Experiments on German and English corpora show how this methodology addresses and solves some of the major problems of unsupervised part-of-speech tagging.

Alain-Pierre Manine, Erick Alphonse and Philippe Bessières. Extraction of Genic Interactions with the Recursive Logical Theory of an Ontology

We introduce an Information Extraction (IE) system which uses the logical theory of an ontology as a generalisation of the typical information extraction patterns to extract biological interactions from text. This provides inferences capabilities beyond current approaches: first, our system is able to handle multiple relations; second, it allows to handle dependencies between relations, by deriving new relations from the previously extracted ones, and using inference at a semantic level; third, it addresses recursive or mutually recursive rules. In this context, automatically acquiring the resources of an IE system becomes an ontology learning task: terms, synonyms, conceptual hierarchy, relational hierarchy, and the logical theory of the ontology have to be acquired. We focus on the last point, as learning the logical theory of an ontology, and {\it a fortiori} of a recursive one, remains a seldom studied problem. We validate our approach by using a relational learning algorithm, which handles recursion, to learn a recursive logical theory from a text corpus on the bacterium \emph{Bacillus subtilis}. This theory achieves a good recall and precision for the ten defined semantic relations, reaching a global recall of $67.7$\% and a precision of $75.5$\%, but more importantly, it captures complex mutually recursive interactions which were implicitly encoded in the ontology.

Smaranda Muresan. Ontology-based Semantic Interpretation as Grammar Rule Constraints

We present an ontology-based semantic interpreter that can be linked to a grammar through grammar rule constraints, providing access to meaning during parsing and generation. In this approach, the parser will take as input natural language utterances and will produce ontology-based semantic representations. We rely on a recently developed constraint-based grammar formalism, which balances expressiveness with practical learnability results. We show that even with a weak "ontological model", the semantic interpreter at the grammar rule level can help remove erroneous parses obtained when we do not have access to meaning.

Francisco Oliveira, Fai Wong and Iok-Sai Hong. Systematic Processing of Long Sentences in Rule based Portuguese-Chinese Machine Translation

The translation quality and parsing efficiency are often disappointed when Rule based Machine Translation systems deal with long sentences. Due to the complicated syntactic structure of the language, many ambiguous parse trees can be generated during the translation process, and it is not easy to select the most suitable parse tree for generating the correct translation. This paper presents an approach to parse and translate long sentences efficiently in application to Rule based Portuguese-Chinese Machine Translation. A systematic approach to break down the length of the sentences based on patterns, clauses, conjunctions, and punctuation is considered to improve the performance of the parsing analysis. On the other hand, Constraint Synchronous Grammar is used to model both source and target languages simultaneously at the parsing stage to further reduce ambiguities and the parsing efficiency.

Rada Mihalcea, Carlo Strapparava and Stephen Pulman. Computational Models for Incongruity Detection in Humor

Incongruity resolution is one of the most widely accepted theories of humour, suggesting that humour is due to the mixing of two disparate interpretation frames in one statement. In this paper, we explore several computational models for incongruity resolution. We introduce a new data set, consisting of a series of set-ups, each of them followed by four possible coherent continuations out of which only one has a comic effect. Using this data set, we redefine the task as the automatic identification of the humorous punch line among all the plausible endings. We explore several measures of semantic relatedness, along with a number of joke-specific features, and try to understand their appropriateness as computational models for incongruity detection.

Dipankar Das and Sivaji Bandyopadhyay. Emotion Agent for Emotional Verbs – The role of Subject and Syntax

In psychology and common use, emotion is an aspect of a person's mental state of being, normally based in or tied to the person’s internal (physical) and external (social) sensory feeling. The determination of emotion expressed in the text with respect to reader or writer is itself a challenging issue. Extraction of emotion holder, human like agent is important for discriminating between emotions that are viewed from different perspectives. A wide range of Natural Language Processing (NLP) tasks such as tracking users’ emotion about products or events or about politics as expressed in online forums or news, to customer relationship management are using emotional information. The determination of emotion agent in the text helps us to track and distinguish user’s emotion separately. The present work aims to identify the emotion agent using two-way approach. A baseline system is developed based on the subject information of the emotional sentences parsed using Stanford Dependency Parser. The precision, recall and F-Score values of the agent identification system are 63.21%, 66.54% and 64.83% respectively for baseline approach. Another way to identify emotion agent has been adopted based on the syntactical argument structure of the emotional verbs. Ekman’s six different types of emotional verbs are retrieved from WordNet Affect Lists (WAL). A total of 4,112 emotional sentences for 942 emotional verbs of six emotion types have been extracted from the English VerbNet. The agent related information as specified in the VerbNet such as Experiencer, Agent, Theme, Beneficiary etc. are properly tagged in the correct position of the syntactical frames on each sentence. All possible subcategorization frames and their corresponding syntax, available in the VerbNet are retrieved for each emotional verb. The head of each chunk is extracted from the dependency-parsed output. This chunk level information helps in constructing the syntactic argument structure with respect to the key emotional verb. The acquired syntactic argument structure of each emotional verb is mapped to all of the possible syntax present for that verb in the VerbNet. If the syntactic argument structure of a sentence with respect to its representing key verb is matched with any of the syntax extracted from VerbNet for that key verb, the agent role associated with the VerbNet syntax is then assigned the agent tag in the appropriate component position of the syntactical arguments. The precision, recall and F-Score values of this unsupervised syntactic agent identification approach are 68.11%, 65.89% and 66.98%. It has been observed that the baseline model suffers from the inability to identify emotion agent from the sentences containing passive senses. Although the recall value has been decreased in the syntactic model, it outperforms over baseline significantly.

Partha Pakray, Alexander Gelbukh and Sivaji Bandyopadhyay. A Syntactic Textual Entailment System Using Dependency Parser

The paper reports about the development of a syntactic textual entailment system that compares the dependency relations identified for both the text and the hypothesis sections. The Stanford Dependency Parser has been run on the 2-way RTE-3 development set. The dependency relations obtained for the hypothesis has been compared with those relations obtained for the text. Some of the important comparisons that are carried out are: subject-subject compare (hypothesis and text subjects are compared), subject-verb compare (subject along with the related verb), object-verb compare are cross subject-verb compare (hypothesis subject with text object and hypothesis object with text subject). Each of the matches through the above comparisons are assigned some weight learnt from the development corpus. A threshold of 0.3 has been set on the fraction of matching hypothesis relations based on the development set results that gives optimal precision and recall values for both YES and NO entailments. The threshold score has been applied on the RTE-4 gold standard test set using the same methods of dependency parsing followed by comparisons. Evaluation scores obtained on the test set show 54.78% precision and 49.2% recall for YES decisions and 53.9% precision and 59.4% recall for NO decisions.

Guillem Gascó Mora and Joan Andreu Sánchez Peiró. Syntax Augmented Inversion Transduction Grammars for Machine Translation

In this paper we propose a novel method for inferring an Inversion Transduction Grammar (ITG) with source (or target) language linguistic information from a bilingual paralel corpus. Our method combines bilingual ITG parse trees with monolingual linguistic source (or target) trees in order to obtain a Syntax Augmented ITG (SAITG). The use of a modifed parsing algorithm for bilingual pars- ing with bracketing information makes possible that each bilingual subtree have a correspondent subtree in the monolingual parsing. In addition, several binariza- tion techniques have been tested for the resulting SAITG. In order to evaluate the effects of the use of SAITGs in Machine Translation tasks, we have used them in an ITG-based machine translation decoder. The decoder is a hybrid machine translation system that combines phrase-based models together with syntax-based translation models. The formalism that underlies the whole decoding process is a Chomsky Normal Form Stochastic Inversion Transduction Grammar (SITG) with phrasal productions and a log-linear combination of probability models. The decoder uses a CYK-like algorithm that combines the translated phrases inversely or directly in order to get a complete translation of the input sentence. The results obtained using SAITGs with the decoder for the IWSLT-08 Chinese-English machine translation task produce signifcant improvements in BLEU and TER.

Cícero N. dos Santos, Ruy L. Milidiú, Carlos E. M. Crestana and ERALDO R. FERNANDES. ETL Ensembles for Chunking, NER and SRL

We present a new ensemble method that uses Entropy Guided Transformation Learning (ETL) as the base learner. The proposed approach, ETL Committee, combines the main ideas of Bagging and Random Subspaces. We also propose a strategy to include redundancy in transformation-based models. To evaluate the effectiveness of the ensemble method, we apply it to three Natural Language Processing tasks: Text Chunking, Named Entity Recognition and Semantic Role Labeling. Our experimental findings indicate that ETL Committee significantly outperforms single ETL models, achieving state-of-the-art competitive results. Some positive characteristics of the proposed ensemble strategy are worth to mention. First, it improves the ETL effectiveness without any additional human effort. Second, it is particularly useful when dealing with very complex tasks that use large feature sets. And finally, the resulting training and classification processes are very easy to parallelize.

Sobha Lalitha Devi, Vijay Sundar Ram R, Bagyavathi T and Praveen Pralayankar. Syntactic Structure Transfer in a Tamil -Hindi MT System - A Hybrid Approach

We describe the Syntactic Structure Transfer (SST), a central design question in machine translation, between two languages Tamil (source) and Hindi (target), belonging to two different language families, Dravidian and Indo-Aryan respectively. Tamil and Hindi differ extensively at the clausal construction level and transferring the structure is difficult. The SST described here is a hybrid approach where we use CRFs for identifying the clause boundaries in the source language, Transformation Based learning(TBL) for extracting the rules and use semantic classification of postpositions for choosing the correct structure in constructions where there are one to many mapping in the target language. We have evaluated the system using web data and the results are encouraging.

Vadlapudi Ravikiran and Rahul Katragadda. Quantitative Evaluation of Grammaticality of Summaries

Automated evaluation is crucial in the context of automated text summaries, as is the case with evaluation of any of the language technologies. While the quality of a summary is determined by both contentand form of a summary, throughout the literature there has been extensive study on the automatic and semi-automatic evaluation of content of summaries and most such applications have been largely successful. What lacks is a careful investigation of automated evaluation of readability aspects of a summary. In this work we dissect readability into five parameters and try to automate the evaluation of grammaticality of text summaries. We use surface level methods like Ngrams and LCS sequence on POS-tag sequences and chunk-tag sequences to capture acceptable grammatical constructions, and these approaches have produced impressive results. Our results show that it is possible to use relatively shallow features to quantify degree of acceptance of grammaticality.

Jan De Belder and Sien Moens. Sentence Compression for Dutch using Integer Linear Programming

Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles from major Belgian newspapers written in Dutch using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.

Alberto Barrón-Cedeño, Chiara Basile, Mirko Degli Esposti and Paolo Rosso. Word Length n-grams for Text Re-Use Detection

The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications; this approach, however, becomes normally impracticable for real-world large datasets. As a result, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a prefix-tree (trie), allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

Roberto Basili and Paolo Annesi. Cross-lingual Alignment of Framenet Annotations through Hidden Markov Models

The development of annotated resources in the area of frame semantics has been crucial to the development of robust systems for shallow semantic parsing. Resource-poor languages have shown a significant delay due to the lack of sufficient training data. Recent works proposed to exploit parallel corpora in order to automatically transfer the semantic information available for English to other target languages. In this paper, an approach based on Hidden Markov Models is proposed to support the automatic semantic transfer and use an aligned bilingual corpus to develop large scale annotated data sets. As this method relies just on lexical alignment of sentence pairs, it is robust against preprocessing errors and does not require complex optimization, like syntax-dependent models for accurate cross-lingual mapping. The experimental evaluation over an English-Italian corpus is successful, achieving 86% of accuracy on average, and improves on the state of the art methods for the same task.

Hui Shi, Robert J. Ross, Thora Tenbrink and John Bateman. Modelling Illocutionary Structure: Combining Empirical Studies with Formal Model Analysis

In this paper we revisit the topic of dialogue grammars at the illocutionary force level and present an approach to the formal modelling, evaluation and comparison of these models based on recursive transition networks. Through the use of appropriate tools such finite-state grammars can be formally analysed and validated against empirically collected corpora. We illustrate our approach through: (a) the construction of human-human dialogue grammars on the basis of recently collected natural language dialogues in joint-task situations; and (b) the evaluation and comparison of these dialogue grammars using formal methods. This work provides a new basis for developing and evaluating dialogue grammars and for engineering corpus-tailored dialogue managers which can be verified for adequacy.

Marcin Junczys-Dowmunt. A Maximum Entropy Approach to Syntactic Translation Rule Filtering

In this paper we present a maximum entropy filter for translation rules of a statistical machine translation system based on tree transducers. This filter can be successfully applied for the reduction of the number of translation rules by more than 70% without negatively affecting translation quality as measured by BLEU. For some filter configurations translation quality is even improved. Our investigations include a discussion of the relationship of Alignment Error Rate and Consistent Translation Rule Score with translation quality in the context of Syntactic Statistical Machine Translation.

Gabriela Ramírez-de-la-Rosa, Manuel Montes-y-Gómez and Luis Villaseñor-Pineda. Enhancing Text Classification by Information Embedded in the Test Set

Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.

Filip Graliński. Mining Parenthetical Translations for Polish-English Lexica

Documents written in languages other than English sometimes include parenthetical English translations, usually for technical and scientific terminology. Techniques had been developed for extracting such translations (as well as transliterations) from large Chinese text corpora. This paper presents methods for mining parenthetical translation in Polish texts. The main difference between translation mining in Chinese and Polish is that the latter is based on the Latin alphabet and it is more difficult to identify English translations in Polish texts. On the other hand, some parenthetical translated terms are preceded with the abbreviation "ang." (=English), a kind of an "anchor", allowing for querying a Web search engine for such translations.

Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor and Ruslan Mitkov. Identification of Translational Sublanguage: A Machine Learning Approach

This paper presents a machine learning approach to the study of translational sublanguage. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal.

Cristian Grozea and Marius Popescu. Who's the Thief? Determining the Direction of Plagiarism

Determining the direction of plagiarism (who plagiated whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which has won the 1st international competition on plagiarism detection in 2009. We have tested it on a very large corpus of artificial plagiarism, with good results.

Eric SanJuan and Fidelia Ibekwe-SanJuan. Multi Word Term queries for focused Information Retrieval

We address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top n ranked documents. MWTs are special text units that represent do- main concepts and objects. As such, they can better represent query topics than ordinary phrases or n-grams. We tested different query representations: bag- of-words, phrases, flat list of MWTs, subsets of MWTs. The experiment is carried out on two benchmarks: TREC Enterprise track (TRE- Cent) 2007 and 2008 collections; INEX 2008 Ad-hoc track using the Wikipedia collection.

Holz Florian and Sven Teresniak. Towards Automatic Detection and Tracking of Topic Change

We present an approach for automatic detection of topic change. Our approach is based on the analysis of statistical features of topics in time-sliced corpora and their dynamics over time. Processing large amounts of time-annotated news text, we identify new facets regarding a stream of topics consisting of latest news of public interest. Adaptable as an addition to the well known task of topic detection and tracking we aim to boil down a daily news stream to its novelty. For that we examine the contextual shift of the concepts over time slices. To quantify the amount of change, we adopt the volatility measure from econometrics and propose a new algorithm for frequency-independent detection of topic drift and change of meaning. The proposed measure does not rely on plain word frequency but the mixture of the co-occurrences of words. So, the analysis is highly independent of the absolute word frequencies and works over the whole frequency spectrum, especially also well for low-frequent words. Aggregating the computed time-related data of the terms allows to build overview illustrations of the most evolving terms for a whole time span.

Stefan Trausan-Matu and Traian Rebedea. A Polyphonic Model and System for Inter-Animation Analysis in Chat Conversations with Multiple Participants

Discourse in instant messenger conversations (chats) with multiple participants is often composed of several intertwining threads. Some chat environments for Computer-Supported Collaborative Learning (CSCL) support and encourage the existence of parallel threads by providing explicit referencing facilities. The paper presents a discourse model for such chats, based on Mikhail Bakhtin’s dialogic theory. It considers that multiple voices (which do not limit to the participants) inter-animate, sometimes in a polyphonic, counterpointal way. An implemented system is also presented, which analyzes such chat logs for detecting additional, implicit links among utterances and threads and, more important for CSCL, for detecting the involvement (inter-animation) of the participants in problem solving. The system begins with a NLP pipe and concludes with inter-animation identification in order to generate feedback and to propose grades for the learners.

Chaitanya Vempaty and Vadlapudi Ravikiran. Issues in analyzing Telugu Sentences towards building a Telugu Treebank

This paper describes an effort towards building a Telugu Dependency Tree- bank. We discuss the basic framework and issues we encountered while an- notation. 1487 sentences have been annotated in Paninian framework. We also discuss how some of the annotation decisions would effect the develop- ment of a parser for Telugu.

Liviu P. Dinu and Andrei Rusu. Rank Distance Aggregation as a Fixed Classifier Combining Rule for Text Categorization

In this paper we show that Rank Distance Aggregation can improve ensemble classifier precision in the clas- sical text categorization task by presenting a series of experiments done on a 20 class newsgroup corpus, with a single correct class per document. We aggregate four established document classi cation methods (TF-IDF, Probabilistic Indexing, Naive Bayes and KNN) in dif- ferent training scenarios, and compare these results to widely used fixed combining rules such as Voting, Min, Max, Sum, Product and Median

Tomoya Iwakura. A Named Entity Extraction using Word Information　Repeatedly Collected from Unlabeled Data

This paper proposes a method for Named Entity (NE) Extraction using　NE-related labels of words collected from unlabeled data repeatedly. NE-labels of words are candidate NE classes of each word, NE classes of　co-occurring words of each word, and so on. We collect NE-related labels　of words from NE extraction results on unlabeled data. Firstly, we collect　NE-related labels of words from NE extraction results on unlabeled data by　an NE extractor. Then we create new NE extractor using NE-related labels　of each word as new additional features. We use the new NE extractor to　collect new NE-related labels of words. We evaluate our method by using　IREX data set for Japanese NE extraction. Experimental results show our　method contributes improved accuracy.

Izaskun Aldezabal, Maria Jesus Aranzabe, Arantza Diaz de Ilarraza, Ainara Estarrona and Larraitz Uria. EusPropBank: Integrating Semantic Information in the Basque Dependency Treebank

This paper deals with theoretical problems found in the work that is being carried out for annotating semantic roles in the Basque Dependency Treebank (BDT). We will present the resources used and the way the annotation is being done. Following the model proposed in the PropBank project, we will show the problems found in the annotation process and decisions we have taken. The representation of the semantic tag has been established and detailed guidelines for the annotation process have been defined, although it is a task that needs continuous updating. Besides, we have adapted AbarHitz, a tool used in the construction of the BDT, to this task.

Xiangdong An. The Optimal IR in Genomics: How Far Away?

There exists a gap between what a human user wants in his mind and what he could get from the information retrieval (IR) systems by his queries. We say an IR system is perfect if it could always provide the users with what they want in their minds if available, and optimal if it could present to the users what it finds in an optimal way. In this paper, based on some assumptions, we empirically study how far away we are still from the optimal IR and the perfect IR in genomics. This study gives us a partial perspective regarding where we are in the IR development path. This study also provides us with the lowest upper bound on IR performance improvement by reranking.

Siddhartha Jonnalagadda, Robert Leaman, Graciela Gonzalez and Trevor Cohen. A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities

Named Entity Recognition and Classification is being studied for last two decades with most of the current tools consuming lot of time for training. Since semantic features take huge amount of training time and are slow in inference, the existing tools apply features and rules mainly at the word-level or use lexicons. Recent advances in distributional semantics allow us to efficiently create paradigmatic models that encode word order. We used Sahlgren’s Random Indexing based model to create an elegant, scalable, efficient and accurate system to simultaneously recognize multiple entity types mentioned in natural language and is validated on the GENIA corpus which has annotations for 46 biomedical entity types and supports nested entities. Only using straightforward distributional semantics features, it achieves an overall micro-averaged F-measure of 67.3% based on fragmental matching with performance ranging from 7.4% for “DNA substructure” to 80.7% for “Bioentity”.

George Tsatsaronis, Iraklis Varlamis and Kjetil Nørvåg. An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation

Recent research works on unsupervised word sense disambiguation report an increase in performance, which reduces their handicap from the respective supervised approaches for the same task. Among the latest state of the art methods, those that use semantic graphs reported the best results. Such methods create a graph comprising the words to be disambiguated and their corresponding candidate senses. The graph is expanded by adding semantic edges and nodes from a thesaurus. The selection of the most appropriate sense per word occurrence is then made through the use of graph processing algorithms that offer a degree of importance among the graph vertices. In this paper we experimentally investigate the performance of such methods. We additionally evaluate a new method, which is based on a recently introduced algorithm for computing similarity between graph vertices, P-Rank. We evaluate the performance of all alternatives in two benchmark data sets, Senseval 2 and 3, using WordNet. The current study shows the differences in the performance of each method, when applied on the same semantic graph representation, and analyzes the pros and cons of each method for each part of speech separately. Furthermore, it analyzes the levels of inter-agreement in the sense selection level, giving further insight on how these methods could be employed in an unsupervised ensemble for word sense disambiguation.

Saeed Raheel and Joseph Dichy. An Empirical Study on the Feature’s Type Effect on the Automatic Classification of Arabic Documents

The Arabic language is a highly flexional and morphologically very rich language. It presents serious challenges to the automatic classification of documents, one of which is determining what type of attribute to use in order to get the optimal classification results. Some people use roots or lemmas which, they say, are able to handle problems with the inflections that do not appear in other languages in that fashion. Others prefer to use character-level n-grams since n-grams are simpler to implement, language independent, and produce satisfactory results. So which of these two approaches is better, if any? This paper tries to answer this question by offering a comparative study between four feature types: words in their original form, lemmas, roots, and character level n-grams and shows how each affects the performance of the classifier. We used and compared the performance of Support Vector Machines and Naïve Bayesian Networks algorithms respectively.

Slim Mesfar. Towards a cascade of morpho-syntactic tools for Arabic Natural Language Processing

This paper presents a cascade of morpho-syntactic tools to deal with Arabic natural language processing. It begins with the description of a large coverage formalization of the Arabic lexicon. The built electronic dictionary, named "El-DicAr", which stands for “Electronic Dictionary for Arabic”, links inflectional, morphological, and syntactic-semantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over 3 million inflected forms. El-DicAr represents the linguistic engine for the automatic analyzer, built through a lexical analysis module, and a cascade of morpho-syntactic tools including: a morphological analyzer, a spell-checker, a named entity recognition tool, an automatic annotator and tools for linguistic research and contextual exploration. The morphological analyzer identifies the component morphemes of the agglutinative forms using large coverage morphological grammars. The spell-checker corrects the most frequent typographical errors. The lexical analysis module handles the different vocalization statements in Arabic written texts. Finally, the named entity recognition tool is based on a combination of the morphological analysis results and a set of rules represented as local grammars.

Victor Bocharov, Lidia Pivovarova, Valery Sh. Rubashkin and Boris Chuprin. Ontological Parsing of Encyclopedia Information

Semi-automatic ontology learning from encyclopedia is presented with primary focus on syntax and semantic analyses of definitions.

Raquel Justo, Alicia Pérez, M. Inés Torres and Francisco Casacuberta. Hierarchical finite-state models for speech translation using categorization of phrases

In this work a hierarchical translation model is formally defined and integrated in a speech translation system. As it is well known, the relations between two languages are better arranged in terms of phrases than in terms of running words. Nevertheless phrase-based models may suffer from data sparsity at training time. The aim of this work is to improve current speech translation systems by integrating categorization within the translation model. The categories are sets of phrases, being the latter either linguistically or statistically motivated. Both category and translation and acoustic models are finite-state models. In what temporal cost concerns, finite-state models count on efficient algorithms. Regarding the spatial cost, all the models where integrated on-the-fly at decoding time, allowing an efficient use of the memory.

Rahul Katragadda. GEMS: Generative Modeling for Evaluation of Summaries

In this paper we argue for the need of alternative `automated' summarization evaluation systems for both content and readability. In the context of TAC Automatically Evaluating Summaries Of Peers (AESOP) task, we describe the problem with content evaluation metrics. We model the problem as an information ordering problem; our approach (and indeed others) should now be able to rank systems (and possibly human summarizers) in the same order as human evaluation would have produced. We show how a well known generative model could be used to create automated evaluation systems comparable to the state-of-the-art. Our method is based on a multinomial model distribution of key-terms (or signature terms) in document collections, and how they are captured in peers. We have used two types of signature-terms to model the evaluation metrics. The first is based on POS tags of important terms in a model summary and the second is based on how much information the reference summaries shared among themselves. Our results show that verbs and nouns are key contributors to our best run which was dependent on various individual features. Another important observation was that all the metrics were consistent in that they produced similar results for both cluster A and cluster B in the context of update summaries. The most startling result is that in comparison with the automated evaluation metrics currently in use (ROUGE, Basic Elements) our approach has been very good at capturing ``overall responsiveness'' apart from pyramid based manual scores.

Meghana Marathe and Graeme Hirst. Lexical Chains using Distributional Measures of Concept Distance

In practice, lexical chains are typically built using term reiteration or resource-based measures of semantic distance. The former approach misses out on a signiﬁcant portion of the inherent semantic information in a text, while the latter suffers from the limitations of the linguistic resource it depends upon. In this paper, chains are constructed using the framework of distributional measures of concept distance, which combines the advantages of resource-based and distributional measures of semantic distance. These chains were evaluated by applying them to the task of text segmentation, where they performed as well as or better than state-of-the-art methods.

Silviu Cucerzan. A Case Study of Using Web Search Engines in NLP: Case Restoration

Web search engines provide a large amount of information about language in general, and its current use in particular. In this paper, we investigate the use of Web search engine statistics for the task of case restoration. Because most engines are case insensitive, an approach based on search hit counts, as employed in previous work in natural language ambiguity resolution, is not applicable for this task. We investigate the use of statistics computed from the snippets generated by a Web search engine. We note that the top few results (one to ten) returned by a search engine may not the most representative for modeling phenomena in a language. Finally, we examine the Web coverage of n-grams in three different languages (English, German, and Spanish).

Short oral presentation + poster

publication in complementary proceedings
(special issues of RCS and IJCLA journals)

Peng Li, Maosong Sun and Ping Xue. Fast-Champollion : A Fast and Robust Sentence Alignment Algorithm

Aligned parallel texts are important resources to many natural language processing tasks including statistical machine translation, etc. With the rapid growth of online parallel texts, efficient and robust sentence alignment methods become increasingly important. In this paper, we propose a fast and robust sentence alignment algorithm, i.e., Fast-Champollion, which employs a combination of both length-based method and lexicon-based method. By optimizing the process of splitting the input bilingual texts into small segments for alignment, Fast-Champollion, as our extensive experiments show, is 3.9 to 9.0 times as fast as the baseline method Champollion on short texts and about 50.6 times long texts, and Fast-Champollion is as robust as Champollion.

Zilong Chen and Yang Lu. A SVM based Method for Active Relevance Feedback

In vector space models, traditional relevance feedback techniques, which utilize the terms in the relevant documents to enrich the user’s initial query, is an effective method to improve retrieval performance. However, in this process, it also brings some non-relevance terms in the relevant documents in the new query. The number of non-relevance terms will increase according to the repeat of feedback process; it will damage the retrieval performance finally. This paper introduces a SVM Based method for relevance feedback. We train a classifier on the feedback documents and classify the rest of the documents. Thus, in the result list, the relevant documents are in front of the non-relevant documents. The new approach avoids modifying the query via text classification algorithm in the relevance feedback process, and it is a new direction for the relevance feedback techniques. Experiments with TREC dataset demonstrate the effectiveness of this method.

Martín Ariel Domínguez and Gabriel Infante-Lopez. Head Finders Inspection: An Unsupervised Optimization Approach

Head finder algorithms are used by supervised parsers during their training phase to transform phrase structure trees into dependency ones. For the same phrase structure tree, different head finders produce different dependency trees. Head finders usually have been inspired on linguistic bases and they have been used by parsers as given. In this paper, we present an optimization set-up that tries to produce a head finder algorithm that is optimal for parsing. We also present a series of experiments with random head finders. We conclude that, although we obtain some statistically significant improvements with the optimal head finder, experiments with random head finders show that random changes in head finder algorithms do not impact dramatically the performance of parsers.

vor der Brück Tim. Hypernymy Extraction Using a Semantic Network Representation

There are several approaches to detect hypernymy relations from texts by text mining. Usually these approaches are based on supervised learning and in a first step extract several patterns. These pattern are then applied to previously unseen texts and used to recognize hypernym/hyponym pairs. Normally these approaches are only based on a surface representation or a syntactical tree structure, i.e. constituency or dependency trees derived by a syntactical parser. In this work, however, we present an approach which operates directly on a semantic network which is generated by a deep syntactico-semantic analysis. Hyponym/hypernym pairs are then extracted by the application of graph matching. This algorithm is combined with a shallow approach enriched with semantic information.

Mihai Dascalu, Stefan Trausan-Matu and Philippe Dessus. Utterances Assessment and Summarization in Chat Conversations

With the continuous evolution of collaborative environments, the needs of automatic analyses and assessment of participants in instant messenger conferences (chat) have become essential. An utterance grading system provides also a basis for chat summarization. For these aims, on one hand, a series of factors based on natural language processing (including lexical analysis and Latent Semantic Analysis) and data-mining have been taken into consideration. On the other hand, in order to thoroughly assess participants, measures as Page’s essay grading, readability and social networks analysis metrics were computed. The weights of each factor in the overall grading system are optimized using a genetic algorithm whose entries are provided by a perceptron in order to ensure numerical stability. A gold standard has been used for evaluating the system’s performance.

Ayako HOSHINO and Hiroshi NAKAGAWA. Predicting the Difﬁculty of Multiple-Choice Cloze Questions for Computer-Adaptive Testing

Multiple-choice, fill-in-the-blank questions are widely used to assess language learners' knowledge of grammar and vocabulary. The questions are often used in CAT (Computer-Adaptive Testing) systems, which are commonly based on IRT (Item Response Theory). The drawback of a simple application of IRT is that it requires training data which are not available in many real world situations. In this work, we explore a machine learning approach to predict the difficulty of a question from automatically extractable features. With the use of the SVM (Support Vector Machine) learning algorithm and 27 features, we achieve over 70% accuracy in a two-way classification task. The trained classifier is applied to a CAT system with a group of postgraduate ESL (English as a Second Language) students. The results show that the predicted values are more indicative of the testees performance than a baseline index (sentence length) alone.

Mircea Petic. Automatic derivational morphology contribution to Romanian lexical acquisition

The derivation with affixes is a method of vocabulary enrichment. The consciousness of the stages in the process of the lexicon enrichment by means of derivational morphology mechanisms will lead to the construction of the new derivatives automatic generator. Therefore the digital variant of the derivatives dictionary helps to overcome difficult situations in the process of the new word validation and the handling of the uncertain character of the affixes. In addition, the derivatives groups, the concrete consonantal and vocalic alternations and the lexical families can be established using the dictionary of derivatives.

Amal Zouaq, Michel Gagnon and Benoit Ozell. Semantic Analysis using Dependency-based Grammars and Upper-Level Ontologies

The process of extracting semantic information from texts is a key issue for the natural language processing community. However, this semantic extraction is generally presented as a deeply-intertwined syntactic and semantic process, which makes it not easily adaptable and reusable from a practical point of view. Moreover, it often relies on linguistic resources that are costly to acquire. This represents an obstacle to the wide development, use and update of semantic analyzers. This paper presents a modular semantic analysis pipeline that aims at extracting logical representations from free text based on dependency grammars and assigning meaning to these logical representations using an upper-level ontology. An evaluation is conducted, where a comparison of our system with baseline systems shows preliminary results.

Claudia Denicia-Carral, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda and Rita Marina Aceves-Pérez. Bilingual Document Clustering using Translation-Independent Features

This paper focuses on the task of bilingual clustering, which involves dividing a set of documents from two different languages into a set of thematically homogeneous groups. It mainly proposes a translation independent approach specially suited to deal with linguistically related languages. In particular, it proposes representing the documents by pairs of words orthographically or thematically related. The experimental evaluation in three bilingual collections and using two clustering algorithms demonstrated the appropriateness of the proposed representation, which results are comparable to those from other approaches based on complex linguistic resources such as translation machines, part-of-speech taggers, and named entity recognizers.

Costin-Gabriel Chiru, Valentin Cojocaru, Traian Rebedea and Stefan Trausan-Matu. Malapropisms Detection and Correction Using a Paronyms Dictionary, a Search Engine and WordNet

This paper presents a method for the automatic detection and correction of malapropism errors found in documents using the WordNet lexical database, a search engine (Google) and a paronyms dictionary. The malapropisms detection is based on the evaluation of the cohesion of the local context using the search engine, while the correction is done using the whole text cohesion evaluated in terms of lexical chains built using the linguistic ontology. The correction candidates, which are taken from the paronyms dictionary, are evaluated versus the local and the whole text cohesion in order to find the best candidate that is chosen for replacement. The testing methods of the application are presented, along with the obtained results.

Richard Sproat. Lightly Supervised Learning of Text Normalization: Russian Number Names

Most areas of natural language processing today make heavy use of automatic inference from large corpora. One exception is text-normalization for such applications as text-to-speech synthesis, where it is still the norm to build grammars by hand for such tasks as handling abbreviations or the expansion of digit sequences into number names. One reason for this, apart from the general lack of interest in text normalization, has been the lack of annotated data. For many languages, however, there is abundant unannotated data that can be brought to bear on these problems. This paper reports on the inference of number-name expansion in Russian, a particularly difficult language due to its complex inflectional system. A database of several million spelled-out number names was collected from the web and mapped to digit strings using an overgenerating number-name grammar. The same overgenerating number-name grammar can be used to produce candidate expansions into number names, which are then scored using a language model trained on the web data. Our results suggest that it is possible to infer expansion modules for very complex number name systems, from unannotated data, and using a minimum of hand-compiled seed data.

Guoyu Tang and Yunqing Xia. Adaptive Topic Modeling with Probabilistic Pseudo Feedback in Online Topic Detection

Online topic detection (OTD) system seeks to analyze sequential stories in a real-time manner so as to detect new topics or to associate stories with certain existing topics. To handle new stories more precisely, an adaptive topic modeling method that incorporates probabilistic pseudo feedback is proposed in this paper to tune every topic model with a changed environment. Differently, this method considers every incoming story as pseudo feedback with certain probability, which is the similarity between the story and the topic. Experiment results show that probabilistic pseudo feedback brings promising improvement to online topic detection.

Prakash Mondal. Exploring the N-th Dimension of Language

This paper is aimed at exploring the hidden fundamental computational property of natural language that has been so elusive that it has made all attempts to characterize its real computational property ultimately fail. Earlier natural language was thought to be context-free. However, it was gradually realized that this does not hold much water given that a range of natural language phenomena have been found as being of non-context-free character that they have almost scuttled plans to brand natural language context-free. So it has been suggested that natural language is mildly context-sensitive and to some extent context-free. In all, it seems that the issue over the exact computational property has not yet been solved. Against this background it will be proposed that this exact computational property of natural language is perhaps the N-th dimension of language, if what we mean by dimension is nothing but universal (computational) property of natural language.

Alberto Ochoa O.Zezzatti. Traditional Rarámuri Songs used by a Recommender System to a Web Radio

In this research is described an Intelligent Web Radio associated to a Recommender System which uses different songs in a database related with a kind of Traditional Music (Rarámuri songs) which employs the Dublin Core metadata standard for the documents description, the XML standard for describing user profile, which is based on the user’s profile, and on service and data providers to generate musical recommendations to a Web Radio. The main contribution of the work is to provide a recommendation mechanism based on this Recommender System reducing the human effort spent on the profile generation. In addition, this paper presents and discusses some experiments that are based on quantitative and qualitative evaluations.

brett drury and Jose Joao Almeida. A Case Study of Rule Based and Probabilistic Word Error Correction of Portuguese OCR Text in a "Real World" Environment for Inclusion in a Digital Library

The transfer of textual information from large collections of paper documents has become increasingly popular. Optical Character Recognition (OCR) software has become a popular method to effect the transfer of this information. The latest commercially available OCR software can be very accurate with reported accuracy of 97% to 99.95%. These high accuracy rates lower dramatically when the documents are in less than pristine condition or the typeface is non-standard or antiquated. In general, OCR recovered text requires some further processing before it can be used in a digital library. This paper documents an attempt by a commercial company to apply automatic word error correction techniques on a "real world" 12 million document collection which contained texts from the late 19th Century until the late 20th Century. The paper also describes attempts to increase the effectiveness of word correction algorithms through the use of the following techniques: 1. reducing the text correction problem to a restricted language domain, 2. segmenting the collection by document quality and 3. learning domain specific rules and text characteristics from the document collection and operator log files. This case study also considers the commercial pressures of the project and the effectiveness of both rule based and probabilistic word error correction techniques on less than pristine documents. It also provides some conclusions for researchers / companies considering multi-million document transfers to electronic storage.

Bruno GALMAR and Jenn-Yeu CHEN. Identifying Different Meanings of a Chinese Morpheme through Latent Semantic Analysis and Minimum Spanning Tree Analysis

A character corresponds roughly to a morpheme in Chinese, and it usually takes on multiple meanings. In this paper, we aimed at capturing the multiple meanings of a Chinese morpheme across polymorphemic words in a growing semantic micro-space. Using Latent Semantic Analysis (LSA), we created several nested LSA semantic micro-spaces of increasing size. The term-document matrix of the smallest semantic space was obtained through filtering a whole corpus with a list of 192 Chinese polymorphemic words sharing a common morpheme (公 gong1). This first LSA semantic space was thought to be the worst representation of semantic relationships between words sharing the same morpheme. Then, additional words were added to the initial list to create bigger LSA semantic spaces. Firstly, we added words that capture all the etymological dimensions of the morpheme under study. Secondly, we proceeded to another addition of vocabulary. Some of the added words were extracted from a Chinese dictionary's definitions of the words of the initial list. Another part of the added words share some common morphemes to the polymorphemic words of the initial list. For each of our created Chinese LSA space, we computed the whole cosine matrix of all the terms of the semantic space to measure semantic similarity between words. From the cosine matrix, we derived a dissimilarity matrix. This dissimilarity matrix was viewed as a complete weighted undirected graph. We built from this graph a minimum spanning tree (MST). So, each of our LSA semantic space had its associated minimum spanning tree. The two first smallest trees failed to capture the range of meanings of the morpheme under study. It is shown that for our biggest MST, some edges capture the meaning of polymorphemic words in a way that is satisfactory for a native Chinese reader. In addition, it is shown that some paths in this MST can be used to infer and capture the correct meaning of a morpheme embedded in a polymorphemic word. Clusters of the different meanings of a polysemous morpheme can be created from the minimum spanning tree. Finally, it is concluded that our approach could model partly human knowledge representation and acquisition of the different meanings of Chinese polysemous morphemes. Our work is thought to bring some insights to the Plato's problem and additional evidence towards the plausibility of words serving as ungrounded symbols. Future directions are sketched.

Agus Hartoyo, Suyanto and Diyas Puspandari. An Improved Indonesian Grapheme-to-Phoneme Conversion Using Statistic and Linguistic Information

This paper focuses on IG-tree + best-guess strategy as a model to develop Indonesian grapheme-to-phoneme conversion (IndoG2P). The model is basically a decision-tree structure built based on a training set. It is constructed using a concept of information gain (IG) in weighing the relative importance of attributes, and equipped with the best-guess strategy in classifying new instances. It is also leveraged with two new features added to its pre-existing structure for improvement. The first feature is a pruning mechanism to minimize the IG-tree dimension and to improve its generalization ability. The second one is a homograph handler using a text-categorization method to handle its special case of a few sets of words which are exactly the same in spelling representations but different each other in phonetic representations. Computer simulation showed that the complete model perform well. The two additional features gave expected benefits.

Bento C. Dias-da-Silva. Brazilian Portuguese WordNet: A Computational Linguistic Exercise of Encoding Bilingual Relational Lexicons

This paper presents the methodology of encoding the Brazilian Portuguese WordNet synsets and the way of inheriting the conceptual relations of hyponymy, co-hyponymy, meronymy, cause, and entailment relations automatically. After contextualizing the project and outlining the current lexical database structure and statistics, it is described the Brazilian Portuguese WordNet editing tool to encode synsets, select sample sentences from corpora, write synset glosses, encode the EQ_RELATIONS relations between WordNet.Br and Princeton WordNet, and concludes by exemplifying the automatic generation of the hyponymy and co-hyponymy conceptual relations between the Brazilian Portuguese WordNet synsets.

Shailly Goyal, Shefali Bhat, Shailja Gulati and C Anantaram. Ontology-Driven Approach to Obtain Semantically Valid Chunks for NL-Enabled Business Applications

For a robust natural language question answering system to business applications, query interpretation is a crucial and complicated task. The complexity arises due to the inherent ambiguity in natural language which may result in multiple interpretations of the user's query. General purpose natural language (NL) parsers are also insufficient for this task because while they give syntactically correct parse, they lose on the semantics of the sentence. This is because such parsers lack domain knowledge. In the present work we address this shortcoming and describe an approach to enrich a general purpose NL parser with domain knowledge to obtain semantically valid chunks for an input query. A part of the domain knowledge, expressed as domain ontology, along with the part-of-speech (POS) tagging is used to identify the correct predicate-object pairs. These pairs form the constraints in the query. In order to identify the semantically valid chunks of a query, we use the syntactic chunks obtained from a parser, constraints obtained by predicate-object binding, and the domain ontology. These semantically valid chunks help in understanding the intent of the input query, and assist in its answer extraction. Our approach seamlessly works across various domains, given the corresponding domain ontology is available.

Plaban Kumar Bhowmick, Anupam Basu, Pabitra Mitra and Abhisek Prasad. Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework

Multiple emotions are evoked with different belongingness in readers' minds in response to text stimuli. In this work, we perform reader perspective emotion analysis in sentence level considering each sentence to be associated with the emotion classes with fuzzy belongingness. As news articles presents emotionally charged stories and facts, a corpus of 1305 news sentences are considered in this study. Experiments have been performed in fuzzy k nearest neighbor classification framework with four different feature groups. Word feature based classification model is considered as baseline. In addition to that, we have proposed other three features namely, polarity, semantic frame and emotion eliciting context (EEC) based features. Different measures applicable to multi-label classification problem have been used to evaluate our system performance. Comparisons between different feature groups revealed that EEC based feature is the most suitable one in reader perspective emotion classification task.

Michael Carl, Martin Kay and Kristian Jensen. Long-distance Revisions in Post-editing and Drafting

The paper looks at translation patterns of student and professional translators. A clearer division of the process into {\em gisting}, {\em drafting} and {\em post-editing} phases is apparent in professional than in student translators. It turns out, however, that long-distance revisions, which would typically be expected during the post-editing phase, occur to the same extent during drafting. Further, long-distance revisions seem to accumulate at particular positions in the translation for both students and translators, suggesting that both groups recognize the same portions of the text as problematical. We suggest how those findings might be taken into account in the design of computer assisted tools.

Ke Wu and Jiangsheng Yu. Unsupervised Text Pattern Learning Using Minimum Description Length

Knowledge of text patterns in a domain-specific corpus is valuable in many natural language processing (NLP) applications such as information extraction and question-answering system. In this paper, we propose a simple but effective probability language model for modeling both continuous and non-continuous patterns. Using the minimum description length (MDL) principle, an efficient unsupervised learning algorithm is implemented and the experiment on an English critical writing corpus has shown promising results.

Aleksei Ustimov, M. Borahan Tumer and Tunga Gungor. A Low-Complexity Constructive Learning Automaton Approach to Handwritten Character Recognition

The task of syntactic pattern recognition has aroused the interest of researchers for several decades. The power of the syntactic approach comes from its capability in exploiting the sequential characteristics of the data. In this work, we propose a new method for syntactic recognition of handwritten characters. The main strengths of our method are its low run-time and space complexity. In the lexical analysis phase, the lines of the presented sample are segmented into simple strokes, which are matched to the primitives of the alphabet. The reconstructed sample is passed to the syntactic analysis component in the form of a graph where the edges are the primitives and the vertices are the connection points of the original strokes. In the syntactic analysis phase, the interconnections of the primitives extracted from the graph are used as a source for the construction of a learning automaton. We reached recognition rates of 72% for the best match and 94% for the top five matches.

Hristo Tanev, Monica Gemo and Mijail Kabadjov. Learning Event Semantics from Online News

In this paper we present an algorithm for automatic extension of an event extraction grammar by unsupervised learning of semantic clusters of terms. In particular, we tested our algorithm to learn terms, which are relevant for detection of displacement and evacuation events. Such events constitute an important part in the process of development of humanitarian crises, conflicts and natural and man made disasters. Apart from the grammar extension we consider our learning algorithm and the obtained semantic classes as a first step towards the semi-automatic building of domain-specific ontology of disaster events. The starting point for us is an existing event extraction grammar for detection of evacuations and displacements from online news reports in English language. The grammar is an integral part of NEXUS – an automatic system for event extraction from online news, which is profiled in the domain of security and crises-management. NEXUS exploits over 90 event-specific patterns and a noun phrase recognition grammar to detect boundaries of phrases which refer to groups of people. Using these two resources the system can identify text fragments such as “about 200000 people have abandoned their homes”, where the phrase “about 200000 people” will be labeled with the event-specific semantic category "displaced people". Similarly, NEXUS can identify phrases about evacuations, such as “five women were evacuated”, where “five women” will be labeled as evacuated people. However, NEXUS cannot detect additional information which is reported together with the evacuated or displaced people. For example, in the text fragment “more than 1000 people were evacuated after a chemical leak” the system can recognize that more than 1000 people were evacuated, but it cannot recognize that the event which caused the evacuation was a chemical leak. In a similar way, the phrase “20000 people displaced to Beddawi camp” reports both the number of displaced people as well as the place where they were moved. However, the second piece of information will be missed by the current version of the grammar. In order not to miss such important data, we developed an algorithm which expands automatically the event extraction grammar. The main part of our learning algorithm is an unsupervised term extraction and clustering approach which is a combination of several state-of-the-art term acquisition and classification approaches. EXTENDING THE GRAMMAR The goal of the grammar expanding algorithm is the learning of syntactic adjuncts which are introduced in the description of the events usually through preposition phrases. More concretely, we would like to recognize phrases like "many people were evacuated to temporary shelters" In order to do this, our system has to recognize patterns like [person_group] were displaced to NP(facility) where NP(facility) refers to a noun phrase whose head noun belongs to the category "facility", which should be described through a list of nouns. The grammar should also assign the event specific semantic label "place_of_displacement" to this noun phrase. In the context of our experiments we learn grammar extensions in the form of triples (preposition, semantic cluster, event-specific role). For example, (to; F; place_of_displacement), where F is a cluster of words, which can be considered as belonging to the category facility in our event specific context. We do not specify which triple to which pattern can be attached. This was done, since many patterns are based on the same verbs or at least on verbs which share the same or similar sub-categorization frames. Therefore, the sample triple , (to; F; place_of_displacement) will be encoded in the extended grammar as [person_group] right_context_displacement_pattern to NP(F) left_context_displacement_pattern [person_group] to NP(F) where F refers to a dictionary containing words which are likely to be facilities, e.g. school, hospital, refugee camp, etc. Such rules can recognize text fragments, such as "1000 people were displaced to government shelters", provided that "shelters" is a member of the cluster F. Moreover, "government shelters" will be tagged with the semantic label place_of_displacement.

Razvan Bunescu and Yunfeng Huang. Towards a General Model of Answer Typing: Question Focus Identification

We analyze the utility of question focus identification for answer typing models in question answering, and propose a comprehensive definition of question focus based on a relation of coreference with the answer. Equipped with the new definition, we annotate a dataset of 2000 questions with focus information, and design two initial approaches to question focus identification: one that uses expert rules, and one that is trained on the annotated dataset. Empirical evaluation of the two approaches shows that focus identification using the new definition can be done with high accuracy, holding the promise of more accurate answer typing models.

Jihee Ryu, Yuchul Jung, Kyung-min Kim and Sung-Hyon Myaeng. Automatic Extraction of Human Activity Knowledge from Method-Describing Web Articles

Knowledge on daily human activities in various domains is invaluable for many customized user services that can benefit from context-awareness or activity predictions. Past approaches to constructing a knowledge base of this kind have been domain-specific and not scalable. In this paper, we propose an approach to automatically extracting human activity knowledge from Web articles that describe methods for performing tasks in a variety of domains. The knowledge is comprised of goals, actions, and ingredients and extracted with a pattern and machine learning based model applied to a huge number of how-to articles. Our evaluation shows 86% accuracy and 73% coverage for the activity mining task.

Amitava Das and Sivaji Bandyopadhyay. Phrase-level Polarity Identification for Bengali

In this paper, opinion polarity classification on news text has been carried out for a less privileged language Bengali using Support Vector Machine (SVM). The present system identifies semantic orientation of an opinionated phrase as either positive or negative. Media news text can be divided into two main types: (1) news reports that aim to objectively present factual information, and (2) opinionated articles that clearly present authors’ and readers’ views, evaluation or judgment about some specific events or persons. Such opinionated articles appear in newspapers in sections such as ‘Editorial’, ‘Forum’ and ‘Letters to the Editor’. Text from the ‘Reader’s opinion’ section or ‘Letters to the Editor Section’ have been retrieved from the web archive of a popular Bengali newspaper to serve as the relevant corpus. The corpus is POS tagged and chunked. Each chunk (basic phrase) is manually annotated with positive / negative polarity for its sue during training and testing. The classification of text as either subjective or objective is clearly a precursor to phrase level polarity identification. A rule-based subjectivity classifier has been used. The polarity identification system uses Bengali SentiWordNet which is being developed along with a dependency parser for Bengali which looks for possible dependency relations that may exist between modifiers and opinionated words in the input text. The phrase level polarity identification system is based on SVM model and uses the following set of features for each word: Part Of Speech (POS) category, Chunk category, functional word (binary valued), SentiWordNet (binary valued), Stemming cluster (indicates the cluster centre that includes the word, dynamic categorical), Negative word (binary valued) and Dependency tree feature. Evaluation results with respect to the number of opinionated phrases in the test corpus have demonstrated an accuracy of 66.45%.

Thoudam Doren Singh and Sivaji Bandyopadhyay. Manipuri-English Example Based Machine Translation system

The paper reports our work on the development of a machine translation system for translating Manipuri to English using example based approach. Manipuri is a relatively free word-order language and makes use of a set of enclitics and morphological suffixes for correct meaning representation. The sentence level parallel Manipuri – English corpus is built from a comparable Manipuri and English news corpora. A Manipuri – English lexicon is being developed and is used during the alignment process. Preprocessing steps are applied on the sentence level parallel corpus in terms of POS tagging, morphological analysis, named entity recognition and chunking on both the source and the target sides. The developed sentence level parallel corpus is aligned at the phrase level. The translation process initially looks for an exact match in the parallel example base and returns the retrieved target output in case of a match. Otherwise, the maximal match sentence is identified for the input sentence. For word level mismatch, the unmatched words in the input are looked into the lexicon or transliterated if not found in the lexicon. Unmatched phrases are looked into the phrase level parallel example base, the target phrase translation identified and then recombined with the retrieved output from the maximal match parallel pair. In case of more than one maximal match pair, the most frequent pair in the example base is identified and the same translation process is carried out as for one maximal match. If there is no match (full or partial) with any of the parallel pairs then a phrasal EBMT method will be applied which is being developed. The EBMT system has been developed using 15319 Manipuri-English parallel sentences and evaluated three fold with a test set of 900 gold standard test sentences giving BLEU and NIST scores of 0.137 and 3.361 respectively. A baseline SMT system using MOSES with the same training and test data has been developed and subsequently evaluated for a BLEU and NIST scores of 0.128 and 3.195 respectively. Thus the proposed EBMT system has performed better than the baseline SMT system with the same training and test data.

Nicole Novielli and Carlo Strapparava. Exploring the Lexical Semantics of Dialogue Acts

People proceed in their conversations through a series of dialogue acts to yield some specific communicative intention. In this paper, we study the task of automatic labeling dialogues with the proper dialogue acts, relying on empirical methods and simply exploiting lexical semantics of the utterances. In particular, we present some experiments in a supervised and unsupervised framework on an English and an Italian corpus of dialogue transcriptions. In the experiments we consider the settings of dealing with or without additional information from the dialogue structure. The evaluation displays good results, also in the unsupervised scenario and regardless of the used language. We conclude the paper with a qualitative analysis of the lexicon used for each dialogue act: we explore the relationship between the communicative goal of an utterance and its affective content as well as the salience of specific word classes for each speech act.

Isabelle Tellier, Iris Eshkol, Samer Taalab and Jean-Philippe Prost. POS-tagging for Oral Texts with CRF and Category Decomposition

Following the ESLO1 (Enquête sociolinguistique d'Orléans, (i.e. Sociolinguistic Inquiry of Orléans) campaign, a large oral corpus was gathered and transcribed in a text format. The purpose of the work presented here is to assign morpho-syntactic labels to each unit of this corpus. To this end, we first studied the specificities of the necessary labels, and their various possible levels of description. This study has led to a new original hierarchical structure of labels. Then, given that our new set of labels was different from any of those used in existing taggers, and that these tools are usually not fit for oral data, we have built a new labelling tool using a Machine Learning approach. As a starting point, we used the data labelled by Cordial and corrected by hand. We have applied CRF (Conditional Random Fields) in trying to take the best possible advantage of the linguistic knowledge we used to define the set of labels. We obtain an F-measure between 85 and 90, depending on the parameters in use.

Adrian Iftene and Ancuta Rotaru. User Profile Modeling in eLearning using Sentiment Extraction from Text

This paper addresses an important issue in the context of current Web applications because there are new ideas of providing personalized services to users. This part of web applications is a very controversial one because it is very hard to identify the main component which should be emphasized, namely, the development of a user model. Customizing an application has many advantages (the elimination of repeated tasks, behavior recognition, indicating a shorter way to achieve a particular purpose, filtering out irrelevant information for an user, flexibility), but also disadvantages (there are users who wish to maintain anonymity, users who refuse the offered customization or users who do not trust the effectiveness of personalization systems). This allows us to say that in this field are many things to be done: the personalization systems can be improved; the user models which are created can be adapted to a larger area of applications, etc. The eLearning system created by us has aim to reduce the distance between the involved actors (the student and the teacher), providing an easy way communication using the Internet. Thus, based on this application students can ask questions and teachers can provide answers to them. Then, on the basis of this dialogue using natural language processing techniques, we build a user model in order to improve the communication between the student and the teacher.

Muhammad Humayoun and Christophe Raffalli. MathNat - Mathematical Text in a Controlled Natural Language

The MathNat project aims at being a first step towards automatic formalisation and verification of textbook mathematical text. First, we develop a controlled language for mathematics (CLM) which is a precisely defined subset of English with restricted grammar and dictionary. To make CLM natural and expressive, we support some complex linguistic features such as anaphoric pronouns and references, rephrasing of a sentence in multiple ways producing canonical forms and the proper handling of distributive and collective readings. Second, we automatically translate CLM into a system independent formal language (MathAbs), with a hope to make MathNat accessible to any proof checking system. Currently, we translate MathAbs into equivalent first order formulas for verification. In this paper, we give an overview of MathNat, describe the linguistic features of CLM, demonstrate its expressive power and validate our work with a few examples.

Ingo Glöckner, Sven Hartrumpf and Hermann Helbig. The Automatic Treatment of Text-Constituting Phenomena in the Process of Knowledge Acquisition from Texts

Automatic knowledge acquisition from texts is one of the challenges of the information society that can only be mastered by technical means. While the syntactic analysis of isolated sentences is relatively well understood, the problem of automatically parsing and understanding texts is far from being solved. This paper explains the approach taken by the MultiNet technology in bridging the gap between the syntactic-semantic analysis of single sentences and the creation of knowledge bases representing the content of whole texts. In particular, it is shown how linguistic text phenomena like inclusion or bridging references can be dealt with by logical means using the axiomatic apparatus of the MultiNet formalism. The NLP techniques described are practically applied in transforming large textual corpora like Wikipedia into a knowledge base.

Maria Rosenberg. Lexical Representation of Agentive Nominal Compounds in French and Swedish

This study addresses the lexical representation of French VN and Swedish NV-are agentive nominal compounds. The objective is to examine their semantic structure and output meaning. The analysis shows that, as a result of their semantic structure, the compounds group into some major output meanings. Most frequently, the N constituent corresponds to an Undergoer in the argument structure of the V constituent, and the compound displays an Actor role. More precisely, it denotes an Agent, an Instrument or an Instrumental Locative, specified in the Telic role in the Qualia. Compounds with Place or Event meanings do not display any role in the V’s argument structure. Hence, their N constituent can be either an Actor or an Undergoer. In conclusion, our study proposes a unified semantic account of the compounds in French and Swedish, and can have applications for NLP systems, particularly for disambiguation and machine translation tasks.

Petr Homola and Vladislav Kuboň. Exploiting Charts in the MT Between Related Languages

The paper describes in detail the exploitation of chart-based methods and data structures in a simple system for the machine translation between related languages. The multigraphs used for the representation of ambiguous partial results in various stages of the processing as well as a shallow syntactic chart parser enable a modification of a simplistic and straightforward architecture developed ten years ago for MT experiments between Slavic languages. The number of translation variants provided by the system inspired an addition of a stochastic ranker whose aim is to select the best translations according to a target language model.

Jürgen Vöhringer, Günther Fliedl, Doris Gälle, Christian Kop and Nickolay Bazhenov. Using linguistic knowledge for fine-tuning ontologies in the context of requirements engineering

Nowadays ontology creation is on the one hand very often hand-knitted and thus arbitrary. On the other hand it is supported by statistically enhanced information extraction and concept filtering methods. Automatized generation in this sense very often evokes “shallow ontologies” including gaps and missing links. In the requirements engineering domain fine-granulated domain ontologies are needed; therefore the suitability of both hand-knitted and automatically generated gap-afflicted ontologies for developing applications can not always be taken for granted. In this paper we focus on fine-tuning ontologies through linguistically guided key concept optimization. In our approach we suggest an incremental process including rudimentary linguistic analysis as well as various mapping and disambiguation steps including concept optimization through word sense identification. We argue that the final step of word sense identification is essential, since a main feature of ontologies is that their contents must be shareable and therefore also understandable and traceable for nonexperts.

Alexandra Balahur, Mijail Kabadjov and Josef Steinberger. Exploiting Higher-level Semantic Information for the Opinion-oriented Summarization of Blogs

Together with the growth of the Web 2.0, people have started more and more to communicate, share ideas and comment in social networks, forums, review sites. Within this context, new and suitable techniques must be developed for the automatic treatment of the large volume of subjective data, to appropriately summarize the arguments presented therein (e.g. as "in favor" and "against"). This article assesses the impact of exploiting higher-level semantic information such as named entities and IS-A relationships for the automatic summarization of positive and negative opinions in blog threads. We first run a sentiment analyzer and then a summarizer based on framework draws on Latent Semantic Analysis and we employ an annotated corpus and the standard ROUGE scorer to automatically evaluate our approach. We compare the results obtained using different system configurations and discuss the issues involved, proposing a suitable method for tackling this scenario.

Tamara Martín, Alexandra Balahur, Andrés Montoyo and Aurora Pons. Word Sense Disambiguation in Opinion Mining: Pros and Cons

The past years have marked the birth of a new type of society - that of interaction and subjective communication, using the mechanisms of the Social Web. As a response to the growth in subjective information, a new task was defined - opinion mining, dealing with its automatic treatment. As the majority of natural language processing tasks, opinion mining is faced with the issue of language ambiguity, as different senses of the same word may have different polarities. This article studies the influece of applying word sense disambiguation within the task of opinion mining, evaluating the advantages and disadvantages of the approach. We evaluate the proposed method on a corpus of newspaper quotations and compare it to the results of an opinion mining system without WSD. Finally, we discuss our findings and show in what context WSD helps in the task of opinion mining.

Elena Irimia and Alexandru Ceausu. Dependency-based translation equivalents for factored machine translation

One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures - super-links and chains - and use these structures in different combinations to set the translation example borders. The option for the dependency-linked n-grams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which could also account for extra-phrasal syntactic dependency (e.g.: subject-verb relation), would guarantee “better” translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system.

Tommaso Caselli and Irina Prodanof. Robust Temporal Processing: from Model to System

This paper shows the functioning and the general architecture of an empirically-based model for robust temporal processing of Italian text/discourse. The starting point of this work has been to understand how humans process and recognize temporal relations. The empirical results show that the different salience of the linguistic and commonsense knowledge sources of temporal information calls for specific computational components and procedures to deal with them.

Jing Peng, Anna Feldman and Laura Street. Computing Linear Discriminants for Idiomatic Clause Detection

In this paper, we describe the task of binary classification of clauses into idiomatic and non-idiomatic. Our idiom detection algorithm is based on linear discriminant analysis (LDA). To obtain a discriminant subspace, we train our model on a small number of randomly selected idiomatic and non-idiomatic clauses. We then project both the training and the test data on the chosen subspace and use the three nearest neighbor (3NN) classifier to obtain accuracy. The proposed approach is more general than the previous algorithms for idiom detection --- it does not rely on target idiom types, lexicons, or large manually annotated corpora, nor does it limit the search space by a particular linguistic construction.

Jorge Carrillo de Albornoz, Laura Plaza and Pablo Gervas. Improving Emotional Intensity Classification using Word Sense Disambiguation

During the last years, sentiment analysis has become a very popular task in Natural Language Processing. Affective analysis of text is usually presented as the problem of automatically identifying a representative emotional category or scoring the text within a set of emotional dimensions. However, most of existing works determine these categories and dimensions by matching the terms in the text with those presented in an affective lexicon, without taking into account the context in which these terms are immersed. This paper presents a method for the automatic tagging of sentences with an emotional intensity value, which makes use of the WordNet Affect lexicon and a word sense disambiguation algorithm to assign emotions to concepts rather than terms. An extensive evaluation is performed using the metrics and guidelines proposed in the SemEval 2007 Affective Text Task. Results are discussed and compared with those obtained by similar systems in the same task.

Hakimeh Fadaei and Mehrnoush Shamsfard. Relation Learning from Persian Web: A Hybrid Approach

Automatic extraction of semantic relations is a challenging task in the field of knowledge acquisition from text and is addressed by many researchers during recent years. In this paper a hybrid approach for relation extraction from Persian web is presented. This approach is a combination of statistical, pattern based, structure based and mapping methods using linguistic heuristics to detect a part of faults. The developed system employs tagged corpora, WordNet and Web pages especially Wikipedia articles as input resources in the relation learning procedure. In this system, a set of Persian patterns were manually extracted to be used in pattern base section. According to the conducted tests the precision of pattern based section is 76%. Experimental results proved that Wikipedia structures are good resources for finding candidate pairs of related concepts. The structure based approach had a precision between 55% and 85% for different Wikipedia structures. Mapping based approach which uses WordNet relations as a guide to extract Persian relations uses a WSD method to map Persian words to English synsets. This approach obtained a precision of 74% in the performed tests. The most important source of error in all proposed approaches is the lack of proper linguistic tools and resources for Persian.

Heriberto Cuayáhuitl, Nina Dethlefs, Kai-Florian Richter, Thora Tenbrink and John Bateman. A Dialogue System for Indoor Wayfinding Using Text-Based Natural Language

We present a dialogue system that automatically generates indoor route instructions in German when asked about locations, using text-based natural language input and output. The challenging task in this system is to provide the user with a compact set of accurate and comprehensible instructions. We describe our approach based on high-level instructions. The system is described with four main modules: natural language understanding, dialogue management, route instruction generation and natural language generation. We report an evaluation with users unfamiliar with the system --- using the PARADISE evaluation framework --- in a real environment and naturalistic setting. We present results with high user satisfaction, and discuss future directions for enhancing this kind of systems with more sophisticated and intuitive interaction.

Iustin Dornescu and Constantin Orasan. Interactive QA using the QALL-ME framework

One of the main concerns when deploying a real-world QA system is user satisfaction. Despite the relevance of criteria such as usability and utility, mainstream research usually overlooks them due to their inherent subjective, user-centric nature and the evaluation involved. This problem is particularly important in the case of real-world QA systems where a "No answer available" answer is not very useful for a user. This paper presents how interaction can be embedded into the QALL-ME framework, a open-source framework for implementing closed domain QA systems. The changes necessary to the framework are described and an evaluation of the feedback returned to the user for questions with no answer is performed.

Marta Guerrero Nieto, María José García Rodríguez, Adolfo Urrutia Zambrana, Miguel Ángel Bernabé Poveda and Willington Siabato. Incorporating TimeML into a GIS

This study approaches a methodology for the integration of temporal information belonging to a historical corpus in a Geographic Information System (GIS), with the purpose of analyzing and visualizing the textual information. The selected corpus is composed of business letters of the Castilian merchant Simón Ruiz (1553-1597), in the context of the DynCoopNet Project(Dynamic Complexity of Cooperation-Based Self-Organizing Commercial Networks in the First Global Age), that aims to analyze the dynamics cooperation procedures of socials networks. The integration of historical corpus into a GIS has involved the following phases: (1) recognition and normalization of temporal expressions and events in 16th century Castilian following the TimeML annotation guidelines and (2) storage of tagged expressions into a Geodatabase. The implementation of this process in a GIS would allow to later carrying out temporal queries, dynamic visualization of historical event and thus, it addresses to the recognition of human activity patterns and behaviours over time.

kiran mayee. Purpose Based Classification of text

Many NLP applications such as QA need semantic analysis of text. From the artifact perspective, most of these questions are implicitly purpose-based. This paper presents two methods for the automatic classification of WordNet corpus based on the presence or absence of purpose causal relation with respect to artifacts. It identifies a set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of purpose data. The same has been determined using a semi-automated and a supervised ML method. The results have been found to be encouraging.

Dana Dannélls. From Formal Specifications to Coherent Representation: A Grammar Based Approach

This paper describes a grammar driven approach for generating multilingual cultural heritage information of objects held by museums and galleries. Discourse strategies are utilized to select and organize ontological statements. The discourse structure is translated to abstract grammar specifications that are mapped to natural language. We demonstrate discourse generation from formal representations using the Grammatical Framework, GF.

Xiaojia Pu, Qi Mao, Gangshan Wu and Chunfeng Yuan. Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields

Conditional Random Fields (CRFs) recently have been widely used for labeling the sequence data, such as Named Entity Recognition (NER) etc. But CRFs suffer from the failing that they are prone to overfiting when the number of features grows. Because of the language’s characteristics, for Chinese NER task, the feature set will be very large. Existing approaches to avoid overfitting include the regularization and feature selection. The main shortcoming of these approaches is that they ignore the so-called unsupported features which are the features appearing in the test set but with zero count in the training set. Actually, without the information of them, the generalization of the CRFs suffers. This paper describes a model called Improved Smoothed CRF which could capture the information about the unsupported features using the smoothing features. It provides a very effective and practical way to improve the generalization performance of CRFs. Experiments on Chinese NER proved the effectiveness of our method.

António Branco, Filipe Nunes and João Silva. Verb Analysis in a Highly Inflective Language with an MFF heuristics

We introduce the MFF algorithm for the task of verbal inflection analysis. This algorithm follows a quite straightforward heuristics that decide for the most frequent inflection feature bundle given the set of admissible feature bundles for a verb input form. This algorithm achieves significantly better levels of accuracy than the ones offered by current stochastic tagging technology commonly used for the same task.

Julio Castillo. Recognizing Textual Entailment: Experiments with Machine Learning Algorithms and RTE Corpora

This paper presents a system that uses machine learning algorithms and a combination of datasets for the task of recognizing textual entailment. The features chosen quantify lexical, syntactic and semantic level matching between text and hypothesis sentences. Additionally, we created a filter which uses a set of heuristics based on Named Entities to detect cases where no entailment was found. We analyze how the different sizes of datasets and classifiers could impact on final overall performance of the systems. We show that the system performs better than the baseline and the average of the systems from the RTE on both two and three way tasks. We conclude that using RTE3 corpus with Multilayer Perceptron algorithm for both two and three way RTE tasks outperformed any other combination of RTE-s corpus and classifiers.

Somnuk Sinthupoun and Ohm Sornil. Thai Rhetorical Structure Tree Construction

A rhetorical structure tree (RS tree) is a representation of discourse relations among elementary discourse units (EDUs). A RS tree is very useful to many text processing tasks employing relationships among EDUs such as text understanding, summarization, and question-answering. Thai language with its unique linguistic characteristics requires a unique RS tree construction technique. This paper proposes an approach for Thai RS tree construction which consists of two major steps: EDU segmentation and Thai RS tree construction. Two hidden markov models derived from grammatical rules are used to segment EDUs, and a clustering technique with its similarity measure derived from Thai semantic rules is used to construct a Thai RS tree. The proposed technique is evaluated using three Thai corpora. The results show the Thai RS tree construction effectiveness of 94.90%.

Seemab Latif, Mary McGee Wood and Goran Nenadic. Improving Clustering of Noisy Documents through Automatic Summarisation

In this paper we discuss clustering of students' textual answers in examinations to provide grouping that will help with their marking. Since such answers may contain noisy sentences, automatic summarisation has been applied as a pre-processing technique. These summarised answers are then clustered based on their similarity, using k-means and agglomerative clustering algorithms. We have evaluated the quality of document clustering results when applied to full-texts and summarized texts. The analyses show that the automatic summarization has filtered out noisy sentences from the documents, which has made the resulting clusters more homogeneous, complete and coherent.

Mutee U Rahman. Finite State Morphology and Sindhi Noun Inflections

Sindhi is a morphologically rich language. Morphological construction include inflections and derivations. Sindhi morphology becomes more complex due to primary and secondary word types which are further divided into simple, complex and compound words. Sindhi nouns are marked by number gender and case. Finite state transducers (FSTs) quite reasonably represent the inflectional morphology of Sindhi nouns. The paper investigates Sindhi noun inflection rules and defines equivalent computational rules to be used by FSTs; corresponding FSTs are also given.

Kranthi Reddy, Karun Kumar and Sai Krishna. Linking Named Entities to a Structured Knowledge Base

The task of entity linking aims at associating named entities with their corresponding entries in a knowledge base. The task is challenging because entities, can not only occur in various forms, viz: acronyms, nick names, spelling variations etc but can also occur in various contexts. To extract the various forms of an entity, we used the largest encyclopedia on web, Wikipedia. In this paper, we model entity linking as an information retrieval problem. Our experiments using TAC 2009 knowledge base population data set show that an information retrieval based approach fares slightly better than naive bayes and maximum entropy.

Aminul Islam and Diana Inkpen. Near-Synonym Choice using a 5-gram Language Model

In this work, an unsupervised statistical method for automatic choice of near-synonyms is presented and compared to the state-of-the-art. We use a 5-gram language model build from the Google Web 1T data set. The proposed method works automatically, does not require any human-annotated knowledge resources (e.g., ontologies) and can be applied to different languages. Our evaluation experiments show that this method outperforms two previous methods on the same task. We also show that our proposed unsupervised method is comparable to a supervised method on the same task. This work is applicable to an intelligent thesaurus, machine translation, and natural language generation.

Milos Jakubicek and Ales Horak. Punctuation Detection with Full Syntactic Parsing

The correct placement of punctuation characters is in many languages, including Czech, driven by complex guidelines. Although those
guidelines use information of morphology, syntax and semantics, state-of-art systems for punctuation detection and correction are
limited to simple rule-based backbones. In this paper we present a syntax-based approach by utilizing the Czech parser synt.
This parser uses an adapted chart parsing technique for building the chart structure for the sentence. synt can then process the chart and
provide several kinds of output information. The implemented punctuation detection technique utilizes the synt output in the form
of automatic and unambiguous extraction of optimal syntactic structures from the sentence (noun phrases, verb phrases, clauses,
relative clauses or inserted clauses). Using this feature it is possible to obtain information about syntactic structures related to
expected punctuation placement. We also present experiments proving that this method makes it possible to cover most syntactic phenomena
needed for punctuation detection or correction.

Rejected papers

6 8 12 14 16 17 18 23 24 28 35 40 42 46 48 50 52 55 58 65 66 67 68 69 71 73 76 78 79 80 81 83 84 85 90 93 95 97 98 99 102 106 107 109 112 114 118 120 121 123 124 125 126 127 129 130 131 132 134 135 137 138 140 141 145 146 147 151 152 153 154 157 159 160 163 168 169 173 174 179 181 183 184 185 187 190 192 195 196 198 200 203 204 205 207 211 214 215 219 220 226 228 231 232 233 234 235 236 237 239 241 243 246 248 250 253 254 255 256 259 261 262 263 264 270 271 273 274 275 277 278 279 280 282 283 284 .