CICLing 2010 review results

Oral presentations

Short oral presentations + posters

Rejected papers

Notes:

• The titles and abstracts provided here are preliminary; they may change in the camera-ready version.

• The papers are listed in no particular order yet.

• We will contact the authors of short presentations (complementary proceedings / journal special issues) with more details about these journals.

Oral presentation

publication in Springer LNCS

 Pablo Gamallo and José Ramom Pichel. Automatic Generation of Bilingual Dictionaries using Intermediary Languages and Comparable Corpora This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The quality of the entries of the derived dictionary is very high, similar to that of hand-crafted dictionaries.  We report a case study where a new, non noisy, English-Galician dictionary with about 12,000 correct bilingual correspondences was automatically generated. Rodrigo Agerri and Anselmo Peñas. On the automatic generation of Intermediate Logic Forms for WordNet glosses This paper presents an automatically generated Intermediate Logic Form of WordNet’s glosses. Our proposed logic form includes neo-Davidsonian reification in a simple and flat syntax close to natural language. We offer a comparison with other semantic representations such as those provided by Hobbs and Extended WordNet. Our Intermediate Logic Forms provide a representation suitable to perform semantic inference without the brittleness that characterizes those approaches based on first-order logic and theorem proving. In its current form, the representation allows to tackle semantic phenomena such as co-reference and pronominal anaphora resolution. The Intermediate Logic Forms are straightforwardly obtained from the output of a pipeline consisting of a part-of-speech tagger, a dependency parser and our own Intermediate Logic Form generator (all freely available tools). We apply the pipeline to the glosses of WordNet 3.0 to obtain a lexical resource ready to be used as knowledge base or resource for a variety of tasks involving some kind of semantic inference. We present a qualitative evaluation of the resource and discuss its possible application in Natural Language Understanding. Hikaru YOKONO and Manabu OKUMURA. Incorporating Cohesive Devices into Entity Grid Model in  Evaluating Local Coherence of Japanese Text This paper describes improvements made to the entity grid local coherence model for Japanese text. We investigate the effectiveness of taking into account cohesive devices, such as conjunction, demonstrative pronoun, lexical cohesion, and refining syntactic roles for a topic marker in Japanese. To take into account lexical cohesion, we consider a semantic relations between entities using lexical chaining. Through the experiments on discrimination where the system has to select the more coherent sentence ordering, and comparison of the system's ranking of automatically created summaries against human judgment based on quality questions, we show that these factors contribute to improve the performance of the entity grid model. Hugo Hernault, Danushka Bollegala and Mitsuru Ishizuka. A Sequential Model for Discourse Segmentation Identifying discourse relations in a text is essential for various tasks in Natural Language Processing, such as automatic text summarization, question-answering, and dialogue generation. The first step of this process is segmenting a text into elementary units. In this paper, we present a novel model of discourse segmentation based on sequential data labeling. Namely, we use Conditional Random Fields to train a discourse segmenter on the RST Discourse Treebank, using a set of lexical and syntactic features.   Our system is compared to other statistical and rule-based segmenters, including one based on Support Vector Machines. Experimental results indicate that our sequential model outperforms current state-of-the-art discourse segmenters, with an F-score of 0.94. This performance level is close to the human agreement F-score of 0.98. Samuel Chan, Lawrence Cheung and Mickey Chong. A Machine Learning Parser Using an Unlexicalized Distituent Model Despite the popularity of lexicalized parsing models, practical concerns such as data sparseness and applicability to domains of different vocabularies makes unlexicalized models that do not refer to word tokens themselves deserve more attention. A classifier-based parser using an unlexicalized parsing model has been developed. Most importantly, to enhance the accuracy of these tasks, we investigated the notion of distituency (the possibility that two parts of speech cannot remain in the same constituent or phrase) and incorporated it as attributes using various statistic measures. A machine learning method integrates linguistic attributes and information-theoretic attributes in two tasks, namely sentence chunking and phrase recognition. The parser was applied to parsing English and Chinese sentences in the Penn Treebank and the Tsinghua Chinese Treebank. It achieved a parsing performance of F-Score 80.3% in English and 82.4% in Chinese. Ronald Winnemöller. Drive-by Language Identification - A Byproduct of applied Prototype Semantics While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics. Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words. We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages. Vidas Daudaravicius. The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested. Lieve Macken and Walter Daelemans. A chunk-driven bootstrapping approach to extracting translation patterns We present a linguistically-motivated sub-sentential alignment system that extends the intersected IBM Model 4 word alignments. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers. We conceive the sub-sentential aligner as a cascaded model consisting of two phases. In the first phase, anchor chunks are linked based on the intersected word alignments and syntactic similarity.  In the second phase, we use a bootstrapping approach to extract more complex translation patterns. The results show an overall AER reduction and competitive F-Measures in comparison to the commonly used symmetrized IBM Model 4 predictions (intersection, union and grow-diag-final) on six different text types for English-Dutch. More in particular, in comparison with the intersected word alignments, the proposed method improves recall, without sacrificing precision. Moreover, the system is able to align discontiguous chunks, which frequently occur in Dutch. Diego Ingaramo, Marcelo Errecalde and Paolo Rosso. A general bio-inspired method to improve the short-text clustering task Short-text clustering'' is a very important research field due to the current tendency for people to use very short documents, e.g. blogs, text-messaging and others. In some recent works, new clustering algorithms have been proposed to deal with this difficult problem and novel bio-inspired methods have reported the best results in this area. In this work, a general bio-inspired method based on the AntTree approach is proposed for this task. It takes as input the results obtained by arbitrary clustering algorithms and refines them in different stages. The proposal shows an interesting improvement in the results obtained with different algorithms on several short-text collections. Azeddine Zidouni and Hervé Glotin. Named Entities Recognition In Transcribed Audio Broadcast News Documents In this paper we propose an efficient model to perform named entities retrieval (NER) using their hierarchical structure. The NER task consists of identifying and classifying every word in a document into some predefined categories such as person name, locations, organizations, and dates. Usually the classical NER systems use generative approaches to learn models considering only the words characteristics (word context). In this work we show that NER is also sensitive to syntactic and semantic contexts. For this reason, we introduce an extension of conditional random fields (CRFs) approach to consider multiple contexts. We present an adaptation of the text-aproach to the automatic speech recognition (ASR) outputs. Experimental results show that the proposed approach outperformed a CRFs simple application. Our method achieves a significant improvement of 17\% in slot rate error (SER) measure over HMMs method. Roberto Basili, Danilo Croce, Cristina Giannone and Diego De Cao. Acquiring IE patterns through Distributional Lexical Semantic Models Techniques for the automatic acquisition of Information Extraction Pattern are still a crucial issue in knowledge engineering. A semi supervised learning method, based on large scale linguistic resources, such as FrameNet and WordNet, is discussed. In particular, a robust method for assigning conceptual relations (i.e. roles) to relevant grammatical structures is defined according to distributional models of lexical semantics over a large scale corpus. Experimental results show that the use of the resulting knowledge base achieves significant accuracy in a typical IE task (above 90%), similarly to supervised semantic parsing methods. This confirms the impact of the proposed approach on the quality and development time of large scale resources. Antonio Juárez-González, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, David Pinto-Avendaño and Manuel Pérez-Coutiño. Selecting the N-Top Retrieval Result Lists for an Effective Data Fusion Although the application of data fusion in information retrieval has yielded good results in the majority of the cases, it has been noticed that its achievement is dependent on the quality of the input result lists. In order to tackle this problem, in this paper we explore the combination of only the n-top result lists as an alternative to the fusion of all available data. In particular, we describe a heuristic measure based on redundancy and ranking information to evaluate the quality of each result list, and, consequently, to select the presumably n-best lists per query. Preliminary results in four data sets, considering a total of 266 queries, and employing three different DF methods are encouraging. They indicate that the proposed approach could significantly outperform the results achieved by fusion all result lists, showing an average improvement of 11% in the MAP scores. Livio Robaldo and Jurij Di Carlo. Flexible disambiguation in DTS In this paper, we presented procedures to carry out incremental disambiguation in Dependency Tree Semantics (DTS), a constraint-based underspecified formalism. The present paper evolves the the research done by (Robaldo and Di Carlo, 2009), who proposed an Expressively Complete version of DTS, but did not show how disambiguation may be computationally achieved, i.e. how the several readings may obtained blocked starting from the fully-underspecified one. We claim that the disambiguation process proposed here is flexible, in the sense that it is able to account for any kind of NL constraints on available readings. Rafał Jaworski. Computing transfer score in Example-Based Machine Translation This paper presents an idea in Example-Based Machine Translation - computing transfer score for each produced translation. When an EBMT system finds an example in the translation memory, it tries to modify it in order to produce the best possible translation of the input sentence. The user of the system, however, is unable to judge the quality of the translation. A solution to this problem is to provide the user with a percentage score for each translated sentence. The idea to base transfer score computation on the similarity between the input sentence and the example is not sufficient.  Real-life examples show that the transfer process is equally likely to go well with a bad translation memory example and fail with a good example. The following paper describes a method of computing transfer score, which is strictly associated with the transfer process. The transfer score is reversely proportional to the number of linguistic operations executed on the example target sentence.  The paper ends with an evaluation of the suggested method. Yulan Yan. Multi-View Bootstrapping for Relation Extraction by Exploring Web Features and Linguistic Features Binary semantic relation extraction from Wikipedia is particularly useful for various NLP and Web applications. Currently frequent pattern mining-based methods and syntactic anlysis-based methods are two types of leading methods for semantic relation extraction task. With a novel view on integrating linguistic analysis on Wikipedia text with redundancy information from the Web, we propose a multi-view learning approach for bootstrapping relationships between entities with the complementary between the Web view and linguistic view. On the one hand, from the linguistic view, features from linguistic parsing by abstracting away from different surface realizations of semantic relations are generated. On the other hand, features from the Web corpus to provide frequency information for relation extraction are extracted. Experimental evaluation on a relational dataset demonstrates that linguistic analysis and web collective information reveal different aspects of the nature of entity-related semantic relationships. And our multi-view learning method considerably boosts the performance comparing to learning with only one view, with the weaknesses of one view complement the strengths of the other. Michael Granitzer. Adaptive Term Weighting through Stochastic Optimization Term weighting strongly influences the performance of text mining and information retrieval approaches. Usually term weighting is based on statistical estimates based on static weighting schemes. Such static approaches lack the capability to generalize to different domains and different data sets. In this paper, we introduce a online learning method for adapting term weights in a supervised manner. Via stochastic optimization we determine a linear transformation of the term space to optimize the expected similarities among documents. We evaluate our approach on 18 standard text data sets and show, that the performance improvement of a k-NN classifier ranges between 1\% and 12\% by using adaptive term weighting as preprocessing step. Further, we provide empirical evidence that using pairwise training examples is efficient in on-line learning settings. Peggy Cellier, Thierry Charnois and Marc Plantevit. Sequential Patterns to Discover and Characterise Biological Relations In this paper we present a method to automatically detect and   characterise interactions between genes in biomedical literature.   Our approach is based on a combination of data mining techniques:   sequential patterns filtered by linguistic constraints and recursive   mining.  Unlike most Natural Language Processing (NLP) approaches,   our approach does not use syntactic parsing to learn and apply   linguistic rules. It does not require any resource except the   training corpus to learn patterns.   The process is in two steps.  First, frequent sequential patterns   are extracted from the training corpus.  Second, after validation of   those patterns, they are applied on the application corpus to detect   and characterise new interactions.  An advantage of our method is   that interactions can be enhanced with modalities and biological   information.   We use two corpora containing only sentences with gene interactions   as training corpus.  Another corpus from PubMed abstracts is used as   application corpus.  We conduct an evaluation that shows that the   precision of our approach is good and the recall correct for both   targets: interaction detection and interaction characterisation. Junsheng Zhou, Yabing Zhang, Xinyu Dai and Jiajun Chen. Chinese Event Descriptive Clause Splitting with Structured SVMs Chinese event descriptive clause splitting is a novel task in Chinese information processing. In this paper, we present the first Chinese clause splitting system with a discriminative approach. By formulating the Chinese clause splitting task as a sequence labeling problem, we apply the structured SVMs model to Chinese clause splitting. Compared with other two baseline systems, our approach gives much better performance. Victoria Bobicev, Victoria Maxim, Tatiana Prodan, Natalia Burciu and Victoria Angheluş. Emotions in words: developing a multilingual WordNet-Affect In this paper we describe the process of Russian and Romanian WordNet-Affect creation. WordNet-Affect is a lexical resource created on the basis of the Princeton WordNet which contains information about the emotions that the words convey. It is organized in six basic emotions: anger, disgust, fear, joy, sadness, surprise. WordNet-Affect is a small lexical resource but valuable for its affective annotation. We translated the WordNet-Affect synsets into Russian and Romanian and created an aligned English – Romanian – Russian lexical resource. The resource is freely available for research purposes. Pepi Stavropoulou, Dimitris Spiliotopoulos and Georgios Kouroupetroglou. Integrating Contrast in a Framework for Predicting Prosody Information Structure (IS) is known to bear a significant effect on Prosody, making the identification of this effect crucial for improving the quality of synthetic speech. Recent theories identify contrast as a central IS element affecting accentuation. This paper presents the results of two experiments aiming to investigate the function of the different levels of contrast within the topic and focus of the utterance, and their effect on the prosody of Greek. Analysis showed that distinguishing between at least two contrast types is important for determining the appropriate accent type, and, therefore, such a distinction should be included in a description of the IS – Prosody interaction. For this description to be useful for practical applications, a framework is required that makes this information accessible to the speech synthesizer. This work reports on such a language-independent framework integration of all identified grammatical and syntactic prerequisites for creating a linguistically enriched input for speech synthesis. Onur Gungor and Tunga Gungor. Morphological Annotation of a Corpus with a Collaborative Multiplayer Game In most of the natural language processing tasks, state-of-the-art systems usually rely on machine learning methods for building their mathematical models. Given that the majority of these systems employ supervised learning strategies, a corpus that is annotated for the problem area is essential. The current method for annotating a corpus is to hire several experts and make them annotate the corpus manually or by using a helper software. However, this method is costly and time-consuming. In this paper, we propose a novel method that aims to solve these problems. By employing a multiplayer collaborative game that is playable by ordinary people on the Internet, it seems possible to direct the covert labour force so that people can contribute by just playing a fun game. Through a game site which incorporates some functionality inherited from social networking sites, people are motivated to contribute to the annotation process by answering questions about the underlying morphological features of a target word. The experiments show that the 63.5\% of the actual question types are successful based on a two-phase evaluation. Christian M. Meyer and Iryna Gurevych. Worth its Weight in Gold or Yet Another Resource --- A Comparative Study of Wiktionary, OpenThesaurus and GermaNet In this paper, we analyze the topology and the content of a range of lexical semantic resources for the German language constructed either in a controlled (GermaNet), semi-controlled (OpenThesaurus), or collaborative, i.e. community-based, manner (Wiktionary). For the first time, the comparison of the corresponding resources is performed at the word sense level. For this purpose, the word senses of terms are automatically disambiguated in Wiktionary and the content of all resources is converted to a uniform representation. We show that the resources' topology is well comparable as they share the small world property and contain a comparable number of entries, although differences in their connectivity exist. Our study of content related properties reveals that the German Wiktionary has a different distribution of word senses and contains more polysemous entries than both other resources. We identify that each resource contains the highest number of a particular type of semantic relation. We finally increase the number of relations in Wiktionary by considering symmetric and inverse relations that have been found to be usually absent in this resource. Ramona Enache and Aarne Ranta. An open-source computational grammar for Romanian We describe the implementation of a computational grammar for Romanian as a resource grammar in the GF project(Grammatical Framework). Resource grammars are the basic constituents of the GF library. They consist of morphological and syntactical modules which implement the common abstract syntax, also describing the basic features of a language. A lexicon that provides the translation of basic words in the given language is also included for testing purposes. There are currently 15 resource grammars in GF, the Romanian one being the 14th. The present paper explores the main features of the Romanian grammar, along with the way they fit into the framework that GF provides. We also compare the implementation for Romanian with related resource grammars that exist already in the library.  The current resource grammar allows generation of natural language, parsing and can be used in multilingual translations and other GF-related applications.  Covering a wide range of specific morphological and syntactical features of the Romanian language, this GF resource grammar is the most comprehensive open-source grammar existing so far for Romanian. Christian Hänig. Unsupervised part-of-speech disambiguation for high frequency words and its influence on unsupervised parsing Current unsupervised part-of-speech tagging algorithms build context vectors containing high frequency words as features and cluster words -- regarding to their context vectors -- into classes. While part-of-speech disambiguation for mid and low frequency words is achieved by applying a Hidden Markov Model, no corresponding method is applied to high frequency terms. But those are exactly the words being essential for analyzing syntactic dependencies of natural language. Thus, we want to introduce an approach employing unsupervised clustering of contexts to detect and separate a word's different syntactic roles. Experiments on German and English corpora show how this methodology addresses and solves some of the major problems of unsupervised part-of-speech tagging. Alain-Pierre Manine, Erick Alphonse and Philippe Bessières. Extraction of Genic Interactions with the Recursive Logical Theory of an Ontology We introduce an Information Extraction (IE) system which uses the logical theory of an ontology as a generalisation of the typical information extraction patterns to extract biological interactions from text.    This provides inferences capabilities beyond current approaches:    first, our system is able to handle multiple relations;    second, it allows to handle dependencies between relations, by deriving new relations from the previously extracted ones, and using inference at a semantic level;    third, it addresses recursive or mutually recursive rules.    In this context, automatically acquiring the resources of an IE system becomes an ontology learning task: terms, synonyms, conceptual hierarchy, relational hierarchy, and the logical theory of the ontology have to be acquired.    We focus on the last point, as learning the logical theory of an ontology, and {\it a fortiori} of a recursive one, remains a seldom studied problem.     We validate our approach by using a relational learning algorithm, which handles recursion, to learn a recursive logical theory from a text corpus on the bacterium \emph{Bacillus subtilis}.     This theory achieves a good recall and precision for the ten defined semantic relations, reaching a global recall of $67.7$\% and a precision of $75.5$\%, but more importantly, it captures complex mutually recursive interactions which were implicitly encoded in the ontology. Smaranda Muresan. Ontology-based Semantic Interpretation as Grammar Rule Constraints We present an ontology-based semantic interpreter that can be linked to a grammar through grammar rule constraints, providing access to meaning during parsing and generation. In this approach, the parser will take as input natural language utterances and will produce ontology-based semantic representations. We rely on a recently developed constraint-based grammar formalism, which balances expressiveness with practical learnability results.  We show that even with a weak "ontological model", the semantic interpreter at the grammar rule level can help remove erroneous parses obtained when we do not have access to meaning. Francisco Oliveira, Fai Wong and Iok-Sai Hong. Systematic Processing of Long Sentences in Rule based Portuguese-Chinese Machine Translation The translation quality and parsing efficiency are often disappointed when Rule based Machine Translation systems deal with long sentences. Due to the complicated syntactic structure of the language, many ambiguous parse trees can be generated during the translation process, and it is not easy to select the most suitable parse tree for generating the correct translation. This paper presents an approach to parse and translate long sentences efficiently in application to Rule based Portuguese-Chinese Machine Translation. A systematic approach to break down the length of the sentences based on patterns, clauses, conjunctions, and punctuation is considered to improve the performance of the parsing analysis. On the other hand, Constraint Synchronous Grammar is used to model both source and target languages simultaneously at the parsing stage to further reduce ambiguities and the parsing efficiency. Rada Mihalcea, Carlo Strapparava and Stephen Pulman. Computational Models for Incongruity Detection in Humor Incongruity resolution is one of the most widely accepted theories of humour, suggesting that humour is due to the mixing of two disparate interpretation frames in one statement. In this paper, we explore several computational models for incongruity resolution. We introduce a new data set, consisting of a series of set-ups, each of them followed by four possible coherent continuations out of which only one has a comic effect. Using this data set, we redefine the task as the automatic  identification of the humorous punch line among all the plausible endings. We explore several measures of semantic relatedness, along with a number of joke-specific features, and try to understand their appropriateness as computational models for incongruity detection. Dipankar Das and Sivaji Bandyopadhyay. Emotion Agent for Emotional Verbs – The role of Subject and Syntax In psychology and common use, emotion is an aspect of a person's mental state of being, normally based in or tied to the person’s internal (physical) and external (social) sensory feeling. The determination of emotion expressed in the text with respect to reader or writer is itself a challenging issue. Extraction of emotion holder, human like agent is important for discriminating between emotions that are viewed from different perspectives. A wide range of Natural Language Processing (NLP) tasks such as tracking users’ emotion about products or events or about politics as expressed in online forums or news, to customer relationship management are using emotional information. The determination of emotion agent in the text helps us to track and distinguish user’s emotion separately. The present work aims to identify the emotion agent using two-way approach. A baseline system is developed based on the subject information of the emotional sentences parsed using Stanford Dependency Parser. The precision, recall and F-Score values of the agent identification system are 63.21%, 66.54% and 64.83% respectively for baseline approach. Another way to identify emotion agent has been adopted based on the syntactical argument structure of the emotional verbs. Ekman’s six different types of emotional verbs are retrieved from WordNet Affect Lists (WAL). A total of 4,112 emotional sentences for 942 emotional verbs of six emotion types have been extracted from the English VerbNet. The agent related information as specified in the VerbNet such as Experiencer, Agent, Theme, Beneficiary etc. are properly tagged in the correct position of the syntactical frames on each sentence. All possible subcategorization frames and their corresponding syntax, available in the VerbNet are retrieved for each emotional verb. The head of each chunk is extracted from the dependency-parsed output. This chunk level information helps in constructing the syntactic argument structure with respect to the key emotional verb. The acquired syntactic argument structure of each emotional verb is mapped to all of the possible syntax present for that verb in the VerbNet. If the syntactic argument structure of a sentence with respect to its representing key verb is matched with any of the syntax extracted from VerbNet for that key verb, the agent role associated with the VerbNet syntax is then assigned the agent tag in the appropriate component position of the syntactical arguments. The precision, recall and F-Score values of this unsupervised syntactic agent identification approach are 68.11%, 65.89% and 66.98%. It has been observed that the baseline model suffers from the inability to identify emotion agent from the sentences containing passive senses. Although the recall value has been decreased in the syntactic model, it outperforms over baseline significantly. Partha Pakray, Alexander Gelbukh and Sivaji Bandyopadhyay. A Syntactic Textual Entailment System Using Dependency Parser The paper reports about the development of a syntactic textual entailment system that compares the dependency relations identified for both the text and the hypothesis sections. The Stanford Dependency Parser has been run on the 2-way RTE-3 development set. The dependency relations obtained for the hypothesis has been compared with those relations obtained for the text. Some of the important comparisons that are carried out are: subject-subject compare (hypothesis and text subjects are compared), subject-verb compare (subject along with the related verb), object-verb compare are cross subject-verb compare (hypothesis subject with text object and hypothesis object with text subject). Each of the matches through the above comparisons are assigned some weight learnt from the development corpus. A threshold of 0.3 has been set on the fraction of matching hypothesis relations based on the development set results that gives optimal precision and recall values for both YES and NO entailments. The threshold score has been applied on the RTE-4 gold standard test set using the same methods of dependency parsing followed by comparisons. Evaluation scores obtained on the test set show 54.78% precision and 49.2% recall for YES decisions and 53.9% precision and 59.4% recall for NO decisions. Guillem Gascó Mora and Joan Andreu Sánchez Peiró. Syntax Augmented Inversion Transduction Grammars for Machine Translation In this paper we propose a novel method for inferring an Inversion Transduction Grammar (ITG) with source (or target) language linguistic information from a bilingual paralel corpus. Our method combines bilingual ITG parse trees with monolingual linguistic source (or target) trees in order to obtain a Syntax Augmented ITG (SAITG). The use of a modifed parsing algorithm for bilingual pars- ing with bracketing information makes possible that each bilingual subtree have a correspondent subtree in the monolingual parsing. In addition, several binariza- tion techniques have been tested for the resulting SAITG. In order to evaluate the effects of the use of SAITGs in Machine Translation tasks, we have used them in an ITG-based machine translation decoder. The decoder is a hybrid machine translation system that combines phrase-based models together with syntax-based translation models. The formalism that underlies the whole decoding process is a Chomsky Normal Form Stochastic Inversion Transduction Grammar (SITG) with phrasal productions and a log-linear combination of probability models. The decoder uses a CYK-like algorithm that combines the translated phrases inversely or directly in order to get a complete translation of the input sentence. The results obtained using SAITGs with the decoder for the IWSLT-08 Chinese-English machine translation task produce signifcant improvements in BLEU and TER. Cícero N. dos Santos, Ruy L. Milidiú, Carlos E. M. Crestana and ERALDO R. FERNANDES. ETL Ensembles for Chunking, NER and SRL We present a new ensemble method that uses Entropy Guided Transformation Learning (ETL) as the base learner. The proposed approach, ETL Committee, combines the main ideas of Bagging and Random Subspaces. We also propose a strategy to include redundancy in transformation-based models. To evaluate the effectiveness of the ensemble method, we apply it to three Natural Language Processing tasks: Text Chunking, Named Entity Recognition and Semantic Role Labeling. Our experimental findings indicate that ETL Committee significantly outperforms single ETL models, achieving state-of-the-art competitive results. Some positive characteristics of the proposed ensemble strategy are worth to mention. First, it improves the ETL effectiveness without any additional human effort. Second, it is particularly useful when dealing with very complex tasks that use large feature sets. And finally, the resulting training and classification processes are very easy to parallelize. Sobha Lalitha Devi, Vijay Sundar Ram R, Bagyavathi T and Praveen Pralayankar. Syntactic Structure Transfer in a Tamil -Hindi MT System - A Hybrid Approach We describe the Syntactic Structure Transfer (SST), a central design question in machine translation, between two languages Tamil (source) and Hindi (target), belonging to two different language families, Dravidian and Indo-Aryan respectively. Tamil and Hindi differ extensively at the clausal construction level and transferring the structure is difficult. The SST described here is a hybrid approach where we use CRFs for identifying the clause boundaries in the source language, Transformation Based learning(TBL) for extracting the rules and use semantic classification of postpositions for choosing the correct structure in constructions where there are one to many mapping in the target language. We have evaluated the system using web data and the results are encouraging. Vadlapudi Ravikiran and Rahul Katragadda. Quantitative Evaluation of Grammaticality of Summaries Automated evaluation is crucial in the context of automated text summaries, as is the case with evaluation of any of the language technologies. While the quality of a summary is determined by both contentand form of a summary, throughout the literature there has been extensive study on the automatic and semi-automatic evaluation of content of summaries and most such applications have been largely successful. What lacks is a careful investigation of automated evaluation of readability aspects of a summary. In this work we dissect readability into five parameters and try to automate the evaluation of grammaticality of text summaries. We use surface level methods like Ngrams and LCS sequence on POS-tag sequences and chunk-tag sequences to capture acceptable grammatical constructions, and these approaches have produced impressive results. Our results show that it is possible to use relatively shallow features to quantify degree of acceptance of grammaticality. Jan De Belder and Sien Moens. Sentence Compression for Dutch using Integer Linear Programming Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles from major Belgian newspapers written in Dutch using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order. Alberto Barrón-Cedeño, Chiara Basile, Mirko Degli Esposti and Paolo Rosso. Word Length n-grams for Text Re-Use Detection The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications; this approach, however, becomes normally impracticable for real-world large datasets. As a result, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a prefix-tree (trie), allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism. Roberto Basili and Paolo Annesi. Cross-lingual Alignment of Framenet Annotations through Hidden Markov Models The development of annotated resources in the area of frame semantics has been crucial to the development of robust systems for shallow semantic parsing. Resource-poor languages have shown a significant delay due to the lack of sufficient training data. Recent works proposed to exploit parallel corpora in order to automatically transfer the semantic information available for English to other target languages. In this paper, an approach based on Hidden Markov Models is proposed to support the automatic semantic transfer and use an aligned bilingual corpus to develop large scale annotated data sets. As this method relies just on lexical alignment of sentence pairs, it is robust against preprocessing errors and does not require complex optimization, like syntax-dependent models for accurate cross-lingual mapping. The experimental evaluation over an English-Italian corpus is successful, achieving 86% of accuracy on average, and improves on the state of the art methods for the same task. Hui Shi, Robert J. Ross, Thora Tenbrink and John Bateman. Modelling Illocutionary Structure: Combining Empirical Studies with Formal Model Analysis In this paper we revisit the topic of dialogue grammars at the illocutionary force level and present an approach to the formal modelling, evaluation and comparison of these models based on recursive transition networks. Through the use of appropriate tools such finite-state grammars can be formally analysed and validated against empirically collected corpora. We illustrate our approach through: (a) the construction of human-human dialogue grammars on the basis of recently collected natural language dialogues in joint-task situations; and (b) the evaluation and comparison of these dialogue grammars using formal methods. This work provides a new basis for developing and evaluating dialogue grammars and for engineering corpus-tailored dialogue managers which can be verified for adequacy. Marcin Junczys-Dowmunt. A Maximum Entropy Approach to Syntactic Translation Rule Filtering In this paper we present a maximum entropy filter for translation rules of a statistical machine translation system based on tree transducers. This filter can be successfully applied for the reduction of the number of translation rules by more than 70% without negatively affecting translation quality as measured by BLEU. For some filter configurations translation quality is even improved.     Our investigations include a discussion of the relationship of Alignment Error Rate and Consistent Translation Rule Score with translation quality in the context of Syntactic Statistical Machine Translation. Gabriela Ramírez-de-la-Rosa, Manuel Montes-y-Gómez and Luis Villaseñor-Pineda. Enhancing Text Classification by Information Embedded in the Test Set Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines. Filip Graliński. Mining Parenthetical Translations for Polish-English Lexica Documents written in languages other than English sometimes include parenthetical English translations, usually for technical and scientific terminology. Techniques had been developed for extracting such translations (as well as transliterations) from large Chinese text corpora. This paper presents methods for mining parenthetical translation in Polish texts. The main difference between translation mining in Chinese and Polish is that the latter is based on the Latin alphabet and it is more difficult to identify English translations in Polish texts. On the other hand, some parenthetical translated terms are preceded with the abbreviation "ang." (=English), a kind of an "anchor", allowing for querying  a Web search engine for such translations. Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor and Ruslan Mitkov. Identification of Translational Sublanguage: A Machine Learning Approach This paper presents a machine learning approach to the study of translational sublanguage. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistically significant improved accuracy when the learning system benefits from the addition of simplification features to the basic translational classifier system. Therefore, these findings may be considered an argument for the existence of the Simplification Universal. Cristian Grozea and Marius Popescu. Who's the Thief? Determining the Direction of Plagiarism Determining the direction of plagiarism (who plagiated whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which has won the 1st  international competition on plagiarism detection in 2009. We have tested it on a very large corpus of artificial plagiarism, with good results. Eric SanJuan and Fidelia Ibekwe-SanJuan. Multi Word Term queries for focused Information Retrieval We address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multi Word Terms (MWTs) selected from top n ranked documents. MWTs are special text units that represent do- main concepts and objects. As such, they can better represent query topics than ordinary phrases or n-grams. We tested different query representations: bag- of-words, phrases, flat list of MWTs, subsets of MWTs. The experiment is carried out on two benchmarks: TREC Enterprise track (TRE- Cent) 2007 and 2008 collections; INEX 2008 Ad-hoc track using the Wikipedia collection. Holz Florian and Sven Teresniak. Towards Automatic Detection and Tracking of Topic Change We present an approach for automatic detection of topic change. Our approach is based on the analysis of statistical features of topics in time-sliced corpora and their dynamics over time. Processing large amounts of time-annotated news text, we identify new facets regarding a stream of topics consisting of latest news of public interest. Adaptable as an addition to the well known task of topic detection and tracking we aim to boil down a daily news stream to its novelty.    For that we examine the contextual shift of the concepts over time slices. To quantify the amount of change, we adopt the volatility measure from econometrics and propose a new algorithm for frequency-independent detection of topic drift and change of meaning. The proposed measure does not rely on plain word frequency but the mixture of the co-occurrences of words. So, the analysis is highly independent of the absolute word frequencies and works over the whole frequency spectrum, especially also well for low-frequent words. Aggregating the computed time-related data of the terms allows to build overview illustrations of the most evolving terms for a whole time span. Stefan Trausan-Matu and Traian Rebedea. A Polyphonic Model and System for Inter-Animation Analysis in Chat Conversations with Multiple Participants Discourse in instant messenger conversations (chats) with multiple participants is often composed of several intertwining threads. Some chat environments for Computer-Supported Collaborative Learning (CSCL) support and encourage the existence of parallel threads by providing explicit referencing facilities. The paper presents a discourse model for such chats, based on Mikhail Bakhtin’s dialogic theory. It considers that multiple voices (which do not limit to the participants) inter-animate, sometimes in a polyphonic, counterpointal way. An implemented system is also presented, which analyzes such chat logs for detecting additional, implicit links among utterances and threads and, more important for CSCL, for detecting the involvement (inter-animation) of the participants in problem solving. The system begins with a NLP pipe and concludes with inter-animation identification in order to generate feedback and to propose grades for the learners. Chaitanya Vempaty and Vadlapudi Ravikiran. Issues in analyzing Telugu Sentences towards building a Telugu Treebank This paper describes an effort towards building a Telugu Dependency Tree- bank. We discuss the basic framework and issues we encountered while an- notation. 1487 sentences have been annotated in Paninian framework. We also discuss how some of the annotation decisions would effect the develop- ment of a parser for Telugu. Liviu P. Dinu and Andrei Rusu. Rank Distance Aggregation as a Fixed Classifier Combining Rule for Text Categorization In this paper we show that Rank Distance Aggregation can improve ensemble classifier precision in the clas- sical text categorization task by presenting a series of experiments done on a 20 class newsgroup corpus, with a single correct class per document. We aggregate four established document classi cation methods (TF-IDF, Probabilistic Indexing, Naive Bayes and KNN) in dif- ferent training scenarios, and compare these results to widely used fixed combining rules such as Voting, Min, Max, Sum, Product and Median Tomoya Iwakura. A Named Entity Extraction using Word Information　Repeatedly Collected from Unlabeled Data This paper proposes a method for Named Entity (NE) Extraction using　NE-related labels of words collected from unlabeled data repeatedly. NE-labels of words are candidate NE classes of each word, NE classes of　co-occurring words of each word, and so on. We collect NE-related labels　of words from NE extraction results on unlabeled data. Firstly, we collect　NE-related labels of words from NE extraction results on unlabeled data by　an NE extractor. Then we create new NE extractor using NE-related labels　of each word as new additional features. We use the new NE extractor to　collect new NE-related labels of words. We evaluate our method by using　IREX data set for Japanese NE extraction. Experimental results show our　method contributes improved accuracy. Izaskun Aldezabal, Maria Jesus Aranzabe, Arantza Diaz de Ilarraza, Ainara Estarrona and Larraitz Uria. EusPropBank: Integrating Semantic Information in the Basque Dependency Treebank This paper deals with theoretical problems found in the work that is being carried out for annotating semantic roles in the Basque Dependency Treebank (BDT). We will present the resources used and the way the annotation is being done. Following the model proposed in the PropBank project, we will show the problems found in the annotation process and decisions we have taken. The representation of the semantic tag has been established and detailed guidelines for the annotation process have been defined, although it is a task that needs continuous updating. Besides, we have adapted AbarHitz, a tool used in the construction of the BDT, to this task. Xiangdong An. The Optimal IR in Genomics: How Far Away? There exists a gap between what a human user wants in his mind and what he could get from the information retrieval (IR) systems by his queries. We say an IR system is perfect if it could always provide the users with what they want in their minds if available, and optimal if it could present to the users what it finds in an optimal way. In this paper, based on some assumptions, we empirically study how far away we are still from the optimal IR and the perfect IR in genomics. This study gives us a partial perspective regarding where we are in the IR development path. This study also provides us with the lowest upper bound on IR performance improvement by reranking. Siddhartha Jonnalagadda, Robert Leaman, Graciela Gonzalez and Trevor Cohen. A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities Named Entity Recognition and Classification is being studied for last two decades with most of the current tools consuming lot of time for training. Since semantic features take huge amount of training time and are slow in inference, the existing tools apply features and rules mainly at the word-level or use lexicons. Recent advances in distributional semantics allow us to efficiently create paradigmatic models that encode word order. We used Sahlgren’s Random Indexing based model to create an elegant, scalable, efficient and accurate system to simultaneously recognize multiple entity types mentioned in natural language and is validated on the GENIA corpus which has annotations for 46 biomedical entity types and supports nested entities. Only using straightforward distributional semantics features, it achieves an overall micro-averaged F-measure of 67.3% based on fragmental matching with performance ranging from 7.4% for “DNA substructure” to 80.7% for “Bioentity”. George Tsatsaronis, Iraklis Varlamis and Kjetil Nørvåg. An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation Recent research works on unsupervised word sense disambiguation report an increase in performance, which reduces their handicap from the respective supervised approaches for the same task. Among the latest state of the art methods, those that use semantic graphs reported the best results. Such methods create a graph comprising the words to be disambiguated and their corresponding candidate senses. The graph is expanded by adding semantic edges and nodes from a thesaurus. The selection of the most appropriate sense per word occurrence is then made through the use of graph processing algorithms that offer a degree of importance among the graph vertices. In this paper we experimentally investigate the performance of such methods. We additionally evaluate a new method, which is based on a recently introduced algorithm for computing similarity between graph vertices, P-Rank. We evaluate the performance of all alternatives in two benchmark data sets, Senseval 2 and 3, using WordNet. The current study shows the differences in the performance of each method, when applied on the same semantic graph representation, and analyzes the pros and cons of each method for each part of speech separately. Furthermore, it analyzes the levels of inter-agreement in the sense selection level, giving further insight on how these methods could be employed in an unsupervised ensemble for word sense disambiguation. Saeed Raheel and Joseph Dichy. An Empirical Study on the Feature’s Type Effect on the Automatic Classification of Arabic Documents The Arabic language is a highly flexional and morphologically very rich language. It presents serious challenges to the automatic classification of documents, one of which is determining what type of attribute to use in order to get the optimal classification results. Some people use roots or lemmas which, they say, are able to handle problems with the inflections that do not appear in other languages in that fashion. Others prefer to use character-level n-grams since n-grams are simpler to implement, language independent, and produce satisfactory results. So which of these two approaches is better, if any? This paper tries to answer this question by offering a comparative study between four feature types: words in their original form, lemmas, roots, and character level n-grams and shows how each affects the performance of the classifier. We used and compared the performance of Support Vector Machines and Naïve Bayesian Networks algorithms respectively. Slim Mesfar. Towards a cascade of morpho-syntactic tools  for Arabic Natural Language Processing This paper presents a cascade of morpho-syntactic tools to deal with Arabic natural language processing. It begins with the description of a large coverage formalization of the Arabic lexicon. The built electronic dictionary, named "El-DicAr", which stands for “Electronic Dictionary for Arabic”, links inflectional, morphological, and syntactic-semantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over 3 million inflected forms. El-DicAr represents the linguistic engine for the automatic analyzer, built through a lexical analysis module, and a cascade of morpho-syntactic tools including: a morphological analyzer, a spell-checker, a named entity recognition tool, an automatic annotator and tools for linguistic research and contextual exploration. The morphological analyzer identifies the component morphemes of the agglutinative forms using large coverage morphological grammars. The spell-checker corrects the most frequent typographical errors. The lexical analysis module handles the different vocalization statements in Arabic written texts. Finally, the named entity recognition tool is based on a combination of the morphological analysis results and a set of rules represented as local grammars. Victor Bocharov, Lidia Pivovarova, Valery Sh. Rubashkin and Boris Chuprin. Ontological Parsing of Encyclopedia Information Semi-automatic ontology learning from encyclopedia is presented with primary focus on syntax and semantic analyses of definitions. Raquel Justo, Alicia Pérez, M. Inés Torres and Francisco Casacuberta. Hierarchical finite-state models for speech translation using categorization of phrases In this work a hierarchical translation model is formally defined and integrated in a speech translation system. As it is well known, the relations between two languages are better arranged in terms of phrases than in terms of running words. Nevertheless phrase-based models may suffer from data sparsity at training time. The aim of this work is to improve current speech translation systems by integrating categorization within the translation model. The categories are sets of phrases, being the latter either linguistically or statistically motivated. Both category and translation and acoustic models are finite-state models. In what temporal cost concerns, finite-state models count on efficient algorithms. Regarding the spatial cost, all the models where integrated on-the-fly at decoding time,  allowing an efficient use of the memory. Rahul Katragadda. GEMS: Generative Modeling for Evaluation of Summaries In this paper we argue for the need of alternative automated' summarization evaluation systems for both content and readability. In the context of TAC Automatically Evaluating Summaries Of Peers (AESOP) task, we describe the problem with content evaluation metrics. We model the problem as an information ordering problem; our approach (and indeed others) should now be able to rank systems (and possibly human summarizers) in the same order as human evaluation would have produced. We show how a well known generative model could be used to create automated evaluation systems comparable to the state-of-the-art. Our method is based on a multinomial model distribution of key-terms (or signature terms) in document collections, and how they are captured in peers. We have used two types of signature-terms to model the evaluation metrics. The first is based on POS tags of important terms in a model summary and the second is based on how much information the reference summaries shared among themselves.  Our results show that verbs and nouns are key contributors to our best run which was dependent on various individual features. Another important observation was that all the metrics were consistent in that they produced similar results for both cluster A and cluster B in the context of update summaries. The most startling result is that in comparison with the automated evaluation metrics currently in use (ROUGE, Basic Elements) our approach has been very good at capturing `overall responsiveness'' apart from pyramid based manual scores. Meghana Marathe and Graeme Hirst. Lexical Chains using Distributional Measures of Concept Distance In practice, lexical chains are typically built using term reiteration or  resource-based measures of semantic distance. The former approach misses out  on a signiﬁcant portion of the inherent semantic information in a text, while the  latter suffers from the limitations of the linguistic resource it depends upon. In this paper, chains are constructed using the framework of distributional measures of concept distance, which combines the advantages of resource-based and distributional measures of semantic distance. These chains were evaluated by applying them to the task of text segmentation, where they performed as well as or better than state-of-the-art methods. Silviu Cucerzan. A Case Study of Using Web Search Engines in NLP:  Case Restoration Web search engines provide a large amount of information about language in general, and its current use in particular. In this paper, we investigate the use of Web search engine statistics for the task of case restoration. Because most engines are case insensitive, an approach based on search hit counts, as employed in previous work in natural language ambiguity resolution, is not applicable for this task. We investigate the use of statistics computed from the snippets generated by a Web search engine. We note that the top few results (one to ten) returned by a search engine may not the most representative for modeling phenomena in a language. Finally, we examine the Web coverage of n-grams in three different languages (English, German, and Spanish).

Short oral presentation + poster

publication in complementary proceedings
(special issues of RCS and IJCLA journals)