CICLing 2014 Accepted Papers with Abstracts

Abstract: Computing the semantic similarity between words is one of the key challenges in many language-based applications. Previous work tends to use the contextual information of words to disclose the degree of their similarity. In this paper, we consider the relationships between words in local contexts as well as latent topic information of words to propose a new distributed representation of words for semantic similarity measure. The method models meanings of a word as high dimensional Vector Space Models (VSMs) which combine relational features in word local contexts and its latent topic features in the global sense. Our experimental results on popular semantic similarity datasets show the significant improvement of correlation scores with human judgements in comparison with other methods using purely plain texts.

Abstract: We suggest a semi-automatic text processing method for ranking and categorization of educational innovation projects (EIP). The EIP is a nation-wide program for strategic development of an university or a group of academic institutions which includes the following features: (1) preparing students for their future careers by developing educational programs corresponding to standards of the best world universities; (2) infusing innovative information technology (IT) solutions throughout teaching and training, such as e-learning methods, computer simulation, web-based modules, etc.; (3) promoting pedagogical tools and IT means for continuous lifelong learning and education; (4) transferring new knowledge into practice; and (5) enhancing permanent active links of academia with industries and public organizations. Outcome-based quantitative evaluation of each of the five innovative features above is an integral ingredient of the innovation projects. Our approach to the EIP evaluation is based on the multi-dimensional system ranking that uses quantitative indicators for three main missions of the higher education institutions, namely, education, research, and knowledge transfer. We provide a description and classification of these indicators. The main part of this paper is devoted to the design of a semi-automatic method for ranking the EIPs exploiting multi-attribute text document classification. The methodology uses the generalized Borda ranking method.

Abstract: In terms of translation quality, hierarchical phrase-based translation model (Hiero) has shown state-of-the-art performance in various translation tasks. However, the slow decoding speed of Hiero prevents it from effective deployment in online scenarios.

In this paper, we propose beam-width adaptation strategies to speed up Hiero decoding. We learn maximum entropy models to evaluate the quality of each span and then predict the optimal beam-width for it. The empirical studies on Chinese-to-English translation tasks show that, even in comparison with a competitive baseline which employs well designed cube pruning, our approaches still double the decoding speed without compromising translation quality. The approaches have already been applied to an online commercial translation system.

Abstract: Classification of text remains a challenge. Most machine learning based approaches require many manually annotated training instances for a reasonable accuracy. In this article we present an approach that minimizes the human annotation effort by interactively incorporating human annotators into the training process via active learning of an ensemble learner. By passing only ambiguous instances to the human annotators the effort is reduced while maintaining a very good accuracy. Since the feedback is only used to train an additional classifier and not for re-training the whole ensemble, the computational complexity is kept relatively low.

Abstract: This paper discusses the generation of relational referring expressions in which target and landmark descriptions are allowed to help disambiguate each other. Using a corpus of referring expressions in a simple visual domain - in which these descriptions are likely to occur - we propose a classification approach to decide when to generate them. The classifier is then embedded in a REG algorithm whose results outperform a number of naive baseline systems, suggesting that mutual disambiguation is fairly common in language use, and that this may not be entirely accounted for by existing REG algorithms.

Abstract: This paper presents a study in the field of Natural Language Generation (NLG), focusing on the computational task of referring expression generation (REG). We describe a standard REG implementation based on the well-known Dale & Reiter Incremental algorithm, and a classification-based approach that combines the output of several support vector machines (SVMs) to generate definite descriptions from two publicly available corpora. Preliminary results suggest that the SVM approach generally outperforms incremental generation, which paves the way to further research on machine learning methods applied to the task.

Abstract: Abstract. This paper presents a comparison of various sentence compression techniques with human compressed sentences in the context of text summarization. Sentence compression is useful in text summarization as it allows to remove redundant and irrelevant information hence preserve space for more relevant information. In this paper, we evaluate recent state-of-the-art sentence compression techniques that are based on syntax alone, a mixture of relevancy and syntax, part of speech feature based machine learning, keywords alone and a naive random word removal baseline. Results show that syntactic based techniques complemented by relevancy measures outperform all other techniques to preserve content in the task of text summarization. However, further analysis of human compressed sentences also shows that human compression techniques rely on world knowledge which is not captured by any automatic technique.

Abstract: A great deal of research on opinion mining and sentiment analysis has been done in specic contexts such as movie reviews, commercial evaluations, campaign speeches, etc. In this paper, we raise the issue of how appropriate these methods are for documents related to land-use planning. After highlighting limitations of existing proposals and discussing issues related to textual data, we present the method called Opiland (OPinion mIning from LAND-use planning documents). Experiments are conducted on a land-use planning dataset, and on three datasets related to others areas highlighting the relevance of our proposal.

Abstract: This work addresses the challenge of automatically unfold transfers of meaning in eventive propositions. For example, if we want to interpret “throw pass” in the context of sports, we need to find the object (“ball”) that transferred some semantic properties to “pass” to make it acceptable as argument for “throw”. We propose a probabilistic model for interpreting an eventive proposition by recovering two coupled propositions related to the one under interpretation. We explore different configurations to couple propositions based on WordNet relations, and we gather the statistics after building a Proposition Store from a document collection. These coupled propositions compose an actual interpretation of the original proposition with a precision of 0.57, but only for an 18% of samples. If we evaluate whether the interpretation is just useful or not for recovering background knowledge required for interpretation, then results rise up to 0.71 of precision and recall

Abstract: We present a tiered-approach to the recognition of metaphor. The first tier is made up of highly precise expert-driven lexico-syntactic patterns which are automatically expended on in the second tier using lexical and dependency transformations. The final tier utilizes an SVM classifier using a variety of syntactic, semantic, and psycholinguistic features to determine if an expression is metaphoric. We focus on the recognition of metaphors in which the target is associated with the concept of ``Economic Inequality'' and examine the effectiveness of our approach for metaphors expressed in English, Farsi, Russian, and Spanish. Through experimental analysis we show that the proposed approach is capable of achieving 67.4% to 77.8% F-Measure depending on the language.

Abstract: The phenomenon of synonymy has been of a central interest for both linguists and logicians, and though it is an important theoretical relation existing in natural language, a satisfactory criterion of synonymy is still a hot issue. In this paper I am going to deal with this problem from the logical point of view and the novel contribution of the paper is a proposal of the logical criterion of synonymy. A seemingly simple definition of synonymy as the identity of meaning evokes many problems including, inter alia, questions like what is the meaning of an expression and how fine-grained meanings should be. In Transparent Intensional Logic (TIL), which is my background theory, the sense of an expression is an algorithmically structured procedure detailing what operations to apply to what procedural constituents to arrive at the object (if any) denoted by the expression. Such procedures are rigorously defined as TIL constructions. In this new orthodoxy of procedural semantics we encounter the problem of the granularity of the individuation of procedures, because from the procedural point of view TIL constructions are a bit too fine-grained. In an effort to solve the problem we introduced the notion of procedural isomorphism. Any two terms or expressions whose respective meanings are procedurally isomorphic are deemed semantically indistinguishable, hence synonymous. Procedural isomorphism is a nod to Carnap’s intensional isomorphism and Church’s synonymous isomorphism.
The problem how fine-grained ‘intensional entities’, hence meanings, should be was of the utmost importance to Church who considered several alternatives of constraining these entities. Senses are identical if the respective expressions are (A0) ‘synonymously isomorphic’, (A1) mutually lambda-convertible, (A2) logically equivalent. (A2), the weakest criterion, was refuted already by Carnap in his (1947), and was not acceptable to Church as well. The alternative (A0) arose from Church’s criticism of Carnap’s notion of intensional isomorphism, and it is synonymy resting on alpha-equivalence and meaning postulates for semantically simple terms. (A1) is deemed to be the right one. Yet it was subjected to a fair amount of criticism in particular due to the involvement of beta-reduction. For instance Salmon (2010) adduces examples of expressions that should not be taken as synonymous yet they are mutually beta-convertible. Moreover, partiality throws a spanner in the works: beta-reduction is not guaranteed to be an equivalent transformation as soon as partial functions are involved. Church also considered Alternative (A1’) that is (A1) plus eta-convertibility. Yet similar defects of eta-convertibility as those connected with beta-convertibility are evincible. Thus the problem of the proper granularity of structured meanings remains open. This is a pressing issue, because in natural language there are contexts that are neither intensional nor extensional so that the substitution of logically equivalent expressions fails here. In such hyperintensional contexts only expressions with procedurally isomorphic meanings can be mutually substituted. The novel contribution of this paper is a formally worked-out, philosophically motivated criterion of hyperintensional individuation, which is defined in terms of a slightly more carefully stated version of alpha-conversion and beta-conversion by value, which amounts to a modification of Church’s Alternative (A1).

References.
Anderson, C. A. (1998). Alonzo Church’s contributions to philosophy and intensional logic. The Bulletin of Symbolic Logic 4, 129-171.
Carnap, R. (1947), Meaning and Necessity, Chicago: Chicago University Press.
Church, A. (1993). A revised formulation of the logic of sense and denotation. Alternative (1). Noûs 27, 141-157.
Salmon, N. (2010). Lambda in sentences with designators: an ode to complex predication, Journal of Philosophy, 107, 445-68.

Abstract: In the literature, two main categories of methods have been proposed for bilingual lexicon extraction from comparable corpora, namely topic model and context based methods. In this paper, we present a bilingual lexicon extraction system that is based on a novel combination of these two methods in an iterative process. Our system does not rely on any prior knowledge and the performance can be iteratively improved. To the best of our knowledge, this is the first study that iteratively exploits both topical and contextual knowledge for bilingual lexicon extraction. Experiments conduct on Chinese–English and Japanese–English Wikipedia data show that our proposed method performs significantly better than a state–of–the–art method that only uses topical knowledge.

Abstract: It is not easy for western people to learn Chinese. Native German speakers find it difficult to understand how Chinese sentences convey meanings without using cases. Statistical machine translation tools may deliver correct German-Chinese translations, but would not explain educational matters. This article reviews some interdisciplinary research on bilingualism,
and expounds on how translation is carried out through cross-linguistic cue switching processes. Machine translation approaches are revisited from the perspective of cue switching concluding that: the word order cue is explicitly simulated in all machine translation approaches, and the case cue being implicitly simulated in statistical machine translation approaches can be explicitly simulated in rule-based and example-based machine translation approaches. A convergent result of machine translation research is to advocate an explicit deep-linguistic representation. Here, a novel machine translation system is motivated by blending existing machine translation methods from the viewpoint of cue-switching, and is firstly developed as an educational tool. This approach takes a limited amount of German-Chinese translations in textbooks as examples, whose cues can be manually obtained. We developed MultiNet-like deep linguistic representations and cross-linguistic cue switching processes. Based on this corpus, our present tool is aimed at helping native German speakers to learn Chinese more efficiently, and shall later be expanded to a more comprehensive machine translation system.

Abstract: This paper proposes ways for combining a
window-based and a syntax-based context
representation for bilingual lexicon extrac-
tion from comparable corpora. Two meth-
ods are proposed: the first one involves
combining the scores assigned to transla-
tions by both approaches and using them
for ranking and selection; the second one
involves a combination of the context fea-
tures provided by the two approaches prior
to applying the lexicon extraction method.
The reported results show that the combi-
nation of the two context representations
improves significantly bilingual lexicon extraction compared to
using each of the representations alone.

Abstract: The analysis of user reviews has become critical in research and industry, as user reviews increasingly impact the reputation of products and services. Many review texts comprise an involved argumentation with facts and opinions on different product features or aspects. Therefore, classifying sentiment polarity does not suffice to capture a review's impact. We claim that an argumentation analysis is needed, including opinion summarization, sentiment score prediction, and others. Since existing language resources to drive such research are missing, we have designed the ArguAna TripAdvisor corpus, which compiles 2,100 manually annotated hotel reviews balanced with respect to the reviews' sentiment scores. Each review text is segmented into facts, positive, and negative opinions, while all hotel aspects and amenities are marked. In this paper, we present the design and a first study of the corpus. We reveal patterns of local sentiment that correlate with sentiment scores, thereby defining a promising starting point for an effective argumentation analysis.

Abstract: We suggest a new technique for deriving paraphrases from a monolingual corpus, supported by a relatively small set of comparable documents. Two somewhat similar phrases that each occur in one of a pair of documents dealing with the same incident are taken as potential paraphrases, which are evaluated based on the contexts in which they appear in the larger monolingual corpus. We apply this technique to Arabic, a highly inflected language, for improving an Arabic-to-English statistical translation system. The paraphrases are provided to the translation system formatted as a word lattice, each assigned with a score reflecting its equivalence level. We experiment with the system on different configurations, resulting in encouraging results: our best system shows an increase of 1.73 (5.49%) in BLEU.

Abstract: Manual annotation of the training data of information extraction models is a time consuming and expensive process but necessary for the building of information extraction systems. Active learning has been proven to be effective in reducing manual annotation efforts for supervised learning tasks where a human judge is asked to annotate the most informative examples with respect to a given model. However, in most cases reliable human judges are not available for all languages. To avoid the expensive re-labeling process, cross-lingual adaptation, is a special case of domain adaptation, refers to the transfer of classification knowledge from one source language to another target language with only unlabeled data. In this paper, We propose a cross-lingual active domain adaptation paradigm (XLADA) that generates high-quality automatically annotated training data from a word-aligned parallel corpus. To evaluate our paradigm, we applied XLADA on English-French and English-Chinese bilingual corpora then we trained French and Chinese information extraction models. The experimental results show that XLADA can produce effective models without manually-annotated training data.

Abstract: In this paper, we show some of the properties of function words in dependency trees. Function words are grammatical words, such as articles, prepositions, pronouns or auxiliary verbs. These words are often short and very frequent in the corpora and therefore can be easily recognized. We formulate hypothesis that function words have often fixed number of children and prove it on datasets. Using this hypothesis, we are able to improve unsupervised dependency parsing and outperform previously published results for many languages.

Abstract: Domain specific ontologies are invaluable despite many challenges associated with their development. In most cases, domain knowledge bases are built with very limited scope without considering the benefits of adding domain knowledge to a general ontology. Furthermore, most existing resources lack meta-information about association strength (weights) and annotations (frequency information [frequent, rare] or relevance information [pertinent or not pertinent]).

In this paper, we are presenting a semantic resource for radiology that elaborates on an existing general semantic lexical network (JeuxDeMots). This network combines weight and annotations on typed relations between terms and concepts, and also includes an inference and reconciliation mechanism to improve quality and coverage. We extend this mechanism to take into account not only relations but also annotations. We describe how annotations improve the network by imposing new constraints especially those relying on medical knowledge. We then give some preliminary results.

Abstract: Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of the anti-virus software will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name to classify the malware, which does not give much information about it. Our method performs this task by using printable strings in the malware. It first creates prototypes from these strings and then uses these prototypes to classify the malware files. We extracted printable strings from 1504 malware files from our university's malware database for which at least five anti-virus vendors agreed upon the family of the malware. Our method is based on the observation that closely related malware have heavy overlap of strings.
We built prototypes for each family based on these strings and their frequencies. Then we used these prototypes to find the correct family for the file and to check if the family classification by the anti-virus vendor is valid or not. We have built two kinds of prototypes: one using the tf-idf of whole vocabulary and the other using the prominent strings extracted from the vocabulary. We achieved an accuracy of about 91.02% by considering the entire vocabulary of the training dataset. Similarly, we achieved an accuracy of 80.52% by considering just the 20 prominent strings for each malware family. Our accuracy is high enough such that we can use this method to classify even those malware that can confuse the anti-virus vendors.

Abstract: In the framework of the French project X, we focus on the semi-automatic analysis of online health forums. Online health forums are areas of exchange where patients, under condition of anonymity, can speak freely about their personal experiences. These resources are a gold mine for health professionals, giving them access to patient to patient exchanges, patient to health professional exchanges and even health professional to health professional exchanges. In this paper, we focus on the emotions expressed by the authors of the messages and more precisely on the targets of these emotions. We suggest a innovative method to identify these targets, based on the notion of semantic roles and using the FrameNet resource. Our method has been successfully validated on real data set.

Abstract: The evolution of the Web from the original proposal made in 1989 can be considered one of the most revolutionary technical changes in centuries. During the past 25 years the Web has evolved from a static version to a fully dynamic and interoperable intelligent ecosystem. The amount of data produced during these few decades is enormous. New applications, developed by individual developers or small companies, can take advantage of both services and data already present on the Web. Data, produced by humans and machines, may be available in different formats and through different access interfaces. This paper analyses three different types of data available on the Web and presents mecha-nisms for accessing and extracting this information. The authors show several applications that leverage extracted information in two areas of research: rec-ommendations of educational resources beyond content and interactive digital TV applications.

Abstract: The terminology of any language and any domain continuously evolves and leads to a constant term renewal. Terms accept a wide range of morphological and syntactic variations which have to be accounted in any NLP applications. If the syntactic variations of multi-word terms have been described and tools designed to process them, only a few works studied the syntagmatic variants of compound terms. This paper is dedicated to the identification of such variants, and more precisely to the detection of synonymic pairs that consist of "compound term - multi-word term". We describe a pipeline for their detection, from compound recognition and splitting to alignment of the variants with original terms, through multi-word term extraction. The experiments are carried out for two languages producing compounds, German and Russian, and two specialised domains: wind energy and breast cancer. We identify variation patterns for these two languages and demonstrate that the transformation of a morphological compound into a syntagmatic compound occurs mainly when the term denomination needs to be enlarged.

Abstract: Chinese grammar engineering has been a much debated task due to the unique characteristics of the language.
Whilst semantic information has been reconed crucial for Chinese syntactic analysis and downstream applications,
existing Chinese treebanks lack a consistent and strict sentential semantic formalism.
In this paper, we introduce a semantic-oriented grammar for Chinese, designed to provide basic supports for tasks such as automatic semantic parsing and sentence generation.
It has a directed acyclic graph structure with a simple yet expressive label set, and leverages elementary predication to support logical form conversion.
To our knowledge, it is the first Chinese grammar representation capable of direct logical transformation.

Abstract: Statistical Machine Translation (SMT) delivers a convenient format for representing how translation process is modeled. The translations of words or phrase pairs are generally computed based on their occurrence in some bilingual training corpus. But SMT still suffers for out of vocabulary (OOV) words and less frequent words especially when only limited training data are available or training and test data are in different domains. In this paper, we propose a convenient way to handle OOV and rare words using paraphrasing technique. Initially we extract paraphrases from bilingual training corpus with the help of comparable corpora. We analyzed the paraphrases by conditionally checking the association of their monolingual distribution. Bilingual aligned paraphrases are incorporated as additional training data into the PB-SMT system. Integration of paraphrases into PB-SMT system results in significant improvement.

Abstract: The word2vec tools learns a vector representation for each dictionary word based on an unannotated training text. The resulting representation has been shown to capture semantics well in the sense that words that share meanings tend to be clustered together.

We extend the word2vec framework to capture meaning across languages. The input consists of an input text and a word-aligned parallel text in another language. The joint word2vec tool then represents words in both languages within a common "semantic" vector space. The result can be used to enrich lexicons of underresourced languages, to identify ambiguities, and to perform clustering and classification.

Experiments were conducted on English-Arabic corpora as well as Bible samples, all of which aligned using Giza++.

Abstract: When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a novel intrinsic plagiarism detection method which is based on a new feature that we called n-gram frequency class. This feature is language independent and straightforward to implement. Moreover, it allows describing plagiarism accurately in terms of the proportions of infrequent n-grams in a suspicious document. The experiments were performed using a publicly available standard corpus in Arabic language. The obtained results in terms of f-measure were comparable to the ones obtained by one of the best state-of-the-art methods.

Abstract: Being an agglutinative language Kazakh imposes certain difficulties on both recognition of correct words and generation of candidate corrections for misspelled words. In this paper we describe a spelling correction method for Kazakh language that takes advantage of both morphological analysis and noisy channel-based model. Our method outperformed both open source and commercial analogues in terms of overall correction accuracy. We performed a comparative analysis of spelling correction tools and pointed out some problems of spelling correction for agglutinative languages in general and Kazakh in particular.

Abstract: The digital text written in an Indian script is difficult to use as such. This is because, there are a number of font-formats available for typing, and these font-formats are not mutually compatible. Gurmukhi alone has more than 225 popular ASCII-based fonts whereas this figure is 180 in case of Devanagari. To read the text written in a particular font, that font is required to be installed on that system. This paper describes a language and font-detection system for Gurmukhi and Devanagari. It also delineates a font-conversion system for converting the ASCII-based text into Unicode. Therefore, the proposed system works in two stages: the first stage suggests a statistical model for automatic language-detection (i.e., Gurmukhi or Devanagari) and font-detection; the sec-ond stage converts the text into Unicode as per font detection. We can not train our systems for some fonts due to non-availability of font converters but system and its architecture is open to accept any number of languages/fonts in the future. The existing system supports around 150 popular Gurmukhi font-encodings and more than 100 popular Devanagari fonts.

Abstract: Parsing plays a significant role in many natural language procceing (NLP) applications as their efficiency relies on having an effective parser. This paper presents Amharic sentence parser developed using base phrase chunker that groups syntactically correlated words at different levels. We use HMM to chunk base phrases where incorrectly chunked phrases are pruned with rules. The task of parsing is then performed by taking chunk results as inputs. Bottom-up approach with transformation algorithm is used to transform the chunker to the parser. Corpus from Amharic news outlets and books was collected for training and testing. The training and testing datasets are prepared using the 10-fold cross validation technique. Test results on the test data showed an average parsing accuracy of 93.75%.

Abstract: Most of the plagiarism detection techniques are based on either string based matching or semantic matching of adjacent strings. However, due to the use of artificial word re-ordering and paraphrasing, the detection of plagiarism became a challenging task of significant interest. To solve this issue, we concentrate on identification of overlapping adjacent plagiarized word patterns and overlapping non-adjacent/reordered plagiarized word patterns from target document(s). Here the main aim is to capture the simple cases and the complex cases (i.e., artificial word reordering and/or paraphrasing) of plagiarism in the target document. For this first of all we identify the relation between all overlapping word pairs with the help of controlled closeness centrality and semantic similarity. Next, to extract the plagiarized word patterns, we introduce the use of minimum weighted bipartite clique covers. We use the plagiarized word patterns in the identification of plagiarized texts from the target document. Our experimental results on publicly available and annotated dataset like: ‘PAN 2012 plagiarism detection dataset’ and ‘Student answer related plagiarism dataset’ shows that it performs better than state-of-arts systems in this area.

Abstract: As more and more textual resources from the medical domain are getting accessible, automatic analysis of clinical notes becomes possible. Since part-of-speech tagging is a fundamental part of any text processing chain, tagging tasks must be performed with high accuracy. While there are numerous studies on tagging medical English, we are not aware of any previous research examining the same field for Hungarian. This paper presents methods and resources which can be used for parsing medical Hungarian and investigates their application to tagging clinical records. Our research relies on a baseline setting, whose performance was improved incrementally by eliminating its most common errors. The extension of the lexicon used raised the overall accuracy significantly, while other domain adaptation methods were only partially successful. The presented enhancements corrected almost half of the errors. However, further analysis of errors suggest that abbreviations should be handled at a higher level of processing.

Abstract: We developed a cross-lingual recommender system using collaborative filtering with English-Japanese translation pairs of product names to help non-Japanese buyers visiting Japanese shopping Web sites who speak English. The customer buying histories at an English shopping site and those at another Japanese shopping site were used for the experiments. Two kinds of experiments were conducted to evaluate the system. They were (1) two-fold cross validation where the half of the translation pairs was masked and (2) experiments where the whole of the translation pairs were used. The precisions, recalls, and mean reciprocal rank (MRR) of the system were evaluated to assess the general performance of the recommender system in the former experiment. On the other hand, what kinds of items were recommended in more realistic scenario was shown in the later experiments. The experiments revealed that masked items were found more efficiently than bestseller baseline and showed that items only at the Japanese site that seemed to be related to buyers' interests could be found by the system in more realistic scenario.

Abstract: This paper presents a maximum entropy based method for determining honorific identities of personal nouns in Bengali. Later this information is used for pronoun (anaphora) resolution system for Bengali as honorificity plays an important role for pronominal anaphora resolution in Bengali. Experiment has done on a publicly available dataset. Experimental result shows that when the module for honorific identification is added with the existing pronoun resolution system the accuracy (avg. F1-score) of the system is improved from 0.602 to 0.703 and this improvement is shown to be statistically significant.

Abstract: Checking the truth value of political statements is difficult. Fact checking computationally has therefore not been very successful. An alternative to checking the truth value of a statement is to not consider the facts that are stated, but the way the statement is expressed. Using linguistic features from seven computational linguistic algorithms, we investigated whether truth-false statements and the definitiveness with which the statement is expressed can be predicted using linguistic features. In a training set we found that both distinctiveness and truthfulness of the statement predicted linguistic variables. These variables corresponded to those mentioned in deception literature. Next, we used a new set of political statements and determined whether the same linguistic variables would be able to predict the definitiveness and truthfulness of the statement. Given the fact that the political statements are short, one-sentence statements, allowing for a large variability in linguistic variables, discriminant analyses showed that the function obtained from the training set allowed for an accurate classification of 57 − 59% of the data. These findings are encouraging, for instance for first analysis on the truth value and verifiability of political statements.

Abstract: An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people's word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people's associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so

Abstract: We aim at building a dependency treebank containing both projective and non-projective dependency trees consistent with a categorial dependency grammar. In order to alleviate the work of the annotator, we propose to automatically pre-annotate the sentences with the labels of the dependencies ending on the words. The selection of the dependency labels reduces the ambiguity of the parsing. We show that a maximum entropy markov model method reaches the label accuracy score of a standard dependency parser (MaltParser). Moreover, this method allows to find more than one label per word, i.e. the more probable ones, in order to improve the recall score. It improves the quality of the parsing step of the annotation process.
Therefore, the inclusion of the method in the process of annotation makes the work quicker and more natural to annotators.

Abstract: We propose a new low-dimensional text representation method for topic classification. Several Latent Dirichet Allocation (LDA) models are built on a large amount of unlabelled data, in order to extract potential topic clusters, at different levels of generalization. Each document is represented as a distribution over these topic clusters. We experiment with two datasets. We collected the first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classification benchmark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classification based on our multi-level LDA representation leads to improved results for both datasets. Our representation catches topic distributions from generic ones to more specific ones and allows the machine learning algorithm choose the appropriate level of generalization for the task. Another advantage is the dimensionality reduction, which permitting the use of machine learning algorithms that cannot run on high-dimensional feature spaces. Even for the algorithms that can deal with high-dimensional features spaces, it is often useful to speed up the training and testing time by using the lower dimensionality.

Abstract: This paper reports on the implementation of grammar checker and parser for highly inflected and under-resourced languages. The syntax of languages with a rich morphological feature system cannot be described using classical context free grammar (CFG) formalism. We have extended CFG formalism by adding syntactic roles, lexical constraints, and constraints on morpho-syntactic feature values. The formalism also allows to assign morpho-syntactic feature values to phrases and to specify optional constituents. The paper also describes how we are implementing the grammar checker by using two sets of rules – rules describing correct sentences and rules describing grammar errors. The same engine with a different rule set can be used for the different purposes – to parse the text or to find the grammar errors. The paper also describes the implementation of Latvian and Lithuanian parsers and grammar checkers and the quality measurement methods used for the quality assessment.

Abstract: Adjectives are words that describe or modify other elements in a sentence. As such, they are frequently used to convey facts and opinions about the nouns they modify. Connecting nouns to the corresponding adjectives becomes vital for intelligent tasks such as aspect-level sentiment analysis or interpretation of complex queries (e.g., "small hotel with large rooms") for fine-grained information retrieval. To respond to the need, we propose a methodology that identifies dependencies of nouns and adjectives by looking at syntactic clues related to part-of-speech sequences that help recognize such relationships. These sequences are generalized into patterns that are used to train a binary classifier using machine learning methods. The capabilities of the new method are demonstrated in two languages whose syntax is essentially different: English, the leading language of international discourse, and Hebrew, whose rich morphology poses extra challenges for parsing. In each language we compare our method with a designated, state-of-the-art parser and show that it performs similarly in terms of accuracy while: (a) using a simple and relatively small training set; (b) not requiring a language specific adaptation, and (c) is robust across a variety of writing styles.

Abstract: The machine translation systems usually build an initial
word-to-word alignment, before training the phrase translation pairs.
This approach requires a lot of matching between different single words
of both considered languages. In this paper, we propose a new approach
for phrase-based machine translation which does not require any alignment. This method is based on inter-lingual triggers retrieved by Multivariate Mutual Information. This algorithm segments sentences into phrases and finds their alignments simultaneously. The main objective of this work is to build directly valid alignments between source and target phrases. The achieved results, in terms of performance are satisfactory and the obtained translation table is smaller than the reference one; this approach could be an alternative to the classical methods.

Abstract: Recognizing textual entailment (RTE) is a well-defined task concerning semantic analysis. It is evaluated against manually annotated collection of pairs hypothesis-text. A pair is annotated true if the text entails the hypothesis and false otherwise. Such collection can be used for training or testing a RTE application only if it is large enough.

We present a game which purpose is to collect h-t pairs.
It follows a detective story narrative pattern: a brilliant detective and his slower assistant talk about the riddle to reveal the solution to readers. In the game the detective (human player) provides a short story. The assistant (the application) proposes hypotheses the detective judges true, false or non-sense.

Hypothesis generation is a rule-based process but the most likely hypotheses that are offered for annotation are calculated from a language model.
During generation individual sentence constituents are rearranged to produce syntactically correct sentences.

The game is intended to collect data in the Czech language. However, the idea can be applied for other languages. The paper concentrates on description of the most interesting modules from a language-independent point of view as well as the game elements.

Abstract: We address the question of how document properties (word count, term frequency, cohesiveness, genre) affect the quality of unsupervised document relatedness measures (Google trigram model and vector space model). We use three genres of documents: aviation safety reports, medical equipment failure descriptions, and biodiversity heritage library text. Quality of document relatedness is assessed by the accuracy of a classification task using the kNN method. Experiments discover correlations between document property values and document relatedness quality, and we discuss how one approach may perform better depending on property values of the dataset.

Abstract: In this paper we present a parser that employs Chomsky’s Government and Binding (GB) theory to better understand the syntactic structure of Arabic sentences. We consider different word orders in Arabic and show how they are derived. We examine the analysis of different sentences orders including: Subject-Verb-Object (SVO), Verb-Object-Subject (VOS), Verb-Subject-Object (VSO), nominal sentences, nominal sentences beginning with inna (and sisters) and question sentences. We tackle the analysis of the structures to develop syntactic rules for a fragment of Arabic grammar, such that we included two sets of rules (1) rules on sentences structures that do not account for case and (2) rules on sentences structures that account for Noun Phrases (NPs) case. We present the implementation of the grammar rules in prolog. The experiments revealed high accuracy in case-assignment in Modern Standard Arabic (MSA) in the light of GB theory especially when the input sentences are tagged with identification of end cases.

Abstract: We propose a novel approach to recognise textual entailment (RTE) following a two-stage architecture -- alignment and decision -- where both stages are based on semantic representations. In the alignment stage the entailment candidate pairs are represented and aligned using predicate-argument structures. In the decision stage, a Markov Logic Network (MLN) is learnt using rich relational information from the alignment stage to predict an entailment decision. We evaluate this approach using the RTE Challenge datasets.
It achieves the best results for the RTE-3 dataset and shows comparable performance against the state of the art approaches for other datasets.

Abstract: Opinion questions are questions which expect answers from opinionated data available on social web. Why-questions asked on product review sites require reasons, elaborations, explanations for the users’ sentiment expressed in question towards a particular product. Sentiment analysis has been recently used in answering why type opinion questions. In this paper, we propose an approach to determine the sentiment polarity of complex why type opinion questions that could be expressed in multiple sentences or could have multiple opinions on different features of products. To the best of our knowledge, this is first work in direction of determining sentiment polarity of such type of why-questions. We apply Rhetorical structure theory to determine discourse structure of why type questions. We use such structure to determine sentiment polarity of why type questions and conduct experiments which obtain better results as compared to baseline average scoring method. We find that better the performance of discourse parser and the use of lexical opinion resources would yield better results.

Abstract: Anaphora resolution is a central topic in dialogue and discourse that deals with finding the referent of a pronoun. It plays a critical role in conversational Intelligent Tutoring Systems (ITSs) as it can increase the accuracy of assessing students' knowledge level, i.e. mental model, based on their natural language inputs. Although anaphora resolution is one of the most studied problems in Natural Language Processing, there are very few studies that focus on anaphora resolution in dialogue based ITSs. To this end, we present Deep Anaphora Resolution Engine (DARE++) that adapts and extends existing machine learning solutions to resolve pronouns in ITS dialogues. Experiment results show that DARE++ achieves a F-measure of 88.93, showing great potential for resolving pronouns in student-tutor dialogues.

Abstract: The development of intelligent text processing based applications require availability of sizable, reliable and representative corpora. However, such corpora are not routinely available for Bengali language. This paper introduces Shahjalal University Monolingual (SUMono)corpus, a representative modern Bengali corpus consisting of more than 27 million words,which is the largest of its kind. This paper describes how we have constructed SUMono corpus from available online and oine Bengali texts, with articles tagged as belonging to 6 domains: Natural Science, Social Science, Computer and IT, Literature, Mass Media and Blogs. We show some characteristics of Bengali language based upon the statistical analysis of
this corpus. We also compare the 'inherent sparseness' of Bengali with English and Arabic by observing Type-to-Token ratio of the languages. We assess our corpus in terms of its representativeness, homogeneity and vocabulary growth rate using established techniques like Zipf's law, distribution of function words and Baayen's equation, respectively. We found that our corpus is balanced with respect to the frequency distribution as well as to the range of idiosyncratic phenomena.

Abstract: News headline typically consits of a few words. In order to attract the reader’s attention, the headlines are offten written with the intention to provoke emotional reactions. To study the connection of linguistic usage of headline and emotional expression is challenging and important work for short text analysis. This paper focus on reader’s emotion prediction of news headline to explore the connection between reader’s emotional reactions and news article. To solve data sparseness problem, the headline is transformed to concepts by a sematic dictionary HowNet. Concept and concept sequence features are designed to represent the headline and keep its power of envoking emotional experiences. Experiments are performed on a datset from Sina Social News. The results shows that the propose approach can achive a comparable performance compare to the BOW model based baseline system which use news content

Abstract: Demands in CGM with regard to a product are useful for companies because these demands show how people want the product to be changed. However, there are many types of demand, and the demandee is not always the company that produces the product. Our objective in this study is to identify the demandees of demands in CGM. We focus on the verbs representing the requested actions and collect them using a graph-based semi-supervised approach for use as the features of a demandee classifier. Experimental results showed that using these features improves the classification performance.

Abstract: When some hot social issue or event occurs, it will significantly increase the number of comments and retweet on that day on twitter. Generally, an event can be extracted by its term frequency but it is hard to find an event that has a low term frequency. Because of this reason there can be a probability of missing important information. However, there is a kind of reliable user who is directly related to that event no matter how low the number of tweet is on that case. In this paper, we propose user reliability based event extraction method. Timeline analysis based LDA model described to extract event terms. User behavior analysis suggested classifying reliable users who are interested in the issue. Four of social issues were experimented in Twitter data to verify the validity of the proposed method. The top 10 results of the experiment showed 97.2% of performance in precision (P@10). The experimental results show that the proposed method is effective for extracting events in twitter corpus.

Abstract: VerbNet-style classes which capture the shared syntax and semantics of verbs have proven useful for many Natural Language Processing (NLP) tasks and applications (e.g. word sense disambiguation, machine translation, automatic summarization and semantic parsing, among others). However, lexical resources such as VerbNet are only available for a handful of worlds languages. Because manual development of lexical resources is extremely time consuming, methods that can be used to automatically induce verb classes from texts could provide a valuable starting point when aiming to build VerbNets for different languages. To date such methods have been explored for English and a small number of other languages with promising results. In this paper, we investigate this approach for Brazilian Portuguese - a language for which no VerbNet and no automatic method for inducing verb classes has been developed yet. We apply unsupervised clustering techniques similar to those developed for other languages to Brazilian Portuguese. We report promising results but also discover many issues which require special consideration when aiming to optimise the performance of well-known techniques on less-resourced languages.

Abstract: In this paper we focus on content selection for summarizing time series data using Machine Learning techniques. The goal is to exploit a parallel corpus to predict the appropriate level of abstraction required for a summarization task. This is an important step towards building an automated NLG (Natural Language Generation) system to generate text for unseen data. Machine learning approaches are used to induce the underlying rules for text summarization, which are potentially close to the ones that humans use to generate textual summaries. We present an approach to select important points in a time series that can aid in generating captions or textual summaries. We evaluate our techniques on a parallel corpus of human generated weather forecast text corresponding to numerical weather prediction data.

Abstract: The major language of Nepal, known today as Nepali, is spoken as mother tongue by nearly half the population, and as a second language by nearly all of the rest. A considerable volume of computational linguistics work has been done on Nepali, both in research establishments and commercial organizations. However there are another 94 languages indigenous to the country, and the situation for these is not good. In order to apply computational linguistics methods to a language it must first be represented in the computer, but most of the lan- guages of Nepal have no written tradition, let alone any support by computers. It is the written form that is needed for full computational processes, and it is here that we encounter barriers or at best inappropriate compromises. We will look at the situation in Nepal, ignoring the 17 cross-border languages where the major speaker population lies outside Nepal. We are left with only three languages with written traditions: Nepali which is well served, Newari with over 1000 years of written tradition but which so far has been frustrated in attempts to encode its writing, and Limbu which does have its writing encoded though with defects. Many of the remaining languages may be written in Devanagari, but aspire to something different that relates to their languages and has a more visually distinctive writing to mark their identity. We look at what can be done for these remaining languages and speculate whether a common writing system and encoding could cover all the languages of Nepal. Inevitably we must focus on the current standard for the computer encoding of writing, Unicode, but we find that while language activists in Nepal do not adequately understand what is possible with the technology and pursue objectives within Unicode that are not necessary or helpful, external experts only have limited understanding of all the issues involved and the requirements of living languages and their users and instead pursuing scholarly interests which offer limited support for living users.

Abstract: The problem of extractive text summarization for a collection of documents is defined as selecting a small subset of sentences so the contents and meaning of the original document set are preserved in the best possible way. In this paper we present a new model for the problem of extractive summarization, where we strive to obtain a summary that preserves the information coverage as much as possible, when compared to the original document set. We construct a new tensor-based representation that describes the given document set in terms of its topics. We then rank topics via Tensor Decomposition, and compile a summary from the sentences of the highest ranked topics.

Abstract: The user profile is an elementary component of any application based on personalization. The existing strategies of user profile considers only objects which interests the user (positive preferences of the user), and not on objects which does not interest the user (negative preferences of the user). This paper focuses on personalization in search engine and the proposed approach consists of three steps. At first, an algorithm for concept extraction is employed in which concepts are extracted and the relations between these concepts are obtained from the web-snippets returned by the search engine. Second, a user profile strategy is employed to build a concept-based user profile which predicts the conceptual preferences of the user (considering both the positive and negative preferences). Building user profile comprises of identifying the concept preference pair by Spy Naive Bayes Classifier(Spy NB-C) method and learning the users preferences represented by feature weights vectors by Ranking-Support Vector Machine(R-SVM). Third, the concept relations together with the predicted conceptual preferences of the user, is given as input to personalized concept-based clustering algorithm to find the conceptually related queries. To cluster ambiguous queries into different clusters of queries a personalized clustered query-concept bi-partite graph is created by making use of the extracted concepts and click through data. This suggested personalized query recommendations to the individual users based on their interests. From the experimental results, it is observed that the user profile which captured both the preferences of the user increased the separation between dissimilar queries and similar queries. Improvements in F-measure shows that the quality of query clusters resulted provided personalised results to the users.

Abstract: Text classification using semantic information is the latest trend of research due to having potential to represent the texts more logically than the bag-of- words approaches. On the other hand, representation of semantics through graphs has several advantages over the traditional representation of feature vector. Therefore, error tolerant graph matching techniques can be used for text classification. But very few methodologies exist in the literature for that using semantic representation through graphs. In the present work, a methodology has been proposed to represent semantic information from a summarized text into a graph. Discourse representation structure of a text is utilized during formation of the graphs from the semantics. Five different graph matching techniques based on Maximum Common Sub-graphs (MCS) and Minimum Common Super-graphs (MMCS) are evaluated on 20 classes from Reuters21578-Apte-115Cat text database taking 10 samples of each class for both training and testing purpose using KNN classifier. From the results it has been observed that the techniques have the potential to do text classification as much as traditional bag-of-words approaches have.

Abstract: This paper presents the results of the standardization procedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input required we produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statistical machine translation system (CSMT). Best results were obtained by combining the manually constructed lexicon and CSMT as fallback with an overall improvement of 9.9% increase on all tokens and 31.3% on OOV tokens. Manual preparation of data in a lexicon manner has proven to be more efficient than normalizing running text for the task at hand. Finally we performed an extrinsic evaluation where we automatically lemmatized the test corpus taking as input either original or automatically standardized wordforms, and achieved 75.1% per-token accuracy with the former and 83.6% with the latter, thus demonstrating that standardization has significant benefits for upstream processing.

Abstract: The paper focuses on the rule based case transfer which is a part of
the transfer grammar module developed for bidirectional Tamil to
Malayalam Machine Translation system. The present study involves two
typologically close and genetically related languages, namely Tamil and
Malayalam. Both the languages belongs to the family of Dravidian
languages and has rich morphology. We considered the basic sentence
which is highly dependent on the case systems. Case is most easily
observed and studied in languages that have rich case morphology. Hence
study on case suffixes is found to enhance the Tamil-Malayalam machine
translation output. A parallel corpora was chosen and case transfer
patterns were analysed and rules were written to sort out the case
changes that happens when translating from one language to another. The
rules was written by taking into consideration the PSPs and cases in the
languages. Web data was used for evaluation and the results were
encouraging.

Abstract: The idea of relevance feedback (RF) is to take the results that are initially returned from a given query and to use information about whether
or not those results are relevant to perform a new query.
The most commonly used methods to RF aim to rewrite
the user query. In the vector space model, RF is usually undertaken by re-weighting
the query terms without any modification in the vector space basis. With respect to the initial vector space basis
(index terms), relevant and irrelevant documents share some terms
(at least the terms of the query which selected these documents). In this paper we propose a new
RF method based on vector space basis change without any modification on the query term weights. The aim of our
method is to build a basis which optimally separates relevant and irrelevant documents. And so, this vector basis gives a better representation
of the documents such
that the relevant documents are gathered and the irrelevant ones
are kept away from the relevant documents.

Abstract: In our paper we present a detailed study on annotation of discourse
connectives and its arguments in Indo-Aryan and Indo-Dravidian languages
like Hindi, Tamil and Malayalam. Discourse Connectives is one of the
cohesive markers that link two or more utterances. The identification of
arguments for discourse connectives is needed for NLP tasks such as
Information Extraction (IE), Machine Translation and Question Answering
Systems. Our study is based on relevance theory framework. The relevance
theory points out that discourse marker link utterances with non-verbal
context. Using this approach we have tagged the discourse connectives
and its arguments for health corpus. We have also evaluated the
reliability of our annotation based on inter annotator agreement.

Abstract: In most languages, the quality of a text to speech system is relating directly to the diversity of language’ domain. Each domain, such as sports, entertainments, etc… has its own grammar structure that will decide the auto-pronunciation of the text to speech system. The prosodic information plays as an important role for analyzing the grammar structure of each domain. In this research, we will analyze characteristics of prosodic information of aviation domains in Vietnamese and detect most important characteristics related to the sentiment of Vietnamese Aviation announcements.

Abstract: We present a work on detection of simulated plagiarism in documents on comparison with a set of source documents. Simulated plagiarisms are the real time plagiarism, where the obfuscation is introduced manually in documents. We have used PAN 10 data set to develop and evaluate our algorithm. Our algorithm consists of two steps, identification of possible plagiarized passages using dice similarity measure and filtering the obtained passages using syntactic and semantic features learned from obfuscation. The algorithm works at sentence level. The results are encouraging.

Abstract: This paper describes a tool for extracting texts from arbitrary PDF files with the purpose of large-scale natural language processing. Our approach combines the benefits of several existing solutions for the conversion of PDF documents to plain text. The inability of reliable text extraction is often an obstacle for large scale NLP based on resources crawled from the Web. One of the largest problems in the conversion is the detection of boundaries of textual units such as paragraphs, sentences and words. PDF is a file format optimized for printing and encapsulates a complete description of the layout of a document including text, fonts, graphics and so on. In our research, we looked especially at publications from the European Union, which constitute a valuable multilingual resource, for example, for training statistical machine translation models. Our approach combines the outputs of PDF-rendering libraries pdfxtk, Apache Tika and Poppler in various configurations. It then recovers proper boundaries from the various outputs using on-the-fly language models and language-independent extraction heuristics. We use our tool for the conversion of a large scale multilingual database crawled from the EU bookshop with the aim of building parallel corpora for MT research. Details about the tool and our experiments will be given in the final paper.

Abstract: Turkish is an agglutinative language where linguistic parameters can have significant consequences on the information retrieval performances. In this paper, we study different Turkish linguistic parameters (truncation, stemming, stop words, ...) and compare their impacts on an information retrieval system performances. We study three word truncations at fix length (3, 4 and 5 characters). We compare these results with the use of two stemmers: Snowball and Zemberek. Moreover, we study the effect of using compound nouns in addition to simple keywords in the indexing process. In our experiments, we used the Belkent University test collection Milliyet and three information retrieval models. The comparison of performances analysis was done by traditional information retrieval metrics and by the bpref metric since the test collection is build on an incomplete relevance judgments.

Abstract: This paper discusses an on-going project aiming at improving the quality and the efficiency of a rule-based parser by the addition of a statistical component. The proposed technique relies on bigrams of pairs (word+category) selected from the homographs contained in our lexical database and computed over a large section of the Hansard corpus, previously tagged. The bigram table is used by the parser to rank and prune the set of alternatives. To evaluate the gains obtained by the hybrid system, we conducted two manual evaluations. One over a small subset of the Hansard corpus, the other one with a corpus of about 50 newpaper articles. In both cases, we compare analyses obtained by the parser with and without the statistical component, focusing only on one important source of mistakes, the confusion between nominal and verbal readings for ambiguous words such as announce, sets, costs, labour, etc.

Abstract: Vector space word representations have gained big success recently at improving performance across various NLP tasks. However, existing word embeddings learning methods only utilize homo-lingual corpus. Inspired by transfer learning, we propose a novel language transfer method to obtain word embeddings via language transfer. Under this method, in order to obtain word embeddings of one language (target language), we train models on corpus of another different language (source language) instead. And then we use the obtained source language word embeddings to represent target language word embeddings. We evaluate the word embeddings obtained by the proposed method on word similarity tasks across several benchmark datasets. And the results show that our method is surprisingly effective, outperforming competitive baselines by a large margin. Another benefit of our method is that the process of collecting new corpus might be skipped.

Abstract: Imbalanced training data always affects the supervised learning based text classification. Such problems become more serious in text emotion classification with multiple emotion categories. To address this problem, this paper presents an over-sampling method by constructing new minority class training sample based on sum sentence vector. Firstly, a large corpus is utilized to train a continuous skip-gram model for learning the word/POS vector representations. The sentence vectors are then constructed by adding the word/POS vectors in the sentence. Based on this, the new minority class training samples are generated by randomly adding the sentence vectors of two training sentences in the corresponding class. In this way, the training sample set of minority classes are expanded until its size is the same as majority class samples. The classifiers are then trained on this balanced training dataset. The evaluations on NLP&CC2013 Chinese microblog emotion classification dataset shows that the proposed method achieves average precision of 48.4%, which is much higher than the known best performance on this dataset, namely, 36.5%. Such results show that the proposed sentence vector based over-sampling method improves the imbalanced emotion classification effectively.

Abstract: We explore the applications of representation learning in Nepali, an under-resourced language. Using distributional similarity on a large unlabeled Nepali text, we induce clusters of different sizes. The use of these clusters as features significantly improves the performance compared to the baseline on two standard NLP tasks. In part-of-speech (PoS) tagging experiment where the train and test domain are the same, the accuracy on the unknown words increased by up to 5% compared to the baseline. In a named-entity recognition (NER) experiment in domain adaptation setting with a small training data size, the F1 score improved by up to 41% compared to the baseline. In the setting where train and test domain are the same, the F1 score improved by 13% compared to the baseline.

Abstract: In the field of Natural Language Processing (NLP), automatic systems for textual similarity detection and measurement are being developed, in order to apply them to dierent tasks, such as plagiarism detection, question answering, textual entailment, summarization, etc. Currently, these systems use surface linguistic features or statistical information. Nowadays, few researchers use deep linguistic information. In this work, we present an algorithm for detecting and measuring textual similarity that takes into account information offered by discourse relations of Rhetorical Structure Theory (RST) by Mann & Thompson (1988), and lexical-semantic relations included in EuroWordNet. We apply the algorithm, called SIMTEX, to texts written in Spanish, but the methodology is language-independent. The resources necessary to adapt SIMTEX to other languages are a discourse parser for the corresponding language, and an ontology or lexical database, such as WordNet.

Abstract: We extend parse tree kernels from the level of individual sentences towards the level of paragraph to build a framework for learning short texts such as search results and social profile postings. We build a set of extended trees for a paragraph of text from the individual parse trees for sentences. It is performed based on coreferences and Rhetoric Structure relations between the phrases in different sentences. Tree kernel learning is applied to extended trees to take advantage of additional discourse-related information. We evaluate our approach, tracking relevance improvement for multi-sentence search, comparing performances of individual sentence kernels with the ones for extended parse trees.

Abstract: This paper describes an attempt to solve the problem of recognizing clauses and their mutual relationship in complex Czech sentences on the basis of limited information, namely the information obtained by morphological analysis only. The method described in this paper may be used in the future for splitting the parsing process into two phases, namely 1) Recognizing clauses and their mutual relationships; and 2) Parsing the individual clauses. This approach should be able to improve the result of parsing long complex sentences.

Abstract: An individual`s ability to produce quality work is a function of their current motivation, their control over the results of their work, and the social influences of their peers. All of these factors can be identified in the language that occurs when an individual discusses their work with their peers. Previous approaches to modeling motivation have relied on social-network and time-series analysis to predict the popularity of a contribution to a user-generated content site. In contrast, we show how an individuals use of language can reflect their level of motivation and can be used to predict their future performance. We compare our results to an analysis of motivation based on utility theory. We show that an understanding of the language contained in an individual's comments on user generated content sites provides significant insight into an individual's level of motivation and the potential quality of their future work.

Abstract: In Hindi, compound nouns and genitive nouns are often used. Complex noun sequence has multiple nouns with genitive markers as the optional seperator. In a complex noun sequence with number of nouns greater than two, binary constituency parsing is a prerequisite for determining semantic relation between the noun and its modifier. Semantic relation identification is useful for various applications such as question answering, information extraction, textual entailment.

Agreement rules have been applied for parsing recursive genitives with modifiers. Bigram based approach have been applied for the cases where allomorphic forms of genitives are same and for compounds where syntactic rules do not apply. We have used two approaches: global and greedy to determine the constituency parse. Both approaches have been applied for adjacency and dependency pairs.

Abstract: Sentiment analysis or opinion mining has been explored actively today. One of the most important resources to know the polarity of a document is sentiment lexicon. This paper presents a study to automatically build domain specific sentiment lexicon for under-resourced language. We develop the lexicon by using a set of seed words translated from English sentiment lexicon and expand the seed using available corpora. In the final step we apply filtering to filter sentiment words and their polarity. Results show that our proposed methods can generate additional lexicon (86%) with high sentiment accuracy (77.7%).

Abstract: Although natural language processing is now a popular area of research and development, less-resourced languages are not receiving much attention from developers. One of such under-resourced languages is Kafi-noonoo language which is spoken in the south-western regions of Ethiopia. This paper presents the development of part-of-speech tagger for Kafi-noonoo language. In order to develop the tagger, we employed a hybrid of two systems: HMM and rule-based taggers. The lexical and transitional probabilities of word classes are computed by the HMM. However, due to the limitation of corpus for the language, a set of transformation rules are applied to improve the result. The system was tested with test corpus where we obtained 77.19% and 61.88% accuracy for HMM and rule-based taggers, respectively. With the same test data, a hybrid of the two systems yielded an accuracy of 80.47%.

Abstract: Research has shown that writing styles are influenced by an extensive array of factors that includes text genre and author's gender. Going beyond the analysis of linguistic features, such as n-grams, stylometric variables and word categories, this paper presents an exploratory study of the role that emotions expressed in writing play to aid discriminating author gender in different text genres. In this work, the gender classification task is seen as a binary classification problem where discriminating features are taken from a vectorial space that includes emotion-based features. Results show that by exploiting the emotional infor-mation present in personal journal (diary) texts, up to 80% accuracy in gender classification with support vector machine (SVM) algorithm can be reached. Over 75% accuracy is reached when classifying the author gender of blog texts. Our findings have implications on the usage of emotion-based features for assisting in author's gender classification.

Abstract: This paper focuses on an emerging research topic about mining microbloggers’ personalized interest tags from their own microblogs ever posted. It based on an intuition that microblogs indicate the daily interests and concerns of microblogs. Previous studies regarded the microblogs posted by one microblogger as a whole document and adopted traditional keyword extraction approaches to select high weighting nouns without considering the characteristics of microblogs. Given the less textual information of microblogs and the implicit interest expression of microbloggers, we suggest a new research framework on mining microbloggers’ interests via exploiting the Wikipedia, a huge online word knowledge encyclopedia, to take up those challenges. Based on the semantic graph constructed via the Wikipedia, the proposed semantic spreading model (SSM) can discover and leverage the semantically related interest tags which do not occur in one’s microblogs. According to SSM, An interest mining system have implemented and deployed on the biggest microblogging platform (Sina Weibo) in China. We have also specified a suite of new evaluation metrics to make up the shortage of evaluation functions in this research topic. Experiments conducted on a real-time dataset demonstrate that our approach outperforms the state-of-the-art methods to identify microbloggers’ interests.

Abstract: One of the reasons sentiment lexicons do not reach human-level performance is that they lack the contexts that define the polarities of words. While obtaining this knowledge through machine learning would require huge amounts of data, context is commonsense knowledge for people, so human computation is a better choice. We identify context using a game with a purpose that increases the workers' engagement in this complex task. With the contextual knowledge we obtain from only a small set of answers, we already halve the sentiment lexicons' performance gap relative to human performance.

Abstract: In this paper, we investigate some of the problems associated with the automatic extraction of discourse relations. In particular, we study the influence of communicative goals encoded in a given genre against another, and between the various communicative goals encoded between sections of documents of a same genre. Some investigations have been made in the past in order to identify the differences seen across either genres or textual organization, but none have made a thorough statistical analysis of these differences across currently available annotated corpora. In this paper, we show that both the communicative goal of a given genre and, to a lesser extend, that of a particular topic tackled by that genre, do in fact influence in the distribution of discourse relations. Using a statistically grounded approach, we show that certain discourse relations are more likely to appear within given genres and subsequently within sections within a genre. In particular, we observed that 'Attributions' are common in the newspaper articles genre while 'Joint' relations are comparatively more frequent in online reviews.
We also notice that 'Temporal' relations are statically more common in the methodology sections of scientific research documents than in the rest of the text. These results are important as they give clues to allow the tailoring of current discourse taggers to specific textual genres.

Abstract: The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use a contemporary finite-state spell-checking method as baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches in all aspects for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems. We discuss the parameters that need to be considered when tuning for performance and quality.

Abstract: LDA considers a surface word to be identical across all documents and measures the contribution of a surface word to each topic. However, a surface word may present different signatures in different contexts, i.e. polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms LDA and a standalone sense-based LDA model significantly in document clustering.

Abstract: A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information of several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result a particular website clicked can be regarded as the summarization of the website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.

Abstract: In this paper we propose a modified differential evolution based feature selection approach for biochemical name recognizer. Identification and classification of chemical entities are relatively more complex and challenging compared to the other related tasks. As chemical entities we focus on IUPAC and IUPAC related entities. The algorithm performs feature selection within the framework of a robust machine learning algorithm, namely Conditional Random Field. Features are identified and implemented mostly without using any domain specific knowledge and/or resources. Differential evolution (DE) is a newly developed optimization strategy. In this paper we have modified the mutation operation of traditional DE algorithm. Thereafter the modified DE is used to develop a feature selection technique. The modified DE produces a set of solutions in the final population those denote different feature representations. We develop many models of CRF using these feature combinations. In order to further improve the performance the outputs of these classifiers are combined using a newly developed classifier ensemble technique using the proposed modified DE. The algorithm is evaluated on benchmark patent dataset. Finally we obtain the recall, precision and F-measure values of 82.34%, 88.26% and 85.20%, respectively.

Abstract: Unsupervised keyphrase extraction techniques generally consist of candidate phrase selection and ranking techniques. Previous studies treat the candidate phrase selection and ranking as a whole, while the effectiveness of identifying candidate phrases and the impact on ranking algorithms have remained undiscovered. This paper analyses the effect on the performance of ranking algorithms
from different candidate selection approaches. Our evaluation shows that improved candidate selection approaches improve the performance of the ranking algorithms.

Abstract: The speech signal contains various analytical features and one such feature is the VOT or voice onset time which has proved to be a very important feature for classifying stops into different phonetic categories with respect to voicing. Furthermore in order to identify the features of digital speech and language for automatic recognition, synthesis and processing, it is important that the language’s phoneme set is analyzed and VOT proves to be very useful in such an analysis. The stops in Assamese, the language spoken by the people of the state of Assam in North-East India, may be classified into three groups according to the place of articulation. They are labials, alveolars and velars. Also for each group there are two different types based on the manner of voiced/ unvoiced distinction, i.e., aspirated and murmured. This paper focuses on computing and analyzing the VOT values for the stops of the Assamese language and its dialectal variants to provide a better understanding of the phonological differences that exist among the different dialectal variants of a language which may prove to be useful for dialect translation and synthesis.

Abstract: Authorship attribution of text documents is a 'hot' domain in research; however, almost all of its applications use supervised machine learning methods. In this research, we explore authorship attribution as a clustering problem, i.e., we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various religious Jewish questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10M words. The clustering tasks that have been performed were according to 2 or 3 or 4 or 5 authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW), and two machine learning methods: K-means and EM. The best clustering tasks according to 2 or 3 or 4 authors achieved results above 98% and the improvement rates were above 40% in comparison to the "majority" (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. This finding is rather surprising as FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This implies that normalized frequencies of function words can serve as good indicators for the Responsa authorship attribution task.

Abstract: Multi Lingual Snippet Generation (MLSG) systems provide the users with snippets in multiple languages. But collecting and managing documents in multiple languages in an efficient way is a difficult task and thereby makes this process more complicated. Fortunately, this requirement can be fulfilled in another way by translating the snippets from one language to another with the help of Machine Translation (MT) systems. The resulting system is called Cross Lingual Snippet Generation (CLSG) system. This paper presents the develop-ment of a CLSG system by Snippet Translation when documents are available only in one language. We consider the English-Bengali language pair for snippet translation in one direction (English to Bengali). In this work, a major con-centration is given towards translating snippets with simpler but excluding deeper MT concepts. In experimental results, an average BLEU score of 14.26 and NIST score of 4.93 are obtained.

Abstract: Knowledge discovery aims at bringing out coherent groups of entities. It is usually based on clustering which necessitates to define a notion of similarity between the relevant entities.
In this paper, we propose to divert a supervised machine learning technique (namely Conditional Random Fields, widely used for supervised labeling tasks) in order to calculate, indirectly and without supervision, similarities among text sequences.
Our approach consists in generating artificial labeling problems on the data to reveal regularities between entities through their labeling.
We describe how this framework can be implemented and experiment it on two information extraction/discovery tasks.
The results demonstrate the usefulness of this unsupervised approach, and open many avenues for defining similarities for complex representations of textual data.

Abstract: Document categorization is a way of determining a category for a given document. Supervised methods mostly rely on a training data and rich linguistic resources that are either language-specific or generic. This study proposes a knowledge-poor approach to text categorization without using any sets of rules or language specific resources such as part-of-speech tagger or shallow parser. Knowledge-poor here refers to lack of a reasonable amount of background knowledge. The proposed system architecture takes data as-is and simply separates tokens by space. Documents represented in vector space models are used as training data for many machine learning algorithm. We empirically examined and compared a several factors from similarity metrics to learning algorithms in a variety of experimental setups. Although researchers believe that some particular classifiers or metrics are better than others for text categorization, the recent studies disclose that the ranking of the models purely depends on the class, experimental setup and domain as well. The study features extensive evaluation, comparison within a variety of experiments. We evaluate models and similarity metrics for Turkish language as one of the agglutinative language especially within poor-knowledge framework. It is seen that output of the study would be very beneficial for other studies.

Abstract: With the rapid increase in the volume of Arabic opinionated posts on different social media forums, comes an increased demand for Arabic sentiment analysis tools. Research for the development of such tools is hindered by the limited availability of Arabic colloquial sentiment polarity lexicons; colloquial Arabic being the most commonly used Arabic form within Arabic social media forums. This paper proposes an approach for building a polarity lexicon for Arabic Sentiment analysis in a fully automated way. Since existing Arabic part of speech taggers and other morphological resources have been found to handle colloquial Arabic very poorly, the presented approach does not employ any such tools, allowing the presented approach to generalize across dialects as well as across languages. Additional challenges addressed by this work, include recognizing internet slang, multi-word expressions and sarcastic expressions. Experiments carried out using a large Twitter dataset, show the approach’s ability for detecting subjective internet slang, as well as many common sarcastic subjective expressions. Evaluation of the resulting polarity terms, shows that the approach achieves the task of lexicon extraction with high precision.

Abstract: Readability classification is an important Natural Language Processing (NLP) application that can be used to judge the quality of documents and assist writers
to locate possible problems. This paper presents a readability classifier of Bangla
textbooks using information-theoretic and lexical features. All together 21 features
achieve F-Score of 86.46%.

Abstract: Modern digital world has enormous amount of data on the Web which is easily accessible anywhere and anytime. This ease of access is also creating new paradigms of education and learning. The modern-day learners have access lot many and in fact one of the best learning materials created in any part of the world. Despite abundant availability of material, we still lack appropriate systems that can identify learning needs of a user and present him/ her with the most relevant and quality material to pursue. This paper presents our algorithmic design towards this goal. We have proposed a text processing based system that works in three phases: (a) identifying learning needs of a learner; (b) Retrieving relevant material and ranking them; and (c) presentation to learner and monitoring the learning process. Our design uses know-how of text processing, information retrieval, recommender systems and educational psychology and present useful and relevant learning material (including slides, videos, articles etc.) to a learner in a focused subject domain. Our initial experiments have produced good results and we are working towards a Web-scale deployment of the system.

Abstract: Stocks-related messages on social media have several interesting properties regarding the sentiment analysis (SA) task. On the one hand, the analysis is particularly challenging, because of frequent typos, bad grammar, and idiosyncratic expressions specific to the domain and media. On the other hand, stocks-related messages primarily refer to the state of specific entities – companies and their stocks, at specific times (of sending). This state is an objective property and even has a measurable numeric characteristic, namely the stock price. Given a large dataset of twitter messages, we can create two separate "views" on the dataset by analyzing messages’ text and external properties separately. With this, we can expand the coverage of generic SA tools and learn new sentiment expressions. In this paper, we experiment with this learning method, comparing several types of general SA tools and sets of external properties. The method is shown to produce significant improvement in accuracy.

Abstract: Internet has become an excellent source for gathering consumer reviews, while opinion of consumer reviews expressed in sentiment words. However, due to the fuzziness of Chinese word itself, the sentiment judgments of people are more subjective. Studies have shown that the polarities and strengths judgment of sentiment words obey Gaussian distribution. In this paper, we propose a novel method of polarity computation of Chinese sentiment words based on Gaussian distribution which can analyze an analysis of semantic fuzziness of Chinese sentiment words quantitatively. Furthermore, several equations are proposed to calculate the polarities and strengths of sentiment words. Experimental results show that our method is highly effective.

Abstract: This work proposes a Bayesian approach to build conversational vir-
tual characters that give advise and help users to complete tasks in a situated en-
vironment. We are interested in studying dialogue management as a problem of
inverse reinforcement learning (IRL), in which the reward function in a Markov
decision process is not explicitly given but instead is learned from demonstra-
tions of experts. In this work, we apply Bayesian Inverse Reinforcement Learn-
ing (BIRL) to infer this reward in the context of a serious game, given evidence
in the form of stored dialogues provided by experts who play the role of several
conversational agents in the game. We show that the proposed approach con-
verges relatively quickly and that it outperforms two baseline systems, including
a dialogue manager trained to provide “locally” optimal decisions.

Abstract: Urban legends are a genre of modern folklore, consisting of stories
about some rare and exceptional events, plausible enough to be
believed. In our view, while urban legends represent a form of
"sticky" deceptive text, they are stressed by a tension between
credible and incredible. They should be credible like a news article
and incredible like a fairy tale. In particular we will focus on the
idea that urban legends should mimic the details of news (who, where,
when) to be credible, while they should be emotional and readable like
a fairy tale to be catchy and memorable. Using NLP tools we will
provide a quantitative analysis on these prototypical characteristics.
We will also lay out some machine learning experiments showing that it
is possible to recognize a urban legend using just these features.

Abstract: In this paper, we make an empirical study on the submitted runs to the TREC Genomics Track, a gathering for information retrieval research in biomedicine. Based on the evaluation criteria provided by the track, we investigate how much relevant information is generally lost from a run, and how well the relevant nominees are actually ranked w.r.t. the level of relevancy and how they are distributed among the irrelevant ones in a run. We examine whether the relevancy or the level of relevancy play a more important role in the performance evaluation. Answering these questions may give us some insight into and help us improve the current IR technologies. The study reveals that the recognition of relevancy is more important than that of level of relevancy. It indicates that averagely more than 60% of relevant information is lost from each run w.r.t. to either the amount of relevant information or the amount of aspects (subtopics,novelty or diversity), which suggests the big potential room for performance improvement. The study shows that the submitted runs from different groups are quite complementary, which implies ensembled IRs could significantly improve retrieval performance. The experiments illustrate that a run performs “good” or “bad” mainly due to its performance on its top 10% rankings, and the rest of the run only contributes to the performance marginally.

Abstract: Natural language processing systems, even when given proper syntactic and semantic interpretations, still lack the common sense inference capabilities required for genuinely understanding a sentence. Recently, there have been several studies developing a semantic classification of verbs and their sentential complements, aiming at determining which inferences people draw from them. Such constructions may give rise to implied commitments that the author normally cannot disavow without being incoherent or without contradicting herself, as described for instance in the work of Kartunnen. In this paper, we model such knowledge at the semantic level by attempting to associate such inferences with specific word senses, drawing on WordNet and VerbNet. This allows us to investigate to what extent such inferences apply to semantically equivalent words within and across languages.

Abstract: We propose a new approach to perform semi-supervised training of Semantic Role Labeling models with very few amount of initial labeled data. The proposed approach combines in a novel way supervised and unsupervised training, by forcing the supervised classifier to overgenerate potential semantic candidates, and then letting unsupervised inference choose the best ones. Hence, the supervised classifier can be trained on a very small corpus and with coarse-grain features, because its precision does not need to be high: its role is mainly to constrain Bayesian inference to explore only a limited part of the full search space. This approach is evaluated on French and English. In both cases, it achieves very good performance and outperforms a strong supervised baseline when only a small number of annotated sentences is available and even without using any previously trained syntactic parser.

Abstract: In this paper, we observe the effects that discourse function attribute to the task of training learned classifiers for sentiment analysis. Experimental results from our study show that training on a corpus of primarily persuasive documents can have a negative effect on the performance of supervised sentiment classification. In addition we demonstrate that through use of the Multinomial Na{\"i}ve Bayes classifier we can minimise the detrimental effects of discourse function during sentiment analysis.

Abstract: Log files consists of several independent lines of text data and contains information about the events from one or different services which may come from one or more nodes on the network. Mining patterns from these log messages are valuable for real-time analysis of network behavior and then detecting fault, anomaly and security threats. A data-streaming algorithm with efficient pattern finding approach is more practical way to deal with these ubiquitous logs. Thus, in this paper the authors purpose a novel approach for finding patterns in log data sets where a locally sensitive signature is generated for similar log messages. The similarity of the log messages is identified by parsing the log messages with non-alphanumeric characters and then, logically analyzing signature bit stream associated with them. In addition to that the approach is intelligent enough to reflect the changes when the totally new log appears in the system. The clustering of log messages from different servers are done with the purposed algorithm and the validation of result is done by checking the word order matched percentage and comparing the results with a log clustering tool.

Abstract: In this work we present a morphological analysis of Bishnupriya Manipuri language, an Indo-Aryan language spoken in the north eastern India. As of now, there is no computational work available for the language. Finite state morphology is one of the successful approaches applied in a wide variety of languages over the year. Therefore we adapted the finite state approach to analyse morphology of the Bishnupriya Manipuri language.

Abstract: One of the aims of DARPA BOLT project is to translate the Egyptian blog data into English. While the parallel data for MSA\footnote{MSA = Modern Standard Arabic}-English is abundantly available, scarcely exists for Egyptian-English or Egyptian-MSA. A notable drop in the translation quality is observed when translating Egyptian to English in comparison with translating from MSA to English. One of the reasons for this drop is the high OOV rate, where as another is the dialectal differences between training and test data. This work is focused on improving Egyptian-to-English translation by bridging the gap between Egyptian and MSA. First we try to reduce the OOV rate by proposing MSA candidates for the unknown Egyptian words through different methods such as spelling correction, suggesting synonyms based on context etc. Secondly we apply convolution model using English as a pivot to map Egyptian words into MSA. We then evaluate our edits by running decoder built on MSA-to-English data. Our spelling-based correction shows an improvement of ~1.7 BLEU points over the baseline system, that translates unedited Egyptian into English.

Abstract: This paper describes an Information Retrieval engine that is used to support our Chinese-Portuguese machine translation services when no internet connection is available. Our mobile translation app, which is deployed on a portable device, relies by default on a server-based machine translation service, which is not accessible when no internet connection is available. For providing translation support under this condition, we have developed a contextualized off-line search engine that allows the users to continue using the app.

Abstract: Concept-level analysis for text understanding is found to be superior to word and phrase level analysis. It offers a better understanding of text and helps to significantly increase the accuracy of many text mining tasks. Concept extraction from text is the key step of the concept level text analysis. In this paper, we propose Sentic Parser: which deconstructs natural language text into concepts based on the dependency relation between the words in the text. Our approach is domain independent and can extract concepts from heterogeneous text. ConceptNet ontology is used to extract more useful information of a concept extracted from the text. Using this approach, 92.21% accuracy is obtained on a dataset of 3204 concepts. We also show the experimental results of three different text analysis tasks in which the proposed concept parser was used. For each of the three text analysis tasks we obtained better accuracy than the accuracy, earned by
the state of the art.

Abstract: This paper reinvestigates a lexical acquisition system initially developed for French. We show that, interestingly, the architecture of the system reproduces and implements the main components of Optimality Theory. However, we formulate the hypothesis that some of its limitations are mainly due to a poor representation of the constraints used. Finally, we show how a better representation of the constraints used would yield better results.

Abstract: In this paper, we propose an approach for predicting the age of the author of narrative texts written by children aged 6-13 years old. The features of this model, which are lexical and syntactic (part of speech), were normalized to prevent the text length to be used as a predictor. Additionally, these features were combined using n-grams representations and machine learning techniques for regression (i.e. SMOreg). The proposed method was tested with collections of texts collected from Internet in Spanish, French and English, obtaining mean-absolute-error rates in the age-prediction task of 1.40, 1.20 and 1.72 years-old, respectively. Finally, the usefulness of this model to develop a ranking of documents by written proficiency for each age is explored and discussed.

Abstract: This paper introduces a method for assessing the semantic similarity between sentences, which relies on the assumption that the meaning of a sentence is captured by its syntactic constituents and the dependencies between them. We obtain both the constituents and their dependencies from a syntactic parser. Our algorithm considers that two sentences have the same meaning if it can find a good mapping between their chunks and also if the chunk dependencies in one text are preserved in the other. Moreover, the algorithm takes into account that every chunk has a different importance, which is computed based on the information content of the words in the chunk. The output is a value in the [0,1] interval, representing the similarity score of two given sentences. The experiments conducted on a well-known paraphrase data set show that the performance of our method is comparable to state of the art.

Abstract: The existent works in Natural Language processing (NLP) were not reliable. They found many problems in this domain, at different levels. In fact, many complex phenomena were neglected, essentially, the coordination. This structure is an important linguistic phenomenon. It is very frequent in various corpora and has always been a center of interest in NLP. The few works treating this structure treated only some coordinated forms using constructed parsers which are generally so heavy. In this context, our work aims to find an adequate typology of Arabic coordination and to develop a grammar representing the different forms. Based on the proposed typology, we developed a Head-driven Phrase Structure grammar (HPSG) for Arabic coordination. This grammar was validated with Linguistic Knowledge Building (LKB) system which is designed for grammars specified in Type Description Language (TDL).

Abstract: Hoy en día se tiene conocimiento del avance significativo de las herramientas y métodos del estado de arte para la generación de Resúmenes Automáticos Individuales (RAI). Para evaluar la calidad de herramientas y métodos de RAI se utilizan los resúmenes generados por humanos y se comparan con los resúmenes generados automáticamente por algún método. En este trabajo se evaluaron 6 herramientas comerciales y se compararon 7 métodos del estado del arte para la generación de RAI, utilizando el mismo corpus (DUC-2002) y el mismo método de evaluación independiente (ROUGE). Interesantemente, se puede observar el avance significativo de los métodos del estado del arte para la generación de RAI con respecto a los resúmenes realizados por las herramientas comerciales.

Abstract: We suggest a new method for the task of extractive text summarization using graph-based ranking algorithms. The main idea of this paper is to rank Maximal Frequent Sequences (MFS) in order to identify the most important information in a text. MFS are considered as nodes of a graph in term selection step, and then are ranked in term weighting step using a graph-based algorithm. We show that the proposed method finds the results superior to the-state-of-the-art methods, and also the best sentences were found with this method. We prove that MFS are better than other terms. Moreover, we show that the longer is MFS, the better are the results. If the stop-words are excluded, we lose the sense of MFS, and the results are worse. Other important aspect of this method is that it does not require deep linguistic knowledge, nor domain or language specific annotated corpora, which makes it highly portable to other domains, genres, and languages.

Abstract: We may say that all words in every natural language are
ambiguous, specially when translation is at stake. In translation tasks,
there is the need for find out adequate translations for such words in
the contexts where they occur. In this article, a bilingual strategy to
cluster source language words according to their meanings, discrimi-
nated by their translations and by the words occurring in its vicinity
in a window in the source and target language is described. This strat-
egy is language independent and is based in a correlation algorithm of
words. For achieving reported results a parallel corpora aligned at sen-
tence granularity together with a bilingual lexicon previously extracted
and validated are used. Clusters obtained will be evaluated in terms of
F-measure and their homogeneity and completeness will be determined
using V-Measure. Learned clusters are then used to train a support vector
machine to tag ambiguous words with their translations in the contexts
where they occur. This task will also be evaluated in terms of F-measure.
So, in this paper, we propose a multilingual approach to disambiguate
word senses, with the purpose of increasing the accuracy of statistically
based machine translation systems when dealing with potentially am-
biguous words. This other task will be described and evaluated later in
another paper.

Abstract: Paraphrase recognition is the task of Natural Language Processing
of detecting if an expression restated as another expression contains the same
information. Textual Entailment recognition, while being similar to paraphrase
recognition, is a task that consists in finding out if a given text can be observed
as a consequence of another text fragment, sometimes considering only part of
the original meaning, or adding some inferences based on common sense. Traditionally,
for solving this problem, several lexical, syntactic and semantic based
techniques are used. In this work, we seek to use the less resources as possible,
while being effective. For this, we perform a feature analysis for performing
Paraphrase Recognition and recognizing Textual Entailment experimenting with
the combination of several Natural Language Processing techniques like word
overlapping, syntactic analysis, and elimination of stop words. Particularly, we
explore using the syntactic n-grams technique combined with some auxiliary approaches
such as stemming, synonym detection, similarity measures and linear
interpolation. We measure and compare the performance of our system by using
the Microsoft Research Paraphrase Corpus, and the RTE-3 test set for Paraphrasing
and Textual Entailment, respectively. Syntactic n-grams produce good results
for Paraphrase Recognition. As far as we know, syntactic n-grams had not been
used for this task. For Textual Entailment, our best results were obtained by using
a simple word overlapping algorithm based on stemming and elimination of stop
words.

Abstract: This paper is focused on automatic document classification. The results will be used to develop a real application for the Czech News Agency. The main goal of this work is to propose new features based on the Named Entities (NEs) for this task. Five different approaches to employ NEs are suggested and evaluated on a Czech newspaper corpus. We show that these features do not improve significantly the score over the baseline word-based features. The classification error rate improvement is only about 0.42% when the best approach is used.

Abstract: In broadly spoken languages such as English or Spanish, there are
words akin to a particular region. For example, there are words typically used in
the UK such as cooker, while stove is preferred for that concept in the US.
Identifying the particular words a region cultivates involves discriminating
them from the set of common words to all regions. This yields the problem
where a term’s frequency should be salient enough to be considered of
importance, while being a common term tames this salience. This is the known
problem of Term Frequency versus the Inverse Document Frequency;
nevertheless, typical TF·IDF applications do not include weighting factors. In
this work we propose several alternative formulae empirically, and then we
conclude that we need to dig in a broader search space; thereby, we propose
using Genetic Programming to find a suitable expression composed of TF and
IDF terms that maximizes the discrimination of such terms given a reduced
bootstrapping set of examples labeled for each region (400). We present
performance examples for the Spanish variations across the Americas and
Spain.

Abstract: Automatic text processing has been a focus of many researchers in different languages for decades. Due to the rise of online publishing, automatic text classification has become a key research problem in text analytics domain. This research identifies various Nepali language features and also their impact on the classification of Nepali news using Vector Space Model, which is applicable in news search engine, news clustering and news recommendation systems. The total number of dimensions used in Vector Space Model for classification is very large. The results show that the number of dimensions can be reduced significantly using Nepali language specific techniques such as filtering most common-words, word replacements and removal of word suffices using morphology. The average precision and recall also improved respectively when common-words were filtered and replacement of words were done. Additional enhancement is achieved with use of Latent Semantic Indexing.