CICLing 2017 Accepted Papers with Abstracts

Notes:

LNCS

Marjan Hosseinia and Arjun Mukherjee. Detecting Sockpuppets in Deceptive Opinion Spam
Abstract: This paper explores the problem of sockpuppet detection in deceptive opinion spam using authorship attribution and verification approaches. Two methods are explored. The first is a feature subsampling scheme that uses the KL-Divergence on stylistic language models of an author to find discriminative features. The second is a transduction scheme, spy induction that leverages the diversity of authors in the unlabeled test set by sending a set of spies (positive samples) from training set to retrieve hidden samples in the unlabeled test set using nearest and farthest neighbors. Experiments using ground truth sockpuppet data show the effectiveness of the proposed schemes.
Kokil Jaidka, Niyati Chhaya, Rahul Wadbude, Sanket Kedia and Manikanta Nallagatla. BATframe: An Unsupervised Approach for Domain-sensitive Affect Detection
Abstract: Generic sentiment and emotion lexicons are widely used for the fine-grained analysis of human affects from text. In order to accurately detect affect, there is a need for domain intelligence, that enables understanding of the perceived interpretation of the same words in varied contexts. Recent work has focused on automatically inducing the polarity of given terms in changing contexts. We propose an unsupervised approach for the construction of domain-specific affect lexicons along these lines. The algorithm is seeded with existing standard lexicons and expanded based on context-relevant word associations. Experiments show that our lexicon provides better coverage than standard lexicons on both short as well as long texts, and corresponds well with human-annotated affect values. Our framework outperforms the state-of-the-art generic and domain-specific approaches with a precision of over 70% for the emotion detection task on the SemEval 2007 Affect Corpus.
Murathan Kurfalı, Ahmet Ustun and Burcu Can. A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
Abstract: In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. In the model, we adopt prior information from different sources. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts obtained from tries that are built by neural word embeddings. Our results show that using different information sources such as neural word embeddings and letter successor variety as prior information improves morphological segmentation in a Bayesian model. Our model outperforms the recent and oldest unsupervised models on morphological segmentation for Turkish and gives promising results on English and German.
Rivindu Perera and Parma Nand. An Ensemble Architecture for Linked Data Lexicalization
Abstract: The consumption of Linked Data has dramatically increased with the increasing momentum towards semantic web. Linked data is essentially a very simplistic format for representation of knowledge in that all the knowledge is represented as triples which can be linked using one or more components from the triple. To date, most of the efforts has been towards either creating linked data by mining the web or making it available for users as a source of knowledgebase for knowledge engineering applications. In recent times there has been a growing need for these applications to interact with users in a natural language which required the transformation of the linked data knowledge into a natural language. The aim of the RealText project described in this paper, is to build a scalable framework to transform Linked Data into natural language by generating lexicalization patterns for triples. A lexicalization pattern is a syntactical pattern that will transform a given triple into a syntactically correct natural language sentence. Using DBpedia as the Linked Data resource, we have generated 283 accurate lexicalization patterns for a sample set of 25 ontology classes. We performed human evaluation on a test sub-sample with an inter-rater agreement of 0.86 and 0.80 for readability and accuracy respectively. This results showed that the lexicalization patterns generated language that are accurate, readable and emanates qualities of a human produced language.
Yukun Ma, Erik Cambria and Benjamin Bigot. ASR Hypothesis Rerarnking using Prior-informed Restricted Boltzmann Machine
Abstract: Discriminative language models (DLMs) have been widely
used for reranking competing hypotheses produced by an Automatic
Speech Recognition (ASR) system. While existing DLMs suer from
limited generalization power, we propose a novel DLM based on a dis-
criminatively trained Restricted Boltzmann Machine (RBM). The hid-
den layer of the RBM improves generalization and allows for employing
additional prior knowledge, including pre-trained parameters and entity-
related prior. Our approach outperforms the single-layer-perceptron (SLP)
reranking model, and fusing our approach with SLP achieves up to 1.3%
absolute Word Error Rate (WER) reduction and a relative 180% im-
provement in terms of WER reduction over the SLP reranker. In par-
ticular, it shows that proposed prior informed RBM reranker achieves
largest ASR error reduction (3.1% absolute WER) on content words.
Nigel Dewdney and Rachel Cotterill. Just the Facts: Winnowing Microblogs for Newsworthy Statements using Non-Lexical Features
Abstract: Microblogging has become a popular method to disseminate information quickly, but also for many other dialogue acts such as expression opinion and advertising. As the volumes have risen, the task of filtering messages for wanted information has become increasingly important.

In this work we examine the potential of natural language processing and machine learning to filter short messages for those that state items of news. We propose an approach that makes use of information carried at a deeper level than message's lexical surface, and show that this can be used effectively improve precision in filtering Twitter messages. Our method outperforms a baseline unigram ``bag-of-words'' approach to selecting news-event Tweets, yielding a 4.8% drop in false detection.
Haiyun Peng and Erik Cambria. CSenticNet: A Concept-Level Resource for Sentiment Analysis in Chinese Language
Abstract: In recent years, sentiment analysis has become a hot topic
in natural language processing. Although sentiment analysis research in
English is rather mature, Chinese sentiment analysis has just set sail,
as the limited amount of sentiment resources in Chinese severely limits
its development. In this paper, we present a method for the construc-
tion of a Chinese sentiment resource. We utilize both English sentiment
resources and the Chinese knowledge base NTU Multi-lingual Corpus.
In particular, we first propose a resource based on SentiWordNet and a
second version based on SenticNet.
Filippos Karolos Ventirozos, Iraklis Varlamis and George Tsatsaronis. Detecting aggressive behavior in discussion threads using text mining
Abstract: The detection of aggressive behavior in online discussion communities is of great interest, due to the large number of users, especially of young age, who are frequently exposed to such behaviors in social networks. Verbal aggression in online chat-rooms and fora is a popular form of cyberbullying among youngsters either in group-wide discussions or pairwise conversations. Research on cyberbullying prevention focuses on the detection of potentially harmful messages and develops intelligent systems which identify automatically potential risks, such as insults and threats. Text mining techniques are among the most promising tools used so far in the field in order to detect the expression of aggressive sentiments in short texts, which correspond to the comments, reviews, tweets and in general the textual digital fingerprints of an aggressive user. This article presents a novel approach which operates via sentiment analysis at the level of each comment, but considers the thread of messages (i.e., user discussions) as a whole to embed the notion of the context. The suggested approach is able to detect aggressive, inappropriate or antisocial behavior, relatively to the wider discussion thread context and sentiment. Key aspects of the approach are the monitoring and analysis of the most recently published comments, and the application of text classification techniques for detecting whether an aggressive action actually emerges in a discussion thread. Thorough experimental validation of the suggested approach in a previously used benchamrk demonstrates its applicability and advantages compared to other approaches.
Joseph Lark, Emmanuel Morin and Sebastian Peña Saldarriaga. A comparative study of target-based and entity-based opinion extraction
Abstract: Opinion target extraction is a crucial task of opinion mining, aiming to extract occurrences of the different entities of a corpus that are subjects of an opinion. In order to produce a readable and comprehensible opinion summary, which is the main application of opinion target extraction, these occurrences are consolidated at the entity level in a second task. In this paper we argue that combining the two tasks, i.e. extracting opinion targets using entities as labels instead of binary labels, yields better results for opinion target extraction. We compare the binary approach and the multi-class approach on available datasets in English and French, and conduct several investigation experiments to explain the promising results. Our experiment show that an entity-based labelling not only improves opinion extraction in a single domain setting, but also let us combine training data from different domains to improve the extraction, a result that has never been achieved on target-based training data.
Karen Mazidi. Automatic Question Generation from Passages
Abstract: Prior question generation systems typically create questions from sentences in a text. In contrast, the work presented here creates questions from a text passage in a holistic approach to natural language understanding and generation. Several NLP techniques including topic modeling are combined in an ensemble approach to identify important concepts, which then are used to create questions. Evaluation of the generated questions revealed that they are of high linguistic quality and are also important, conceptual questions, compared to questions generated by sentence-level question generation systems.
Yifan Zhang, Arjun Mukherjee and Fan Yang. Supervised Domain Adaptation via Label Alignment for Opinion Expression Extraction
Abstract: In this paper we proposed a supervised domain adaptation technique for opinion expression extraction task. The technique generates low dimensional projections that can improve model performance in the target domain by aligning features with true label sequences. We test our methods on product reviews and observed significant improvement over baseline methods.
Fatma Ben Mesmia, Kaouther Bouabidi, Nathalie Friburger, Kais Haddar and Denis Maurel. Extraction of Semantic Relation between Arabic Named Entities Using Different Kinds of Transducer Cascades
Abstract: The extraction of Semantic Relationship (RS) is an important task allowing the identification of relevant semantic information in the annotated textual resources. Besides, extracting SR between Named Entities (NE) is a process, which consists in guessing the significant semantic links related to them. This process is very useful to enhance the NLP-application performance, such as Question Answering systems. In this paper, we propose a rule-based method to extract and annotate SR between Arabic NEs (ANE) using an annotated Arabic Wikipedia corpus. In fact, our proposed method is composed of two main cascades regrouping respectively analysis and synthesis transducers. The analysis transducer cascade is dedicated to extract five SR types, which are synonymy, meronymy, accessibility, functional and proximity. However, synthesis one is devoted to normalize the SR and NE annotation using the TEI (Text Encoding Initiative) recommendation. Furthermore, the established transducer cascades are implemented and generated using the CasSys tool available under Unitex linguistic platform. Finally, the obtained results showed by the calculated measure values are encouraging.
Svetlana Sheremetyeva. On Universal Computational Formalisms for Rule-Based NLP
Abstract: After more than a decade of the dominance of the statistical paradigm in NLP, a new wave of R&D has reverted to the primacy of rule-based approaches. This is particu-larly true for those applications where high quality results are a must. The ultimate examples of such applications are the analysis, generation and machine translation of patent claims. It is well known, however, that high quality rule-based NLP require rich knowledge resources (world models, grammar rules and lexicons) which nowa-days are painstakingly handcrafted from scratch for every new application, language or language pair.
The paper suggests a set of universal computational formalisms, which facilitate rule and lexicon acquisition and allows migrating from one rule-based application to another within one language or cross-linguistically reusing processing algorithms and, partially, linguistic knowledge thus saving a great deal of development effort. The suggested rule-based approach still absorbs the benefits of the statistics-oriented methodologies that speeded up the development of the new resources and systems.
The basic idea in our approach to the computational formalisms is two-fold. On the one hand, they are based on grammars that in a uniform way deal with atomic informational structures and then manipulate these structures by means of a few well-defined operations that build new more complex structures. On the other hand, within the frame of this formalism both the atomic, and complex structures, are application oriented and motivated by processing considerations. We consider a computational formalism to be well-defined if its semantics (mind, not the semantics of the language described by the formalism) is also well defined that we achieved by characterizing them algebraically.
We describe several such formalisms that take care of the main traditional process-ing steps in the rule-based NLP, - tagging, syntactic and shallow semantic parsing, (that results in a number of atomic predicate-argument structures), transfer (for multi-lingual applications) and generation, the latter includes a number of formalisms for building trees of atomic structures, linearization and grammaticalization. The formal-isms for different steps of processing can be applied in a cascade mode and are de-fined over a specially designed tagset, syncretically coding morphosyntactic and se-mantic information and.
All formalisms are implemented in a set of compilers (modules of developer’s environment), by means of which a developer can acquire or edit processing rules that are immediately propagated into a corresponding NLP program, which is language-independent. The formalisms described have been successfully used in several unilin-gual and multilingual systems, which are illustrated in the paper.
Kanako Komiya, Shota Suzuki, Minoru Sasaki, Hiroyuki Shinnou and Manabu Okumura. Domain Adaptation for Word Sense Disambiguation using Word Embeddings
Abstract: In this paper, we propose domain adaptation in word sense disambiguation (WSD) using word embeddings.
The validity of the word embeddings from a huge corpus, e.g., Wikipedia, for WSD had already been shown, but
their validity in a domain adaptation framework has not been discussed before.
In addition, if they are valid, the difference in effects according to the genre of the corpora is still unknown.
Therefore, we investigate the performances of domain adaptation in WSD using the word embeddings from the source, target, and general corpora and
examine (1) whether the word embeddings are valid for domain adaptation of WSD and
(2) they are, the effects in accordance with the genre of the corpora.
The experiments using Japanese corpora revealed that the accuracy of WSD was highest when we used the word embeddings obtained from the target corpus.
Lionel Ramadier and Mathieu Lafourcade. Radiological text simplification using a general knowledge base
Abstract: In the medical domain, text simplification is both a desirable and challenging natural language processing task . Indeed, first, medical texts can be difficult to understand for patient, because of the presence of specialized medical terms. Replacing these difficult terms with easier words can lead to improve patient’s understanding both is as straightforward as one can think because of polysemous, elisions and fuzzy semantics. In this paper, we present a lexical network based method to simplify health information in French language. We deal with semantic difficulty by replacement difficult term with supposedly easier synonyms or by using semantically related term with the help of a french lexical semantic network. We extract semantic and lexical information present in the network. In this paper, we present such a method for text simplification along with its qualitative evaluation.
Bensalem Raja, Kadri Nesrine, Kais Haddar and Philippe Blache. Evaluation and enrichment of Stanford Parser using an Arabic Property Grammar
Abstract: So far, the Stanford Arabic statistic parser is considered as the best parsing tool in terms of performance compared to other parsers. This performance is not stable and may vary depending on the given corpus. A more detailed method to evaluate this parser may help the users to address the causes of a performance loss. We propose, for this reason, to evaluate the Stanford Parser using the verification of the satisfaction of the syntactic constraints (called, properties) based on the analysis results of the corpus. We may obtain these properties from a reference Arabic property grammar. By the way, we enriched the simple representation of the parsing result with syntactic properties. This allows to explicit several implicit information that are the relations between syntactic units. Therefore, we had both a detailed method for the evaluation of parsers and a more syntactically informative representation for the analysis. We obtained widely detailed and encouraging results.
Saint Dizier Patrick. Towards incoherent argument mining: the case of incoherent requirements
Abstract: Requirements form a specific class of arguments. They are designed to specify the features of systems.
Even for a simple system, several thousands of requirements produced by different authors are often needed.
It is then frequent to observe overlap and incoherence problems.
This has several consequences: significant
waste of time to identify incoherences, difficulty to update requirements,
and, most importantly, risks of misconceptions by manufacturers and users leading to incorrect
realizations and exposure to health or ecological risks.

It is very difficult and costly to manually identify incoherent requirements in large specifications.
In this paper, we propose a method to construct a corpus of various types of incoherent requirements and a categorization method that leads to the definition of patterns to detect incoherence.
We focus in this contribution on incoherences (1) which can be detected solely from linguistic factors and (2) which concern pairs of requirements. These represent about
40\% of the different types of incoherences; the other types often require extensive domain knowledge and reasoning.

This contribution opens new perspectives (1) on incoherence analysis in texts and (2) on mining incoherent requirements and arguments more generally.
Ágnes Kalivoda. Hungarian particle verbs in a corpus-driven approach
Abstract: The meaning and the argument structure of particle verbs are determined by the combination of a verb and a particle. In Hungarian, verbal particles (preverbs) can occupy various positions in the sentence: they can be preverbal (detached from the verb), immediately preverbal or postverbal. The syntax of these particles is discussed in a wide range of theoretical literature. This paper presents a performance-based analysis, using corpus-driven method to reveal the distribution patterns of verbal particles in more than 21.5 million sentences. In order to obtain these data, we improved the POS-tagging of verbal particles and developed a semi-automatic method to decide which verb the detached particle belongs to. The distribution patterns gave us better insight into the phonological and pragmatic factors that may determine the position of verbal particles.
Di Shang and Xin-Yu Dai. A Multi-view Clustering Model for Event Detection in Twitter
Abstract: Event detection in Twitter is an attractive and hard task. Existing methods mainly consider words co-occurrence or topic distribution of the tweets to detect the event. Few of them consider the time-series information in the text stream. In this paper, for event detection in twitter, we propose a novel multi-view clustering model which can consider both topic information and time-series information. First, we build a topic similarity matrix and a time-series similarity matrix by using the topic model and the wavelet analysis, respectively. Then, the multi-view clustering algorithm are used to group keywords. Each cluster of keywords is finally represented as an event. The experiments show that our method achieves better performance than other related work.
Yiou Wang, Koji Satake, Takeshi Onishi and Hiroshi Masuichi. Customer Churn Prediction using Sentiment Analysis and Text Classification of VOC
Abstract: In this paper, we explore the utility of sentiment analysis and text classification of voice of the customer (VOC) for improving churn prediction, which is a task to detect customers who are about to quit. Our work is motivated by the observation that the increase of customer satisfaction will reduce churn and the customer satisfaction can be reflected in some degree by applying NLP techniques on VOC, the unstructured textual information which captures a view of customer’s attitude and feedbacks. To the best of our knowledge, this is the first work that introduces text classification of VOC to churn prediction task. Experiments show that adding VOC analysis into a conventional churn prediction model results in a significant increase in predictive performance.
Filippo Geraci and Tiziano Papini. Approximating Multi-Class Text Classification via Automatic Generation of Training Examples
Abstract: Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications.
Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available.
Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out.
The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy.

Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set.
Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web).

Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.
Sarah Kohail and Chris Biemann. Matching, Re-ranking and Scoring: Learning Textual Similarity by Incorporating Dependency Graph Alignment and Coverage Features
Abstract: In this work, we introduce a supervised model for learning textual similarity, which can identify and score the alternative versions of a given query text. By combining dependency graph similarity and coverage features with lexical similarity measures using neural networks, we show that most relevant documents to a given text can be more accurately ranked and scored than if the similarity measures were used in isolation. Additionally, we introduce an approximate dependency subgraph alignment approach allowing node gaps and mismatch, where a certain word in one dependency graph cannot be mapped to any word in the other graph. We apply our model into two different applications, namely re-ranking for improving document retrieval precision on a new dataset, and automatic short answer scoring on a standard dataset. Experimental results indicate that our approach is easily adaptable to different tasks and languages, and works well for long texts as well as short texts.
Maite Giménez Fayos, Roberto Paredes Palacios and Paolo Rosso. Personality Recognition using Convolutional Neural Networks
Abstract: Personality Recognition is an emerging task in Natural Language Processing due to its potential applications. However, there is still limited literature on the topic, and those models which address this task rely on handcrafted resources; therefore, they are restricted by the domain of the problem and by the availability of resources. We propose a Convolutional Neural Network architecture trained using pre-trained word embeddings that is capable of learning the best features for the task at hand without any external dependence. The results show the potential of this approximation. The proposed model achieves comparable results with state-of-the-art models and is able to predict the personality traits of authors regardless of the social network and the availability of resources.
Delia Irazu Hernandez Farias, Cristina Bosco, Viviana Patti and Paolo Rosso. Sentiment Polarity Classification of Figurative Language: Exploring the Role of Irony-Aware and Multifaceted Affect Features
Abstract: The presence of figurative language represents a big challenge for sentiment analysis. In this work, we address the task of assigning sentiment polarity to Twitter texts when figurative language is employed, with a special focus on the presence of ironic devices.
We introduce a pipeline model which aims to assign a polarity value exploiting, on the one hand, irony-aware features, which rely on the outcome of a state-of-the-art irony detection model, on the other hand a wide range of affective features that cover different facets of affect exploiting information from various sentiment and emotion lexical resources for English available to the community, possibly referring to different psychological models of affect.
The proposed method has been evaluated on a set of tweets especially rich in figurative language devices proposed as a benchmark in the shared task
on ``Sentiment Analysis of Figurative Language" at SemEval-2015. Experiments and results of feature ablation show the usefulness of irony-aware features and the impact of using different affective lexicons for the task.
Keiji Yasuda, Akio Ishikawa, Kazunori Matsumoto, Fumiaki Sugaya, Panikos Heracleous and Masayuki Hashimoto. Building a Location Dependent Dictionary for Speech Translation Systems
Abstract: Miss-translation or dropping of proper noun affects quality of machine translation output. In this paper, we propose an automatic method to build a location dependent dictionary for speech translation system. The method consists of two parts: location dependent word extraction part and word classification part. The first part extract the word by using micro blog date based on Akaike's information criteria. And the second parts classifies the words using machine learning using crawled corpus as training data.
Tomohiro Sakaguchi and Sadao Kurohashi. Timeline Generation based on a Two-stage Event-time Anchoring Model
Abstract: Timeline construction task has become popular as a way of multi-document summarization. Dealing with such a problem, it is essential to anchor each event to an appropriate time expression in a document. In this paper, we present a supervised machine learning model, two-stage event-time anchoring model. In the first stage, our system estimates event-time relations using local features. In the second stage, the system re-estimates them using the result of first stage and global features. Our experimental results show that the proposed method surpasses the state-of-the-art system by 3.5 F-score points in the TimeLine shared task of SemEval 2015.
Aidar Khusainov. A Comparative Analysis of Speech Recognition Systems for the Tatar Language
Abstract: This paper presents a comparative study of several different approaches to speech recognition for the Tatar language. All the compared systems use a corpus-based approach, so recent results in speech and text corpora creation are also shown. The recognition systems differ in acoustic modelling algorithms, basic acoustic units, and language modelling techniques. The DNN-based system shows the best recognition result obtained on the test part of speech corpus.
Jiachen Du, Ruifeng Xu and Lin Gui. Stance Classification with Target-specific Neural Attention
Abstract: Classifying stance expressed in a text toward specific target is an emerging problem in opinion mining. A major difference between stance detection and traditional aspect-level sentiment classification is that the target of the stance might not be explicitly mentioned in text. In this paper, we show that the stance polarity of a text is not merely dependent on the content but is also highly determined by the concerned target. To this end, We propose a neural network based model, which incorporate target-specific information into stance classification using a novel attention mechanism. The proposed attention mechanism can focus on critical parts of a text. We evaluate our model on the SemEval 2016 Task 6 Twitter Stance Detection corpus achieving satisfactory performance. Our model achieves significant and consistent improvements on this task as compared with baselines.
Mohamed Ali Batita, Mohsen Maraoui, Souheyl Mallat and Mounir Zrigui. The Enrichment of Arabic WordNet Antonym Relations
Abstract: Arabic WordNet (AWN) is a lexical database, freely available, and useful resource to Natural Language Processing (NLP) research and applications (Information Retrieval, Machine Translation\ldots). This project is built following the methods developed for Princeton WordNet (PWN) and EuroWordNet (EWN). However, this database needs more than what it gets to improve NLP applications. Compared with others wordnets, AWN has a very poor content in both, quantity and quality levels. In this paper, we concentrate on the quality plan, especially on the antonym relations. Therefore, we propose a pattern-based approach to extend these relations, using Arabic Corpus and a corpus analysis tool. Our proposed method relies on two steps: patterns definition and automatic antonym pair extraction. The evaluation of our approach has given good results.
Soufian Salim, Nicolas Hernandez and Emmanuel Morin. A meta-model for dialogue act taxonomy interoperability
Abstract: Building a meta-model of dialogue act taxonomies, such as those of DAMSL, DiAML or the HCRC dialogue structure, would allow for the modelization of taxonomies themselves through the characterization of their labels using primitive features. Doing so enables the re-exploitation of annotated data for automatic dialogue act recognition tasks across taxonomies, i.e. it gives us the means to make a classifier learn from data annotated according to taxonomies different from the target taxonomy. We propose a meta-model covering several well-known taxonomies of dialogue acts, and we demonstrate its usefulness for the task of cross-taxonomy dialogue act recognition.
Adnen Mahmoud, Souheyl Mallat and Mounir Zrigui. A Deep Learning Approach for Arabic Plagiarism Detection
Abstract: The main challenge of plagiarism is how to detect the semantic relationship between the suspect text document and the source text document. Nowadays, the combinations of Natural Language Processing NLP and deep learning based approaches have a booming in the field of text analysis, including: text classification, machine translation, text similarity detection, etc.

In this context, we proposed a deep learning based method to detect Arabic plagiarism composed by the following phases: First, we started with a preprocessing phase by extracting the relevant information from text document. Then, word2vec algorithm was used to generate word vectors representation which they would be combined subsequently to generate a sentence vectors representation. Finally, we used a Convolutional Neural Networks CNN to improve the ability to capture statistical regularities in the context of sentences which then makes it possible to facilitate the similarity measurement operation between the representations of source and suspicious sentences.
The evaluation of our proposed approach gave us a promising result in term of precision.
Ha-Nguyen Tran, Erik Cambria and Hoang Giang Do. Efficient Semantic Search over Structured Web Data: A GPU Approach
Abstract: Semantic search is an advanced topic in Information Retrieval (IR) and has attracted increasing attention in recent years. The growing availability of the structured semantic data offers opportunities for semantic search engines, which can support more expressive queries that address complex information needs. However, due to the fact that many new concepts (mined from the Web or learned through crowd-sourcing) are continuously integrated into knowledge bases, those search engines face the challenging issues performance of scalability. In this paper, we present a parallel method, called $gSparql$, which utilizes the massive computation power of General Purpose GPUs to accelerate the performance of query processing and inference. Our method is based on the backward-chaining approach which makes inferences at query time. The experimental results show that gSparql outperforms the state-of-the-art algorithm and efficiently answers structured queries on large datasets.
Necva BÖlÜcÜ and Burcu Can. Joint POS Tagging and Stemming for Agglutinative Languages
Abstract: The number of word forms in agglutinating languages is theoretically infinite, which introduces sparsity in many natural language processing tasks. Part-of-speech tagging is one of these tasks that often suffers from sparsity. In this paper, we present an unsupervised model based on Bayesian Hidden Markov Models for jointly part-of-speech tagging and stemming for agglutinating languages. We use stemming in order to reduce the sparsity and emit the stems rather than words. Our results show that joint part-of-speech tagging and stemming improves both stemming and tagging scores. We present results for Turkish as an agglutinating language and English as a morphologically less complex language.
Hicham G. Elmongui and Riham Mansour. Curator: Enhancing Micro-blogs Ranking by Exploiting User's Context
Abstract: Micro-blogging services have emerged as a powerful, real-time, way to disseminate information on the web. A small fraction of the colossal volume of posts overall are relavant. We propose Curator, a micro-blogging recommendation system that ranks micro-blogs appearing on a user's timeline according to her context. Curator learns user's time variant preferences from the text of the micro-blogs the user interacts with. Furthermore, Curator infers the user's home location and the micro-blog's subject location with the help of textual features. Precisely, we use a set of machine learning and natural language processing techniques to analyze the user's context dynamically from the micro-blogs and rank them accordingly. Curator’s extensive performance evaluation on a publicly available dataset show that it outperforms the competitive state-of-the-art by up to 154% on NDCG@5 and 105% on NDCG@25. The results also show that location is a salient feature in Curator.
Christoph Kilian Theil, Sanja Štajner, Heiner Stuckenschmidt and Simone Paolo Ponzetto. Automatic Detection of Uncertain Statements in the Financial Domain
Abstract: The automatic detection of uncertain statements can benefit
NLP tasks such as deception detection and information extraction. Furthermore,
it can enable new analyses in social sciences such as business
where the quantification of uncertainty or risk plays a significant role. In
spite of this, the task has only been performed on biomedical scientific
texts and Wikipedia articles. Hence, we approach the automatic detection
of uncertain statements as a binary sentence classification task on
the transcripts of spoken language in the financial domain. We present
a new dataset and { besides using features proven on the encyclopedic
domain such as bag-of-words, part-of-speech tags, and dictionaries { develop
rule-based features tailored to our task. We provide a systematic
analysis showing which type of features perform best in the financial domain
as opposed to the already explored encyclopedic domain.
Yoann Dupont, Marco Dinarelli, Isabelle Tellier and Christian Lautier. Structured Named Entity Recognition by Cascading CRFs
Abstract: NER is an important task in NLP, often used as a basis for further treatments. A new challenge has emerged in the last few years: structured named entity recognition, where not only named entities must be identified but also their hierarchical components. In this article, we describe a cascading CRFs approach to address this challenge. It reaches the state of the art while remaining very simple on a structured NER challenge. We then offer an error analysis of our system based on a detailed, yet simple, error classification.
Mingyang Xu, Paul Jones, Ruixin Yang and Nagiza Samatova. Mining Aspect-Specific Opinions from Online Reviews Using Latent Embedding Structured Topic Model
Abstract: Online reviews often contain user's specific opinions on aspects (features) of items. These opinions are very useful to merchants and customers, but manually extracting them is time-consuming. Several topic models have been proposed to simultaneously extract item aspects, user opinions on the aspects, and to detect sentiment of the opinions. However, existing models tend to find poor aspect-opinion associations when limited examples of the required word co-occurrences are available. Existing models often also assign incorrect sentiment to words. In this paper, we propose a Latent embedding structured Opinion mining Topic model, called the LOT, to simultaneously discover relevant aspect-level specific opinions and to assign accurate sentiment to words. Our model does this in a way that works well with small or large numbers of reviews. Experimental results for topic coherence, document sentiment classification, and a human evaluation all show that our proposed model achieves significant improvements over several state-of-the-art baselines.
Panikos Heracleous, Akio Ishikawa, Keiji Yasuda, Hiroyuki Kawashima, Fumiaki Sugaya and Masayuki Hashimoto. Machine learning approaches for speech emotion recognition: Classic and novel advances
Abstract: Speech is the most natural form of communication for human beings, and among others, it provides information about the speaker’s emotional state. The current study focuses on automatic speech emotion recognition based on classic and innovated machine learning approaches. Specifically, individual Gaussian mixture models (GMM) trained for each emotion, a universal background GMM model (UBM-GMM) adapted to each emotion using maximum posteriori (MAP) adaptation, and an approach based on i-vector paradigm, widely used in speaker recognition and language identification, and adapted to emotion recognition are used. When using individual GMMs, a novel technique based on multiple classifiers and late fusion is also applied. In this case, a 90.9% recognition rate is been obtained. When the state of the art, i-vector paradigm based method, along with probabilistic linear discriminant analysis (PLDA) model is used, a 91.4% average rate for speaker-independent Japanese speech emotion recognition is achieved, which is a very promising result and superior to similar studies. In addition to the Japanese emotion recognition, pair-wise recognition for seven emotions in German language has also been conducted. In this experiment, an 89.2% rate has been achieved.
Vincent Berment, Christian Boitet, Jean-Philippe Guilbaud and Jurgita Kapočiūtė-Dzikienė. Several ways to use the lingwarium.org online MT collaborative platform to develop rich morphological analyzers
Abstract: We will demonstrate several morphological analyzers of languages for which morphological analysis is very difficult, and/or that are under-resourced. It will cover at least French, German, Khmer, Lao, Lithuanian, Portuguese, Quechua, Spanish and Russian. These morphological analyzers all run on the collaborative platform lingwarium.org that supports the ARIANE-H lingware development environment. Some will also be presented as stand-alone Windows applications.
Serkan Özen and Burcu Can. Building Morphological Chains for Agglutinating Languages
Abstract: We extend the Morphological Chain model (Narasimhan et al., 2015) by expanding the candidate space used in the model recursively to cover more split points for agglutinating languages. The results show that we improve the scores with 10% by having a F-measure of 71% in Turkish, which is higher than the current state-of-art. It shows that candidate generation plays an important role in such a log-linear model with contrastive estimation.
Manfred Klenner, Simon Clematide and Don Tuggener. Verb-mediated Composition of Attitude Relations Comprising Reader and Writer Perspective
Abstract: We introduce a model for attitude prediction that takes the reader and the writer perspectives into account
and enables a joint reception of the attitudinal dispositions involved.
For instance, a proponent of the reader might turn out to be a villain, or some moral values of his might be negatively affected.
A formal model is specified that induces in a compositional, bottom-up manner informative relation tuples which indicate perspectives on attitudes.
This enables the reader to focus on interesting cases,
since they are directly accessible from the parts of the relation tuple.
Marina Litvak, Natalia Vanetik and Lei Li. Summarizing Weibo with Topics Compression
Abstract: Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are preserved in the best possible way. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to Chinese language. Our summarizer strives to extract information that provides the best description
of text topics in terms of MDL. This model is applied to the NLPCC
2015 Shared Task of Weibo-Oriented Chinese News Summarization [1],
where Chinese texts from news articles were summarized with the goal
of creating short meaningful messages for Weibo [2]. The experimental
results disclose superiority of our approach over other summarizers from
the NLPCC 2015 competition.
Koji Matsuda, Mizuki Sango, Naoaki Okazaki and Kentaro Inui. Monitoring Geographical Entities with Temporal Awareness in Tweets
Abstract: In order to extract the real-time information referring to a specific place from SNS text such as tweets, it is necessary to analyze the temporal semantics of the mentions.
To solve this problem, we created corpus with multiple annotations using crowdsourcing for more than 10,000 tweets.
We constructed an automatic analysis model based on multiple neural networks and compared their characteristics.
Raksha Sharma, Dibyendu Mondal and Pushpak Bhattacharyya. A Comparison among Significance Tests and Other Feature Building Methods for Sentiment Analysis: A First Study
Abstract: Words that participate in the sentiment (positive or negative) classification decision are known as significant words for sentiment classification. Identification of such significant words as features from the corpus reduces the amount of irrelevant information in the feature set under supervised sentiment classification settings. In this paper, we conceptually study and compare various types of feature building methods, viz., unigrams, TFIDF, Relief, Delta-TFIDF, χ2 test and Welch’s t-test for sentiment analysis task. Unigrams and TFIDF are the classic ways of feature building from the corpus. Relief, Delta-TFIDF and χ2 test have recently attracted much attention for their potential use as feature selection methods in sentiment analysis. On the contrary, t-test is the least explored for the identification of significant words from the corpus as features.

We show the effectiveness of significance tests over other feature building methods for three types of sentiment analysis tasks, viz., in-domain, cross-domain and cross-lingual. Delta-TFIDF, χ2 test and Welch’s t-test compute the significance of the word for classification in the corpus, whereas unigrams, TFIDF and Relief do not observe the significance of the word for classification. Furthermore, significance tests can be divided into two categories, bag-of-words-based test and distribution-based test. Bag-of-words-based test observes the total count of the word in different classes to find significance of the word, while distribution-based test observes the distribution of the word. In this paper, we substantiate that the distribution-based Welch’s t-test is more accurate than bag-of-words-based χ2 test and Delta-TFIDF in identification of significant words from the corpus.
Hani Daher, Romaric Besançon, Olivier Ferret, Hervé Le Borgne, Anne-Laure Daquo and Youssef Tamaazousti. Supervised Learning of Entity Disambiguation Models by Negative Sample Selection
Abstract: The objective of Entity Linking is to connect an entity mention in a text to a known entity in a knowledge base. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and determine, in a second step, the best one. This paper focuses on this last step and proposes a method based on learning a function that discriminates an entity from its most ambiguous ones. Our contribution lies in the strategy to learn efficiently such a model while keeping it compatible with large knowledge bases. We propose three strategies with different efficiency/performance tradeoff, that are experimentally validated on six datasets of the TAC evaluation campaigns by using Freebase and DBpedia as reference knowledge bases.
Karim Sayadi, Mansour Hamidi, Marc Bui, Marcus Liwicki and Andreas Fischer. Character-Level Dialect Identification in Arabic Using Long Short-Term Memory
Abstract: In this paper, we introduce a neural network based sequence learning approach for the task of Arabic dialect classification. Character models based on recurrent neural networks with Long Short-Term Memory (LSTM) are suggested to classify short texts, such as tweets, written in different Arabic dialects. The LSTM-based character models can handle long-term dependencies in character sequences and do not require a set of linguistic rules at word-level, which is especially useful for the rich morphology of the Arabic language and the lack of strict orthographic rules for dialects. On the Tunisian Election Twitter dataset, our system achieves a promising average accuracy of 92.2\% for distinguishing Modern Standard Arabic from Tunisian dialect. On the Multidialectal Parallel Corpus of Arabic, the proposed character models can distinguish six classes, Modern Standard Arabic and five Arabic dialects, with an average accuracy of 63.4\%. They clearly outperform a standard word-level approach based on statistical n-grams as well as several other existing systems.
Girish Palshikar, Sachin Pawar, Saheb Chourasia and Nitin Ramrakhiyani. Mining Supervisor Evaluation and Peer Feedback in Performance Appraisals
Abstract: Performance appraisal (PA) is an important HR process to periodically measure and evaluate every employee's performance vis-a-vis the goals established by the organization. A PA process involves purposeful multi-step multi-modal communication between employees, their supervisors and their peers, such as self-appraisal, supervisor assessment and peer feedback. Analysis of the structured data and text produced in PA is crucial for measuring the quality of appraisals and tracking actual improvements. In this paper, we apply text mining techniques to produce insights from PA text. First, we perform sentence classification to identify strengths, weaknesses and suggestions of improvements found in the supervisor assessments and then use clustering to discover broad categories among them. Next we use multi-class multi-label classification techniques to match supervisor assessments to predefined broad perspectives on performance. Finally, we propose a short-text summarization technique to produce a summary of peer feedback comments for a given employee and compare it with manual summaries. All techniques are illustrated using a real-life dataset of supervisor assessment and peer feedback text produced during the PA of 4528 employees in a large multi-national IT company.
Lung-Hao Lee, Kuei-Ching Lee and Yuen-Hsien Tseng. HANS: A Service-Oriented Framework for Chinese Language Processing
Abstract: This paper proposes a service-oriented architecture, called as HANS, to facilitate Chinese language processing. Base on this unified framework, fundamental NLP tasks, namely word segmentation, part-of-speech tagging, named entity recognition, chunking, paring, and semantic role labeling can be seamlessly integrated to enhance functionalities for Chinese language processing. We take basic Chinese word segmentation task for example to demonstrate the effects. Evaluated benchmarks are originated from the SIGHAN 2005 bakeoff and the NLPCC 2016 shared task. We implement publicly released toolkits including Stanford CoreNLP, Fudan NLP and CKIP as services in our HANS framework for performance comparison. Experimental results confirm the feasibility of the proposed architecture. Findings are also discussed to sketch potential developments in the future.
Thanassis Mavropoulos, Dimitris Liparas, Spyridon Symeonidis, Stefanos Vrochidis and Ioannis Kompatsiaris. A Hybrid Approach for Biomedical Relation Extraction Using Finite State Automata and Random Forest-Weighted Fusion
Abstract: The automatic extraction of relations between medical entities found in related texts is considered to be a very important task, due to the multitude of applications that it can support, from question answering systems to the development of medical ontologies. Many different methodologies have been presented and applied to this task over the years. Of particular interest are hybrid approaches, in which different techniques are combined in order to improve the individual performance of either one of them. In this study, we extend a previously established hybrid framework for medical relation extrac-tion, which we modify by enhancing the pattern-based part of the framework and by applying a more sophisticated weighting method. Most notably, we replace the use of regular expressions with finite state automata for the pattern-building part, while the fusion part is replaced by a weighting strategy that is based on the operational capabilities of the Random Forests algorithm. The experimental results indicate the superiority of the proposed approach against the aforementioned well-established hybrid methodology and other state-of-the-art approaches.
Mansouri Sadek, Chahira Lihoui, Mbarek Charhad and Mounir Zrigui. Text-to-concept: a semantic indexing framework for Arabic News videos
Abstract: In these last years, many works have been published in the video indexing and retrieval field. However, few are the methods that have been designed to Arabic video. This paper’s aim is to achieve a new approach for Arabic news video indexing which based on embedded text as the information source and Knowledge extraction techniques to provide a conceptual description of video content. Firstly, we applied a low level processing in order to detect and recognize the video texts. Then, we extract the conceptual information including name of person, Organization and location of event using local grammars which have been implemented with the linguistic platform NooJ. Our proposed approach was tested on a large collection of Arabic TV news and experimental results were satisfactory
Piotr Andruszkiewicz and Rafal Hazan. Domain Specific Features Driven Information Extraction from Web Pages of Scientific Conferences
Abstract: In this paper we describe information extraction from web pages of scientific conferences. We enrich already known features with our new features specific for this domain and show their importance in the process of extracting information. Moreover, we investigate various data representation models, e.g., based on single tokens or sequences, in order to find the best configuration for the task in question and set up a new baseline over publicly available corpus.
László János Laki and Zijian Győző Yang. Combining Machine Translation Systems with Quality Estimation
Abstract: Improving the quality of Machine Translation (MT) systems is an important task not only for researcher but it is a substantial need for a translating companies to create translations in quicker and cheaper way. Combination of more machine translation system outputs is a common technique to get better translation quality because the benefits of the different systems could be utilized. The main question is to find the best method for the combination. In this paper, we used the Quality Estimation (QE) technique to combine a phrase-based and a hierarchical-based machine translation systems. The composite system was tested on several language combinations. The QE module was used to compare the outputs of the different MT systems and gave the best one as the result translation of the composite system. Results of the composite system gained better translation quality than the results for separated systems.
Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne and Toms Miks. Investigation of Neural Machine Translation for Highly Inflected and Small Languages
Abstract: The paper evaluates neural machine translation systems and phrase-based machine translation systems for highly inflected and small languages. It analyses two translation scenarios: 1) when translating broad domain data from a morphologically rich language into a morphologically rich language or English (and vice versa), and 2) when translating narrow domain data and there are limited amounts of training data available for training machine translation systems. The paper reports on experiments for English (Germanic), Estonian (Finno-Ugric), Latvian (Baltic), and Russian (Slavic) languages. The scenarios are evaluated using automatic and manual – system comparative and error analysis-based – evaluation methods. The paper also analyses the aspects where neural machine translation systems are superior to statistical (phrase-based) machine translation systems and vice versa.
Frank Z. Xing, Danyuan Ho, Diyana Hamzah and Erik Cambria. Classifying World Englishes from a lexical perspective: A corpus-based approach
Abstract: The spread of English has led to the emergence of new English varieties worldwide. Existing quantitative approaches made use of several linguistic criteria, particularly morphological and syntactical features, in investigating variations across English varieties. Taking an alternative lexical perspective to the classification of World Englishes, this paper adopts a corpus-based approach in investigating the lexical frequency across 20 regional English varieties. Specifically, the lexical items in focus include culture-bound terms and words that have undergone semantic shift. The English varieties are categorized following a series of filtering rules and normalization techniques, and a hierarchical cluster of the varieties is subsequently formalized. Our findings generally corroborate
with Kachru's Three Circle of English model, with subtle differences to the Inner Circle-Outer Circle groupings. The taxonomy of the English varieties additionally reveals geographical and cultural correlations.
Chandramouli Shama Sastry, Darshan Siddesh Jagaluru and Kavi Mahesh. Visualizing Textbook Concepts: Beyond Word Co-occurrences
Abstract: In this paper, we present a simple and elegant algorithm to extract and visualize various concept relationships present in sections of a textbook. This can be easily extended to develop visualizations of entire chapters or textbooks, thereby opening up opportunities for developing a range of visual applications for e-learning and education in general. Our algorithm creates visualizations by mining relationships between concepts present in a text by applying the idea of transitive closure rather than merely counting co-occurrences of terms. It does not require any thesaurus or ontology of concepts We applied the algorithm to two textbooks - Theory of Computation and Machine Learning - to extract and visualize concept relationships from their sections. Our findings show that the algorithm is not only capable of capturing deep-set relationships between concepts which could not have been found by using a term co-occurrence approach, but also capable of word-sense disambiguation.
Etienne Papegnies, Vincent Labatut, Richard Dufour and Georges Linares. Impact Of Content Features For Automatic Online Abuse Detection
Abstract: Online communities have gained considerable importance in recent years due to the increasing number of people connected to the Internet. Moderating user content in online communities is mainly performed manually, and reducing the workload through automatic methods is of great financial interest for community maintainers. Often, the industry uses basic approaches such as bad words filtering and regular expression matching to assist the moderators. In this article, we consider the task of automatically determining if a message is abusive. This task is complex since messages are written in a non-standardized way, including spelling errors, abbreviations, community-specific codes...
First, we evaluate the system that we propose using standard features of online messages. Then, we evaluate the impact of the addition of pre-processing strategies, as well as original specific features developed for the community of an online in-browser strategy game.
We finally propose to analyze the usefulness of this wide range of features using feature selection. This work can lead to two possible applications: 1) automatically flag potentially abusive messages to draw the moderator's attention on a narrow subset of messages; and 2) fully automate the moderation process by deciding whether a message is abusive without any human intervention.
Angrosh Mandya, Danushka Bollegala, Frans Coenen and Katie Atkinson. Classifier-based Pattern Selection Approach for Relation Instance Extraction
Abstract: A classifier-based pattern selection approach for relation instance extraction is proposed in this paper. The classifier-based pattern selection approach proposes to employ a binary classifier that filters patterns that extracts incorrect entities for a given relation, from pattern set obtained using global estimates such as high frequency. The proposed approach is evaluated using two large independent datasets. The results presented in this paper shows that the classifier-based approach provides a significant improvement in the task of relation extraction against standard methods of relation extraction, employing pattern sets based on high frequency. The higher performance is achieved through filtering out patterns that extract incorrect entities, which in turn improves the precision of applied patterns, resulting in significant improvement in the task of relation extraction.
Tirthankar Dasgupta, Abir Naskar and Lipika Dey. Exploring Linguistic and Graph based Features for the Automatic Classification and Extraction of Adverse Drug Effects
Abstract: Adverse drug effects (ADEs) are known to be one of the leading causes of post-therapeutic death. Thus, their
identification constitutes an important challenge as the effects of ADEs are often underreported.
However, the recent popularity of different social media sources have make it a promising source for ADE extraction. In this paper we have explored different linguistic and graph topological features to
automatically classify short sentences or tweets into ADEs or Non-ADEs. We have further represented the ADE
knowledge base into an bipartite network structure of drugs and their side effects to model drug-side effect
relationships. The proposed model can also be used to discover implicit ADEs that are
not represented in the source data. We have evaluated our proposed models with two openly available ADE dataset.
Our evaluation results shows that the proposed model have surpasses the performance of the existing baseline systems.
Vijay Sundar Ram and Sobha Lalitha Devi. A robust Co-reference Chain buider for Tamil
Abstract: The goal of the present work is to extract Co-reference chains from a document. Co-reference chains show cohesiveness of the document. The cohesiveness in the document is marked by cohesive markers such as Reference, Substitution, Ellipsis, Conjunction and Lexical cohesion. In this work we will take up Pronominal, Reflexives, R-expressions and form Co-reference chains for each of the above markers. The Co-reference chains are very essential in building sophisticated natural language processing applications such as information extraction, profile generator, entity specific text summarization etc. It is also needed in machine translation and information retrieval task. Though pronominal resolution is dealt in few Indian languages such as Tamil, Hindi, Bengali, Malayalam etc, extraction of co-reference chain in Indian languages is not attempted. We extract co-reference chains from Tamil language text. We have evaluated the system with real time data and results are encouraging.
Sandeep Kumar Dash, Partha Pakray, Jan Smeddinck, Robert Porzel, Rainer Malaka and Alexander Gelbukh. Designing An Ontology for Physical Exercise Actions
Abstract: Instructions for physical exercises leave many details underspecified that are taken for granted and inferred by the intended reader. For certain applications, such as generating virtual action visualizations from such textual instructions, advanced text processing is needed, requiring interpretation of both implicit and explicit information. This work presents an ontology that can support the semantic analysis of such instructions in order to support the identification of matching action constructs. The proposed ontology lays down a hierarchical structure following the human body structure along with various type of movement restrictions. This facilitates flexible yet adequate representations.
Nattapong Sanchan, Ahmet Aker and Kalina Bontcheva. Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data
Abstract: Usage of online textual media is steadily increasing; news stories, blog posts, scientific articles, are all freely available and have been employed extensively in multiple research areas, e.g. automatic text summarization, information retrieval, information extraction. Meanwhile, online debate forums have recently become popular, but have remained largely unexplored. In this paper, we work on an annotation task for a new dataset, \emph{online debate data}. We present the judgements of annotators on similar-meaning sentences, which give a low-averaged baseline for Cohen's kappa and Krippendorff's alpha. Through the application of a semantic similarity approach during calculation of inter-annotator agreement, we increase kappa and alpha to 36\% and 50\%, respectively. Moreover, we also implement an extractive summarization system for online debates. Key features to extract salient sentences from online debates are sentence position, debate title words, cosine similarity of the debate title words, and cosine similarity of the debate sentences. Our ROUGE results reveal that system performance also increases after the semantic similarity approach is applied.
Silpa Kanneganti, Vandan Mujadia and Dipti. M. Sharma. Classifier Ensemble Approach to Learning in Dependency Parsing
Abstract: Transition-based and graph-based parsing models are two of the most dominant approaches in dependency parsing. Both models are known to achieve state-of-the-art accuracy. While each method has its own set of advantages, they are also limited by the models they use for learning and inference. Although their approaches couldn't be more different from each other, both transition and graph-based models use single classifier based linear models to predict arcs or decisions for a given instance. In this paper we deal with problems arising with both these approaches during learning. We propose a neural network based classifier voting approach to dependency parsing using multiple classifiers as component systems in an ensemble and a neural network algorithm as an oracle. We show significant improvements over the best component systems for both transition-based and graph-based dependency parsing. We also investigate different weighting schemes for voting among individual classifiers in the ensemble. All our experiments were conducted on Hindi and Telugu language data but the approach is language-independent.
Marco Antonio Sobrevilla Cabezudo, Félix Arturo Oncevay Marcos and Héctor Andrés Melgar Sasieta. SenseDependency-Rank: A Word Sense Disambiguation Method Based on Random Walks and Dependency Trees
Abstract: Word Sense Disambiguation (WSD) is the field that seeks to determine the correct sense of a word in a given context. In this paper, we present a WSD method based on random walks over a dependency tree, whose nodes are word-senses from the WordNet. Besides, our method incorporates prior knowledge about the frequency of use of the word-senses. We observed that our results outperform several graph-based WSD methods in All-Word task of SensEval-2 and SensEval-3, including the baseline, where the nouns and verbs part-of-speech show the better improvement in their F-measure scores.
Ahlem Bouziri, Chiraz Latiri and Eric Gaussier. Efficient Association Rules Selection for Automatic Query Expansion
Abstract: Query expansion approaches based on terms correlations such as association rules (ARs) between terms have shown significant improvement in the performance of the information retrieval task. The highly sized set of generated ARs is considered as a real hamper to select only most interesting ones for query expansion. In this respect, we propose a new learning automatic query expansion approach using ARs between terms. The main idea of our proposal is to rank candidate ARs in order to select the most relevant rules to be used in the query expansion process. In particular, we propose an association rules ranking approach based on a pairwise learning model to generate relevant expansion terms. Experimental results on Robust TREC and CLEF test collections highlight that the retrieval performance can be improved when ARs ranking method is used.
Yoann Dupont, Marco Dinarelli and Isabelle Tellier. Label-Dependencies Aware Recurrent Neural Networks
Abstract: In the last few years, Recurrent Neural Networks (RNNs) have proved effective on several NLP tasks.
Despite such great success, their ability to model sequence labeling is still limited.
This lead research toward solutions where RNNs are combined with models which already proved effective in this domain, such as CRFs.
In this work we propose a solution far simpler but very effective: an evolution of the simple Jordan RNN, where labels are reinjected as input into the network, and converted into embeddings, in the same way as words.
We compare this RNN variant to all the other RNN models, Elman and Jordan RNN, LSTM and GRU, on two well-known tasks of Spoken Language Understanding (SLU).
Thanks to label embeddings and their combination at the hidden layer, the proposed variant, which uses more parameters than Elman and Jordan RNNs, but far fewer than LSTM and GRU, is more effective than other RNNs, but also outperforms sophisticated CRF models.
Valentina Sintsova, Margarita Bolívar Jiménez and Pearl Pu. Modeling the Impact of Modifiers on Emotional Statements
Abstract: Humans use a variety of modifiers while expressing their emotions. This paper aims to understand the influence of different modifiers on specific emotion categories, and to compare side-by-side their impact in an automatic manner. To do so, we propose a novel data analysis method that not only quantifies how much emotional statements change under each modifier, but also models how emotions shift and how their confidence changes. This method is based on comparing the distributions of emotion labels for modified and non-modified occurrences of emotional terms within labeled data. We apply this analysis to study six types of modifiers (negation, intensification, conditionality, tense, interrogation, and modality) within a large corpus of tweets with emotional hashtags. Our study sheds light on how to model negation relations between studied emotions, reveals the impact of previously under-studied modifiers, and suggests how to detect more precise emotional statements.
Alina Maria Ciobanu, Liviu P. Dinu and Andrea Sgarro. Towards a Map of the Syntactic Similarity of Languages
Abstract: In this paper we propose a computational method for determining
the syntactic similarity between languages. We investigate multiple
approaches and metrics, showing that the results are consistent
across methods. We report results on 15 languages belonging to various
language families. The analysis that we conduct is adaptable to any
languages, as far as resources are available.
Liviu P. Dinu and Alina Maria Ciobanu. Romanian Word Production: an Orthographic Approach Based on Sequence Labeling
Abstract: Languages borrow words from one another for various reasons. How the borrowing process takes place, how new words enter a target language is one of the key questions of historical linguistics. In this paper, we propose a multilingual method for word form production based on the orthography of the words. For borrowed words, we investigate the derivation from a source language into a target language. We also address the problem of genetic cognates derivation. We experiment with Romanian as a target language and we investigate borrowings from multiple source languages. The advantages of the proposed method are that it does not use any external knowledge, except for the training data, and it does not require the phonetic transcriptions of the input words.
Angel Luis Garrido, Oscar Cardiel, Andrea Aleyxendri, Ruben Quilez and Carlos Bobed. Optimization in Extractive Summarization Processes through Automatic Classification
Abstract: The results of an extractive automatic summarization task depends to a great extend on the nature of the processed texts (e.g., news, medicine, or literature). In fact, general-purpose methods usually need to be adhoc modified to improve their performance when dealing with a particular application context. However, this customization requires a lot of effort from domain experts and application developers, which makes it not always possible nor appropriate.

In this paper, we propose a multi-language approach to extractive summarization which adapts itself to different text domains in order to improve its performance. In a training step, our approach leverages the features of the text documents in order to classify them by using machine learning techniques. Then, once the text typology of each text is identified, it tunes the different parameters of the extraction mechanism solving an optimization problem for each of the text document classes. This classifier along with the learned optimizations associated with each document class allows our system to adapt to each of the input texts automatically. The proposed method has been applied in a real environment of a media company with promising results.
Omar El Ariss and Loai Alnemer. Morphology based Arabic Sentiment Analysis of Book Reviews
Abstract: Sentiment analysis is a fundamental natural language processing task that automatically analyzes raw textual data and infer from it seman-tic meaning. The inferred information focuses on the author’s attitude or opinion towards a written text. Although there is extensive research done on sentiment analysis on English language, there has been little work done that targets the morphologically rich structure of the Arabic language. In addition, most of the research done on Arabic either focus on introducing new datasets or new sentiment lexicons. We propose a supervised sentiment analysis approach for two tasks: posi-tive/negative classification and positive/negative/neutral classification. We focus on the morphological structure of the Arabic language by in-troducing filtering, segmentation and morphological processing specifi-cally for this language. We also manually create an emoticon senti-ment lexicon in order to stress the expressed emotions and improve on the sentiment analyzer.
Manali Pradhan, Jing Peng, Anna Feldman and Bianca Wright. Idioms: Humans or machines, it’s all about context
Abstract: Some expressions can be ambiguous between idiomatic and literal
interpretation depending on the context they occur in (``sales hit
the roof'' vs ``hit the roof of the car''). Previous studies suggest
that idiomaticity is not a binary property, but rather a continuum
or the so-called ``scalar phenomenon'' ranging from completely
literal to highly idiomatic. In this paper we reported the results
of an experiment in which human annotators rank idiomatic
expressions in context on a scale from 1 (literal) to 4 (highly
idiomatic). Our experiment supports the hypothesis that idioms fall
on a continuum and that one might differentiate between highly
idiomatic, mildly idiomatic and weakly idiomatic expressions. In
addition, we measure the relative idiomaticity of 11 idiomatic
types and compute the correlation between the relative idiomaticity
of an expression and the performance of various automatic models for
idiom detection. We show that our model, based on the
distributional semantics ideas, not only outperforms the previous
models, but also positively correlates with the human judgements,
which suggests that we're moving in the right direction toward
automatic idiom detection.
Joe Cheri Ross and Pushpak Bhattacharyya. Improved Best-First Clustering for Coreference Resolution in Indian Classical Music Forums
Abstract: Clustering step in the mention-pair paradigm for coreference resolution, forms the chain of coreferent mentions from the mention pairs classified as coreferent. Clustering methods including best-first clustering considers each antecedent candidate individually, while selecting the antecedent for an anaphoric mention. Here we introduce an easy-to-implement modification to best-first clustering to improve coreference resolution on Indian classical music forums. This method considers the relation between the candidate antecedents along with the relation between the anaphoric mention and the candidate antecedent. We observe a modest but statistically significant improvement over the best-first clustering for this dataset.
Rajendra Banjade, Nabin Maharjan, Dipesh Gautam and Vasile Rus. Pooling Word Vector Representations across Models
Abstract: Vector based word representation models are typically developed from
very large corpora with the hope that the representations are reliable and that they cover many words. However, we often encounter words in real world applications that are not available in a single vector-based model. In this paper, we present a novel Neural Network (NN) based approach for obtaining representations for words that are missing in a target model from another model, called the source model, where representations for these words are available, effectively pooling together their vocabularies and the corresponding representations. Our experiments with three different types of pre-trained models (Word2vec, GloVe, and LSA) each providing representations for millions of unique words and yet having only a small percentage of words in common show that the representations obtained using our transformation approach can substantially and effectively extend the word coverage of existing models. The increase in the number of unique words covered by a model varies from few to several times depending on which model vocabulary is taken as a reference. The transformed word representations are well correlated (average correlation up to 0.801 for words in Simlex-999 dataset) with the native target model representations indicating that the transformed vectors can effectively be used as substitutes of native word representations. Furthermore, an extrinsic evaluation based on a word-to-word similarity task using the Simlex-999 dataset leads to results close to those obtained using native target model representations.
Thoudam Doren Singh and Thamar Solorio. Towards Translating Mixed-Code Comments from Social Media
Abstract: The translation task of social media comments has attracted researchers in recent times because of the challenges to understand the nature of the comments and its representation and the need of its translation into a target language. In the present work, we attempt two approaches of translating the Facebook comments – one using a language identifier and other without using the language identifier.
We also attempt to handle some form of spelling variation of these comments towards
improving the translation quality with the help of state-of-the-art statistical machine translation techniques. Our approach employs n-best-list generation of the source language of the training dataset to address the spelling variation in the comments and also enrich the resource for translation. A small in-domain dataset could further boost the performance of the translation system. Our translation task focus on Hindi-English mixed comments collected from Facebook and our systems show
improvement of translation quality over the baseline system in terms of automatic evaluation scores.
Vincent Claveau and Ewa Kijak. Strategies to select examples for Active Learning with Conditional Random Fields
Abstract: Nowadays, many NLP problems are tackled as supervised machine learning tasks. Consequently, the cost of the expertise needed to annotate the examples is a widespread issue. Active learning offers a framework to that issue, allowing to control the annotation cost while maximizing the classifier performance, but it relies on the key step of choosing which example will be proposed to the expert.
In this paper, we examine and propose such selection strategies in the specific case of Conditional Random Fields (CRF) which are largely used in NLP.
On the one hand, we propose a simple method to correct a bias of certain state-of-the-art selection techniques. On the other hand, we detail an original approach to select the examples, based on the respect of proportions in the datasets.
These contributions are validated over a large range of experiments implying several tasks and datasets, including named entity recognition, chunking, phonetization, word sense disambiguation.
Daniil Alexeyevsky. Semi-supervised Relation Extraction from Monolingual Dictionary for Russian WordNet.
Abstract: Existing pre computer era monolingual dictionaries are a voluminous loosely structured source of lexical and ontological information. Numerous attempts were made to extract WordNet or ontology relations from monolingual dictionaries with varying success, most based on morphosyntactic rules. Difficulty of information extraction task greatly depends on dictionary authors' discipline, which is rarely enough to allow effortless dictionary parsing.

Core of this work is an analysis of how can rule-based approach for relation extraction be improved and what relations can be extracted. To simplify this task syntactically similar definitions are clustered, clusters are analyzed and then one morphosyntactic rule per cluster is assigned. The author suggests to use mixed n-gram clustering features as a simple substitute for syntactic features.

The clustering is performed on Russian explanatory dictionary (ed. Kuznetsov). The results show that dictionary discipline in this dictionary is actually very strict with less than hundred definition styles adhered. This along with preliminary clustering allowed to develop rules for relation extraction with very high precision, not reported in similar works previously.
Tanik Saikh, Sudip Kumar Naskar, Asif Ekbal and Sivaji Bandyopadhyay. Textual Entailment Using Machine Translation Evaluation Metrics
Abstract: In this paper we propose a novel approach to determine Textual Entailment (TE) relation between a pair of text snippets. It can be defined as a directional relationship between a pair of sentences, denoted by T – the entailing “Test” and H – the entailed “Hypothesis”. We can say that T entails H if the meaning of H can be relate with the meaning of T. Different machine translation along with summary evaluation metrics have been used as features for different machine learning classifiers to take the entailment decision in this study. We considered three machine translation evaluation metrics, namely BLEU, METEOR and TER and a summary evaluation metric namely ROUGE as similarity metrics for this task. Finally, we used polarity feature like negation in combination with the aforementioned features. We carried out the experiments on the datasets released in the shared tasks on textual entailment organized in RTE-1, RTE-2, RTE-3, RTE-4 and RTE-5. The best classification accuracies obtained by our system on the RTE-1, RTE-2, RTE-3, RTE-4 and RTE-5 dataset are 54%, 55%, 60%, 52% and 51% respectively.
Nima Hemmati, Heshaam Faili and Jalal Maleki. Multiple System Combination for Persian-English Transliteration
Abstract: In this paper, we model a transliteration system from Persian to English as grapheme-to-phoneme (G2P) and word lattice methods combined with statistical machine translation (SMT). Persian is an Indo-Iranian branch of the Indo-European family of languages be-longing to Arabic script-based languages. Our transliteration model is induced from a parallel corpus containing the Arabic script of a Persian book together with its romanized transcription scheme named Dabire. We manually aligned the sentence of this book in both scripts and used it as a parallel corpus. Our results indicate that the performance of the system is improved by adding grapheme-to-phoneme and word lattice methods for out-of-vocabulary handling task into the monotonic statistical machine transliteration sys-tem. In addition, the final performance on the test corpus shows that our system achieves comparable results with other state-of-the-art systems.
Xun Wang, Rumeng Li, Shindo Hiroyuki, Katsuhito Sudoh and Masaaki Nagata. Learning to Rank for Coordination Detection
Abstract: Coordinations refer to phrases such as ``A and/but/or/... B".
The detection of coordinations remains a major problem due to the complexity of their components. Existing work normally classified the training data into two categories: correct and incorrect. This often caused the problem of data imbalance which inevitably damaged performances of the models they used.
We propose to fully exploit the differences between training data by
formulating the detection of coordinations as a ranking problem to remedy this problem. We develop a novel model based on the long short-term memory network. Experiments on Penn Treebank and Genia show the proposed method has outperformed previous work.
Gaurush Hiranandani, Pranav Manerikar and Harsh Jhamtani. Generating Appealing Brand Names
Abstract: Providing appealing brand names to newly launched products, newly formed companies, websites, campaigns or for renaming existing companies is highly important as it can play a crucial role in deciding its success or failure. In this work, we propose a computational method to generate appealing brand names based on the description of such entities. Further, we use quantitative scores for readability, pronounciability, memorability and uniqueness of a generated name to rank order them. A set of diverse appealing names is recommended to the user for brand naming task. These names also aid in ideation. Experimental results show that the names generated by our approach are more appealing than names which prior approaches and few recruited humans could come up.
Ilia Markov, Efstathios Stamatatos and Grigori Sidorov. Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing
Abstract: The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we present an improved algorithm for cross-topic AA. We demonstrate that the effectiveness of character n-grams representation can be significantly enhanced by performing simple pre-processing steps and appropriately tuning the number of features, especially in cross-topic conditions.
Utpal Sikdar and Björn Gambäck. Named Entity Recognition for Amharic Using Stack-Based Deep Learning
Abstract: The paper describes a named entity recognition system for Amharic, an under-resourced language, using a recurrent neural network, a bi-directional long short term memory model. Word vectors based on semantic information are built for all tokens using an unsupervised learning algorithm, word2vec, while a Conditional Random Fields classifier trained on language independent features predicts each token's named entity class. The predictions, features and word vectors are fed to the deep neural network which assigns labels to the words. When evaluated by 10-fold cross-validation, the Amharic named entity recogniser achieved good precision (86.0%), but worse recall (65.5%).
Adrián Jiménez Pascual and Sumio Fujita. Text similarity function based on word embeddings for short text analysis
Abstract: We present the Contextual Specificity Similarity (CSS) measure, a new document similarity measure based on word embeddings and inverse document frequency. The idea behind the CSS measure is to score higher the documents that include words with close embeddings and frequency of usage. This paper provides a comparison with several methods of text classification, which will evince the accuracy and utility of CSS in k-nearest neighbour classification tasks for short texts.
Lars Borin, Shafqat Mumtaz Virk and Anju Saxena. Language Technology for Digital Linguistics: Turning the Linguistic Survey of India Into a Rich Source of Linguistic Information
Abstract: We present our work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) into a digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia. While doing so, we develop state-of-the-art language technology for automatically extracting the relevant grammatical information from the text of the LSI, and interactive linguistic information visualization tools for better analysis and comparisons of languages based on their structural and functional features.
Johan Georg Cyrus Mazaher Ræder and Björn Gambäck. Sarcasm Annotation and Detection in Tweets
Abstract: To automatically identify sarcasm in text is a challenging task, since it can be difficult also for humans, in particular in very short texts with little explicit context, such as tweets (Twitter messages). The paper presents a comparison of three sets of tweets marked for sarcasm, two annotated manually and one annotated using the common strategy of relying on the authors correctly using hashtags to mark sarcasm. To evaluate the datasets, a state-of-the-art system for sarcasm detection in tweets was implemented. Experiments on the two manually annotated datasets show comparable results, while deviating considerably from results on the automatically annotated data, indicating that using hashtags is not a reliable approach for creating Twitter sarcasm corpora.
Chaya Liebeskind, Shmuel Liebeskind and Yaakov Hacohen-Kerner. Comment relevance classification in Facebook
Abstract: Over the past few years, there has been a growing interest in applying Natural Language Processing (NLP) methods to social media texts due to the increased availability of social media platforms, such as social networks, forums and blogs, and the importance of the social media data to companies, governments and nonprofit organizations. Social media texts are usually very short texts which characterized by nonstandard spelling, inconsistent punctuation, abbreviations, emojis, and emoticons that pose special challenges for NLP.

Since social media allow direct contact with the target public, opinion and sentiment analysis of users' comments has received a lot of research attention. Most prior work on sentiment analysis addressed the problem of classifying comments as positive, negative or neutral with respect to the general attitude of the blog/post. In this paper, we focus on task of classifying the relevance of a comment to the content of its post, which requires a fuller understanding of the comment semantics. For example, the general attitude of the comment John, you are the best, and above the rest!!! is positive, but since the comment is not toward the content of the post, it is considered as irrelevant/negative in the relevance classification task.

Many text classification (TC) methods use the bag-of-words (BoW) representation. Due to the low frequency of occurrence of each word in short texts, their BoW representation of is very sparse. To overcome the sparsity problem, different dimensional reduction methods for semantic analysis, such as Latent Semantic Analysis (LSA), Word Embedding (Word2Vec), and Random Projection (RP) have been explored.

In this study, we explore four semantic vector representations for the relevance classification task. We investigate different types of large unlabeled data for learning the semantic vectors, namely comment texts, post texts and both post and comment texts. In addition, we examine whether expanding the input of the comment relevance classification task to include also the post text is beneficial for increasing the classification performance.
Mourad Gridach and Hatem Haddad. Arabic Named Entity Recognition : A Bidirectional GRU-CRF Approach
Abstract: The previous Named Entity Recognition (NER) models for Modern Standard Arabic (MSA) rely heavily on the use of features and gazetteers, which is time consuming. In this paper, we introduce a novel neural network architecture based on bidirectional Gated Recurrent Unit (GRU) combined with Conditional Random Fields (CRF). Our neural network uses minimal features: pretrained word representations learned from unannotated corpora and also character-level embeddings of words. This novel architecture allowed us to eliminate the need for most of handcrafted engineering features. We evaluate our system on a publicly available dataset where we were able to achieve comparable results to previous best-performing systems.
Amr Al-Khatib and Samhaa El-Beltagy. Emotional Tone Detection in Arabic Tweets
Abstract: Emotion detection in Arabic text is an emerging research area, but the efforts in this new field are hindered by the very limited availability of Arabic datasets annotated with emotions. In this paper, we review work that has been carried out in the area of emotion analysis in Arabic text. We then present an Arabic tweet dataset that we have built to serve this task. The efforts and methodologies followed to collect, clean, and annotate our dataset are described and preliminary experiments carried out on this dataset for emotion detection are presented. The results of these experiments are provided as a benchmark for future studies and comparisons with other emotion detection models. The best results over a set of eight emotions were obtained using a complement Naïve Bayes algorithm with an overall accuracy of 68.12 %.
Soujanya Poria. Speaker-independent Multimodal Sentiment Analysis with CNN-Based Feature Extraction
Abstract: We propose a novel framework for multimodal sentiment
analysis and emotion recognition using convolutional
neural network-based feature extraction from text and visual
modality. We obtain a performance improvement of 10%
over the state of the art by combining visual, text and audio
features. To this end, we also discuss some of frequently
ignored major issues in this research field: role of speaker independent
models, importance of the modalities and generalizability.

CyS

Xun Wang, Katsuhito Sudoh, Masaaki Nagata, Tomohide Shibata, Daisuke Kawahara and Sadao Kurohashi. Learning to Answer Questions by Understanding
Abstract: Many natural language processing tasks can be formulated as questions answering problems.
This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network} which is able to solve different tasks under a unified scheme. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text.
The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities.
Entities in this model are regard as the basic units that carry information and construct text. Information carried by text are encoded in the states of entities.
Hence text can be best understood by analysing its containing entities.
Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities.
We conducted experiments on several daasets including the Machine Comprehension Test dataset, bAbI, and the large movie review dataset and the proposed model reported satisfying results.
Wen-Hsing Lai, Cheng-Jia Yang and Siou-Lin Wang. Computational Auditory Scene Analysis with Mask Post-Processing for Monaural Speech Segregation
Abstract: Speech segregation is one of the most difficult tasks in speech processing. This paper uses computational auditory scene analysis, support vector machine classifier, and post-processing on binary mask to separate speech from background noise. Mel-frequency cepstral coefficients and pitch are the two features used for support vector machine classification. Connected Component Labeling, Hole Filling, and Morphology are applied on the resulting binary mask as post-processing. Experimental results show that our method separates speech from background noise effectively.
Marie Duzi. Property modification
Abstract: In this paper I deal with property modifiers defined as functions that associate a given property P with a modified property [M P]. Property modifiers typically divide into four kinds, namely intersective, subsective, privative and modal. Here I do not deal with modal modifiers, which appear to be well-nigh logically lawless. The goal of this paper is to logically define the three remaining kinds of modifiers together with the rules of left and right subsectivity. I launch pseudo-detachment as a rule of left subsectivity to replace the modifier M in the premise by the property M* in the conclusion, and prove that the rule of pseudo-detachment is valid for all kinds of modifiers. Furthermore, it is defined in a way that avoids paradoxes like that a small elephant is smaller that a large mouse.
Yujun Zhou, Jiaming Xu, Jie Cao, Bo Xu, Changliang Li and Bo Xu. Hybrid Attention Networks for Chinese Short Text Classification
Abstract: To improve the classification performance for Chinese short text with automatic semantic feature selection, in this paper we propose the Hybrid Attention Networks (HANs) which combines the word- and character-level selective attentions. The model firstly applies RNN and CNN to extract the semantic features of texts. Then it captures class-related attentive representation from word- and character-level features. Finally, all of the features are concatenated and fed into the output layer for classification. Experimental results on 32-class and 5-class datasets show that, our model outperforms multiple baselines by combining not only the word- and character-level features of the texts, but also class-related semantic features by attentive mechanism.
Ali Balali, Masoud Asadpour and Hesham Faili. A supervised method to predict the popularity of news articles
Abstract: In this study, we identify the features of an article that encourage people to leave a comment for it. The volume of the received comments for a news article shows its importance. It also indirectly indicates the amount of influence a news articles has on the public. Leaving comment on a news article indicates not only the visitor has read the article but also the article has been important to him/her. We propose a machine learning approach to predict the volume of comments using the information that is extracted about the users’ activities on the web pages of news agencies. In order to evaluate the proposed method, several experiments were performed. The results reveal salient improvement in comparison with the baseline methods
Amarnath Pathak, Partha Pakray, Sandip Sarkar, Dipankar Das and Alexander Gelbukh. MathIRs: Scientific Documents Retrieval System
Abstract: Effective retrieval of mathematical contents from vast corpus of scientific documents demands enhancement in the conventional indexing and searching mechanisms. Indexing mechanism and the choice of semantic similarity measures guide the results of Math Information Retrieval system (MathIRs) to perfection. Tokenization and formula unification are among the distinguishing features of indexing mechanism used in MathIRs which facilitate sub-formula and similarity search. Besides, the indexed documents and the user query in MathIRs are most likely to contain math as well as textual contents which necessitate the requirement of designing important modules for Text-Text Similarity (TS), Math-Math Similarity (MS) and Text- Math Similarity (TMS) matching. In this paper we have proposed MathIRs comprising these important modules and a substitution tree based mechanism for indexing mathematical expressions. We have also presented experimental results for similarity search and argued that proposal of MathIRs will ease the task of scientific document retrieval.
Mohammad Golam Sohrab, Toru Nakata, Makoto Miwa and Yutaka Sasaki. edge2vec: Edge Representations for Large-Scale Scalable Hierarchical Learning
Abstract: In present front-line of Big Data, prediction tasks over the nodes and edges in complex deep architecture needs a careful representation of features by assigning hundreds of thousands, or even millions of labels and samples for information access system, especially for hierarchical extreme multi-label classification. We introduce edge2vec, an edge representations framework for learning discrete and continuous features of edges in deep architecture. In edge2vec$, we learn a mapping of edges associated with nodes where random samples are augmented by statistical and semantic representations of words and documents. We argue that infusing semantic representations of features for edges by exploiting word2vec and para2vec is the key to learning richer representations for exploring target nodes or labels in the hierarchy. Moreover, we design and implement a balanced stochastic dual coordinate ascent (DCA)-based support vector machine for speeding up training. We introduce a global decision-based top-down walks instead of random walks to predict the most likelihood labels in the deep architecture. We judge the efficiency of edge2vec over the existing state-of-the-art techniques on extreme multi-label hierarchical as well as flat classification tasks. The empirical results show that edge2vec is very promising and computationally very efficient in fast learning and predicting tasks. In deep learning workbench, edge2vec represents a new direction for statistical and semantic representations of features in task-independent networks.
Nadia Ghezaiel and Kais Haddar. Parsing Arabic nominal sentences with transducers to annotate corpora
Abstract: Studying Arabic nominal sentences is important to analyze and annotate successfully Arabic corpora. This type of sentences is frequent in Arabic text and speech. Transducers can be used to realize local grammars and treat several linguistic phenomena. In this paper, we propose a parsing approach for Arabic nominal sentences using transducers. To do this, we first study the typology of the Arabic nominal sentence indicating its different forms. Then, we develop a set of lexical and syntactic rules dealing with this type of sentences and respecting the specificity of the Arabic language. After that, we present our parsing approach based on the transducers and on certain principles. In fact, this approach allows the annotation of Arabic nominal sentences but also the Arabic verbal ones. Finally, we present the implementation and experimentation of our approach in NooJ linguistic platform. The metric values show that the obtained results are satisfactory.
Somnath Banerjee, Sudip Naskar, Paolo Rosso and Sivaji Bandyopadhyay. Named Entity Recognition on Code-Mixed Cross-Script Social Media Content
Abstract: Focusing on the current multilingual scenario in social media, this paper reports automatic extraction of named entities (NE) from
code-mixed cross-script social media data. Our prime target is to extract
NE for question answering. This paper also introduces a Bengali{English
(Bn{En) code-mixed cross-script dataset for NE research and proposes
domain specific taxonomies for NE. We used formal as well as informal
language-specific features to prepare the classification models and employed four machine learning algorithms (Conditional Random Fields,
Margin Infused Relaxed Algorithm, Support Vector Machine and Maximum Entropy Markov Model) for the NE recognition (NER) task. In
this study, Bengali is considered as the native language while English
is considered as the non-native language. However, the approach presented in this paper is generic in nature and could be used for any other
code-mixed dataset. The classification models based on CRF and SVM
performed well among the classifiers.
Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier and Philippe Grange. Knowledge Representation and Phonological Rules for the Automatic Transliteration of Balinese Script on Palm Leaf Manuscript
Abstract: Balinese ancient palm leaf manuscripts record many important knowledges about world civilization histories. They vary from ordinary texts to Bali’s most sacred writings. In reality, the majority of Balinese can not read it because of language obstacles as well as tradition which perceived them as a sacrilege. Palm leaf manuscripts attract the historians, philologists, and archaeologists to discover more about the ancient ways of life. But unfortunately, there is only a limited access to the content of these manuscripts, because of the linguistic difficulties. The Balinese palm leaf manuscripts were written in Balinese script in Balinese language, in the ancient literary texts composed in the old Javanese language of Kawi and Sanskrit. Balinese script is considered to be one of the most complex scripts from Southeast Asia. A transliteration engine for transliterating the Balinese script of palm leaf manuscript to the Latin-based script is one of the most demanding systems which has to be developed for the collection of palm leaf manuscript images. In this paper, we present an implementation of knowledge representation and phonological rules for the automatic transliteration of Balinese script on palm leaf manuscript. In this system, a rule-based engine for performing transliterations is proposed. Our model is based on phonetics which is based on traditional linguistic study of Balinese transliteration. This automatic transliteration system is needed to complete the optical character recognition (OCR) process on the palm leaf manuscript images, to make the manuscripts more accessible and readable to a wider audience.
Priya Radhakrishnan, Manish Gupta, Vasudeva Varma and Ganesh Jawahar. SNEIT : Salient Named Entity Identification in Tweets
Abstract: Social media is a rich source of information and opinion, and its volume is growing by the day. However social media posts are difficult to analyze since they are brief, unstructured and noisy. Many social media posts are about an entity or entities. Hence understanding which entity is central (Salient Entity) to a post, helps better analyze the post. In this paper we propose a model that aids in such analysis by trying to identify the Salient entity in a social media post, tweets in particular. We present a supervised machine-learning model, to identify Salient Entity in a tweet and propose that the tweet is most likely about that
particular entity. We have used the premise that, when an image accompanies a text, the text most likely is about the entity in that image, to build a dataset of tweets and salient entities. We trained our model on the dataset. The model itself is not dependent on tweets with images, since we use only text features of the tweet. In our experiments we find that the model identifies Salient Named Entity with an F measure of 0.63. We evaluate the model using a standard tweet-filtering task and our results are better than the median of tweet filtering task results. Our method outperforms two of three baseline methods for salience identification. We have made the human annotated dataset and the source code of this model publicly available to the research community.
Souvick Ghosh, Satanu Ghosh and Dipankar Das. Complexity Metric for Code-Mixed Social Media Text
Abstract: An evaluation metric is an absolute necessity for measuring the performance of any system and complexity of any data. In this paper, we have discussed how to determine the level of complexity of code-mixed social media texts that are growing rapidly due to multilingual interference. In general, texts written in multiple languages are often hard to comprehend and analyze. At the same time, to meet the demands of analysis, it is also necessary to determine the complexity of a particular document or a text segment. Thus, in the present paper, we have discussed the existing metrics for determining the code-mixing complexity of a corpus, their advantages, and shortcomings as well as proposed several improvements on the existing metrics. The new index better reflects the variety and complexity of a multilingual document. Also, the index can be applied to a sentence and seamlessly extended to a paragraph or an entire document. We have employed two existing code-mixed corpora to suit the requirements of our study.
Wafa Wali, Bilel Gargouri and Abdelmajid Ben Hamadou. Sentences similarity computation based on VerbNet and WordNet.
Abstract: Sentence similarity computing is increasingly growing in several applications such as question answering, machine-translation, information retrieval and automatic abstracting systems. This paper firstly sums up several methods for calculating similarity between sentences which consider semantic and syntactic knowledge. Second, it presents a new method for the sentences similarity measure that aggregates in a linear function three components: the Lexical similarity Lexsim including the common words, the semantic similarity SemSim using the synonymy words and the syntactico-semantic similarity SynSemSim based on common semantic arguments notably thematic role and semantic class.

Concerning the word-based semantic similarity, a measure is computed to estimate the semantic degree between words by exploiting the WordNet "is a" taxonomy. Also, the semantic arguments determination is based on the VerbNet database.

The proposed method yielded into competitive results compared to previously proposed measures with regard to the Li's benchmark, showing a high correlation with human ratings. Further, experiments performed on the Microsoft Paraphrase Corpus showed the best F-measure values compared to other measures for high similarity thresholds.
A. K. Fischer, Jilles Vreeken and Dietrich Klakow. Beyond Pairwise Similarity: Quantifying and Characterizing Linguistic Similarity between Groups of Languages by MDL
Abstract: In this paper, we present a minimum description length-based algorithm for finding the regular correspondences between related languages and show how it can be used to quantify the similarity between not only pairs, but whole groups of languages directly from cognate sets.
Unlike previous work, our approach is not limited to pairs of languages, does not limit the size of correspondences, does not make assumptions about the shape or distribution of correspondences, and requires no expert knowledge or fine-tuning of parameters.
We employ a two-part code and use data and model complexities of the discovered correspondences and the joint compressibility of different languages as information-theoretic quantifications of the degree of regularity of cognate realizations in these languages.
We here test our approach on the Slavic languages.
In a pairwise analysis of 13 Slavic languages, we show that our algorithm replicates their linguistic classification exactly.
In a four-language experiment, we demonstrate how our algorithm efficiently quantifies similarity between all subsets of the analyzed four languages and find that it is excellently-suited to quantifying the orthographic regularity of closely-related languages.
Ranjan Satapathy, Erik Cambria, Shirley Ho and Jin-Cheon Na. Subjective Detection on twitter
Abstract: Subjective Detection aims to distinguish natural language as either opinionated (positive or negative) or neutral. This paper gives the statistical significance of the "nuclear energy" and "silk road" topic around the globe on twitter as a source of opinion. In this paper, the authors have employed dynamic Gaussian Bayesian networks to learn significant network motifs of words and concepts. The results show a great significance in detection of subjective and objective tweets on "nuclear energy" and "silk road" topic. Our experiments would help the government agencies to decide and act according to the need of the hour if a nuclear power plant is planned in near future. The shift in sentiments could be recorded country wise and acted upon accordingly. This model helps gauging public's perception regarding nuclear energy and silk road in real time. This paper gives a statistical significance to both the topics and concludes that twitter is a great source of opinion mining.
Sindhuja Gopalan and Sobha Lalitha Devi. Cause and Effect Extraction from Biomedical Corpus
Abstract: The objective of the present work is to automatically extract the cause and effect from discourse analyzed biomedical corpus. Cause-effect is defined as a relation established between two events, where first event acts as the cause of second event and the second event is the effect of first event. Any causative constructions need three components, a causal marker, cause and effect. In this study, we consider the automatic extraction of cause and effect realized by explicit discourse connective markers. We evaluated our system using BIONLP/NLPBA 2004 shared task test data and obtained encouraging results.
Jasmina Smailović, Martin Žnidaršič, Aljoša Valentinčič, Igor Lončarski, Marko Pahor, Pedro Tiago Martins and Senja Pollak. Automatic Analysis of Financial Annual Reports: a Case Study
Abstract: The main goal of financial reporting in the financial system is to ensure high quality and useful information about the financial position of firms, and to make it available to a wide range of users, including existing and potential investors, financial institutions, employees, the government, etc. Formal reports contain both strictly regulated, financial sections, as well as unregulated, narrative parts.
Our research starts from the hypothesis that there is a relation between business performance and not only content, but also the linguistic properties of unregulated parts of annual reports.
In the paper we first present our dataset of financial reports and the techniques we used to extract the unregulated textual parts. Next, we introduce our approaches of differential content analysis and analysis of correlation with financial aspects. The differential content analysis is based on TF-IDF weighting and is aimed at finding the characteristic terms for each year (i.e. the terms which were not prevailing in the previous reports by the same firm).
For correlation of linguistic characteristics of reports with financial aspects, an array of linguistic features was considered and selected financial indicators were used. Linguistic features range from simple personal/impersonal pronouns ratio, to elaborate like sentiment, novel custom trust and doubt word lists and discursive features expressing certainty, modality, etc. While some features show strong correlation with industry (e.g., shorter and more personal reports by IT industry compared to automotive industry), doubt, communication (as well as necessity and cognition words to some extent) are positively correlated with failure.
Rakesh Verma and Daniel Lee. Extractive Summarization: Limits, Compression, Generalized Model and Heuristics
Abstract: Due to its promise to alleviate information overload, text summarization has attracted the attention of many researchers. However, it has remained a serious challenge. Here, we first prove empirical limits on the recall (and F1-scores) of extractive summarizers on the DUC datasets under ROUGE evaluation for both the single-document and multi-document summarization tasks. Next we define the concept of compressibility of a document and present a new model of summarization, which generalizes existing models in the literature and integrates several dimensions of the summarization, viz., abstractive versus extractive, single versus multi-document, and syntactic versus semantic. Finally, we examine some new and existing single-document summarization algorithms in a single framework and compare with state of the art summarizers on DUC data.
Balázs Indig. Less is More, More or Less... -- Finding the Optimal Threshold for Lexicalisation in Chunking
Abstract: Lexicalisation of the input of sequential taggers has gone a long way since it was invented by Molina and Pla (2002). In this paper we thoroughly investigate the method introduced by Indig Endrédy (2016) to find out the best lexicalisation level for chunking and to explore the behaviour of different IOB representations. Both tasks are applied to the CoNLL-2000 dataset. Our goal is to introduce a transformation method to accommodate the parameters of the development set to the training set using their frequency distributions which other tasks like POS tagging or NER could benefit too.
Samiksha Tripathi and Dr. Vineet Kansal. Using Linguistic Knowledge for Machine Translation Evaluation with Hindi as a Target Language
Abstract: Several proposed metrics of MT Evaluation like BLEU have been criticised for their poor performance in evaluating machine translations. Languages like Hindi which have relatively free word-order and are morphologically rich pose additional problems in such evaluation. We attempt here to make use of linguistic knowledge to evaluate machine translations with Hindi as a target language. We formulate the problem of MT Evaluation as minimum cost assignment problem between test and reference translations with cost function based on linguistic knowledge.

Polibits

Anupam Mondal, Erik Cambria, Dipankar Das and Sivaji Bandyopadhyay. Sentiment based identification of affinity and gravity scores to judge relations of medical concepts and glosses
Abstract: Sentiment based relations are considered as important clues in order to identify the hidden links between medical concepts as well as to link various concepts with their source of glosses represented as descriptive explanations. Affinity score describes what extend two concepts are linked with each other by measuring the number of common sentiment words whereas gravity score identifies the sentiment-oriented relevance between medical concepts and their various glosses. To uncover salient connections between the concepts, we employ the existing WordNet of Medical Events (WME 2.0) lexicon and enrich it with the affinity and gravity scores. We have employed the supervised classifiers, Na\"{\i}ve Bayes and Sequential Minimal Optimization as well as unsupervised K-Means classifier to evaluate the identified sentiments of medical concepts present in WME 2.0 lexicon. The comparative results provide Cohen's kappa score of 0.73 and 0.78 before and after employing the affinity and gravity features into account. The agreement analysis was performed by medical practitioners. The higher agreement score indicates the positive influence of implementing the affinity and gravity features in determining sentiment relations. A further validation of the affinity and gravity features with the WME 2.0, three machine learning classifiers provide improved F-Measure scores such as 0.915, 0.915, and 0.974, which are much higher than the scores obtained without these features.
Khin War War Htike, Ye Kyaw Thu, Zuping Zhang, Win Pa Pa, Yoshinori Sagisaka and Naoto Iwahashi. Comparison of Six POS Tagging Methods on 10K Sentences Myanmar POS Tagged Corpus
Abstract: A robust Myanmar Part-of-Speech(POS) tagger is necessary for Myanmar natural language processing (NLP) research and not available publicly yet. For this reason, we developed a manually annotated ten thousand sentence POS tagged corpus for the general domain. We also evaluated six POS tagging approaches: Conditional Random Fields (CRFs), Hidden Markov Model (HMM), Maximum Entropy (MaxEnt), Support Vector Machine (SVM), Ripple Down Rules-based (RDR) and Two Hours of Annotation Approach (i.e. combination of HMM and Maximum Entropy Markov Model) on our developed POS tagged corpus. The POS tagging experimental results were measured with precision, recall and F-score and also manual checking in terms of confusion pairs. The result shows that CRFs approach gave the best performance for both closed and open test-sets. The RDR, HMM and MaxEnt approaches also give strong results. We plan to release our POS tagged corpus and trained models in early 2017.

IJCLA

Minoru Yoshida, Kazuyuki Matsumoto and Kenji Kita. Acceralation of Similar Word Search in Distributed Representation
Abstract: We report our study on accelerating search for similar words in distributed representation using cosine similarity measure.
Our main idea is to convert similarity value to Euclidean distance, and apply the branch-and-bound approach to distance calculation itself,
which means to terminate distance calculation when the current value exceeds the current N-th best similarity value
when N is the size of final list of words similar to the query.
We also apply other techniques for speeding up such as converting vector representation using PCA and reduction of adding and multiplying operations
in distance calculation.
Yahaya Alassan Mahaman Sanoussi. Enhancing Short Text Clustering by Using Data Categorized
Abstract: The classification of textual data has emerged in the last ten years as one of the most popular techniques of data analysis. This is due to the massive data exchanged on social networks, blogs, customer feedback on products ... These textual data, called short messages, have the particularity of being very brief compared to traditional texts. The classification algorithms generally used encounter difficulties in providing satisfactory results on these types of texts. Classification algorithms make it possible to synthesize the information contained in the masses of data by representing them in partitions. The aim of this article is to show how partitions of short messages can be used to improve the quality of classification algorithms for new short messages. The process consists in creating a thesaurus, using the partitions of the short messages of a subject. The thesaurus will improve the quality of the classification of new short messages of the same subject by enriching them. 
Nigel Dewdney. MESME: Multi-word Expression Extraction for Social Media in English
Abstract: The identification of multi-word expressions is important as they pose a challenge for language learners and NLP systems alike. Informal language poses additional complications for their systematic extraction.
This paper describes MESME, a two stage parsing and filtering approach for identification of candidate sequences for three types of multi-word expressions within English text. The first stage performs a shallow syntactic parse by means of state machines. Part-of-speech tagging accuracy is shown to be a significant factor in overall performance. The second stage employs machine learning on candidate expression features to filter out those incorrectly identified.

MESME is shown to achieve reasonable performance for short MWEs when operated on micro-blog messages rather than edited material using a tagger designed for such content. Achieving the highest rates of identification, particularly for longer expressions, without a high cost in precision remains a challenge.

Finally an analysis of extracted MWE candidates (not originally annotated in the wiki50 corpus) is given. This shows that finding consensus on what lexical sequences constitute MWEs is a non-trivial task
as inter-annotator agreement is measured at 0.24 by Fleiss's k.
Yunfei Long, Qin Lu, Minglei Li, Lin Gui and Chu-Ren Huang. Predicting implicit preferences from user comments: Is missing comments useful?
Abstract: Many user preference prediction methods are trained on comments written by users in social media. However, comments written by users generally have long-tail distribution and thus data sparseness is a big issue. Inspired by the stimulus generalization theory and the halo effect in sociology and cognitive science, we propose a novel approach to predict user preferences by learning from both observed comments and missing comments based on the missing not at random hypothesis. First, we construct a user-word heterogeneous network embedding model to learn user and word representations from observed comments. Then, we construct a user-word matrix and a user-user similarity matrix to model missing comments by users. By integrating both missing comments and observed comments, we learn the final user-to-user presentation through joint matrix factorization to include the "silent" comments in the final representation. Experiments indicate our model makes 1.89 \% improvement in F-score compared to the current start-of-art representation methods with P-value of 8.5e-13 to show the improvement is very significant.
Boris Galitsky and Greg Makowski. Document Classifier for a Data Loss Prevention System based on Learning Rhetoric Relations
Abstract: We build a Data Loss Prevention (DLP) system for document classification that relies on discourse-related features for higher detection accuracy. We demonstrate that using rhetoric relations and anaphora on top of syntactic data allows recognition of the document style which is essential in determining whether it is sensitive or not. We introduce a number of DLP domains such as sensitive engineering, legal, financial documents and build detectors for them. A dataset to train and test such detectors is built and made available for DLP benchmarking. The superior performance of DLP detectors in comparison with keyword-based and statistical keyword learning-based is shown. We also used a rule system in a post processing stage to reduce false alerts, given that for some job roles (such as HR, public relations or legal) it is valid to email out certain documents that would be considered invalid for the general population to send out of the company.
Boris Galitsky, Anna Parnis and Daniel Usikov. Exploring Discourse Structure of User-Generated Content
Abstract: Since syntactic-level analysis of user generated content is limited due to its poor grammar and incompleteness, we attempt to apply a higher level domain-independent discourse analysis and explore in which NLP tasks it can be leveraged. Since communicative discourse can be extracted from user generated text more reliably than factual data, we extend the traditional dis-course tree with the communicative structure – based features. Communicative discourse tree (CDT) is defined as a discourse tree with verbs for communicative actions as labels for its arcs. We identify three text classification tasks which relies on learning CDTs: sentiment analysis, text authenticity and answer classification for question answering in social domains. These classification tasks are implemented as a tree kernel learning of CDTs. We demonstrate that this discourse-level technique outperforms traditional keyword-based statistics approaches in all of these three tasks. We also show that this improvement is larger for user-generated con-tent in comparison with the professionally written one.
Ameneh Gholipour Shahraki and Osmar Zaïane. Lexical and Learning-based Emotion Mining from Text
Abstract: Emotion mining from text refers to the detection of people’s emotions based on observations of their writings. In this work, we study the problem of text emotion classification. First, we collect and cleanse a corpus of Twitter messages that convey at least one of the targeted emotions, then, we propose several lexical and learning based methods to classify the emotion of test tweets and study the effect of different feature sets. Our experimental results show that a set of Naive Bayes classifiers, each corresponding to one emotion, using unigrams as features, is the best performing method for the task. In addition we test our approach on other datasets, Twitter, and formally written texts and show that our approach achieves higher accuracy, compared with state-of-the-art methods.
Shih-Feng Yang and Julia Rayz. An Event Detection Approach Based On Twitter Hashtags
Abstract: Twitter is one of the most popular microblogging services in the world. The great amount of information within Twitter makes it an important information channel for people to learn and share news. Twitter hashtags are an popular feature that can be viewed as human-labeled information which people use to identify the topic of a tweet. Many researchers have proposed event-detection approaches that can monitor Twitter data and determine whether special events, such as accidents, extreme weather, earthquakes, or crimes take place. Although many approaches use hashtags as one of their features, few of them explicitly focus on the effectiveness of using hashtags on event detection. In this study, we proposed an event detection approach that utilizes hashtags in tweets. We adopted the feature extraction used in STREAMCUBE and applied a clustering K-means approach to it. The experiments demonstrated that the K-means approach performed better than STREAMCUBE in the clustering results. A discussion on optimal K values for the K-means approach is also provided.
Paul Reisert, Naoya Inoue, Naoaki Okazaki and Kentaro Inui. Designing a Task for Recognizing Argumentation Logic in Argumentative Texts
Abstract: Argumentation mining is the task of recognizing argumentative structures in unstructured documents. In this paper, we propose a novel task of argumentation mining for recognizing the logical structure of debates. Given a pair of connected argumentative texts, the task is to output a single argumentation logic graph consisting of decomposed bipolar causal relations. We first compose an annotation guidelines through preliminary observation of online debates. For determining the feasibility of our task, we conduct a trial annotation study using two annotators and 13 argumentative text pairs. Although we encountered several issues in our trial annotation, our results indicate that we can reasonably represent the logical structure between a pair of argumentative texts as indicated by observed logical attacks.
Masanori Hayashi, Ryohei Sasano, Hiroya Takamura and Manabu Okumura. Judging CEFR Levels of English Learner's Essays Based on Error-type Identification and Text Quality Measures
Abstract: We present a system for automatically judging levels of English essays written by English learners.
The system leverages error types and text quality measures, i.e., grammaticality and readability of sentences.
First, we describe how we build the grammatical error correction system and how it can detect grammatical errors in essays.
Next, we explain how we measure the grammaticality and readability of sentences in essays using the acceptability measure and the surprisal value.
Then, we show how we judge the levels of essays based on the error-type tendency and the grammaticality and readability measures.
Experiments on the Japanese English as a foreign language learner (JEFLL) corpus with the common European framework of reference for languages (CEFR) levels reveal that both the error type and the grammaticality and readability are effective in judging the CEFR levels.
Lucia Comparin and Sara Mendes. Error detection and error correction for improving quality in machine translation and human post-editing
Abstract: Machine translation (MT) has been an important field of research in the last decades and is currently playing a key role in the translation market. The variable quality of results makes it necessary to combine MT with post-editing, to obtain high-quality translation. Post-editing is, however, a costly and time-consuming task. Additionally, it is possible to improve the results by integrating more information in automatic systems. In order to improve automatic systems performance, it is crucial to evaluate the quality of results produced by MT systems to identify the main errors. In this study, we assessed the results of MT using an error-annotated corpus of texts translated from English into Italian. The data collected allowed us to identify frequent and critical errors. Detecting and correcting such errors would have a major impact on the quality of translation and make the post-editing process more accurate and efficient. The errors were analyzed in order to identify patterns of errors, and solutions to address them automatically or semi-automatically are presented. To achieve this a set of rules are formulated and integrated in a tool which detects or corrects the most frequent and critical errors in the texts.
Ryo Asakura, Hirotaka Niitsuma and Manabu Ohta. Recurrent Neural Networks on Convoluted Word Vectors for Aspect-Based Sentiment Analysis
Abstract: Aspect-Based Sentiment Analysis (ABSA) is one of the tasks of Sentiment Analysis. We propose a neural network model combining sentence-level convolution and LSTM Unit for this task. Our model powered by pretraining and transfer learning achieved a good performance comparable to the state-of-the-art techniques proposed for ABSA Task of restaurant reviews in SemEval 2016.
Jan Rygl. Enhancing Similarity Based Authorship Verification using Corpus
Abstract: The authorship verification problem can be defined as the task to
determine whether two given texts were or were not written by the
same author. A~similar task of the authorship attribution consists in
choosing one author out of a predefined set of candidate authors as the most
probable composer of a given document. The second task is usually
transformed to a~classification problem where the authors represent
category names. In this respect the authorship verification
corresponds to an open-class variant of authorship attribution.

As the authorship attribution task (in the closed-class
variant) can be solved with significantly higher accuracy, we
suggest to transform the problem of authorship verification
to be more similar to authorship attribution using two novel techniques:
Ranking Distance and Corpus Ranking.

The results
indicate that the problem transformation and application of
our optimizations increase the accuracy of authorship
verification algorithms. All experiments were performed on
Czech books, Slovak Internet news and English SMS messages, but proposed algorithms are document-type and language independent.
Paul Reisert, Naoya Inoue, Naoaki Okazaki and Kentaro Inui. Identifying and Ranking Relevant Claims for Decision Support
Abstract: For decision-making, having extensive knowledge regarding support for a decision can have a strong impact on the course of action to pursue. In this work, we propose a method for automatically collecting, identifying, and ranking relevant knowledge pieces for controversial decisions. We first investigate if humans can reasonably identify relevant knowledge pieces for a decision through crowdsourcing. Afterwards, we construct two gold datasets from the crowdsourcing results for identifying knowledge pieces and for ranking them. We create a simple rule-based classifier for relevant knowledge piece identification, and a trainable classifier for ranking evaluated on a topic-based cross validation. Our experiment demonstrates that it is difficult for humans to identify and rank certain instances of knowledge pieces. Finally, we found that a computational model for topic-independent claim ranking is feasible.
Yazan Jaradat and Ahmad Al-Taani. Arabic Single-Document Text Summarization Using Harmony Search
Abstract: Huge amount of data offered by the web has led to an age of information abundance, which made the searching in the internet a time-consuming process. This research presents a Single-document Extractive Arabic Text Summarization approach based on Harmony Search (HS). In this research, we incorporate HS with the summarization process to obtain the best summary of a document due to the ability of HS to reach global minima/maxima. The proposed summarization approach is evaluated using the Essex Arabic Summaries Corpus (EASC), and ROUGE evaluation method to determine the accuracy of the proposed approach. Obtained results showed that the proposed approach achieved competitive ROUGE scores and outperformed many state-of-the art methods.
Changliang Li, Bo Xu and Xiuying Wang. Encoder-Memory-Decoder Model for Long Conversation Dialogue System
Abstract: One long-term goal in artificial intelligence field is to build an intelligent dialogue agent. Recently with the development of deep learning, the popular dialogue system is built on the encoder-decoder framework for sequence-to-sequence learning just like Neural Machine Translation. However, this approach can only handle single turn dialogue without consistency, due to its lack of ability to acquire the dialogue history information. It’s still challenging to build a dialogue system that works reasonably well for long conversations (multiple turns). In this paper, we propose an Encoder-Memory-Decoder model to build long conversations dialogue system in neural generative way. It can be viewed as an end-to-end neural network model equipped with memory ability to memorize the dialogue history information for generative dialogue response. More specifically, the proposed model requires few hand-crafted rules and generates more flexible responses. Empirical study shows the proposed model can effectively deal with long conversations, and can generate right and natural response coherently. This model gives a new perspective for building long conversation dialogue system.
Zijian Győző Yang and László János Laki. πRate: A Task-oriented Monolingual Quality Estimation System
Abstract: Psycholinguistically motivated natural parsing is a new, human-oriented computational language processing approach. This complex real-time model has several parallel threads to analyze the input words, phrases or sentences. One of the main threads can be the quality estimation module, which informs, controls and filters the noisy or erroneous input. To build this quality controller module we implemented the quality estimation method that is traditionally used in the field of machine translation evaluation. To tailor the quality estimation model to the monolingual natural parsing system, we optimized the architecture with task-oriented approach. In our research, a quality estimation system is built for monolingual text input. Using this system we can provide quality indicators for the input with an accuracy of ~72%. The system is created for the AnaGramma Hungarian natural parsing system, but it can be used for other languages as well. Our method can incrementally estimate the quality of input in real time while it is generated.
Syed Sarfaraz Akhtar, Arihant Gupta, Avijit Vajpayee, Arijt Srivastava and Manish Shrivastava. Unsupervised Morphological Expansion of Small Datasets for Improving Word Embeddings
Abstract: We present a language independent, unsupervised method for building word embeddings using morphological expansion of text. Our model handles the problem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seven languages. Further, for English, we evaluated the impacts of our approach using a large training set on three standard test sets. Our method improved results across all languages.
Emna Hkiri. Integrating Bilingual NE Lexicon with CRF Model for Arabic NER
Abstract: Named Entity Recognition plays an important role in locating
and classifying atomic elements into predefined categories such as
person names, locations ,organizations, expression of times, temporal
expressions etc. Several approaches with rule based and machine
learning based techniques have applied English and some other Latin
languages successfully. Arabic has a complex and rich morphology,
which makes the named entities recognition a challenging process. In
this paper we propose our hybrid NER system that applies conditional
random fields (CRF), bilingual NE lexicon and grammar rules to the
task of Named Entity Recognition in Arabic languages. The aim of our
system is enhancing the overall performance of NER tasks. The
empirical results indicate that the hybrid system outperforms the
state-of-the-art of Arabic NER in terms of precision when applied to
ANERcorp dataset, with f-measures 81.65 for Person, 71.7 for
Location, and 89.6 for Organization
Ashish Palakurthi. New Data is Indeed Helping Lexical Simplification
Abstract: We propose the use of a new corpus for Complex Word Identification,
a sub-problem of Lexical Simplification and conduct
an empirical evaluation by comparing it with benchmark
corpora previously employed for this task. Our experiments
suggest that the proposed corpus is effective for Complex
Word Identification, thus helping Lexical Simplification.
Carmen Klaussner and Carl Vogel. Revisiting hypotheses on linguistic ageing in literary careers
Abstract: Aspects of language change in individuals has been addressed as part of
different works: sometimes for literary purposes, such as investigating
change in style over time (Smith and Kelly,2002) or in the area of neuro-linguistics
through assessing how diseases such as Alzheimer's or dementia affect patients'
language complexity over time (Kemper et al, 2001). In addition, the property of
language change can be used indirectly to detect the temporal origin of a
text (Klaussner and Vogel, 2015).
When assessing individual language change, there is always the question of the
ever present underlying language change and how
to disentangle individual effects from the more general ones.
As already discussed by Daelemans (2013): when analyzing linguistic
variables of texts of various temporal origins, the temporal dimension can act as a
confounding factor and might misconstrue the subsequent synchronic interpretation.
Similarly, if considering individuals' language properties one has to control
for the underlying language properties, especially if these are likely to be subject to change.
In this case, we consider the effect of age on language development and
the general language change might act as a confounding factor possibly
causing phenomenon in the individual that are actually due to general effects
pervading all language.
The study by Pennebaker and Stone (2003) investigates how the age of a
person affects certain linguistic categories, such as preference for particular
pronouns or tenses with respect to two very different data sets: one
based on self-reports from emotional disclosure studies and the other
based on collected works of 10 different authors across their individual life spans.
However, neither analysis appears to have considered or evaluated the influence of general language change onto the text samples in question, something that we plan to remedy as part of this study.
In addition, this work will consider different realizations or interpretations of the variables in question and consider additional analyses beyond what would be necessary to compare to the earlier study.
Carlos-Emiliano Gonzalez-Gallardo, Eric Sanjuan and Juan-Manuel Torres-Moreno. Extending Text Informativeness measures to Passage Interestingness evaluation: Language Model vs Word Embbedding representation
Abstract: Informativeness measures used to evaluate Automatic Text Summarization systems are mainly variations of terms or overlapping measurements between one or several reference summaries. The measures differ from the text units (n-grams, entities, nuggets etc.) they consider and the method (Rouge, Kullback-Leibler, Word Embedding related measures) to evaluate the overlapping with references. In this paper we study the ability of these measures in order to predict the informativeness of short factual sentences based over a large data set of sentences extracted from Wikipedia by state of the art passage retrieval systems and manually assessed by humans with regards to a pool 60 topics extracted from Twitter.
Amosse Edouard, Elena Cabrio, Sara Tonelli and Nhan Le Thanh. Semantic Linking for Event-Based Classification of Tweets
Abstract: Detecting which tweets are related to events and classifying them into categories is a challenging task due to the peculiarities of Twitter language and to the lack of contextual information. We propose to face this challenge by taking advantage of the information that can be automatically acquired from external knowledge bases. In particular, we enrich and generalise the textual content of tweets by linking the Named Entities (NE) to concepts in both DBpedia and YAGO ontologies, and exploit their specific or generic types to replace NE mentions in tweets. The approach we propose in this paper is applied to build a supervised classifier to separate event-related from non event-related tweets, as well as to associate to event-related tweets the event categories defined by the Topic Detection and Tracking community (TDT). We compare Naive Bayes (NB), Support Vector Machines (SVM) and Long Short-Term Memory (LSTM) classification algorithms, showing that NE linking and replacement improves classification performance and contributes to reducing overfitting, especially with Recurrent Neural Networks (RNN).
Darshan Agarwal, Vandan Mujadia and Radhika Mamidi. A Modified Annotation Scheme For Semantic Textual Similarity
Abstract: Semantic Similarity plays an important role in several natural
language related tasks such as evaluation of machine translation, textual
entailment, entity linking, document clustering etc. To be able to make
machines do such tasks automatically, one must need to have well anno-
tated textual semantic similarity corpus. Therefore as an initial stepping
step, we present a modified annotation scheme for measuring semantic
similarity between two Hindi sentences. Given two sentences, the goal of
the annotation is to give the similarity score between two sentences on
a scale of 0 to 5. We observed several difficulties in assigning similarity
score by following the (Agirre et al., 2012) annotation scheme. To over-
come those difficulties, we developed new scoring scheme and observed
considerable inter annotator agreement compared to (Agirre et al., 2012)
annotation scheme. Using our annotation scheme we also annotated the
degree of semantic relatedness on 750 pair of Hindi sentences.
Darshan Agarwal, Vandan Mujadia, Dipti Misra Sharma and Radhika Mamidi. Semantic Textual Similarity For Hindi
Abstract: Semantic textual similarity is the degree of equivalence between the two sentences semantically. We may also say, it is the ability to substitute one text for the other without changing its meaning. In this paper, we propose unsupervised and supervised systems which measure the semantic relatedness between two Hindi sentences on the scale of 0 (least similar) to 5 (most similar). Both systems make use of several syntactico-semantic features such as language specific linguistic characteristics, distributional semantics and dependency clusters. With several constraints on these features, our unsupervised system is able to achieve around 75.23\% accuracy on Hindi news similarity corpus. In supervised approach, we use support vector machine (SVM) with above mentioned features and euclidean distance between dependency clusters to derive word level alignments. Later, we use these alignments to assign similarity score between two sentences. With this approach we are able to achieve considerable accuracy of 70.01\% on a small set of our corpus.
Razieh Ehsani and Olcay Taner Yıldız. Initial Efforts in Creating a Persian-English Parallel TreeBank
Abstract: In this paper, we introduce our preliminary efforts in constructing Persian-English parallel treebank corpus. We extract 1,500 sentences from Penn Treebank, where a sentence contains 15 tokens at maximum including punctuation. In our approach, we replace English words with Persian equivalents and reorder subtrees without changing Penn tags.
Aditya Mogadala, Dominik Jung and Achim Rettinger. Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings
Abstract: Social media platforms have grown into an important medium to spread information about an event published by the traditional media, such as news articles. Grouping such diverse sources of information that discuss the same topic in varied perspectives provide new insights. But the gap in word usage between informal social media content such as tweets and diligently written content (e.g. news articles) make such assembling difficult. In this paper, we propose a transformation framework to bridge the word usage gap between tweets and online news articles across languages by leveraging their word embeddings. Using our framework, word embeddings extracted from tweets and news articles are aligned closer to each other across languages, thus facilitating the identification of similarity between news articles and tweets. Experimental results show a notable improvement over baselines for monolingual tweets and news articles comparison, while new findings are reported for cross-lingual comparison.
Emmanuel Cartier, Gaël Lejeune, Kata Gabor and Thierry Charnois. A System for Multilingual Online Neologism Tracking
Abstract: This paper details a system designed to track neologisms in seven languages in corpora of online news articles. This system combines state-of-the-art methodology to automatically track new words as well as semantic changes, and a web platform for linguists to create and manage their corpora, accept or reject automatically identified neologisms, provide a linguistic description of the accepted neologisms and follow their life cycle in the corpora. After a state-of-the-art overview of Neologism Retrieval, Analysis and Life-tracking, we describe the overall architecture of the system and detail our current research for improving formal neologisms detection and semantic change tracking. We show results on French corpora and we plan to use these methods for new languages, including poorly endowed ones.
Haithem Afli, Sorcha McGuire and Andy Way. Sentiment Translation for low resourced languages: Experiments on Irish General Election Tweets
Abstract: This paper presents two main methods of Sentiment Analysis (SA) of User-Generated Content for a low-resource language: Irish. The first method, automatic sentiment translation, applies existing English SA resources to both manually- and automatically-translated tweets. We obtained an accuracy of 70% using this approach. The second method involved the manual creation of an Irish-language sentiment lexicon: Senti-Foclóir. This lexicon was used to build the first Irish SA system, SentiFocalTweet, which produced superior results to the first method, with an accuracy of 76%. This demonstrates that translation from Irish to English has a minor effect on the preservation of sentiment; it is also shown that the SentiFocalTweet system is a successful baseline system for Irish sentiment analysis.
Imen Bouaziz Mezghanni and Faiez Gargouri. CrimAr: A criminal Arabic ontology for a benchmark based evaluation
Abstract: Ontologies are playing a pivotal role in the modern Seman-
tic Web by capturing knowledge in a particular domain of interest while
emphasizing interoperability and establishing a common shared under-
standing among the involved actors of web-based applications. However,
accordingly to the abundance of the proposed approaches for ontology
learning in recent literature, a related problem concerning the evalu-
ation of such automatically generated ontologies is emerging in various
domains. In the Arabic legal domain, a benchmark golden ontology, used
to the end of ensuring the quality of the (semi-)automatic generated ones,
is a necessity. This paper presents CrimAr, a handcrafted ontology based
on the top-levels of LRI-Core to represent all relevant knowledge in the
Arabic legal domain especially the criminal matter. The use of CrimAr
is demonstrated in a real case of evaluation.
Thi Bich Ngoc Hoang, Véronique Moriceau and Josiane Mothe. Predicting Locations in Tweets
Abstract: Five hundred millions of tweets are posted daily, making Twitter a major social media from which topical information on events can be extracted. Events are represented by time, location and entity-related information. This paper focuses on location which is an important clue for both users and geo-spatial applications. We address the problem of predicting whether a tweet contains a location or not, as location prediction is a useful pre-processing step for location extraction, by defining a number of features to represent tweets and conducting intensive evaluation of machine learning parameters. We found that: (1) not only words appearing in a geography gazetteer are important but the occurrence of a preposition right before a proper noun also is. (2) it is possible to improve precision on location extraction if the occurrence of a location is predicted.
Hazem Abdelaal, Brian Davis and Manel Zarrouk. From Simplified Text to Knowledge Representation using Controlled Natural Language
Abstract: Knowledge based systems provide means to store data and perform reasoning on top of it. Controlled Natural Language (CNL) is considered as an engineered subset of natural language. CNLs aim to abstract the complexity level of natural languages and reduce/abolish the characterizing ambiguity of these languages. Two major types of CNLs exist. The one aiming to improve human-human communication is called Simplified Language or human CNL.
The second type is a machine-readable CNL, which is formal language used mainly as an Abstract Knowledge Representation (AKR) Language. In this paper, we present an approach to translate the Medline summaries of diseases presented in a Simplified English to a machine readable CNL. This approach will result in a formal Knowledge Base enabling machine processing, querying and reasoning while avoiding the usual main obstacle which is natural language ambiguity. We tested the approach on a sample corpus of the Anemia disease, and presented the concluded hypothesis.

RCS

Li Zhang, Jun Li and Lei Chen. A Method on Similar Text Finding and Plagiarism Detection Based on Topic Model
Abstract: This paper proposes a method on similar text finding and plagiarism detection among mass texts, which is based on the LDA mode, when a text need to detect, LDA generates its topics distribution, then calculate the similarity with the text in topics distribution library, we define the most similar text set as SimiSet, plagiarism detection is based on SimiSet, the coming text compares with the text in SimiSet using the method of fingerprint matching. In this paper, we compared several kinds of similarity calculation algorithm; experiment found that the combined algorithm of KNN and JS distance has higher recall rate and lower AIV value on plagiarism detection. We also compared proposed method with the traditional method on plagiarism detection, the result show our method has higher F1 and smaller search range.
Zahra Mustafa-Awad, Majdi Sawalha, Monika Kirner-Ludwig and Dua’a Tabaza. Arab Women in Western Press: Designing News Corpora for Arab Women in British, American, and German newspapers during the Arab Spring
Abstract: Abstract
This paper studies the representation of Arab women in Western press during the so called Arab Spring. Two corpora were constructed by collecting the news articles on Arab Women published by most popular newspapers in the USA, UK and Germany during the period 2010-2015. The first corpus is made up of articles published by The Guardian and The New York Times. The second is composed of articles collected from four German newspapers: Die ZEIT, Frankfurter Allgemeine Zeitung, Süddeutsche Zeitung, and Der Spiegel. The English corpus consists of 738,964 words, whereas the German corpus has 219,634 words. The corpora were managed using Sketch Engine and keyword extraction programs developed by the researchers. The significance of the corpora comes from the importance of the Arab Spring; as Arab women played unprecedented roles which broke several stereotypes related to them. The corpora could reveal whether Western press coverage of these events also broke such stereotypes or continued to promote them.

Cătălina Mărănduc, Monica Mihaela Rizea and Dan Cristea. Mapping Dependency Relations onto Semantic Categories
Abstract: The paper is about a dependency treebank that aims to illustrate the Romanian language, in more styles of communication, in more geographical and historical variants. The treebank, called UAIC-RoDepTb, contains 14,000 sentences and it is freely available. 4,500 of it, illustrating Contemporary Romanian, is being affiliated to UD (Universal Dependencies) that assembles more than 30 languages and offers the multiple relations and comparisons perspective. However, the UD annotation system is simplified and the affiliation of our treebank is possible only with loss of information. We aim to establish an original system of semantic annotation exploiting all the semantic information in our treebank, in the syntactic categories, in the morphological analysis, in the lexical definition of some words, in the punctuation. We obtained some logical structures similar with the AMR system. However, we intend to maintain the form of the FDG trees for the semantic layer, for preserving the isomorphism with the syntactic one. The chosen solution will be justified and compared with other systems. In order to find an annotation system with universal openness, our semantic labels have not been abbreviated yet. After fixing them, we will build a semantic parser.
Cătălina Mărănduc, Radu Simionescu and Dan Cristea. Hybrid POS-tagger for Old Romanian
Abstract: The paper describes the creation of the first of a series of Old Romanian pro-cessing tools, intended to cover several syntactic and semantic parsers, and a POS-tagger able to annotate Romanian text written in Cyrillic alphabet. The tool is based on a corpus type Dependency Treebank, called UAIC-RoDepTb, having 14,000 sentences, 253,144 tokens, punctuation included. A part of it, called RoDia (Romanian Diachronic), is formed by texts in Old Romanian (sixteenth and seventeenth centuries). It has now 3,482 sentences and 58,103 tokens and its weight and dimensions will increase. The rest of the treebank, being a balanced corpus, contains numerous quotations from texts written in eighteenth and nineteenth centuries, therefore it can be used as training corpus for the Old Romanian processing tools. We began by automatically annotate texts in Old Romanian using our tools for Contemporary Romanian and by manually correct them, although the accuracy was poor. We extracted data from these manual corrections of the 3,482 phrases. We also indexed other ancient texts, we collected indices of the Bible (1688) digitized in the MLD (Monumenta Linguae Dacoromanorum) project, archaic variants of words in dictionaries, and we added this information to the POS tagger lexicon for Contemporary Romanian.
Balaji Vasan Srinivasan, Rishiraj Saha Roy, Harsh Jhamtani, Natwar Modani and Niyati Chhaya. Corpus-based Automatic Text Expansion
Abstract: The task of algorithmically expanding a textual content based on an existing corpus can aid in efficient authoring and is feasible if the desired additional materials are already present in the corpus. We propose an algorithm that automatically expands a piece of text, by identifying paragraphs from the repository as candidates for augmentation to the original content. The proposed method involves: extracting the keywords, searching the corpus, selecting and ranking relevant textual units while maintaining diversity in the overall information in the expanded content, and finally concatenating the selected text units. We propose metrics to evaluate the expanded content for diversity and relevance, and compare them against manual annotations. Results indicate viability of the proposed approach.
Reda Ahmed Zayed, Hesham A. Hefny and Mohamed Farouk Abdel Hady. A Hybrid Approach to Extract Keyphrases from Arabic Text
Abstract: Key phrases are the phrases, consisting of one or more words, representing the important concepts in the document. This paper presents a hybrid approach to key phrase extraction from Islamic Arabic Fatwa. The key phrase extraction approach presented in this paper is an amalgamation of three methods: the first one assigns weights to candidate key phrases based on, term frequency and inverse document frequency, the second one assign weights to candidate key phrases using some knowledge about their similarities to the structure and characteristics of key phrases available in the memory (stored list of key phrases) and the third one assign weights to candidate key phrases using some knowledge about fatwa label(class or fatwa area). An efficient candidate key phrase identification method as the first component of the proposed key phrase extraction system has also been introduced in this paper. The experimental results show that the proposed hybrid approach give good performs.
Minglei Li, Qin Lu, Yunfei Long and Lin Gui. Hidden Recursive Neural Network for Sentence Classification
Abstract: Recursive Neural Network has been successfully used in sentence-level sentiment analysis for language compositionality based on structured parsing trees and several modified versions are proposed. These models either treat word vectors as model parameters or employ pre-trained word vectors as input. The former has the advantage of learning task specific word vectors but has much larger parameter size. The later has the advantage of using the encoded semantic information in the vectors and has much smaller parameter size but the general word vectors may be not task-specific. In this work, we propose a hidden recursive neural network (HRNN) which can take the advantages of both learning word vectors and using pre-trained word vectors. This model takes the pre-trained word vectors as the input and adds one hidden layer to extract task-specific representation. Then the recursive composition process is performed in the hidden space.
We perform extensive experiments on several sentence classification tasks and results show that our proposed model outperforms both methods and the other baselines, which indicates the effectiveness of our proposed model.
Djalma Padovani and Joao Jose Neto. Proposal for Modeling Brazilian Portuguese with Adaptive Grammars
Abstract: Natural Language Processing uses different techniques for identifying elements of the language and the syntactic and semantic roles they carry out in the text under analysis. Traditionally, NLP systems are built with modules that divide the text, identify its elements, verify whether the syntactic trees are in accordance with grammar rules, and apply specific formalisms to validate the semantics. However, it is noticed that there are few formalisms that represent semantics in a syntactic way and such formalisms are either very complex or incomplete. Adaptive Grammars is a formalism in which the grammar can modify itself based on the character chain parsing and the application of rules associated to the context. It brings several advantages over similar techniques because it allows the representation of both syntactic and semantics together in a single model. This work presents a method for modeling Natural Languages using Adaptive Grammars and illustrates the proposal with an application to Brazilian Portuguese.
Chahira Lhioui, Anis Zouaghi and Mounir Zrigui. Towards a Linguistic Knowledge Recognition for Arabic Discourses Parsing
Abstract: Regarding the amelioration of NLP field, Arabic knowledge recognition has become an interesting research topic. In fact, thanks to the automation of Arabic knowledge recognition that the complexity of text analysis could be reduced. In this article, the authors present a knowledge Automatic recognition process of transcribed Arabic speech. This process applies recognition grammars implemented on a linguistic engineering platform called NooJ. To do this, the authors started with a survey of existing works dealing with the knowledge recognition. Then they introduce an Arabic knowledge typology by studying its ambiguities. Besides, they developed a knowledge recognition approach based on transducers implemented by NooJ platform. This approach is tested on a corpus describing the field of Touristic Information and Hotel Reservations. Eventually, they ended their work with a conclusion and perspectives.

Sukriti Verma and Vagisha Nidhi. Extractive Summarization using Deep Learning
Abstract: This paper proposes a text summarization approach for factual reports using a deep learning model. This approach consists of three phases: feature extraction, feature enhancement, and summary generation, which work together to assimilate core information and generate a coherent, understandable summary. We are exploring various features to improve the set of sentences selected for the summary, and are using a Restricted Boltzmann Machine to enhance and abstract those features to improve resultant accuracy without losing any important information. The sentences are scored based on those enhanced features and an extractive summary is constructed. Experimentation carried out on several articles demonstrates the effectiveness of the proposed approach.
Viktor Hangya, Zsolt Szántó and Richárd Farkas. Latent Syntactic Structure-Based Sentiment Analysis
Abstract: We propose a latent syntactic structure-based approach for sentiment analysis which requires only sentence-level polarity labels for training. Our experiments on three domains (movie, IT products, restaurant) show that a sentiment analyzer that exploits syntactic parses and has access only to sentence-level polarity annotation for in-domain sentences can outperform state-of-the-art models that were trained on out-domain parse trees with sentiment annotation for each node of the trees. In practice, millions of sentence-level polarity annotations are usually available for a particular domain thus our approach is applicable for training a sentiment analyzer for a new domain while it can exploit the syntactic structure of sentences as well.
Dhaou Ghoul. Classifications and grammars of simple Arabic lexical invariants in anticipation of an automatic processing of this language: “the temporal invariants”.
Abstract: This paper focuses on the classification and treatment of simple Arabic lexical invariants that express a temporal aspect. Our aim is to create a diagram of grammar (finite state machine) for each invariant. In this work, we limited our treatment to only 13 lexical invariants. Our hypothesis begins with the principle that the lexical invariants are located at the same structural level (formal) as the schemes in the language quotient (skeleton) of the Arabic language. They hide a great deal of information and involve syntactic expectations that make it possible to predict the structure of the sentence. To do this first, for each lexical invariant, we developed a sample corpus that contains the different contexts of the lexical invariant in question. Second, from this corpus, we identified the different linguistic criteria of the lexical invariant that allow us to correctly identify it. Finally, we codified this information in the form of linguistic rules in order to model it by a diagram of grammar (finite state machine).
Tiansi Dong, Armin Cremers and Juanzi Li. The Fix-point of Dependency Graph -- A Case Study of Chinese-German Similarity
Abstract: Applications using linguistic dependency graphs (LDG) produce exciting results in neural-NLP and machine translation, as it captures more semantic relations. A case study is carried out to make {\em language dependency graphs}(LDG) more closer to meaning representations. We encode linguistic cues in graph-transformation rules, and keep updating an LDG till a fix-point is reached, which we name {\em spatial linguistic graph} (SLG). The formal definition of SDG is presented. An evaluation using SimRank-based method is conducted using paired German-Chinese sentences in a grammar book (totally 682 paired sentences). Results show that SDGs of paired sentences are more similar than that of LDGs – supported by 89.4\% observations. Comparison with related work is presented. Applications of SDG in word-embedding and Machine Translation are described.
Lipika Dey, Kunal Ranjan, Gargi Roy and Shashank Mittal. Email Repository Analysis – Learning from communications
Abstract: Email exchange is still the most common means of conducting business communications and can be considered as an integral part of several business processes. Email repositories, both personal and organizational are goldmines of information which can be very effectively mined to gain insights about best practices, pitfalls to be avoided, personal engagement history with different people and groups and so on. In this paper, we propose methods to analyze email archives and provide hidden insights and information that are useful to the user. We further show that it is possible to gain knowledge about repetitive and unintelligent tasks that can be automated thus increasing efficiency and productivity of an organization.
Yang Xu, Junjie Huang, Binglu Wang and Shuwen Liu. Natural Language Processing in “Bullet Screen” Application
Abstract: The main purpose of nature language processing is to process, understand, and apply all kinds of human beings’ languages in written or oral forms. In recent years, “bullet screen”, which allows viewers to post bullet-like, real-time comments on screen during their watching films, is an emerging craze in online video sites in China and Japan, mainly popular among young people for their social interactivities. Meanwhile, since “bullet screen” can be regarded as a novel type of natural language, processing, restoring, and organizing them plays an important role in providing retrieval platforms for users so as to help them find out highlights from oceans of videos. Taking Bilibili, a Chinese video website, as an example, this research comes up with a new algorithm called Filling-Rate Ranking (FRR) based on BM25 in order to put “bullet screen” processing into practice. The result of our empirical study shows that FRR based on BM25 can anatomize “bullet screen” language and mine more details about audio as well as video information. Besides, an interactive platform about “bullet screen” could enormously refine viewers’ lives of entertainments.
Chih-Ling Hsu, Huang-Hua Ju, Yu-Hsuan Wu, Hao-Chun Peng, Jhih-Jie Chen, Jim Chang and Jason S. Chang. Computer Assisted English Email Writing System
Abstract: We introduce a method for developing an interactive system that gives assistance to English email writing by providing the writer with multiple appropriate collocations depending on previous writing. The method involves preprocessing Enron Email Dataset (https://www.cs.cmu.edu/~./enron/), applying tokenization to the preprocessed documents, extracting collocations based on the skip-gram model, generating lexical bundles, deriving patterns, and building a statistical model that performs text prediction. At run-time, the developed system would recommend the writer with patterns consisting of collocations, which were succeeding predictions of the last few words. Lastly, we present a prototype, EmailPro, that applied with the system developed from this method. Preliminary experiments show that EmailPro fulfills the goal of inducing better writing and improving productivity and fluency in real time.
Yuejun Li, Xiao Feng and Shuwu Zhang. Identify Fake Online Reviews by Review Group and Collusion
Abstract: As the fast development of e-commerce, online product and store reviews play an important role in product and service recommendation for new customers. However, due to economic or fame reasons, dishonest people are employed to write fake reviews which is also called “opinion spamming” to promote or demote target products and services. Previous research has used text similarity, linguistics, rating patterns, graph relation and other behaviors for spammer detection. It is difficult to find fake reviews by browsing product reviews in time-descending order while it’s easier to identify fraudulent reviews by checking the list of reviews of reviewers. We propose novel review grouping method and models to identify latent fake reviews. The review grouping algorithm can effectively split reviews of reviewer into groups which participate in building new model of review spamming detection. Additionally we explore the collusion behavior between reviewers to build group collusion model. Experiments and evaluations show that the review group method and relevant models can effectively identify fake reviews especially those posted by professional review spammers.
Pyae Phyo Thu and Nwe Nwe. Understanding Social Media Satirical Emotion
Abstract: Nowadays, tackling satirical language turn out to be a trending re-search area in computational figurative language analysis. Many researchers have analyzed satirical language from various point of views: lexically, semantically and syntactically. However, due to the ironic dimension of emotion embedded in satirical language, emotional study of satirical language had ever left behind. In this study, we try to tackle satire detection from emotional point of view. In order to achieve the implicit nature of satirical emotion, instead of using straight for-ward emotional lexicon we use 5 emotional tone (Anger, Disgust, Fear, Joy, and Sadness) acquired from Tone Analyzer. Moreover, we further supplement emotional tone with Social Order and Personal traits. We implement these traits on various datasets, namely, tweets, newswires articles and daily stream. We examine the experiments with several different classification approach: SVM, Naïve Bayes and Ensemble Classifier. Our system outperform a word-based baseline and it is able to recognize satire not only in tweets but also in news articles with good accuracy.
Kaori Kashiwai and Ichiro Kobayashi. Multiple Time-series Documents Summarization based on the Sequential Difference of Events
Abstract: The time-series documents such as periodicals, newspapers and so on
catch up new information and take it to the articles day by day.
It will take much time for readers to grasp the contents of the
documents because of quite a few information in themselves.
So, a method to summarize the documents from multiple news resources
along the timeline should be necessary.
In this study, we propose a method to generate a summary of time-series
documents provided from multiple news resources by extracting essential
sentences from the documents along the timeline, focusing on new
information sequentially added along the timeline.
Souvick Ghosh, Satanu Ghosh and Dipankar Das. Sentiment Identification in Code-Mixed Social Media Text
Abstract: Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying the presence of sentiment in the text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is a presence of sentiment in the text, it has a source (people, group of people or any entity) and the sentiment is directed towards some entity, object, event or person. Sentiment analysis tasks aim to determine the subject, the target and the polarity or valence of the sentiment. In our work, we try to automatically extract sentiment (positive or negative) from Facebook posts using a machine learning approach. While some works have been done in code-mixed social media data and sentiment analysis separately, our work is the first attempt (as of now) which aims at performing sentiment analysis of code-mixed social media text. We have used extensive pre-processing to remove noise from raw text. Multilayer Perceptron model has been used to determine the polarity of the sentiment. We have also developed the corpus for this task by manually labeling Facebook posts with their associated sentiments.
Mahdi Mohseni and Heshaam Faili. Topic Modeling Based Analysis of Professors Roles in Directing Theses
Abstract: Topic-sensitive analysis of participants in scientific networks has been a hot research topic for many years. In this paper, we propose a topic modeling based approach to analyze the influence of professors when they take on different roles (e.g. first supervisor, second supervisor and advisor) in directing theses. We explain how a topic distribution of theses obtained by Latent Dirichlet Allocation provides a basis to formulate the problem and how the unknown parameters are calculated. Results of experiments on a real world dataset reveal some interesting facts about professor roles in supervising and advising theses. Our results are in contradiction with this traditional view that supervisors are more influential than advisor and first supervisor is a more effective role than second supervisor.
Lamees Al Qassem, Di Wang, Zaid Al Mahmoud, Ahmad Al-Rubaie, Hassan Barada and Nawaf Al Moosa. Latest Challenges for automatic Arabic Summarisation and the Proposed Solutions
Abstract: Automatic text summarization has been a field of intense research over the last 50 years, especially for common used and easier grammar languages such as English. More, recent exponential growth in internet use over the past 2 dec-ades and the widespread of online content has led to an unprecedented growth in the amount of information available to users and businesses, includ-ing news articles, general information and social media. Therefore, automatic text summarization has become more important to distill the colossal volume of content available into condensed and relevant information. However, tech-niques/methodologies for Arabic automatic summarization are still immature due to the inherent complexity of the Arabic language in terms of both struc-ture and morphology. This paper focuses on the latest challenges for Arabic text summarization, compares different solutions/methodologies in the litera-ture. Generally speaking, three main methodologies are used in different sys-tems: symbolic based methods, numerical based methods and hybrid meth-ods. This paper discusses and compared different algorithms used in different methodologies and the remaining untangled challenges. Finally, the suggested solutions for future work based on the comparison result across different exist-ing methods/systems in the literature for automatic Arabic summarization are proposed and discussed.
András Dobó. Multi-D Kneser-Ney smoothing preserving the original marginal distributions
Abstract: Smoothing is an essential tool in many NLP tasks, therefore numerous techniques have been developed for this purpose in the past. One of the most widely used group of smoothing methods are the Kneser-Ney smoothing (KNS) and its variants, including the Modified Kneser-Ney smoothing (MKNS), which is widely considered to be the best smoothing method available. Although when developing the original KNS the intention of the authors was to develop such a smoothing method that preserve the marginal distributions of the original model, this property was not maintained when developing the MKNS.

In this article I would like to overcome this and propose such a refined version of the MKNS that preserves these marginal distributions while keeping the advantages of both previous versions. Beside its advantageous properties, this novel smoothing method is shown to achieve about the same results as the MKNS in a standard language modeling task.
Valerie Mozharova and Natalia Loukachevitch. Named Entity Recognition in Islam-Related Russian Twitter
Abstract: The paper describes an approach to creating a domain-specific tweet collection written by users frequently discussing Islam-related issues in Russian. We use this collection to study specific features of named entity recognition on Twitter. We found that in contrast to tweets collected randomly, our tweet collection contains relatively small number of spelling errors or strange word shortenings. Specific difficulties of our collection for named entity recognition include a large number of Arabic and other Eastern names and frequent use of ALL-CAPS writing for emphasizing main words in messages. We studied the transfer of NER model trained on a newswire collection to the created tweet collection and approaches to decrease the degradation of the model because of the transfer. We found that for our specialized text collection, the most improvement was based on normalizing of word capitalization. Two-stage approaches to named entity recognition and word2vec-based clustering were also useful for our task.
Bibek Behera, Joshi Manoj, Abhilash KK and Mohammad Ansari Ismail. Distributed Vector Representation Of Shopping Items, The Customer And Shopping Cart To Build A Three Fold Recommendation System.
Abstract: The main idea of this paper is to represent shopping items through vectors because these vectors act as the base for building embeddings for customers and shopping carts. Also these vectors are input to the mathematical models that act as a recommendation engine or target offers for potential customers. We have used exponential family embeddings as the tool to construct two basic vectors - product embeddings and context vectors. Using the basic vectors, we build combined embeddings, trip embeddings and customer embeddings. Combined embeddings mix linguistic properties of product names with their shopping patterns. The customer embeddings establish an understanding of the buying pattern of customers in a group and help in building customer profile. Profiles are basically an identity of a cluster. For example a cluster can represent customers frequently buying pet-food. Identifying such profiles can help us bring out offers and discounts. Similarly trip embeddings are used to build trip profiles. People happen to buy similar set of products in a trip and hence their trip embeddings can be used to predict the next product they would like to buy. This is a novel technique and the first of its kind to make recommendation using product, trip and customer embeddings.
Nabeela Altrabsheh, Mazen Elmasri and Hanady Mansour. Proposed Novel Algorithm for transliterating Arabic terms into Arabizi
Abstract: Arabizi is a new trend in social media where the person uses Latin characters to represent Arabic words. Arabic letters can be replaced with different symbols according to the dialects and preference. This creates a wide range of new vocabulary in sentiment lexicons. All the Arabizi literature focus has been exploring transliteration Arabizi terms to Arabic terms. In this paper we propose a new algorithm to transliterate Arabic terms into Arabizi. This will allow a larger coverage of the different ways words are written in Arabizi, which will allow a more accurate analysis of the sentiment.
Razieh Ehsani, Ercan Solak and Olcay Taner Yıldız. Hybrid chunking for Turkish combining morphological and semantic features
Abstract: We use morphological features together with the semantic representations of words to solve chunking problem in Turkish.
We separately train and tune word embeddings for semantic representations and conditional random fields for morphological features. We combine the two in a random forest.
Divyanshu Bhardwaj, Partha Pakray and Alexander Gelbukh. An approach to Information Retrieval from Scientific Documents
Abstract: Information retrieval (IR) is one the prominent fields of research in the present day. Despite this, mathematical information retrieval represents a callow niche that isn’t delved into much. This paper describes the design and implementation of a Mathematical Search Engine using Apache Nutch and Apache Tomcat. The search engine designed is to be able to exhaustively search for mathematical equations and formula in scientific documents and research papers. Processed WikiPedia texts were used as inputs in order to probe the ability to search for specific mathematical entities among large volume of text distractors.
Anupam Mondal, Erik Cambria, Dipankar Das and Sivaji Bandyopadhyay. Auto-categorization of medical concepts and contexts
Abstract: Information extraction is essential to understand conceptual knowledge of medical concepts in the absence of domain experts, e.g. doctors and medical practitioners, in healthcare. In this work, the challenge appears due to a large number of daily produced unstructured and semi-structured corpora. To resolve the challenge, one needs to convert unstructured or semi-structured corpora into a structured corpus. In the present paper, we focus on identifying the category of the medical concepts with our previously built WordNet of Medical Event (WME2.0) lexicon, assist in constructing a structured corpus here. Medical concepts and their conceptual features such as affinity, gravity, polarity scores, similar sentiment words as semantics and sentiment, all are produced by WME2.0, are used to develop a category assignment system for the medical concepts and contexts. The resultant categories for the concepts are \textit{diseases, drugs, symptoms, human$\_$anatomy,} and \textit{miscellaneous medical terms (MMT)}, which all refer the broadest fundamental classes of medical concepts. The category assignment system of concepts is applied to build a category assignment system of medical contexts. We have assumed each sentence of the corpus as context. The proposed system allow us to extract eleven different types of pair-based categories such as \textit{disease-symptom, disease-drug,} and \textit{disease-MMT} for the contexts, aiming to understand the subjective information of the corpus. These categorization systems are extremely crucial for generating medical annotation and recommendation systems in healthcare service. To evaluate the proposed category assignment system of medical concepts and contexts, we have used widely applied Na\"{\i}ve Bayes and Logistic Regression supervised machine learning classifiers. These classifiers provide average F-scores approximately 81\% and 86\% for the categorization systems of concepts and contexts, respectively.
Hospice Houngbo and Robert Mercer. Investigating Citation Linkage as an Information Retrieval Task
Abstract: Nowadays, there is an overabundance of literature that scientists need to consult in their research work. However, they face the challenge of having to navigate the deluge of information contained in the articles published in their domain of study. Tools such as Citation Networks which contain paper sources linked by co-citation relationships are often used to find papers with related content, but they are lacking in the ability to provide specific information that a researcher may need without having to read hundreds of linked papers.
In this study, we report our method for finding those sentences in a cited article that are the focus of a citation in a citing paper, a task we have called Citation Linkage. We first provide guidelines for building a corpus annotated by a domain expert. The corpus consists of citing sentences and their cited articles. For this study the citing sentences deal with biochemistry methodology. All sentences in the cited article are annotated with six levels of relevance ranging from 0 (no relevance), to 5 (the annotator had the highest confidence that the sentence is relevant).
We hypothesize that citation sentences, when used as queries in a retrieval model, should likely point to relevant sentences in the cited paper. For this purpose, we use many established retrieval algorithms to check whether the task of citation linkage targeting sentences in a cited paper can be performed by ranking the sentences as would do a search engine.
Evaluation of the citation linkage task has used document ranking methods and information retrieval evaluation metrics. For each citation-paper linkage task, we compute Precision@k and Normalized Discounted Cumulative Gain NDCG@k, where k is the number of sentences given non-zero relevance scores by the annotator. We found that 18 out of 22 citation linkage tasks have at least one sentence in the top k positions. The mean average NDCG is 57% and the Mean Average Precision is 35% over all the tasks. These results are very promising and show that it is possible to find the set of sentences that a citation refers to in a cited paper with reasonable performance.
Soumil Mandal and Dipankar Das. Analyzing Roles of Classifiers and Code-Mixed factors for Sentiment Identification
Abstract: Multilingual speakers often switch between languages to express themselves on social communication platforms. Sometimes, the original script of the language is preserved while using a common script for all the languages is quite popular as well due to convenience. On such occasions, multiple languages are being mixed with different rules of grammar using the same script which makes it a challenging task for natural language processing even in case of accurate sentiment extraction. In this paper, we report results of various experiments carried out on movie reviews dataset having this property where the two languages used are English and Bengali and both typed in Roman script. We have tested various machine learning algorithms trained only on English features on our code-mixed data and have found the maximum accuracy achieved to be 59.00% by a Naïve Bayes (NB) model. We have also tested various models trained on code-mixed as well as English features and the highest accuracy of 72.50% was obtained by a Support Vector Machine (SVM) model. Finally, we have analyzed the mis-classified snippets and have discussed the challenges needed to be resolved for better accuracy.
Syed Sarfaraz Akhtar, Arihant Gupta, Avijit Vajpayee, Arjit Srivastava and Manish Shrivastava. An Unsupervised Approach for Mapping between Vector Spaces
Abstract: We present a language independent, unsupervised approach for transforming word embeddings from source language to target language using a transformation matrix. Our model handles the problem of data scarcity which is faced by most of the languages in the world and yields improved word embeddings for words in the target language by relying on transformed embeddings of words of the source language. We initially evaluate our approach via word similarity tasks on a similar language pair - Hindi as source and Urdu as the target language, while we also evaluate our method on French and German as target languages and English as source language. For Hindi and Urdu as a language pair, we evaluate on our newly created word similarity dataset for Urdu: WS-UR-100. Our approach improves the current state of the art results - by 13% for French and 19% for German. For Urdu, we saw an increment of 16% over our initial baseline score. We further explore the prospects of our approach by applying it on multiple models of the same language and transferring words between the two models, thus solving the problem of missing words in a model. We evaluate this on word similarity and word analogy tasks.
Damir Mukhamedshin, Olga Nevzorova and Aidar Khusainov. Complex Search Queries in the Corpus Management System
Abstract: This article discusses the advanced features of the newly developed search engine of the "Tugan tel” corpus management system. This corpus consists of texts in the Tatar language.
This features include executing complex queries with arbitrary logical formulas for direct and reverse search; executing complex queries using a thesaurus or wordforms/lemmas list and extracting some types of named entities.
Complex queries enable to automatically extract annotated corpus data for linguistic applications and also these options improve a search in a corpus management system.
Silpa Kanneganti, Vandan Mujadia and Dipti. M. Sharma. Integrating Word Embedding Based Cluster Features in Dependency Parsing
Abstract: In this work, we propose a semi-supervised approach at introducing word embedding based features into dependency parsing. We introduce features derived from dependency label based word embedded clusters into both transition and graph based parsing models thereby improving the prediction accuracies for both. Given a dependency annotated corpus, we group edges from the parsed trees based on their dependency relation labels. All the edges with common heads and dependency relations from the corpus are grouped together into clusters i.e., for a given token word (W) in the data, all its children in the entire corpus are pooled together based on their dependency label with W. W in this context can be a uni-gram, bi-gram or a tri-gram (n-gram form henceforth) of a chunk head i.e., we extract n-grams of all the chunk heads from the corpus and apply the aforementioned clustering approach to it. These clusters are now projected into a semantic space which inturn consists of n-grams created from a large corpus of raw data in-order to get word vector representations of the corresponding work tokens. The resulting clustered word embeddings are used to make better predictions(transitions or arcs) in data driven dependency parsing. This work is motivated by the distributional semantics hypothesis that, words that share context, have similar meanings. We demonstrate the effectiveness of the approach in a series of experiments on the Hindi Language Data.
Samira Ellouze, Maher Jaoua and Lamia Hadrich Belguith. Using multiple features to evaluate MultiLing summary
Abstract: The present paper introduces a new Multiling text summary evaluation method. This method relies on machine learning approach which operates by combining multiple features to build models that predict the human score (overall responsiveness) of a new summary. We have tried several single and “ensemble learning” classifiers to build the best model. We have experiment our method in summary level evaluation where we evaluate each text summary separately. The correlation between built models and human score is better than the correlation between baselines and manual score.
Malek Lhioui, Kais Haddar and Laurent Romary. Semantic integration of COPEs in GOLD ontology
Abstract: This paper has as goal the semantic integration of the local ontologies named consists on the use of a web semantic method to resolve an NLP issue i.e. se-mantic integration as method for interoperability resolution between lexical re-sources. We define lexicons as local ontologies using Description Logics. Then, we build the resulted global ontology by combining alignment techniques and logical-reasoning.
Salima Harrat, Karima Meftouh and Kamel Smaili. Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid
Abstract: Creating parallel corpora is a difficult issue that many researches try to deal with it. In the context of under-resourced languages like arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment with the creation of a parallel corpus containing several dialects in addition to MSA. We attempt to highlight the most important choices that we did and how good were these choices.
Randa Benkhelifa and Fatima Zohra Laallam. Integrating user’s characteristics for Improved Content Categorization in social networking
Abstract: Social networks enable users to freely communicate with each other and share their personal and social data, ongoing activities, interests, preferences, and views about different topics. The nature of the social networking user, play an important role in their interests and preferences, for example women have not the same interests like men, women interest about fashion domain more than men, where, men interest about sport domain more than women. Therefore, the gender, the age, the location, the race, the intellectual level, etc. all these characteristics can affect user's interests and preferences. In this paper, we propose a new approach which integrates user's characteristics in the content categorization. The experimental results are made on a large Facebook dataset in order to analyze the effect of these characteristics on the performance of the categorization of the shared textual content in social networks.
Tereza Pařilová, Eva Hladka and Pavel Říha. Fuzzy Model of Dyslexic Users Built with Linguistic Pattern Categories
Abstract: Along with the discovery of new facts and the development of new technologies and methodologies, more and more definitions and specifications emerge. The quantity of these emergences, however, can lead to paradoxical contradictions, which obscure borders
Often, we see this phenomenon of vagueness within natural language or non-exact topics. Fuzzy principles are therefore applied in wide range of (not only) scientific areas. In applied technical science, a user model based on interface and human computer cooperation meets such fuzzy borders. So, why not use it in assistive technology models? Fuzzy deals from its nature with linguistic variables and such variables are being transformed from numbers to expressions on predefined scales.
Dyslexia is a neurobiological cognitive based disorder and ideal for (neuro) fuzzy computational modeling for many reasons. This paper describes the idea and process of using the fuzzy approach for obtaining information about individual problems of dyslexic users and differentiating between the type(s) of dyslexic user model he or she may belong to. Such information may serve for better text accommodation in the dyslexia assistive web service DysTexia and for further clinical and psychological studies of dyslexia and linguistic based problems.
Tiina Puolakainen. Semi-automatic enhancement of bidictionary from aligned sentences
Abstract: When the base of rule-based machine translation system is established and it already can give some reasonable translations - at some point of development of rule-based MT system the insufficiency of the dictionary becomes a bottleneck, as available lexical resources are already utilized. The issue holds especially considering low-resource and morphologically rich languages with very productive compounding formation.

The article proposes a method for generating new bidictionary entries for closely related morphologically rich languages with productive compounding. The idea is to use incompletely translated sentences along with their correct translations to incrementally improve and enhance the translation dictionary. The method employs also the advantages of rule-based machine translation taking into account the inner bitranslation state to find the interconnections between source and target utterances, generate new entries and minimize the unmatched existing entries of bidictionary with mono-dictionaries.
Martin Mikula, Kristína Machová and Xiaoying Sharon Gao. Combined dictionary approach to opinion analysis in Slovak
Abstract: People produce more and more textual data every day. They speak with each other, write articles and comment on products and services. It is simple to analyze them manually, in case that we have a small amount of data. But when we have many data, it is very difficult to process them manually. We decided to use dictionary approach for the automatic analysis of comments in the Slovak language. The first algorithm achieved accuracy around 72%. One disadvantage was that the algorithm could not identify the polarity of all comments. More than 18% comments were not assigned polarity because they did not contain subjective words from the dictionary. The new approach combines the first dictionary approach with a probabilistic method which is used to create a new lexicon. The new dictionary was again used to analysis comments in the dataset. This new approach reduced the percentage of unidentified comments to 0.5%. The new approach outperformed the previous method and also achieved better results than Naive Bayes classifier and Support Vector Machines (SVM) on the same dataset.
Weicheng Ma, Kai Cao and Peter Chin. Phrase segmentation based on Skip-gram model
Abstract: Phrases are important in Text Mining tasks since their meanings usually differ from those of their forming words.
Most of the current phrase extraction methods rely their inferences highly on grammatical features
such as dependency path, Wordnet distance, Part-of-Speech(POS) tagging to insure better
performance. However these features show no advantage on informal English corpora such as Twitter posts, or on some foreign
language corpora which do not have strict grammar restrictions, for example, Chinese. Moreover, in many existing methods phrases are defined as adjacent word groups, which is not ideal. Though adjacency and distance really matter, phrases formed by separated words are common. In this
paper we present a novel method which extracts phrases from raw text, especially from informal English sentences, based on Skip-gram model. We evaluate the precisions and coverages on a Twitter Sentiment Analysis corpus to prove the correctness and advancement of our method.
Dror Mughaz, Yaakov Hacohen-Kerner and Dov Gabbay. Citation-Based Prediction of Birth and Death Years of Authors
Abstract: This paper presents an unusual approach in text mining and feature extraction for identifying the era of anonymous texts that can help in the examination of forged documents or extracting the time-frame of which an author lived. The work and the experiments concern rabbinic documents written in Hebrew, Aramaic and Yiddish texts. The documents are undated and do not contain any bibliographic sections, which leaves us with an interesting challenge. This study proposes a few algorithms based on keyphrases that enable prediction of the time-frame of which the authors lived based on the temporal organization of references using linguistic patterns. Based on the keyphrases and the citations we formulated various types of "Iron-clad", Heuristic and Greedy constraints defining the birth and death years of an author that lead to an interesting classification task. The experiments that were applied on corpora containing texts authored by 12, 24 and 36 rabbinic authors show promising results.
Kai Cao, Xiang Li and Weicheng Ma. Improving Event Extraction with Expert-Level Patterns
Abstract: Event Extraction (EE) is a challenging Information Extraction task which aims to discover event triggers with specific types and their arguments. Most recent research on Event Extraction re- lies on pattern-based or feature-based approaches, trained on annotated corpora, to recognize combinations of event triggers, arguments, and other contextual information. However, as the event instances in the ACE corpus are not evenly distributed, some frequent expressions involv- ing ACE event triggers do not appear in the training data, adversely affecting the performance. In this paper, we demonstrate the effectiveness of systematically importing expert-level patterns from TABARI to boost EE performance. The experimental results demonstrate that our pattern- based system with the expanded patterns can achieve 69.8% (with 1.9% absolute improvement) F-measure over the baseline, an advance over current state-of-the-art systems.
Colm Sweeney, Jun Hong and Weiru Liu. Sentiment Analysis using entity-level feature extraction
Abstract: Sentiment Analysis is a field of study that analyses people's opinions or sentiments expressed in natural language. Within that, sentiment classification aims to identify whether the author of a text has a positive or negative opinion about a topic. One of the main indicators which assist in the detection of sentiment are the words used in the texts. The sentiments expressed will also depend on the structure or syntax of the text and the context. Supervised machine learning approaches to sentiment classification have shown to achieve good results, but classifying texts by sentiment in a social media feed, is a more difficult problem, as these text snippets are noisy and usually badly structured. Sentiment classification can be seen as a way of associating a given entity with the adjectives, adverbs, and verbs describing it, and extracting the associated sentiment to try and infer if the microblog is positive or negative in relation to the main entity. With this in mind, we propose the deployment of a sentiment lexicon-based technique that moves towards a more fine-grained (entity level) Sentiment Analysis of Twitter posts. The sentiment lexicon will be used to appoint a total score to indicate the polarity of the emotion of consumers related to the Prudential entity and related products.
Zhuang Liu, Degen Huang and Jing Zhang. Reasoning with Self-Attention and Inference Model for Machine Comprehension
Abstract: Enabling a computer to understand a document so that it can answer comprehension questions is a central, yet unsolved goal of Natural Language Processing (NLP), so reading comprehension of text is an important problem in NLP. Recently, machine reading comprehension has embraced a booming in NLP research. In this paper, we introduce a novel iterative inference neural network based on a matrix sentence embedding with a self-attention mechanism. The proposed approach continually refines its view of the query and document while aggregating the information required to answer a query, aiming to compute the attentions not only for the document but also the query side, which will benefit from the mutual information. Experimental results show that our model has achieved significant state-of-the-art performance in public English datasets, such as CNN and Children's Book Test datasets. Furthermore, the proposed model also outperforms state-of-the-art systems by a large margin in Chinese datasets, including People Daily and Children's Fairy Tale datasets, which are recently released and the first Chinese reading comprehension datasets.
Wided Bakari, Patrice Bellot and Mahmoud Neji. The Design and Implementation of NArQAS: New Arabic Question-Answering System
Abstract: We present, in this paper, the overall structure of our specific system for generating answers of questions in Arabic, named NArQAS (New Arabic Question Answering System). This system aims to develop and evaluate the contribution of the use of reasoning procedures, NLP tech-niques and the RTE technology to develop precise answers to natural language questions. We also detail its operating architecture. In particu-lar, our system is seen as a contribution, rather than a rival, to traditional systems focused on approaches extensively used information retrieval and NLP techniques. Thus, we present the evaluation of the outputs of each of these components based on a collection of questions and texts retrieved from the Web. NArQAS system was built and experiments showed good results with an accuracy of 68% for answering factual questions from the Web.
Rajendra Prasath and Chaman Sabharwal. A statistical term weighting approach for knowledge discovery in scientific literature
Abstract: In this paper, we present a statistical term weighting approach for knowledge discovery in Scientific Literature. In this approach, we have generated knowledge discovery hypothesis from the content of the scientific articles from Marine Science, Climate Change and Environmental Science domains.
Then we perform testing of hypothesis among various events, relationships, and causes in the scientific literature for knowledge discovery tasks. We primarily focused on identifying (a) {\it terms} (domain specific terms) that represent variables, entities, compounds; (b) relationships that connect the domain specific terms with particular causes and effects like increasing, or decreasing or neutral; and (c) properties that could be explored by describes the values between the terms and their relationships.
Carlos Escolano and Marta R. Costa-Jussà. Spanish Morphology Generation with Deep Learning
Abstract: Morphology generation is the natural language processing task of generating word inflection information. In this paper, we propose a new classification architecture based on deep learning to generate gender and number in Spanish from non-inflected words. This deep architecture uses a concatenation of embedding, convolutional and recurrent neural networks. We obtain improvements compared to other standard machine learning techniques. Accuracy of our proposed classifiers reaches over 98% for gender and over 93% for number.
Ayesha Siddiqa, Ashish Tendulkar and Sutanu Chakraborti. WikiAug: Augmenting Wikipedia Stubs by suggesting credible hyperlinks
Abstract: Wikipedia is the ubiquitous resource for knowledge acquisition. Various widely used systems like Apple's Siri system and Google's Knowledge Graph rely on Wikipedia for knowledge acquisition. Currently, Wikipedia articles are created and maintained by volunteers who are not necessarily experts in that field, often many links or concepts are missed by the editors. We attempt to fill these gaps and recommend concepts which are missing from Wikipedia articles. In this work, we specifically focus on recommending concepts to Stub articles. To recommend concepts to stubs, we rely on other external web articles retrieved by a search engine. We propose an automated way of evaluation, in order to save considerable human effort and time consumed in the review process before the changes appear. We evaluate our approach on datasets of Wikipedia articles which got promoted from stub to enriched articles over the years 2008 to 2015. Our results reinforce that our suggested links significantly overlap with the links added by human editors. We envisage that the choice of content from the suggested links will help Wikipedia editors enrich stub articles.