For Norsk Ordbok 2014 linguistic documentation comes in the forms of scanned paper slips, electronic slips and corpus quotations, organised through the index system called the Meta Dictionary. Each entry in Norsk Ordbok is directly linked to the corresponding materials in the Meta Dictionary. However, with up to several thousand instances to sort into a complex entry structure, editors needed a sorting tool for classifying instances and storing the sorting itself. This would both improve editorial stringency, allow for variation in approach method and save time. This article, entitled Classifying instances of linguistic documentation on a digital platform \u2013 one step further towards data integrity within academic lexicography, describes the tool developed for Norsk Ordbok to meet this need, and sums up user experiences through the first two years as positive but still with space for improvement.

With the tremendous advances made by Convolutional Neural Networks (ConvNets) on object recognition, we can now easily obtain adequately reliable machine-labeled annotations easily from predictions by off-the-shelf ConvNets. In this work, we present an abstraction memory based framework for few-shot learning, building upon machine-labeled image annotations. Our method takes large-scale machine-annotated dataset (e.g., OpenImages) as an external memory bank. In the external memory bank, the information is stored in the memory slots in the form of key-value, in which image feature is regarded as the key and the label embedding serves as the value. When queried by the few-shot examples, our model selects visually similar data from the external memory bank and writes the useful information obtained from related external data into another memory bank, i.e., abstraction memory. Long Short-Term Memory (LSTM) controllers and attention mechanisms are utilized to guarantee the data written to the abstraction memory correlates with the query example. The abstraction memory concentrates information from the external memory bank to make the few-shot recognition effective. In the experiments, we first confirm that our model can learn to conduct few-shot object recognition on clean human-labeled data from the ImageNet dataset. Then, we demonstrate that with our model, machine-labeled image annotations are very effective and abundant resources for performing object recognition on novel categories. Experimental results show that our proposed model with machine-labeled annotations achieves great results, with only a 1% difference in accuracy between the machine-labeled annotations and the human-labeled annotations.

There have been increasing interests in natural language processing to explore effective methods in learning better representations of text for sentiment classification in product reviews. However, most existing methods do not consider subtle interplays among words appeared in review text, authors of reviews and products the reviews are associated with. In this paper, we make use of a heterogeneous network to model the shared polarity in product reviews and learn representations of users, products they commented on and words they used simultaneously. The basic idea is to first construct a heterogeneous network which links users, products, words appeared in product reviews, as well as the polarities of the words. Based on the constructed network, representations of nodes are learned using a network embedding method, which are subsequently incorporated into a convolutional neural network for sentiment analysis. Evaluations on the product reviews, including IMDB, Yelp 2013 and Yelp 2014 datasets, show that the proposed approach achieves the state-of-the-art performance.

This paper proposes a social network with an integrated children disease prediction system developed by the use of the specially designed Children General Disease Ontology (CGDO). This ontology consists of children diseases and their relationship with symptoms and Semantic Web Rule Language (SWRL rules) that are specially designed for predicting diseases. The prediction process starts by filling data about the appeared signs and symptoms by the user which are after that mapped with the CGDO ontology. Once the data are mapped, the prediction results are presented. The phase of prediction executes the rules which extract the predicted disease details based on the SWRL rule specified. The motivation behind the development of this system is to spread knowledge about the children diseases and their symptoms in a very simple way using the specialized social networking website

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Test's sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select the correct replacement from amongst five alternates. In general, all of the options are syntactically valid, and reasonable with respect to local N-gram statistics. The set was generated by using an N-gram language model to generate a long list of likely words, given the immediate context. These options were then hand-groomed, to identify four decoys which are globally incoherent, yet syntactically correct. To ensure the right to public distribution, all the data is derived from out-of-copyright materials from Project Gutenberg. The test sentences were derived from five of Conan Doyle's Sherlock Holmes novels, and we provide a large set of Nineteenth and early Twentieth Century texts as training material.

Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes, k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. Naive Bayes methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

Abstract: The previous chatting systems have generally used methods based on lexical agreement between users\u2019 in-put sentences and target sentences in a database. However, these methods often raise well-known lexical disagreement problems. To resolve some of lexical disagreement problems, we propose a three-step sentence searching method that is sequentially applied when the previous step is failed. The first step is to compare common keyword sequences be-tween users\u2019 inputs and target sentences in the lexical level . The second step is to compare sentence types and se-mantic markers between users\u2019 input and target sentences in the semantic level. The last step is to match users\u2019s in-puts against predefined lexico-syntactic patterns. In the experiments, the proposed method showed better response pre-cision and user satisfaction rate than simple keyword matching methods. Keywords: Chatting, Lexical-level searching, Semantic-level searching, Lexico-syntactic pattern matching 1. \uC11C \uB860 \uCD5C\uADFC \uD574\uC591 \uC815\uBCF4 \uAD00\uB828 \uBE44\uC815\uD615 \uB370\uC774\uD130\uC758 \uC591\uC774 \uB9CE\uC774 \uC99D\uAC00\uD558\uACE0 \uC788\uB2E4

This paper describes a new statistical parser which is based on probabilities of dependencies between head-words in the parse tree. Standard bigram probability estimation techniques are extended to calculate probabilities of dependencies between pairs of words. Tests using Wall Street Journal data show that the method performs at least as well as SPATTER (Magerman 95, Jelinek et al 94), which has the best published results for a statistical parser on this task. The simplicity of the approach means the model trains on 40,000 sentences in under 15 minutes. With a beam search strategy parsing speed can be improved to over 200 sentences a minute with negligible loss in accuracy.

This paper presents a method of learning deep AND-OR Grammar (AOG) networks for visual recognition, which we term AOGNets. An AOGNet consists of a number of stages each of which is composed of a number of AOG building blocks. An AOG building block is designed based on a principled AND-OR grammar and represented by a hierarchical and compositional AND-OR graph. Each node applies some basic operation (e.g., Conv-BatchNorm-ReLU) to its input. There are three types of nodes: an AND-node explores composition, whose input is computed by concatenating features of its child nodes; an OR-node represents alternative ways of composition in the spirit of exploitation, whose input is the element-wise sum of features of its child nodes; and a Terminal-node takes as input a channel-wise slice of the input feature map of the AOG building block. AOGNets aim to harness the best of two worlds (grammar models and deep neural networks) in representation learning with end-to-end training. In experiments, AOGNets are tested on three highly competitive image classification benchmarks: CIFAR-10, CIFAR-100 and ImageNet-1K. AOGNets obtain better performance than the widely used Residual Net and its variants, and are tightly comparable to the Dense Net. AOGNets are also tested in object detection on the PASCAL VOC 2007 and 2012 using the vanilla Faster RCNN system and obtain better performance than the Residual Net.

One area of linguistics which has developed very rapidly in the last 25 years is phraseology. Corpus study has shown that routine phraseology is pervasive in language use, and various models of recurrent word-combinations have been proposed. This paper discusses aspects of frequent phraseology in English: the distribution of recurrent multi-word sequences in different text-types and the structure, lexis and function of some frequent multi-word sequences. Most of the data come from a major interactive data-base which provides extensive quantitative information on recurrent phraseology in the British National Corpus (BNC). This data-base, available at, has been developed by William Fletcher. Quantitative phraseological data have important implications for linguistic theory, because they show how findings from phraseology can be related to independent findings from other areas of linguistics, including recent studies of grammar and of semantic change. However, the very large amount of data itself poses methodological and interpretative puzzles.

This paper introduces a new framework for open-domain question answering in which the retriever and the reader iteratively interact with each other. The framework is agnostic to the architecture of the machine reading model, only requiring access to the token-level hidden representations of the reader. The retriever uses fast nearest neighbor search to scale to corpora containing millions of paragraphs. A gated recurrent unit updates the query at each step conditioned on the state of the reader and the reformulated query is used to re-rank the paragraphs by the retriever. We conduct analysis and show that iterative interaction helps in retrieving informative paragraphs from the corpus. Finally, we show that our multi-step-reasoning framework brings consistent improvement when applied to two widely used reader architectures DrQA and BiDAF on various large open-domain datasets --- TriviaQA-unfiltered, QuasarT, SearchQA, and SQuAD-Open.

This paper presents an algorithm for tagging words whose part-of-speech properties are unknown. Unlike previous work, the algorithm categorizes word tokens in context instead of word types. The algorithm is evaluated on the Brown Corpus.

Eliciting semantic similarity between concepts in the biomedical domain remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships The underlying idea is that two words that have close meaning gather similar contexts. In this study, we propose a new neural network model named MeSH-gram which relies on a straighforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings) descriptors instead words. Trained on publicly available corpus PubMed MEDLINE, MeSH-gram is evaluated on reference standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at several windows contexts. A deeper comparison is performed with tewenty existing models. All the obtained results of Spearman's rank correlations between human scores and computed similarities show that MeSH-gram outperforms the skip-gram model, and is comparable to the best methods but that need more computation and external resources.

In this paper we present the results of an empirical study into the cognitive reality of existing classifications of modality using Polish data. We analyzed random samples of 250 independent observations for the 7 most frequent modal words (moc 'can', mozna 'it is possible', musiec 'must', nalezy 'it is necessary', powinien 'should', trzeba 'it is required', wolno 'it is allowed'), extracted from the National Corpus of Polish. Observations were annotated for modal type according to four different classifications of modality, as well as for morphological, syntactic and semantic properties using the Behavioral Profiling approach. Multiple correspondence analysis and (polytomous) regression models were used to determine how well modal type and usage align. These corpus-based findings were validated experimentally. In a forced choice task, 'naive' native speakers were exposed to definitions and prototypical examples of modal types or functions and then labeled a number of authentic corpus sentences accordingly. In the sorting task, naive native speakers sorted authentic corpus sentences into semantically coherent groups. In this article we discuss the results of our empirical study as well as the issues involved in building usage-based accounts on traditional linguistic classifications.

Sentiment Analysis or opinion mining refers to a process of identifying and categorizing the subjective information in source materials using natural language processing (NLP), text analytics and statistical linguistics. The main purpose of opinion mining is to determine the writer\u2019s attitude towards a particular topic under discussion. This is done by identifying a polarity of a particular text paragraph using different feature sets. Feature engineering in pre-processing phase plays a vital role in improving the performance of a classifier. In this paper we empirically evaluated various features weighting mechanisms against the well-established classification techniques for opinion mining, i.e. Naive Bayes-Multinomial for binary polarity cases and SVM-LIN for multiclass cases. In order to evaluates these classification techniques we use Rotten Tomatoes publically available movie reviews dataset for training the classifiers as this is widely used dataset by research community for the same purpose. The empirical experiment concludes that the feature set containing noun, verb, adverb and adjective lemmas with feature-frequency (FF) function perform better among all other feature settings with 84% and 85% correctly classified test instances for Naive Bayes and SVM, respectively.

KABA is a subject heading language used in Polish library catalogues to describe document subjects. An attempt has been made to convert KABA into a thesaurus compliant with the CIDOC CRM ontology to embed it in a semantic knowledge base comprising the foundation of the Integrated Knowledge System created in the SYNAT Project. Information objects (e.g. books) in the knowledge base are described with KABA headings, which increases the search engine recall based on the information about relations between the subjects. This paper presents the process of transforming KABA into a fully machine-readable thesaurus form and the challenges that must be overcome in order for this process to succeed.

Abstract A 560-unit neural network with two layers of modifiable connections was trained by means of back-propagation to disambiguate the syntactic categories of words in samples of text taken from the Brown Corpus. After training, the network was able to successfully disambiguate words in previously unanalyzed text with 95% accuracy, a performance level comparable to the best current computational techniques for the disambiguation of syntactic function. The model incorporates plausible psychological constraints on its input and output representations, and exhibited human-like behavior during parts of the learning process. The network's success suggests that syntactic category disambiguation may be mainly a low-level, bottom-up process with little dependence on the recognition of higher-level syntactic structures. Although the network simulates only a restricted component of the human language processing mechanism, its intrinsic ability to use partially formed data should allow it to be easily integrated into a full-scale language comprehension system. The model's overall performance level, along with its psychological plausibility, indicates that neural networks may be a new and useful approach to building human language processing models.

Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive emoji uses in sentiment analysis remains light. In this paper, we propose a novel scheme for Twitter sentiment analysis with extra attention on emojis. We first learn bi-sense emoji embeddings under positive and negative sentimental tweets individually, and then train a sentiment classifier by attending on these bi-sense emoji embeddings with an attention-based long short-term memory network (LSTM). Our experiments show that the bi-sense embedding is effective for extracting sentiment-aware embeddings of emojis and outperforms the state-of-the-art models. We also visualize the attentions to show that the bi-sense emoji embedding provides better guidance on the attention mechanism to obtain a more robust understanding of the semantics and sentiments.

Web Scraping Tools are simplifying the task of creating large databases for various applications such as the construction of corpus aimed at the development of applications for natural language processing. Many of these applications require a large amount of data, and in that sense, the Web presents itself as an important data source. Among the various tasks in the NLP scope, one of the most challenging is automatic text generation. In this task the objective is to generate syntactically and semantically correct texts after a training process on a particular corpus. This article presents the elaboration of an English song lyrics Corpus, extracted from the Web, that can be used to train applications for automatic generation of lyrics, poems, or other NPL related tasks. After its normalization, an analysis of the Corpus is presented, as well as analyzes performed after the corpus vectorization (embedding) generated with the use of two current techniques.

LSTM-based language models have been shown effective in Word Sense Disambiguation (WSD). In particular, the technique proposed by Yuan et al. (2016) returned state-of-the-art performance in several benchmarks, but neither the training data nor the source code was released. This paper presents the results of a reproduction study and analysis of this technique using only openly available datasets (GigaWord, SemCor, OMSTI) and software (TensorFlow). Our study showed that similar results can be obtained with much less data than hinted at by Yuan et al. (2016). Detailed analyses shed light on the strengths and weaknesses of this method. First, adding more unannotated training data is useful, but is subject to diminishing returns. Second, the model can correctly identify both popular and unpopular meanings. Finally, the limited sense coverage in the annotated datasets is a major limitation. All code and trained models are made freely available.

Today, with the development of the Semantic Web, Linked Open Data (LOD), expressed using the Resource Description Framework (RDF), has reached the status of \u201Cbig data\u201D and can be considered as a giant data resource from which knowledge can be discovered. The process of learning knowledge defined in terms of OWL 2 axioms from the RDF datasets can be viewed as a special case of knowledge discovery from data or \u201Cdata mining\u201D, which can be called\u201CRDF mining\u201D. The approaches to automated generation of the axioms from recorded RDF facts on the Web may be regarded as a case of inductive reasoning and ontology learning. The instances, represented by RDF triples, play the role of specific observations, from which axioms can be extracted by generalization. Based on the insight that discovering new knowledge is essentially an evolutionary process, whereby hypotheses are generated by some heuristic mechanism and then tested against the available evidence, so that only the best hypotheses survive, we propose the use of Grammatical Evolution, one type of evolutionary algorithm, for mining disjointness OWL 2 axioms from an RDF data repository such as DBpedia. For the evaluation of candidate axioms against the DBpedia dataset, we adopt an approach based on possibility theory.

The increasing number of e-commerce and social networking sites are producing large amount of data pertaining to reviews of a product, restaurant etc. A keen observation reveals that the text data gathered from any social review site are specific to a context and are subjective in nature promoting varied perceptions of sentiments. The novel idea is to define context specific grammar as semantics for a particular domain. Our research aims to develop a scalable model where features obtained from matching semantic patterns are used to predict the sentiment polarity of movie reviews and also provide a sentiment score for each review. The proposed model is intended to be flexible so that it could be applied to any domain by redefining the semantics specific to that domain. There are many other models which give accuracies greater than 80% using various methods. A study suggests that 70% accurate program is as good as humans as they have varied perceptions of sentiment about a movie review as it is a subjective summary of a movie. Our model might give lesser accuracy but it uses a cognitive approach trying to catch these varied perceptions by learning from a combination of positive and negative grammars. Analyzing results from various experiments we find that Logistic Regression with SGD on Apache Spark performs better with accuracy of 64.12% while being highly scalable. High dependency on the grammars is a limitation of the model. Improvements can be done by defining different quality and quantity of grammars.

This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.

This paper describes negation in contemporary Votic \u2013 a near-extinct Finnic language. All the data come from fieldwork carried out in the last decade. We analyse different aspects of the negation system, including such typologically important features as expression of the person and number of the negative construction by the negative auxiliary verb, while the main verb expresses the tense and mood characteristics; the formation of the Votic prohibitive and its classification in the World Atlas of Language Structures; a specific system of negative pronouns; the use of the abessive suffix only in verbal forms; negative conjunctions that can conjugate. The paper also includes a short overview of the changes in the system of negation in contemporary Votic as compared to previous descriptions.

Information about the contributions of individual authors to scientific publications is important for assessing authors' achievements. Some biomedical publications have a short section that describes authors' roles and contributions. It is usually written in natural language and hence author contributions cannot be trivially extracted in a machine-readable format. In this paper, we present 1) A statistical analysis of roles in author contributions sections, and 2) NaiveRole, a novel approach to extract structured authors' roles from author contribution sections. For the first part, we used co-clustering techniques, as well as Open Information Extraction, to semi-automatically discover the popular roles within a corpus of 2,000 contributions sections from PubMed Central. The discovered roles were used to automatically build a training set for NaiveRole, our role extractor approach, based on Naive Bayes. NaiveRole extracts roles with a micro-averaged precision of 0.68, recall of 0.48 and F1 of 0.57. It is, to the best of our knowledge, the first attempt to automatically extract author roles from research papers. This paper is an extended version of a previous poster published at JCDL 2018.

The present book offers fresh insights into the description of ditransitive verbs and their complementation in present-day English. In the theory-oriented first part, a pluralist framework is developed on the basis of previous research that integrates ditransitive verbs as lexical items with both the entirety of their complementation patterns and the cognitive and semantic aspects of ditransitivity. This approach is combined with modern corpus-linguistic methodology in the present study, which draws on an exhaustive semi-automatic analysis of all patterns of ditransitive verbs in the British component of the International Corpus of English (ICE-GB) and also takes into account selected data from the British National Corpus (BNC). In the second part of the study, the complementation of ditransitive verbs (e.g. give, send) is analysed quantitatively and qualitatively. Special emphasis is placed here on the identification of significant principles of pattern selection, i.e. factors that lead language users to prefer specific patterns over other patterns in given contexts (e.g. weight, focus, pattern flow in text, lexical constraints). In the last part, some general aspects of a network-like, usage-based model of ditransitive verbs, their patterns and the relevant principles of pattern selection are sketched out, thus bridging the gap between the performance-related description of language use and a competence-related model of language cognition.

Abstract Bilingual word embeddings represent words of two languages in the same space, and allow to transfer knowledge from one language to the other without machine translation. The main approach is to train monolingual embeddings first and then map them using bilingual dictionaries. In this work, we present a novel method to learn bilingual embeddings based on multilingual knowledge bases (KB) such as WordNet. Our method extracts bilingual information from multilingual wordnets via random walks and learns a joint embedding space in one go. We further reinforce cross-lingual equivalence adding bilingual constraints in the loss function of the popular Skip-gram model. Our experiments on twelve cross-lingual word similarity and relatedness datasets in six language pairs covering four languages show that: 1) our method outperforms the state-of-the-art mapping method using dictionaries; 2) multilingual wordnets on their own improve over text-based systems in similarity datasets; 3) the combination of wordnet-generated information and text is key for good results. Our method can be applied to richer KBs like DBpedia or BabelNet, and can be easily extended to multilingual embeddings. All our software and resources are open source.

Translating information between text and image is a fundamental problem in artificial intelligence that connects natural language processing and computer vision. In the past few years, performance in image caption generation has seen significant improvement through the adoption of recurrent neural networks (RNN). Meanwhile, text-to-image generation begun to generate plausible images using datasets of specific categories like birds and flowers. We've even seen image generation from multi-category datasets such as the Microsoft Common Objects in Context (MSCOCO) through the use of generative adversarial networks (GANs). Synthesizing objects with a complex shape, however, is still challenging. For example, animals and humans have many degrees of freedom, which means that they can take on many complex shapes. We propose a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis. We demonstrate that %the capability of our method to understand the sentence descriptions, so as to I2T2I can generate better multi-categories images using MSCOCO than the state-of-the-art. We also demonstrate that I2T2I can achieve transfer learning by using a pre-trained image captioning module to generate human images on the MPII Human Pose

GIScience 2016 Short Paper Proceedings Which Kobani? A Case Study on the Role of Spatial Statistics and Semantics for Coreference Resolution Across Gazetteers Rui Zhu, Krzysztof Janowicz, Bo Yan, and Yingjie Hu STKO Lab, Department of Geography, University of California, Santa Barbara, USA {ruizhu,jano,boyan,yingjiehu} Abstract Identifying the same places across di\u21B5erent gazetteers is a key prerequisite for spatial data con- flation and interlinkage. Conventional approaches mostly rely on combining spatial distance with string matching and structural similarity measures, while ignoring relations among places and the semantics of place types. In this work, we propose to use spatial statistics to mine semantic signatures for place types and use these signatures for coreference resolution, i.e., to determine whether records form di\u21B5erent gazetteers refer to the same place. We implement 27 statistical features for computing these signatures and apply them to the type and entity levels to determine the corresponding places between two gazetteers, which are GeoNames and DBpedia. The city of Kobani, Syria, is used as a running example to demonstrate the feasibility of our approach. The experimental results show that the proposed signatures have the potential to improve the performance of coreference resolution. Keywords: Spatial statistics, coreference resolution, gazetteers, semantic signatures Introduction and Motivation Coreference resolution across gazetteers is an important prerequisite for spatial data conflation and in- terlinkage. Conventional approaches, such as coordinate matching, string matching, and feature type matching, often focus on the footprints, names, and types of places, as well as the combination of these three properties (Sehgal et al., 2006; Shvaiko and Euzenat, 2013). However, such approaches have their limitations. Today, most gazetteers still rely on centroids for representing geographic features (even for feature types such as counties, rivers, or oceans). These centroids di\u21B5er significantly across datasets, often by more than 100km. Furthermore, it is difficult to select a place type agnostic distance threshold as initial search radius. Polygon and polyline based matching, e.g., using Hausdor\u21B5 distance, comes with its own limitations, scale and the resulting generalization being key problems. For string matching, such as using Levenshtein distance, the same place may have substantially di\u21B5erent toponyms (e.g., Ayn al-Arab in TGN and Kobani in DBpedia) while di\u21B5erent places may share common names. In addition, simply relying on direct feature type matching is likely to fail since di\u21B5erent gazetteers employ incompatible typing schemata/ontologies. In conjunction, these problems often lead to either false negative or false positive matches. In previous work, we proposed using spatial signatures, which are derived from spatial statistics, to understand the semantics of places types bottom-up (Zhu et al., 2016). In this work, we apply these signatures to coreference resolution. The used spatial statistics are selected from three perspectives; a detailed list is shown in Table 1: \u2022 Spatial point pattern analysis. Point coordinates are used to quantitatively measure the spatial point patterns of place types (such as populated place). Kernel density estimation, Ripley\u2019s K, and standard deviational ellipse analysis are conducted and corresponding statistics are obtained for representing the signatures. Furthermore, we computed these statistics from both local and global aspects. \u2022 Spatial autocorrelation analysis. In order to capture the interaction between places, we con- verted the point patterns into raster maps where each pixel represents the intensity of points. Spatial correlation statistics, such as Moran\u2019s I and semivariograms, are subsequently used to improve the signatures. \u2022 Spatial interactions with other geographic features. In contrast to the first two perspec- tives, this group of statistical features is derived by integrating other geographic features. These

Relation detection plays a crucial role in Knowledge Base Question Answering (KBQA) because of the high variance of relation expression in the question. Traditional deep learning methods follow an encoding-comparing paradigm, where the question and the candidate relation are represented as vectors to compare their semantic similarity. Max- or average- pooling operation, which compresses the sequence of words into fixed-dimensional vectors, becomes the bottleneck of information. In this paper, we propose to learn attention-based word-level interactions between questions and relations to alleviate the bottleneck issue. Similar to the traditional models, the question and relation are firstly represented as sequences of vectors. Then, instead of merging the sequence into a single vector with pooling operation, soft alignments between words from the question and the relation are learned. The aligned words are subsequently compared with the convolutional neural network (CNN) and the comparison results are merged finally. Through performing the comparison on low-level representations, the attention-based word-level interaction model (ABWIM) relieves the information loss issue caused by merging the sequence into a fixed-dimensional vector before the comparison. The experimental results of relation detection on both SimpleQuestions and WebQuestions datasets show that ABWIM achieves state-of-the-art accuracy, demonstrating its effectiveness.

Author(s): Houser, Michael John | Advisor(s): Mikkelsen, Line | Abstract: Do so anaphora is a fairly widely used in English, but has received relatively little treatment in the literature (especially when compared with verb phrase ellipsis). There are, how- ever, two aspects of this anaphor that have gained prominence: i) its use as a test for constituency within the verb phrase, and ii) the semantic restriction it places on its antecedent. Though these two properties have been the most prominent, their analyses have not been uncontroversial. In this dissertation, I investigate these properties and give them a more complete analysis. The first part of the dissertation is devoted to a discussion of the the use of do so as a test for constituency in the verb phrase, and the second part is devoted to understanding the semantic restriction that do so places on its antecedent.The behavior of do so anaphora has been used to argue both hierarchical structure (Lakoff and Ross 1976) and flat structure within the verb phrase (Culicover and Jackendoff 2005). In chapter 2, however, I argue that do so does not have any bearing on the debate about the internal structure of the verb phrase. The arguments put forth by these authors are predicated on do so being a surface anaphor in terms of Hankamer and Sag (1976). Instead I argue that do so is in fact a deep anaphor and that its purported surface anaphor properties fall out from independent semantic and pragmatic properties of the anaphor. As a deep anaphor, do so does not replace any structure in the verb phrase, but rather forms a verb phrase in its own right from the beginning of the derivation. Therefore, the use of do so to argue for or against hierarchical structure in the verb phrase has been misguided.I approach the semantic restriction that do so places on its antecedent from two angles. In chapter 3, I review the previous analyses of this restriction, and test their claims against a corpus of over 1000 naturally occurring examples extracted from the American National Corpus. None of the previous analyses are supported by the data, and I present a novel analysis that utilize three semantic parameters (agentivity, aktionsart, stativity) to predict which antecedents are possible with do so. One striking property of the counterexamples found in the corpus is that they instantiate particular syntactic structures. The majority of them contain do so in a nonfinite form (usually in the infinitive), and in others, the antecedent is contained in a relative clause modifying the subject of do so. In chapter 4, I present experimental evidence that shows that these two syntactic environments lessen the effects of the restriction that do so normally places on its antecedent. I attribute this amelioration of the semantic restriction to the unavailability of verb phrase ellipsis in these syntactic environments. The analysis falls out from the nonmonotonic interaction of the two restrictions: the syntactic restrictions on ellipsis force the use do so to the detriment of the semantic restriction that do so normally places on its antecedent. I then situate this amelioration effect into the typology of coercion effects in general and argue that do so displays a novel type of coercion: subtractive coercion.

RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain.

As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper reports on a new digital infrastructure for discovering language resources being developed by the Open Language Archives Community (OLAC). At the core of OLAC is its metadata format, which is designed to facilitate description and discovery of all kinds of language resources, including data, tools, or advice. The paper describes OLAC metadata, its relationship to Dublin Core metadata, and its dissemination using the metadata harvesting protocol of the Open Archives Initiative.

With the purpose of discovering the recent status of natural language processing research field, this paper presents a data-driven statistical method by utilizing bibliometrics and social network analysis on related publications. On 3,222 academic publications retrieved from Web of Science core collection during year 2007-2016, this paper explores literature distribution characteristics using descriptive statistical method, research hotspots using k-means clustering method, and cooperation relationships among authors and affiliations using network analysis method. The findings provide relevant learners and researchers with information to keep abreast of the research status of NLP field.

Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.

We present a new database of lexical decision times for English words and nonwords, for which two groups of British participants each responded to 14,365 monosyllabic and disyllabic words and the same number of nonwords for a total duration of 16 h (divided over multiple sessions). This database, called the British Lexicon Project (BLP), fills an important gap between the Dutch Lexicon Project (DLP; Keuleers, Diependaele, & Brysbaert, Frontiers in Language Sciences. Psychology, 1, 174, 2010) and the English Lexicon Project (ELP; Balota et al., 2007), because it applies the repeated measures design of the DLP to the English language. The high correlation between the BLP and ELP data indicates that a high percentage of variance in lexical decision data sets is systematic variance, rather than noise, and that the results of megastudies are rather robust with respect to the selection and presentation of the stimuli. Because of its design, the BLP makes the same analyses possible as the DLP, offering researchers with a new interesting data set of word-processing times for mixed effects analyses and mathematical modeling. The BLP data are available at and as Electronic Supplementary Materials.

Available in a Dutch and English Edition, the Syntactic Atlas of the Dutch Dialects provides a detailed overview of the surprisingly rich syntactic variation found in 267 dialects of Dutch at the beginning of the 21th century. 200 full color maps show the geographic distribution of more than 100 syntactic variables. Many of these variables involve phenomena that are absent from the standard language and thus of great descriptive and theoretical importance. A state-of-the art linguistic commentary accompanies each map, taking into account the results of modern syntactic research, as well as historical developments. Volume I includes (a.o.) subject pronouns and pronoun doubling, the anaphoric system, expletive pronouns, complementizer agreement, yes/no agreement, complementizer doubling, question word doubling, relative clauses and topicalisation. Volume II (Fall 2006) includes (a.o.) word order in verb clusters, the Infinitivus pro Participio effect, the Participium pro Infinitivo effect, perfective participle doubling, the Imperative pro Infinitivo effect, DO-support, negative particles, negative concord and negative quantifiers. Further information on the SAND can be found at: \"\">

This paper introduces a recent development of a Romanian Speech corpus to include prosodic annotations of the speech data in the form of ToBI labels. We describe the methodology of determining the required pitch patterns that are common for the Romanian language, annotate the speech resource, and then provide a comparison of two text-to-speech synthesis systems to establish the benefits of using this type of information to our speech resource. The result is a publicly available speech dataset which can be used to further develop speech synthesis systems or to automatically learn the prediction of ToBI labels from text in Romanian language.

This paper describes the independent construction and implementation of two cellular automata that model dialect feature diffusion as the adaptive aspect of the complex system of speech. We show how a feature, once established, can spread across an area, and how the distribution of a dialect feature as it stands in Linguistic Atlas data could either spread or diminish. Cellular automata use update rules to determine the status of a feature at a given location with respect to the status of its neighboring locations. In each iteration all locations in a matrix are evaluated, and then the new status for each one is displayed all at once. Throughout hundreds of iterations, we can watch regional distributional\r\npatterns emerge as a consequence of these simple update rules. We validate patterns with respect to the linguistic distributions known to occur in the Linguistic Atlas Project.

Introduction T prospect of preparing a comprehensive history of space guidance and navigation was, initially, a delight to contemplate. But, as the unproductive weeks went by, the original euphoria was gradually replaced by a sense of pragmatism. I reasoned that the historical papers which had the greatest appeal were written by \"old timers\" telling of their personal experiences. Since I had lived through the entire space age, and had the good fortune of being involved in many of the nation's important aerospace programs, I decided to narrow the scope to encompass only that of which I had personal knowledge. (It is, however, a sobering thought that you might qualify as an ' 'old timer.\") The story begins in the early 1950's when the MIT Instrumentation Laboratory (later to become the Charles Stark Draper Laboratory, Inc.) was chosen by the Air Force Western Development Division to provide a self-contained guidance system backup to Convair in San Diego for the new Atlas intercontinental ballistic missile. The work was contracted through the Ramo-Wooldridge Corporation, and the technical monitor for the MIT task was a young engineer named Jim Fletcher who later served as the NASA Administrator. The Atlas guidance system was to be a combination of an on-board autonomous system, and a ground-based tracking and command system. This was the beginning of a philosophic controversy, which, in some areas, remains unresolved. The self-contained system finally prevailed in ballistic missile applications for obvious reasons. In space exploration, a mixture of the two remains. The electronic digital computer industry was in its infancy then, so that an on-board guidance system could be mechanized only with analog components. Likewise, the design and analysis tools were highly primitive by today's standards. It is difficult to appreciate the development problems without considering the available computational aids.

The paper presents experimental results on WSD, with focus on disambiguation of Russian nouns that refer to tangible objects and abstract notions. The body of contexts has been extracted from the Russian National Corpus (RNC). The tool used in our experiments is aimed at statistical processing and classification of noun contexts. The WSD procedure takes into account taxonomy markers of word meanings as well as lexical markers and morphological tagsets in the context. A set of experiments allows us to establish preferential conditions for WSD in Russian texts.

We describe our participation in the Knowledge Base Population (KBP) English slot filling track of TAC. Our system is based on a distant supervision approach, where training instances are created by heuristically matching DBpedia relation tuples with sentences. Our official submission, which uses web snippets returned by search engines to collect positive instances, results in a performance comparable to the median. Based on an intial analysis we find a number of shortcomings with our official runs. After the TAC evaluation, we submit an unofficial run which matches relation tuples with sentences appearing in Wikipedia pages. Our proposed solutions implemented in our unofficial run did not further improve performance.

1. Preface 2. 1. Introduction 3. 2. The sound system 4. 3. The lexicon 5. 4. The Lexicon (continued) 6. 5. Parts of speech 7. 6. Parts of speech (continued) 8. 7. Parts of speech (continued) 9. 8. The noun phrase 10. 9. The verb phrase 11. 10. The sentence 12. 11. The sentence (continued) 13. Appendix 1. Parts of speech 14. Appendix 2. Texts 15. Bibliography 16. Index

We present Contextual Query Rewrite (CQR) a dataset for multi-domain task-oriented spoken dialogue systems that is an extension of the Stanford dialog corpus (Eric et al., 2017a). While previous approaches have addressed the issue of diverse schemas by learning candidate transformations (Naik et al., 2018), we instead model the reference resolution task as a user query reformulation task, where the dialog state is serialized into a natural language query that can be executed by the downstream spoken language understanding system. In this paper, we describe our methodology for creating the query reformulation extension to the dialog corpus, and present an initial set of experiments to establish a baseline for the CQR task. We have released the corpus to the public [1] to support further research in this area.

Rough Set-based machine learning and knowledge acquisition, as embodied in the system LERS, are applied to the task of sorting out natural-language word sense relationships. The data for training and testing are derived from The Oxford English Dictionary; a subsequent objective of this research enterprise is automatically placing additional terms in Roget\u2019s International Thesaurus. The results of this research are promising, so that now we would like to employ this approach to providing a comprehensive whole-language base for general use in building varied natural-language computing applications.

In this paper we present the Multilingual AllWords Sense Disambiguation and Entity Linking task. Word Sense Disambiguation (WSD) and Entity Linking (EL) are well-known problems in the Natural Language Processing field and both address the lexical ambiguity of language. Their main difference lies in the kind of meaning inventories that are used: EL uses encyclopedic knowledge, while WSD uses lexicographic information. Our aim with this task is to analyze whether, and if so, how, using a resource that integrates both kinds of inventories (i.e., BabelNet 2.5.1) might enable WSD and EL to be solved by means of similar (even, the same) methods. Moreover, we investigate this task in a multilingual setting and for some specific domains.

NLM (National Library of Medicine\u00AE) maintains two broad, relatively small classifications:\r\n\r\n\r\nA set of 122 descriptors from MeSH\u00AE (Medical Subject Headings\u00AE), known as JDs, used for manually indexing MEDLINE\u00AE journals per se according to discipline. These are found in the List of Journals Indexed for MEDLINE, which also contains the listing of titles under these descriptors. For example, Journal of Pediatric Surgery is listed under both Pediatrics and Surgery.\r\n\r\n\r\nA set of 135 STs in the Semantic Network in NLM's UMLS (Unified Medical Language System\u00AE). Concepts in the UMLS Metathesaurus\u00AE are assigned one or more STs which semantically characterize those concepts. For example, the Metathesaurus concept Aspirin is assigned the STs Pharmacologic Substance and Organic Chemical.\r\n\r\n\r\n\r\nThe JDI tool uses a methodology based on statistical word-JD associations from a training set of MEDLINE citations to which are imported the JDs corresponding to journal unique identifiers in the citations. For example, words in articles in the Journal of Pediatric Surgery become statistically associated with the JDs Pediatrics and Surgery. Then an input text comprised of words similar to the ones in these articles would be categorized by the same JDs. Using words in the input, JDI ranks the JDs according to the average of JD scores in word-JD associations. For example, the first three JDs, with scores, returned by JDI for the input \"appendectomy in children\" are: 1 0.7311 Surgery, 2 0.6856 Pediatrics, and 3 0.4661 Gastroenterology.\r\n\r\nThis methodology is being applied in the SemRep UMLS NLP (Natural Language Processing) tool; JDI increases accuracy by identifying MEDLINE citations in the molecular genetics domain before NLP begins. A possible retrieval application would be to intersect citations described by a JD with citations described by textwords or another JD, for example: neurotransmitters [tw] AND Cardiology [jd]; Cardiology [jd] AND Pediatrics [jd].\r\n\r\nJDI methodology is the basis for STI (Semantic Type Indexing). ST \u201Cdocuments\u201D are created comprised of UMLS Metathesaurus strings belonging to the ST, and these documents each undergo JDI. Then statistical word-ST associations are calculated by comparing JDI of individual training set words and JDI of these ST documents. Using words in the input, STI ranks the STs according to the average of ST scores in word-ST associations. For example, the first three STs, with scores, returned by STI for the input \"appendectomy in children\" are: 1 0.5985 Age Group, 2 0.5520 Finding, and 3 0.5498 Therapeutic or Preventive Procedure. That is, the average Age Group score for words in the input is higher than for other STs. An alternate method of STI compares the JDI of the input to the JDI of each ST document, and ranks the STs according to the greatest similarity to their ST documents. By this method, JDI of this input is most similar to JDI of the Age Group document.\r\n\r\nNLM has applied STI to WSD. If the senses of an ambiguous word are expressed by candidate STs for its meaning, STI can be performed on the context surrounding the word (phrase, sentence, abstract) in the expectation that in the STI of the context, the correct ST for the word will rank higher than the other candidate STs.\r\n\r\nThe Lexical Systems Group plans to distribute an open source JAVA version of the JDI tool*as part of the UMLS NLP tools. This tool would allow users to enter text input, and would return a ranked list of JDs or STs with scores between 0\u20131.\r\n\r\nThis work was supported by the Intramural Research Program of the NIH, National Library of Medicine.

Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics of natural languages, word sense disambiguation has been extensively studied in Computational Linguistics. However, existing methods either are brittle and narrowly focus on specific topics or words, or provide only mediocre performance in real-world settings. Broad coverage and disambiguation quality are critical for a word sense disambiguation system. In this paper we present a fully unsupervised word sense disambiguation method that requires only a dictionary and unannotated text as input. Such an automatic approach overcomes the problem of brittleness suffered in many existing methods and makes broad-coverage word sense disambiguation feasible in practice. We evaluated our approach using SemEval 2007 Task 7 (Coarse-grained English All-words Task), and our system significantly outperformed the best unsupervised system participating in SemEval 2007 and achieved the performance approaching top-performing supervised systems. Although our method was only tested with coarse-grained sense disambiguation, it can be directly applied to fine-grained sense disambiguation.

The paper focuses on numerous advantages of Linguee contextual search engine used as a tool for teaching translation. As an example, it provides a few tasks helping to form professional translation skills.

This paper presents some findings around musical genres. The main goal is to analyse whether there is any agreement between a group of experts and a community, when defining a set of genres and their relationships. For this purpose, three different experiments are conducted using two datasets: the expert taxonomy, and tags at artist level. The experimental results show a clear agreement for some components of the taxonomy (Blues, HipHop), whilst in other cases (e.g. Rock) there is no correlations. Interestingly enough, the same results are found in the MIREX2007 results for audio genre classification task. Therefore, a multi\u2013faceted approach for musical genre using expert based classifications, dynamic associations derived from the wisdom of crowds, and content\u2013based analysis can improve genre classification, as well as other relevant MIR tasks such as music similarity or music recommendation.

Corpus Linguistics - which exploits electronic annotated corpora in the study of languages - is a widespread and consolidated approach. In particular, parallel corpora, where texts in a language are aligned with their translation in a second language, are an extremely useful tool in contrastive analysis. The lack of good parallel corpora for the languages of our interest - Russian and Italian - has led us to work for improving the Italian-Russian parallel corpus available as a pilot corpus in the Russian National Corpus. Therefore, this work had a twofold aim: practical and theoretical. On the one hand, after studying the essential issues for designing a high-quality corpus, all the criteria for expanding the corpus were established and the number of texts was increased, allowing the Italian-Russian parallel corpus, which counted 700.000 words, to reach more than 4 million words. As a result, it is now possible to conduct scientifically valid research based on this corpus. On the other hand, three corpus-based analyses were proposed in order to highlight the potential of the corpus: the study of prefixed Russian memory verbs and their translation into Italian; the comparison between the Italian analytic causative \"fare + infinitive\" and Russian causative verbs; The comparative analysis of fifteen Italian versions of The Overcoat by N. Gogol'. These analyses first of all allowed to advance some methodological remarks considering a further enlargement and improvement of the Italian-Russian parallel corpus. Secondly, the corpus-based approach has proved to be useful in deepening the study of these topics from a theoretical point of view.

concept ) concept (with fewer attributes) Abstract is de ned as removing an attribute from a concept. The abstract knowledge source makes a concept less speci c by abstracting away an irrelevant detail (see also specify). Abstract may also work on an instance. Classify (set of) attribute(s) ) concept The classify knowledge source takes a number of attributes and classi es these as representing a certain concept. In other words, a concept is recognised on the basis of its attributes. Another term for classify could be identify. Classify does not distinguish between generic concepts or instances of these concepts, although it usually will have a generic concept as output. Generalise (set of) concept(s) ) (generic) concept Generalisation is concerned with nding the common features in a number of concepts and trying to map these onto an existing concept or to develop a new concept. In the former, generalisation is closely related to classify. Developing a new concept is also known as induction, a reasoning technique often studied in the context of machine learning (cf. [39; 106]). Generalise can also work on instances, but will usually produce a generic concept as output. Instantiate concept (or structure) ) instantiated concept (or structure) This knowledge source creates an instance of a concept or a structure. The inference competence represented by this knowledge source may seem a bit trivial, because it involves some form of copying rather than actually inferring a proper relation. However, the distinction between general classes and speci c instances has to be made and that is exactly what this knowledge source does. Notice that this knowledge source is often an implicit part of other knowledge sources.

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

AbstractTraditional grammarians generally hold that English absolute clauses are formal and infrequent. This article is intended to carry out a corpus-based quantitative research on the genre and diachronic distributions of English absolute clauses. We hypothesize that the distribution of absolute clauses in English is significantly different across genres and the diachronic distribution of each function type of absolute clauses in different genres is homogeneous. The British National Corpus (BNC)-based genre distribution research shows that absolute clauses are not frequently used in both the informal spoken texts and the formal academic texts; rather they are mostly used in the narrative texts of fiction. The Corpus of Historical American English (COHA)-based research shows that over the span of 200\u00A0years, the total number of absolute clauses tends to increase but not decrease. This is especially true to absolute clauses of attendant circumstances. Although the number of absolute clauses of clausal adju...

We describe a project aimed at creating a deeply annotated corpus of Russian texts. The annotation consists of comprehensive morphological marking, syntactic tagging in the form of a complete dependency tree, and semantic tagging within a restricted semantic dictionary. Syntactic tagging is using about 80 dependency relations. The syntactically annotated corpus counts more than 28,000 sentences and makes an autonomous part of the Russian National Corpus ( Semantic tagging is based on an inventory of semantic features (descriptors) and a dictionary comprising about 3,000 entries, with a set of tags assigned to each lexeme and its argument slots. The set of descriptors assigned to words has been designed in such a way as to construct a linguistically relevant classification for the whole Russian vocabulary. This classification serves for discovering laws according to which the elements of various lexical and semantic classes interact in the texts. The inventory of semantic descriptors consists of two parts, object descriptors (about 90 items in total) and predicate descriptors (about a hundred). A set of semantic roles is thoroughly elaborated and contains about 50 roles. 1 The paper was partially supported by a grant No. 04-07-90179 from the Russian Foundation of Basic Research, which is gratefully acknowledged. In addition to the authors\u2019 of the paper, Valentina Apresjan, Olga Boguslavskaya, Tatyana Krylova, Irina Levontina and Elena Uryson have contributed to the creation of the semantic dictionary and the system of descriptors. 1. Syntactic Tagging The paper is a progress report on a project aimed at creating a deeply annotated corpus of Russian texts. This corpus, jointly developed by two Moscow teams, is largely based on the ideology of an advanced MT system, ETAP-3 (Apresjan et al. 2003), and is so far the only corpus of Russian supplied with comprehensive morphological annotation and syntactic tagging in the form of a complete dependency tree provided for every sentence. Fig. 1 is a screenshot of the dependency tree for the sentence (1) \u041D\u0430\u0438\u0431\u043E\u043B\u044C\u0448\u0435\u0435 \u0432\u043E\u0437\u043C\u0443\u0449\u0435\u043D\u0438\u0435 \u0443\u0447\u0430\u0441\u0442\u043D\u0438\u043A\u043E\u0432 \u043C\u0438\u0442\u0438\u043D\u0433\u0430 \u0432\u044B\u0437\u0432\u0430\u043B \u043F\u0440\u043E\u0434\u043E\u043B\u0436\u0430\u044E\u0449\u0438\u0439\u0441\u044F \u0440\u043E\u0441\u0442 \u0446\u0435\u043D \u043D\u0430 \u0431\u0435\u043D\u0437\u0438\u043D, \u0443\u0441\u0442\u0430\u043D\u0430\u0432\u043B\u0438\u0432\u0430\u0435\u043C\u044B\u0445 \u043D\u0435\u0444\u0442\u044F\u043D\u044B\u043C\u0438 \u043A\u043E\u043C\u043F\u0430\u043D\u0438\u044F\u043C\u0438 \u2018It was the continuing growth of petrol prices set by oil companies that caused the greatest indignation of the participants of the meeting\u2019. Fig.1. A syntactically tagged sentence Here, nodes represent words assigned morphological and part-of-speech tags, whilst branches are labeled with names of syntactic links. The tagging uses about 80 surface-syntactic links; half of these were proposed in Mel\u2019cuk\u2019s Meaning \u21D4 Text Theory (see e.g. Mel\u2019cuk 1988) and the rest were adopted from the ETAP-3 system or specifically designed for the project. Annotation is produced semi-automatically: sentences are first processed by the rule-based Russian parser of ETAP-3 and then edited manually by linguists who handle all hard cases, including the cases of ambiguity that cannot be reliably resolved without extralinguistic knowledge, as well as versatile elliptical constructions, syntactic idiomaticity, and the like. Currently, the syntactically tagged corpus exceeds 28,000 sentences belonging to modern Russian texts of a variety of genres (fiction, popular science, newspaper and journal articles etc.) and is steadily growing. It is an integral but fully autonomous part of the Russian National Corpus developed in a nationwide research project and available on the Web (

Background\r\n\r\nSpecies occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus\u2014a gold standard corpus that covers a wide range of biodiversity entities.\r\n\r\nResults\r\n\r\nTwo annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences.\r\n\r\nConclusion\r\n\r\nThe paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature\u2014a useful task for monitoring species distribution and preserving the biodiversity.

In this paper, we report a knowledge-based method for Word Sense Disambiguation in the domains of biomedical and clinical text. We combine word representations created on large corpora with a small number of definitions from the UMLS to create concept representations, which we then compare to representations of the context of ambiguous terms. Using no relational information, we obtain comparable performance to previous approaches on the MSH-WSD dataset, which is a well-known dataset in the biomedical domain. Additionally, our method is fast and easy to set up and extend to other domains. Supplementary materials, including source code, can be found at https: //

Linked Data brings inherent challenges in the way users and applications consume the available data. Users consuming Linked Data on the Web, should be able to search and query data spread over potentially large numbers of heterogeneous, complex and distributed datasets. Ideally, a query mechanism for Linked Data should abstract users from the representation of data. This work focuses on the investigation of a vocabulary independent natural language query mechanism for Linked Data, using an approach based on the combination of entity search, a Wikipedia-based semantic relatedness measure and spreading activation. Wikipedia-based semantic relatedness measures address existing limitations of existing works which are based on similarity measures/term expansion based on WordNet. Experimental results using the query mechanism to answer 50 natural language queries over DBpedia achieved a mean reciprocal rank of 61.4%, an average precision of 48.7% and average recall of 57.2%.

This paper gives an overview of an interdisciplinary research project that is concerned with the application of computational linguistics methods to the analysis of the structure and variance of rituals, as investigated in ritual science. We present motivation and prospects of a computational approach to ritual research, and explain the choice of specific analysis techniques. We discuss design decisions for data collection and processing and present the general NLP architecture. For the analysis of ritual descriptions, we apply the frame semantics paradigm with newly invented frames where appropriate. Using scientific ritual research literature, we experimented with several techniques of automatic extraction of domain terms for the domain of rituals. As ritual research is a highly interdiciplinary endavour, a vocabulary common to all sub-areas of ritual research can is hard to specify and highly controversial. The domain terms extracted from ritual research literature are used as a basis for a common vocabulary and thus help the creation of ritual specific frames. We applied the tf*idf, \u03C7 and PageRank algorithm to our ritual research literature corpus and two non-domain corpora: The British National Corpus and the British Academic Written English corpus. All corpora have been part of speech tagged and lemmatized. The domain terms have been evaluated by two ritual experts independently. Interestingly, the results of the algorithms were different for different parts of speech. This finding is in line with the fact that the inter-annotator agreement also differs between parts of speech.

We analyze some of the fundamental design challenges that impact the development of a multilingual state-of-the-art named entity transliteration system, including curating bilingual named entity datasets and evaluation of multiple transliteration methods. We empirically evaluate the transliteration task using traditional weighted finite state transducer (WFST) approach against two neural approaches: the encoder-decoder recurrent neural network method and the recent, non-sequential Transformer method. In order to improve availability of bilingual named entity transliteration datasets, we release personal name bilingual dictionaries minded from Wikidata for English to Russian, Hebrew, Arabic and Japanese Katakana. Our code and dictionaries are publicly available.

Wikipedia is a well-known public and collaborative encyclopaedia consisting of millions of articles. Initially in English, the popular website has grown to include versions in over 288 languages. These versions and their articles are interconnected via cross-language links, which not only facilitate navigation and understanding of concepts in multiple languages, but have been used in natural language processing applications, developments in linked open data, and expansion of minor Wikipedia language versions. These applications are the motivation for an automatic, robust, and accurate technique to identify cross-language links. In this paper, we present a multilingual approach called EurekaCL to automatically identify missing cross-language links in Wikipedia. More precisely, given a Wikipedia article the source EurekaCL uses the multilingual and semantic features of BabelNet 2.0 in order to efficiently identify a set of candidate articles in a target language that are likely to cover the same topic as the source. The Wikipedia graph structure is then exploited both to prune and to rank the candidates. Our evaluation carried out on 42,000 pairs of articles in eight language versions of Wikipedia shows that our candidate selection and pruning procedures allow an effective selection of candidates which significantly helps the determination of the correct article in the target language version.

We present SQLova, the first Natural-language-to-SQL (NL2SQL) model to achieve human performance in WikiSQL dataset. We revisit and discuss diverse popular methods in NL2SQL literature, take a full advantage of BERT {Devlin et al., 2018) through an effective table contextualization method, and coherently combine them, outperforming the previous state of the art by 8.2% and 2.5% in logical form and execution accuracy, respectively. We particularly note that BERT with a seq2seq decoder leads to a poor performance in the task, indicating the importance of a careful design when using such large pretrained models. We also provide a comprehensive analysis on the dataset and our model, which can be helpful for designing future NL2SQL datsets and models. We especially show that our model's performance is near the upper bound in WikiSQL, where we observe that a large portion of the evaluation errors are due to wrong annotations, and our model is already exceeding human performance by 1.3% in execution accuracy.

Article history: Received 25 May 2016 Received in revised form 28 August 2016 Accepted 20 September 2016 It is ironical to note that worldwide the Internet content in the Arabic language is mere 1%, whereas 5% of the world population speaks Arabic. This speaks of the disproportionate presence of on-line content of Arabic language as compared to other languages which may be due to many reasons including a lack of experts in the field of the Arabic language. This research study will investigate the impact of such Machine Translation (MT) software and TM tools that are widely used by the Arab community for their academic and business purposes. The study aims at finding whether it is possible to bring a paradigm shift from Arabic Localization to Arabic Globalization; hence, facilitating the usage of NLP techniques in the human interface with the computer. For this study; a few machine translation software (e.g. SYSTRAN, IBM Watson) shall be studied for their content and applications, to determine their usage without human intervention and retaining the meaning of the original text.

This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms significantly its baseline systems.

We consider a method of constructing a statistical tagger for automated morphological tagging for Russian language texts. In this method, each word is assigned with a tag that contains information about the part of speech and a full set of the word's morphological characteristics. We employ the set of morphological characteristics used in the SynTagRus corpus whose material has been used to train the tagger. The tagger is based on the SVM (Support Vector Machine) approach. The developed tagger has proven to be efficient and has shown high tagging quality.

Ontology alignment became a very important problem to ensure semantic interoperability for different sources of information heterogeneous and distributed. Instance-based ontology alignment represents a very promising technique to find semantic correspondences between entities of different ontologies when they contain a lot of instances. In this paper, we describe a new approach to manage ontologies that do not share common instances.This approach extracts the argument and event structures from a set of instances of the concept of the source ontology and compared them with other semantic features extracted from a set of instances of the concept of the target ontology using Generative Lexicon Theory. We show that it is theoretically powerful because it is based on linguistic semantics and useful in practice. We present the experimental results obtained by running our approach on Biblio test of Benchmark1 series of OAEI2 2011. The results show the good performance of our approach.

This paper presents an innovative research resulting in the English-Lithuanian statistical factored phrase-based machine translation system with a spatial ontology. The system is based on the Moses toolkit and is enriched with semantic knowledge inferred from the spatial ontology. The ontology was developed on the basis of the GeoNames database (more than 15 000 toponyms), implemented in the web ontology language (OWL), and integrated into the machine translation process. Spatial knowledge was added as an additional factor in the statistical translation model and used for toponym disambiguation during machine translation. The implemented machine translation approach was evaluated against the baseline system without spatial knowledge. A multifaceted evaluation strategy including automatic metrics, human evaluation and linguistic analysis, was implemented to perform evaluation experiments. The results of the evaluation have shown a slight improvement in the output quality of machine translation with spatial knowledge.

Machine translation is an active research domain in fields of artificial intelligence. The relevant literature presents a number of machine translation approaches for the translation of different languages. Urdu is the national language of Pakistan while Arabic is a major language in almost 20 different countries of the world comprising almost 450 million people. To the best of our knowledge, there is no published research work presenting any method on machine translation from Urdu to Arabic, however, some online machine translation systems like Google , Bing and Babylon provide Urdu to Arabic machine translation facility. In this paper, we compare the performance of online machine translation systems. The input in Urdu language is translated by the systems and the output in Arabic is compared with the ground truth data of Arabic reference sentences. The comparative analysis evaluates the systems by three performance evaluation measures: BLEU (BiLingual Evaluation Understudy), METEOR (Metric for Evaluation of Translation with Explicit ORdering) and NIST (National Institute of Standard and Technology) with the help of a standard corpus. The results show that Google translator is far better than Bing and Babylon translators. It outperforms, on the average, Babylon by 28.55% and Bing by 15.74%.

We present three natural language marking strategies based on fast and reliable shallow parsing techniques, and on widely\r\navailable lexical resources: lexical substitution, adjective conjunction swaps, and relativiser switching. We test these\r\ntechniques on a random sample of the British National Corpus. Individual candidate marks are checked for goodness of\r\nstructural and semantic fit, using both lexical resources, and the web as a corpus. A representative sample of marks is given\r\nto 25 human judges to evaluate for acceptability and preservation of meaning. This establishes a correlation between corpus\r\nbased felicity measures and perceived quality, and makes qualified predictions. Grammatical acceptability correlates with\r\nour automatic measure strongly (Pearson's r = 0.795, p = 0.001), allowing us to account for about two thirds of variability\r\nin human judgements. A moderate but statistically insignificant (Pearson's r = 0.422, p = 0.356) correlation is found with\r\njudgements of meaning preservation, indicating that the contextual window of five content words used for our automatic\r\nmeasure may need to be extended.

The main focus of our research work is to construct the Job Recommendation System (JRS) based on the ontology construction. The current discussion is related to web data extraction process in the construction of ontology. The importance of this concept is how best data can be extracted from various web pages to minimize the time requirement and improve the efficiency. The process followed to construct ontology is first identified various sources of job portals, then extraction of data from those portals for that the proposed a method is FSA(Finite State Automata) and NLP(Natural Language Processing) based extraction of data. The outcome of this method is efficient data extraction in terms of time and space usage when compared with other models.

Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation involved in the task of text-to-image synthesis. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. In particular, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images in training data during text encoding. On the other hand, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator, which models the fine-grained correlations between words and image sub-regions to push for the cross-modality semantic alignment between the input text and the synthesized image. Thus, a full-spectrum content-oriented parsing in the deep semantic level is performed by our model, which is referred to as Content-Parsing Generative Adversarial Networks (CPGAN). Extensive experiments on COCO dataset manifest that CPGAN advances the state-of-the-art performance significantly.

We extract semantic links of words and logical properties from unstructured data.We jointly encode semantics of words and logical properties into an embedding space.Embedding space provides semantic similarities between word and logical properties.Questions and potential answers can be represented on the embedding space.Potential answers are ranked based on semantic similarities with a given question. Semantic transformation of a natural language question into its corresponding logical form is crucial for knowledge-based question answering systems. Most previous methods have tried to achieve this goal by using syntax-based grammar formalisms and rule-based logical inference. However, these approaches are usually limited in terms of the coverage of the lexical trigger, which performs a mapping task from words to the logical properties of the knowledge base, and thus it is easy to ignore implicit and broken relations between properties by not interpreting the full knowledge base. In this study, our goal is to answer questions in any domains by using the semantic embedding space in which the embeddings encode the semantics of words and logical properties. In the latent space, the semantic associations between existing features can be exploited based on their embeddings without using a manually produced lexicon and rules. This embedding-based inference approach for question answering allows the mapping of factoid questions posed in a natural language onto logical representations of the correct answers guided by the knowledge base. In terms of the overall question answering performance, our experimental results and examples demonstrate that the proposed method outperforms previous knowledge-based question answering baseline methods with a publicly released question answering evaluation dataset: WebQuestions.

The Intercontinental Dictionary Series - a rich and principled database for language comparison

In recent years, automatic processing of opinions in text documents has received a growing interest. Some possible causes are the exponential increase of user-generated contents in Web 2.0, and also the interest of companies and governments in automatically analysing, filtering or detecting opinions from their customers or citizens. On the basis of some similar works in English by other authors, in this paper we expose the results obtained in the experimentation with an unsupervised sentiment classifier for Spanish. We also propose a supervised version of the classifier that shows a significatively better performance. Experiments have been carried out using a corpus that we have extracted from a web of movie reviews in Spanish. We have made this corpus available to the research community.

With the huge textual information available online, the need for a computer system to processing and analyzing this information are felt. One of the systems that exist in analyzing and processing of text is a text categorization in which large volume of text is grouping into categories based on their contents. In this paper, we introduce a novel graph-based approach using BabelNet knowledge resource for Arabic text categorization. Contrary to the traditional Bag-of-Words model for document representation, we consider a model in which each document is represented by a graph that encodes relationships between the different named entities. The experimental results reveal that graph-based representation using SVM algorithm outperforms the NB with regards to all measures.

Distributional semantics in the form of word embeddings are an essential ingredient to many modern natural language processing systems. The quantification of semantic similarity between words can be used to evaluate the ability of a system to perform semantic interpretation. To this end, a number of word similarity datasets have been created for the English language over the last decades. For Thai language few such resources are available. In this work, we create three Thai word similarity datasets by translating and re-rating the popular WordSim-353, SimLex-999 and SemEval-2017-Task-2 datasets. The three datasets contain 1852 word pairs in total and have different characteristics in terms of difficulty, domain coverage, and notion of similarity (relatedness vs.~similarity). These features help to gain a broader picture of the properties of an evaluated word embedding model. We include baseline evaluations with existing Thai embedding models, and identify the high ratio of out-of-vocabulary words as one of the biggest challenges. All datasets, evaluation results, and a tool for easy evaluation of new Thai embedding models are available to the NLP community online.

Most studies that make use of keyword analysis rely on the log-likelihood or the chi-square to extract words that are particularly characteristic of a corpus (e.g. Scott & Tribble 2006). These measures are computed on the basis of absolute frequencies and cannot account for the fact that \"corpora are inherently variable internally\" (Gries 2007). To overcome this limitation, measures of dispersion are sometimes used in combination with keyness values (e.g. Rayson 2003; Oakes & Farrow 2007). Some scholars have also suggested using other statistical measures (e.g. t-test, Wilcoxon's rank-sum test) but these techniques have not gained corpus linguists' favour (yet?). One possible explanation for this lack of enthusiasm is that their statistical added value has rarely been discussed in terms of 'linguistic' added value. To the authors' knowledge, there is not a single study comparing keywords extracted by means of different measures. In our presentation, we will report on a follow-up study to Paquot (2007), which made use of the log-likelihood and measures of range and dispersion to extract academic words and design a productively-oriented academic word list. We make use of the log-likelihood, the t-test and the Wilcoxon's rank-sum test in turn to compare the academic and the fiction sub-corpora of the 'British National Corpus' and extract words that are typical of academic discourse. We compare the three lists of academic keywords on a number of criteria (e.g. number of keywords extracted by each measure, percentage of keywords that are shared in the three lists, frequency and distribution of academic keywords in the two corpora) and explore the specificities of the three statistical measures. We also assess the advantages and disadvantages of these measures for the design of an academic wordlist.

Keyphrases provide important semantic metadata for organizing and managing free-text documents. As data grow exponentially, there is a pressing demand for automatic and efficient keyphrase extraction methods. We introduce in this paper SemCluster, a clustering-based unsupervised keyphrase extraction method. By integrating an internal ontology (i.e., WordNet) with external knowledge sources, SemCluster identifies and extracts semantically important terms from a given document, clusters the terms, and, using the clustering results as heuristics, identifies the most representative phrases and singles them out as keyphrases. SemCluster is evaluated against two baseline unsupervised methods, TextRank and KeyCluster, over the Inspec dataset under an F1-measure metric. The evaluation results clearly show that SemCluster outperforms both methods.

Recurrent Neural Network (RNN) is one of the most popular architectures used in Natural Language Processsing (NLP) tasks because its recurrent structure is very suitable to process variablelength text. RNN can utilize distributed representations of words by first converting the tokens comprising each text into vectors, which form a matrix. And this matrix includes two dimensions: the time-step dimension and the feature vector dimension. Then most existing models usually utilize one-dimensional (1D) max pooling operation or attention-based operation only on the time-step dimension to obtain a fixed-length vector. However, the features on the feature vector dimension are not mutually independent, and simply applying 1D pooling operation over the time-step dimension independently may destroy the structure of the feature representation. On the other hand, applying two-dimensional (2D) pooling operation over the two dimensions may sample more meaningful features for sequence modeling tasks. To integrate the features on both dimensions of the matrix, this paper explores applying 2D max pooling operation to obtain a fixed-length representation of the text. This paper also utilizes 2D convolution to sample more meaningful information of the matrix. Experiments are conducted on six text classification tasks, including sentiment analysis, question classification, subjectivity classification and newsgroup classification. Compared with the state-of-the-art models, the proposed models achieve excellent performance on 4 out of 6 tasks. Specifically, one of the proposed models achieves highest accuracy on Stanford Sentiment Treebank binary classification and fine-grained classification tasks.

The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance. \r\nOur method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise. \r\nOur evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.

Background\r\nVerbNet, an extensive computational verb lexicon for English, has proved useful for supporting a wide range of Natural Language Processing tasks requiring information about the behaviour and meaning of verbs. Biomedical text processing and mining could benefit from a similar resource. We take the first step towards the development of BioVerbNet: A VerbNet specifically aimed at describing verbs in the area of biomedicine. Because VerbNet-style classification is extremely time consuming, we start from a small manual classification of biomedical verbs and apply a state-of-the-art neural representation model, specifically developed for class-based optimization, to expand the classification with new verbs, using all the PubMed abstracts and the full articles in the PubMed Central Open Access subset as data.

The present study investigates the relationship between of Japanese kanji strokes and their printed-frequencies of occurrence, compositional asymmetry and kanji multiple readings. First, distributions of kanji strokes in both samples of the 1,945 basic kanji and of 6,355 kanji appearing in the Asashi Newspaper published between 1985 and 1998 followed a negative hypergeometric distribution as demonstrated by Figure 1. The distribution of strokes of the 1,945 kanji with their printed-frequencies is rather rhapsodic, as shown in Figure 2, but a rough-fitting model is drawn in Figure 3. Mathematical modelling for kanji strokes with lexical compositional asymmetry reveals the interesting tendency of regressive compounding; that is, that the greater the number of strokes in a kanji, the more it tends to produce two-kanji compound words by adding a kanji on the right side of the target kanji, as shown in Figure 4. A kanji may often have multiple readings; this study also examines the number of readings in relation to the number of kanji strokes. As shown in Figure 6, the greater the number of kanji strokes, the fewer the number of readings. In other words, the more visually complex the kanji is, the more specialised its reading becomes. As such, kanji strokes, as one of the central characteristics of kanji, are closely related to other properties such as frequency, asymmetry and readings. The present study uses mathematical modelling to indicate these relations.

With rapid increasing text information, the need for a computer system to processing and analyzing this information are felt. One of the systems that exist in analyzing and processing of text is a text summarization in which large volume of text is summarized based on different algorithms. In this paper, by using BabelNet knowledge base and its concept graph, a system for summarizing text is offered. In proposed approach, concepts of words by using BabelNet knowledge base are extracted and concept graphs are produced and sentences, according to concepts and resulting graph are rated. Therefore, these rating concepts are utilized in final summarization. Also, a replication control approach is proposed in a way that selected concepts in each state are punished and this causes to produce summaries with less redundancy. To compare and evaluate the performance of the proposed method, DUC2004 is used and ROUGE used as evaluation metric. The proposed method by compared to other methods produces summaries with more quality and fewer redundancies.

The quality of training data is one of the crucial problems when a learning-centered approach is employed. This paper proposes a new method to investigate the quality of a large corpus designed for the recognizing textual entailment (RTE) task. The proposed method, which is inspired by a statistical hypothesis test, consists of two phases: the first phase is to introduce the predictability of textual entailment labels as a null hypothesis which is extremely unacceptable if a target corpus has no hidden bias, and the second phase is to test the null hypothesis using a Naive Bayes model. The experimental result of the Stanford Natural Language Inference (SNLI) corpus does not reject the null hypothesis. Therefore, it indicates that the SNLI corpus has a hidden bias which allows prediction of textual entailment labels from hypothesis sentences even if no context information is given by a premise sentence. This paper also presents the performance impact of NN models for RTE caused by this hidden bias.

We describe the Why2-Atlas intelligent tutoring system for qualitative physics that interacts with students via natural language dialogue. We focus on the issue of analyzing and responding to multi-sentential explanations. We explore an approach that combines a statistical classifier, multiple semantic parsers and a formal reasoner for achieving a deeper understanding of these explanations in order to provide appropriate feedback on them.

OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. \r\n\r\nThis data is derived from OAG data collection ( which was released under ODC-BY licence. \r\n\r\nThis data (OAGK Keyword Generation Dataset) is released under CC-BY licence (\r\n\r\nIf using it, please cite the following paper:\r\nCano, Erion and Bojar, Ond\u0159ej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA

Abstract : Users may have a variety of tasks that give rise to issuing a particular query. The goal of the Tasks Track at TREC2015 was to identify all aspects or subtasks of a users task as well as the documents relevant to the entire task. This was broken into two parts: (1) Task Understanding which judged relevance of key phrases or queries to the original query (relative to a likely task that would have given rise to both); (2) Task Completion which performed document retrieval and measured usefulness to any task a user with the query might be peforming through either a completion measure that uses both relevance and usefulness criteria or more simply through an ad hoc retrieval measure of relevance alone. We submitted a run in the Task Understanding track. In particular, since the anchor text graph has proven useful in the general realm of query reformulation [2], we sought to quantify the value of extracting key phrases from anchor text in the broader setting of the task understanding track.

The task of event extraction has long been investigated in a supervised learning paradigm, which is bound by the number and the quality of the training instances. Existing training data must be manually generated through a combination of expert domain knowledge and extensive human involvement. However, due to drastic efforts required in annotating text, the resultant datasets are usually small, which severally affects the quality of the learned model, making it hard to generalize. Our work develops an automatic approach for generating training data for event extraction. Our approach allows us to scale up event extraction training instances from thousands to hundreds of thousands, and it does this at a much lower cost than a manual approach. We achieve this by employing distant supervision to automatically create event annotations from unlabelled text using existing structured knowledge bases or tables.We then develop a neural network model with post inference to transfer the knowledge extracted from structured knowledge bases to automatically annotate typed events with corresponding arguments in text.We evaluate our approach by using the knowledge extracted from Freebase to label texts from Wikipedia articles. Experimental results show that our approach can generate a large number of high quality training instances. We show that this large volume of training data not only leads to a better event extractor, but also allows us to detect multiple typed events.

We introduce the Semantic Scholar Graph of References in Context (GORC), a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers, across broad domains of science. Each paper is represented with rich paper metadata (title, authors, abstract, etc.), and where available: cleaned full text, section headers, figure and table captions, and parsed bibliography entries. In-line citation mentions in full text are linked to their corresponding bibliography entries, which are in turn linked to in-corpus cited papers, forming the edges of a contextual citation graph. To our knowledge, this is the largest publicly available contextual citation graph; the full text alone is the largest parsed academic text corpus publicly available. We demonstrate the ability to identify similar papers using these citation contexts and propose several applications for language modeling and citation-related tasks.

In the era of precision medicine, the clinical utility of next generation sequencing technology highly depends on the ability of interpreting the causality association of genetic variants and phenotyping which can be a labor intensive process. There are various resources available for cataloging such associations such as HGMD or ClinVar. Given the exponential growth in literature in the field, it is desired to accelerate the process by automatically identifying genetic causality statements from literature. Here, we define the task of identifying the statements as a classification task for sentences containing gene and disease entities. We used the cancer gene census available at the Catalogue of Somatic Mutations in Cancer (COSMIC) and to generate a weakly labeled data set for our classification task. We evaluated multiple feature sets such as: words, bi-grams, word embedding, and several machine-learning methods and showed the weighted F-measure around 95%. Evaluation using the top 50 genetic variant disease sentences demonstrated that the proposed method can identify genetic causality statements.

The automatic identification of discourse relations is still a challenging task in natural language processing. Discourse connectives, such as \"since\" or \"but\", are the most informative cues to identify explicit relations; however discourse parsers typically use a closed inventory of such connectives. As a result, discourse relations signaled by markers outside these inventories (i.e. AltLexes) are not detected as effectively. In this paper, we propose a novel method to leverage parallel corpora in text simplification and lexical resources to automatically identify alternative lexicalizations that signal discourse relation. When applied to the Simple Wikipedia and Newsela corpora along with WordNet and the PPDB, the method allowed the automatic discovery of 91 AltLexes.

Motivation and Objectives Biomedical terminologies play important roles in clinical data capture, annotation, reporting, information integration, indexing and retrieval. More particularly, genomic terminologies and ontologies are very useful for indexing genomic information. Several sources of information and terminologies have already been developed. For instance, the Gene Ontology (GO,, last accessed on July 17, 2012), which is a controlled vocabulary widely used for the annotation of gene products; the Human Phenotype Ontology (HPO,, last accessed on July 17, 2012) in which terms describe phenotypic abnormalities encountered in human disease, such as \u201Catrial septal defect\u201D; and ORPHANET,, last accessed on July 17, 2012) the portal for rare diseases and orphan drugs. These knowledge sources have mostly different formats and purposes. For example, ORPHANET is a rare disease database whereas HPO is an ontology which supports the description of phenotypic information. Faced with this reality and the need to allow cooperation between various health actors and their related health information systems, it appeared necessary to link these terminologies by developing a semantic repository to integrate them. The most known repository is the Unified Medical Language System (UMLS) (Lindberg et al., 1993). Several works were based on the UMLS to align terminologies in French (Merabti et al., 2012) and in English (Bodenreider et al., 1998; Milicic Brandt et al., 2011; Mougin et al., 2011). However, HPO and ORPHANET are not yet included in the UMLS. Thus, another solution is to find correspondences between these terminologies in French and in English using automatic methods. In (Merabti et al., 2012) we have proposed a lexical method to map biomedical terminologies either included or not into the UMLS. Nevertheless, these methods remain very dependent on the terminologies languages since they used NLP tools such as stemming or normalization. We propose in this study a string-based method to find correspon-dences between a subset of terminologies for an easier access to biomedical information. It is based on the combination of several string metrics and it is neither based on the UMLS, nor language dependent. Mixed with lexical or conceptual approaches developed in previous studies (Merabti et al., 2012), it could improve the number of correspondences between terminologies with a high precision. Semantic methods are also an envisaged issue to complete this study. Methods To map biomedical terminologies, we used string matching methods where concept names, terms and their labels are considered as sequences of characters. A string distance is determined to compute a similarity degree. Some of these methods can skip the order of characters. In this paper, the union of three metrics was used (i) Dice (Dice, 1945), (ii) Levenshtein (Levenshtein, 1965) and (iii) Stoilos (Stoilos et al., 2005). The Dice\u2019s coefficient calculates the ratio between the number of bigrams of characters incommon to both the strings x and y and the total number of bigrams for two strings defined by the following equation where nb-big(x) is the number of bigrams of x: The Levenshtein distance between two strings x and y is defined as the minimum number of elementary operations that is required to pass from a string x to a string y. There are three possible transactions: replacing a character with another, deleting a character and adding a character. This measure takes its values in the interval [0, \u221E [. The Normalized Levenshtein (Yujian and Bo, 2007) (LevNorm) in the range [0, 1] is obtained by dividing the distance of Levenshtein Lev(x, y) by the size of the longest string and it is defined by: LevNorm (x,y) is element of [0,1] as Lev(x,y) 99%). Aligning genomic terminologies provided also good results with high precision. However, we evaluated here only \u201Cexact\u201D correspondences and rated them as \u201Ccorrect\u201D or \u201Cnot correct\u201D. Indeed, correspondences such as \u201Cbroader\u2013narrower\u201D or \u201Csibling\u201D relations between terms were not considered. For example, when a correspondence is founded between two terms which one string is included in another one in most cases it is more general than the second, and a \u201Cbroader-narrower\u201D correspondence could exist (for example, correspondence between \u201Cinsuffisance surrenale\u201D term (Adrenal insufficiency) and all the terms such as \u201Cinsuffisance surrenale aigue\u201D (Acute Adrenal insufficiency), \u201Cinsuffisance surrenale primaire\u201D (Primary adrenal insufficiency)). These preliminary good results encouraged us to apply the combination of these string matching methods on other health terminologies. The correspondences found between two terminologies in their French version may be projected on their versions in other languages. As perspectives of this study, these methods will be completed with normalization techniques and the validation of the correspondences, manual here, will be done according to the UMLS semantic types for the terminologies included in it such as in (Mougin et al, 2011). References Bodenreider O, Nelson SJ, et al. (1998) Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies. In Proc. AMIA Symp. 1998, pp.815\u2013819. Dice LR (1945). Measures of the amount of ecologic association between species. Ecology 26, pp.297\u2013302. Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Dokl.10, pp.707\u201310. Lindberg DA, Humphreys BL, et al. (1993) The Unified Medical Language System, Methods Inf Med 32(4): 281\u2013291. Merabti T, Soualmia LF, et al. (2012) Aligning Biomedical Terminologies in French: Towards Semantic Interoperability in Medical Applications. In Book Medical informatics, InTech, pp.41\u201368. Milicic Brandt M, Rath A, et al. (2011) Mapping Orphanet terminology to UMLS. In Proc. AIME, LNAI 6747, pp.194\u2013203. Mougin F, Dupuch M, et al. (2011) Improving the mapping between MedDRA and SNOMED CT. In Proc. AIME. LNAI 6747, pp. 220-224. Stoilos G, Stamou G, et al. (2005) A string Metric for Ontology Alignment. In Proc. ISWC, pp.624\u201337. Winkler W (1999) The state record linkage and current research problems. Technical report: Statistics of Income Division, Internal Revenue Service Publication. Yujian L, Bo L (2007) A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091\u20131095. Note: Figures tables an equations are available in PDF version only.

Semantic similarity between words aims at establishing resemblance by interpreting the meaning of the words being compared. The Semantic Web can benefit from semantic similarity in several ways: ontology alignment and merging, automatic ontology construction, semantic-search, to cite a few. Current approaches mostly focus on computing similarity between nouns. The aim of this paper is to define a framework to compute semantic similarity even for other grammar categories such as verbs, adverbs and adjectives. The framework has been implemented on top of WordNet. Extensive experiments confirmed the suitability of this approach in the task of solving English tests.

Sentiment analysis or opinion mining, which is one of the application of Natural Language Processing (NLP), aims to find a method to facilitate human in communicating with a computer using their common language. To simplify the process of understanding human language, there are three important stages that must be carried out by a computer, which are tokenizing, stemming and filtering. The tokenizing that breaks down the sentence into a single word will make the computer assume all words (token) are the same. If there is a phrase formed from one of unimportant words, which is happened to be in the stoplist, the phrase will be deleted. Solution for the aforementioned problem is tokenizing based on phrase detection using Hidden Markov Model (HMM) POS-Tagger to improve classification performance using Support Vector Machine (SVM). With this approach, computer will be able to distinguish a phrase from others, then store the phrase into a single entity. There is an increase in accuracy by approximately 6% on Dataset I and 3% on Dataset II in the classification process using phrase detection, due to reduction of missing features that usually occurs in the filtering process. In addition, the detection of the phrase-based approach also produces the most optimal classification model, as seen from the ROC value that reaches 0.897.

We introduce the challenge of detecting semantically compatible words, that is, words that can potentially refer to the same thing (cat and hindrance are compatible, cat and dog are not), arguing for its central role in many semantic tasks. We present a publicly available data-set of human compatibility ratings, and a neural-network model that takes distributional embeddings of words as input and learns alternative embeddings that perform the compatibility detection task quite well.

Wikipedia plays a central role in the web as one of the biggest knowledge source due to its large coverage of information that comes from various domains. However, due to the enormous number of pages and limited number of contributors to maintain all of the pages, the problem of missing information among Wikipedia articles has emerged, especially articles in multiple language versions. Several approaches have been studied to fix information gap in between cross- language Wikipedia articles. However, they can only be applied for languages that came from the same root. In this paper, we propose an approach to generate new information for Wikipedia infoboxes written in different languages with different roots by utilizing the existing DBpedia mappings. We combined mapping information from DBpedia with an instance-based method to align the existing Korean-English infobox attribute-value pairs as well as to generate new pairs from the Korean version to fill missing information in the English version. The results showed that we could expand up to 38% of the existing English Wikipedia attribute-value pairs from our datasets with 61% of accuracy.

In this paper we introduce a new speech recognition system, leveraging a simple letter-based ConvNet acoustic model. The acoustic model requires -- only audio transcription for training -- no alignment annotations, nor any forced alignment step is needed. At inference, our decoder takes only a word list and a language model, and is fed with letter scores from the -- acoustic model -- no phonetic word lexicon is needed. Key ingredients for the acoustic model are Gated Linear Units and high dropout. We show near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the \"clean\" and \"other\" configurations.

Semantic Question Answering (SQA) was concerned about the natural language processing. The purpose of this study was to help facilitate the users to access the information through the natural language and to obtain the concise and needed information. As considered the current studies, it was found that this processing still encountered the problems of flexibility and accuracy, particularly those of the question processing, which was a very important processing for developing question answering system. Thus, this study proposed a semantic approach for question answering using DBpedia and WordNet. For this paper, the techniques for solving the problems were proposed consisting of (1) extracting named entities from the question and solving the problems of similarities of named entities, (2) extracting properties from the question and solving the problems of similarities of properties, and (3) evaluating the accurate capability of the answer of question. This approach evaluated the test dataset from TREC question collections, DBpedia and achieved an F-measure score of 93.43%, an average precision of 92.73%, and an average recall of 94.15% over 500 questions.

The use of labels of semantic properties like \u2018concreteness\u2019 is quite common in studies in syntax, but their exact meaning is often unclear. In this article, we compare dierent denitions of concreteness, and use them in dierent implementations to annotate nouns in two data sets: (1) all nouns with word sense annotations in the SemCor corpus, and (2) nouns in a particular lexico-syntactic context, viz. the theme (e.g. a book) in prepositional dative (gave a book to him) and double object (gave him a book) constructions. The results show that the denition and implementation used in dierent approaches dier greatly, and can considerably aect the conclusions drawn in syntactic research. A followup crowdsourcing experiment showed that there are instances that are clearly concrete or

The most effective paradigm for word sense disambiguation, supervised learning, seems to be stuck because of the knowledge acquisition bottleneck. In this paper we take an in-depth study of the performance of decision lists on two publicly available corpora and an additional corpus automatically acquired from the Web, using the fine-grained highly polysemous senses in WordNet. Decision lists are shown a versatile state-of-the-art technique. The experiments reveal, among other facts, that SemCor can be an acceptable (0.7 precision for polysemous words) starting point for an all-words system. The results on the DSO corpus show that for some highly polysemous words 0.7 precision seems to be the current state-of-the-art limit. On the other hand, independently constructed hand-tagged corpora are not mutually useful, and a corpus automatically acquired from the Web is shown to fail.

In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logical forms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The main challenge in this setting is narrowing down the huge number of possible logical predicates for a given question. We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using a knowledge base and a large text corpus. Second, we use a bridging operation to generate additional predicates based on neighboring predicates. On the dataset of Cai and Yates (2013), despite not having annotated logical forms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic and challenging dataset of question-answer pairs and improves over a natural baseline.

Given the increasing need to process massive amounts of textual data, efficiency of NLP tools is becoming a pressing concern. Parsers based on lexicalised grammar formalisms, such as TAG and CCG, can be made more efficient using supertagging, which for CCG is so effective that every derivation consistent with the supertagger output can be stored in a packed chart. However, wide-coverage CCG parsers still produce a very large number of derivations for typical newspaper or Wikipedia sentences. In this paper we investigate two forms of chart pruning, and develop a novel method for pruning complete cells in a parse chart. The result is a wide-coverage CCG parser that can process almost 100 sentences per second, with little or no loss in accuracy over the baseline with no pruning.

We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model's ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale questionanswer corpora available. In this paper we present the 30M Factoid QuestionAnswer Corpus, an enormous questionanswer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question-answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the questiongeneration model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear to be comparable in quality to real human-generated questions. * First authors. \u25E6 Email: {iulian.vlad.serban,caglar.gulcehre, sungjin.ahn,sarath.chandar.anbil.parthipan, aaron.courville,yoshua.bengio} Email: \u2020 CIFAR Senior Fellow

Recent advances in deep visual attention methods accelerate greatly the research of image captioning. However, how to leverage hand-crafted features or deep features for the encoder of image captioning is not fully explored, due to the difficulty in finding a kind of all-purpose features to entail a set of visual semantics. In this paper, we introduce a cascade semantic fusion architecture (CSF) to mine the representative features to encode image content through attention mechanism without bells and whistles. Specifically, the CSF benefits from three types of visual attention semantics, including object-level, image-level, and spatial attention features, in a novel three-stage cascade manner. In the first stage, object-level attention features are extracted to capture the detailed contents of the objects based on the pretrained detector. Then, the middle stage devises a fusion module to merge object-level attention features with spatial features, thereby inducing image-level attention features to enrich the context information around the objects. In the last stage, spatial attention features are learned to unveil the salient region representation as a complement to two previously learned attention features. In a nutshell, we integrate attention mechanism with three types of features to organize context knowledge about images from different aspects. The empirical analysis shows that the CSF can assist image captioning model in selecting the object regions of interest. The experiments of image captioning on MSCOCO dataset show the efficacy of our semantic fusion architecture in depicting image content.

Most work on tweet sentiment analysis is mono-lingual and the models that are generated by machine learning strategies do not generalize across multiple languages. Cross-language sentiment analysis is usually performed through machine translation approaches that translate a given source language into the target language of choice. Machine translation is expensive and the results that are provided by theses strategies are limited by the quality of the translation that is performed. In this paper, we propose a language-agnostic translation-free method for Twitter sentiment analysis, which makes use of deep convolutional neural networks with character-level embeddings for pointing to the proper polarity of tweets that may be written in distinct (or multiple) languages. The proposed method is more accurate than several other deep neural architectures while requiring substantially less learnable parameters. The resulting model is capable of learning latent features from all languages that are employed during the training process in a straightforward fashion and it does not require any translation process to be performed whatsoever. We empirically evaluate the efficiency and effectiveness of the proposed approach in tweet corpora based on tweets from four different languages, showing that our approach comfortably outperforms the baselines. Moreover, we visualize the knowledge that is learned by our method to qualitatively validate its effectiveness for tweet sentiment classification.

Quantifiers are a linguistic concept that mirrors quantity in reality. They indicate \u2018how many\u2019 or \u2018how much\u2019, for example, the number of entities denoted by a noun, the count of actions or events, the length of time, and the distance in space. All human languages have linguistic devices that express such ideas, though the encoding of natural language semantics can vary from language to language. This paper compares quantifying constructions in English and Chinese on the basis of comparable corpora of spoken and written data in the two languages. We will focus on classifiers in Chinese and their counterparts in English, as well as the interaction between quantifying constructions and progressives, which is normally ruled out by aspect theory, with the aim of addressing the following research questions: \u2022 What linguistic devices are used in Chinese and English for quantification? \u2022 How different (or similar) are classifiers in Chinese as a classifier language and in English as a non-classifier language? \u2022 Can quantifiers interact with progressives in English and Chinese if such interactions are theoretically ruled out by aspect theory? Before these research questions are explored in detail, it is appropriate to first present the principal data used in this study, which includes two written corpora and two spoken corpora. The Freiburg-LOB (FLOB) corpus is a recent update of LOB, which is composed of approximately one million tokens of written British English sampled proportionally from fifteen text categories published in the early 1990s (Hundt et al. 1998). The Lancaster Corpus of Mandarin Chinese (LCMC) was designed as a Chinese match for FLOB and created using the same sampling criteria, representing written Mandarin Chinese published in China in the corresponding sampling period (McEnery et al. 2003). The two spoken corpora are BNCdemo and CallHome Mandarin. BNCdemo is the demographically sampled component of the British National Corpus (BNC), which contains four million tokens of transcripts of conversations recorded around the early 1990s. The CallHome Mandarin Transcripts corpus, which was released by the LDC, comprises 120 transcripts of 5-to-10-minute telephone conversations recorded in the first half of the 1990s between native Chinese speakers living overseas and their families in China, amounting to approximately 300,000 tokens. While telephone calls differ from face-to-face conversations alongside some dimensions (Biber 1988), the sampling periods of two spoken corpora are roughly comparable. A practical reason for using the CallHome corpus is that this dataset is closest to BNCdemo which is available to us. In the remaining sections of this article, we will first explore classifiers in Chinese and English, on the basis of which the two will be compared. We will then discuss the interaction of the progressive with quantifying constructions in the two languages.

According to Biber (1993) and Biber and Conrad (2009), multiple dimensional approach on several bi-polar features can classify genres founds in texts. Problem is that the process between feature extraction and statistical model building is exhaustively applicable when the machine learning system is dependent on frequency based language models. Following the machine learning approach of Kanaris and Stmatatos (2007), nearly 8,000 frequency-based models are used to induce genre distinction. If their whole process is not successful, they have to exhaustively redo the classification process. Upon such problem, we used a word embedding language model in order to deal feature extraction and model building at the same time. Among several word embedding models, our research is based on Doc2Vec model which captures paragraph feature of texts. We used the genre distinctions from the Project Gutenberg, dividing text databases into three parts in revealing distributional characteristics of linguistic features. Our method for detecting text genre is convenient in process as well as accurate in capturing feature distribution of text genre.

Question generation from a knowledge base (KB) is the task of generating questions related to the domain of the input KB. We propose a system for generating fluent and natural questions from a KB, which significantly reduces the human effort by leveraging massive web resources. In more detail, a seed question set is first generated by applying a small number of hand-crafted templates on the input KB, then more questions are retrieved by iteratively forming already obtained questions as search queries into a standard search engine, before finally questions are selected by estimating their fluency and domain relevance. Evaluated by human graders on 500 random-selected triples from Freebase, questions generated by our system are judged to be more fluent than those of \\newcite{serban-EtAl:2016:P16-1} by human graders.

Identification of named entity(NE) class (semantic class) is crucial for NLP problems like coreference resolution where semantic compatibility between the entity mentions is imperative to coreference decision. Short and noisy text containing the entity makes it challenging to extract the NE class of the entity through the context. We introduce a framework for named entity class identification for a given entity, using the web when the entity boundaries are known. The proposed framework will be beneficial for specialized domains where data and class label challenges exist. We demonstrate the benefit of our framework through a case study of Indian classical music forums. Apart from person and location included in standard semantic classes, here we also consider raga 1 , song, instrument and music concept. Our baseline approach follows a heuristic based method making use of Freebase, a structured web repository. The search engine based approaches acquire context from the web for an entity and perform named entity class identification. This approach shows improvement compared to baseline performance and it is further improved with the hierarchical classification introduced. In summary, our framework is a first-of-its-kind validation of viability of the web for NE class identification.

We start from a Web-oriented system for evaluating, presenting, processing, enlarging and annotating corpora of translations, previously applied to a real MT evaluation task, involving classical subjective measures, objective n-gram-based scores, and objective post-edition-based task-related evaluation. We describe its recent extension to support the high-quality translation into French of the large on-line Encyclopedia of Life Support Systems (EOLSS) presented as documents each made of a Web page and a companion UNL file, by applying contributive on-line human post-edition to results of machine translation systems and of UNL deconverters. Target language Web pages are generated on the fly from source language ones, using the best target segments available in the database. 25 documents (about 220,000 words) of the EOLSS are now available in French, Spanish, Russian, Arabic and Japanese. MT followed by contributive incremental cheap or free post-edition is now proved to be a viable way of making difficult information available in many languages.

Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object/relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.

Als 'semantische Lucke' wird der Unterschied zwischen der begrenzten Ausdruckskraft der aus Rohdaten automatisch extrahierbaren low-level Merkmalen und der menschlichen high level Wahrnehmung von Inhalt und Ahnlichkeit bezeichnet. Um diese zu minimieren, ist das Einbringen von semantischem Wissen in moderne inhaltsbasierte (content-based, CBIR) Information Retrieval Systeme unbedingt notwendig. Einen weit verbreiteten Ansatz dazu stellt die inhaltliche Annotation von multimedialen Objekten dar, die diese Daten in semantische Kategorien klassifiziert und somit textuelle oder konzeptuelle Anfragen moglich macht. Obwohl der Ansatz der manuellen Annotation der mit Unsicherheiten behafteten automatischen Annotation gegenubersteht,ist dieser dafur mit einem hohen Aufwand verbunden. Die Nachteile beider Vorgehensweisen konnten jedoch durch einen interaktiven Prozess, der die automatische Berechnung und die semantische Modellierung kombiniert, eliminiert werden. Dazu prasentieren wir mehrere speziell fur CBIR Systeme entwickelten Konzepte und Architekturen, um die verschiedenen Auspragungen der semantischen Lucke abzuschwachen.\r Zuerst stellen wir unser Framework fur die semi-automatische Annotation von Multimedia-Daten vor, welches auf der automatischen Extraktion von low-level Merkmalen, Relevance Feedback und der Benutzung von Wissen aus Ontologien basiert. Weitere Aspekte der Arbeit behandeln die wahrend des Annotationsprozesses auftretenden Probleme, wie die Existenz von unterschiedlichen Abstraktionsebenen, die Unvollstandigkeit der Annotationsdaten oder die zwischen den Benutzern eines Systems variierende Subjektivitat. Um die genannten Probleme zu losen, wird unser System fur die Analyse von Annotationen vorgestellt, welches diese in eine graph-basierte Reprasentation uberfuhrt und sie somit fur den Benutzer nachvollziehbar und durch die gegebenen Inferenz-Funktionen fur die Maschine verstandlich macht.\r Um zu vermeiden, dass eine grose Benutzerdiversitat das Retrievalverhalten eines IR Systems negativ beeinflusst, werden Methoden fur das Verstehen und Interpretieren der subjektiven Wahrnehmung der Benutzer benotigt. Dazu wird aufbauend auf unserem Annotations/Retrieval System das GLENARVAN Teilsystem prasentiert, welches fur die Kontextberechnung, den Vergleich von Annotationsontologien und die Anfrageerweiterung (query expansion) anhand von Benutzerprofilen zustandig ist. Es werden hierbei zwei Aspekte betrachtet: Zuerst wird die Benutzerdiversitat durch eine Menge von Benutzerprofilen und den dazugehorigen Annotationsontologien modelliert und dafur verwendet, Kontextinformation zu extrahieren und somit die Subjektivitat der Benutzer abzuschwachen. Der zweite Aspekt beschaftigt sich mit der Frage, wie man trotz verschiedener Sichten auf identische Datenbestande zufriedenstellende Retrievalergebnisse erreichen kann. Als Losung wird hier ein Query Expansion Algorithmus vorgestellt, der anhand der subjektiven Annotationen die Zuordnungen zwischen der Systemontologie und dem vom Benutzer verwendeten Vokabulars aufdeckt und somit zusatzliche Parameter fur eine an den jeweiligen Benutzer angepasste Anfrage liefert.\r Anschliesend stellen wir unsere Methode des Pseudo Relevance Feedbacks fur Bilddaten vor, die eine Anpassung der Anfrage (query reformulation) anhand der Feedbackaktivitaten des Benutzers vornimmt. Unser Verfahren eignet sich stark fur die Integration in bestehende Web Retrieval Anwendungen, da die Implementierung der beinhalteten Funktionalitaten, wie der Bewertung der Ergebnisse, Relevanzberechnung und die Neuordnung der Ergebnismenge mithilfe von benutzerdefinierten Funktionen (user-defined functions, UDF) realisiert ist.\r - - - - -\r In order to reduce the 'semantic gap', which is known as the mismatch between the low-level feature representation and the high-level human perception, the inclusion of semantic knowledge into advanced content-based retrieval systems has become indispensable. One approach to overcome the gap is the manual or automatic assignment of annotations for the description of multimedia objects classifying the data into semantic categories and thus facilitating textual or conceptual queries. Although the manual approach takes away the uncertainty of fully automatic annotation, but in return it requires a high effort. Hence, an interactive combination of the automatic computation and semantic modeling would provide a significant improvement by eliminating the disadvantages of the both approaches. For this purpose, we present several concepts and architectures that are specifically developed to attenuate different manifestations of the semantic gap.\r At first, we introduce a framework for supporting semi-automatic annotation of multimedia data which is based on the extraction of elementary low-level features, user's relevance feedback, and the usage of ontology knowledge. Further aspects of this work include the encountered problems during the annotation process, like multiple levels of abstraction at which annotations are assigned, incompleteness of annotation data, or differing users' subjectivity. To solve these problems, we introduce the Annotation Analysis Framework which provides a graph-based representation for annotations, encoding their complex structure and making them understandable for the machine by allowing semantic inference.\r In order to incorporate user diversity which might negatively influence the retrieval behavior, methods for understanding and interpreting the subjective views are needed. Based on our annotation/retrieval framework, we present the GLENARVAN component, which is responsible for context computation, ontology comparison, and query expansion according to users' profiles. Here, we consider two different aspects: First, user diversity is modeled as different user profiles and annotation ontologies which are brought together in order to extract contextual information and thus to attenuate users' subjectivity. The second issue is how to prevent the retrieval process to fail in the case of different views on the data collection. For this purpose, the subjective annotations are used to discover mappings between the user's and the system's conceptual model, which are subsequently applied to infer additional parameters for a user-adapted query.\r Finally, we propose a Pseudo Relevance Feedback method, which improves the content-based image retrieval by query reformulation. The particular aspect of this method is the fact that the involved functions, like result judgments, relevance computation, and reordering of the results, have been implemented as user-defined functions, making the method highly suitable for web retrieval applications.

Deep learning models (DLMs) are state-of-the-art techniques in speech recognition. However, training good DLMs can be time consuming especially for production-size models and corpora. Although several parallel training algorithms have been proposed to improve training efficiency, there is no clear guidance on which one to choose for the task in hand due to lack of systematic and fair comparison among them. In this paper we aim at filling this gap by comparing four popular parallel training algorithms in speech recognition, namely asynchronous stochastic gradient descent (ASGD), blockwise model-update filtering (BMUF), bulk synchronous parallel (BSP) and elastic averaging stochastic gradient descent (EASGD), on 1000-hour LibriSpeech corpora using feed-forward deep neural networks (DNNs) and convolutional, long short-term memory, DNNs (CLDNNs). Based on our experiments, we recommend using BMUF as the top choice to train acoustic models since it is most stable, scales well with number of GPUs, can achieve reproducible results, and in many cases even outperforms single-GPU SGD. ASGD can be used as a substitute in some cases.

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation (generated from a previously trained Convolutional Neural Network) and phrases that are used to described them. The system is then able to infer phrases from a given image sample. Based on caption syntax statistics, we propose a simple language model that can produce relevant descriptions for a given test image using the phrases inferred. Our approach, which is considerably simpler than state-of-the-art models, achieves comparable results on the recently release Microsoft COCO dataset.

Multi-hop reading comprehension requires the model to explore and connect relevant information from multiple sentences/documents in order to answer the question about the context. To achieve this, we propose an interpretable 3-module system called Explore-Propose-Assemble reader (EPAr). First, the Document Explorer iteratively selects relevant documents and represents divergent reasoning chains in a tree structure so as to allow assimilating information from all chains. The Answer Proposer then proposes an answer from every root-to-leaf path in the reasoning tree. Finally, the Evidence Assembler extracts a key sentence containing the proposed answer from every path and combines them to predict the final answer. Intuitively, EPAr approximates the coarse-to-fine-grained comprehension behavior of human readers when facing multiple long documents. We jointly optimize our 3 modules by minimizing the sum of losses from each stage conditioned on the previous stage's output. On two multi-hop reading comprehension datasets WikiHop and MedHop, our EPAr model achieves significant improvements over the baseline and competitive results compared to the state-of-the-art model. We also present multiple reasoning-chain-recovery tests and ablation studies to demonstrate our system's ability to perform interpretable and accurate reasoning.

Using semantic technologies for mining and intelligent information access to microblogs is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Semantic annotation of tweets is typically performed in a pipeline, comprising successive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). Consequently, errors are cumulative, and earlier-stage problems can severely reduce the performance of final stages. This paper presents a characterisation of genre-specific problems at each semantic annotation stage and the impact on subsequent stages. Critically, we evaluate impact on two high-level semantic annotation tasks: named entity detection and disambiguation. Our results demonstrate the importance of making approaches specific to the genre, and indicate a diminishing returns effect that reduces the effectiveness of complex text normalisation.

The aim of the Semantic Web is to improve the access, management, and retrieval of information on the Web-based. On this understanding, ontologies are considered a technology that supports all aforementioned tasks. However, current approaches for information retrieval on ontology-based knowledge bases are intended to be used by experienced users. To address this gap, Natural Language Processing (NLP) is deemed a very intuitive approach from a non-experienced user\u2018s perspective, because the formality of a knowledge base is hidden, as well as the executable query language. In this work, we present ONLI, a natural language interface for DBpedia, a community effort to structure Wikipedia\u2019s content based on an ontological approach. ONLI combines NLP techniques in order to analyze user\u2019s question and populate an ontological model, which is responsible for describing question\u2019s context. From this model, ONLI requests the answer through a set of heuristic SPARQL-based query patterns. Finally, we describe the current version of the ONLI system, as well as an evaluation to assess its effectiveness in finding the correct answer.

In this article we will examine whether certain data driven dictionaries can improve user-friendliness and usability \u2013in a broader sense- of lexicographic instruments used in language learning, as well as enhance positive features such as empowerment, discovery, learner autonomy and the correct identification of lexico-grammatical patterns. Therefore we present the qualitative and quantitative data of a small experiment with the on line multilingual (English-Spanish) tool Linguee, comparing its efficiency and satisfaction levels at a Spanish writing test and the ones of the same tests in which, in one control group, only traditional dictionaries were used, and in another those dictionaries in combination with Linguee. Based on our findings, we will propose some further pedagogical and technological interventions in order to obtain even higher satisfaction and efficiency levels. An example of a result of such interventions can be found partially in the Interactive Language Toolbox.

The network model of linguistic meaning was developed within the framework of cognitive linguistics as a tool for visualising the semantic structure of polysemous units. The model is based on the notion that linguistic knowledge is grounded in categorisation, and that linguistic units are typically characterised by structured polysemy.We explore the potential usefulness of the network model in the structuring of word senses in a dictionary by transferring two medium-sized entries in the monolingual Norwegian dictionary Norsk Ordbok into semantic networks. Prototypicality and links between senses based on either extension or schematicity are made explicit in the two networks. We argue that the network model is a valuable tool for the dictionary editor faced with the task of identifying word senses and arranging them in a hierarchy.

Answer selection is a challenging task in natural language processing that requires both natural language understanding and word knowledge. At present, most of recent methods draw on insights from attention mechanism to learn the complex semantic relations between questions and answers. Previous remarkable approaches mainly apply general Compare-Aggregate framework. In this paper, we propose a novel Compare-Aggregate framework with embedding selector to solve answer selection task. Unlike previous Compare-Aggregate methods which just use one type of Attention mechanism and lack the use of word vectors at different level, we employ two types of Attention mechanism in a model and add a selector layer to choose a best input for aggregation layer. We evaluate the model on the two answer selection tasks: WikiQA and TrecQA. On the two different datasets, our approach outperforms several strong baselines and achieves state-of-the-art performance.

Knowledge encoded in semantic graphs such as Freebase has been shown to benefit semantic parsing and interpretation of natural language user utterances. In this paper, we propose new methods to assign weights to semantic graphs that reflect common usage types of the entities and their relations. Such statistical information can improve the disambiguation of entities in natural language utterances. Weights for entity types can be derived from the populated knowledge in the semantic graph, based on the frequency of occurrence of each type. They can also be learned from the usage frequencies in real world natural language text, such as related Wikipedia documents or user queries posed to a search engine. We compare the proposed methods with the unweighted version of the semantic knowledge graph for the relation detection task and show that all weighting methods result in better performance in comparison to using the unweighted version.

The paper contributes to the research on automatic evaluation of surface coherence in student essays. We look into possibilities of using large unlabeled data to improve quality of such evaluation. Particularly, we propose two approaches to benefit from the large data: (i) n-gram language model, and (ii) density estimates of features used by the evaluation system. In our experiments, we integrate these approaches that exploit data from the Czech National Corpus into the evaluator of surface coherence for Czech, the EVALD system, and test its performance on two datasets: essays written by native speakers (L1) as well as foreign learners of Czech (L2). The system implementing these approaches together with other new features significantly outperforms the original EVALD system, especially on L1 with a large margin.

Acknowledgments 1. Introduction 1.1 Why Another Introduction to Corpus Linguistics? 1.2 Outline of the Book 1.3 Recommendation for Instructors 2. Three Central Corpus-linguistic Methods 2.1 Corpora 2.2 Frequency Lists 2.3 Lexical Co-occurrence: Collocations 2.4 (Lexico-)Grammatical Co-occurence: Concordances 3. An Introduction to R 3.1 A few Central Notions: Data structures, Functions, and Arguments 3.2 Vectors 3.3 Factors 3.4 Data Frames 3.5 Lists 3.6 Elementary Programming Functions 3.7 Character/String Processing 3.8 File and Directory Operations 4. Using R in Corpus Linguistics 4.1 Frequency Lists 4.2 Concordances 4.3 Collocations 4.4 Excursus 1: Processing Multi-tiered Corpora 4.5 Excursus 2: Unicode 5. Some Statistics for Corpus Linguistics 5.1 Introduction to Statistical Thinking 5.2 Categorical Dependent Variables 5.3 Interval/Ratio Dependent Variables 5.4 Customizing Statistical Plots 5.5 Reporting Results 6. Case Studies and Pointers to Other Applications 6.1 Introduction to the Case Studies 6.2 Some Pointers to Further Applications Appendix References Endnotes Index

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. \"\"\"\"Understanding\"\"\"\" these texts has been a focus of natural language processing research for many years, with some remarkable successes. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm or FDA's Orange Book, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. We propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a \"\"\"\"drug\"\"\"\" is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated gold-standard corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. We present the performance tradeoffs depending on the thresholds chosen for these features.

This paper describes and evaluates the automatic grammatical annotation of a chat and an e-mail corpus of together 117 million words, using a modular Constraint Grammar system. We discuss a number of genre-specific issues, such as emoticons and personal pronouns, and offer a linguistic comparison of the two corpora with corresponding annotations of the Europarl corpus and the spoken and written subsections of the BNC corpus, with a focus on orality markers such as linguistic complexity and word class distribution.

The Person Name Vocabulary (PNV) is an rdf vocabulary for modelling persons' names. It is applicable to many datasets in which persons are described, as it accommodates different levels of data granularity. It furthermore allows for easy alignment of name elements, including idiosyncratic ones, such as family name prefixes and patronymics, with standard vocabularies such as, foaf, DBpedia and Wikidata, thus guaranteeing optimal data interoperability.

For a given software bug report, identifying an appropriate developer who could potentially fix the bug is the primary task of bug triaging. Automatic bug triaging is formulated as a classification problem, which takes the bug title and description as the input, and maps it to one of the available developers. A major challenge in doing this is that the bug description usually contains a combination of unstructured text, code snippets, and stack traces making the input data highly noisy. The existing bag-of-words (BOW) models do not consider the semantic information in the unstructured text. In this research, we propose a novel bug report representation using a deep bidirectional recurrent neural network with attention (DBRNN-A) that learns the syntactic and semantic features from long word sequences in an unsupervised manner. Using attention enables the model to remember and attend to important parts of text in a bug report. For training the model, we use unfixed bug reports (which constitute about 70% of bugs in a typical open source bug tracking system) which were ignored in previous studies. Another major contribution of this work is the release of a public benchmark dataset of bug reports from three open source bug tracking systems: Google Chromium, Mozilla Core, and Mozilla Firefox. The dataset consists of 383,104 bug reports from Google Chromium, 314,388 bug reports from Mozilla Core, and 162,307 bug reports from Mozilla Firefox. When compared to other systems, we observe that DBRNN-A provides a higher rank-10 average accuracy.

This paper introduces the MEQLD method (Mapping Expansion of Natural Language Entities to DBpedia's Components for Querying Linked Data) that we propose to perform a part of the Task 1 (Multilingual Question Answering) [15] of CLEF 2013 lab QALD-3 (Multilingual Question Answering over Linked Data) [14]. MEQLD investigates to improve the mapping extension of (lexical) entities of English questions into DBpedia's components for creating query in SPARQL Query Language [12]. This paper will focus on resolving the most difficult testing questions of QALD-3 [18] that all their submitted systems have no good evaluation.

Abstract : In this paper we present our participation in the 2014 TREC Clinical Decision Support Track. The goal of this track is to find relevant medical literature for a case report which should help address one specific clinical aspect of the case. Since it was the first time we participated in this task, we opted for an exploratory approach to test the impact of retrieval systems based on Bag-of-Words (BoW) or Medical Subject Headings (MeSH) index terms. In all five submitted runs, we used manually constructed MeSH queries to filter a target corpus for each of the three clinical question types. Query expansion (for both MeSH and BoW runs) was based on the automatic generation of disease hypotheses for which we used data from OrphaNet [4] and the Disease Symptom Knowledge Database [3]. Our best run was a MeSH-based run in which PubMed was queried directly with the MeSH terms extracted from the case reports, combined with the MeSH terms of the top 5 disease hypotheses generated for the case reports. Compared to the other participants we achieved low scores. Preliminary analysis shows that our corpus filtering method was too strict and has a negative impact on recall.

Aspect-level sentiment analysis refers to sentiment polarity detection from unstructured text at a fine-grained feature or aspect level. This paper presents our experimental work on aspect-level sentiment analysis of movie reviews. Movie reviews generally contain user opinion about different aspects such as acting, direction, choreography, cinematography, etc. We have devised a linguistic rule-based approach which identifies the aspects from movie reviews, locates opinion about that aspect and computes the sentiment polarity of that opinion using linguistic approaches. The system generates an aspect-level opinion summary. The experimental design is evaluated on datasets of two movies. The results achieved good accuracy and shows promise for deployment in an integrated opinion profiling system.

During the recent years, the use of linguistic data for language processing (semantic ambiguity resolution, translation...) increased progressively. Such data are now commonly called language resources. A few years ago, nearly all the language resources used for this purpose were collections of texts as the Brown Corpus and the Penn Treebank, but the use of electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. This development is slow because most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

This paper describes a simple and principled approach to automatically construct sentiment lexicons using distant supervision. We induce the sentiment association scores for the lexicon items from a model trained on a weakly supervised corpora. Our empirical findings show that features extracted from such a machine-learned lexicon outperform models using manual or other automatically constructed sentiment lexicons. Finally, our system achieves the state-of-the-art in Twitter Sentiment Analysis tasks from Semeval-2013 and ranks 2nd best in Semeval-2014 according to the average rank.

This small dictionary offers the most accurate and up-to-date coverage of essential, everyday vocabulary with over 90,000 words, phrases, and definitions based on evidence from the Oxford English Corpus, a unique databank comprising hundreds of millions of words of English. Definitions are easy to understand, given in a clear, simple style, and avoiding technical language. Access our free dictionary service Oxford Dictionaries Online at

This study characterizes the subject-verb agreement that occurs with group of NP and number of NP. These two complex noun phrases can agree with a verb as a singular or plural noun. These two particular items were selected as number of NP has a relatively firm description of its quantification behavior described in existing literature while group of NP has not been shown to have describable rules governing its quantity. Using data collected from the Corpus of Contemporary American English (COCA), 1200 concordance lines centered on group of and number of which agree with a verb in a clause were extracted for study of several co-occurring features. Individual features such as determiners and modifiers are examined with respect to their distribution with singular or plural-agreeing verbs to identify patterns of agreement and potentially indicate trends, if not causal relationships. Some features, such as determiners preceding the first noun number, show trends with respect to the verb-demonstrated quantity of the noun phrase. Other features, such as premodifiers on either noun in group of NP do not appear to co-occur in demonstrable patterns. By creating a description of quantification in this way, this study lays the foundation for more targeted future studies of quantification in cognition, grammar, and semantics.

Twitter is an emerging platform to express the opinion on various issues. Plenty of approaches like machine learning, information retrieval and NLP have been exercised to figure out the sentiment of the tweets. We have used movie reviews as our data set for training as well as testing and merged the naive bayes and adjective analysis for finding the polarity of the ambiguous tweets. Experimental outputs reveal that the overall accuracy of the process is improved using this model. Firstly we have applied naive bayes on collected tweets which results in set of truly polarized and falsely polarized tweets. False polarized set is further processed with adjective analysis to determine the polarity of tweets and classify it to be positive or negative.

Human annotators are critical for creating the necessary datasets to train statistical learners, but annotation cost and limited access to qualified annotators forms a data bottleneck. In recent years, researchers have investigated overcoming this obstacle using crowdsourcing, which is the delegation of a particular task to a large group of untrained individuals rather than a select trained few. \r\nThis thesis is concerned with crowdsourcing annotation across a variety of natural language processing tasks. The tasks reflect a spectrum of annotation complexity, from simple labeling to translating entire sentences. The presented work involves new types of annotators, new types of tasks, new types of data, and new types of algorithms that can handle such data. \r\nThe first part of the thesis deals with two text classification tasks. The first is the identification of dialectal Arabic sentences. We use crowdsourcing to create a large annotated dataset of Arabic sentences, which is used to train and evaluate language models for each Arabic variety. We also introduce a new type of annotations called annotator rationales, which complement traditional class labels. We collect rationales for dialect identification and for a sentiment analysis task on movie reviews. In both tasks, adding rationales yields significant accuracy improvements. \r\nIn the second part, we examine how crowdsourcing can be beneficial to machine translation (MT). We start with the evaluation of MT systems, and show the potential of crowdsourcing to edit MT output. We also present a new MT evaluation metric, RYPT, that is based on human judgment, and well-suited for a crowdsourced setting. Finally, we demonstrate that crowdsourcing can be used to collect translations. We discuss a set of features that help distinguish well-formed translations from those that are not, and show that crowdsourcing yields high-quality translations at a fraction of the cost of hiring professionals.

Although remarkable improvements have been seen in the independent word sense disambiguation(WSD) models,there are still debates about the necessity to integrate the WSD models with the machine translation(MT)systems. To settle the question in a general view,we break the restrictions from specific models and a simulative perfect all-words WSD process is imported into MT systems of different types to acquire a sufficient and general evaluation.Experiment results indicate that a fine WSD process not only yields considerable translation quality itself but also obviously improves the MT systems.In addition,this work also reveals that current MT technologies still have much room to improve in selecting the best translation.

This paper explores a method that use WordNet concept to categorize text documents. The bag of words representation used for text representation is unsatisfactory as it ignores possible relations between terms. The proposed method extracts generic concepts from WordNet for all the terms in the text then combines them with the terms in different ways to form a new representative vector. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine distance and two benchmark corpus the reuters-21578 newswire articles and the 20 newsgroups data for evaluation. The proposed method is especially effective in raising the macro-averaged F1 value, which increased to 0.714 for the Reuters from 0.649 and to 0.719 for the 20 newsgroups from 0.667.

Introduction: Natural Language Processing is one part of Artificial Intelligence and Machine Learning to make an understanding of the interactions between computers and human (natural) languages. Sentiment analysis is one part of Natural Language Processing, that often used to analyze words based on the patterns of people in writing to find positive, negative, or neutral sentiments. Sentiment analysis is useful for knowing how users like something or not. Zomato is an application for rating restaurants. The rating has a review of the restaurant which can be used for sentiment analysis. Based on this, writers want to discuss the sentiment of the review to be predicted.\r\nMethod: The method used for preprocessing the review is to make all words lowercase, tokenization, remove numbers and punctuation, stop words, and lemmatization. Then after that, we create word to vector with the term frequency-inverse document frequency (TF-IDF). The data that we process are 150,000 reviews. After that make positive with reviews that have a rating of 3 and above, negative with reviews that have a rating of 3 and below, and neutral who have a rating of 3. The author uses Split Test, 80% Data Training and 20% Data Testing. The metrics used to determine random forest classifiers are precision, recall, and accuracy. The accuracy of this research is 92%.\r\nResult: The precision of positive, negative, and neutral sentiment is 92%, 93%, 96%. The recall of positive, negative, and neutral sentiment are 99%, 89%, 73%. Average precision and recall are 93% and 87%. The 10 words that affect the results are: \u201Cbad\u201D, \u201Cgood\u201D, \u201Caverage\u201D, \u201Cbest\u201D, \u201Cplace\u201D, \u201Clove\u201D, \u201Corder\u201D, \u201Cfood\u201D, \u201Ctry\u201D, and \u201Cnice\u201D.

We describe a new semantic relatedness measure combining the Wikipedia-based Explicit Semantic Analysis measure, the WordNet path measure and the mixed collocation index. Our measure achieves the currently highest results on the WS-353 test: a Spearman coefficient of 0.79 (vs. 0.75 in (Gabrilovich and Markovitch, 2007)) when applying the measure directly, and a value of 0.87 (vs. 0.78 in (Agirre et al., 2009)) when using the prediction of a polynomial SVM classifier trained on our measure. In the appendix we discuss the adaptation of ESA to 2011 Wikipedia data, as well as various unsuccessful attempts to enhance ESA by filtering at word, sentence, and section level.

One of the main problems in Natural Language Processing is lexical ambiguity, words often have multiple lexical functionalities (i.e. they can have various parts-of-speech) or have several semantic meanings. Nowadays, the semantic ambiguity problem, most known as Word Sense Disambiguation, is still an open problem in this area. The accuracy of the different approaches for semantic disambiguation is much lower than the accuracy of the systems which solve other kinds of ambiguity, such as part-of-speech tagging. Corpus-based approaches have been widely used in nearly all natural language processing tasks. In this work, we propose a Word Sense Disambiguation system which is based on Hidden Markov Models and the use of WordNet. Some experimental results of our system on the SemCor corpus are

Tumor molecular profiling plays an integral role in identifying genomic anomalies which may help in personalizing cancer treatments, improving patient outcomes and minimizing risks associated with different therapies. However, critical information regarding the evidence of clinical utility of such anomalies is largely buried in biomedical literature. It is becoming prohibitive for biocurators, clinical researchers and oncologists to keep up with the rapidly growing volume and breadth of information, especially those that describe therapeutic implications of biomarkers and therefore relevant for treatment selection. In an effort to improve and speed up the process of manually reviewing and extracting relevant information from literature, we have developed a natural language processing (NLP)-based text mining (TM) system called eGARD (extracting Genomic Anomalies association with Response to Drugs). This system relies on the syntactic nature of sentences coupled with various textual features to extract relations between genomic anomalies and drug response from MEDLINE abstracts. Our system achieved high precision, recall and F-measure of up to 0.95, 0.86 and 0.90, respectively, on annotated evaluation datasets created in-house and obtained externally from PharmGKB. Additionally, the system extracted information that helps determine the confidence level of extraction to support prioritization of curation. Such a system will enable clinical researchers to explore the use of published markers to stratify patients upfront for \u2018best-fit\u2019 therapies and readily generate hypotheses for new clinical trials.

Image caption task has been focusing on generating a descriptive sentence for a certain image. In this work, we propose the accurate guidance for image caption generation, which guides the caption model to focus more on the principle semantic object while making human reading sentence, and generate high quality sentence in grammar. In particular, we replace the classification network with object detection network as the multi-level feature extracter to emphasize what human care about and avoid unnecessary model additions. Attention mechanism is utilized to align the feature of principle objects with words in the semantic sentence. Under these circumstances, we combine the object detection network and the text generation model together and it becomes an end-to-end model with less parameters. The experimental results on MS-COCO dataset show that our methods are on part with or even outperforms the current state-of-the-art.

The objective of this research is to present a web application that predicts L2 text readability. The software is intended to assist ESL teachers in selecting texts written at a level of difficulty that corresponds with the target students' lexical competence. The ranges are obtained by statistical approach using distribution probability and an optimized version of the word frequency class algorithm, with the aid of WordNet and a lemmatised list for the British National Corpus. Additionally, the program is intended to facilitate the method of selection of specialised texts for teachers of ESP using proportionality and lists of specialised vocabulary.

In this paper, the problem of recovery of morphological information lost in abbreviated forms is addressed with a focus on highly inflected languages. Evidence is presented that the correct inflected form of an expanded abbreviation can in many cases be deduced solely from the morphosyntactic tags of the context. The prediction model is a deep bidirectional LSTM network with tag embedding. The training and evaluation data are gathered by finding the words which could have been abbreviated and using their corresponding morphosyntactic tags as the labels, while the tags of the context words are used as the input features for classification. The network is trained on over 10 million words from the Polish Sejm Corpus and achieves 74.2% prediction accuracy on a smaller, but more general National Corpus of Polish. The analysis of errors suggests that performance in this task may improve if some prior knowledge about the abbreviated word is incorporated into the model.

The 10 consonant-nucleus-consonant (CNC) word lists are considered the gold standard in the testing of cochlear implant (CI) users. However, variance in scores across lists could degrade the sensitivity and reliability of them to identify deficits in speech perception. This study examined the relationship between variability in performance among lists and the lexical characteristics of the words. Data are from 28 adult CI users. Each subject was tested on all 10 CNC word lists. Data were analyzed in terms of lexical characteristics, lexical frequency, neighborhood density, bi-, and tri-phonemic probabilities. To determine whether individual performance variability across lists can be reduced, the standard set of 10 phonetically balanced 50-word lists was redistributed into a new set of lists using two sampling strategies: (a) balancing with respect to word lexical frequency or (b) selecting words with equal probability. The mean performance on the CNC lists varied from 53.1% to 62.4% correct. The average ...

Corpus linguistics and lexicography make a natural combination, the first has had many centuries, even millennia, without recourse to a corpus, but with the ar- rival of corpus linguistics the paradigm has changed entirely. It is now impossible to build a decent dictionary of either general or special languages without mak- ing use of a corpus. The exponential rise in corpus linguistics since the 1980s is also largely due to the influence of lexicography with the impetus given by the COBUILD project lead by John Sinclair at the University of Birmingham, and then the building of the British National Corpus by a consortium including major dic- tionary houses. Although there have been numerous books and articles describing corpus-based lexicographical project and the use of corpora in dictionary making, there has to date been no introduction to the art of building dictionaries from corpora. Two books now give a real introduction to corpus-based lexicography for mono- and bilingual dictionaries: The Oxford Guide to Practical Lexicography (Atkins & Rundell 2008) and Practical Lexicography: a reader (Fontenelle 2007).

The effectiveness of automatic key concept or keyphrase identification from unstructured text documents mainly depends on a comprehensive and meaningful list of candidate features extracted from the documents. However, the conventional techniques for candidate feature extraction limit the performance of keyphrase identification algorithms and need improvement. The objective of this paper is to propose a novel parse tree-based approach for candidate feature extraction to overcome the shortcomings of the existing techniques. Our proposed technique is based on generating a parse tree for each sentence in the input text. Sentence parse trees are then cut into sub-trees to extract branches for candidate phrases (i.e., noun, verb, and so on). The sub-trees are combined using parts-of-speech tagging to generate the flat list of candidate phrases. Finally, filtering is performed using heuristic rules and redundant phrases are eliminated to generate final list of candidate features. Experimental analysis is conducted for validation of the proposed scheme using three manually annotated and publicly available data sets from different domains, i.e., Inspec, 500N-KPCrowed, and SemEval-2010. The proposed technique is fine-tuned to determine the optimal value for the parameter context window size and then it is compared with the existing conventional n-gram and noun-phrase-based techniques. The results show that the proposed technique outperforms the existing approaches and significant improvements of 13.51% and 30.67%, 12.86% and 5.48%, and 13.16% and 31.46% are achieved, in terms of precision , recall , and F-measure when compared with noun-phrase-based scheme and n-gram-based scheme, respectively. These results give us confidence to further validate the proposed technique by developing a keyphrase extraction algorithm in the future.

The goal of my thesis is the extension of the Distributional Hypothesis [13] from the word to the concept level. This will be achieved by creating data-driven methods to create and apply conceptualizations, taxonomic semantic models that are grounded in the input corpus. Such conceptualizations can be used to disambiguate all words in the corpus, so that we can extract richer relations and create a dense graph of semantic relations between concepts. These relations will reduce sparsity issues, a common problem for contextualization techniques. By extending our conceptualization with named entities and multi-word entities MWE, we can create a Linked Open Data knowledge base that is linked to existing knowledge bases like Freebase.

Modern knowledge bases have matured to the extent of being capable of complex reasoning at scale. Unfortunately, wide deployment of this technology is still hindered by the fact that specifying the requisite knowledge requires skills that most domain experts do not have, and skilled knowledge engineers are in short supply. A way around this problem could be to acquire knowledge from text. However, the current knowledge acquisition technologies for information extraction are not up to the task because logic reasoning systems are extremely sensitive to errors in the acquired knowledge, and existing techniques lack the required accuracy by too large of a margin. Because of the enormous complexity of the problem, controlled natural languages (CNLs) were proposed in the past, but even they lack high enough accuracy. Instead of tackling the general problem of text understanding, our interest is in a related, but different, area of knowledge authoring\u2014a technology designed to enable domain experts to manually create formalized knowledge using CNL. Our approach adopts and formalizes the FrameNet methodology for representing the meaning, enables incrementally-learnable and explainable semantic parsing, and harnesses rich knowledge graphs like BabelNet in the quest to obtain unique, disambiguated meaning of CNL sentences. Our experiments show that this approach is 95.6% accurate in standardizing the semantic relations extracted from CNL sentences\u2014far superior to alternative systems.

As mentioned in section 1.7, a selection involves choosing between two or more different actions depending on the value of a data item or some condition. We can illustrate the concept of selecting different actions and the associated C++ language constructs by developing further the student marks example from chapter 5. This will involve taking different actions depending on the values of the examination mark and the practical mark.

The paper deals with the use of the superlative degree in spoken British English on the basis of the demographic part of the British National Corpus. The aspects investigated include the distribution of the morphological types (inflectional vs. periphrastic), the types of adjectives used in this construction and the syntax of the superlative (attributive, predicative and nominal use; determiner usage). Special attention is being paid to the semantics (relative, absolute, intensifying meanings) and the corresponding functions of the superlative, where it is noticeable that absolute and intensifying readings are much more common than expectable from the extant literature. Together with the usage of generalising modification structures, this points to the conclusion that the superlative may be less a means of factual comparison than rather a means for (often vague) evaluation and the expression of emotion.

This paper describes the Atlante Sintattico d'Italia, Syntactic Atlas of Italy ASIt linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network, the goal of which is to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy.Both the ASIt linguistic linked dataset and the Resource Description Framework Schema RDF/S on which it is based are publicly available and released with a Creative Commons license CC BY-NC-SA 3.0. We report the characteristics of the data exposed by ASIt, the statistics about the evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.

In English, new adjectives can be coined by adding the suffix -ish. For instance, one can describe someone who acts like Arnold Schwarzenegger as Schwarzeneggerish. This paper investigates how the use of -ish is influenced by text characteristics (genre, formality) and author characteristics (gender, age). We used two corpora, the British National Corpus and the Blog Authorship Corpus. From our analyses of variance (ANOVAs) and logistic regression models, we learned that for the use of -ish it is probably more important what type of text you are writing than who you are. We also concluded that this type of research is seriously hampered by the absence of the kind of metadata needed for our type of research.

We present a comparative analysis of synonyms in collaboratively constructed and linguistic lexical semantic resources and its implications for NLP research. Our focus is on the Wiki-based resources constructed mostly by non-experts on the Web which lack any principled linguistic guidelines and rely on user collaboration for quality management, as opposed to conventional sources of synonyms such as WordNet or thesauri. The most prominent examples are Wikipedia (a free Encyclopedia) and its dictionary spin-offs Wiktionary and OmegaWiki , where the latter has a strong focus on crosslinguality. We examine three major ways how synonyms emerge in these resources, all of which imply a different operational definition of synonymy. We then discuss how these synonyms can be mined and used building upon previous research in this field.

The rapid progress of question answering (QA) systems over knowledge bases (KBs) enables end users to acquire knowledge with natural language questions. While mapping proper nouns and relational phrases to semantic constructs in KBs has been extensively studied, little attention has been devoted to adjectives, most of which play the role of factoid constraints on the modified nouns. In this paper, we study the problem of finding appropriate representations for adjectives over KBs. We propose a novel approach, called Adj2ER, to automatically map an adjective to several existential restrictions or their negation forms. Specifically, we leverage statistic measures for generating candidate existential restrictions and supervised learning for filtering the candidates, which largely reduce the search space and overcome the lexical gap. We create two question sets with adjectives from QALD and Yahoo! Answers, and conduct experiments over DBpedia. Our experimental results show that Adj2ER can generate high-quality mappings for most adjectives and significantly outperform several alternative approaches. Furthermore, current QA systems can gain a promising improvement when integrating our adjective mapping approach.

We present a project aimed at construction of a bank of constituent parse trees for 20,000 Polish sentences taken from the balanced hand-annotated subcorpus of the National Corpus of Polish (NKJP).\r\n\r\nThe treebank is to be obtained by automatic parsing and manual disambiguation of resulting trees. The grammar applied by the project is a new version of Swidzinski's formal definition of Polish. Each sentence is disambiguated independently by two linguists and, if needed, adjudicated by a supervisor. The feedback from this process is used to iteratively improve the grammar.\r\n\r\nIn the paper, we describe linguistic but also technical decisions made in the project. We discuss the overall shape of the parse trees including the extent of encoded grammatical information. We also delve into the problem of syntactic disambiguation as a challenge for our job.

In this paper we concentrate on the resolution of the semantic ambiguity that arises when a given word has several meanings. This specific task is commonly referred to as Word Sense Disambiguation (WSD). We propose a method that obtains the appropriate senses from a multidimensional analysis (using Relevant Semantic Trees). Our method uses different resources WordNet, WordNet Domains, WordNet-Affects and SUMO, combined with senses frequency obtained from SemCor. Our hypothesis is that in WSD it is important to obtain the most frequent senses depending on the type of analyzed context to achieve better results. Finally, in order to evaluate and compare our results, it is presented a comprehensive study and experimental work using the Senseval-2 and Semeval-2 data set, demonstrating that our system obtains better results than other unsupervised systems.

Twitter is a microblogging service where worldwide users publish their feelings. However, sentiment analysis for Twitter messages (tweets) is regarded as a challenging problem because tweets are short and informal. In this paper, we focus on this problem by the analysis of emotion tokens, including emotion symbols (e.g. emoticons), irregular forms of words and combined punctuations. According to our observation on five million tweets, these emotion tokens are commonly used (0.47 emotion tokens per tweet). They directly express one's emotion regardless of his language; hence become a useful signal for sentiment analysis on multilingual tweets. Firstly, emotion tokens are extracted automatically from tweets. Secondly, a graph propagation algorithm is proposed to label the tokens' polarities. Finally, a multilingual sentiment analysis algorithm is introduced. Comparative evaluations are conducted among semantic lexicon based approach and some state-of-the-art Twitter sentiment analysis Web services, both on English and non-English tweets. Experimental results show effectiveness of the proposed algorithms.

The Syntactic Atlas of the Dutch Dialects (SAND) provides a detailed overview of the surprisingly rich syntactic variation found in 267 dialects of Dutch at the beginning of the 21st century. More than 200 full color maps show the geographic distributiebon of over 100 syntactic variables. Many of these variables involve phenomena that are absent form the standard language and thus of great descriptive and theoretical importante. A state-of-the art linguistic description and commentary accompanies each map, taking into account the results of modern syntactic research as well as historical developments. Volume I (2005) shows the variation in complementisers and complementiser agreement, subject pronouns, subject doubling and subject clitisation after yes/ no, reflexives and reciprocals, relative clauses, question-word doubling and topicalisation. Volume II shows the variation in two- and three verb clusters, interruption of the verb clusters, extraposition and te 'to' in the verbal clusters, auxiliary selection, do-insertion, and negation and quantification.

This study investigates the effect of corpus consultation on the accuracy of learner written error revisions. It examines the conditions which cause a learner to consult the corpus in correcting errors and whether these revisions are more effective than those made using other corrections methods.\r\nClaims have been made for the potential usefulness of corpora in encouraging a better understanding of language through inductive learning (Johns, 1991; Benson, 2001; Watson Todd, 2003). The opportunity for learners to interact with the authentic language used to compile corpora has also been cited as a possible benefit (Thurstun and Candlin, 1998). However, theoretical advantages of using corpus data have not always translated into actual benefits in real learning contexts. Learners frequently encounter difficulties in dealing with the volume of information available to them in concordances and can reject corpus use because it adds to their learning load (Yoon and Hirvela, 2004; Frankenberg Garcia, 2005; Lee and Swales, 2006). This has meant that practical employment of corpus data has sometimes been difficult to implement.\r\nIn this experiment, learners on a six week pre-sessional English for Academic Purposes (EAP) course were shown how to use the BYU (Brigham Young University) website to access the BNC (British National Corpus) to address written errors. Through a draft/feedback/revision process using meta-linguistic error coding, the frequency, context and effectiveness of the corpus being used as a reference tool was measured.\r\nUse of the corpus was found to be limited to a small range of error types which largely involved queries of a pragmatic nature. In these contexts, the corpus was found to be a potentially more effective correction tool than dictionary reference or recourse to previous knowledge and it may have a beneficial effect in encouraging top-down processing skills. However, its frequency of use over the course was low and accounted for only a small proportion of accurate error revisions as a whole. Learner response to the corpus corroborated the negative perception already noted in previous studies.\r\nThese findings prompt recommendations for further investigation into effective mediation of corpus data within the classroom and continued technological developments in order to make corpus data more accessible to non-specialists.

In research on word recognition, it has been shown that word beginnings have higher information content for word identification than word endings; this asymmetric information distribution within words has been argued to be due to the communicative pressure to allow words in speech to be recognized as early as possible. Through entropy analysis using two representative datasets from Wikifonia and the Essen folksong corpus, we show that musical segments also have higher information content (i.e., higher entropy) in segment beginnings than endings. Nevertheless, this asymmetry was not as dramatic as that found within words, and the highest information content was observed in the middle of the segments (i.e., an inverted U pattern). This effect may be because the first and last notes of a musical segment tend to be tonally stable, with more flexibility in the first note for providing the initial context. The asymmetric information distribution within words has been shown to be an important factor accounting for various asymmetric effects in word reading, such as the left-biased preferred viewing location and optimal viewing position effects. Similarly, the asymmetric information distribution within musical segments is a potential factor that can modulate music reading behavior and should not be overlooked.

In the last years, different efforts have been made to extract information that users express through online social networking services, e.g. Twitter. Despite the progress achieved, there are still open gaps to be addressed. Related to the sentiment analysis issue, we stand out the following gaps: (a) low accuracy in sentiment classification task for short texts; and, (b) lack of tools for sentiment analysis in several languages. Aiming to fill these gaps, in this paper we apply the Spanish adaptation of ANEW (Affective Norms for English Words) as resource to improve the Twitter sentiment analysis by applying a variety of multi-label classifiers in a corpus of Spanish tweets collected by us. To the best of our knowledge, this is the first work using a Spanish adaptation of ANEW for sentiment analysis.

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6% performance improvement on the standard IMDB movie review dataset, and a 7.01% improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for underresourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12% for two review datasets over earlier results, and also release a new larger dataset for future testing (Singh, 2015).

Supervised word sense disambiguation has proven incredibly difficult. Despite significant effort, there has been little success at using contextual features to accurately assign the sense of a word. Instead, few systems are able to outperform the default sense baseline of selecting the highest ranked WordNet sense. In this paper, we suggest that the situation is even worse than it might first appear: the highest ranked WordNet sense is not even the best default sense classifier. We evaluate several default sense heuristics, using supersenses and SemCor frequencies to achieve significant improvements on the WordNet ranking strategy.

We compare two ways of obtaining lexical knowledge for antecedent selection in other-anaphora and definite noun phrase coreference. Specifically, we compare an algorithm that relies on links encoded in the manually created lexical hierarchy WordNet and an algorithm that mines corpora by means of shallow lexico-semantic patterns. As corpora we use the British National Corpus (BNC), as well as the Web, which has not been previously used for this task. Our results show that (a) the knowledge encoded in WordNet is often insufficient, especially for anaphor\u2013antecedent relations that exploit subjective or context-dependent knowledge; (b) for other-anaphora, the Web-based method outperforms the WordNet-based method; (c) for definite NP coreference, the Web-based method yields results comparable to those obtained using WordNet over the whole data set and outperforms the WordNet-based method on subsets of the data set; (d) in both case studies, the BNC-based method is worse than the other methods because of data sparseness. Thus, in our studies, the Web-based method alleviated the lexical knowledge gap often encountered in anaphora resolution and handled examples with context-dependent relations between anaphor and antecedent. Because it is inexpensive and needs no hand-modeling of lexical knowledge, it is a promising knowledge source to integrate into anaphora resolution systems.

ABSTRACTThe freely available European Parliament Proceedings Parallel Corpus, or Europarl, is one of the largest multilingual corpora available to date. Surprisingly, bibliometric analyses show that it has hardly been used in translation studies. Its low impact in translation studies may partly be attributed to the fact that the Europarl corpus is distributed in a format that largely disregards the needs of translation research. In order to make the wealth of linguistic data from Europarl easily and readily available to the translation studies community, the toolkit \u2018EuroparlExtract\u2019 has been developed. With the toolkit, comparable and parallel corpora tailored to the requirements of translation research can be extracted from Europarl on demand. Both the toolkit and the extracted corpora are distributed under open licenses. The free availability is to avoid the duplication of effort in corpus-based translation studies and to ensure the sustainability of data reuse. Thus, EuroparlExtract is a contribution ...

Abstract Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH\u00AE) or Online Mendelian Inheritance in Man (OMIM\u00AE). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F -measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: .

Given a set of images with related captions, our goal is to show how visual features can improve the accuracy of unsupervised word sense disambiguation when the textual context is very small, as this sort of data is common in news and social media. We extend previous work in unsupervised text-only disambiguation with methods that integrate text and images. We construct a corpus by using Amazon Mechanical Turk to caption sense-tagged images gathered from ImageNet. Using a Yarowsky-inspired algorithm, we show that gains can be made over text-only disambiguation, as well as multimodal approaches such as Latent Dirichlet Allocation.

While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context. In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets. Our method achieves this by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts. Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet. In contrast, our model can compose sentences that describe novel objects and their interactions with other objects. We demonstrate our model's ability to describe novel concepts by empirically evaluating its performance on MSCOCO and show qualitative results on ImageNet images of objects for which no paired image-sentence data exist. Further, we extend our approach to generate descriptions of objects in video clips. Our results show that DCC has distinct advantages over existing image and video captioning approaches for generating descriptions of new objects in context.

Abstract Ontologies are explicit specifications of concepts and their relationships. In the context of a semantic web of independently developed ontologies, overcoming interoperability and heterogeneity issues is of considerable importance. Many semantic web applications, such as matching of instances in social networks, reasoning over combined knowledge bases, and knowledge sharing among services, rely on ontology alignment. While existing research in this area has developed a wide range of different heuristics, in this paper we propose to look towards cognitive science, specifically analogical reasoning, to support ontology alignment. We investigate the question whether ontology alignment is rooted in the same cognitive process as analogical reasoning. We apply the LISA system, a cognitively-based model of human analogical reasoning, to ontology alignment and present a comprehensive experimental study to determine its performance on ontology alignment problems.

A novel unsupervised genetic word sense disambiguation (GWSD) algorithm is proposed in this paper. The algorithm first uses WordNet to determine all possible senses for a set of words, then a genetic algorithm is used to maximize the overall semantic similarity on this set of words. A novel conceptual similarity function combining domain information is also proposed to compute similarity between senses in WordNet. GWSD is tested on two sets of domain terms and obtains good results. A weighted genetic word sense disambiguation (WGWSD) algorithm is then proposed to disambiguate words in a general corpus. Experiments on SemCor are carried out to compare WGWSD with previous work.

This article presents the Provo Corpus, a corpus of eye-tracking data with accompanying predictability norms. The predictability norms for the Provo Corpus differ from those of other corpora. In addition to traditional cloze scores that estimate the predictability of the full orthographic form of each word, the Provo Corpus also includes measures of the predictability of the morpho-syntactic and semantic information for each word. This makes the Provo Corpus ideal for studying predictive processes in reading. Some analyses using these data have previously been reported elsewhere (Luke & Christianson, 2016). The Provo Corpus is available for download on the Open Science Framework, at

Presents a new interactive paradigm for iterative information retrieval based on the automatic construction of a faceted representation for a document subset. Facets are chosen heuristically based on lexical dispersion (a measure of the number of different terms with which a word co-occurs within noun phrases appearing in the document set). The phrases in which these terms occur serve to provide a set of attributes, specializations, related concepts, etc., for the so-identified \"facet\" terms. The resulting representation serves the dual purpose of providing a concise, structured summary of the contents of the result set, as well as presenting a set of terms for engaging in interactive query reformulation.

Lexical knowledge plays a vital role for systems translating between natural language and structured data, and an important part of such lexical knowledge are adjectives. In this paper we introduce a low-cost method for automatically acquiring adjective lexicalizations of restriction classes from a knowledge base by inspecting the range of prop- erties. The resulting lexicalizations can then, for example, be added to the existing manual DBpedia lexicon, achieving a significant increase in coverage.

Transliteration of Arabic numerals is not easily resolved. Arabic numerals occur frequently in scientific and informative texts and deliver significant meanings. Since readings of Arabic numerals depend largely on their context, generating accurate pronunciation of Arabic numerals is one of the critical criteria in evaluating TTS systems. In this paper, (1) contextual, pattern, and arithmetic features are extracted from a transliterated corpus; (2) ambiguities of homographic classifiers are resolved based on the semantic relations in KorLex1.0 (Korean Lexico-Semantic Network); (3) a classification model for accurate and efficient transliteration of Arabic numerals is proposed in order to improve Korean TTS systems. The proposed model yields 97.3% accuracy, which is 9.5% higher than that of a customized Korean TTS system.

Rigorous data analysis is a cornerstone of empirical scientific research. In recent years, much attention in the field of cognitive neuroscience has been paid to the issue of correct statistical procedures (e.g., Kriegeskorte et al., 2009; Nieuwenhuis et al., 2011; Kilner, 2013). However, an additional essential aspect of data analysis, which has attracted relatively little attention, is the errors (bugs) in custom data analysis programming code. Whereas, in its broad definition a bug can be any type of an error, here I refer to it as a problem in the code that does not lead to a failure (error or crash) during execution; that is, code that contains a bug completes its execution properly, but the output result is incorrect. Notably, if an erroneous output result consists of reasonable values, then the programmer might have no indication that something went wrong. It is impossible to estimate how many published studies contain results with bugs; however, given the ubiquity of bugs in non-scientific (industrial) software (Zhivich and Cunningham, 2009), it is plausible that academic code is no exception. Furthermore, the quality of custom code in cognitive neuroscience field in particular, is probably even worse than in the industry because (a) code in the field of cognitive neuroscience is usually written by people with only basic programming training\u2014after all, they are brain researchers and not software engineers; (b) the code most often is programmed by a single researcher, without any code peer-review procedure (Wiegers, 2002); and (c) the custom code is usually used by a single or few lab members only. This last point is critical, because when the code is used by many, as is the case with large open-source projects like SPM (, FieldTrip (, EEGLab ( or PyMVPA (, the likelihood is low that a serious bug would remain unnoticed for a long time. The goal of this paper is to give practical advice to cognitive neuroscientists on how to minimize bugs in their custom data analysis code. The advice is illustrated using MATLAB schematic samples; executable code examples can be found in the Supplementary Materials.

Semantic similarity measures within a reference ontology have been used in a few ontology alignment (OA) systems. Most use a single reference ontology, typically WordNet, and a single similarity measure within it. The mediating matcher with semantic similarity (MMSS) was added to AgreementMaker to incorporate the selection of a semantic similarity measure and the combination of multiple reference ontologies in an adaptable fashion. The results of experiments using the MMSS on the anatomy track of the Ontology Alignment Evaluation Initiative (OAEI) are reported. A variety of semantic similarity measures are applied within multiple reference ontologies. Using multiple reference ontologies with the MMSS improved alignment results. All information-content based semantic similarity measures produced better alignment results than a path-based semantic similarity measure.

We start from a web-oriented system for evaluating, presenting, processing, enlarging and annotating corpora of translations, previously applied to a real MT evaluation task, involving classical subjective measures, objective n-gram-based scores, and objective post-edition-based task-related evaluation. We describe its recent extension to support the high-quality translation into French of the large on-line Encyclopedia of Life Support Systems (EOLSS) presented as documents each made of a web page and a companion UNL file, by applying contributive on-line human post-edition to results of Machine Translation systems and of UNL deconverters. Target language web pages are generated on the fly from source language ones, using the best target segments available in the database. 25 documents (about 220,000 words) of the EOLSS are now available in French, Spanish, Russian, Arabic and Japanese. MT followed by contributive incremental cheap or free post-edition is now proved to be a viable way of making difficult information available in many languages.

We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks.

Extracting sentiments from unstructured text has emerged as an important problem in many disciplines. An accurate method would enable us, for example, to mine online opinions from the Internet and learn customers\u2019 preferences for economic or marketing research, or for leveraging a strategic advantage. In this paper, we propose a two-stage Bayesian algorithm that is able to capture the dependencies among words, and, at the same time, finds a vocabulary that is efficient for the purpose of extracting sentiments. Experimental results on online movie reviews and online news show that our algorithm is able to select a parsimonious feature set with substantially fewer predictor variables than in the full data set and leads to better predictions about sentiment orientations than several state-of-the-art machine learning methods. Our findings suggest that sentiments are captured by conditional dependence relations among words, rather than by keywords or high-frequency words.

This research aims at knowing the character of base in derivational process (suffix) \u2013ment and \u2013ness, and explaining the function of derivational process (suffix) \u2013ment and \u2013ness used in Oxford English Dictionary. The type of this research is descriptive qualitative. The technique of collecting data is documentation. The steps are reading dictionary, classifying and analyzing the data, taking note, and browsing to the internet.\r\nTo answer those problems, this research employs morphological analysis. The objectives are identifying and analyzing the new derived words whether or not they have changed based on syntactical category. The result of the research shows that derivational process (suffix) by \u2013ment are in 30% words of the dictionary, and derivational process (suffix) by \u2013ness are in 70% words of the dictionary. The function of derivational process (suffix) \u2013ment changes the grammatical categories from the verb into noun and function of derivational process (suffix) \u2013ness changes the grammatical categories from the adjectives into noun.

This paper presents two groups of text encoding problems encountered by the Brown University Women Writers Project (WWP). The WWP is creating a full-text database of transcriptions of pre-1830 printed books written by women in English. For encoding our texts we use Standard Generalized Markup Language (SGML), following the Text Encoding Initiative's Guidelines for Electronic Text Encoding and Interchange. SGML is a powerful text encoding system for describing complex textual features, but a full expression of these may require very complex encoding, and careful thought about the intended purpose of the encoded text. We present here several possible approaches to these encoding problems, and analyze the issues they raise.

This paper presents a general approach for open-domain question answering (QA) that models interactions between paragraphs using structural information from a knowledge base. We first describe how to construct a graph of passages from a large corpus, where the relations are either from the knowledge base or the internal structure of Wikipedia. We then introduce a reading comprehension model which takes this graph as an input, to better model relationships across pairs of paragraphs. This approach consistently outperforms competitive baselines in three open-domain QA datasets, WebQuestions, Natural Questions and TriviaQA, improving the pipeline-based state-of-the-art by 3--13%.

Of basic interest is the quantification of the long term growth of a language's lexicon as it develops to more completely cover both a culture's communication requirements and knowledge space. Here, we explore the usage dynamics of words in the English language as reflected by the Google Books 2012 English Fiction corpus. We critique an earlier method that found decreasing birth and increasing death rates of words over the second half of the 20th Century, showing death rates to be strongly affected by the imposed time cutoff of the arbitrary present and not increasing dramatically. We provide a robust, principled approach to examining lexical evolution by tracking the volume of word flux across various relative frequency thresholds. We show that while the overall statistical structure of the English language remains stable over time in terms of its raw Zipf distribution, we find evidence of an enduring `lexical turbulence': The flux of words across frequency thresholds from decade to decade scales superlinearly with word rank and exhibits a scaling break we connect to that of Zipf's law. To better understand the changing lexicon, we examine the contributions to the Jensen-Shannon divergence of individual words crossing frequency thresholds. We also find indications that scholarly works about fiction are strongly represented in the 2012 English Fiction corpus, and suggest that a future revision of the corpus should attempt to separate critical works from fiction itself.

In contemporary lexicography, particularly in learners\u2019 dictionaries, word frequency information from large corpora has been used for entry selection, sense ranking, and collocation identification as well as selecting defining vocabulary. However, age information in linguistic corpora has not been adequately highlighted or exploited. Early experiments have demonstrated that word retrieval in long-term memory is much more influenced by the age of acquisition than word frequency. For EFL English learners, it is necessary to know what words native speakers tend to use at different ages besides frequent words. Core vocabulary contains not simply those words with high frequency but also those with even distribution in different age groups. Learners\u2019 dictionaries with this kind of core vocabulary will be of much help for English learning and teaching as well as research in core vocabulary. Our research makes use of the age group information in the British National Corpus XML Edition (BNC XML 2007). It turns out that higher lexical coverage can be achieved when we select core vocabulary by the combined parameters of a word\u2019s dispersion index and distributed frequency in different age groups rather than raw frequency only. Moreover, our study shows that the young age group under 15 rely more on core vocabulary than adults due to its fundamental role in language learning. For the age group over 15 years old, core vocabulary occupies a stable proportion of their vocabulary size despite age increase. Another interesting finding is that each age group tends to acquire more core words selected on a frequency-age basis than those on a raw-frequency basis.

This study aims to find out the types of code switching used by Mario Teguh as the speaker in the TV Program \u2018Mario Teguh Golden Ways 2015\u2019. It also finds out the functions of code switching used in the program.\r\n This study employed a descriptive qualitative method with the use of the researcher and a data sheet as the research instruments. The data of this study were in the form of words, phrases, clauses, and sentences uttered by the speaker of Mario Teguh Golden Ways 2015 TV Program. This research applied triangulation technique to check and establish validity.\r\n \tThe results of this research show that there are three types of code switching used by Mario Teguh as the speaker in the TV Program \u2018Mario Teguh Golden Ways 2015\u2019. There are intersentential code switching (63%), tag switching (26%) and intrasentential code switching (11%). Intrasentential is the most appearing type in this TV program because it is considered to be the easiest type which does not need to consider the grammatical rule of the switched language. The most frequent occurrence of code switching function is reiteration (40%) and the least frequent is quotation (3%). Reiteration is the most appearing function in this TV program because the speaker uses code switching when he wants to emphasize the idea of his utterances so the audience can get the idea easier.\r\n\r\nKeywords: sociolinguistics, code switching, Mario Teguh Golden Ways

Sports fans generate a large amount of tweets which reflect their opinions and feelings about what is happening during various sporting events. Given the popularity of football events, in this work, we focus on analyzing sentiment expressed by football fans through Twitter. These tweets reflect the changes in the fans\u2019 sentiment as they watch the game and react to the events of the game, e.g., goal scoring, penalties, and so on. Collecting and examining the sentiment conveyed through these tweets will help to draw a complete picture which expresses fan interaction during a specific football event. The objective of this work is to propose a domain-specific approach for understanding sentiments expressed in football fans\u2019 conversations. To achieve our goal, we start by developing a football-specific sentiment dataset which we label manually. We then utilize our dataset to automatically create a football-specific sentiment lexicon. Finally, we develop a sentiment classifier which is capable of recognizing sentiments expressed in football conversation. We conduct extensive experiments on our dataset to compare the performance of different learning algorithms in identifying the sentiment expressed in football related tweets. Our results show that our approach is effective in recognizing the fans\u2019 sentiment during football events.

ABSTRACTIn this paper, we apply an information theoretic measure, self-entropy of phoneme n-gram distributions, for quantifying the amount of phonological variation in words for the same concepts across languages, thereby investigating the stability of concepts in a standardized concept list \u2013 based on the 100-item Swadesh list \u2013 specifically designed for automated language classification. Our findings are consistent with those of the ASJP project (Automated Similarity Judgment Program; Holman et al. 2008a). The correlation of our ranking with that of ASJP is statistically highly significant. Our ranking also largely agrees with two other reduced concept lists proposed in the literature. Our results suggest that n-gram analysis works at least as well as other measures for investigating the relation of phonological similarity to geographical spread, automatic language classification, and typological similarity, while being computationally considerably cheaper than the most widespread method (normalized Lev...

The automatic extraction of meta data from wiki articles has to be carried out when saving a new article. The meta data types to extract are known in advance, e. g. as a consequence of an automatic classification of the text. Before the automatic extraction, a training has to be performed. Therefore the system is designed as a classification system, using the possible meta data types as classes. These classes are assigned to natural language expressions which were extracted from the article text. As a test data set, we use some Wikipedia articles and their according DBpedia data, which represent sample meta data. A Named Entity Recognition is used for the retrieval of candidates. Then, semantic, syntactic and lexical features are extracted. For the classification, a decision tree learner, a k-nearest neighbour classifier and a naive bayes classifier are compared.

We propose a novel, semantic-reasoning-based approach to look for potentially adverse drug-drug interactions (DDIs) by using a knowledge-base of biomedical public ontologies and datasets in a semantic graph representation. This approach makes it possible to find previously unknown relations between different biological entities like drugs, proteins and biological processes, and perform inferences on those relations. Finding nodes that represent drugs in this semantic graph, and intersecting pathways between these nodes (e.g. intersecting at a metabolic pathway step described in Reactome [1] data), can yield to novel drug-drug interactions. The resulting pathways not only describe drug-drug interactions reflected in the literature, but also unstudied interactions that could elucidate reported adverse effects.

Word sense disambiguation (WSD) and coreference resolution are two fundamental tasks for natural language processing. Unfortunately, they are seldom studied together. In this paper, we propose to incorporate the coreference resolution technique into a word sense disambiguation system for improving disambiguation precision. Our work is based on the existing instance knowledge network (IKN) based approach for WSD. With the help of coreference resolution, we are able to connect related candidate dependency graphs at the candidate level and similarly the related instance graph patterns at the instance level in IKN together. Consequently, the contexts which can be considered for WSD are expanded and precision for WSD is improved. Based on Senseval-3 all-words task, we run extensive experiments by following the same experimental approach as the IKN based WSD. It turns out that each combined algorithm between the extended IKN WSD algorithm and one of the best five existing algorithms consistently outperforms the corresponding combined algorithm between the IKN WSD algorithm and the existing algorithm.

Published in English under the title: AGROVOC: a multilingual thesaurus of agricultural terminology : English version

Recurrent neural networks (RNNs) have been widely used in text similarity modeling for text semantic representation learning. However, referring to the classical topic models, a text contains many different latent topics, and the complete semantic information of the text is described by all the latent topics. Previous RNN based models usually learn the text representation with the separated words in the text instead of topics, which will bring noises and loss hierarchical structure information for text representation. In this paper, we proposed a novel fractional latent topic based RNN (FraLT-RNN) model, which focuses on the text representation in topic-level and largely preserve the whole semantic information of a text. To be specific, we first adopt the fractional calculus to generate latent topics for a text with the hidden states learned by a RNN model. Then, we propose a topic-wise attention gating mechanism and embed it into our model to generate the topic-level attentive vector for each topic. Finally, we reward the topic perspective with the topic-level attention for text representation. Experiments on four benchmark datasets, namely TREC-QA and WikiQA for answer selection, MSRP for paraphrase identification, and MultiNLI for textual entailment, show the great advantages of our proposed model.

Besides Syntax structure,lexical semantic features are also closely related to semantic roles.Therefore,lexical semantic features could help solve the problems that could not be well-done only by syntax features.In this paper,some lexical semantic features such as valency number,semantic class of subject and object are introduced according to the Peking University semantic dictionary CSD,The 10-fold cross validation results show that,by applying the semantic dictionary,the overall F-score increases by 1.11%.And the F-score of Arg0 and Arg1 reach 93.85% and 90.60% respectively,which are 1.10% and 1.26% higher than the results only depending on syntax features.

We consider the task of interpreting and understanding a taxonomy of classification terms applied to documents in a collection. In particular, we show how unsupervised topic models are useful for interpreting and understanding MeSH, the Medical Subject Headings applied to articles in MEDLINE. We introduce the resampled author model, which captures some of the advantages of both the topic model and the author-topic model. We demonstrate how topic models complement and add to the information conveyed in a traditional listing and description of a subject heading hierarchy.

This article explores the usage of singular HE and plural THEY with their possessive, objective and reflexive forms in anaphoric reference to compound indefinite pronouns in written present-day English. Previous studies have indicated that the most commonly used personal pronouns in anaphoric reference to non-referential indefinite pronouns are indeed HE and THEY. The data for the study are drawn from the written part of the British National Corpus. The structure of the study is such that following the introduction, I will survey the earlier literature on the topic to illustrate that there is a gap in the previous studies on epicene pronouns. The third section defines the indefinite pronouns used in this study. In addition, the section also discusses the differences between the meaning and form of the indefinites and the semantic reference sets of each pronoun paradigm. Following the explanation of the methods, the article sets out the findings.

Traditional Chinese medicine (TCM) is a clinical medicine. The huge clinical data from the daily clinical process which keeps to TCM theories and principles, is the core empirical knowledge source for TCM research. Induction of the common knowledge or regularities from the large-scale clinical data is a vital task for both theoretical and clinical research of TCM. Topic model have recently shown much success for text analysis and information retrieval by extracting latent topics from text collection. In this paper, we propose a hierarchical symptom-herb topic model (HSHT), to automatically extract the hierarchical latent topic structures with both symptoms and their corresponding herbs in the TCM clinical data. The HSHT model is one of the extensions of hierarchical latent Dirichlet allocation model (hLDA) and Link latent Dirichlet allocation (LinkLDA). The proposed HSHT model is used for extracting the hierarchical structure of symptoms with their corresponding herbs in clinical type 2 diabetes mellitus (T2DM). We get one meaningful super-topic with common symptoms and commonly used herbs and some meaningful subtopics denoted T2DM complications with corresponding symptoms and their commonly used herbs. The results indicate some important medical groups corresponding to the companioned diseases in the T2DM inpatients. And then the results show that there exactly exist TCM diagnosis and treatment sub-categories and the personalized therapies to T2DM. Furthermore, it manifested that the HSHT model is useful for establishing of the TCM clinical guidelines based on the TCM clinical data.

A substantial amount of subjectivity is involved in how people use language and conceptualize the world. Computational methods and formal representations of knowledge usually neglect this kind of individual variation. We have developed a novel method, Grounded Intersubjective Concept Analysis (GICA), for the analysis and visualization of individual differences in language use and conceptualization. The GICA method first employs a conceptual survey or a text mining step to elicit from varied groups of individuals the particular ways in which terms and associated concepts are used among the individuals. The subsequent analysis and visualization reveals potential underlying groupings of subjects, objects and contexts. One way of viewing the GICA method is to compare it with the traditional word space models. In the word space models, such as latent semantic analysis (LSA), statistical analysis of word-context matrices reveals latent information. A common approach is to analyze term-document matrices in the analysis. The GICA method extends the basic idea of the traditional term-document matrix analysis to include a third dimension of different individuals. This leads to a formation of a third-order tensor of size subjects \u00D7 objects \u00D7 contexts. Through flattening into a matrix, these subject-object-context (SOC) tensors can again be analyzed using various computational methods including principal component analysis (PCA), singular value decomposition (SVD), independent component analysis (ICA) or any existing or future method suitable for analyzing high-dimensional data sets. In order to demonstrate the use of the GICA method, we present the results of two case studies. In the first case, GICA of health-related concepts is conducted. In the second one, the State of the Union addresses by US presidents are analyzed. In these case studies, we apply multidimensional scaling (MDS), the self-organizing map (SOM) and Neighborhood Retrieval Visualizer (NeRV) as specific data analysis methods within the overall GICA method. The GICA method can be used, for instance, to support education of heterogeneous audiences, public planning processes and participatory design, conflict resolution, environmental problem solving, interprofessional and interdisciplinary communication, product development processes, mergers of organizations, and building enhanced knowledge representations in semantic web.

The 28-volume historical Ordbog over det danske Sprog (ODS, Dictionary of the Danish Language), by far the most comprehensive dictionary of the Danish language, was initiated c. 1900 by Verner Dahlerup and published 1919\u201356 by Det Danske Sprog- og Litteraturselskab (DSL, Society for Danish Language and Literature). In 1907, Dahlerup presented his principles for the new dictionary in the periodical Danske Studier (Studies in Danish). From his article it appears that ODS was conceived on a much smaller scale than it proved to be after Dahlerup handed over his project to DSL in 1915. The paper explains to what extent DSL did or did not realize the 1907-principles as to entry structure, definitions, etymology, word usage, foreign words etc.

This paper describes the general principles, design, and present state of the Czech National Corpus (CNC) project. The corpus has been designed to provide a firm basis for the study of both the contemporary written Czech (a goal well attainable with the present resources) and the Czech language beyond the limits of contemporary written texts (a long-term commitment including the building of a corpus of spoken Czech and diachronic and dialectal corpora). The work on the CNC project, now in the eighth year of its official existence, has resulted in the completion of SYN2000, a 100-million-word corpus of contemporary written Czech, the organization of the cores of spoken, diachronic, and dialectal corpora, and the finding of workable solutions to some general theoretical problems involved in the building of these corpora.

This paper describes our experience in preparing the data and evaluating the results for three subtasks of SemEval-2007 Task-17 - Lexical Sample, Semantic Role Labeling (SRL) and All-Words respectively. We tabulate and analyze the results of participating systems.

KG Cleaner is a framework to identify and correct errors in data produced and delivered by an information extraction system. These tasks have been understudied and KG Cleaner is the first to address both. We introduce a multi-task model that jointly learns to predict if an extracted relation is credible and repair it if not. We evaluate our approach and other models as instance of our framework on two collections: a Wikidata corpus of nearly 700K facts and 5M fact-relevant sentences and a collection of 30K facts from the 2015 TAC Knowledge Base Population task. For credibility classification, we find that parameter efficient, simple shallow neural networks can achieve an absolute performance gain of 30 F1 points on Wikidata and comparable performance on TAC. For the repair task, significant performance (at more than twice) gain can be obtained depending on the nature of the dataset and the models.

A cognitively plausible measure of semantic similarity between geographic concepts is valuable across several areas, including geographic information retrieval, data mining, and ontology alignment. Semantic similarity measures are not intrinsically right or wrong, but obtain a certain degree of cognitive plausibility in the context of a given application. A similarity measure can therefore be seen as a domain expert summoned to judge the similarity of a pair of concepts according to her subjective set of beliefs, perceptions, hypotheses, and epistemic biases. Following this analogy, we first define the similarity jury as a panel of experts having to reach a decision on the semantic similarity of a set of geographic concepts. Second, we have conducted an evaluation of 8 WordNet-based semantic similarity measures on a subset of OpenStreetMap geographic concepts. This empirical evidence indicates that a jury tends to perform better than individual experts, but the best expert often outperforms the jury. In some cases, the jury obtains higher cognitive plausibility than its best expert.

Abstract One of the most challenging tasks in human-computer communication is the decomposition of meaning. The theory of semantic frames allows for the identification of the roles that various constituents have in an event: the doer of the action, the receiver of the action, the person towards whom the action is directed, the means and purposes of an action, etc. Through this paper, we propose to introduce semantic frames in eLearning contexts, with the conviction that users may find it easier to learn concepts if they are offered in a semantically related manner. In order to achieve this, we propose a system that, for every concept searched by the user, offers a network of concepts, by analyzing the semantic relations which appear between concepts. In other words, the proposed system starts with a concept, retrieves sentences containing it from the collection of learning materials and identifies the semantic relations between the considered concept and the ones found in their neighborhood using semantic role labeling. Additional information is completed using DBpedia\u2019s knowledge base before establishing the final network of relations.

Against the background of some of the major linguistic problems which demand our attention and which should point to some badly-needed criteria, the brief history and structure of the Czech National Corpus is outlined. The points seen as open include differences between various languages in their degree of ex-plicitness, form-function relation, ellipsis, etc. It is argued that a more general and language-independent approach is necessary to handle, among other things, the multi-word units of the text; a general corpus maintenance and query system available to the increasing number of would-be users is required, too. The particular Czech solution, still being worked out and gradually implemented, is described in some detail.

This paper presents a set of methodologies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Portuguese. Along with the methodologies and evaluation we present an implementation of all the algorithms grouped in a set of programs or toolkit. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The toolkit is published under the GNU-GPL license and can be freely downloaded from http: //

We present a new corpus of 200 abstracts and 100 full text papers which have been annotated with named entities and relations in the biomedical domain as part of the OpenMinTeD project. This corpus facilitates the goal in OpenMinTeD of making text and data mining accessible to the users who need it most. We describe the process we took to annotate the corpus with entities (Metabolite, Chemical, Protein, Species, Biological Activity and Spectral Data) and relations (Isolated From, Associated With, Binds With and Metabolite Of ). We report inter-annotator agreement (using F-score) for entities of between 0.796 and 0.892 using a strict matching protocol and between 0.875 and 0.963 using a relaxed matching protocol. For relations we report inter annotator agreement of between 0.591 and 0.693 using a strict matching protocol and between 0.744 and 0.793 using a relaxed matching protocol. We describe how this corpus can be used within ChEBI to facilitate text and data mining and how the integration of this work with the OpenMinTeD text and data mining platform will aid curation of ChEBI and other biomedical databases.

We study how to learn a semantic parser of state-of-the-art accuracy with less supervised training data. We conduct our study on WikiSQL, the largest hand-annotated semantic parsing dataset to date. First, we demonstrate that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data. Second, we show that applying question generation to the full supervised training data further improves the state-of-the-art model. In addition, we observe that there is a logarithmic relationship between the accuracy of a semantic parser and the amount of training data.

Learner corpus research has witnessed a boom in the number of studies that investigate learners\u2019 use of multi-word combinations (see Paquot & Granger, 2012 for a recent overview). Several recent studies have adopted an approach first put forward by Schmitt and colleagues (e.g. Durrant & Schmitt, 2009) to assess whether and to what extent the word combinations used by learners are \u2018native-like\u2019 by assigning to each pair of words in a learner text an association score computed on the basis of a large reference corpus. Bestgen & Granger (2014), for example, used this procedure to analyse the Michigan State University Corpus of second language writing (MSU) and showed that mean Mutual Information (MI) scores of the bigrams used by L2 writers are positively correlated with human judgment of proficiency. Most studies so far have investigated positional co-occurrences, where words are said to co-occur when they appear within a certain distance from each other (Evert, 2004) and focused more particularly on adjacent word combinations (often in the form of adjective + noun combinations) (e.g. Li & Schmitt 2010, Siyanova & Schmitt 2008). Corpus linguists such as Evert & Krenn (2003), however, have argued strongly for a relational model of co-occurrences, where the co-occurring words appear in a specific structural relation (see also Bartsch, 2004). Paquot (2014) adopted a relational model of co-occurrences to evaluate whether such co-occurrences are good discriminators of language proficiency. She made use of the Stanford CoreNLP suite of tools to parse the French L1 component of the Varieties of English for Specific Purposes dAtabase (VESPA) and extract dependency relations in the form of triples of a relation between pairs of words such as dobj(win,lottery), i.e. \u201Cthe direct object of win is lottery\u201D (de Marneffe and Manning, 2013). She then used association measures computed on the basis of a large reference corpus to analyse pairs of words in specific grammatical relations in three VESPA sub-corpora made up of texts rated at different CEFR levels (i.e. B2, C1 and C2). Findings showed that adjective + noun relations discriminated well between B2 and C2 levels; adverbial modifiers separated out B2 texts from the C1 and C2 texts; and verb + direct object relations set C2 texts apart from B2 and C1 texts. These results suggest that, used together, phraseological indices computed on the basis of relational dependencies are able to gauge language proficiency. The main objective of this study is to investigate whether relational co-occurrences also constitute valid indices of phraseological development. To do so, we replicate the method used in Paquot (2014) on data from the Longitudinal Database of Learner English (LONGDALE, Meunier 2013, forthcoming). In the LONGDALE project, the same students are followed over a period of at least three years and data collections are typically organized once per year. The 78 argumentative essays selected for this study were written by 39 French learners of English in Year 1 and Year 3 of their studies at the University of Louvain. Unlike in Year 2, students were requested to write on the same topic in Year 1 and Year 3, which allows us to control for topic, a variable that has been shown to considerably influence learners\u2019 use of word combinations (e.g. Cortes, 2004; Paquot, 2013). Like in Paquot (2014), relational co-occurrences are operationalized in the form of word combinations used in four grammatical relations, i.e. adjective + noun, adverb + adjective, adverb + verb and verb + direct object, and extracted from the learner and reference corpora with the Stanford CoreNLP suite of tools. We then assign to each pair of words in the LONGDALE corpus its MI score computed on the basis of the British National Corpus, and compute mean MI scores for each dependency relations in each learner text (cf. Bestgen & Granger, 2014). Distributions in the two learner data sets (i.e. Year 1 and Year 3) are tested for normality and accordingly compared with ANOVAs followed by Tuckey contrasts or Kruskal-Wallis rank sum tests followed by pairwise comparisons using Wilcoxon rank sum tests. To explore the links between individual and group phraseological development trajectories, a detailed variability analysis using the method of individual profiling and visualization techniques will also be presented (cf. Verspoor & Smiskova, 2012). References Bartsch, Sabine (2004). Structural and Functional Properties of Collocations in English. A Corpus Study of Lexical and Pragmatic Constraints on Lexical Cooccurrence. Tubingen: Narr. Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28\u201341. Cortes, V. (2004). Lexical bundles in published and student disciplinary writing : Examples from history and biology. English for Specific Purposes 23(4): 397-423. De Marneffe, M.-C. & Manning, C. (2013). Stanford typed dependencies manual. Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make use of collocations? IRAL - International Review of Applied Linguistics in Language Teaching, 47(2), 157\u2013177. doi:10.1515/iral.2009.007 Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD dissertation, IMS, University of Stuttgart. Evert, S. & Krenn, B. (2003). Computational approaches to collocations. Introductory course at the European Summer School on Logic, Language, and Information (ESSLLI 2003), Vienna. Available from [retrieved 5 February 2015] Granger, S. & Bestgen, Y. (2014). The use of collocations by intermediate vs. advanced nonnative writers: A bigram-based study. International Review of Applied Linguistics in Language Teaching (IRAL) 52(3), 229-252. Li, J. & Schmitt, N. (2010). The development of collocation use in academic texts by advanced L2 learners: A multiple case-study approach. In Wood, D. (ed.), Perspectives on Formulaic Language: Acquisition and Communication. London: Continuum Press. Meunier, F. and Littre, D. (2013). Tracking Learners\u2019 Progress. Adopting a Dual \u2018Corpus Cum Experimental Data\u2019 Approach. The Modern Language Journal 97/1, 61-76. Meunier, F. (forthcoming) Introduction to the LONGDALE project. In Castello E., Ackerley K., Coccetta F. (eds.) Studies in Learner Corpus Linguistics: Research and Applications for Foreign Language Teaching and Assessment. Bern: Peter Lang. Paquot, M. (2013). Lexical bundles and L1 transfer effects. International Journal of Corpus Linguistics 18(3): 391-417. Paquot, M. (2014). Is there a role for the lexis-grammar interface in interlanguage complexity research? Paper presented at the Colloquium on cross-linguistic aspects of complexity in second language research, Vrije Universiteit Brussel, 19 December 2014, Brussels, Belgium. Available from [retrieved 5 February 2015] Paquot, M. & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied Linguistics 32, 130-149. Siyanova, A. & Schmitt, N. (2008). L2 learner production and processing of collocation: A multi-study perspective. Canadian Modern Language Review 64, 3: 429-458. Verspoor, M. & Smiskova,H. (2012). Foreign language writing development from a dynamic usage-based perspective. In Manchon, R. (Ed.), L2 Writing Development: Multiple Perspectives. Berlin: De Gruyter, 47-68.

The aim of image captioning is to generate similar captions by machine as human do to describe image contents. Despite many efforts, generating discriminative captions for images remains non-trivial. Most traditional approaches imitate the language structure patterns, thus tend to fall into a stereotype of replicating frequent phrases or sentences and neglect unique aspects of each image. In this work, we propose an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions. It brings unique advantages: (1) the self-retrieval guidance can act as a metric and an evaluator of caption discriminativeness to assure the quality of generated captions. (2) The correspondence between generated captions and images are naturally incorporated in the generation process without human annotations, and hence our approach could utilize a large amount of unlabeled images to boost captioning performance with no additional laborious annotations. We demonstrate the effectiveness of the proposed retrieval-guided method on MS-COCO and Flickr30k captioning datasets, and show its superior captioning performance with more discriminative captions.

While much health data is available online, patients who are not technically astute may be unable to access it because they may not know the relevant resources, they may be reluctant to confront an unfamiliar interface, and they may not know how to compose an answer from information provided by multiple heterogeneous resources. We describe ongoing research in using natural English text queries and automated deduction to obtain answers based on multiple structured data sources in a specific subject domain. Each English query is transformed using natural language technology into an unambiguous logical form; this is submitted to a theorem prover that operates over an axiomatic theory of the subject domain. Symbols in the theory are linked to relations in external databases known to the system. An answer is obtained from the proof, along with an English language explanation of how the answer was obtained. Answers need not be present explicitly in any of the databases, but rather may be deduced or computed from the information they provide. Although English is highly ambiguous, the natural language technology is informed by subject domain knowledge, so that readings of the query that are syntactically plausible but semantically impossible are discarded. When a question is still ambiguous, the system can interrogate the patient to determine what meaning was intended. Additional queries can clarify earlier ones or ask questions referring to previously computed answers. We describe a prototype system, Quadri, which answers questions about HIV treatment using the Stanford HIV Drug Resistance Database and other resources. Natural language processing is provided by PARC\u2019s Bridge, and the deductive mechanism is SRI\u2019s SNARK theorem prover. We discuss some of the problems that must be faced to make this approach work, and some of our solutions.

One key step towards extracting structured data from unstructured data sources is the disambiguation of entities. With AGDISTIS, we provide a time-efficient, state-of-the-art, knowledge-base-agnostic and multilingual framework for the disambiguation of RDF resources. The aim of this demo is to present the English, German and Chinese version of our framework based on DBpedia. We show the results of the framework on texts pertaining to manifold domains including news, sports, automobiles and e-commerce. We also summarize the results of the evaluation of AGDISTIS on several languages.

This paper examines our spoken English Majors used to connect words and characteristics. Corpus used the \"Chinese students Spoken and Written English Corpus (SWECCL2.0)\" in the spoken corpus SECCL2.0, reference corpus used in the British National Corpus BNC spoken corpus BNC Spoken Corpus (BNC / S). The study found that of native speakers of English majors and English spoken words using both common connections are also differences. Meanwhile, China's English Majors spoken word there are multiple connections with the situation misuse. Based on the findings, the article on spoken English teaching some suggestions.

Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random field (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential benefits, the empirical results of previous WSME models are not satisfactory. In this paper, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF (TDRF) model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and find that our TDRF models lead to performances as good as the recurrent neural network LMs but are computationally more efficient in computing sentence probability.

We report on interdisciplinary research which draws on both corpus linguistics and psycholinguistics, using three sorts of data on the use and concept of five high frequency multifunctional words: like, up, down, can and will. First, we present corpus frequency data on the uses of these words in spoken and written English from the British National Corpus and corpora of spoken and written New Zealand English. Second, we present data on how the five words are used in two coursebooks for adult learners of English as a second language. Although the five words differ from each other in their frequency within each corpus, the patterns of occurrence for each word are similar between the British and New Zealand corpora. Furthermore, for four of the words (up, down, can and will), the category forms which occur more frequently in the corpus data also occur more frequently in the coursebook data. For like, however, the corpus data show a clear preference for prepositional usage over verb usage, while the coursebook data indicate that like is first introduced to learners as a verb with no discussion of its prepositional usage. Third, we present data from a psycholinguistic experiment which gives an insight into naive native-speaking English users\u2019 processing of the five words. The self-paced reading experiment focuses on category ambiguity: verb and preposition uses of like, up and down, and modal and lexical verb uses of can and will. We found that for up, down, can and will, the processing preferences are compatible with the corpus and coursebook data. However, for like, the preferences are consistent with the corpus data rather than with the coursebook data. We argue that, while corpus frequency data and native speaker processing preferences need not be pedagogically prescriptive, they should inform pedagogy.

We demonstrate that a supervised annotation learning approach using structured features derived from tokens and prior annotations performs better than a bag of words approach. We present a general graph representation for automatically deriving these features from labeled data. Automatic feature selection based on class association scores requires a large amount of labeled data and direct voting can be difficult and error-prone for structured features, even for language specialists. We show that highlighted rationales from the user can be used for indirect feature voting and same performance can be achieved with less labeled data. We present our results on two annotation learning tasks for opinion mining from product and movie reviews.

The article describes the original method of creating a dictionary of abbreviations based on the Google Books Ngram Corpus. The dictionary of abbreviations is designed for Russian, yet as its methodology is universal it can be applied to any language. The dictionary can be used to define the function of the period during text segmentation in various applied systems of text processing. The article describes difficulties encountered in the process of its construction as well as the ways to overcome them. A model of evaluating a probability of first and second type errors (extraction accuracy and fullness) is constructed. Certain statistical data for the use of abbreviations are provided.

Abstract : The English Word Speculum is a series of volumes illustrating the structural properties of written English words. The collection is intended to be used as a complement to the standard dictionaries of English words, and should be of particular value to linguists and students of English. The word list used to generate the volumes consists of left-justified, bold-face words contained in the Shorter Oxford English Dictionary, a total of 73,582 words. Information about the parts of speech and status of the words in the list has been drawn from both the Shorter Oxford English Dictionary and the Merriam Webster New International Dictionary, Third Edition. Volume V is constructed by reordering Volume III, the reverse word list. The primary ordering is according to the symbol in the vowel-string-count column. A new secondary ordering is introduced by sorting the records on the merged part-ofspeech field alphabetically. Within this category, the list is further sorted according to the reverse word listing. The largest single set of words with the same part of speech and status is, of course, the standard nouns. Among the 22,000-odd two- vowel-string words, nearly 7,500 are nouns and have the standard designation as such in at least one of the two dictionary sources.

This chapter will help you how to assess whether or not to use automatic software to translate documents and emails. The last section refers to two wonderful online resources which you can use for translating words and phrases (and thus for checking your English): Reverso and Linguee.

Abstract Research on data-driven learning (DDL) has generally suggested that DDL can facilitate second/foreign language learning due to its inductive nature and authentic language samples that provide learners with opportunity of deep and discovery learning. One major obstacle that prevents learners from such benefits lies in the cognitive load that such learning might induce. Limitations of DDL such as this have resulted in a call for more studies that examine the differential effect of inductive and more traditionally, deductive approach in DDL-based ESL/EFL instruction. The present study compared the effect of the deductive and inductive approaches in a DDL context on vocabulary acquisition and retention. A total of 27 EFL learners were randomly divided into deductive and inductive groups making use of the Corpus of Contemporary American English (COCA) to learn eight target words. A modified version of the Vocabulary Knowledge Scale (VKS; Paribakht & Wesche, 1997) was used to assess the students' learning of word form, meaning, and use before, immediately after, and 2 weeks after the treatment instruction. Furthermore, the two groups' performances on the pre-test and immediate post-test were analyzed to scrutinize their change in vocabulary knowledge. The results showed that both approaches were equally effective in terms of facilitating the learners' vocabulary acquisition and retention. Furthermore, the inductive and deductive groups showed similar patterns of the acquisition of target words: their knowledge of words generally moved to a higher level, and seldom stayed at the same level or moved to a lower level. The result that deductive DDL was just as effective as the inductive approach but less time-consuming might suggest that the deductive approach could complement DDL more efficiently when the DDL's inductive nature prevents learners from fully benefiting from its potential advantages.

Applications which use human speech as an input require a speech interface with high recognition accuracy. The words or phrases in the recognised text are annotated with a machine-understandable meaning and linked to knowledge graphs for further processing by the target application. These semantic annotations of recognised words can be represented as a subject-predicate-object triples which collectively form a graph often referred to as a knowledge graph. This type of knowledge representation facilitates to use speech interfaces with any spoken input application, since the information is represented in logical, semantic form, retrieving and storing can be followed using any web standard query languages. In this work, we develop a methodology for linking speech input to knowledge graphs and study the impact of recognition errors in the overall process. We show that for a corpus with lower WER, the annotation and linking of entities to the DBpedia knowledge graph is considerable. DBpedia Spotlight, a tool to interlink text documents with the linked open data is used to link the speech recognition output to the DBpedia knowledge graph. Such a knowledge-based speech recognition interface is useful for applications such as question answering or spoken dialog systems.

In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.

This paper describes a sense disambiguation method for a polysemous target noun using the context words surrounding the target noun and its WordNet relatives, such as synonyms, hypernyms and hyponyms. The result of sense disambiguation is a relative that can substitute for that target noun in a context. The selection is made based on co-occurrence frequency between candidate relatives and each word in the context. Since the co-occurrence frequency is obtainable from a raw corpus, the method is considered to be an unsupervised learning algorithm and therefore does not require a sense-tagged corpus. In a series of experiments using SemCor and the corpus of SENSEVAL-2 lexical sample task, all in English, and using some Korean data, the proposed method was shown to be very promising. In particular, its performance was superior to that of the other approaches evaluated on the same test corpora.

In this paper, we present preliminary work on corpus-based anaphora resolution of discourse deixis in German. Our annotation guidelines provide linguistic tests for locating the antecedent, and for determining the semantic types of both the antecedent and the anaphor. The corpus consists of selected speaker turns from the Europarl corpus.

Sentiment analysis is the task of determining the opinion expressed on subjective data, which may include microblog messages, such as tweets. This type of message has been considered the target of sentiment analysis in many recent studies, since they represent a rich source of opinionated texts. Thus, in order to determine the opinion expressed in tweets, different studies have employed distinct strategies, which mainly include supervised machine learning methods. For this purpose, different kinds of features have been evaluated. Despite that, none of the state-of-the-art studies has evaluated distinct categories of features, regarding their similar characteristics. In this context, this work presents a literature review of the most common feature representation in Twitter sentiment analysis. We propose to group features sharing similar aspects into specific categories. We also evaluate the relevance of these categories of features, including meta-level features, using a significant number of Twitter datasets. Furthermore, we apply important and well-known feature selection strategies in order to identify relevant subsets of features for each dataset. We show in the experimental evaluation that the results achieved in this study, using feature selection strategies, outperform the results reported in previous works for the most of the assessed datasets.

In this paper we show several experiments motivated by the hypothesis that counting the number of relationships each synset has in WordNet 2.0 is related to the senses that are the most frequent (MFS), because MFS usually has a longer gloss, more examples of usage, more relationships with other words (synonyms, hyponyms), etc. We present a comparison of finding the MFS through the relationships in a semantic network (WordNet) versus measuring only the number of characters, words and other features in the gloss of each sense. We found that counting only inbound relationships is different to counting both inbound and outbound relationships, and that second order relationships are not so helpful, despite restricting them to be of the same kind. We analyze the contribution of each different kind of relationship in a synset; and finally, we present an analysis of the different cases where our algorithm is able to find the correct sense in SemCor, being different from the MFS listed in WordNet.

Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set of image-caption pairs. However, for some language, large scale image-caption paired corpus might not be available. We present an approach to this unpaired image captioning problem by language pivoting. Our method can effectively capture the characteristics of an image captioner from the pivot language (Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus. We evaluate our method on two image-to-English benchmark datasets: MSCOCO and Flickr30K. Quantitative comparisons against several baseline approaches demonstrate the effectiveness of our method.

With the advent of word embeddings, lexicons are no longer fully utilized for sentiment analysis although they still provide important features in the traditional setting. This paper introduces a novel approach to sentiment analysis that integrates lexicon embeddings and an attention mechanism into Convolutional Neural Networks. Our approach performs separate convolutions for word and lexicon embeddings and provides a global view of the document using attention. Our models are experimented on both the SemEval'16 Task 4 dataset and the Stanford Sentiment Treebank, and show comparative or better results against the existing state-of-the-art systems. Our analysis shows that lexicon embeddings allow to build high-performing models with much smaller word embeddings, and the attention mechanism effectively dims out noisy words for sentiment analysis.

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our visual question answering (VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three advantages. First, executing programs on a symbolic space is more robust to long program traces. Our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it learns to perform well on a small number of training data; it can also encode an image into a compact representation and answer questions offline, using only 1% of the storage needed by the best competing methods. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step. Our model recovers the ground truth programs precisely.

This paper reports on the design, development and evaluation of a bilingual telephone-based spoken language system for foreign exchange inquiries-the CU FOREX system. The telephone-based system supports Cantonese and English in a single spoken language interface to access real-time foreign exchange information. The specific domain covers information on the exchange rates between foreign currencies, as well as deposit interest rates of various time durations for a specified currency. Overall our system achieves a performance of 0.95 (kappa-statistic), with a typical transaction duration of 2 minutes on average.

In this paper, we present preliminary work on recognizing affect from a Korean textual document by using a manually built affect lexicon and adopting natural language processing tools. A manually built affect lexicon is constructed in order to be able to detect various emotional expressions, and its entries consist of emotion vectors. The natural language processing tools analyze an input document to enhance the accuracy of our affect recognizer. The performance of our affect recognizer is evaluated through automatic classification of song lyrics according to moods.

Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach.

Recently, word embedding techniques that assign a multidimensional vector to each word in a given corpus are often used in various tasks in Natural Language Processing. Although most of existing methods such as word2vec assign a single vector to each word, some advanced ones assign a multisense word with multiple vectors corresponding to individual meanings it has. However, unfortunately, it is difficult to properly evaluate those word vectors assigned to multisense words by using publicly available word similarity datasets. Thus, in this paper, we propose a novel dataset and a corresponding evaluation metric that enable us to evaluate such word vectors learned considering multisense words. The proposed dataset consists of synsets in WordNet and BabelNet that are well-known lexical databases, instead of using individual words, and incorporates the distance between synsets in the concept hierarchies of WordNet and BabelNet to evaluate the similarity between word vectors. We empirically show that the proposed dataset and evaluation metric allow us to evaluate word vectors for multisense words more properly than metrics for an existing dataset.

In recent years, while Internet has brought various technological evolutions, users have been required to collect, select and integrate information according to a purpose. Based on this background, ontology that systemizes knowledge of the target world has been received a lot of attention. As a method of automatically constructing a super-sub relation which is a one of the important concept of ontology, there is a method of using a Lexico-syntactic pattern and a word dictionary. However, there are problems that cannot be classified correctly because it does not consider semantic relation of words so that cannot deal with words not existed in the dictionary. Therefore, a method to classify super-sub relation using a wedge product of word vectors is proposed to solve the problem. As a result, it has been confirmed that the effectiveness of the research to get higher precision/recall than that of the baseline method.

We present a set of stand-off annotations for the ninety thousand sentences in the spoken section of the British National Corpus (BNC) which feature a progressive aspect verb group. These annotations may be matched to the original BNC text using the supplied document and sentence identifiers. The annotated features mostly relate to linguistic form: subject type, subject person and number, form of auxiliary verb, and clause type, tense and polarity. In addition, the sentences are classified for register, the formality of recording context: three levels of \u2018spontaneity\u2019 with genres such as sermons and scripted speech at the most formal level and casual conversation at the least formal. The resource has been designed so that it may easily be augmented with further stand-off annotations. Expert linguistic annotations of spoken data, such as these, are valuable for improving the performance of natural language processing tools in the spoken language domain and assist linguistic research in general.

Knowledge Graphs are used in an increasing number of applications. Although considerable human effort has been invested into making knowledge graphs available in multiple languages, most knowledge graphs are in English. Additionally, regional facts are often only available in the language of the corresponding region. This lack of multilingual knowledge availability clearly limits the porting of machine learning models to different languages. In this paper, we aim to alleviate this drawback by proposing THOTH, an approach for translating and enriching knowledge graphs. THOTH extracts bilingual alignments between a source and target knowledge graph and learns how to translate from one to the other by relying on two different recurrent neural network models along with knowledge graph embeddings. We evaluated THOTH extrinsically by comparing the German DBpedia with the German translation of the English DBpedia on two tasks: fact checking and entity linking. In addition, we ran a manual intrinsic evaluation of the translation. Our results show that THOTH is a promising approach which achieves a translation accuracy of 88.56%. Moreover, its enrichment improves the quality of the German DBpedia significantly, as we report +18.4% accuracy for fact validation and +19% F\\(_1\\) for entity linking.

In this paper, we discuss a method of a textual transformation between the similar languages taking Mongolian as an example. The textual transformation approach is performed by combining a knowledge-based rule bank with data driven method. DP algorithm (dynamic programming) is applied to matching of the source and target language words. Our experimental results demonstrate that the proposed method has achieved 83.9% transformation accuracy (in F-measure) from NM (Cyrillic) to TM (Traditional Mongolian) text, and 88.1% for NM to TODO. KEYWORD: Mongolian language; similar language cross processing; data-driven approach; knowledgebased rule bank; DP algorithm International Conference on Industrial Technology and Management Science (ITMS 2015) \u00A9 2015. The authors Published by Atlantis Press 786 In Fig. 1, (a) TM (written by the traditional Mongolian scripts, found at the 13 th century) is nowadays used mainly in the area of the inner Mongolia; (b) TODO (written by called TODO scripts, found at the 17 th century) is used mainly in the Xinjiang area in China and Kalmyk in Russia; and (c) Cyrillic (writing by Cyrillic alphabet, found at the beginning of 20 th century ), is used in Mongolia and other areas such as Kalmyk and Buryat in Russia today. The sentence order and SOV structure (subject, predicate, object and verb) are same. But the rule of building a word and a way using the function words (suffixes or affixes) are different. An example of the sentence alignment by words of TODO and TM is illustrated in figure 2. Figure 2 a sentence and word alignment by TODO and TM Similarly, a phrase transformation pair of TM and NM is shown in figure 3. As shown in Fig.2 and Fig.3, we observe that a word, in either TODO or Cyrillic, corresponds to two or more words of TM, and there is a clear difference in the word formation and sequence. This means that it is quite difficult to transcribe the multi-graphic documents between Mongolian by a script (Unicode) unit or word unit. And as shown in Fig. 4, some words, especially in a case of TODO and NM, are very similar when the word is converted into nominal character (Latin or Unicode) forms. Figure 3 a phrase (NP) by TM and Cyrillic Figure 4 same words [Unicode] by different scripts To create word-to-word alignments and transformation, some rule based and statistical processing, such as segmentation of the suffixes and syntactic analysis of root word in the case of NM and TODO, will be necessary for Mongolian. Currently, researches related to the Mongolian natural language processing, especially a textual transformation among their texts are rare. T.ISHIKAWA research group introduced a performance based on the fundamental linguistic rules and a character unit for converting Cyrillic to TM texts and vice versa [6]. Although satisfactory conversion results have been reported, the authors also pointed that it was rather difficult to use their approach when the source languages were different and when out-of-rule (OOV) words occurred frequently. The report [7] challenged a transformation method between TM and NM two scripts based on the linguistic rules. However, it has been reported that the method has limited capability to transform others, such as TM. Additionally, the method cannot be used in the case of unlisted words in a limited corpus.

In the known literature, hapax legomena in an English text or a collection of texts roughly account for about 50% of the vocabulary. This sort of constancy is baffling. The 100-million-word British National Corpus was used to study this phenomenon. The result reveals that the hapax/vocabulary ratio follows a U-shaped pattern. Initially, as the size of text increases, the hapax/vocabulary ratio decreases; however, after the text size reaches about 3,000,000 words, the hapax/vocabulary ratio starts to increase steadily. A computer simulation shows that as the text size continues to increase, the hapax/vocabulary ratio would approach 1.

Word sense disambiguation (WSD) is a fundamental problem in nature language processing, the objective of which is to identify the most proper sense for an ambiguous word in a given context. Although WSD has been researched over the years, the performance of existing algorithms in terms of accuracy and recall is still unsatisfactory. In this paper, we propose a novel approach to word sense disambiguation based on topical and semantic association. For a given document, supposing that its topic category is accurately discriminated, the correct sense of the ambiguous term is identified through the corresponding topic and semantic contexts. We firstly extract topic discriminative terms from document and construct topical graph based on topic span intervals to implement topic identification. We then exploit syntactic features, topic span features, and semantic features to disambiguate nouns and verbs in the context of ambiguous word. Finally, we conduct experiments on the standard data set SemCor to evaluate the performance of the proposed method, and the results indicate that our approach achieves relatively better performance than existing approaches.

This technical note describes a new baseline for the Natural Questions. Our model is based on BERT and reduces the gap between the model F1 scores reported in the original dataset paper and the human upper bound by 30% and 50% relative for the long and short answer tasks respectively. This baseline has been submitted to the official NQ leaderboard at this http URL. Code, preprocessed data and pretrained model are available at this https URL.

The most effective paradigm for word sense disambiguation, supervised learning, seems to be stuck because of the knowledge acquisition bottleneck. In this paper we take an in-depth study of the performance of decision lists on two publicly available corpora and an additional corpus automatically acquired from the Web, using the fine-grained highly polysemous senses in WordNet. Decision lists are shown a versatile state-of-the-art technique. The experiments reveal, among other facts, that SemCor can be an acceptable (0.7 precision for polysemous words) starting point for an all-words system. The results on the DSO corpus show that for some highly polysemous words 0.7 precision seems to be the current state-of-the-art limit. On the other hand, independently constructed hand-tagged corpora are not mutually useful, and a corpus automatically acquired from the Web is shown to fail.

In the first part, basic approaches to collocations and their nature is discussed. In a survey, these deal with (1) corpus-psychological approach, (2) content-grammar words, (3) polysemous-monosemous words, (4) frequent-infrequent words, (5) collocationally large-restricted words, (6) stable-not stable collocations, (7) regular-anomalous combinations. Some attention is paid to relation of polysemy and frequency and role of valency. In the second part, an analysis of selected words from LDCE's letter A is undertaken in their coverage of collocations and confronted with what the British National Corpus has to offer. Finally, major open problems of lexicographical coverage of collocations are listed.

An ACT-R List Learning Representation for Training Prediction Michael Matessa ( Alion Science and Technology 6404 Cooper Street Felton, CA 95018 USA Abstract Model This paper presents a representation of training based on an ACT-R model of list learning. The benefit of the list model representation for making training predictions can be seen in the accurate a priori predictions of trials to mastery given the number of task steps. The benefit of using accurate step times can be seen in the even more accurate post-hoc model results. Keywords: Training; prediction; list length; ACT-R. Introduction Numerous studies have documented operational and training problems with the modern autoflight systems, in particular the flight management system (FMS) and its pilot interface, the control display unit (CDU). During the last few years, more attention has been given to the limitations of current autoflight training methods. Many studies have concluded that current training programs are inadequate in both depth and breadth of coverage of FMS functions (Air Transport Association, 1999; BASI, 1998; FAA Human Factors Team, Matessa and Polson (2006) proposed that the inadequacies of the programs are due to airline training practices that encourage pilots to master FMS programming tasks by memorizing lists of actions, one list for each task. Treating FMS programming skills as lists of actions can interfere with acquisition of robust and flexible skills. This hypothesis of the negative consequence of list-based representation was validated by Taatgen, Huss, and Anderson (2008), who show poorer performance for list- based representation compared to a stimulus-based representation. This paper extends the table-based training time predictions of Matessa and Polson (2006) by presenting a computational model that represents procedure training as list learning. The model is meant to describe training programs where to-be-learned procedures are formally trained, and trainees must demonstrate mastery before they can go on to more advanced, on-the-job training. Airline transition training programs are examples of this paradigm. The model takes as input the number of steps in a procedure and the time per step, and it generates estimates of the training time required to master the procedure. Predictions of the model are compared to human data and show the benefit of the number-of-steps and step-time parameters. Novice pilots lack an organizing schema for memorizing lists of actions and so the actions are effectively represented as nonsense syllables (Matessa & Polson, 2006). Therefore, the list model does not represent the actual information to be learned, but instead as an engineering approximation represents the training as learning a list of random digits. The model is motivated by the table-based list model of Matessa and Polson (2006), but is implemented in the ACT- R cognitive architecture (Anderson, 2007). Table-Based List Model The following description from Matessa and Polson (2006) shows how procedure learning can be represented as list learning, and a table-based prediction of training time can be created based on procedure length. A representation of a task must encode both item (actions and parameters) and order information. Anderson, Bothell, Lebiere, and Matessa (1998) assumed that item and order information is encoded in a hierarchical retrieval structure incorporated in their ACT-R model of serial list learning shown in Figure 1. The order information is encoded in a hierarchically organized collection of chunks. The terminal nodes of this retrieval structure represent the item information. The model assumes that pilots transitioning to their first FMS-equipped aircraft master a cockpit procedure by memorizing a serial list of declarative representations of individual actions or summaries of subsequences of actions. It is assumed that each of these attempts to learn the list is analogous to a test- study trial in a serial recall experiment. Figure 1: The List Model representation for a list of nine digits (from Anderson et al., 1998).

This paper describes our participation to the monolingual English GIRT task. The main objectives of our experiments were to evaluate the use of Mercure IRS (designed at IRIT/SIG) on domain specific corpus. Two other techniques of automatic query reformulation using WordNet are evaluated.

We present the first results on parsing the SynTagRus treebank of Russian with a data-driven dependency parser, achieving a labeled attachment score of over 82% and an unlabeled attachment score of 89%. A feature analysis shows that high parsing accuracy is crucially dependent on the use of both lexical and morphological features. We conjecture that the latter result can be generalized to richly inflected languages in general, provided that sufficient amounts of training data are available.

Semantic ambiguity is typically measured by summing the number of senses or dictionary definitions that a word has. Such measures are somewhat subjective and may not adequately capture the full extent of variation in word meaning, particularly for polysemous words that can be used in many different ways, with subtle shifts in meaning. Here, we describe an alternative, computationally derived measure of ambiguity based on the proposal that the meanings of words vary continuously as a function of their contexts. On this view, words that appear in a wide range of contexts on diverse topics are more variable in meaning than those that appear in a restricted set of similar contexts. To quantify this variation, we performed latent semantic analysis on a large text corpus to estimate the semantic similarities of different linguistic contexts. From these estimates, we calculated the degree to which the different contexts associated with a given word vary in their meanings. We term this quantity a word\u2019s semantic diversity (SemD). We suggest that this approach provides an objective way of quantifying the subtle, context-dependent variations in word meaning that are often present in language. We demonstrate that SemD is correlated with other measures of ambiguity and contextual variability, as well as with frequency and imageability. We also show that SemD is a strong predictor of performance in semantic judgments in healthy individuals and in patients with semantic deficits, accounting for unique variance beyond that of other predictors. SemD values for over 30,000 English words are provided as supplementary materials.

In this paper we present a question answering system using a neural network to interpret questions learned from the DBpedia repository. We train a sequence-to-sequence neural network model with n-triples extracted from the DBpedia Infobox Properties. Since these properties do not represent the natural language, we further used question-answer dialogues from movie subtitles. Although the automatic evaluation shows a low overlap of the generated answers compared to the gold standard set, a manual inspection of the showed promising outcomes from the experiment for further work.

EDBL (Euskararen Datu-Base Lexikala) is a general-purpose lexical database used in Basque text-processing tasks. It is a large repository of lexical knowledge (currently around 80,000 entries) that acts as basis and support in a number of different NLP tasks, thus providing lexical information for several language tools: morphological analysis, spell checking and correction, lemmatization and tagging, syntactic analysis, and so on. It has been designed to be neutral in relation to the different linguistic formalisms, and flexible and open enough to accept new types of information. A browser-based user interface makes the job of consulting the database, correcting and updating entries, adding new ones, etc. easy to the lexicographer. The paper presents the conceptual schema and the main features of the database, along with some problems encountered in its design and implementation in a commercial DBMS. Given the diversity of the lexical entities and the complex relationships existing among them, three total specializations have been defined under the main class of the hierarchy that represents the conceptual schema. The first one divides all the entries in EDBL into Basque standard and non-standard entries. The second divides the units in the database into dictionary entries (classified into the different parts-of-speech) and other entries (mainly non-independent morphemes and irregularly inflected forms). Finally, another total specialization has been established between single-word entries and multiword lexical units; this permits us to describe the morphotactics of single-word entries, and the constitution and surface realization schemas of multiword lexical units. A hierarchy of typed feature structures (FS) has been designed to map the entities and relationships in the database conceptual schema. The FSs are coded in TEI-conformant SGML, and Feature Structure Declarations (FSD) have been made for all the types of the hierarchy. Feature structures are used as a delivery format to export the lexical information from the database. The information coded in this way is subsequently used as input by the different language analysis tools.

In this article, a method for automatic sentiment analysis of movie reviews is proposed, implemented and evaluated. In contrast to most studies that focus on determining only sentiment orientation (positive versus negative), the proposed method performs fine-grained analysis to determine both the sentiment orientation and sentiment strength of the reviewer towards various aspects of a movie. Sentences in review documents contain independent clauses that express different sentiments toward different aspects of a movie. The method adopts a linguistic approach of computing the sentiment of a clause from the prior sentiment scores assigned to individual words, taking into consideration the grammatical dependency structure of the clause. The prior sentiment scores of about 32,000 individual words are derived from SentiWordNet with the help of a subjectivity lexicon. Negation is delicately handled. The output sentiment scores can be used to identify the most positive and negative clauses or sentences with respect to particular movie aspects.

The observed relation between prosodic information and the degree of precision and lack of ambiguity is attempted to be integrated in the processing of the user's spoken input in the CitizenShield (\"POLIAS\") system for consumer complaints for commercial products. The prosodic information contained in the spoken descriptions provided by the consumers is attempted to be preserved with the use of semantically processable markers, classifiable within an Ontological Framework and signalizing prosodic prominence in the speakers spoken input. Semantic processability is related to the reusability and/or extensibility of the present system to multilingual applications or even to other types of monolingual applications.

This release contains all the app data and vocabulary questionnaires collected for Carbajal et al. (in preparation)'s project on bilingual assimilation (OSF project It also includes analysis scripts as of the 22nd of March, 2018.

Sentiment lexicons are the most used tool to automatically predict sentiment in text. To the best of our knowledge, there exist no openly available sentiment lexicons for the Norwegian language. Thus in this paper we applied two different strategies to automatically generate sentiment lexicons for the Norwegian language. The first strategy used machine translation to translate an English sentiment lexicon to Norwegian and the other strategy used information from three different thesauruses to build several sentiment lexicons. The lexicons based on thesauruses were built using the Label propagation algorithm from graph theory. The lexicons were evaluated by classifying product and movie reviews. The results show satisfying classification performances. Different sentiment lexicons perform well on product and on movie reviews. Overall the lexicon based on machine translation performed the best, showing that linguistic resources in English can be translated to Norwegian without losing significant value.

Landau (1991: 217) stipulates that 'usage refers to any or all uses of language'. It is the study of good, correct, or standard uses of language as distinguished from bad, incorrect, and nonstandard uses of language. Usage may also include the study of any limitations on the method of use, whether geographic, social or temporal. Basically it alerts users that certain terms should not be uncritically employed in communication. This article discusses the treatment of usage in English lexicography. It analyses the labelling practices in six monolingual English dictionaries namely: the Oxford Advanced Learner's Dictionary (OALD), the Macmillan English Dictionary (MED), the Longman Dictionary of Contemporary English (LDOCE), the Cambridge International Dictionary of English (CIDE), the World Book Dictionary (WBD) and the New Shorter Oxford English Dictionary (NSOED). Discrepancies in the contextual usage labelling in the dictionaries were established and are discussed.

Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter\u2019s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons; we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.

This paper presents two groups of text encodingproblems encountered by the Brown University WomenWriters Project (WWP). The WWP is creating a full-textdatabase of transcriptions of pre-1830 printed bookswritten by women in English. For encoding our texts weuse Standard Generalized Markup Language (SGML),following the Text Encoding Initiative\u2019s Guidelines for Electronic Text Encoding andInterchange. SGML is a powerful text encoding systemfor describing complex textual features, but a fullexpression of these may require very complex encoding,and careful thought about the intended purpose of theencoded text. We present here several possibleapproaches to these encoding problems, and analyze theissues they raise.

In this paper, we implicitly incorporate morpheme information into word embedding. Based on the strategy we utilize the morpheme information, three models are proposed. To test the performances of our models, we conduct the word similarity and syntactic analogy. The results demonstrate the effectiveness of our methods. Our models beat the comparative baselines on both tasks to a great extent. On the golden standard Wordsim-353 and RG-65, our models approximately outperform CBOW for 5 and 7 percent, respectively. In addition, 7 percent advantage is also achieved by our models on syntactic analysis. According to parameter analysis, our models can increase the semantic information in the corpus and our performances on the smallest corpus are similar to the performance of CBOW on the corpus which is five times ours. This property of our methods may have some positive effects on NLP researches about the corpus-limited languages.

Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.

Part 1: An Essential Grammar 1Romanian Sounds and Letters 2 Nouns 3 Articles 4 Adjectives 5 Pronouns 6 Numerals 7 Verbs 8 Adverbs 9 Prepositions 10 Conjunctions 11 Interjections 12 Word Order and Punctuation Part 2: Language Functions 13 Socialising 14 Exchanging Factual Information 15 Expressing Opinions and Attitudes 16 Judgment and Evaluation. Appendix 1: Verb Table. Appendix 2: Useful Romanian Internet Sites

This article looks at how a comprehensive list of one category of idioms, that of \"core idioms\", was established. When the criteria to define a core idiom were strictly applied to a dictionary of idioms, the result was that the large number of \"idioms\" was reduced to a small number of \"core idioms\". The original list from the first source dictionary was added to by applying the same criteria to other idiom dictionaries, and other sources of idioms. Once the list was complete, a corpus search of the final total of 104 \"core idioms\" was carried out in the British National Corpus (BNC). The search revealed that none of the 104 core idioms occurs frequently enough to merit inclusion in the 5,000 most frequent words of English.

Most deep learning approaches for text-to-SQL generation are limited to the WikiSQL dataset, which only supports very simple queries. Recently, template-based and sequence-to-sequence approaches were proposed to support complex queries, which contain join queries, nested queries, and other types. However, Finegan-Dollak et al. (2018) demonstrated that both the approaches lack the ability to generate SQL of unseen templates. In this paper, we propose a template-based one-shot learning model for the text-to-SQL generation so that the model can generate SQL of an untrained template based on a single example. First, we classify the SQL template using the Matching Network that is augmented by our novel architecture Candidate Search Network. Then, we fill the variable slots in the predicted template using the Pointer Network. We show that our model outperforms state-of-the-art approaches for various text-to-SQL datasets in two aspects: 1) the SQL generation accuracy for the trained templates, and 2) the adaptability to the unseen SQL templates based on a single example without any additional training.

Three common approaches for deriving or predicting instantiated relations are information extraction, deductive reasoning and machine learning. Information extraction uses subsymbolic unstructured sensory information, e.g. in form of texts or images, and extracts statements using various methods ranging from simple classifiers to the most sophisticated NLP approaches. Deductive reasoning is based on a symbolic representation and derives new statements from logical axioms. Finally, machine learning can both support information extraction by deriving symbolic representations from sensory data, e.g., via classification, and can support deductive reasoning by exploiting regularities in structured data. In this paper we combine all three methods to exploit the available information in a modular way, by which we mean that each approach, i.e., information extraction, deductive reasoning, machine learning, can be optimized independently to be combined in an overall system. We validate our model using data from the YAGO2 ontology, and from Linked Life Data and Bio2RDF, all of which are part of the Linked Open Data (LOD) cloud.

This summarizing paper was dealing with the issue of borrowings from English in the section Drobnosti which is an inseparable part of the periodical Nase \u0159ec from its beginnings. Ca. 10% of 2546 items published in this section of Nase \u0159ec since 1917 were dealing with borrowings from foreign languages. Contributions dealing with borrowings from English became more frequent in the 1960s, interest in this topic significantly growing in the 1990s and lasting to the present day. During the existence of Nase \u0159ec, 79 contributions in total were dealing with borrowings from English. The conference paper was based on 45 of them, namely on those published between 1957 and 1992, i.e. during the period between two codifications, the latter of which is still valid. The aim was to compare predictions and linguists\u2019 recommendations relating to codification with the following codification and especially with further development in usage (as found out on the basis of searching the Czech National Corpus), to evaluate success of those predictions and to attempt at an interpretation of differences among some predictions of linguists and the real development.

Building speech recognition application for resource deficient languages is a challenge because of the unavailability of a speech corpus. Speech corpus is a central element for training the acoustic models used in a speech recognition engine. Constructing a speech corpus for a language is an expensive, time consuming and laborious process. This paper addresses a mechanism to develop an inexpensive speech corpus, for resource deficient languages Indian English and Hindi, by exploiting existing collections of online speech data to build a frugal speech corpus. For the purpose of demonstration we use online audio news archives to build a frugal speech corpus. We then use this speech corpus to train acoustic models and evaluate the performance of speech recognition on Indian English and Hindi speech.

Search engines provide an effective means of retrieving a document in which a piece of text occurs when the query contains infrequently occurring terms or the query is known to be an exact phrase. However, phrase queries usually contain common terms including determiners and users may not remember phrases exactly. Search engines discard common terms or assign them little importance, which may lead to poor retrieval results. In this paper, we examine the use of proximity-based phrase searching to search for quotes from song lyrics and movie scripts and compare the results against, and An improvement of over 25% on search engine results shows that an additional search method to complement the common search engine methods would be beneficial for this task.

Language-based document image retrieval (LBDIR) is an essential need for a multi-lingual environment. It provides an ease of accessing, searching and browsing of the documents pertaining to a particular language. This paper proposes a method for LBDIR using multi-resolution Histogram of Oriented Gradient (HOG) features. These features are obtained by computing HOG on the sub-bands of Discrete Wavelet Transform. The Canberra distance is used for matching and retrieval of the documents. The proposed scheme is investigated on the three datasets (Dataset1, Dataset2 and Dataset3) consisting of 1437 document images of Kannada, Marathi, Telugu, Hindi and English languages. The objective of this work is to provide an efficient LBDIR for the government and non-government organizations of Karnataka, Maharashtra and Telangana states with the context of the tri-lingual model adopted. An average precision (AP) of 96.2%, 95.4%, 94.6%, 99.4% and 99.6% for Kannada, Marathi, Telugu, Hindi and English language documents is achieved while retrieving top 50 documents with the proposed method. The proposed feature extraction scheme provided promising results compared to existing techniques.

Traditionally, automated triage of papers is performed using lexical (unigram, bigram, and sometimes trigram) features. This paper explores the use of information extraction (IE) techniques to create richer linguistic features than traditional bag-of-words models. Our classifier includes lexico-syntactic patterns and more-complex features that represent a pattern coupled with its extracted noun, represented both as a lexical term and as a semantic category. Our experimental results show that the IE-based features can improve performance over unigram and bigram features alone. We present intrinsic evaluation results of full-text document classification experiments to determine automatically whether a paper should be considered of interest to biologists at the Mouse Genome Informatics (MGI) system at the Jackson Laboratories. We also further discuss issues relating to design and deployment of our classifiers as an application to support scientific knowledge curation at MGI.

The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations were designed. A histogram of distribution of quotations of literary works written in different years was built. It was made an attempt to explain the characteristics of the histogram by associating it with the years of the most popular and cited (in the Russian Wiktionary) writers of the nineteenth century. It was found that more than one-third of all the quotations (the example sentences) contained in the Russian Wiktionary are taken by the editors of a Wiktionary entry from the Russian National Corpus.

The main goal of this work is to propose an Interactive System for Arabic Web Search Results Clustering ISAWSRC that may be used as helpful tool for Arabic Query Reformulation. This latter is an enhanced version of our published system called AWSRC for Arabic Web Search Results Clustering. Using this proposed interactive system, the user can decide at a glance whether the contents of a cluster are of interest: He/She does not have to reformulate the query, but can merely click on the produced cluster label most accurately describing his or her specific information need in the topic hierarchy. Therefore, the user can navigate in the sense of generalization or specification based on these cluster labels. To illustrate the efficiency and to evaluate the performance of our proposition, several experiments have been conducted. The obtained results show that: on the one hand the quality of the cluster labels is very high and can help user to reformulate his/ her queries, and on the other hand the generated topics hierarchy allow the user the possibility to navigate the sense of generalization and/or specialization without need of query reformulation.

We experiment with learning word representations designed to be combined into sentence level semantic representations, using an objective function which does not directly make use of the supervised scores provided with the training data, instead opting for a simpler objective which encourages similar phrases to be close together in the embedding space. This simple objective lets us start with high quality embeddings trained using the Paraphrase Database (PPDB) (Wieting et al., 2015;\r\nGanitkevitch et al., 2013), and then tune these embeddings using the official STS task training data, as well as synthetic paraphrases for each test dataset, obtained by pivoting through machine translation. Our submissions include runs which only\r\ncompare the similarity of phrases in the embedding space, directly using the similarity score to produce predictions, as well as a run which uses vector similarity in addition to a suite of features we investigated for our 2015 Semeval submission.\r\nFor the crosslingual task, we simply translate the Spanish sentences to English, and use the same system we designed for the monolingual task.

This paper presents a novel approach for word sense disambiguation. The underlying algorithm has two main components: (1) pattern learning from available sense-tagged corpora (SemCor), from dictionary definitions (WordNet) and from a generated corpus (GenCor); and (2) instance based learning with automatic feature selection, when training data is available for a particular word. The ideas described in this paper were implemented in a system that achieves excellent performance on the data provided during the Senseval-2 evaluation exercise, for both English all words and English lexical sample tasks. 1I ntroduction Word Sense Disambiguation (WSD) does not need any more an introduction and particularly not in a special issue on WSD evaluation. It is well known that WSD constitutes one of the hardest problems in natural language processing, yet is an ecessary step in a large range of applications including machine translation, knowledge acquisition, coreference, information retrieval and others. This fact motivates a continuously increasing number of researchers to develop WSD systems and devote time for finding solutions to this challenging problem. The system presented here was initially designed for the semantic disambiguation of all words in open text. The Senseval competitions provided a good environment fo rs upervised systems, and this fact motivated us to improve our system with the capability of incorporating larger training data sets when available. There are two important modules in this system. The first one uses pattern learning that relies on machine readable dictionaries and sense-tagged corpora to tag all words in open text. The second module is triggered only for words with large training data, as was the case with the words from the lexical sample tasks. It uses an instance-based learning algorithm with automatic feature selection. To our knowledge, both pattern learning and automatic feature selection are novel approaches in the WSD field, and they led to very good results during the Senseval-2 evaluation exercise.

Lexical chains algorithms attempt to find sequences of words in a document that are closely related semantically. Such chains have been argued to provide a good indication of the topics covered by the document without requiring a deeper analysis of the text, and have been proposed for many NLP tasks. Different underlying lexical semantic relations based on WordNet have been used for this task. Since links in WordNet connect synsets rather than words, open word-sense disambiguation becomes a necessary part of any chaining algorithm, even if the intended application is not disambiguation. Previous chaining algorithms have combined the tasks of disambiguation and chaining by choosing those word senses that maximize chain connectivity, a strategy which yields poor disambiguation accuracy in practice. We present a novel probabilistic algorithm for finding lexical chains. Our algorithm explicitly balances the requirements of maximizing chain connectivity with the choice of probable word-senses. The algorithm achieves better disambiguation results than all previous ones, but under its optimal settings shifts this balance totally in favor of probable senses, essentially ignoring the chains. This model points to an inherent conflict between chaining and word-sensedisambiguation. By establishing an upper bound on the disambiguation potential of lexical chains, we show that chaining is theoretically highly unlikely to achieve accurate disambiguation. Moreover, by defining a novel intrinsic evaluation criterion for lexical chains, we show that poor disambiguation accuracy also implies poor chain accuracy. Our results have crucial implications for chaining algorithms. At the very least, they show that disentangling disambiguation from chaining significantly improves chaining accuracy. The hardness of all-words disambiguation, however, implies that finding accurate lexical chains is harder than suggested by the literature.

Dyscalculia is a specific learning disability amongst learners in underachievement of learning Mathematics, which begins in childhood and is persistent through adulthood. The population of dyscalculia is estimated to range between 3% and 6% of the world population, including Malaysia. In this preliminary study, we highlight a data driven approach, through literature content analysis and interviews conducted upon teachers, to analyse the different terms used on dyscalculia, and the effectiveness of computer-based technologies or assistive learning technologies, which are developed and used for learners with learning problems in mathematics for the past two decades. Current studies show an increasing interest in adopting Augmented Reality (AR) technology in education, and in optimisming to create unique educational setting for special education learners, specifically Dyscalculia learners, to enable them to undergo experiential learning by experiencing learning through the real world, mixed with virtual objects without losing their sense of reality.

This paper presents the coarse-grained English all-words task at SemEval-2007. We describe our experience in producing a coarse version of the WordNet sense inventory and preparing the sense-tagged corpus for the task. We present the results of participating systems and discuss future directions.

We examine the effectiveness on the multilingual WebCLEF 2006 test set of light-weight methods that have proved successful in other web retrieval settings: combinations of document representations on the one hand and query reformulation techniques on the other.

The past decade has seen the emergence of web-scale structured and linked semantic knowledge resources (e.g., Freebase, DBPedia). These semantic knowledge graphs provide a scalable \u201Cschema for the web\u201D, representing a significant opportunity for the spoken language understanding (SLU) research community. This paper leverages these resources to bootstrap a web-scale semantic parser with no requirement for semantic schema design, no data collection, and no manual annotations. Our approach is based on an iterative graph crawl algorithm. From an initial seed node (entity-type), the method learns the related entity-types from the graph structure, and automatically annotates documents that can be linked to the node (e.g., Wikipedia articles, web search documents). Following the branches, the graph is crawled and the procedure is repeated. The resulting collection of annotated documents is used to bootstrap webscale conditional random field (CRF) semantic parsers. Finally, we use a maximum-a-posteriori (MAP) unsupervised adaptation technique on sample data from a specific domain to refine the parsers. The scale of the unsupervised parsers is on the order of thousands of domains and entity-types, millions of entities, and hundreds of millions of relations. The precision-recall of the semantic parsers trained with our unsupervised method approaches those trained with supervised annotations. Index Terms: semantic parsing, semantic web, semantic search, dialog, natural language understanding

Although widely seen as critical both in terms of its frequency and its social significance as a prime means of encoding and perpetuating moral stance and configuring self and identity, conversational narrative has received little attention in corpus linguistics. In this paper we describe the construction and annotation of a corpus that is intended to advance the linguistic theory of this fundamental mode of everyday social interaction: the Narrative Corpus (NC). The NC contains narratives extracted from the demographically-sampled sub-corpus of the British National Corpus (BNC) (XML version). It includes more than 500 narratives, socially balanced in terms of participant sex, age, and social class. We describe the extraction techniques, selection criteria, and sampling methods used in constructing the NC. Further, we describe four levels of annotation implemented in the corpus: speaker (social information on speakers), text (text Ids, title, type of story, type of embedding etc.), textual components (pre-/post-narrative talk, narrative, and narrative-initial/final utterances), and utterance (participation roles, quotatives and reporting modes). A brief rationale is given for each level of annotation, and possible avenues of research facilitated by the annotation are sketched out.

This paper describes the methodology used to compile a corpus called MorphoQuantics that contains a comprehensive set of 17,943 complex word types extracted from the spoken component of the British National Corpus (BNC). The categorisation of these complex words was derived primarily from the classification of Prefixes, Suffixes and Combining Forms proposed by Stein (2007). The MorphoQuantics corpus has been made available on a website of the same name; it lists 554 word-initial and 281 word-final morphemes in English, their etymology and meaning, and records the type and token frequencies of all the associated complex words containing these morphemes from the spoken element of the BNC, together with their Part of Speech. The results show that, although the number of word-initial affixes is nearly double that of word-final affixes, the relative number of each observed in the BNC is very similar; however, word-final affixes are more productive in that, on average, the frequency with which they attach to different bases is three times that of word-initial affixes. Finally, this paper considers how linguists, psycholinguists and psychologists may use MorphoQuantics to support their empirical work in first and second language acquisition, and clinical and educational research.

In this paper we address the problem of Word Sense Disambiguation by introducing a knowledge-driven framework for the disambiguation of nouns. The proposal is based on the clustering of noun sense representations and it serves as a general model that includes some existing disambiguation methods. A first prototype algorithm for the framework, relying on both topic signatures built from WordNet and the Extended Star clustering algorithm, is also presented. This algorithm yields encouraging experimental results for the SemCor corpus, showing improvements in recall over other knowledge-driven methods.

Based on the analysis of key words and trigrams, this paper explores characteristics of contemporary American English television dialogue. Using a corpus comprising dialogue from seven fictional series (five different genres) and the spoken part of the American National Corpus, key words and trigrams are compared to previous corpus linguistic studies of such dialogue (Mittmann 2006, Quaglio 2009) and further explored on the basis of concordances, with special emphasis on over-represented key words/trigrams and their potential to indicate informality and emotionality. The results suggest that the expression of emotion is a key defining feature of the language of television, cutting across individual series and different televisual genres.

Neural semantic parsing has achieved impressive results in recent years, yet its success relies on the availability of large amounts of supervised data. Our goal is to learn a neural semantic parser when only prior knowledge about a limited number of simple rules is available, without access to either annotated programs or execution results. Our approach is initialized by rules, and improved in a back-translation paradigm using generated question-program pairs from the semantic parser and the question generator. A phrase table with frequent mapping patterns is automatically derived, also updated as training progresses, to measure the quality of generated instances. We train the model with model-agnostic meta-learning to guarantee the accuracy and stability on examples covered by rules, and meanwhile acquire the versatility to generalize well on examples uncovered by rules. Results on three benchmark datasets with different domains and programs show that our approach incrementally improves the accuracy. On WikiSQL, our best model is comparable to the SOTA system learned from denotations.

This paper explores a fully automatic knowledge-based method which performs the noun sense disambiguation relying only on the WordNet ontology. The basis of the method is the idea of conceptual density, that is, the correlation between the sense of a given word and its context. A new formula for calculating the conceptual density was proposed and was evaluated on the SemCor corpus.

This article presents methods to construct procedure of morpho-syntactic parsing based on corpus dataset analyzes. It contains 1) the method to eliminate morphological ambiguities using existing morphological parsers and then converting the results of parsing into the format of the language corpus used; 2) a method of selecting parameters for syntactic parsing and assessment of the achievable accuracy of parsing, which can be provided by the data of the used corpus; 3) a method of parsing sentences on the basis of neural network algorithms and a selected set of parameters in the format of used corpus. The basis for this study are sentences with unambiguous morpho-syntactic marking from the Russian National Corpus.

Syntactic chunking has been a well-defined and well-studied task since its introduction in 2000 as the CONLL shared task. Though some efforts have been further spent on chunking performance improvement, the experimental data has been restricted, with few exceptions, to (part of) the Wall Street Journal data, as adopted in the shared task. It remains open how those successful chunking technologies could be extended to other data, which may differ in genre/domain and/or amount of annotation. In this paper we first train chunkers with three classifiers on three different data sets and test on four data sets. We also vary the size of training data systematically to show data requirements for chunkers. It turns out that there is no significant difference between those state-of-the-art classifiers; training on plentiful data from the same corpus (switchboard) yields comparable results to Wall Street Journal chunkers even when the underlying material is spoken; the results from a large amount of unmatched training data can be obtained by using a very modest amount of matched training data.

The Corpus of Contemporary American English (COCA), which was released online in early 2008, is the first large and diverse corpus of American English. In this paper, we first discuss the design of the corpus \u2014 which contains more than 385 million words from 1990\u20132008 (20 million words each year), balanced between spoken, fiction, popular magazines, newspapers, and academic journals. We also discuss the unique relational databases architecture, which allows for a wide range of queries that are not available (or are quite difficult) with other architectures and interfaces. To conclude, we consider insights from the corpus on a number of cases of genre-based variation and recent linguistic variation, including an extended analysis of phrasal verbs in contemporary American English.

Recent studies have shown the potential benefits of leveraging resources for resource-rich languages to build tools for similar, but resource-poor languages. We examine what constitutes \"similarity\" by comparing traditional phylogenetic language groups, which are motivated largely by genetic relationships, with language groupings formed by clustering methods using typological features only. Using data from the World Atlas of Language Structures (WALS), our preliminary experiments show that typologically-based clusters look quite different from genetic groups, but perform as good or better when used to predict feature values of member languages.

We present semantic markup as a way to exploit the semantics of mathematics in a wiki. Semantic markup makes mathematical knowledge machine-processable and thus allows for a multitude of useful applications. But as it is hard to read and write for humans, an editor needs to understand its inherent semantics and allow for a humanreadable presentation. The semantic wiki SWiM offers this support for the OpenMath markup language. Using OpenMath as an example, we present a way of integrating a semantic markup language into a semantic wiki using a document ontology and extracting RDF triples from XML markup. As a benefit gained from making semantics explicit, we show how SWiM supports the collaborative editing of definitions of mathematical symbols and their visual appearance. 1 Making Mathematical Wikis More Semantic What does a wiki need in order to support mathematics in a semantic way? First, there needs to be a way to edit mathematical formulae. Many wikis offer a LATEX-like syntax for that, and they have been used to build large mathematical knowledge collections, such as the mathematical sections of Wikipedia [30] or the mathematics-only encyclopaedia PlanetMath [16]. But LATEX, which is mostly presentation-oriented, despite certain macros like \\frac{num}{denom} or \\binom{n}{k}, is not sufficient to capture the semantics of mathematics. One could write O(n2 + n), which could mean \u201CO times n2 + n\u201D (with redundant brackets), or \u201CO (being a function) applied to n2 + n\u201D, or the set of all integer functions not growing faster than n2 +n, and just by common notational convention we know that the latter is most likely to hold. For being able to express the semantics of O(n2+n), we need to make explicit that the Landau symbol O is a set construction operator and n is a variable. The meaning of O has to be defined in a vocabulary shared among mathematical applications such as our wiki. This is analogous to RDF, where a vocabulary\u2014 also called ontology\u2014has to be defined before one can use it to create machineprocessable and exchangeable RDF statements. In a mathematical context, these vocabularies are called content dictionaries (CDs). As with ontology languages, one can usually do more than just listing symbols and their descriptions in a CD: defining symbols formally in terms of other symbols, declaring their types formally, and specifying their visual appearance. Thus, CDs themselves are special mathematical documents that could again be made available in a mathematical wiki. Then it would be possible to create an unambiguous link from any occurrence of O in a formula to its definition in the wiki, and knowledge from the wiki could be shared with any other mathematical application supporting this CD. As a practical solution, we present the OpenMath CD language in sect. 2 and its integration into the semantic wiki SWiM in sect. 4. 2 Semantic Markup for Mathematics with OpenMath Semantic markup languages for mathematics address the problems introduced in sect. 1 by offering an appropriate expressivity and semantics for defining symbols and other structures of mathematical knowledge. This is a common approach to knowledge representation not only in mathematics, but generally in science1. OpenMath [7] is a markup language for expressing the logical structure of mathematical formulae. It provides its own sublanguage for defining CDs\u2014 collections of symbol definitions with formal and informal semantics. One symbol definition consists of a mandatory symbol name and a normative textual description of the symbol, as well as other metadata2. Formal mathematical properties (FMPs) of the symbol, such as the definition of the sine function, or the commutativity axiom that holds for the multiplication operator, can be added, written in OpenMath and possibly using other symbols (see fig. 1). Type signatures (such as sin : R\u2192 R) and human-readable notations (see sect. 3) of symbols are defined separately from the CD in a similar fashion. As semantic markup makes mathematical formulae machine-understandable, it has leveraged many applications. For OpenMath, it started with data exchange between computer algebra systems, then automated theorem provers, and more recently dynamic geometry systems. OpenMath is also used in multilingual publishing, adaptive learning applications, and web search [10]. OpenMath CDs foster exchange by their modularity. Usually, a CD contains a set of related symbols, e. g. basic operations on matrices (CD linalg1) or eigenvalues and related concepts (CD linalg4), and a CD group contains a set of related CDs, e. g. all standard CDs about linear algebra (CD group linalg). In this setting, agents exchanging mathematical knowledge need not agree upon one large, monolithic mathematical ontology, but can flexibly refer to a specific set of CDs or CD groups they understand3. 1 Consider e. g. the chemical markup language CML [23] 2 OpenMath 2 uses an idiosyncratic schema for metadata, but Dublin Core is likely to be adopted for OpenMath 3. 3 A communication protocol for such agreements is specified in [7, sect. 5.3].

We present TransPhoner: a system that generates keywords for a variety of scenarios including vocabulary learning, phonetic transliteration, and creative word plays. We select effective keywords by considering phonetic, orthographic and semantic word similarity, and word concept imageability. We show that keywords provided by TransPhoner improve learner performance in an online vocabulary learning study, with the improvement being more pronounced for harder words. Participants rated TransPhoner keywords as more helpful than a random keyword baseline, and almost as helpful as manually selected keywords. Comments also indicated higher engagement in the learning task, and more desire to continue learning. We demonstrate additional applications to tasks such as pure phonetic transliteration, generation of mnemonics for complex vocabulary, and topic-based transformation of song lyrics.

QA (question answering) systems designated for answering in-depth geographic questions are highly demanded but not quite available. Previous research has visited various individual aspects of a QA system but few synergistic frameworks have been proposed. This paper investigates the nature of geographic question formation and observes their unique linguistic structures that can be semantically translated into a spatial query. We create a new task of solving non-trivial questions using GIS (Geographic Information System) and test it with an associated corpus. A dynamic programming algorithm is developed for classification and voting algorithm for verification. Two types of ontologies are integrated for disambiguating and discriminating spatial terms. PostGIS serves as the GIS backend to provide domain expertise for spatial reasoning. Results show that exact answers can be returned quickly and correctly by our system. Contrast classification results in improved accuracy compared with the baseline which proves the effectiveness of proposed methods.

Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2nd to the 5th position, depending on the task. \r\nWe introduce the tools and corpora used, comment on the nature of the shared task and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). \r\nHigh-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.

The ATCOSIM Air Traffic Control Simulation Speech corpus is a speech database of air traffic control (ATC) operator speech. It consists of ten hours of speech data, which were recorded during ATC real-time simulations. The database includes orthographic transcriptions and additional information on speakers and recording sessions. The corpus is publicly available and provided free of charge. This report describes the production process of the corpus and gives a thorough description of the final corpus. Possible applications of the corpus are, among others, ATC language studies, speech recognition and speaker identification, as well as listening tests within the ATC domain.

We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.

Dataset re-collected from an original dataset collected by Pang, B., and Lee, L. 2004. \"A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts\". In Proceedings of the 42nd annual meeting on Association for Computational Linguistics. The dataset presents a binary classification problem, with workers asked to select either positive (1) or negative (0) for a 500 sentences extracted from movie reviews, with gold labels assigned by the website. It contains 10,000 sentiment judgements collected from 143 using the Amazon Mechanical Turk platform. Each row is in the format WorkerID, TaskID, Worker label, Gold label, time spent on the judgement by the worker

MEDLINE, the primary bibliographic database in life sciences, currently contains more than 17 million article citations and last year grew with 600,000 entries. Staying up-todate, finding relevant information and even extracting new knowledge becomes increasingly difficult in this field [1]. The peculiarities of biomedical terminology make building an effective IR system a challenge [3]. Firstly, biomedical terms are highly synonymous and ambiguous. Secondly, multi-word terms such as \u2018Bovine Spongiform Encephalopathy\u2019 are commonly used, making a bag of words approach less effective. Thirdly, new terms and especially abbreviations are abundant. And finally there is the challenge of variation in terminology. Differences in spelling, use of hyphens and other special characters make it even more difficult to handle biomedical text. The TRECGenomics benchmarks have demonstrated that overcoming these challenges is far from trivial. Different attempts have been pursued to map text to a notion of concepts. Explicitly, by mapping texts to entries in controlled vocabularies such as the Unified Medical Language System (UMLS). But also implicitly, by for example treating collocated words as \u2018concepts\u2019. The goal of my PhD project is to study how to optimize biomedical IR by including conceptual knowledge from biomedical ontologies, while maintaining a theoretical sound framework. To achieve this I propose to approach terminology issues in biomedical IR as a form of cross-lingual IR (CLIR). The two \u2018languages\u2019 distinguished are the textual representation of query and documents, and their conceptual representation in terms of concepts from a biomedical

\u6C49\u8BED\u65B9\u8A00\u7684\u4E00\u4E2A\u6700\u660E\u663E\u7684\u5206\u91CE\u662F\u5317\u65B9\u5B98\u8BDD\u548C\u4E1C\u5357\u8BF8\u65B9\u8A00\u7684\u5BF9\u7ACB\uFF0C\u5373\u5357\u5317\u5BF9\u7ACB\u3002\u672C\u6587\u4ECE\u300A\u6C49\u8BED\u65B9\u8A00\u5730\u56FE\u96C6\u300B\u7684\u8BCD\u6C47\u5377\u548C\u8BED\u6CD5\u5377\u9009\u53D616\u4E2A\u9879\u76EE\uFF0C\u6A21\u62DFDNA\u5E8F\u5217\u7528MEGA (Molecular Evolutionary Genetics Analysis \u5206\u5B50\u8FDB\u5316\u9057\u4F20\u5206\u6790)\u8F6F\u4EF6\u5BF9930\u4E2A\u6C49\u8BED\u65B9\u8A00\u70B9\u7684\u8FD916\u4E2A\u9879\u76EE\u7684\u7279\u5F81\u5E8F\u5217\u8FDB\u884C\u5206\u6790\uFF0C\u7ED3\u679C\u53D1\u73B0\u53EA\u9009\u7528\u8BCD\u6C47\u3001\u8BED\u6CD5\u9879\u76EE\u4E5F\u80FD\u5927\u81F4\u770B\u5230\u5357\u5317\u5BF9\u7ACB\uFF0C\u8BF4\u660E\u5728\u6C49\u8BED\u65B9\u8A00\u7684\u5206\u533A\u5DE5\u4F5C\u4E2D\u5F15\u5165\u8BCD\u6C47\u548C\u8BED\u6CD5\u6807\u51C6\u662F\u6709\u610F\u4E49\u7684\u3002\u5F53\u7136\u65B9\u8A00\u5B66\u91CC\u7684\u6240\u8C13\u201C\u7279\u5F81\u5E8F\u5217\u201D\u5E76\u975E\u771F\u6B63\u7684DNA\u5E8F\u5217\uFF0C\u800C\u7528MEGA\u6765\u8FDB\u884C\u5927\u6837\u672C\u8BA1\u7B97\u65F6\uFF0C\u81EA\u5C55\u503C\u4F4E\u4E5F\u662F\u6B63\u5E38\u60C5\u51B5\uFF0C\u91CD\u8981\u7684\u662F\u4ECE\u4E2D\u89C2\u5BDF\u8FDB\u5316\u6811\u6240\u4F53\u73B0\u7684\u5206\u7EC4\u8D8B\u52BF\u3002 One of the most obvious divisions in Chinese dialects is the confrontation between northern mandarin and southeastern dialects. In this paper, the author selected 16 items from the vocabulary and the grammar volumes of Linguistic Atlas of Chinese Dialects and analyzed the feature sequences of the 16 items of 930 Chinese dialects with MEGA (Molecular Evolutionary Genetics Analysis) by simulating DNA sequences. The results showed that lexicon-grammar items alone could also basically reveal the North-South opposition just as phonology items do. Therefore, the introduction of lexicon-grammar items into the Chinese dialect classification is meaningful. Of course, the so-called \u201Cfeature sequence\u201D in dialectology is not the real DNA sequence and when using MEGA for large sample calculation, it is normal that bootstrap values are low. The important thing is to observe the grouping trends embodied in the phylogenetic trees.

A trigram language model based on word categories is introduced in order to improve word recognition results by use of linguistic information. A trigram model based on word sequences requires a lot of memory and training samples to store and estimate its probabilities. To avoid these almost unsolvable problems, a trigram model of words whose probabilities are estimated from the trigram of categories and word occurrence probabilities in the dictionary is introduced. The probabilities of the trigram of categories and the word probabilities in the dictionary are estimated using the Brown Corpus Text Database[1]. This trigram model is efficiently applied to improve word recognition results using a dynamic programming technique. Moreover, probabilities of special word sequences (frozen word sequences) are extracted from the Brown Corpus Text Database and these probabilities are also integrated in the dynamic programming algorithm. Word recognition through speaker adaptation is carried out using three input speakers from the IBM office correspondence task database[3]. The word recognition rate was 80.9%. The trigram model improves the word recognition rate to 89.1%.

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of constituents is broken up with elements of other constituents. The two projects are the first of their kind for Latin, so no prior established guidelines for syntactic annotation are available to rely on. The general model for the adopted style of representation is that used by the Prague Dependency Treebank, with departures arising from the Latin grammar of Pinkster, specifically in the traditional grammatical categories of the ablative absolute, the accusative + infinitive, and gerunds/gerundives. Sharing common annotation guidelines allows us to compare the datasets of the two treebanks for tasks such as mutually checking annotation consistency, diachronically studying specific syntactic constructions, and training statistical dependency parsers. 1. The Latin Dependency Treebank and Index Thomisticus Treebank Treebanks have recently emerged as a valuable resource not only for computational tasks such as grammar induction and automatic parsing, but for traditional linguistic and philological pursuits as well. This trend has been encouraged by the creation of several historical treebanks, such as that for Middle English (Kroch & Taylor, 2000), Early Modern English (Kroch et al., 2004), Old English (Taylor et al., 2003), Early New High German (Demske et al., 2004) and Medieval Portuguese (Rocio et al., 2000). The Perseus Project (Crane et al., 2001) and the Index Thomisticus (IT) (Busa 1974-1980) are currently in the process of developing treebanks for Latin \u2013 the Latin Dependency Treebank (LDT) (Bamman & Crane, 2006; Bamman & Crane, 2007) on works from the Classical era, and the Index Thomisticus Treebank (IT-TB) (Passarotti, 2007) on the works of Thomas Aquinas. In order for our separate endeavors to be most useful for the community, we have come to an agreement on a common standard for the syntactic annotation of Latin. In this paper we present some examples from our preliminary set of annotation guidelines that illustrate how 1 The IT-TB is available online at the following URL:; the LDT can be found at we have adapted our general annotation model inherited from the one of the Prague Dependency Treebank (PDT) of Czech to the specific linguistic demands of Latin. Date Author Words Sentences 1st c. BCE Cicero 5,663 295 1st c. BCE Caesar 1,488 71 1st c. BCE Sallust 12,311 701 1st c. BCE Vergil 2,613 178 4th-5th c. CE Jerome 8,382 405

A recently suggested method for the automatic indexing of full text is applied to extracts from the Brown Corpus. Scientific and technological extracts are found to give rise to a much larger number of index terms than humanities and social science extracts. These results would appear to arise from differences in the word frequency distributions for each type of subject.