NLPinAI 2020 Abstracts


Full Papers
Paper Nr: 2
Title:

Aspect Phrase Extraction in Sentiment Analysis with Deep Learning

Authors:

Joschka Kersting and Michaela Geierhos

Abstract: This paper deals with aspect phrase extraction and classification in sentiment analysis. We summarize current approaches and datasets from the domain of aspect-based sentiment analysis. This domain detects sentiments expressed for individual aspects in unstructured text data. So far, mainly commercial user reviews for products or services such as restaurants were investigated. We here present our dataset consisting of German physician reviews, a sensitive and linguistically complex field. Furthermore, we describe the annotation process of a dataset for supervised learning with neural networks. Moreover, we introduce our model for extracting and classifying aspect phrases in one step, which obtains an F1-score of 80%. By applying it to a more complex domain, our approach and results outperform previous approaches.

Paper Nr: 5
Title:

Learning to Determine the Quality of News Headlines

Authors:

Amin Omidvar, Hossein Pourmodheji, Aijun An and Gordon Edall

Abstract: Today, most news readers read the online version of news articles rather than traditional paper-based newspapers. Also, news media publishers rely heavily on the income generated from subscriptions and website visits made by news readers. Thus, online user engagement is a very important issue for online newspapers. Much effort has been spent on writing interesting headlines to catch the attention of online users. On the other hand, headlines should not be misleading (e.g., clickbaits); otherwise readers would be disappointed when reading the content. In this paper, we propose four indicators to determine the quality of published news headlines based on their click count and dwell time, which are obtained by website log analysis. Then, we use soft target distribution of the calculated quality indicators to train our proposed deep learning model which can predict the quality of unpublished news headlines. The proposed model not only processes the latent features of both headline and body of the article to predict its headline quality but also considers the semantic relation between headline and body as well. To evaluate our model, we use a real dataset from a major Canadian newspaper. Results show our proposed model outperforms other state-of-the-art NLP models.

Paper Nr: 7
Title:

Integrating Special Rules Rooted in Natural Language Semantics into the System of Natural Deduction

Authors:

Marie Duží and Michal Fait

Abstract: The paper deals with natural language processing and question answering over large corpora of formalised natural language texts. Our background theory is the system of Transparent Intensional Logic (TIL). Having a fine-grained analysis of natural language sentences in the form of TIL constructions, we apply Gentzen’s system of natural deduction to answer questions in an ‘intelligent’ way. It means that our system derives logical consequences entailed by the input sentences rather than merely searching answers by keywords. Natural language semantics is rich, and plenty of its special features must be taken into account in the process of inferring answers. The TIL system makes it possible to formalise all these semantically salient features in a fine-grained way. In particular, since TIL is a logic of partial functions, it deals with non-referring terms and sentences with truth-value gaps in an appropriate way. This is important because sentences often come attached with a presupposition that must be true in order that a given sentence had any truth-value. Yet, a problem arises how to integrate those special semantic rules into a standard deduction system. Proposal of the solution is one of the goals of this paper. The second novel result is this. There is a problem how to search relevant sentences in the labyrinth of input text data and how to vote for relevant applicable rules to meet the goal, i.e. to answer a given question. To this end, we propose a heuristic method driven by constituents of a given question.

Paper Nr: 8
Title:

Learning Domain-specific Grammars from a Small Number of Examples

Authors:

Herbert Lange and Peter Ljunglöf

Abstract: In this paper we investigate the problem of grammar inference from a different perspective. The common approach is to try to infer a grammar directly from example sentences, which either requires a large training set or suffers from bad accuracy. We instead view it as a problem of grammar restriction or sub-grammar extraction. We start from a large-scale resource grammar and a small number of examples, and find a sub-grammar that still covers all the examples. To do this we formulate the problem as a constraint satisfaction problem, and use an existing constraint solver to find the optimal grammar. We have made experiments with English, Finnish, German, Swedish and Spanish, which show that 10–20 examples are often sufficient to learn an interesting domain grammar. Possible applications include computer-assisted language learning, domain-specific dialogue systems, computer games, Q/A-systems, and others.

Paper Nr: 14
Title:

Unsupervised Statistical Learning of Context-free Grammar

Authors:

Olgierd Unold, Mateusz Gabor and Wojciech Wieczorek

Abstract: In this paper, we address the problem of inducing (weighted) context-free grammar (WCFG) on data given. The induction is performed by using a new model of grammatical inference, i.e., weighted Grammar-based Classifier System (wGCS). wGCS derives from learning classifier systems and searches grammar structure using a genetic algorithm and covering. Weights of rules are estimated by using a novelty Inside-Outside Contrastive Estimation algorithm. The proposed method employs direct negative evidence and learns WCFG both form positive and negative samples. Results of experiments on three synthetic context-free languages show that wGCS is competitive with other statistical-based method for unsupervised CFG learning.

Short Papers
Paper Nr: 1
Title:

Mitigating Vocabulary Mismatch on Multi-domain Corpus using Word Embeddings and Thesaurus

Authors:

Nagesh Yadav, Alessandro Dibari, Miao Wei, John Segrave-Daly, Conor Cullen, Denisa Moga, Jillian Scalvini, Ciaran Hennessy, Morten Kristiansen and Omar O’Sullivan

Abstract: Query expansion is an extensively researched topic in the field of information retrieval that helps to bridge the vocabulary mismatch problem, i.e., the way users express concepts differs from the way they appear in the corpus. In this paper, we propose a query-expansion technique for searching a corpus that contains a mix of terminology from several domains - some of which have well-curated thesauri and some of which do not. An iterative fusion technique is proposed that exploits thesauri for those domains that have them, and word embeddings for those that do not. For our experiments, we have used a corpus of Medicaid healthcare policies that contain a mix of terminology from medical and insurance domains. The Unified Medical Language System (UMLS) thesaurus was used to expand medical concepts and a word embeddings model was used to expand non-medical concepts. The technique was evaluated against elastic search using no expansion. The results show 8% improvement in recall and 12% improvement in mean average precision.

Paper Nr: 9
Title:

Disambiguating Confusion Sets in a Language with Rich Morphology

Authors:

Steinunn R. Friðriksdóttir and Anton K. Ingason

Abstract: The processing of strings which are semantically distinct but can be easily confused with each other, often on account of being pronounced identically, is a prime example of context dependency in Natural Language Processing. This problem arises when a system needs to distinguish whether a bank is a ‘river bank’ or a ‘financial institution’ and it also challenges systems for context-sensitive spelling and grammar correction because pairs like their/there and I/me are one common source of issues that such systems must address. In practice, this type of context-dependency can be especially prominent in languages with rich morphology where large paradigms of inflected word forms lead to a proliferation of such confusion sets. In this paper, we present our novel confusion set corpus for Icelandic as well as our findings from an experiment that uses well-known classification algorithms to disambiguate confusion sets that appear in our corpus.

Paper Nr: 12
Title:

Keyword Extraction in German: Information-theory vs. Deep Learning

Authors:

Max Kölbl, Yuki Kyogoku, J. N. Philipp, Michael Richter, Clemens Rietdorf and Tariq Yousef

Abstract: This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) an unsupervised method based on information theory (Shannon, 1948). We employed (i) a bigram model, (ii) a probabilistic parser model (Hale, 2001) and (iii) an innovative model which utilises topics as extra-sentential contexts for the calculation of the information content of the words, and (B) a supervised method employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence, that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language’s, and (ii) that - as a cognitive principle - the information content of words is determined from extra-sentential contexts, that is to say, from the discourse of words.

Paper Nr: 10
Title:

Text Processing Procedures for Analysing a Corpus with Medieval Marian Miracle Tales in Old Swedish

Authors:

Bengt Dahlqvist

Abstract: A text corpus of one hundred and one Marian Miracle stories in Old Swedish written between c. 1272 and 1430 has been digitally compiled from three transcribed sources from the 19th Century. Highly specialized knowledge is needed to interpret these texts, since the medieval variant of Swedish differs significantly from the modern form of the language. Both the vocabulary and spelling as well as the grammar show substantial variances compared to modern Swedish. To advance the understanding of these texts, automated tools for textual processing are needed. This paper preliminary investigates a number of strategies, such as frequency list analysis and methods for identifying spelling variations for producing stop word lists and exposing the key words of the texts. This can be a help to understand the texts, identifying different word forms of the same word, to ease a lexicon lookup and be a starting point for lemmatisation.