NLPinAI 2022 Abstracts


Area 1 - NLPinAI

Full Papers
Paper Nr: 3
Title:

Examining n-grams and Multinomial Naïve Bayes Classifier for Identifying the Author of the Text “Epistle to the Hebrews”

Authors:

Panagiotis Satos and Chrysostomos Stylios

Abstract: This work proposes a methodology consisting of splitting and pre-processing of Koine Greek dialect texts, examining word n-grams, character n-grams, multiple-length grams, and then suggests the best value for n of n-grams. The Multinomial Naïve Bayes Classifier is used along with the n-grams to identify the author of the text “Epistle to the Hebrews” between Paul and Luke, who are considered the most likely authors of this Epistle. In order to create a balanced dataset, the texts of Apostle Paul’s Epistles and the book “Acts of the Apostles” by Luke the Evangelist are used. This work aims to identify the author of the text “Epistle to the Hebrews” and reply to the theological question about its paternity.
Download

Paper Nr: 4
Title:

Transformers for Low-resource Neural Machine Translation

Authors:

Andargachew M. Gezmu and Andreas Nürnberger

Abstract: The recent advances in neural machine translation enable it to be state-of-the-art. However, although there are significant improvements in neural machine translation for a few high-resource languages, its performance is still low for less-resourced languages as the amount of training data significantly affects the quality of the machine translation models. Therefore, identifying a neural machine translation architecture that can train the best models in low-data conditions is essential for less-resourced languages. This research modified the Transformer-based neural machine translation architectures for low-resource polysynthetic languages. Our proposed system outperformed the strong baseline in the automatic evaluation of the experiments on the public benchmark datasets.
Download

Paper Nr: 6
Title:

Toward a New Hybrid Intelligent Sentiment Analysis using CNN- LSTM and Cultural Algorithms

Authors:

Imtiez Fliss

Abstract: In this paper, we propose a new sentiment analysis approach based on the combination of deep learning and soft computing techniques. We use the GloVe word embeddings for feature extraction. For sentiment classification, we propose to combine CNN and LSTM to decide whether the sentiment among the text is positive or negative. To tune hyperparameters, this classifier is optimized using cultural algorithms.
Download

Paper Nr: 8
Title:

Automatic Word Sense Mapping from Princeton WordNet to Latvian WordNet

Authors:

Laine Strankale and Madara Stāde

Abstract: Latvian WordNet is a resource where word senses are connected based on their semantic relationships. The manual construction of a high-quality core Latvian WordNet is currently underway. However, text processing tasks require broad coverage, therefore, this work aims to extend the wordnet by automatically linking additional word senses in the Latvian online dictionary Tēzaurs.lv and aligning them to the English-language Princeton WordNet (PWN). Our method only needs translation data, sense definitions and usage examples to compare it to PWN using pretrained word embeddings and sBERT. As a result, 57 927 interlanguage links were found that can potentially be added to Latvian WordNet, with an accuracy of 80% for nouns, 56% for verbs, 67% for adjectives and 66% for adverbs.
Download

Short Papers
Paper Nr: 1
Title:

A Machine Learning based Study on Classical Arabic Authorship Identification

Authors:

Mohamed-Amine Boukhaled

Abstract: Arabic is a widely spoken language with a rich and long written tradition spanning more than 14 centuries. Due to its very peculiars linguistic properties, it constitutes a difficult challenge to some natural language processing applications such as authorship identification, especially in its classical form. Authorship identification works done on Arabic have mainly focused on the investigation of style markers derived from either lexical or structural properties of the studied texts. Despite being effective to a certain degree, these types of style markers have been shown to be unreliable in addressing authorship problems for such language. In this contribution, we present a machine learning-based study on using different types of style markers for classical Arabic. Our aim is to compare the effectiveness of machine learning authorship identification using style markers that do not rely primarily on the lexical or structural dimension of language. We used three types of style markers relying mostly on the syntactic information. By way of illustration, we conducted a study and reported results of experiments done on a corpus of 700 books written by 20 eminent classical Arabic authors.
Download

Paper Nr: 2
Title:

Uniform Density in Linguistic Information Derived from Dependency Structures

Authors:

Michael Richter, Maria Bardají I. Farré, Max Kölbl, Yuki Kyogoku, J. N. Philipp, Tariq Yousef, Gerhard Heyer and Nikolaus P. Himmelmann

Abstract: This pilot study addresses the question of whether the Uniform Information Density principle (UID) can be proved for eight typologically diverse languages. The lexical information of words is derived from dependency structures both in sentences preceding the sentences and within the sentence in which the target word occurs. Dependency structures are a realisation of extra-sentential contexts for deriving information as formulated in the surprisal model. Only subject, object and oblique, i.e., the level directly below the verbal root node, were considered. UID says that in natural language, the variance of information and information jumps from word to word should be small so as not to make the processing of a linguistic message an insurmountable hurdle. We observed cross-linguistically different information distributions but an almost identical UID, which provides evidence for the UID hypothesis and assumes that dependency structures can function as proxies for extra-sentential contexts. However, for the dependency structures chosen as contexts, the information distributions in some languages were not statistically significantly different from distributions from a random corpus. This might be an effect of too low complexity of our model’s dependency structures, so lower hierarchical levels (e.g. phrases) should be considered.
Download

Paper Nr: 7
Title:

DMS: A System for Delivering Dynamic Multitask NLP Tools

Authors:

Haukur P. Jónsson and Hrafn Loftsson

Abstract: Most NLP frameworks focus on state-of-the-art models which solve a single task. As an alternative to these frameworks, we present the Dynamic Multitask System (DMS), based on native PyTorch. The DMS has a simple interface, can be combined with other frameworks, is easily extendable, and bundles model downloading with an API and a terminal client for end-users. The DMS is flexible towards different tasks and enables quick experimentation with different architectures and hyperparameters. Components of the system are split into two categories with their respective interfaces: encoders and decoders. The DMS targets researchers and practitioners who want to develop state-of-the-art multitask NLP tools and easily supply them to end-users. In this paper, we, first, describe the core components of the DMS and how it can be used to deliver a trained system. Second, we demonstrate how we used the DMS for developing a state-of-the-art PoS tagger and a lemmatizer for Icelandic.
Download