ARTIDIGH 2020 Abstracts

Full Papers

Paper Nr:	1
Title:	Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach
Authors:	Lorella Viola and Jaap Verheul
Abstract:	This paper discusses the added value of applying machine learning (ML) to contextually enrich digital collections. In this study, we employed ML as a method to geographically enrich historical datasets. Specifically, we used a sequence tagging tool (Riedl and Padó 2018) which implements TensorFlow to perform NER on a corpus of historical immigrant newspapers. Afterwards, the entities were extracted and geocoded. The aim was to prepare large quantities of unstructured data for a conceptual historical analysis of geographical references. The intention was to develop a method that would assist researchers working in spatial humanities, a recently emerged interdisciplinary field focused on geographic and conceptual space. Here we describe the ML methodology and the geocoding phase of the project, focussing on the advantages and challenges of this approach, particularly for humanities scholars. We also argue that, by choosing to use largely neglected sources such as immigrant newspapers (also known as ethnic newspapers), this study contributes to the debate about diversity representation and archival biases in digital practices.
Download

Paper Nr:	5
Title:	Page Boundary Extraction of Bound Historical Herbaria
Authors:	Krishna Kumar Thirukokaranam Chandrasekar and Steven Verstockt
Abstract:	When digitizing bound historical collections such as herbaria it is important to extract the main page region so that it could be used for automated processing. The thickness of the herbaria books also gives rise to deformations during imaging which reduces the efficiency of automatic detection tasks. In this work we address these problems by proposing an automatic page detection algorithm that estimates all the boundaries of the page and performs morphological corrections in order to reduce deformations. The algorithm extracts features from Hue, Saturation and Value transformations of an RGB image to detect the main page polygon. The algorithm was evaluated on multiple textual and herbaria type historical collections and obtains over 94% mean intersection over union on all these datasets. Additionally, the algorithm was also subjected to an ablation test to demonstrate the importance of morphological corrections.
Download

Paper Nr:	7
Title:	Assessing the Impact of OCR Quality on Downstream NLP Tasks
Authors:	Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray and Giovanni Colavizza
Abstract:	A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.
Download

Paper Nr:	9
Title:	An Ontology-based Approach for Building and Querying ICH Video Datasets
Authors:	Sihem Belabbes, Yacine Izza, Nizar Mhadhbi, Tri-Thuc Vo, Karim Tabia and Salem Benferhat
Abstract:	The diversity of Southeast Asian Intangible Cultural Heritage (ICH) is showcased in many art forms and notably in traditional dances. We focus on the preservation of Vietnamese ICH by building an ontology for Tamia đwa buk dances. We propose a completion of the ontology by semantically enriching traditional dance videos through manual annotation. Once annotated video datasets are built, we propose strategies for processing user queries. In particular, we address inconsistencies which emerge when the same video receives conflicting annotations from multiple sources. We also take into account different reliability levels of the sources in order to prioritize query answers.
Download

Paper Nr:	10
Title:	Content Adaptation, Personalisation and Fine-grained Retrieval: Applying AI to Support Engagement with and Reuse of Archival Content at Scale
Authors:	Rasa Bocyte and Johan Oomen
Abstract:	Recent technological advances in the distribution of audiovisual content have opened up many opportunities for media archives to fulfil their outward-facing ambitions and easily reach large audiences with their content. This paper reports on the initial results of the ReTV research project that aims to develop novel approaches for the reuse of audiovisual collections. It addresses the reuse of archival collections from three perspectives: content holders (broadcasters and media archives) who want to adapt audiovisual content for distribution on social media, end-users who have switched from linear television to online platforms to consume audiovisual content and creatives in the media industry who seek audiovisual content that could be used in new productions. The paper presents three uses cases that demonstrate how AI-based video analysis technologies can facilitate these reuse scenarios through video content adaptation, personalisation and fine-grained retrieval.
Download

Short Papers

Paper Nr:	4
Title:	A Multiagent Framework for Querying Distributed Digital Collections
Authors:	Jan de Mooij, Can Kurtan, Jurian Baas and Mehdi Dastani
Abstract:	Since initial digitization strategies are often inspired by existing usage of the data, and usage of archives often varies among institutes, there is a lot of variation in accessibility of digital collections. We identify four challenges that researchers may encounter when querying such collections in conjunction for research purposes, namely query formulation, alignment, source selection and lack of transparency. We present a multiagent architecture to help overcome these challenges and discuss an prototype implementation of this framework. By means of a query scenario we show the utility of using the framework for humanities researchers.
Download

Paper Nr:	6
Title:	Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level
Authors:	Nikolay Banar, Karine Lasaracina, Walter Daelemans and Mike Kestemont
Abstract:	Transfer learning via pre-training has become an important strategy for the efficient application of NLP methods in domains where only limited training data is available. This paper reports on a focused case study in which we apply transfer learning in the context of neural machine translation (French–Dutch) for cultural heritage metadata (i.e. titles of artistic works). Nowadays, neural machine translation (NMT) is commonly applied at the subword level using byte-pair encoding (BPE), because word-level models struggle with rare and out-of-vocabulary words. Because unseen vocabulary is a significant issue in domain adaptation, BPE seems a better fit for transfer learning across text varieties. We discuss an experiment in which we compare a subword-level to a character-level NMT approach. We pre-trained models on a large, generic corpus and fine-tuned them in a two-stage process: first, on a domain-specific dataset extracted from Wikipedia, and then on our metadata. While our experiments show comparable performance for character-level and BPE-based models on the general dataset, we demonstrate that the character-level approach nevertheless yields major downstream performance gains during the subsequent stages of fine-tuning. We therefore conclude that character-level translation can be beneficial compared to the popular subword-level approach in the cultural heritage domain.
Download