Abstract
This article describes the publication of occurrences of Southern Elephant Seals Mirounga leonina (Linnaeus, 1758) as Linked Open Data in two environments (marine and coastal). The data constitutes hydrographic measurements of instrumented animals and observation data collected during censuses between 1990 and 2017. The data scheme is based on the previously developed ontology BiGe-Onto and the new version of the Semantic Sensor Network ontology (SSN). We introduce the network of ontologies used to organize the data and the transformation process to publish the dataset. In the use case, we develop an application to access and analyze the dataset. The linked open dataset and the related visualization tool turned data into a resource that can be located by the international community and thus increase the commitment to its sustainability. The data, coming from Península Valdés (UNESCO World Heritage), is available for interdisciplinary studies of management and conservation of marine and coastal protected areas which demand reliable and updated data.
Introduction
In the ecology domain, research teams collect and store biological and environmental information over the years/decades in database systems to answer their own queries. However, this information is isolated from other datasets for interoperating with and, in addition, is not ready to be accessed by machines. Particularly, in marine science the data collection is a process of cumulative logistic complexity, which makes it important to work on the curation and sustainability of the database, in both the short and long term. It is of great benefit for scientific institutions to publish their datasets following the Linked Data principles [7] not only for interlinking and easy cross-referencing but also for other purposes that are not foreseen at the moment of publication. The state of the art in the last decade shows that together with technology to collect data, semantic interoperability has further grown in importance [20]. To meet Linked Data requirements, datasets must be described with rich metadata such as controlled vocabularies in a particular form – RDF – and published as a findable resource with a unique identifier.
This paper integrates observational and hydrographic datasets based on the SOSA/SSN ontology [6,10] and BiGe-Onto ontology1
During the annual cycle, the SES come ashore to breed and molt. The rest of the year they are at sea, traveling long distances throughout its extensive migration (up to 8 months and 12000 km. of round trip), and diving continuously to a depth of 1500 meters or more. During their terrestrial phase they frequently revisit previous years’ sites [13,14]. The behavior during the marine phase shows that SES are ideal carriers of devices, providing physical profiles, (i.e. hydrographic of the water column). For tracking the SES at sea, researchers make use of miniaturized animal-attached tags for relaying data, known as biologging domain [19], and cover animal migration and oceanographic measurements [1]. The instruments deployed on the seal return, at a low cost, large volumes of hydrographic data in regions never studied directly by buoys or oceanographic vessels and collecting large amounts of information associated with the key habitats in the South Atlantic Ocean.
This paper is organized as follows: Section 2 describes the SES database. Section 3 briefly presents the network of ontologies. Section 4 shows examples of how the data are organized. Section 5, describes the populating processes and the links to other datasets. Section 6 shows the application developed to access and analyze the dataset. Finally, we conclude by presenting an analysis of our work and perspectives.
Data are recorded from measurements of physical variables and locations obtained in two different stages. First stage involves an annual census, which takes place during the breeding season of SES. The second stage starts at the end of the breeding season, when SES go back at sea for foraging purposes. Below we briefly describe how data are generated and recorded in each stage. During the breeding season, the SES haul out to the beach to breed. Annual census on foot along the coast of the colony is an arduous but indispensable work to know distribution and trend of the population. The objective is to count each of the harems scattered on the beaches of the PV to determine the number of offspring born in a season. Counts carried out during 2–3 days at peak of the breeding season (October 3–7), when most of the population is ashore. All the breeding groups were counted and located along 200 km of coastline, divided into sections and each census taker is assigned to a route. The census taker must count for the number of animals and classify them by sex and age males, females, and pups. Hereinafter, we will call the procedure of counting individuals in a certain place Occurrence. Each occurrence was georeferenced (latitude and longitude) and demographic data included date and time, group size and substrate where the SES is located. All information about these censuses is recorded in a field book and then uploaded into a MySQL database. Table 1 summarizes the most relevant fields for the conducted censuses.
Main fields registered during a census
Main fields registered during a census
At the end of breeding, SES go back at sea for foraging. The trip is monitored by small computers designed and built by Wildlife Computers Inc.2
Main fields registered during SES diving
The census and the deployments of the instruments are carried out by the research team belonging to Centre for the Study of Marine Systems hosted in Puerto Madryn, Patagonia Argentina (CESIMAR-CENPAT-CONICET).3
In this section, we briefly summarize these ontologies used for the publication of our dataset, indicating the reuse of concepts.
The core of our ontologies network is composed by SOSA/SSN [6] and BiGe-Onto [23], which can be jointly used for both hydrographic profiles and observational data. These ontologies are linked to other ones describing different sub-domains, and thus creating such network. Therefore, the resulting network is composed mainly by the following:
an ontology to describe the sensors used to measure hydrographic profiles an ontology to describe SES occurrences made during censuses an ontology to describe the associated measures an ontology to describe the locations and places of interest an ontology to describe temporality of events an ontology to describe scientific publications
We reuse only the elements from these ontologies that are necessary for modeling our data, adopting a soft reuse strategy [4] instead of importing the whole ontologies.
A list of prefixes and their corresponding URIs are listed in Table 3.
Reused vocabularies and ontologies
Reused vocabularies and ontologies
Semantic Sensor Network (SSN) ontology The Semantic Sensor Network (SSN) is a generic ontology related to sensor observations. This ontology has been updated to become a W3C recommendation, and currently it is a lightweight one dedicated to sensor and actuator descriptions. It has been called Sensor, Observation, Sample, and Actuator (SOSA) pattern. The link between SSN and SOSA is described in [6]. The classes we have reused from SOSA/SSN ontology are:
We have also reused the main properties associated with these classes:
BiGe-Onto ontology BiGe-Onto is an ontology designed for modeling biodiversity and marine biogeography data [23]. Its main concept is an
Since BiGe-Onto mainly describes occurrences, which depend on other concepts to exist, we also outline below some of the most important properties defined for relating such occurrences:
The Quantity, Unit, Dimension and Type (QUDT) ontologies QUDT is a collection of OWL ontologies and vocabularies [9]. The QUDT schema defines the base classes, properties, and restrictions used for modeling physical quantities, units of measure, and their dimensions in various measurement systems. QUDT also contains a set of vocabularies to define units for different domains. We have reused the unit vocabulary that categorizes units in different classes. This vocabulary also provides individuals to identify units such as
GeoSPARQL ontology GeoSPARQL [18] is an Open Geospatial Consortium (OGC) standard for supporting the representation and querying of geospatial data on the Semantic Web. As such, it is based on the OGC’s Simple Features model, with some adaptations for RDF. GeoSPARQL designates a vocabulary for representing geospatial data in RDF, and it defines an extension to the SPARQL query language5
A feature is simply any entity in the real world with some spatial location.
A geometry is any geometric shape, such as a point, polygon, or line, and is used as a representation of a feature’s spatial location.
The W3C time ontology The W3C Time ontology [2] enables the description of time instants and intervals. Hence it may be useful when we need to describe the timestamp, or the time associated with the measurements made by the observers of the SES. We reuse the classes
FRBR-aligned Bibliographic Ontology (FaBiO) The aim of FaBiO [16] is recording and publishing descriptions of entities that are published or potentially publisable, and that contain or are referred to by bibligraphic references. Its classes are structured according to the FRBR schema of Works, Expressions, Manifestations and Items. Additional properties have been added to extends the FRBR data model by linking works and manifestations. Considering that both census observations and measurements of physical variables have been used to publish results in diverse scientific journals, we have chosen the ontology FaBIO for modeling the relationships between platforms and such publications. Our choice is based on the fact that FaBIO is one of the ontologies involved in the OpenCitations8
NERC and GoodRelations ontologies Additionally, we reuse the vocabulary for Oceanography known as Natural Environment Research Council (
It is important to highlight that the previous ontologies were chosen because many of them are W3C standards (such as SOSA/SSN and OWL Time) or standards of the Open Geospatial Consortium (OGC) such as GeoSPARQL and vocabularies that are widely used by the user community involved in the domain. (e.g. Darwin Core or NERC). In the absence of a standard ontology for a specific domain, we decided to use those that according to our knowledge, are the ones that are currently under development, provide support and the documentation is adequate, such as FaBIO and GoodRelations.
Based on the network of ontologies described in the previous section, we are now able to create a dataset containing all the individuals describing hydrographic profiles and occurrences taken during the censuses. Now we explain the decisions taken to create resource URIs and we provide examples of resource descriptions.
Resource URIs for hydrographic profiles
This subsection presents the main URI design decisions and conventions used. Table 4 provides a summary of the main types of URIs that we generate. The first column presents the type of resources. The second column indicates the associated class which types the resources. The last column contains the name pattern used to generate the resource URIs. The base URI for our dataset is
URI generation templates for resources. First part of the table describes the patterns of URIs related to hydrographic profiles, while the second describes those related to occurrence data. Longitud and latitude numbers are expressed in decimal
URI generation templates for resources. First part of the table describes the patterns of URIs related to hydrographic profiles, while the second describes those related to occurrence data. Longitud and latitude numbers are expressed in decimal

SES platform and TDR sensor description.
We consider SES as an oceanographic sampling platform. The individual that represents the platform is an instance of the
Observation
An observation describes the context of a measurement made by a sensor. In the case of TDRs, the measurements are location, time, depth and temperature. Properties

Example of observation made by TDR on location of the sea.

Example of an interval representing a dive of 10.94 seconds duration.
Figure 3 presents an observation produced by the TDR. Sometimes a measurement is related to a period of time. For example, the TDR measures the duration of a dive during a immersion. The property

Representation of an observation (
It is true that we can model the SES census using SOSA/SSN because sosa:Sensor can be an observation made by a human instead of an electronic equipment. Thus, we decided not to use SOSA/SSN, and instead we used BiGe-Onto since it was created to model species occurrences by means of the DwC. We believe that if we want to share the results in an interdisciplinary way, it is necessary to respect the standard adopted by biodiversity community. Using DwC will also allow the reuse of this part of the dataset to perform more complex analyzes such as marine spatial planning. We use the class
Publication
Each publication is represented as an instance of

Representation of a publication associated with the AAPR platform, the DOI is described using
To create Linked Open Data, a conversion needs to take place from the data contained in SES database into RDF. As explained in Section 2, measurements produced by sensors, and census data are stored in MySQL server. Fields that are no longer used or that contain confidential data are excluded, for example data that is still being processed. Transformation process is done by D2RQ Platform,10
D2RQ runs at
Dataset key statistics
The external links were generated manually, using a MySQL table specifically created for this purpose, which is then mapped using D2RQ. This table has in one column the URI of some concept belonging to our data set, and in another column the equivalent URI that references the external dataset. For example,
When possible, in the case of publications, the instances of
We use Geonames,11
To link instances of people (
The SES dataset can be downloaded, navigated and queried using a SPARQL endpoint, and they are published under Creative Commons Universal Public Domain Dedication (CC0 1.0)13
Technical details
To explore the dataset using the SPARQL endpoint, we have developed a set of queries to answer the most common questions that researchers need to answer. For example, number of dives, trips, and values of certain environmental variables. Table 7 shows the developed queries and their corresponding links to the endpoint.
Predefined queries to explore the data set
One crucial aspect is how to access and analyze data, and especially how to get only that part of data which is of interest for a given research question. To show the exploitation of the dataset, we developed a dashboard 14
This module summarizes the diving statistics (maximum depths recorded, number of dives, maximum temperatures and number of platforms). The information for each of the sensors used is also detailed. For bar charts, the ggplot18
this module allows you to see by platform the most important variables registered during dives. Temperatures and depths, as well as duration can be displayed. The line chart was built using the plot_ly library.19
This module retrieve the trips made by each platform and displays them on a map generated with the leaflet library,20
This module allows analyzing the data of the census carried out during 1990 to 2017. Two charts were developed with ggplot, the first shows the annual population of SES grouped by category, while the second shows the trend of the SES breeding population.
This paper presents the publication as LOD of a biological and physical dataset, collected for more than 20 years, and stored with early objective of studying the environment influence on the foraging, reproductive performance, and population trend of the SES. This dataset was initially available for a small research group, and the aim is to make it available to a global community. Our development improves the discoverability of the content of the database and could be applied at new knowledge-building and cross-disciplinary research. For example, we expect the hydrographic profiles become a useful tool together with physical samples resulting from other science programs, to assess ocean changes associated with the climate change.
The dataset comes from PV a geographic region under conservation regulations by UNESCO and there is a continue demand of the governmental authorities to develop spatial planning. This requirement helps the sustainability of the database, because it needs a high level of accuracy for SES data, and access at other databases in a user-friendly manner. Coastal management Planning and Marine Spatial Planning (MSP) are concerned with the management of the distribution of human activities in space and time in and around seas and oceans to achieve ecological, economic and societal objectives and outcomes [3]. The next steps will be to promote the use vocabulary terms for discovery databases purposes of the institute CESIMAR, to allow the availability and suitability of data, to be used at regular review cycles of the MSP process. In addition, it would be desirable to access to the physical dataset collected by tourist and commercial vessels that overlap the same range in the southwest Atlantic Ocean. These hydrographic profiles could cover changes of the environment in all influence area of the SES distribution.
Particularly, this work shows the feasibility of using the SSN/SOSA ontology for modeling hydrographic measurements of instrumented animals and observation data collected during censuses. SSN/SOSA ontology is already a W3C standard, then we find this analysis as a valuable step towards the definition of the precise semantics of the ocean biodiversity systems, which requires of a collaborative effort.
The research also provides us useful insights into the process of developing and publishing data as LOD. First of all, valuable raw data can be highly heterogeneous and, as it was in this case, stored in relational databases developed by third parties, and even no longer maintained. Therefore, understanding how these data have been structured is one of the first barriers. Automated processes for bootstrapping SPARQL-to-SQL mappings, as those provided by the very same D2RQ or related technologies, fail in their attempts to automatically generate such mappings based on the content of data sources. Thus, much of this work have to done manually.
Ontology reusing is another challenging task, particularly in the context where standard ontologies have not been developed yet. Thus, our process required of a trade-off between the needed conceptualization (concerning the domain) and the availability of reused ontologies. As defined in [4], the strategy to reuse ontologies is another important obstacle, reusing concepts or properties from external ontologies (soft reuse) seems to be a good strategy to bootstrap LOD datasets from scratch, but it is limited for reasoning purposes. As a consequence and looking toward the future works, other strategies must be considered to increase the expressiveness of model without introducing unnecessary complexity.
Footnotes
Acknowledgements
This work could not have been possible if it were not for the immense task of digitizing and ordering the data carried out by the Computer Scientist Maria Rosa Marin (recently retired). We grateful to PhD Claudio Campagna founder of the program for his vision of long lasting studies that allowed to keep the collection of records.
