Incremental RML generation in YARRRML

Unofficial Draft

More details about this document
Latest published version:
https://rml.io/yarrrml/spec/incrml/
Latest editor's draft:
https://rml.io/yarrrml/spec/incrml/
Editor:
Gerald Haesendonck

Abstract

Linked Data Event Streams (LDES) is an advanced Knowledge Graph (KG) publication specification aimed at continuous data source replication and synchronization with benefits such as data entities versioning and history retention while providing a self-descriptive API. However, building an LDES requires a high level of expertise in the Semantic Web ecosystem. In this specification provides an extension point to YARRRML, a human-friendly way to configure KG generation via RML. This extension provides an easy-to-use starting point for anyone wanting to create an LDES from non-semantic data.

1. General overview

This is a general full-fletched example with all options which describes behaviour of how to generate IncRML with YARRRML.

sources:
  data-source-1: a data source
  data-source-2: another data source

targets:
  # an "ordinary" target'
  my-boring-target:
    access: out.nq
    type: localfile
    serialization: nquads
  # an LDES target
  my-special-target:
    access: out-ldes.nq
    type: localfile
    serialization: nquads
    ldes:   # LDES specific keys
      id: https://my-ldes.org/the-one-and-only-ldes  # The identifier of the Event Stream object.
      timestampPath: dcterms:created     # optional, default = dcterms:created.
      versionOfPath: dcterms:isVersionOf # optional, default = dcterms: isVersionOf.
      generateImmutableIRI: false        # optional, default = false. If true, turn the member subject IRI into a unique one.

mapping:
  general-mapping:

    # Here come some existing YARRRML rules
    sources: data-source-1
    graphs: ex:some-graph
    subjects:
      - value: ex:$(ObservationID)
        targets: my-boring-target
    po:
      - some-PO-mappings: fun!

    # Here's where the magic happens.
    # Specify what to do when certain changes in data are detected.
    # You can specify any combination of 'create', 'update' and 'delete' here
    changeDetection:

      # The operation to perform when change is detected.
      # Can be `create`, `update` or `delete`.
      # In this case the explicit creation of new data objects
      create:

        # The type of operation: is the create operation explicit or not?
        # Optional.
        #  `true` = explicitly advertised by the data source (default)
        #  `false` = implicitly advertised by the data source
        # See more explanation in section "Change detection".
        explicit: true

        # Optional. Things that will be *added to* the current mapping for this operation
        mappingAdd:
          sources: data-source-2
          graphs: ex:a-second-graph
          subjects:
            - value: ex:$(Sensor)/$(ObservationID)
              targets: my-special-target
          po: [even more fun]

        # Optional. Things that will be removed (ignored) from the original mapping
        mappingRemove:
          subjects: []  # The empty list means "all subjects"
          graphs: []
          po: []
          sources: []

        # Here's an "implicit update" example:
        update:
          explicit: false

          # References to data attributes that trigger an update when they change.
          watchedProperties: [$(temperature)]

          mappingAdd:
            # Add a graph to the original mapping.
            graphs: ex:update

            # Add a target to the original subjects.
            subjects:
              - targets: my-special-target

          # Remove the original graphs at mapping level
          mappingRemove:
            graphs: []

        # The "delete" operation works the same.

2. Notes, remarks

The delete operation removes all PO maps from the generated triples map. By adding POmaps using mappingAdd, you can create RDF that provide hints/classifications e.g. ex:id4 ex:currentState <deleted> . If no mappingAdd po elements are added, all original rdf:type pos are kept.

3. Examples of explicit / implicit operation

Here are some brief examples.

# explicit create with default options
create:
  explicit: true
# implicit create with default options
create:
  explicit: false
# explicit create with a specific data source
create:
  explicit: true
  mappingRemove:
    sources: []
  mappingAdd:
    sources: create-source
# implicit update with properties to watch for change.
# Results in using the default `implicitUpdate` IDLab function
# to check the properties.
update:
  explicit: false
  watchedProperties: [$(temperature)]
# Custom change detection can be accomplished by just using
# functions at subjects level without specifying changeDetection.
mappings:
  general-mapping:
    subjects:
      - function: idlab-fn:implicitUpdate
        parameters:
          - [idlab-fn:watchedProperty, $(temperature)]

4. Change detection

To model changes in data (e.g., modeling a stream of events), we introduce the changeDetection key. It specifies how to detect and act upon changes in the data of a certain mapping.


mapping:
  myMapping:
    subjects: subject mappings
    predicateObjects: predicate-object mappings
    graphs: graph mappings

    changeDetection:
      ... # details of the change detection...

How changes are detected, depend on the data source: it can publish changes explicitly or implicitly. Types of changes are create, update, and delete. This results in handling the following combinations:

In YARRRML this is defined by an operation (create, update, delete) key and a boolean explicit sub-key.

changeDetection:
  create:
    explicit: true

Implicit create:

changeDetection:
  create:
    explicit: false

Explicit update:

changeDetection:
  update:
    explicit: true

Implicit update:

changeDetection:
  update:
    explicit: false

Explicit delete:

changeDetection:
  delete:
    explicit: true

Implicit delete:

changeDetection:
  delete:
    explicit: false

By default, changes are detected by detecting presence or absence of the IRI generated by the subjects mappings.

This translates to one of the IDLab functions explicitCreate, implicitCreate, explicitUpdate, implicitUpdate, explicitDelete, and implicitDelete applied to the subject mapping.

The next table illustrates what a subject mapping with change detection generates:

Run Incoming IRI Create e Create i Update e Update i Delete e Delete i
1 example.org/1 example.org/1 example.org/1 example.org/1 x example.org/1 x
2 example.org/2 example.org/2 example.org/2 example.org/2 x example.org/2 example.org/1
3 example.org/1 x x x example.org/1 x example.org/2

Explicit and Implicit create behave the same: if an IRI is not seen yet, it is considered new and gets generated by the subject mapping.

Explicit update and Explicit delete consider the incoming data as updates or deletes resp. and will get generated by the subject mapping. Duplicates are ignored: their IRIs are already updated or deleted by the data source.

Implicit update only considers IRIs that have already been seen as updates, and they get generated by the subject mapping. If updates in data fields that are not used when generating the subject IRI need to be considered, see watchedProperties.

Implicit delete considers IRIs it does not encounter the next run as deleted. The subject mapping generates those deleted IRIs.

5. watchedProperties

Implicit changes sometimes require monitoring certain properties or attributes of the data that are not used for subject IRI generation.

For example, consider this initial dataset:

sensorID temperature
1 15.1
2 14.9

We'd map this in YARRRML as an implicit create because an update would would use the same sensor IDs.

mapping:
  temperatures:
    subjects: https://thermometer.net/sensor_$(sensorID)
    po:
      - [ex:temperature, $(temperature)]
    changeDetection:
      create:
        explicit: false

Next we get an update of this data set:

sensorID temperature
1 17
2 14.9

Notice that sensor 1 changes its reading while sensor 2 stays the same. We want to capture the change in sensor 1's value.

We'd map this in YARRRML as an implicit update:

mapping:
  temperatures:
    subjects: https://thermometer.net/sensor_$(sensorID)
    po:
      - [ex:temperature, $(temperature)]
    changeDetection:
      update:
      create:
        explicit: false
      update:
        explicit: false

This is not enough though; the result of this mapping would not generate triples: the sensorIDs remain the same, so no change is detected in the subject IRI.

To fix this, a watchedProperties key can be added: we can specify to monitor temperature for changes:

mapping:
  temperatures:
    subjects: https://thermometer.net/sensor_$(sensorID)
    po:
      - [ex:temperature, $(temperature)]
    changeDetection:
      update:
      create:
        explicit: false
      update:
        explicit: false
        watchedProperties: [$(temperature)]

When processing the updated dataset the temperature change of sensor 1 will be detected and a new triple will be generated.

6. Modifying the original mappings

It is possible to act differently on different changes. For example, changes could be written to a specific named graph, or another target. Or certain predicateObject mappings would not be executed for certain changes. This chapter descripbes the possibilities.

Modifications in the original mappings can be specified with the mappingAdd and mappingRemove sub-keys of changeDetection.

6.1 Example: modifying source mappings

Suppose we process a stream of messages like these:

{
  "create": [{"fruit": "apple", "colour": "green"}, {"fruit": "orange", "colour": "orange"}],
  "update": [],
  "delete": []
}
{
  "create": [{"fruit": "mellon", "colour": "yellow"}],
  "update": [{"fruit": "apple", "colour": "red"}],
  "delete": []
}
{
  "create": [{"fruit": "pear", "colour": "green"}],
  "update": [],
  "delete": [{"fruit": "apple"}, {"fruit": "mellon"}]
}

All changes are explicitly advertised in separate JSON keys for every type of change (what a coincidence!).

One way of specifying this in YARRRML is by removing the original source(s) and add different sources since they have a different iterator:

# Different iterators, so different sources:
sources:
  create-source: [message.json~jsonpath, $.create.*]
  update-source: [message.json~jsonpath, $.update.*]
  delete-source: [message.json~jsonpath, $.delete.*]

mappings:

  s: http://fruit.org/$(fruit)

  po:
    - [ex:colour, $(colour)]

  changeDetection:
    create:
      explicit: true
      mappingAdd:
        sources: create-source
      mappingRemove:
        sources: []
    update:
      explicit: true
      mappingAdd:
        sources: update-source
      mappingRemove:
        sources: []
    update:
      explicit: true
      mappingAdd:
        sources: delete-source
      mappingRemove:
        sources: []

A more concise way achieving the same result is "updating" the iterator from the source mapping:

mappings:

  # This source will be updated by the operations:
  sources: [message.json~jsonpath, $.*]
  s: http://fruit.org/$(fruit)
  po:
    - [ex:colour, $(colour)]

  changeDetection:
    create:
      explicit: true
      mappingAdd:
        sources:
          - iterator: $.create.*
    update:
      explicit: true
      mappingAdd:
        sources:
          - iterator: $.update.*
    create:
      explicit: true
      mappingAdd:
        sources:
          - iterator: $.delete.*

6.2 Adding and removing mappings

This is what can be done with mappingAdd and mappingRemove:

Not possible yet: