Version: v0.17.0

DCAT Metadata Structure

Since the Dataspace Protocol (DSP) uses DCAT 3 (Data Catalog Vocabulary) for describing catalog entries, we have setup a DCAT structure for the TSG. This document describes the DCAT structure used in the TNO Security Gateway for representing data catalogs, datasets, and their distributions.

Overview

The implementation of DCAT enables standardized discovery and exchange of dataset information between participants. The implementation extends DCAT with domain-specific vocabularies like HealthDCAT-AP for health data scenarios.

Core DCAT Structure

The TSG uses a hierarchical DCAT structure to organize data resources:

Catalog
├── DataService (DSP API endpoint)
└── Dataset
    ├── dct:conformsTo → Ontology, legislation, standards
    │   └── (e.g., HealthDCAT-AP, CSVW, domain ontologies)
    └── Distribution
        ├── dct:conformsTo → Distribution schema (JSON Schema, XSD, etc.)
        ├── dcat:mediaType → Media type of source data
        └── DataService
            ├── endpointURL → DSP API endpoint
            ├── endpointDescription → dspace:connector
            ├── dspace:dataPlaneType → IRI indicating data plane type

Component Details

Catalog

The Catalog represents a collection of datasets and data services offered by a participant in the data space. It serves as the top-level container for all discoverable resources.

Key Properties:

dcat:dataset - References to datasets in the catalog
dcat:service - References to data services (typically the DSP API)
dspace:participantId - Identifier of the participant offering this catalog
dct:publisher - Publisher of the catalog

DataService (Catalog Level)

The catalog-level DataService represents the DSP API endpoint that provides access to the catalog and facilitates data space protocol interactions.

Key Properties:

dcat:endpointURL - URL of the DSP API
dcat:endpointDescription - Description of the service (typically references dspace:connector). This reference is used to automatically link the DSP Data Service to new datasets.

Dataset

A Dataset represents a logical collection of data that can be accessed through one or more distributions. Datasets can conform to various standards, ontologies, or legislation.

Dataset Creation in HTTP Data Plane

In the HTTP Data Plane, datasets are created through configuration (Helm values) or dynamically via the user interface. The data plane translates configuration into DCAT-compliant dataset metadata that is registered with the Control Plane.

Configuration-based Creation:

Datasets are defined in the HTTP Data Plane configuration (e.g., values.http-data-plane.yaml) using either simple or versioned types:

dataset:
  type: versioned
  title: Patient Demographics API
  baseSemanticModelRef: https://vocabulary-hub.eu/ontology/demographics
  currentVersion: 2.1.0
  versions:
    - version: 2.1.0
      semanticModelRef: https://vocabulary-hub.eu/ontology/demographics/v2.1
      distributions:
        - backendUrl: https://api.example.org/patients
          openApiSpecRef: https://api.example.org/openapi.json
          mediaType: application/json
          schemaRef: https://example.org/schemas/patient.schema.json

The HTTP Data Plane transforms this configuration into a DCAT dataset with:

Base dataset properties (title, conformsTo from semantic model references)
Distribution with dcat:accessService pointing to the DSP API endpoint
dcat:format set to tsg:http to indicate HTTP-based access
OpenAPI specification referenced for technical description
Proper version linking if using versioned type

UI-based Creation:

Users can also create and modify datasets through the HTTP Data Plane UI, which provides forms for entering dataset metadata and automatically generates valid DCAT structures.

Dataset Creation in Analytics Data Plane

In the Analytics Data Plane, datasets are created when users upload files through the web interface. The data plane automatically generates rich DCAT metadata by analyzing the uploaded data.

Automated Metadata Generation:

When a CSV file is uploaded, the Analytics Data Plane performs:

Deterministic Analysis:
- Column data types detection (string, integer, float, date, boolean)
- Statistical profiling (min/max values, unique counts, null percentages)
- Pattern detection (email addresses, phone numbers, medical codes)
- Temporal coverage extraction from date columns
- Data quality metrics (completeness, consistency)
LLM-Enhanced Metadata (optional):
- Semantic title and description generation
- Keyword extraction for discovery
- Theme classification (using DCAT and HealthDCAT-AP themes)
- Column-level semantic annotations
DCAT Dataset Creation:
- Generates comprehensive dataset metadata including CSVW table schema
- Creates distribution with dcat:format set to tsg:analytics
- Includes HealthDCAT-AP extensions for health data (age ranges, coding systems)
- Embeds data quality measurements using DQV (Data Quality Vocabulary)
- References CSVW for column-level metadata (variable dictionary)

The resulting dataset includes rich semantic metadata aligned with DCAT 3, HealthDCAT-AP, and CSVW standards, enabling fine-grained discovery and understanding of the data without exposing the actual content.

Example Generated Properties:

healthdcatap:numberOfRecords - Row count
healthdcatap:hasCodingSystem - Detected medical coding systems (ICD-10, LOINC, etc.)
dqv:hasQualityMeasurement - Completeness and validity metrics
csvw:tableSchema - Full variable dictionary with semantic annotations

Domain Extensions

For health data, datasets may include HealthDCAT-AP properties:

healthdcatap:numberOfRecords - Number of records in the dataset
healthdcatap:minTypicalAge - Minimum typical age of subjects
healthdcatap:maxTypicalAge - Maximum typical age of subjects
healthdcatap:hasCodingSystem - Medical coding systems used

Distribution

A Distribution represents a specific available format or access mechanism for a dataset. Each distribution can have its own schema, media type, and access service.

Key Properties:

dct:title - Title of this specific distribution
dct:conformsTo - Schema or standard for this distribution
- Examples: JSON Schema, XML Schema (XSD), Avro Schema
dcat:mediaType - IANA media type of the data
- Examples: text/csv, application/json, application/parquet
dcat:format - Format identifier (may be different from mediaType)
dcat:byteSize - Size of the distribution in bytes
dcat:accessService - Reference to the DataService providing access

DataService (Distribution Level)

The distribution-level DataService describes how to access a specific distribution, including the data plane endpoint and any technical descriptions.

Key Properties:

dcat:endpointURL - URL of the DSP API (for transfer negotiation)
dcat:endpointDescription - Type of connector (typically dspace:connector)

Dataset Versioning

DCAT provides multiple properties to manage dataset versions, allowing participants to track evolution of datasets over time and maintain relationships between different versions.

Version Properties

DCAT defines several properties for version management:

dcat:version - A version number or identifier (e.g., "1.0", "2.3.1", "2024-01-15")
dcat:hasVersion - Links to other versions of this dataset (can be multiple)
dcat:isVersionOf - Points to the parent/base dataset that this is a version of
dcat:hasCurrentVersion - Points to the current/latest version
dcat:previousVersion - Points to the immediately preceding version

Versioning in TSG

The HTTP Data Plane in TSG supports versioning through the type: versioned configuration:

dataset:
  type: versioned
  title: Patient Demographics API
  currentVersion: 2.1.0
  versions:
    - version: 1.0.0
      distributions:
        - backendUrl: https://api.example.org/v1/patients
    - version: 2.0.0
      distributions:
        - backendUrl: https://api.example.org/v2/patients
    - version: 2.1.0
      distributions:
        - backendUrl: https://api.example.org/v2.1/patients

This configuration creates:

A base dataset with dcat:hasCurrentVersion pointing to version 2.1.0
Separate dataset resources for each version with appropriate version linking
Distribution-level versioning for API endpoints

Versioning Best Practices

Semantic Versioning: Use semantic versioning (MAJOR.MINOR.PATCH) for APIs and data schemas
Date-based Versioning: Use ISO 8601 dates (YYYY-MM-DD) for time-series data or periodic releases
Breaking Changes: Increment major version when making breaking changes to schema or semantics

Schema Versioning

When versioning datasets, also version the conformance schemas:

{
  "@type": "Dataset",
  "@id": "https://example.org/datasets/research-data/v2.0.0",
  "dcat:version": "2.0.0",
  "distribution": [
    {
      "@type": "Distribution",
      "conformsTo": "https://example.org/schemas/research-v2.schema.json",
      "mediaType": "application/json"
    }
  ]
}

This ensures consumers can validate data against the correct schema version and understand structural changes between versions.

Implementation Examples

HTTP Data Plane Dataset

{
  "@type": "Dataset",
  "@id": "https://example.org/datasets/patient-data",
  "title": "Patient Demographics",
  "description": "Anonymized patient demographic data",
  "conformsTo": [
    "https://healthdataeu.pages.code.europa.eu/healthdcat-ap/",
    "https://example.org/ontology/demographics-v1"
  ],
  "distribution": [
    {
      "@type": "Distribution",
      "title": "JSON API Distribution",
      "conformsTo": "https://example.org/schemas/patient-schema.json",
      "mediaType": "application/json",
      "accessService": {
        "@type": "DataService",
        "endpointURL": "https://connector.example.org/api/dsp",
        "endpointDescription": "dspace:connector",
        "dataPlaneType": "tsg:http",
      }
    }
  ]
}

Analytics Data Plane Dataset

{
  "@type": "Dataset",
  "@id": "https://example.org/datasets/medical-records",
  "title": "Medical Records for Analysis",
  "description": "Encrypted medical records for federated analytics",
  "conformsTo": [
    "https://healthdataeu.pages.code.europa.eu/healthdcat-ap/"
  ],
  "distribution": [
    {
      "@type": "Distribution",
      "title": "CSV Distribution",
      "conformsTo": "https://www.w3.org/TR/tabular-metadata/",
      "mediaType": "text/csv",
      "format": "tsg:analytics",
      "accessService": {
        "@type": "DataService",
        "endpointURL": "https://connector.example.org/api/dsp",
        "endpointDescription": "dspace:connector",
        "dataPlaneType": "tsg:analytics"
      }
    }
  ]
}

Conformance References

Dataset Level (`dct:conformsTo`)

At the dataset level, dct:conformsTo typically references:

Application Profiles: HealthDCAT-AP, DCAT-AP
Domain Ontologies: Medical terminologies, industry standards
Legislation: GDPR, HIPAA, domain-specific regulations
Metadata Standards: CSVW for tabular data

Distribution Level (`dct:conformsTo`)

At the distribution level, dct:conformsTo typically references:

Data Schemas: JSON Schema, XML Schema (XSD), Avro Schema
Table Schemas: CSVW Table Schema for CSV files
API Specifications: OpenAPI specification

Best Practices

Choosing Conformance References

Dataset conformsTo: Use for high-level semantic models, application profiles, and regulatory frameworks
Distribution conformsTo: Use for technical schemas that validate the data format

Media Type Selection

Use standard IANA media types whenever possible
For file-based resources, dcat:mediaType indicates the file format
For API-based resources, consider whether mediaType represents:
- The format of API responses
- The format of the underlying data source
- Both (if they align)

Multiple Distributions

Provide multiple distributions when:

Data is available in multiple formats (CSV, JSON, Parquet)
Different access patterns are supported (API vs. file download)
Different data plane types can access the same dataset
Different conformance levels or schemas apply to different views

DCAT 3 - Data Catalog Vocabulary
HealthDCAT-AP - Health Data Application Profile
CSVW - CSV on the Web
Eclipse Dataspace Protocol - Data space interactions
Dublin Core Terms - Metadata vocabulary

Next: Learn about Design Decisions or return to Standards and Protocols.

Overview​

Core DCAT Structure​

Component Details​

Catalog​

DataService (Catalog Level)​

Dataset​

Dataset Creation in HTTP Data Plane​

Dataset Creation in Analytics Data Plane​

Domain Extensions​

Distribution​

DataService (Distribution Level)​

Dataset Versioning​

Version Properties​

Versioning in TSG​

Versioning Best Practices​

Schema Versioning​

Implementation Examples​

HTTP Data Plane Dataset​

Analytics Data Plane Dataset​

Conformance References​

Dataset Level (dct:conformsTo)​

Distribution Level (dct:conformsTo)​

Best Practices​

Choosing Conformance References​

Media Type Selection​

Multiple Distributions​

Related Standards​

Overview

Core DCAT Structure

Component Details

Catalog

DataService (Catalog Level)

Dataset

Dataset Creation in HTTP Data Plane

Dataset Creation in Analytics Data Plane

Domain Extensions

Distribution

DataService (Distribution Level)

Dataset Versioning

Version Properties

Versioning in TSG

Versioning Best Practices

Schema Versioning

Implementation Examples

HTTP Data Plane Dataset

Analytics Data Plane Dataset

Conformance References

Dataset Level (`dct:conformsTo`)

Distribution Level (`dct:conformsTo`)

Best Practices

Choosing Conformance References

Media Type Selection

Multiple Distributions

Related Standards