DCAT Metadata Structure
Since the Dataspace Protocol (DSP) uses DCAT 3 (Data Catalog Vocabulary) for describing catalog entries, we have setup a DCAT structure for the TSG. This document describes the DCAT structure used in the TNO Security Gateway for representing data catalogs, datasets, and their distributions.
Overview
The implementation of DCAT enables standardized discovery and exchange of dataset information between participants. The implementation extends DCAT with domain-specific vocabularies like HealthDCAT-AP for health data scenarios.
Core DCAT Structure
The TSG uses a hierarchical DCAT structure to organize data resources:
Catalog
├── DataService (DSP API endpoint)
└── Dataset
├── dct:conformsTo → Ontology, legislation, standards
│ └── (e.g., HealthDCAT-AP, CSVW, domain ontologies)
└── Distribution
├── dct:conformsTo → Distribution schema (JSON Schema, XSD, etc.)
├── dcat:mediaType → Media type of source data
└── DataService
├── endpointURL → DSP API endpoint
├── endpointDescription → dspace:connector
├── dspace:dataPlaneType → IRI indicating data plane type
Component Details
Catalog
The Catalog represents a collection of datasets and data services offered by a participant in the data space. It serves as the top-level container for all discoverable resources.
Key Properties:
dcat:dataset- References to datasets in the catalogdcat:service- References to data services (typically the DSP API)dspace:participantId- Identifier of the participant offering this catalogdct:publisher- Publisher of the catalog
DataService (Catalog Level)
The catalog-level DataService represents the DSP API endpoint that provides access to the catalog and facilitates data space protocol interactions.
Key Properties:
dcat:endpointURL- URL of the DSP APIdcat:endpointDescription- Description of the service (typically references dspace:connector). This reference is used to automatically link the DSP Data Service to new datasets.
Dataset
A Dataset represents a logical collection of data that can be accessed through one or more distributions. Datasets can conform to various standards, ontologies, or legislation.
Dataset Creation in HTTP Data Plane
In the HTTP Data Plane, datasets are created through configuration (Helm values) or dynamically via the user interface. The data plane translates configuration into DCAT-compliant dataset metadata that is registered with the Control Plane.
Configuration-based Creation:
Datasets are defined in the HTTP Data Plane configuration (e.g., values.http-data-plane.yaml) using either simple or versioned types:
dataset:
type: versioned
title: Patient Demographics API
baseSemanticModelRef: https://vocabulary-hub.eu/ontology/demographics
currentVersion: 2.1.0
versions:
- version: 2.1.0
semanticModelRef: https://vocabulary-hub.eu/ontology/demographics/v2.1
distributions:
- backendUrl: https://api.example.org/patients
openApiSpecRef: https://api.example.org/openapi.json
mediaType: application/json
schemaRef: https://example.org/schemas/patient.schema.json
The HTTP Data Plane transforms this configuration into a DCAT dataset with:
- Base dataset properties (title, conformsTo from semantic model references)
- Distribution with
dcat:accessServicepointing to the DSP API endpoint dcat:formatset totsg:httpto indicate HTTP-based access- OpenAPI specification referenced for technical description
- Proper version linking if using versioned type
UI-based Creation:
Users can also create and modify datasets through the HTTP Data Plane UI, which provides forms for entering dataset metadata and automatically generates valid DCAT structures.
Dataset Creation in Analytics Data Plane
In the Analytics Data Plane, datasets are created when users upload files through the web interface. The data plane automatically generates rich DCAT metadata by analyzing the uploaded data.
Automated Metadata Generation:
When a CSV file is uploaded, the Analytics Data Plane performs:
-
Deterministic Analysis:
- Column data types detection (string, integer, float, date, boolean)
- Statistical profiling (min/max values, unique counts, null percentages)
- Pattern detection (email addresses, phone numbers, medical codes)
- Temporal coverage extraction from date columns
- Data quality metrics (completeness, consistency)
-
LLM-Enhanced Metadata (optional):
- Semantic title and description generation
- Keyword extraction for discovery
- Theme classification (using DCAT and HealthDCAT-AP themes)
- Column-level semantic annotations
-
DCAT Dataset Creation:
- Generates comprehensive dataset metadata including CSVW table schema
- Creates distribution with
dcat:formatset totsg:analytics - Includes HealthDCAT-AP extensions for health data (age ranges, coding systems)
- Embeds data quality measurements using DQV (Data Quality Vocabulary)
- References CSVW for column-level metadata (variable dictionary)
The resulting dataset includes rich semantic metadata aligned with DCAT 3, HealthDCAT-AP, and CSVW standards, enabling fine-grained discovery and understanding of the data without exposing the actual content.
Example Generated Properties:
healthdcatap:numberOfRecords- Row counthealthdcatap:hasCodingSystem- Detected medical coding systems (ICD-10, LOINC, etc.)dqv:hasQualityMeasurement- Completeness and validity metricscsvw:tableSchema- Full variable dictionary with semantic annotations
Domain Extensions
For health data, datasets may include HealthDCAT-AP properties:
healthdcatap:numberOfRecords- Number of records in the datasethealthdcatap:minTypicalAge- Minimum typical age of subjectshealthdcatap:maxTypicalAge- Maximum typical age of subjectshealthdcatap:hasCodingSystem- Medical coding systems used
Distribution
A Distribution represents a specific available format or access mechanism for a dataset. Each distribution can have its own schema, media type, and access service.
Key Properties:
dct:title- Title of this specific distributiondct:conformsTo- Schema or standard for this distribution- Examples: JSON Schema, XML Schema (XSD), Avro Schema
dcat:mediaType- IANA media type of the data- Examples:
text/csv,application/json,application/parquet
- Examples:
dcat:format- Format identifier (may be different from mediaType)dcat:byteSize- Size of the distribution in bytesdcat:accessService- Reference to the DataService providing access
DataService (Distribution Level)
The distribution-level DataService describes how to access a specific distribution, including the data plane endpoint and any technical descriptions.
Key Properties:
dcat:endpointURL- URL of the DSP API (for transfer negotiation)dcat:endpointDescription- Type of connector (typicallydspace:connector)
Dataset Versioning
DCAT provides multiple properties to manage dataset versions, allowing participants to track evolution of datasets over time and maintain relationships between different versions.
Version Properties
DCAT defines several properties for version management:
dcat:version- A version number or identifier (e.g., "1.0", "2.3.1", "2024-01-15")dcat:hasVersion- Links to other versions of this dataset (can be multiple)dcat:isVersionOf- Points to the parent/base dataset that this is a version ofdcat:hasCurrentVersion- Points to the current/latest versiondcat:previousVersion- Points to the immediately preceding version
Versioning in TSG
The HTTP Data Plane in TSG supports versioning through the type: versioned configuration:
dataset:
type: versioned
title: Patient Demographics API
currentVersion: 2.1.0
versions:
- version: 1.0.0
distributions:
- backendUrl: https://api.example.org/v1/patients
- version: 2.0.0
distributions:
- backendUrl: https://api.example.org/v2/patients
- version: 2.1.0
distributions:
- backendUrl: https://api.example.org/v2.1/patients
This configuration creates:
- A base dataset with
dcat:hasCurrentVersionpointing to version 2.1.0 - Separate dataset resources for each version with appropriate version linking
- Distribution-level versioning for API endpoints
Versioning Best Practices
- Semantic Versioning: Use semantic versioning (MAJOR.MINOR.PATCH) for APIs and data schemas
- Date-based Versioning: Use ISO 8601 dates (YYYY-MM-DD) for time-series data or periodic releases
- Breaking Changes: Increment major version when making breaking changes to schema or semantics
Schema Versioning
When versioning datasets, also version the conformance schemas:
{
"@type": "Dataset",
"@id": "https://example.org/datasets/research-data/v2.0.0",
"dcat:version": "2.0.0",
"distribution": [
{
"@type": "Distribution",
"conformsTo": "https://example.org/schemas/research-v2.schema.json",
"mediaType": "application/json"
}
]
}
This ensures consumers can validate data against the correct schema version and understand structural changes between versions.
Implementation Examples
HTTP Data Plane Dataset
{
"@type": "Dataset",
"@id": "https://example.org/datasets/patient-data",
"title": "Patient Demographics",
"description": "Anonymized patient demographic data",
"conformsTo": [
"https://healthdataeu.pages.code.europa.eu/healthdcat-ap/",
"https://example.org/ontology/demographics-v1"
],
"distribution": [
{
"@type": "Distribution",
"title": "JSON API Distribution",
"conformsTo": "https://example.org/schemas/patient-schema.json",
"mediaType": "application/json",
"accessService": {
"@type": "DataService",
"endpointURL": "https://connector.example.org/api/dsp",
"endpointDescription": "dspace:connector",
"dataPlaneType": "tsg:http",
}
}
]
}
Analytics Data Plane Dataset
{
"@type": "Dataset",
"@id": "https://example.org/datasets/medical-records",
"title": "Medical Records for Analysis",
"description": "Encrypted medical records for federated analytics",
"conformsTo": [
"https://healthdataeu.pages.code.europa.eu/healthdcat-ap/"
],
"distribution": [
{
"@type": "Distribution",
"title": "CSV Distribution",
"conformsTo": "https://www.w3.org/TR/tabular-metadata/",
"mediaType": "text/csv",
"format": "tsg:analytics",
"accessService": {
"@type": "DataService",
"endpointURL": "https://connector.example.org/api/dsp",
"endpointDescription": "dspace:connector",
"dataPlaneType": "tsg:analytics"
}
}
]
}
Conformance References
Dataset Level (dct:conformsTo)
At the dataset level, dct:conformsTo typically references:
- Application Profiles: HealthDCAT-AP, DCAT-AP
- Domain Ontologies: Medical terminologies, industry standards
- Legislation: GDPR, HIPAA, domain-specific regulations
- Metadata Standards: CSVW for tabular data
Distribution Level (dct:conformsTo)
At the distribution level, dct:conformsTo typically references:
- Data Schemas: JSON Schema, XML Schema (XSD), Avro Schema
- Table Schemas: CSVW Table Schema for CSV files
- API Specifications: OpenAPI specification
Best Practices
Choosing Conformance References
- Dataset conformsTo: Use for high-level semantic models, application profiles, and regulatory frameworks
- Distribution conformsTo: Use for technical schemas that validate the data format
Media Type Selection
- Use standard IANA media types whenever possible
- For file-based resources,
dcat:mediaTypeindicates the file format - For API-based resources, consider whether mediaType represents:
- The format of API responses
- The format of the underlying data source
- Both (if they align)
Multiple Distributions
Provide multiple distributions when:
- Data is available in multiple formats (CSV, JSON, Parquet)
- Different access patterns are supported (API vs. file download)
- Different data plane types can access the same dataset
- Different conformance levels or schemas apply to different views
Related Standards
- DCAT 3 - Data Catalog Vocabulary
- HealthDCAT-AP - Health Data Application Profile
- CSVW - CSV on the Web
- Eclipse Dataspace Protocol - Data space interactions
- Dublin Core Terms - Metadata vocabulary
Next: Learn about Design Decisions or return to Standards and Protocols.