Your first models
Data validation plays an important role for ensuring the best data quality across teams and systems. While writing models for validating the data is optional in Blacksmith, we highly recommend to do so since data reliability and quality from end-to-end is critical for an organisation.
In the previous guide, we sent some data to the trigger created earlier. What
happens if the data is not good? What happens if user
is not an object? Or if
username
is not a string?
Blacksmith leverages the JSON Schema specification for validating data against a model: each model is a JSON Schema document.
In the ETL pipeline, validating data against a model ensures data quality both at the trigger level right after Extracting the data, and at the integration level right before Loading the data. In other words, you can set validations pre and post Transformation. Without validation, you may try to Load bad data against an integration. Depending on the behavior of the said integration, it might break or accept the bad data. Either way, data is broken because it's either missing or not valid.
Validation pre Transformation
Let's validate the data before the Transformation. Once the data is
Extracted, you can validate it against a model by adding a model
key in the
trigger.
First, generate a new model:
$ blacksmith generate model \
--name user \
--path ./models/users/extraction.json \
--extend trigger/http_endpoint
Here we add the --extend
flag with trigger/http_endpoint
as value. The generated
model inherits the schema of a data Extraction from a HTTP endpoint, as defined
in the Application reference. This way, you can set validations
not only for the body
, but also for the headers
, query
, etc. passed by the
HTTP request.
We can then add the model
key with the path to the generated model for validating
the data:
sources:
- name: "api"
# ...
triggers:
- name: "new_user"
# ...
model: "./models/users/extraction.json"
integrations:
# ...
This informs that the data Extracted must respect the JSON Schema defined at
the model
path. If the data doesn't respect the model, the ETL process will stop
since the data that will then be passed down for Transformation would not be
valid. Therefore the data to Load shall not be valid as well.
For our simple use case, the JSON Schema pre Transformation shall look like this:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "new_user",
"title": "new_user",
"description": "JSON schema for validating a HTTP endpoint trigger on `new_user`.",
"allOf": [{"$ref": "https://nunchi.studio/blacksmith/trigger/http_endpoint"}],
"properties": {
"headers": {
"type": "object",
"properties": {}
},
"query": {
"type": "object",
"properties": {}
},
"body": {
"type": "object",
"properties": {
"user": {
"type": "object",
"properties": {
"username": {
"type": "string"
}
}
}
}
}
}
}
You can now restart the worker loader and the gateway, and see how the ETL pipeline reacts with both good and bad data.
Validation post Transformation
Now, let's validate the data after the Transformation happened. This aims to validate the data for each integration, allowing to have a dedicated model for each of them.
As explained before, a sql
integration doesn't need the transformation
inside
a trigger. For the purpose of these guides, let's assume we also have the non-SQL
nonexisting-nosql
integration, and wish to validate data before Loading
it.
We would need to generate a new model for Loading this kind of data to the
nonexisting-nosql
integration:
$ blacksmith generate model \
--name user \
--path ./models/users/load.json
Then, we would set the path to the model for the integration as the value for
model
:
sources:
- name: "api"
# ...
triggers:
- name: "new_user"
# ...
integrations:
- name: "nonexisting-nosql"
model: "./models/users/load.json"
transformation:
id: "{% uuid %}"
username: "{% query 'body.user.username' %}"
config:
# ...
This means the transformation
object defined for the integration will be tested
against this model. If the validation failed, the data will not be Loaded to
the integration.
Following on our example, the model shall look like this:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "user",
"title": "user",
"description": "JSON schema for validating data before Loading to a non-SQL integration.",
"type": "object",
"required": ["id", "username"],
"properties": {
"id": {
"type": "string",
"format": "uuid"
},
"username": {
"type": "string"
}
}
}
Finally, we can retrieve our users from the database by running a select.
If you notice something we've missed or could be improved on, please follow this link and submit a pull request to the repository. Once we merge it, the changes will be reflected on the website the next time it is deployed.
Thank you for your contributions!