GitOrigin-RevId: e6e19f3d81a982a62e1bba08f0b4f7fdc21b4ea0
10 KiB
tags: netencode, json date: 2022-03-31 certainty: likely status: initial title: An idealized Configuration Language
An Idealized Configuration Language
JSON brought us one step closer to what an idealized configuration language is, which I define as “data, stripped of all externalities of the system it is working in”.
Specifically, JSON is very close to what I consider the minimal properties to represent structured data.
A short history, according to me
In the beginning, Lisp defined s-expressions as a stand-in for an actual syntax. Then, people figured out that it’s also a way to represent structured data. It has scalars, which can be nested into lists, recursively.
(this is (a (list) (of lists)))
This provides the first three rules of our idealized language:
-
A scalar is a primitive value that is domain-specific. We can assume a bunch of bytes here, or a text or an integer.
-
A list gives an ordering to
0..n
(or1..n
) values -
Both a scalar and a list are the same kind of “thing” (from here on called value), lists can be created from arbitrary values recursively (for example scalars, or lists of scalars and other lists)
Later, ASN.1 came and had the important insight that the same idealized data structure can be represented in different fashions, for example as a binary-efficient version and a human-readable format.
Then, XML “graced” the world for a decade or two, and the main lesson from it was that you don’t want to mix markup languages and configuration languages, and that you don’t want a committee to design these things.
In the meantime, Brendan Eich designed Javascript. Its prototype-based object system arguably stripped down the rituals of existing OO-systems. Douglas Crockford later extracted the object format (minus functions) into a syntax, and we got JSON.
{
"foo": [
{ "nested": "attrs" },
"some text"
],
"bar": 42
}
JSON adds another fundamental idea into the mix:
- Records are unordered collections of
name
/value
pairs. Aname
is defined to be a unicode string, so a semantic descriptor of the nestedvalue
.
Unfortunately, the JSON syntax does not actually specify any semantics of records (objects
in JSON lingo),
in particular it does not mention what the meaning is if a name
appears twice in one record.
If records can have multiple entries with the same name
, suddenly ordering becomes important!
But wait, remember earlier we defined lists to impose ordering on two values.
So in order to rectify that problem, we say that
- A
name
can only appear in a record once, names must be unique.
This is the current state of the programming community at large, where most “modern” configuration languages basically use a version of the JSON model as their underlying data structure. (However not all of them use the same version.)
Improving JSON’s data model
We are not yet at the final “idealized” configuration language, though.
Modern languages like Standard ML define their data types as a mixture of
- records (“structs” in the C lingo)
- and sums (which you can think about as enums that can hold more
value
s inside them)
This allows to express the common pattern where some fields in a record are only meaningful
if another field—the so-called tag
-field—is set to a specific value.
An easy example: if a request can fail with an error message or succeed with a result.
You could model that as
{
"was_error": true,
"error_message": "there was an error"
}
or
{
"was_error": false,
"result": 42
}
in your JSON representation.
But in a ML-like language (like, for example, Rust), you would instead model it as
type RequestResult
= Error { error_message: String }
| Success { result: i64 }
where the distinction in Error
or Success
makes it clear that error_message
and result
only exist in one of these cases, not the other.
We can encode exactly that idea into JSON in multiple ways, but not a “blessed” way.
For example, another way to encode the above would be
{
"Error": {
"error_message": "there was an error"
}
}
and
{
"Success": {
"result": 42
}
}
Particularly notice the difference between the language representation, where the type is “closed”only Success
or Error
can happen—
and the data representation where the type is “open”, more cases could potentially exist.
This is an important differentiation from a type system: Our idealized configuration language just gives more structure to a bag of data, it does not restrict which value can be where. Think of a value in an unityped language, like Python.
So far we have the notion of
- a scalar (a primitive)
- a list (ordering on values)
- a record (unordered collection of named values)
and in order to get the “open” tag
ged enumeration values, we introduce
- a
tag
, which gives a name to a value
We can then redefine record
to mean “an unordered collection of tag
ged values”,
which further reduces the amount of concepts needed.
And that’s it, this is the full idealized configuration language.
Some examples of data modelling with tags
This is all well and good, but what does it look like in practice?
For these examples I will be using JSON with a new < "tag": value >
syntax
to represent tag
s.
From a compatibility standpoint, tag
s (or sum types) have dual properties to record types.
With a record, when you have a producer that adds a field to it, the consumer will still be able to handle the record (provided the semantics of the existing fields is not changed by the new field).
With a tag, removing a tag from the producer will mean that the consumer will still be able to handle the tag. It might do one “dead” check on the removed tag
, but can still handle the remaining ones just fine.
An example of how that is applied in practice is that in protobuf3
, fields of a record are always optional fields.
We can model optional fields by wrapping them in < "Some": value >
or < "None": {} >
(where the actual value of the None
is ignored or always an empty record).
So a protobuf with the fields foo: int
and bar: string
has to be parsed by the receiver als containing four possibilities:
№ | foo | bar |
---|---|---|
1 | <"None":{}> |
<"None":{}> |
2 | <"Some":42> |
<"None":{}> |
3 | <"None":{}> |
<"Some":"x"> |
4 | <"Some":42> |
<"Some":"x"> |
Now, iff the receiver actually handles all four possibilities
(and doesn’t just crash if a field is not set, as customary in million-dollar-mistake languages),
it’s easy to see how removing a field from the producer is semantically equal to always setting it to <"None":{}>
.
Since all receivers should be ready to receive None
for every field, this provides a simple forward-compatibility scheme.
We can abstract this to any kind of tag value: If you start with “more” tags, you give yourself space to remove them later without breaking compatibility, typically called “forward compatibility”.
To empty list/record or not to
Something to think about is whether records and fields should be defined to always contain at least one element.
As it stands, JSON has multiple ways of expressing the “empty value”:
null
[]
{}
""
- leave out the field
and two of those come from the possibility of having empty structured values.
Representations of this language
This line of thought originally fell out of me designing netencode
as a small human-debuggable format for pipeline serialization.
In addition to the concepts mentioned here (especially tags), it provides a better set of scalars than JSON (specifically arbitrary bytestrings), but it cannot practically be written or modified by hand, which might be a good thing depending on how you look at it.
The way that is compatible with the rest of the ecosystem is probably to use a subset of json to represent our idealized language.
There is multiple ways of encoding tags in json, which each have their pros and cons.
The most common is probably the “tag field” variant, where the tag is pulled into the nested record:
{
"_tag": "Success",
"result": 42
}
Which has the advantage that people know how to deal with it and that it’s easy to “just add another field”, plus it is backward-compatible when you had a record in the first place.
It has multiple disadvantages however:
- If your value wasn’t a record (e.g. an int) before, you have to put it in a record and assign an arbitrary name to its field
- People are not forced to “unwrap” the tag first, so they are going to forget to check it
- The magic “_tag” name cannot be used by any of the record’s fields
An in-between version of this with less downsides is to always push a json record onto the stack:
{
"tag": "Success",
"value": {
"result": 42
}
}
This makes it harder for people to miss checking the tag
, but still possible of course.
It also makes it easily possible to inspect the contents of value
without knowing the
exhaustive list of tag
s, which can be useful in practice (though often not sound!).
It also gets rid of the “_tag” field name clash problem.
Disadvantages:
- Breaks the backwards-compatibility with an existing record-based approach if you want to introduce
tag
s - Verbosity of representation
- hard to distinguish a record with the
tag
andvalue
fields from atag
ed value (though you know the type layout of your data on a higher level, don’t you? ;) )
The final, “most pure” representation is the one I gave in the original introduction:
{
"Success": {
"result": 42
}
}
Now you have to match on the tag
name first, before you can actually access your data,
and it’s less verbose than the above representation.
Disavantages:
- You also have to know what
tag
s to expect, it’s harder to query cause you need to extract the keys and values from the dict and then take the first one. - Doing a “tag backwards compat” check is harder,
because you can’t just check whether
_tag
ortag
/value
are the keys in the dict.