You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

128 lines
11 KiB
Plaintext

Roadmap:
- Write encoding and roundtrip tests for value encodings
- Write a schema for a testing table
- Figure out an initial version of the schema DSL
- Encode a record for that table
- Decode the encoded representation, and verify that it matches
- Write encoding and roundtrip tests for sample records
- Build a todo list app to design and test out the APIs in a real-world application
- Build zapforum to test out more complex real-world requirements
======
# Value encodings
Timestamp (UTC)
Value: fixed-length arithmetic
Index: fixed-length arithmetic
Timestamp (with timezone)
Value: fixed-length arithmetic (local time + timezone ID)
Index: fixed-length arithmetic (in UTC)
Integer
Value: orderable varint
Index: orderable varint
Decimal
Value: orderable varint (with fixed decimal point shift)
Index: orderable varint (with fixed decimal point shift)
Boolean
Value: 0 or 1
Index: 0 or 1
Binary
Value: as-is; inline (with length prefix), or blob ID reference
Index: as-is; inline (truncated)
String:
Value: UTF-8; inline (with length prefix), or blob ID reference
Index: Unicode Collation Algorithm (truncated)
JSON:
Value: CBOR
Index: (not indexable)
Record reference:
Value: TBD
Index: TBD
======
# Migrations as a first-class feature
Database migrations are a first-class and *required* feature in zapdb. Every table is defined as the sum of (ordered) database migrations that affect it. It is not possible to modify schemas out-of-band; if it's not defined in a migration, it doesn't exist. Making the database directly aware of migrations, and declaring them the exclusive way of defining schemas, has a number of important benefits; for example, it means that whole-table data conversions are not necessary because the database has 'historical' insight into old schema versions, which also means that the database can automatically infer rollback operations from migration operations.
To achieve this, every database internally stores a copy of all the migrations it knows about, with an index and an *exact* structural representation of that migration. If a conflicting migration definition is ever provided, the database will refuse to initialize, ensuring that the database definition in code *always* matches that which produced the existing data in the database.
It is possible for a database administrator to make the database 'forget' a certain migration; but doing so will involve a (potentially destructive) rollback of all data which was stored using the schema components defined in that migration, and it is only possible for the most recent migration at any time. This means that if the schema consists of 10 migrations, and you want to roll back migration 8, you must roll back migration 10 and 9 first.
It is generally more practical to 'undo' schema changes by defining them as a *new* migration instead, and so this 'forget' functionality is mainly expected to be used when eg. resolving merges in version control that specify conflicting database migrations, where different developers have a different local state in their database.
To make this first-class migration model possible, *all* migrations are required to be 'exhaustive'; that is, they must not only specify the full set of changes for that migration, but also the full set of operations for a rollback. Most of these rollback operations can be automatically inferred by the database by simply inverting the operation; but there are some cases where this is not possible, eg. when deleting a required field.
In that case, the migration author must specify how to reconstruct the field's contents if that field deletion were ever reverted; after all, it would be required again and so cannot be empty, but the original data was deleted when the migration was first applied.
A migration which is not exhaustive is invalid, and will produce an error, preventing the database from being initialized.
======
# Per-record schema versioning, and rolling migrations
One of the neat features that first-class migrations allow, is that of 'rolling migrations'. Whereas in traditional RDBMSes, a schema change necessitates immediately reconstructing the entire table, this is not necessary in zapdb. Instead, every record is prefixed with a schema index; an ID that specifies which 'revision' of the schema (ie. which migration) the record was encoded against. This allows records encoded against different versions of the schema to co-exist in one table.
When a record is retrieved that was encoded against a non-latest version of the schema, it is converted to the most recent schema on-the-fly after retrieval; this ensures that the application is always working with consistently-shaped data. The database may be configured to do one of two things:
1. Immediately write back the converted record in the new schema version, or
2. Only write it back once it is modified or otherwise stored, and so a write operation would occur anyway
Even though the database permanently has access to all revisions of the schema, due to migrations being a first-class feature, it is generally advisable not to keep 'old' records around for too long, as the on-retrieval conversion step requires some computation work, and this can compound as the record's schema revision becomes further removed from the latest (ie. target) schema revision. Therefore, the database offers a number of options:
1. After a migration occurs, start continuously converting records in the background whether they are accessed or not (this is a non-blocking process)
2. Do not automatically convert records, instead relying entirely on regular database access to surface old records (as described above)
The first option is likely to be the right one in a production environment, as it improves query times. The second option, however, may be useful in a development environment; as the developer might be doing schema rollbacks for one reason or another, and those *do* involve a blocking conversion of data to an older schema revision. It may then be preferable to keep values represented in an old schema revision as long as possible, to reduce the potential rollback time - records which were never converted to the newer schema revision, also don't need to be rolled back.
======
# Index formatting and querying
Keys should be generated based on the index key encoding for the value type. This encoding *must* be binary-sortable; that is, in binary (encoded) representation, the ordering must *exactly* match the desired ordering in structural (decoded) form. This necessitates that all value types can be represented in such a way; if not, they cannot be used to create an index.
Index keys *may* be truncated to a certain maximum length, to satisfy database limitations. In such cases, zapdb needs to recognize that the key length limit is hit, and do an *additional* iterative check on the results; as it is possible to get a false positive when the query and a record have a matching prefix the size of the length limit, but the full value mismatches.
This also needs to be taken into account for range queries; when the queried value meets or exceeds the key length limit, every match must separately be compared against the query value to ensure that it really *does* fit within the range.
The need for this additional check can (probably?) be statically determined from the query value during planning, so the additional iteration only needs to occur for known-ambiguous queries. This means that those queries will incur a small performance penalty (*especially* when there are many false positives, this needs to be documented!), but queries which fall *below* the key length limit are not affected and are still a straight index lookup.
The actual index lookups are done in one of two ways, depending on the type of clause:
- Exact match: do an index lookup for a specific key
- Comparison: do a range lookup for the first and/or last query value
======
# Handling unstable encodings
There are a number of value type encodings which are 'unstable', ie. the encoded representation can be different depending on *when* it is created, eg. due to the use of an external data source that receives updates. Currently known cases of this are:
- Timestamps with timezones; dependent on the tz database
- Strings; dependent on the DUCET dataset, used with the UCA (Unicode Collation Algorithm)
Unfortunately, this creates a complication for sorting in particular; if the sortable representation can change over time, that means that encoded keys generated with different versions of the dataset can *not* be compared against each other, as the dataset could have changed the sorting rules in some way. For example, DUCET swapped around the order of certain characters, or one timezone is now *before* another instead of *after* it.
Therefore, the database must internally maintain a record of which version of these external datasets was previously used to generate sortable representations; and whenever this external dataset changes through eg. a software update, the database is in an unstable state. It will need to regenerate all of the existing sorting keys before queries can safely be executed again. Two options can be provided to the database administrator for when this occurs:
1. Block the database; all sortable representations will be regenerated at once, and all queries are held back until this process has completed. No queries against the database can be carried out during this time, but correct query results are guaranteed at all times. This is also the fastest option, as the entire index can be thrown out at once and regenerated from scratch.
2. Regenerate sortable representations as a background task; queries to the database are permitted as normal, but any kind of range query on the affected fields *may* produce false positives or false negatives while the regeneration process is in progress. This maximizes database availability, at the cost of potentially wrong query results. It is also the slower option; instead of throwing out the entire index, the database will now need to selectively modify and/or delete index entries as individual records have been converted, which requires additional lookups.
In the future, it may be worth exploring whether it is possible to make these options configurable on a per-field level; such that eg. wrong results are allowed for *most* fields, but not for security-sensitive fields, to which queries are held back until the process completes. In such a setup, the database could prioritize converting the blocking fields, allowing some time between each such index conversion to allow newly-possible queries to complete.
Note that for the second option to work, a separate 'reverse index' needs to be maintained (which costs additional storage space), which maps records to their index key(s). As the dataset which was originally used to generate index keys may no longer be available, and so the old sortable representations cannot be reproduced anymore, this mapping is needed to determine which outdated index entries need to be invalidated when the sortable representation is regenerated (or whether it needs to be regenerated at all!). The first option does not require this, as it wipes out the entire index at once.
======
To spec out:
- Application-defined index types