You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

8.8 KiB

Raw Permalink Blame History

Architecture

NOTE: This document is a work in progress, and will be changed and extended over time.

Design philosophy

Everything should be as stateless as possible.
Everything should be as composable as possible, including in ways unforeseen by the Zap developers.
Validation should be strict and informative; if the user does something wrong, they should be informed of this as quickly as possible, and in an actionable way.
The user should not need to know anything about PostgreSQL whatsoever; as far as they are concerned, it's an internal implementation detail.
However, a DBA should, if they so desire, be able to make sense of what Zap is doing. Likewise, interacting with the database from software in other languages should be reasonably painless. To this end, the schema and queries produced by Zap should reflect common practices in hand-written queries as much as possible.

High-level architecture

The general lifetime cycle of a query is as follows:

User constructs AST in-place using operations. Crucially, there is no parsing step; the AST is constructed directly, but the functions used for this are designed like a DSL.
Optimizers normalize and improve upon this AST to generate a representation that matches SQL semantics.
The optimized AST is converted into an SQL query, series of parameters, optionally scheduled follow-up queries (eg. for relation handling), and metadata.
The query is executed and results collected.
Optionally, follow-up queries are executed to eg. resolve relations.
The combined results are returned to the user.

Importantly, steps 1-3 and 4-6 can be decoupled; this allows for "pre-compiling" queries for repeated use. Because all AST structures are fully stateless, queries can be reused indefinitely and with any kind of concurrency.

Operations

The main mechanism for query building are "operations"; essentially just functions which take some input and return an AST node or subtree. Most of the query construction that the user does, will involve passing AST nodes returned from operations to other operation calls, essentially constructing a full AST in-place. The operation modules live in src/operations/.

Some example code using Zap operations might look like the following:

// Get a thread with its (visible) posts

select("threads", [
    first(),
    where({ id: threadID }),
    withRelations({
        posts: has("posts.thread", [
            where({ visible: true }),
            startAt(offset),
            first(10)
        ])
    })
]);

Because functions represent declarative keywords rather than immediate actions, some operations are reused in multiple contexts, and will return different AST nodes depending on their input. By validating these specific types in inputs elsewhere, it can be ensured that only valid combinations of operations can be provided.

An example is the index operation, which will return a localIndex node when called as index(), but an index node when called as index("fieldName"). Schema field definitions then only accept localIndex, not index nodes; and likewise, schema index definitions only accept index, not localIndex nodes.

There are two types of operations; regular operations and internal operations:

Regular operations: These are exposed to the end user, and are meant to be used primarily for query-building. They are validated very strictly; they should only allow input that is guaranteed to be representable in the resulting SQL query in some sensible way (unless this cannot be statically ensured).
Internal operations: These are operations that are only meant to be used from optimizers. They are typically validated less strictly, but still do checks on their inputs. There are two main types:
1. Operations which represent some internal construct that will be used for SQL generation, but that the user will never specify themselves.
2. Operations which represent a type of operations (with subtypes) that is available to the user, but that would normally involve multiple different operation methods and where that would be inconvenient for optimizer authors. An example are the moreThan, lessThan, equals, etc. operations which are all represented by a single condition internal operation.

For validating inputs to operations, Validatem is used. Most of the (partial) validation rules live in src/validators/operations/, as they are commonly reused across operations. However, the top-level validation rules are defined within the operation functions themselves, and therefore input is validated at AST construction time.

Optimizers

Internally, Zap has an 'optimizer' infrastructure that allows for AST transformation, much like a tool like Babel might do for Javascript. Their purpose is to do any and all AST transformations necessary to translate a Zap AST into something that semantically represents an equivalent SQL query. The optimizers live in src/optimizers/.

Optimizers are currently all bundled into the core package with no support for third-party optimizers, but this will likely change in the future. Optimizers are split into multiple categories, which can be enabled/disabled as desired to target a specific set of tradeoffs.

Currently, the following categories of optimizers are defined:

normalization: These are optimizers that are required for Zap to function correctly. They will typically deduplicate AST nodes of which only one should logically exist, convert Zap-specific semantics into their equivalent SQL semantics, and so on. Disabling these will break Zap.
performance: These are optional optimizers that aim to improve the performance of the query in some way, eg. by reducing its complexity when certain patterns are found.
readability: These are optional optimizers that aim to improve the human readability of the generated queries. They can typically be safely disabled until there is a desire to debug Zap's behaviour on a PostgreSQL level, in which case they would make it easier to understand what Zap is doing. If there is no specific performance concern, it's usually a good idea to leave these enabled.

An optimizer is defined as a set of 'visitors', which specify a handler to call for each encountered node of the specified type. The handler can then decide to:

Return a new node. Mutating the existing node is not permitted.
Return a RemoveNode marker: remove the node, all of its children, and any accumulated state (explained later). This will typically be used for extraneous nodes.
Return a ConsumeNode marker: consume the node and all of its children, but leave any accumulated state intact. This will typically be used for 'modifier' nodes which just serve to configure a parent node.
Return a NoChange marker, indicating that the node should remain unmodified.
Return a defer (explained below).

Every optimizer must eventually 'stabilize'; that is, it must return NoChange for all of its visitors, when it concludes that all of its work has been done. The optimizer infrastructure ensures this; if an optimizer fails to stabilize after a configured number of iterations (currently 10), the query optimization phase will be aborted. This design allows for iterative optimization of the AST, even if there's a (bounded) bi-directional interdependency between two optimizers.

The AST is traversed depth-first, starting at the root of the tree. This means that handlers for parent nodes are invoked before those of their child nodes.

Each handler is invoked with access to the node that is currently being processed, as well as a number of utilities:

setState and registerStateHandler, for emitting state and capturing that state upstream respectively. Each state item is keyed by a 'type', and multiple items of the same type can be emitted. State items will propagate upwards to the nearest parent that has a handler registered for their type; but only if their originating subtree has not been removed in the meantime. This prevents stale state from being processed. State from consumed nodes is propagated; as is typically desirable for handling modifier nodes.
defer, which allows for specifying a callback to be called later, after the node's child nodes have been processed. This is commonly used in combination with state handling to collect state from child nodes and, afterwards, construct and return a new node based on the collected information. The defer callback may return all of the same 'conclusions' as a handler callback, except for another defer.
Some number of path utilities for determining the path of the currently-processed node and its ancestors. This part of the API is still in flux, but at the time of writing there only exists findNearestStep, which locates the nearest ancestor of a given type (using "$object" to denote an object literal and "$array" for an array).