You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

387 lines
24 KiB
Plaintext

Todo:
- Operators
- BETWEEN
- LIKE/ILIKE?
- ISNULL / NOTNULL
- IS ...
- Condition modifiers
x allOf
x anyOf
x not
- Boolean logic
x anyOf(...)
x allOf(...)
- Annotate placeholder/parameter nodes with the expected parameter type in their position, for validation of input parameters at query time
- Rename table -> collection everywhere, also column -> field?
- Update wrapError API
- Figure out a way to semantically deal with the SQL boolean behaviour of propagating NULL (eg. as described on https://www.postgresql.org/docs/8.1/functions-subquery.html for IN)
- Some sort of convertAs method for schema building, when changing the data type of a column - disallow a direct type override, and require it to be wrapped in a changeType method that also takes the conversion logic as an argument?
Docs:
- Emphasize that users should feel free to experiment; the library will tell them (safely) if they are trying to do something invalid - need to help the user overcome the "but what if I get it wrong" fear (also emphasize the flexibility of schema updates to help with this, "it can always be fixed later" or so)
- Instruct users to report it if raqb generates invalid SQL, as this is considered a bug in raqb, regardless of the input (except when it's caused by `unsafeSQL`)
- Explain the base mental model for relations; the relationship between things is always that one Thing 'owns' many other things. Therefore, has(...) means all the items in a collection that reference the current item in their (remote) column, whereas belongsTo(...) means an item in another collection that's referenced in one of the current item's local columns
- Explain that aggregrates (collapseBy) are always applied in reference to the being-queried table; eg. if you want to collapse user counts by their role_id, you should base-query the users table, not the roles table
- Demonstrate the usefulness of normalization through examples:
- Deduplication of data (eg. user.friends containing duplicate user records)
- Recursion is impossible (eg. user -> posts -> replyUsers or something)
Considerations:
- Disallow non-lowercase column/table names entirely due to PostgreSQL weirdness? Or just cast them to lowercase?
- For rollback of NOT NULL column deletions in migrations: require each column deletion change to include an initializer function for recreating it afterwards, otherwise it cannot be rolled back
- Before applying or rolling back a migration, all the migrations should be evaluated in-memory, so that the state at any given migration can be determined
- Instead of applying the delta in the migration directly, always generate the "actual delta" from the evaluated state at the previous point vs. that at the new point? This can skip intermediate steps when eg. one migration adds a column that the next migration removes, and should generally result in a much more consistent outcome (similar to deterministic package management)
FAQ:
- Why have you changed all the names of things, like table and column and GROUP BY?
- One of the primary goals of zapdb is to get more people using relational databases who were previously intimidated by them. Unfortunately, typical RDMBSes are full of jargon that isn't intuitively understandable to many new users. Because of this, zapdb uses more intuitive terms that users can understand without having to memorize them - it's much easier for experienced RDBMS users to adapt to intuitive terms, than it is for inexpected users to learn the jargon.
Terminology:
- Collection: table
- Field: column
- Local
- Remote: foreign column
- Operation: any kind of item in a query; eg. select(...), but also where(...) and collapseBy(...) -- need to emphasize that this does *not* imply immediate execution
- Clause: any kind of operation that modifies the behaviour of its parent operation in some way; eg. a where(...) inside of a select(...) or inside of a `has(...)`
- Predicate: condition, like moreThan(3)
- Meta-queries?: EXISTS, IN, etc. - queries that operate on other queries
Tricky usecases:
- https://gist.github.com/samsch/83f91a66eaf96a70c909ae80d4adfe3e
----
foo = ANY (subquery)
foo = ANY (array)
foo = IN (subquery)
foo = IN (a, b, c)
alias: SOME -> ANY
----
Recommendations to make:
- Express moreThanOrEqualTo as not(lessThan(...))
- Express lessThanOrEqualTo as not(moreThan(...))
----
Ideas:
- Make placeholders typed so that we can warn the user when they try to pass in invalid things?
- Safe client API that returns a disposer instead of a client upon client creation?
- For modular/extensible design, allow plugins to supply operations + optimizers, and expect it all to be compilable to the core operations set? Need to find a way to keep same-named operations in different plugins from clashing. Also need dependency-solving to make it possible to specify "my optimizers should run before those of $otherPlugin".
----
Things to check:
- How are NULLs handled in all cases/permutations? Is that handling correct?
- Maybe rename `sqlExpression` to `sqlFragment` or so, to avoid confusion with `expression`? Or is it `expression` that needs to be renamed to something else, like `predicate` or `assertion`?
----
Boolean operations:
_,
not {
equals,
moreThan,
lessThan {
_,
anyOf,
allOf {
ARRAY,
PLACEHOLDER<ARRAY>
}
LITERAL,
PLACEHOLDER<LITERAL>
}
}
Considerations:
- Allow nested not() modifiers? For composability with things that use not() themselves
{ foo: anyOf([ 30, moreThan(42) ]) } -> WHERE foo = 30 OR foo > 42
{ foo: anyOf([ 30, 42 ]) } -> WHERE foo = ANY(ARRAY[30, 42])
{ foo: anyOf([ moreThan(30), moreThan(42) ]) } -> WHERE foo > ANY(ARRAY[30, 42])
Group into condition types, and based on the count of each, determine how to represent that particular type, and then glue the types together
eg. WHERE foo > ANY(ARRAY[30, 42]) OR foo < 42 OR foo = ANY(ARRAY[30, 42])
Make $or/$and abstraction to make this gluing easier in multiple places
----
MARKER: Relation resolution
- Generate relation queries after executing their base query, because by then we have access to the necessary schema information (as we are in the execution phase)
- Eventually replace this with a mechanism where relation queries can be pre-generated, with special placeholder markers to be filled in later, to:
a) aid static analysis
b) allow alternative relation resolving implementations to queue follow-up queries with (arbitrary?) to-be-filled-in details
- Maybe use ["foo", placeholder, "bar"] format instead of a single string, when generating the query? That would be a bit hacky...
- Implement various kinds of JOINs with table-prefixed resultset column names, those column names being determined at query execution time
- Figure out a way to represent this in the pre-generated query, maybe the same mechanism as the relation query placeholder mechanism above?
- Different API designs for different JOIN types? eg. a `with(otherTableQuery)` for left joins, something similar for right joins, and a `combine(tableAQuery, tableBQuery)` for full joins and inner joins
- Implement through relation queries with the JOIN mechanism
- AND/OR as `all()` and `any()`
- Can also fix the placeholder thing by having a separarate kind of nodePlaceholder that gets filled in with a query node; and then only doing the astToQuery operation at the very last moment, after state has been collected from the database and all of those nodes can be filled in
- In this case, there should probably be a way to provide 'global' values for nodePlaceholders (applicable to the query + any derived ones like relations) and local ones (just the query itself)? Maybe this is already solved by using strings vs. symbols for the placeholder names? That would work the same as with normal placeholders, and therefore be consistent implementation-wise
-----------
Generic AST optimization infrastructure instead of inline modifications in the stringification phase
Pipeline would then look like:
Build AST -> Optimize AST -> Compile to SQL
Three AST optimization purposes:
- Normalization/correctness (required)
- Readability (optional)
- Performance (optional)
--------
Renames:
- remote -> foreign
- remoteSide? -> foreignSide
- sql -> unsafeSQL
Transforms should ideally be cached for a query, maybe auto-do this for top-level constructs like SELECT, so that a full query is immediately performantly reusable?
How to deal with such queries being specified as sub-expressions then? Maybe cross-query optimization needs to happen?
FIXME: Document the design for dealing with the potentially-cyclical operations dependencies within the module
FIXME: Linter rule for verifying that all requires inside of the affected folders (operations/ and validators/operations/) correctly pass in operations, to avoid polluting the AST with Validatem validators
Walker design:
- Allow a visitor to place a 'marker' on a given node, that can be referenced back to from another, more deeply-nested visitor (referencing the nearest, a la React context) - this is useful for hoisting things up in the AST. Need to figure out how to combine this with immutable AST walking, though, without pulling the carpet from under the currently-walked subtree.
- Possibly Babel's walker implementation might have a solution to this category of problem?
- Possibly cursors are a useful concept to use here? As mutable pointers to otherwise immutable data. This might cause unexpected behaviour though, with different items being walked than currently (newly) exist in the tree, as the walker would still be operating on the old tree -- unless a change higher-up would trigger a re-walk of that entire subtree, which would require all optimizations to be idempotent.
- Make sure to iterate again from whatever new AST subtree was returned from a visitor! So that other visitors get a chance.
- How to deal with a case where in notExpression(foo), foo returns a notExpression(bar), and the notExpression deduplicator does not happen anymore because the top-most notExpression was already previously visited? Maybe re-run the whole sequence of visitors repeatedly until the entire tree has stabilized? This also requires idempotence of visitors! And makes persisting/caching processed query ASTs even more important.
- Apply a walking iteration limit to detect trees that will never stabilize
- Change handling strategies:
- After change to a parent marker: abandon current subtree, move back to new subtree
- After completion of walk: re-walk, check if the tree has stabilized
- Allow the walker itself to mutate nodes, but set a "has mutated" flag whenever a visitor returns a new subtree, to know that it has not stabilized yet
- Allow the user to disable certain categories of optimizations (readability, performance) but not critical ones (normalization)
- For this to be possible, select(...) and such cannot auto-optimize! Because there would be no logical place to specify this configuration
- A hacky workaround would be a configuration node wrapped around the clauses, though...
- Probably better to provide an `optimize` wrapper node that goes around the entire query; either the user can specify it themselves with custom optimization configuration, or the query executor will do it by itself automatically, maintaining an internal cache for this in a WeakMap so that the user doesn't have to care about it.
- Optimizers should be provided as a single array of category-tagged items, rather than a separate array for each category; otherwise the correct execution order cannot be ensured between items of different categories that are nevertheless dependent on one of the two having run first.
- Optimizers *must* a) be idempotent and b) have a terminal (NoChange) condition!
-------
Planned AST optimization phases (loosely ordered):
x Combine multiple `where(...)` into a single `where(allOf(...))`
x Flatten nested notCondition
x Group >1 same-typed conditionTypes into a single condition(_internalArray(...))
x Invert condition modifiers (not, anyOf, allOf, ...) into expression modifiers
x Flatten nested notExpression
x Flatten nested same-typed allOfCondition/anyOfCondition
x Flatten nested same-typed allOfExpression/anyOfExpression
-----
Issue tracker template:
<!--
Hi! Thanks for trying out raqb. If you're reading this, that means you've probably either found a bug, or you're not quite sure how to use something. We'd be happy to help!
To help us help you, please provide as much detail about your problem as possible. This includes example code, a description of what your goal is and the usecase it's for, exactly how it's failing (eg. a copy-pasted error message), and so on. The more detail you provide, the faster we can find a solution together!
When posting code, please provide as much surrounding code as you can, and change it as little as possible from how it looks in your project. The more context you provide, the easier it is for us to understand what you're trying to do. It's okay for your code to be messy!
Please also keep in mind the following things, which are unlike most other open-source projects:
- raqb is an open-source project in every sense of the term. It's meant to empower open-source (especially non-commercial) developers to build better software, faster. Because of this, we will not implement features in raqb that are meant to solve problems specific to proprietary software (eg. private optimizers). You are still free to use raqb in proprietary software, as the license allows it - but you're expected to be a good open-source citizen, and contribute back your improvements.
- Technology and politics are inseparable. Software is infrastructure, and the choices that software developers make directly and indirectly impact the lives of millions of people - disproportionately those with less power in society. Because of this, we WILL NOT PROVIDE SUPPORT to ethically objectionable organizations. This includes, but is not limited to, organizations that build software for deportation, and the adtech industry. You are not welcome in our community.
You can remove this block of text after reading it, or start typing after the arrow below. Either way, this block of text will not be visible in your issue.
-->
-----
MARKER:
- Refactor relation operations to new design
- Implement AST optimization infrastructure (incl. solving the immutable reference problem?)
-----
NOTE: May need https://www.npmjs.com/package/browser-hrtime for process.hrtime.bigint support in browsers! Need to investigate whether this (or something similar) is already being used by bundlers by default, or whether they use a shim without bigint support.
FIXME: Remove all the .type stuff and replace with typeOf()
FIXME: Document that `DEBUG=raqb:ast:optimize:*` can be used for tracking down unstable optimizers (symptom = error telling you so, should not be a stack overflow anymore)
Pitch: "Easier than MongoDB", no more if(foo.bar) checks, no more data format mismatches, the database ensures for you that all data looks like expected, and it's easy to change the format if needed.
------
# PostgreSQL design notes
VALUES: virtual, in-memory table
JOIN: combine fields of rows in multiple tables, correlating on some sort of inter-table predicate
INNER: only rows that exist in all tables, ie. with correlation
FULL: all rows in any table, even without correlation
LEFT/RIGHT: FULL for one side, INNER for the other
CROSS: every row in every table combined with every row in every other table (cartesian product)
LATERAL: modifier that marks a query as having a data dependency on some other query in the JOIN, eg. comparing against an aggregrate function over another table
ex. https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral
ON <condition>: specifies the join condition
USING (<column>): shorthand for `ON a1.<column> = a2.<column>`
UNION: concatenate rows from multiple tables
GROUP BY <predicate>: produces 1 "collapsed" row for each unique <predicate> value, with some of the result columns possibly being the results of aggregrate functions over all of the "collapsed" input values for that row
<- WHERE: controls inputs into the aggregrate functions *without* knowledge of aggregrate function results (more efficient, as it does not aggregrate rows eliminated here)
HAVING: controls outputs of the aggregrate functions and inputs into the results *with* knowledge of aggregrate function results - note that this still eliminates individual rows, *not* the "collapsed" rows!
(should divert aggregrate constraints into HAVING, and all others into WHERE)
ROLLUP -> hierarchyOf - (a, b, c), (a, b), (a) -- cannot be nested!
CUBE -> permutationsOf (including implicit null for each slot) -- cannot be nested!
comma list -> permutationsOf (excluding implicit null for each slot)
# Table functions (things that produce tables) - note that some table functions might require explicit defining of output columns with `AS`, as it controls function behaviour
UNNEST: array(s) to rows, analogous to Array#values; one row per max(arrayLengths) array index, one column per input array
WITH ORDINALITY: analogous to Array#entries, add a column with the index ("ordinality") of each row; column name defaults to 'ordinality'
ROWS FROM(): like JOIN, but for table functions, correlated on index ("ordinality"); each individual function call can have its own `AS` clause
# GROUP BY combinator semantics
GROUP BY foo, bar; -- group by permutationsOf(foo, bar)
GROUP BY (foo, bar); -- group by compositeKey(foo, bar)
GROUP BY foo, (bar, baz) -- group by permutationsOf(foo, compositeKey(bar, baz))
GROUP BY CUBE(foo, bar) -- group by ???(foo, bar)
CUBE ( a, b, c )
is equivalent to
GROUPING SETS (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( )
)
a, b, GROUPING SETS (c, d)
is equivalent to
GROUPING SETS (
( a, b, c ),
( a, b, d )
)
Limitations when grouping:
- Column selection list may only contain columns that are:
a) guaranteed to be identical across all collapsed rows (ie. they are specified as a GROUP BY constraint)
b) the results of aggregrate functions over all of the collapsed rows
- WHERE may only contain column-references that apply directly to the to-be-collapsed rows, not wrapped in aggregrate functions
- HAVING may only contain column-references that are:
a) wrapped in an aggregrate function
b) specified in the column selection as guaranteed-identical columns
note: aliases to results of aggregrate functions are *not* permitted here, for some reason
Aggregrate functions:
- Specifying an aggregrate function in the column selection list, means an implicit `GROUP BY` where there is only a single group consisting of all rows!
- This means that aggregrate functions cannot be combined with custom column selection, *even if* no grouping was specified.
- The same also happens when specifying a HAVING clause, though that in and of itself usually means there's going to be an aggregrate function somewhere.
Joins:
- Need to consistently use ON (...) syntax for JOIN conditions, not WHERE, because WHERE does not work for outer joins.
# Mental model notes
- A JOIN is just a virtual table that is scoped to the currently-being-processed row. This also applies when `define`ing a has/belongsTo relation!
- A `compute` with an aggregrate function on a local(!) column directly in a SELECT is equivalent to `collapsedBy(value(true), [ compute(...) ])`, and means the whole query switches to collapsed mode. But this is not applicable for aggregrate functions on referenced (foreign) tables!
- A `collapseBy` + aggregrate function essentially means uniqueValues(groupingColumn).map((value, rows) => [ value, aggregrateFunction(rows) ])
- GROUPING SETS is effectively a UNION for the to-be-analyzed groups
- This means that when the user specifies multiple `collapseBy` clauses, we should produce a GROUPING SETS that contains each of them. If only a single one, we omit it.
- `GROUP BY (color, size)` is equivalent to `GROUP BY color, size`
# Base types
- Scalar values
- Lists (arrays of same-typed scalar values)
- Resultsets/Tables (grids of rows and schemaful columns)
# Query construction strategies
Observations:
- Whenever we are JOINing, we need to prefix column names with table names. We should probably have an optimizer that auto-converts local columnNames into foreignColumnNames in JOIN-y contexts, based on that context's local table name. We can't just keep them local, because PostgreSQLs semantics are different and not scope-dependent, so we have to be explicit about this.
- This requires a general-purpose context abstraction in the AST walker for tracking scope, eg. a `setContext` method passed to the visitor, the argument to which is then used to generate a new context for its walked children (but not used outside of that). Contexts can be shallow-merged, which allows for overriding.
- Since the same table can be JOINed with more than once, we might need to generate incrementing numbers for the different cases, if no explicit names are specified. This is especially true for cases where relations are specified as inline references in eg. `compute` clauses.
- When computing aggregrate functions over columns of relations, that constitutes an implicit GROUP BY own_primary_key.
- In GROUP BY clauses, it seems that parenthesized expression lists are only semantically meaningful inside of a CUBE/ROLLUP/GROUPING SETS, as in the top-level GROUP BY clause, the result is the same with or without parentheses.
- Where it *is* semantically meaningful, it denotes "produce a group for every permutation of the values in these columns"
- A single item in GROUPING SETS is equivalent to not using GROUPING SETS at all, similar to how it would work with a UNION.
- GROUP BY CUBE (a, b) is equivalent to GROUP BY (a, b) except when using CUBE, the 'pretend this field does not take part' condition is also considered for each field
- We'll ignore CUBE for now, and just implement (a, b) and ROLLUP (a, b)
<query> = select, combine, concatenate
<relation> = has, belongsTo, through?
## collapseBy(field, [ <clauses> ])
## concatenate([ a, b, c ])
a UNION b UNION c, somehow
## combine([ a, b, c ])
a JOIN b JOIN c, somehow, taking into account column selection
## define({ name: <query> })
WITH <query> AS name
## define({ name: <relation> })
base JOIN <relation>; account for WHERE clauses on the <relation> specified in the top-level `where`, those may also be specified on the relation itself directly
normalize that in an optimizer!
does that also apply to withRelations?
## inline(<query>)
parenthesized subquery; possibly WITH if needed to make it referenceable elsewhere? need to think about this
## compute
Used to indicate derived values, eg.
a) sql(...) as column
b) database functions
c) JS postprocessing functions (not valid in collapseBy?)
GROUP BY color, size, ROLLUP (country_id, store_id)
for each country_id
for (each store_id + null)
for each color
for each size
NTILE(3) OVER ( ORDER BY amount )
order by `amount`, then divide into 3 equally-sized buckets, and for each record return its bucket ID
NTILE(3) OVER ( PARTITION BY YEAR ORDER BY amount )
same as above, but produce a record for each distinct year (like a GROUP BY)
SELECT foo, bar, <foo> OVER ( <bar> )
Windowing function
like GROUP BY, but the rows are not actually collapsed in the output result; they are just collapsed behind the scenes, so that aggregrate functions can be used over their virtual groups
SELECT employee_id, avg(salary) OVER (PARTITION BY department_id) AS average_department_salary FROM salary;
SELECT department_id, avg(salary) FROM salary GROUP BY department_id;
---------
# Relation operations
through([ item, item, ... ])
item: columName [belongsTo] | remoteColumnName [has] | belongsTo | has
has(remoteColumnName, [ operation, operation, ... ])
operation: any operation that is valid in a SELECT?
belongsTo(columnName)
linkTo(remoteColumnName)