Todo: - Operators - BETWEEN - LIKE/ILIKE? - ISNULL / NOTNULL - IS ... - Condition modifiers x allOf x anyOf x not - Boolean logic x anyOf(...) x allOf(...) - Annotate placeholder/parameter nodes with the expected parameter type in their position, for validation of input parameters at query time - Rename table -> collection everywhere, also column -> field? - Update wrapError API - Figure out a way to semantically deal with the SQL boolean behaviour of propagating NULL (eg. as described on https://www.postgresql.org/docs/8.1/functions-subquery.html for IN) - Some sort of convertAs method for schema building, when changing the data type of a column - disallow a direct type override, and require it to be wrapped in a changeType method that also takes the conversion logic as an argument? Docs: - Emphasize that users should feel free to experiment; the library will tell them (safely) if they are trying to do something invalid - need to help the user overcome the "but what if I get it wrong" fear (also emphasize the flexibility of schema updates to help with this, "it can always be fixed later" or so) - Instruct users to report it if raqb generates invalid SQL, as this is considered a bug in raqb, regardless of the input (except when it's caused by `unsafeSQL`) - Explain the base mental model for relations; the relationship between things is always that one Thing 'owns' many other things. Therefore, has(...) means all the items in a collection that reference the current item in their (remote) column, whereas belongsTo(...) means an item in another collection that's referenced in one of the current item's local columns - Explain that aggregrates (collapseBy) are always applied in reference to the being-queried table; eg. if you want to collapse user counts by their role_id, you should base-query the users table, not the roles table - Demonstrate the usefulness of normalization through examples: - Deduplication of data (eg. user.friends containing duplicate user records) - Recursion is impossible (eg. user -> posts -> replyUsers or something) Considerations: - Disallow non-lowercase column/table names entirely due to PostgreSQL weirdness? Or just cast them to lowercase? - For rollback of NOT NULL column deletions in migrations: require each column deletion change to include an initializer function for recreating it afterwards, otherwise it cannot be rolled back - Before applying or rolling back a migration, all the migrations should be evaluated in-memory, so that the state at any given migration can be determined - Instead of applying the delta in the migration directly, always generate the "actual delta" from the evaluated state at the previous point vs. that at the new point? This can skip intermediate steps when eg. one migration adds a column that the next migration removes, and should generally result in a much more consistent outcome (similar to deterministic package management) FAQ: - Why have you changed all the names of things, like table and column and GROUP BY? - One of the primary goals of zapdb is to get more people using relational databases who were previously intimidated by them. Unfortunately, typical RDMBSes are full of jargon that isn't intuitively understandable to many new users. Because of this, zapdb uses more intuitive terms that users can understand without having to memorize them - it's much easier for experienced RDBMS users to adapt to intuitive terms, than it is for inexpected users to learn the jargon. Terminology: - Collection: table - Field: column - Local - Remote: foreign column - Operation: any kind of item in a query; eg. select(...), but also where(...) and collapseBy(...) -- need to emphasize that this does *not* imply immediate execution - Clause: any kind of operation that modifies the behaviour of its parent operation in some way; eg. a where(...) inside of a select(...) or inside of a `has(...)` - Predicate: condition, like moreThan(3) - Meta-queries?: EXISTS, IN, etc. - queries that operate on other queries Tricky usecases: - https://gist.github.com/samsch/83f91a66eaf96a70c909ae80d4adfe3e ---- foo = ANY (subquery) foo = ANY (array) foo = IN (subquery) foo = IN (a, b, c) alias: SOME -> ANY ---- Recommendations to make: - Express moreThanOrEqualTo as not(lessThan(...)) - Express lessThanOrEqualTo as not(moreThan(...)) ---- Ideas: - Make placeholders typed so that we can warn the user when they try to pass in invalid things? - Safe client API that returns a disposer instead of a client upon client creation? - For modular/extensible design, allow plugins to supply operations + optimizers, and expect it all to be compilable to the core operations set? Need to find a way to keep same-named operations in different plugins from clashing. Also need dependency-solving to make it possible to specify "my optimizers should run before those of $otherPlugin". ---- Things to check: - How are NULLs handled in all cases/permutations? Is that handling correct? - Maybe rename `sqlExpression` to `sqlFragment` or so, to avoid confusion with `expression`? Or is it `expression` that needs to be renamed to something else, like `predicate` or `assertion`? ---- Boolean operations: _, not { equals, moreThan, lessThan { _, anyOf, allOf { ARRAY, PLACEHOLDER } LITERAL, PLACEHOLDER } } Considerations: - Allow nested not() modifiers? For composability with things that use not() themselves { foo: anyOf([ 30, moreThan(42) ]) } -> WHERE foo = 30 OR foo > 42 { foo: anyOf([ 30, 42 ]) } -> WHERE foo = ANY(ARRAY[30, 42]) { foo: anyOf([ moreThan(30), moreThan(42) ]) } -> WHERE foo > ANY(ARRAY[30, 42]) Group into condition types, and based on the count of each, determine how to represent that particular type, and then glue the types together eg. WHERE foo > ANY(ARRAY[30, 42]) OR foo < 42 OR foo = ANY(ARRAY[30, 42]) Make $or/$and abstraction to make this gluing easier in multiple places ---- MARKER: Relation resolution - Generate relation queries after executing their base query, because by then we have access to the necessary schema information (as we are in the execution phase) - Eventually replace this with a mechanism where relation queries can be pre-generated, with special placeholder markers to be filled in later, to: a) aid static analysis b) allow alternative relation resolving implementations to queue follow-up queries with (arbitrary?) to-be-filled-in details - Maybe use ["foo", placeholder, "bar"] format instead of a single string, when generating the query? That would be a bit hacky... - Implement various kinds of JOINs with table-prefixed resultset column names, those column names being determined at query execution time - Figure out a way to represent this in the pre-generated query, maybe the same mechanism as the relation query placeholder mechanism above? - Different API designs for different JOIN types? eg. a `with(otherTableQuery)` for left joins, something similar for right joins, and a `combine(tableAQuery, tableBQuery)` for full joins and inner joins - Implement through relation queries with the JOIN mechanism - AND/OR as `all()` and `any()` - Can also fix the placeholder thing by having a separarate kind of nodePlaceholder that gets filled in with a query node; and then only doing the astToQuery operation at the very last moment, after state has been collected from the database and all of those nodes can be filled in - In this case, there should probably be a way to provide 'global' values for nodePlaceholders (applicable to the query + any derived ones like relations) and local ones (just the query itself)? Maybe this is already solved by using strings vs. symbols for the placeholder names? That would work the same as with normal placeholders, and therefore be consistent implementation-wise ----------- Generic AST optimization infrastructure instead of inline modifications in the stringification phase Pipeline would then look like: Build AST -> Optimize AST -> Compile to SQL Three AST optimization purposes: - Normalization/correctness (required) - Readability (optional) - Performance (optional) -------- Renames: - remote -> foreign - remoteSide? -> foreignSide - sql -> unsafeSQL Transforms should ideally be cached for a query, maybe auto-do this for top-level constructs like SELECT, so that a full query is immediately performantly reusable? How to deal with such queries being specified as sub-expressions then? Maybe cross-query optimization needs to happen? FIXME: Document the design for dealing with the potentially-cyclical operations dependencies within the module FIXME: Linter rule for verifying that all requires inside of the affected folders (operations/ and validators/operations/) correctly pass in operations, to avoid polluting the AST with Validatem validators Walker design: - Allow a visitor to place a 'marker' on a given node, that can be referenced back to from another, more deeply-nested visitor (referencing the nearest, a la React context) - this is useful for hoisting things up in the AST. Need to figure out how to combine this with immutable AST walking, though, without pulling the carpet from under the currently-walked subtree. - Possibly Babel's walker implementation might have a solution to this category of problem? - Possibly cursors are a useful concept to use here? As mutable pointers to otherwise immutable data. This might cause unexpected behaviour though, with different items being walked than currently (newly) exist in the tree, as the walker would still be operating on the old tree -- unless a change higher-up would trigger a re-walk of that entire subtree, which would require all optimizations to be idempotent. - Make sure to iterate again from whatever new AST subtree was returned from a visitor! So that other visitors get a chance. - How to deal with a case where in notExpression(foo), foo returns a notExpression(bar), and the notExpression deduplicator does not happen anymore because the top-most notExpression was already previously visited? Maybe re-run the whole sequence of visitors repeatedly until the entire tree has stabilized? This also requires idempotence of visitors! And makes persisting/caching processed query ASTs even more important. - Apply a walking iteration limit to detect trees that will never stabilize - Change handling strategies: - After change to a parent marker: abandon current subtree, move back to new subtree - After completion of walk: re-walk, check if the tree has stabilized - Allow the walker itself to mutate nodes, but set a "has mutated" flag whenever a visitor returns a new subtree, to know that it has not stabilized yet - Allow the user to disable certain categories of optimizations (readability, performance) but not critical ones (normalization) - For this to be possible, select(...) and such cannot auto-optimize! Because there would be no logical place to specify this configuration - A hacky workaround would be a configuration node wrapped around the clauses, though... - Probably better to provide an `optimize` wrapper node that goes around the entire query; either the user can specify it themselves with custom optimization configuration, or the query executor will do it by itself automatically, maintaining an internal cache for this in a WeakMap so that the user doesn't have to care about it. - Optimizers should be provided as a single array of category-tagged items, rather than a separate array for each category; otherwise the correct execution order cannot be ensured between items of different categories that are nevertheless dependent on one of the two having run first. - Optimizers *must* a) be idempotent and b) have a terminal (NoChange) condition! ------- Planned AST optimization phases (loosely ordered): x Combine multiple `where(...)` into a single `where(allOf(...))` x Flatten nested notCondition x Group >1 same-typed conditionTypes into a single condition(_internalArray(...)) x Invert condition modifiers (not, anyOf, allOf, ...) into expression modifiers x Flatten nested notExpression x Flatten nested same-typed allOfCondition/anyOfCondition x Flatten nested same-typed allOfExpression/anyOfExpression ----- Issue tracker template: ----- MARKER: - Refactor relation operations to new design - Implement AST optimization infrastructure (incl. solving the immutable reference problem?) ----- NOTE: May need https://www.npmjs.com/package/browser-hrtime for process.hrtime.bigint support in browsers! Need to investigate whether this (or something similar) is already being used by bundlers by default, or whether they use a shim without bigint support. FIXME: Remove all the .type stuff and replace with typeOf() FIXME: Document that `DEBUG=raqb:ast:optimize:*` can be used for tracking down unstable optimizers (symptom = error telling you so, should not be a stack overflow anymore) Pitch: "Easier than MongoDB", no more if(foo.bar) checks, no more data format mismatches, the database ensures for you that all data looks like expected, and it's easy to change the format if needed. ------ # PostgreSQL design notes VALUES: virtual, in-memory table JOIN: combine fields of rows in multiple tables, correlating on some sort of inter-table predicate INNER: only rows that exist in all tables, ie. with correlation FULL: all rows in any table, even without correlation LEFT/RIGHT: FULL for one side, INNER for the other CROSS: every row in every table combined with every row in every other table (cartesian product) LATERAL: modifier that marks a query as having a data dependency on some other query in the JOIN, eg. comparing against an aggregrate function over another table ex. https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral ON : specifies the join condition USING (): shorthand for `ON a1. = a2.` UNION: concatenate rows from multiple tables GROUP BY : produces 1 "collapsed" row for each unique value, with some of the result columns possibly being the results of aggregrate functions over all of the "collapsed" input values for that row <- WHERE: controls inputs into the aggregrate functions *without* knowledge of aggregrate function results (more efficient, as it does not aggregrate rows eliminated here) HAVING: controls outputs of the aggregrate functions and inputs into the results *with* knowledge of aggregrate function results - note that this still eliminates individual rows, *not* the "collapsed" rows! (should divert aggregrate constraints into HAVING, and all others into WHERE) ROLLUP -> hierarchyOf - (a, b, c), (a, b), (a) -- cannot be nested! CUBE -> permutationsOf (including implicit null for each slot) -- cannot be nested! comma list -> permutationsOf (excluding implicit null for each slot) # Table functions (things that produce tables) - note that some table functions might require explicit defining of output columns with `AS`, as it controls function behaviour UNNEST: array(s) to rows, analogous to Array#values; one row per max(arrayLengths) array index, one column per input array WITH ORDINALITY: analogous to Array#entries, add a column with the index ("ordinality") of each row; column name defaults to 'ordinality' ROWS FROM(): like JOIN, but for table functions, correlated on index ("ordinality"); each individual function call can have its own `AS` clause # GROUP BY combinator semantics GROUP BY foo, bar; -- group by permutationsOf(foo, bar) GROUP BY (foo, bar); -- group by compositeKey(foo, bar) GROUP BY foo, (bar, baz) -- group by permutationsOf(foo, compositeKey(bar, baz)) GROUP BY CUBE(foo, bar) -- group by ???(foo, bar) CUBE ( a, b, c ) is equivalent to GROUPING SETS ( ( a, b, c ), ( a, b ), ( a, c ), ( a ), ( b, c ), ( b ), ( c ), ( ) ) a, b, GROUPING SETS (c, d) is equivalent to GROUPING SETS ( ( a, b, c ), ( a, b, d ) ) Limitations when grouping: - Column selection list may only contain columns that are: a) guaranteed to be identical across all collapsed rows (ie. they are specified as a GROUP BY constraint) b) the results of aggregrate functions over all of the collapsed rows - WHERE may only contain column-references that apply directly to the to-be-collapsed rows, not wrapped in aggregrate functions - HAVING may only contain column-references that are: a) wrapped in an aggregrate function b) specified in the column selection as guaranteed-identical columns note: aliases to results of aggregrate functions are *not* permitted here, for some reason Aggregrate functions: - Specifying an aggregrate function in the column selection list, means an implicit `GROUP BY` where there is only a single group consisting of all rows! - This means that aggregrate functions cannot be combined with custom column selection, *even if* no grouping was specified. - The same also happens when specifying a HAVING clause, though that in and of itself usually means there's going to be an aggregrate function somewhere. Joins: - Need to consistently use ON (...) syntax for JOIN conditions, not WHERE, because WHERE does not work for outer joins. # Mental model notes - A JOIN is just a virtual table that is scoped to the currently-being-processed row. This also applies when `define`ing a has/belongsTo relation! - A `compute` with an aggregrate function on a local(!) column directly in a SELECT is equivalent to `collapsedBy(value(true), [ compute(...) ])`, and means the whole query switches to collapsed mode. But this is not applicable for aggregrate functions on referenced (foreign) tables! - A `collapseBy` + aggregrate function essentially means uniqueValues(groupingColumn).map((value, rows) => [ value, aggregrateFunction(rows) ]) - GROUPING SETS is effectively a UNION for the to-be-analyzed groups - This means that when the user specifies multiple `collapseBy` clauses, we should produce a GROUPING SETS that contains each of them. If only a single one, we omit it. - `GROUP BY (color, size)` is equivalent to `GROUP BY color, size` # Base types - Scalar values - Lists (arrays of same-typed scalar values) - Resultsets/Tables (grids of rows and schemaful columns) # Query construction strategies Observations: - Whenever we are JOINing, we need to prefix column names with table names. We should probably have an optimizer that auto-converts local columnNames into foreignColumnNames in JOIN-y contexts, based on that context's local table name. We can't just keep them local, because PostgreSQLs semantics are different and not scope-dependent, so we have to be explicit about this. - This requires a general-purpose context abstraction in the AST walker for tracking scope, eg. a `setContext` method passed to the visitor, the argument to which is then used to generate a new context for its walked children (but not used outside of that). Contexts can be shallow-merged, which allows for overriding. - Since the same table can be JOINed with more than once, we might need to generate incrementing numbers for the different cases, if no explicit names are specified. This is especially true for cases where relations are specified as inline references in eg. `compute` clauses. - When computing aggregrate functions over columns of relations, that constitutes an implicit GROUP BY own_primary_key. - In GROUP BY clauses, it seems that parenthesized expression lists are only semantically meaningful inside of a CUBE/ROLLUP/GROUPING SETS, as in the top-level GROUP BY clause, the result is the same with or without parentheses. - Where it *is* semantically meaningful, it denotes "produce a group for every permutation of the values in these columns" - A single item in GROUPING SETS is equivalent to not using GROUPING SETS at all, similar to how it would work with a UNION. - GROUP BY CUBE (a, b) is equivalent to GROUP BY (a, b) except when using CUBE, the 'pretend this field does not take part' condition is also considered for each field - We'll ignore CUBE for now, and just implement (a, b) and ROLLUP (a, b) = select, combine, concatenate = has, belongsTo, through? ## collapseBy(field, [ ]) ## concatenate([ a, b, c ]) a UNION b UNION c, somehow ## combine([ a, b, c ]) a JOIN b JOIN c, somehow, taking into account column selection ## define({ name: }) WITH AS name ## define({ name: }) base JOIN ; account for WHERE clauses on the specified in the top-level `where`, those may also be specified on the relation itself directly normalize that in an optimizer! does that also apply to withRelations? ## inline() parenthesized subquery; possibly WITH if needed to make it referenceable elsewhere? need to think about this ## compute Used to indicate derived values, eg. a) sql(...) as column b) database functions c) JS postprocessing functions (not valid in collapseBy?) GROUP BY color, size, ROLLUP (country_id, store_id) for each country_id for (each store_id + null) for each color for each size NTILE(3) OVER ( ORDER BY amount ) order by `amount`, then divide into 3 equally-sized buckets, and for each record return its bucket ID NTILE(3) OVER ( PARTITION BY YEAR ORDER BY amount ) same as above, but produce a record for each distinct year (like a GROUP BY) SELECT foo, bar, OVER ( ) Windowing function like GROUP BY, but the rows are not actually collapsed in the output result; they are just collapsed behind the scenes, so that aggregrate functions can be used over their virtual groups SELECT employee_id, avg(salary) OVER (PARTITION BY department_id) AS average_department_salary FROM salary; SELECT department_id, avg(salary) FROM salary GROUP BY department_id; --------- # Relation operations through([ item, item, ... ]) item: columName [belongsTo] | remoteColumnName [has] | belongsTo | has has(remoteColumnName, [ operation, operation, ... ]) operation: any operation that is valid in a SELECT? belongsTo(columnName) linkTo(remoteColumnName)