How does SQL pushback work?
The initial fragments of a flow leading from the data import nodes are the main targets for SQL generation. When a node is encountered that can't be compiled to SQL, the data is extracted from the database and subsequent processing is performed.
During flow preparation and prior to running, the SQL generation process happens as follows:
- The software reorders flows to move downstream nodes into the “SQL zone” where it can be proven safe to do so.
- Working from the import nodes toward the terminal nodes, SQL expressions are constructed incrementally. This phase stops when a node is encountered that can't be converted to SQL or when the terminal node (for example, a Table node or a Graph node) is converted to SQL. At the end of this phase, each node is labeled with an SQL statement if the node and its predecessors have an SQL equivalent.
- Working from the nodes with the most complicated SQL equivalents back toward the import nodes, the SQL is checked for validity. The SQL that was successfully validated is chosen for execution.
- Nodes for which all operations have generated SQL are highlighted with a SQL icon next to the node on the flow canvas. Based on the results, you may want to further reorganize your flow where appropriate to take full advantage of database execution.
Where do improvements occur?
SQL pushback improves performance in a number of data operations:
- Joins (merge by key). Join operations can increase optimization within databases.
- Aggregation. The Aggregate, Distribution, and Web nodes all use aggregation to produce their results. Summarized data uses considerably less bandwidth than the original data.
- Selection. Choosing records based on certain criteria reduces the quantity of records.
- Sorting. Sorting records is a resource-intensive activity that is performed more efficiently in a database.
- Field derivation. New fields are generated more efficiently in a database.
- Field projection. The software extracts only fields that are required for subsequent processing from the database, which minimizes bandwidth and memory requirements. The same is also true for superfluous fields in flat files: although the software must read the superfluous fields, it doesn't allocate any storage for them.
- Scoring. SQL can be generated from decision trees, rulesets, linear regression, and factor-generated models.