Federation

Federation#

Federation is what lets a single SQL query read and join data that lives in different, heterogeneous systems — at query time, without first copying it into one place.

You write one SELECT (or define one view); Querona reaches each source, retrieves the relevant data, and combines it into a single result. Because nothing is copied ahead of time, federated results are real-time and require no up-front ETL.

How it works#

When a federated query runs, Querona first plans it into an execution plan, then executes that plan across the sources and engines involved. Work runs in three places:

At the sources — push-down delegates as much as possible (filters, projections, same-source joins, grouping and aggregation) to each source, so only the data that is actually needed travels back across the network.
In Querona’s in-memory engine, which brings the partial results together and performs the work the sources did not.
On the default federator — a configured engine that handles operations neither a source nor Querona’s in-memory engine can perform. The planner can use it at any point in the plan, more than once if a query needs it, depending on the data types, functions and operators involved and on current system conditions. By default it is the built-in Apache Spark instance (see Managing Apache Spark), which runs that work off the Querona host; Spark is optional, and another engine can take its place.

What gets pushed down#

Push-down is what keeps federation fast: rather than pulling raw tables across the network and processing them centrally, Querona asks each source to do the work it can, close to the data, and brings back only the reduced result. A query over a billion-row table can come back as a handful of aggregated rows, having moved almost nothing.

Within a single source, Querona delegates as much of the query as that source supports:

Filters — WHERE predicates are applied at the source, so rows are discarded before they cross the network, not after.
Projections — only the columns you actually select are read and returned.
Joins between tables that live in the same source.
Grouping and aggregation — GROUP BY and aggregates such as SUM or COUNT return the computed result instead of the underlying rows.
Sorting and row limits — ORDER BY and TOP, where the source can apply them.
Expressions and functions the source understands.

Push-down is capability-driven. Querona keeps a map of what each engine supports — its data types, functions, operators and collation behaviour — and delegates an operation only when the source can evaluate it faithfully. When a source lacks a function, Querona first tries to rewrite it using expressions the source does support; if it still cannot, only that part of the work is held back and run in Querona’s in-memory engine, while everything around it is still pushed down. So a single unsupported function in one corner of a query does not force the whole query to be dragged back and processed centrally — Querona pushes down the most it safely can and does the remainder itself.

In this context a source is one server and database — a single SQL Server database, for example. If several Querona connections point at the same server and database, the planner unifies them into a single source, so a join between tables reached through them still pushes down. Work that combines different sources is what has to be federated.

Joining across sources#

A join whose two sides live in different sources cannot be pushed to either source on its own. Querona retrieves the rows each side needs — already filtered and projected by push-down — and runs the join across them. Where it can, it ships the smaller side to the larger: sending, say, a short list of keys to the system that already holds the big table, so the bulk of the data never moves and the join runs where the data already is. This is what lets a small local table join a huge corporate one without copying the corporate data anywhere (see Querona for analysts and self-service).

Example#

A single statement can join a table in one source with a table in another as if they were one:

SELECT   c.name, SUM(o.total) AS lifetime_value
FROM     crm.dbo.customers AS c       -- e.g. a SQL Server source
JOIN     sales.public.orders AS o     -- e.g. a PostgreSQL source
    ON   o.customer_id = c.id
WHERE    c.region = 'EU'
GROUP BY c.name;

Querona pushes the region filter and each side’s own work down to its system, then joins and aggregates the results — no pipeline, no copy.

When to materialize instead#

On-the-fly federation is ideal for live, ad-hoc access. When a federated query is expensive to run repeatedly — for example, it aggregates large, slowly-changing data — you can materialize the view so queries read from the copy instead of the sources. Materialization complements federation rather than replacing it; Querona can also redirect a query to a pre-computed aggregate automatically with query retargeting.