External Spark cluster#
For larger or shared workloads, Querona can use an existing external Apache Spark cluster, alongside or instead of the built-in local Spark instance. Like the local instance, an external cluster running a supported Spark version (Apache Spark 3.5.x or 4.1.x) acts as a data source and a materialization destination, and can also be used as a federator that executes federated queries.
Driver deployment#
Querona communicates with Spark through its Driver - a Java application acting as a proxy that delegates
requests to and from Spark (see Instance configuration). With a local instance Querona starts the Driver
itself; with an external cluster the Driver is instead deployed onto the cluster’s head nodes and submitted to
the cluster’s resource manager (its Spark master, e.g. spark://HOST:PORT or yarn),
attaching to the running cluster rather than starting a new local instance.
Because the Driver runs on each head node, Querona automatically fails over between them, so the cluster stays reachable if one node goes down.
The Driver files are part of the Querona installation; copy them to each head node and configure the Driver
through querona-site.xml. The most important parameters:
Parameter |
Description |
|---|---|
querona.api.key |
Security key; must match the Driver API key set on the Querona connection. Do not leave the default value. |
querona.driver.port |
The port the Driver listens on. Default: 8400. |
querona.driver.protocol |
The connection protocol. Use Thrift. |
querona.ha.enable |
Enables head-node failover (high availability). |
Unlike a local instance, Driver logs are written locally on each head node and are not streamed back to Querona; access them directly on the node.
Note
Communication is required both ways: Querona must be able to reach the Driver on querona.driver.port (default 8400), and the Driver must be able to reach Querona on the Querona TDS port. Place the cluster in a network that allows both directions.
Connecting#
Create a Spark Connection configuration pointing at the cluster’s head nodes:
Host:Port - a comma-separated list of head-node addresses, e.g.
node1:8400,node2:8400(list every head node for failover).Protocol and Driver API key - must match the values in
querona-site.xml.Spark dialect - match the cluster’s Spark version: 3.0 for Spark 3.5.x, 4.0 for Spark 4.1.x.
Reverse connection host address - an address of Querona reachable from every head node.
If the cluster is slow to start, increase the connection’s Cluster initialization timeout.