Connection configuration#

For general information regarding connections see Create a connection.

When adding a new connection to Spark, the following parameters apply:

Parameter	Description	Default value
Name	A friendly name of the connection e.g. Local Spark 3.0
Host:Port	A comma-separated list of server:port pairs to connect to. For local instance use localhost:port.	localhost:8400
Protocol	The protocol that the driver will use to exchange data with Querona. Possible values: Thrift, http	Thrift
Spark dialect	(= protocol version). The dialect version used to communicate with the driver. Possible values: 2.0, 3.0
Driver API key	The security key used to authenticate requests incoming from Querona. Using the default value is highly NOT recommended.	querona-key
Username	The account to be used for authenticating reverse connections from the driver
Password	The password for the account above
Default database	The database used for storing data for cached views
Reverse connection host address	The address under which Querona can be reached by the driver. For local instance use localhost
Node connection timeout	The timeout used for waiting for each cluster node to respond.	10s
Cluster initialization timeout	The timeout given for the instance to initialize when doing the initial request	30s
Frame size	Frame size used when retrieving query results from Spark	16777216 bytes
Target partition size	Defines the target size of partitions.	128000000 bytes
Max partitions for auto repartitioning	Defines the upper limit for the number of partitions to be created.	256

Note

When JOIN queries are pushed-down to execute on Spark, the data from the original database has to be copied to Spark. Querona registers external data sources as JDBC data frames, optionally applying repartition(n) function. If the data frame is too large it has to be broken into smaller chunks of size defined by Target partition size.

Scheduler pools#

To learn about Spark scheduler pools in general, please consult Spark Job Scheduling.

Scheduler pools describe relative part of cluster’s resources that will be assigned to process a given task. Querona’s Spark connections support the following scheduler pools:

default - scheduler pool used when no scheduler pool was selected explicitly
ETL - scheduler pool used by materialization jobs, and executes jobs in FIFO order
S, M, L, XL - 4 pools that user can specify for their queries with ascending allocation of resources

To specify pool to be used for given query use the OPTION clause. For example:

SELECT * FROM Table OPTION(POOL 'XL')]

INSERT INTO TABLE(COLUMN) select 1 OPTION(POOL 'S')

Exact configuration values for each scheduler pool can be reviewed and configured on Spark connection screen:

The pool properties are as follows:

Parameter	Description
Mode	FIFO means tasks are queued one after the other. FAIR means that tasks can run in parallel and share resources.
Weight	Decides how much of cluster’s resources will this pool get relative to other pools.
Min share	Share of cluster’s resources assigned by FAIR scheduler (if possible) before allocation of rest of resources takes place.

You can review pools used by spark for each query in Spark’s admin portal, just go to tab ‘Stages’:

Spark variables#

Any properties prefixed with spark can be defined in this section and will be used when creating a new Spark context. The notable parameters are:

Parameter	Description	Default value
spark.ui.port	The Spark UI port	4041
spark.local.dir	The local working directory of Spark	C:\ProgramData\Querona\tmp

Hive configuration#

Internally Spark uses part of the Hive functionality. The important property is the warehouse directory which is a file system folder used as physical data storage.

Depending on the version of Spark used, this parameter is set by either/or hive.metastore.warehouse.dir and spark.sql.warehouse.dir. These values are set in the config files but can be passed on the connection level as well.

We highly recommend to always consult the documentation of Spark to make sure the right values are set. Excerpt from the Apache Spark v2.2 documentation:

Note

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of the database in a warehouse. You may need to grant write privilege to the user who starts the Spark application.

For the managed Spark instance, the following Hive parameters may need to be configured:

Parameter	Description	Default value
javax.jdo.option.ConnectionURL	JDBC connection string to the Hive metastore	jdbc:sqlserver://localhost; databaseName=SparkMetastore;
javax.jdo.option.ConnectionUserName	JDBC login to Hive metastore
javax.jdo.option.ConnectionPassword	JDBC password to Hive metastore
hive.exec.scratchdir	C:/ProgramData/Querona/tmp/hive	Hive tmp folder

Connection test and Spark validation#

Once configured the Test Connection button can be used to verify whether the instance is responsive. Note, that this only test if the given driver instance responds to incoming connections. To fully test the configuration, the following steps are one of the methods to trigger and check Spark initialization:

Navigate to DATABASES and create a new database named “spark-test-db” linked to the newly created connection

On the newly created database add new view of type Custom SQL named “test_view”:

On the next screen type select 1 as col; and click Parse

Finish creating the view
Click the view on the list and go to Caching and Partitioning, click Edit
Set Caching mode to Persistent caching - one table and Save

Go to Management tasks and run Rebuild persistent cache task

Go to query editor and run the following query on the Spark database:

select [col]
from [spark-test-db].[dbo].[test_view]
union all
select [col]
from [spark-test-db].[dbo].[test_view]

The result should look more or less like this:

These actions should force Spark to initialized. To verify it change the URL in the browser from e.g. http://querona-example:9000 to http://querona-example:4040 where 4040 is the Spark GUI port set in spark.ui.port variable (4040 is the default value).

Spark GUI should open and you should see the query executed:

You can click on the tabs to see different jobs and parameters which might be useful for future debugging and optimization.