Connection configuration#

For general information regarding connections see Create a connection.

When adding a new connection to Spark, the following parameters apply:

Parameter

Description

Default value

Name

A friendly name of the connection e.g. Local Spark 3.0

Host:Port

A comma-separated list of server:port pairs to connect to. For local instance use localhost:port.

localhost:8400

Protocol

The protocol that the driver will use to exchange data with Querona. Possible values: Thrift, http

Thrift

Spark dialect

(= protocol version). The dialect version used to communicate with the driver. Possible values: 2.0, 3.0

Driver API key

The security key used to authenticate requests incoming from Querona. Using the default value is highly NOT recommended.

querona-key

Username

The account to be used for authenticating reverse connections from the driver

Password

The password for the account above

Default database

The database used for storing data for cached views

Reverse connection host address

The address under which Querona can be reached by the driver. For local instance use localhost

Node connection timeout

The timeout used for waiting for each cluster node to respond.

10s

Cluster initialization timeout

The timeout given for the instance to initialize when doing the initial request

30s

Frame size

Frame size used when retrieving query results from Spark

16777216 bytes

Target partition size

Defines the target size of partitions.

128000000 bytes

Max partitions for auto repartitioning

Defines the upper limit for the number of partitions to be created.

256

Note

When JOIN queries are pushed-down to execute on Spark, the data from the original database has to be copied to Spark. Querona registers external data sources as JDBC data frames, optionally applying repartition(n) function. If the data frame is too large it has to be broken into smaller chunks of size defined by Target partition size.

Scheduler pools#

To learn about Spark scheduler pools in general, please consult Spark Job Scheduling.

Scheduler pools describe relative part of cluster’s resources that will be assigned to process a given task. Querona’s Spark connections support the following scheduler pools:

  • default - scheduler pool used when no scheduler pool was selected explicitly

  • ETL - scheduler pool used by materialization jobs, and executes jobs in FIFO order

  • S, M, L, XL - 4 pools that user can specify for their queries with ascending allocation of resources

To specify pool to be used for given query use the OPTION clause. For example:

SELECT * FROM Table OPTION(POOL 'XL')]

or

INSERT INTO TABLE(COLUMN) select 1 OPTION(POOL 'S')

Exact configuration values for each scheduler pool can be reviewed and configured on Spark connection screen:

Spark scheduler pools

The pool properties are as follows:

Parameter

Description

Mode

FIFO means tasks are queued one after the other. FAIR means that tasks can run in parallel and share resources.

Weight

Decides how much of cluster’s resources will this pool get relative to other pools.

Min share

Share of cluster’s resources assigned by FAIR scheduler (if possible) before allocation of rest of resources takes place.

You can review pools used by spark for each query in Spark’s admin portal, just go to tab ‘Stages’:

Spark stages

Spark variables#

Any properties prefixed with spark can be defined in this section and will be used when creating a new Spark context. The notable parameters are:

Parameter

Description

Default value

spark.ui.port

The Spark UI port

4041

spark.local.dir

The local working directory of Spark

C:\ProgramData\Querona\tmp

Hive configuration#

Internally Spark uses part of the Hive functionality. The important property is the warehouse directory which is a file system folder used as physical data storage.

Depending on the version of Spark used, this parameter is set by either/or hive.metastore.warehouse.dir and spark.sql.warehouse.dir. These values are set in the config files but can be passed on the connection level as well.

We highly recommend to always consult the documentation of Spark to make sure the right values are set. Excerpt from the Apache Spark v2.2 documentation:

Note

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of the database in a warehouse. You may need to grant write privilege to the user who starts the Spark application.

For the managed Spark instance, the following Hive parameters may need to be configured:

Parameter

Description

Default value

javax.jdo.option.ConnectionURL

JDBC connection string to the Hive metastore

jdbc:sqlserver://localhost; databaseName=SparkMetastore;

javax.jdo.option.ConnectionUserName

JDBC login to Hive metastore

javax.jdo.option.ConnectionPassword

JDBC password to Hive metastore

hive.exec.scratchdir

C:/ProgramData/Querona/tmp/hive

Hive tmp folder

Connection test and Spark validation#

Once configured the Test Connection button can be used to verify whether the instance is responsive. Note, that this only test if the given driver instance responds to incoming connections. To fully test the configuration, the following steps are one of the methods to trigger and check Spark initialization:

  • Navigate to DATABASES and create a new database named “spark-test-db” linked to the newly created connection

Create Spark db
  • On the newly created database add new view of type Custom SQL named “test_view”:

Create Spark view
  • On the next screen type select 1 as col; and click Parse

Create Spark view
  • Finish creating the view

  • Click the view on the list and go to Caching and Partitioning, click Edit

  • Set Caching mode to Persistent caching - one table and Save

Create Spark view
  • Go to Management tasks and run Rebuild persistent cache task

Create Spark view
  • Go to query editor and run the following query on the Spark database:

select [col]
from [spark-test-db].[dbo].[test_view]
union all
select [col]
from [spark-test-db].[dbo].[test_view]
  • The result should look more or less like this:

Create Spark view

These actions should force Spark to initialized. To verify it change the URL in the browser from e.g. http://querona-example:9000 to http://querona-example:4040 where 4040 is the Spark GUI port set in spark.ui.port variable (4040 is the default value).

Spark GUI should open and you should see the query executed:

Local Spark

You can click on the tabs to see different jobs and parameters which might be useful for future debugging and optimization.