Spark default database migration guide
Contents
Spark default database migration guide#
Default Apache Spark database storage location configured by Querona is in %ProgramData%\Querona\warehouse
folder.
It is okay for development and test environments, but for production, we recommend to change the default location to a different one.
This guide’s objective is to describe actions required to migrate Spark default database to a different location.
Introduction and Prerequisites#
By default, a built-in Spark keeps all of its metadata in a Microsoft SQL Server database called SparkMetastore
,
located on the database server designated during Querona installation.
To migrate, we need to update three tables in Spark metastore database: DBS
, SDS
and FUNC_RU
as they keep the locations for database, tables, and functions in that order.
Since Querona does not support custom Spark functions yet, we will update only SDS
and DBS
tables.
User or users performing the migration steps need to collectively have:
administrative access to Querona,
administrative access to Spark metadata database,
administrative access to the folder holding the actual Spark database files,
administrative access to the destination folder.
Basic knowledge of MS SQL Server and SQL language is required.
Migration procedure#
Step 1, shutdown the Spark instance#
Connect to Querona’s administrative portal with administrative privileges and shutdown Spark instance.
Danger
It is vital to stop Spark instance before attempting to do any metadata operations.
Step 2, prepare a backup of Spark metabase#
Before updating any data prepare a backup of Spark metabase.
Step 3, create and configure a destination folder for database files#
Create a folder that will hold the database files of the migrated database using tools of your operating system. In this article we assume that our destination folder is d:warehouse.
Note
It is recommended to leave the source files intact, and delete them at the end of the procedure, only when you verify that the migrated database is readable.
The newly created folder does not have proper security Access Control Lists (ACL’s) set on it, that are required by Spark to run.
To set the necessary ACL’s go to %ProgramData%\Querona\lib\tools\
.
Make a copy of the file FixHiveACL.bat
, name it FixHiveACL-new.bat
and edit the copy.
Change the lines inside the FixHiveACL-new.bat
:
..\hadoop\hadoop-2.6.0\bin\winutils.exe chmod 777 ../../tmp/hive
..\hadoop\hadoop-2.6.0\bin\winutils.exe chmod 777 ../../warehouse
using our example destination path, into:
..\hadoop\hadoop-2.6.0\bin\winutils.exe chmod 777 d:/warehouse/tmp/hive
..\hadoop\hadoop-2.6.0\bin\winutils.exe chmod 777 d:/warehouse
and save the file.
Open an elevated command prompt, change the current directory to %ProgramData%\Querona\lib\tools\
and execute the FixHiveACL-new.bat
.
The script should execute without errors. If “access denied” error occurs, most probably you did not open command line with administrative privileges.
Step 4, copy files to the destination folder#
Copy the contents of %ProgramData%\Querona\warehouse
into the destination folder (e.g.: d:\warehouse
).
Step 5, connect to database server holding Spark metastore#
Connect to the database server that holds Spark metastore database using, for example, a free SQL Server Management Studio. Any SQL Server client able to interactively issue queries to SQL Server will suffice.
Note
You need to have the right to at least modify data in SDB
and DBS
tables.
Ensure that a recent backup of SparkMetastore database is available.
Step 6, modify file locations in metabase#
Using the SQL client (SSMS), edit and run the following SQL scripts.
List contents of DBS
table and verify the location of database files:
SELECT [db_id]
, [desc]
, [db_location_uri]
, [name]
, [owner_name]
, [owner_type]
FROM [SparkMetastore].[dbo].[DBS]
Results of the query should look like shown below:
DB_ID |
DESC |
DB_LOCATION_URI |
NAME |
OWNER_NAME |
OWNER_TYPE |
---|---|---|---|---|---|
1 |
Default Hive database |
default |
public |
ROLE |
Edit the following script and change the value of @destinationFolderPathPart variable into the destination folder path (do not add the trailing slash ‘/’):
declare @destinationFolderPathPart varchar(240); declare @sourceFolderPathPart varchar(240);
select @sourceFolderPathPart = 'file:/c:/ProgramData/Querona/warehouse'
select @destinationFolderPathPart = 'file:/c:/ProgramData/Querona/warehouse'; -- <<< change the destination folder path here
declare @qualifyingDbsEntriesBeforeUpdate int
declare @dbsEntriesAfterUpdate int
-- get information about the count of database entries to modify
select @qualifyingDbsEntriesBeforeUpdate = count(*)
from [SparkMetastore].[dbo].[DBS]
where DB_LOCATION_URI like @sourceFolderPathPart + '%'
;
update [SparkMetastore].[dbo].[DBS]
set DB_LOCATION_URI = replace(DB_LOCATION_URI, @sourceFolderPathPart, @destinationFolderPathPart)
where DB_LOCATION_URI like @sourceFolderPathPart + '%'
;
select @dbsEntriesAfterUpdate = count(*)
from [SparkMetastore].[dbo].[DBS]
where DB_ID = 1
and DB_LOCATION_URI like @sourceFolderPathPart + '%'
;
-- SDS
declare @qualifyingSdsEntriesBeforeUpdate int
declare @sdsEntriesAfterUpdate int
select @qualifyingSdsEntriesBeforeUpdate = count(*)
from [SparkMetastore].[dbo].[SDS]
where [LOCATION] like @sourceFolderPathPart + '%'
;
update [SparkMetastore].[dbo].[SDS]
set [LOCATION] = replace([LOCATION], @sourceFolderPathPart, @destinationFolderPathPart)
where [LOCATION] like @sourceFolderPathPart + '%'
;
select @sdsEntriesAfterUpdate = count(*)
from [SparkMetastore].[dbo].[SDS]
where [LOCATION] like @sourceFolderPathPart + '%'
;
set nocount on;
-- show execution statistics report
select 'DBS' as TableName
, @qualifyingDbsEntriesBeforeUpdate as QualifyingEntries
, @dbsEntriesAfterUpdate as EntriesAfterUpdate
union
all
select 'SDS'
, @qualifyingSdsEntriesBeforeUpdate
, @sdsEntriesAfterUpdate
;
set nocount off;
The execution result should be similar to this:
(1 row affected)
(4632 rows affected)
TableName QualifyingEntries EntriesAfterUpdate
--------- ----------------- ------------------
DBS 1 0
SDS 4632 0
The EntriesAfterUpdate column holds information about how many rows satisfy the update condition AFTER update - it should be zero.
Step 7, start Spark and verify that data is readable#
Use Querona Administrative Portal to start Spark. Verify that data from migrated database is available, for example, by executing queries in Querona that affect data from Spark.
Step 8, plan to delete the source folder if not needed#
After the successful verification of migration, source files may be deleted, if not needed.
That concludes the Spark database migration procedure.