OPENROWSET#
Reads data that is not a registered table and exposes it as a table in the FROM clause — either by
querying a source through a provider, or by bulk-reading external files.
Provider form#
OPENROWSET ( 'provider', 'connection_string', 'query' )
Runs query against the source identified by provider and connection_string.
SELECT * FROM OPENROWSET('csv', 'provider_string', 'SELECT * FROM data');
Bulk form#
OPENROWSET ( BULK 'data_file' [ , <option> [ , ...n ] ] ) [ WITH ( <schema> ) ] AS alias
Reads one or more Excel workbooks (.xlsx / .xls / .xlsb) from a registered Excel connection as a
rowset; a correlation name (AS alias) is required. DATA_SOURCE names the connection (which locates the
files) and WITH (...) declares the result columns. Read several workbooks by listing them
(BULK ('a.xlsx', 'b.xlsx')) or with a glob. The worked walkthrough — worksheet and range selection,
header-name binding, column ordinals, and globbing — is in Reading Excel workbooks (BULK) below.
The same bulk form reads PDF documents from a registered File connection when you add FORMAT = 'PDF'. A
PDF read returns the built-in PdfDocuments columns — including the extracted Text and the detected
Tables (JSON) — selected by name, with no WITH clause. See Reading PDF documents (BULK) below.
Supported options#
Options follow the BULK clause as NAME = value:
Option |
Meaning and behavior |
|---|---|
|
Names the registered connection that locates the file(s). File paths are resolved relative to the connection’s directory, and a path that escapes it is rejected. |
|
Selects the file reader for a File connection — currently |
|
Carries reader options as a TOML |
|
The 1-based first and last data rows of the read window, applied when the |
|
|
The WITH clause also accepts an explicit 1-based source-column ordinal after a column’s type, pinning a
column to a specific sheet column (see Bind columns by header name).
Reading Excel workbooks (BULK)#
The bulk form reads .xlsx, .xls and .xlsb workbooks from a registered Excel connection. Point
DATA_SOURCE at the connection and declare the result columns with WITH (...):
SELECT id, amount, qty
FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales')
WITH (id bigint, amount decimal(28,2), qty int) AS data;
The WITH columns map positionally to sheet columns A, B, C, … in declaration order, and each cell is
coerced to the declared type. The read starts at the top of the sheet and treats the first row as data; skip a
header row with FIRSTROW = 2, or bind to it by name with HEADER_ROW (see below).
A file reference is resolved under the connection’s directory, and a reference that escapes it is rejected, so a query cannot read arbitrary files on the server. Address a workbook by a path relative to that directory; an absolute path is accepted only when it already points inside it.
Worksheet and range#
Reader options specific to Excel — the worksheet and range, and the multi-file error mode — are passed as a
small TOML options table inside the standard FORMATFILE_DATA_SOURCE parameter; packing them into a
Microsoft-recognised parameter keeps the statement parseable by SQL Server tools. sheet selects the
worksheet and an optional A1 cell range; the standard FIRSTROW and LASTROW bound the rows when no range
is given (an A1 range wins if both are supplied):
-- worksheet 'Q1', restricted to the range B2:D100
SELECT region, q1, q2
FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales',
FORMATFILE_DATA_SOURCE = 'options = { sheet = "Q1!B2:D100" }')
WITH (region nvarchar(50), q1 float, q2 float) AS data;
Bind columns by header name#
With HEADER_ROW = TRUE the first row of the read window is treated as a header, and each WITH column
binds to the sheet column whose header matches its name instead of by position — so you can omit or reorder
columns:
-- sheet headers: id | name | amount; read only amount and id, by name
SELECT amount, id
FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales', HEADER_ROW = TRUE)
WITH (amount decimal(28,2), id bigint) AS data;
Header matching is case-insensitive, and the header row is excluded from the data. An unmatched column name, or
a duplicate header, raises a clear error. To pin a specific sheet column regardless of the header, give the
column an explicit 1-based sheet ordinal after its type — for example
WITH (id bigint 1, note nvarchar(200) 3) reads sheet column 1 into id and sheet column 3 into note.
Read several workbooks (globbing)#
List workbooks explicitly, or match them with a glob pattern relative to the connection’s directory:
-- an explicit list
SELECT id FROM OPENROWSET(BULK ('jan.xlsx', 'feb.xlsx'), DATA_SOURCE = 'excel_sales')
WITH (id bigint) AS data;
-- a glob: every .xlsx in the connection directory (top level)
SELECT id FROM OPENROWSET(BULK '*.xlsx', DATA_SOURCE = 'excel_sales')
WITH (id bigint) AS data;
* matches within a single path segment (the top level only); ** spans subdirectories, so
2024/**/*.xlsx matches every workbook under 2024/ at any depth. The matched workbooks are read and
concatenated. Add on_error to the FORMATFILE_DATA_SOURCE options payload to control how an unreadable
workbook is handled — on_error = "skip" skips it and continues with the rest, while the default
on_error = "fail" raises an error instead:
-- skip workbooks that cannot be read, and read the rest
SELECT id FROM OPENROWSET(BULK '2024/**/*.xlsx', DATA_SOURCE = 'excel_sales',
FORMATFILE_DATA_SOURCE = 'options = { on_error = "skip" }')
WITH (id bigint) AS data;
See Microsoft Excel for registering an Excel connection, data-type inference, and password-protected workbooks.
Reading PDF documents (BULK)#
The bulk form also reads .pdf files from a registered File connection. Add FORMAT = 'PDF', point
DATA_SOURCE at the connection, and select the built-in PdfDocuments columns by name — no WITH clause
is needed, and a correlation name (AS alias) is required. File references are resolved under the
connection’s directory, and a reference that escapes it is rejected.
SELECT Path, NumberOfPages, Text
FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;
Two columns carry extracted content: Text (the document text, via PdfPig) and Tables (tables detected in the document, as a JSON array). Each is produced only when selected, and the table engine’s Java runtime starts lazily on the first read of Tables. The full PdfDocuments schema is in CSV/TSV, PDF, MSG, Text.
-- metadata for every PDF in the connection directory
SELECT Path, Title, Author, NumberOfPages
FROM OPENROWSET(BULK '*.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;
-- detected tables as JSON
SELECT Path, Tables
FROM OPENROWSET(BULK 'invoice.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;
Extraction options#
Extraction is steered by options passed as a TOML options table in FORMATFILE_DATA_SOURCE. Each option
also has a connection-level default that applies to every read (including the built-in PdfDocuments table); a
per-query value overrides it.
Option |
Applies to |
Values (default in bold) |
|---|---|---|
|
Text |
blocks, words, letters |
|
Text |
default, docstrum, xycut — page-segmentation algorithm for |
|
Text |
none, rendering, unsupervised |
|
Text |
true, false — drop duplicated overlapping glyphs |
|
Text |
points cropped from each page edge (default 0) |
|
Tables |
auto (ruled first, stream fallback), lattice (ruled only), stream (borderless) |
|
Tables |
|
|
Text and Tables |
1-based page range, e.g. |
|
Text and Tables |
password for an encrypted PDF (default: none) |
|
file selection |
fail, skip — how a multi-file read handles an unreadable PDF |
Examples#
-- text as space-separated words instead of reading-order blocks
SELECT Text
FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { text_mode = "words" }') AS docs;
-- only pages 1-3 and 5
SELECT Text
FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { pages = "1-3,5" }') AS docs;
-- an encrypted document
SELECT Text
FROM OPENROWSET(BULK 'secured.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { password = "s3cret" }') AS docs;
-- ruled tables only (no stream fallback)
SELECT Tables
FROM OPENROWSET(BULK 'grid.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { engine = "lattice" }') AS docs;
-- borderless tables via stream detection
SELECT Tables
FROM OPENROWSET(BULK 'columns.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { engine = "stream" }') AS docs;
-- restrict extraction to a rectangle (top, left, bottom, right in PDF points, top-left origin)
SELECT Tables
FROM OPENROWSET(BULK 'invoice.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { engine = "lattice", area = [72, 40, 760, 560] }') AS docs;
-- combine text and table options in one read
SELECT Path, Text, Tables
FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { pages = "2-4", engine = "lattice", text_mode = "words" }') AS docs;
-- every PDF under 2024/ at any depth, skipping any that cannot be read
SELECT Path, Text
FROM OPENROWSET(BULK '2024/**/*.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
FORMATFILE_DATA_SOURCE = 'options = { on_error = "skip" }') AS docs;
See CSV/TSV, PDF, MSG, Text for the File provider, the full PdfDocuments schema, and connection setup.