OPENROWSET

OPENROWSET#

Reads data that is not a registered table and exposes it as a table in the FROM clause — either by querying a source through a provider, or by bulk-reading external files.

Provider form#

OPENROWSET ( 'provider', 'connection_string', 'query' )

Runs query against the source identified by provider and connection_string.

SELECT * FROM OPENROWSET('csv', 'provider_string', 'SELECT * FROM data');

Bulk form#

OPENROWSET ( BULK 'data_file' [ , <option> [ , ...n ] ] ) [ WITH ( <schema> ) ] AS alias

Reads one or more Excel workbooks (.xlsx / .xls / .xlsb) from a registered Excel connection as a rowset; a correlation name (AS alias) is required. DATA_SOURCE names the connection (which locates the files) and WITH (...) declares the result columns. Read several workbooks by listing them (BULK ('a.xlsx', 'b.xlsx')) or with a glob. The worked walkthrough — worksheet and range selection, header-name binding, column ordinals, and globbing — is in Reading Excel workbooks (BULK) below.

The same bulk form reads PDF documents from a registered File connection when you add FORMAT = 'PDF'. A PDF read returns the built-in PdfDocuments columns — including the extracted Text and the detected Tables (JSON) — selected by name, with no WITH clause. See Reading PDF documents (BULK) below.

Supported options#

Options follow the BULK clause as NAME = value:

Option	Meaning and behavior
`DATA_SOURCE`	Names the registered connection that locates the file(s). File paths are resolved relative to the connection’s directory, and a path that escapes it is rejected.
`FORMAT`	Selects the file reader for a File connection — currently `'PDF'`. Omit it for an Excel connection.
`FORMATFILE_DATA_SOURCE`	Carries reader options as a TOML `options` table. The option set depends on the reader — Excel (`sheet`, `on_error`) or PDF (text and table extraction; see Reading PDF documents (BULK)). Packing them into this standard parameter keeps the statement parseable by SQL Server tools.
`FIRSTROW` / `LASTROW`	The 1-based first and last data rows of the read window, applied when the `sheet` expression carries no A1 range.
`HEADER_ROW`	`TRUE` treats the first row of the window as a header and binds `WITH` columns by name (see below).

The WITH clause also accepts an explicit 1-based source-column ordinal after a column’s type, pinning a column to a specific sheet column (see Bind columns by header name).

Reading Excel workbooks (BULK)#

The bulk form reads .xlsx, .xls and .xlsb workbooks from a registered Excel connection. Point DATA_SOURCE at the connection and declare the result columns with WITH (...):

SELECT id, amount, qty
  FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales')
       WITH (id bigint, amount decimal(28,2), qty int) AS data;

The WITH columns map positionally to sheet columns A, B, C, … in declaration order, and each cell is coerced to the declared type. The read starts at the top of the sheet and treats the first row as data; skip a header row with FIRSTROW = 2, or bind to it by name with HEADER_ROW (see below).

A file reference is resolved under the connection’s directory, and a reference that escapes it is rejected, so a query cannot read arbitrary files on the server. Address a workbook by a path relative to that directory; an absolute path is accepted only when it already points inside it.

Worksheet and range#

Reader options specific to Excel — the worksheet and range, and the multi-file error mode — are passed as a small TOML options table inside the standard FORMATFILE_DATA_SOURCE parameter; packing them into a Microsoft-recognised parameter keeps the statement parseable by SQL Server tools. sheet selects the worksheet and an optional A1 cell range; the standard FIRSTROW and LASTROW bound the rows when no range is given (an A1 range wins if both are supplied):

-- worksheet 'Q1', restricted to the range B2:D100
SELECT region, q1, q2
  FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales',
                  FORMATFILE_DATA_SOURCE = 'options = { sheet = "Q1!B2:D100" }')
       WITH (region nvarchar(50), q1 float, q2 float) AS data;

Bind columns by header name#

With HEADER_ROW = TRUE the first row of the read window is treated as a header, and each WITH column binds to the sheet column whose header matches its name instead of by position — so you can omit or reorder columns:

-- sheet headers: id | name | amount; read only amount and id, by name
SELECT amount, id
  FROM OPENROWSET(BULK 'sales.xlsx', DATA_SOURCE = 'excel_sales', HEADER_ROW = TRUE)
       WITH (amount decimal(28,2), id bigint) AS data;

Header matching is case-insensitive, and the header row is excluded from the data. An unmatched column name, or a duplicate header, raises a clear error. To pin a specific sheet column regardless of the header, give the column an explicit 1-based sheet ordinal after its type — for example WITH (id bigint 1, note nvarchar(200) 3) reads sheet column 1 into id and sheet column 3 into note.

Read several workbooks (globbing)#

List workbooks explicitly, or match them with a glob pattern relative to the connection’s directory:

-- an explicit list
SELECT id FROM OPENROWSET(BULK ('jan.xlsx', 'feb.xlsx'), DATA_SOURCE = 'excel_sales')
               WITH (id bigint) AS data;

-- a glob: every .xlsx in the connection directory (top level)
SELECT id FROM OPENROWSET(BULK '*.xlsx', DATA_SOURCE = 'excel_sales')
               WITH (id bigint) AS data;

* matches within a single path segment (the top level only); ** spans subdirectories, so 2024/**/*.xlsx matches every workbook under 2024/ at any depth. The matched workbooks are read and concatenated. Add on_error to the FORMATFILE_DATA_SOURCE options payload to control how an unreadable workbook is handled — on_error = "skip" skips it and continues with the rest, while the default on_error = "fail" raises an error instead:

-- skip workbooks that cannot be read, and read the rest
SELECT id FROM OPENROWSET(BULK '2024/**/*.xlsx', DATA_SOURCE = 'excel_sales',
                          FORMATFILE_DATA_SOURCE = 'options = { on_error = "skip" }')
               WITH (id bigint) AS data;

See Microsoft Excel for registering an Excel connection, data-type inference, and password-protected workbooks.

Reading PDF documents (BULK)#

The bulk form also reads .pdf files from a registered File connection. Add FORMAT = 'PDF', point DATA_SOURCE at the connection, and select the built-in PdfDocuments columns by name — no WITH clause is needed, and a correlation name (AS alias) is required. File references are resolved under the connection’s directory, and a reference that escapes it is rejected.

SELECT Path, NumberOfPages, Text
  FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;

Two columns carry extracted content: Text (the document text, via PdfPig) and Tables (tables detected in the document, as a JSON array). Each is produced only when selected, and the table engine’s Java runtime starts lazily on the first read of Tables. The full PdfDocuments schema is in CSV/TSV, PDF, MSG, Text.

-- metadata for every PDF in the connection directory
SELECT Path, Title, Author, NumberOfPages
  FROM OPENROWSET(BULK '*.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;

-- detected tables as JSON
SELECT Path, Tables
  FROM OPENROWSET(BULK 'invoice.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF') AS docs;

Extraction options#

Extraction is steered by options passed as a TOML options table in FORMATFILE_DATA_SOURCE. Each option also has a connection-level default that applies to every read (including the built-in PdfDocuments table); a per-query value overrides it.

Option	Applies to	Values (default in bold)
`text_mode`	Text	blocks, words, letters
`segmenter`	Text	default, docstrum, xycut — page-segmentation algorithm for `blocks`
`reading_order`	Text	none, rendering, unsupervised
`dedup_letters`	Text	true, false — drop duplicated overlapping glyphs
`margin_left` / `margin_right` / `margin_top` / `margin_bottom`	Text	points cropped from each page edge (default 0)
`engine`	Tables	auto (ruled first, stream fallback), lattice (ruled only), stream (borderless)
`area`	Tables	`[top, left, bottom, right]` in PDF points, top-left origin (default: whole page)
`pages`	Text and Tables	1-based page range, e.g. `"1-3,5"` (default: all pages)
`password`	Text and Tables	password for an encrypted PDF (default: none)
`on_error`	file selection	fail, skip — how a multi-file read handles an unreadable PDF

Examples#

-- text as space-separated words instead of reading-order blocks
SELECT Text
  FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { text_mode = "words" }') AS docs;

-- only pages 1-3 and 5
SELECT Text
  FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { pages = "1-3,5" }') AS docs;

-- an encrypted document
SELECT Text
  FROM OPENROWSET(BULK 'secured.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { password = "s3cret" }') AS docs;

-- ruled tables only (no stream fallback)
SELECT Tables
  FROM OPENROWSET(BULK 'grid.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { engine = "lattice" }') AS docs;

-- borderless tables via stream detection
SELECT Tables
  FROM OPENROWSET(BULK 'columns.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { engine = "stream" }') AS docs;

-- restrict extraction to a rectangle (top, left, bottom, right in PDF points, top-left origin)
SELECT Tables
  FROM OPENROWSET(BULK 'invoice.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { engine = "lattice", area = [72, 40, 760, 560] }') AS docs;

-- combine text and table options in one read
SELECT Path, Text, Tables
  FROM OPENROWSET(BULK 'report.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { pages = "2-4", engine = "lattice", text_mode = "words" }') AS docs;

-- every PDF under 2024/ at any depth, skipping any that cannot be read
SELECT Path, Text
  FROM OPENROWSET(BULK '2024/**/*.pdf', DATA_SOURCE = 'files', FORMAT = 'PDF',
                  FORMATFILE_DATA_SOURCE = 'options = { on_error = "skip" }') AS docs;

See CSV/TSV, PDF, MSG, Text for the File provider, the full PdfDocuments schema, and connection setup.