CSV/TSV, PDF, MSG, Text#

The built-in File provider handles the following file types:

CSV/TSV (.csv, .tsv)
Exchange Email (.msg, .ems, .eml)
Portable Document Format (.pdf)
Unstructured text file (.txt)

Key features#

very large file support (tested on files of several GBs of size)
very wide file support (up to 32 000 columns)
automatic schema detection for schema-less files
custom schema support compatible with Microsoft ODBC schema.ini definition
multisource: local files, http(s) publicly available sources
built-in popular formats for parsing dates, currency, booleans, etc.
support of custom values that get translated into Null, boolean True and False
high throughput

How the File provider works#

File provider scans all designated source folders looking for files. Supported file types with a well-known, immutable schema, eg. email messages or PDFs, are presented as rows in a built-in virtual table that groups together files of the same type. The CSV/TSV files can have different schemas, therefore each of them is exposed as a dedicated virtual table. The following mapping rules govern the metadata import:

Files in a folder designated in connection definition as a root, belong to the dbo schema.
Each direct subfolder of the root folder converts into a schema, and all of its files belong the that schema.
All supported file types having a well-known, immutable schema (eg. pdf), map into a single virtual table, stored in the builtin schema.
CSV/TSV files having a detected or assigned schema, map to a virtual table 1-to-1.
All files found, irrespective of type, map into a AllFiles virtual table.

The following table describes the built-in tables, available from the builtin schema:

Table name	Default file extensions
EmailMessages	.msg, .eml
PdfDocuments	.pdf
TextFiles	.txt
AllFiles	.*

Example scenario#

Let’s consider the following folder structure on your disk:

c:\files
c:\files\customers.csv
c:\files\invoices.tsv
c:\files\email1.msg
c:\files\email2.msg
c:\files\printout1.pdf
c:\files\printout2.pdf
c:\files\readme.txt
c:\files\company\employees.csv
c:\files\company\websites.csv

Let’s assume that we mapped the c:\files folder as the root and we import metadata of all files presented to us during virtual database creation.

The resulting virtual database consists of the following objects:

Source file path	Mapped table name	Mapped schema	Reason
c:\files\customers.csv	customers_csv	dbo	Customers.csv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.
c:\files\invoices.tsv	invoices_tsv	dbo	Invoices.tsv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.
c:\files\email1.msg, c:\files\email2.msg	EmailMessages	builtin	Exchange email files have a well-known, immutable schema, so are all mapped as rows in the builtin.EmailMessages virtual table.
c:\files\printout.pdf	PdfDocuments	builtin	PDF documents have a well-known, immutable schema, and are all mapped as rows in the builtin.PdfDocument virtual table.
c:\files\readme.txt	TextFiles	builtin	Text file have no per-file schema, however are all mapped as rows into the builtin.TextFiles virtual table for convenience.
c:\files\company\employees.csv	employees_csv	company	Company is a subdirectory of the directory designated as root, therefore directory maps to a schema. inside it map to respective tables. Employees.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.
c:\files\company\websites.csv	websites_csv	company	Websites.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.
c:\files\.	AllFiles	builtin	All files found in the root and its subfolders map as rows into the AllFiles virtual table in the builtin schema.

Built-in tables#

AllFiles#

Column name	Column type	Description
Name	NVarchar(255) not null	File name with extension
Type	NVarchar(30) not null	File extension with leading dot (.ext) or “directory”.
SizeBytes	bigint not null	File size in bytes
Path	NVarchar(4000) not null	OS file path

EmailMessages#

Column name	Column type	Description
MessageId	VarChar(160) not null	Message Id
Subject	NVarchar(250) not null	Message subject
SubjectNormalized	NVarchar(250) null	Normalized message subject
ReceivedOn	DateTime2 null	Received date
BodyHtml	NVarchar(max) null	Body in HTML format
BodyRtf	NVarchar(max) null	Body in Rich Text Format (RTF) format
BodyText	NVarchar(max) null	Body in Text format
Attachments	Int not null	Attachments count
AttachmentsJson	NVarchar(max) null	JSON text with attachments information
Sender_DisplayName	NVarchar(200) null	Sender display name
Sender_Email	NVarchar(200) null	Sender email address
Recipients_To	NVarchar(max) null	Recipients list
Recipients_CC	NVarchar(max) null	Carbon-copy recipients list
SentOn	DateTime2 null	Send date
CreatedOn	DateTime2 null	Created date
Categories	NVarchar(220) null	List of categories
ConversationIndex	NVarchar(160) null	Conversation index
ConversationTopic	NVarchar(200) null	Conversation topic
LastModifiedOn	DateTime2 null	Last modification date
Headers	VarChar(2000) null	List of headers
Importance	VarChar(30) null	Importance
InternetCodePage	VarChar(30) null	Internet code page
MessageLocalId	VarChar(160) null	Message local Id
Type	VarChar(30) null	File types
Path	NVarchar(4000) null	File path

PdfDocuments#

Column name	Column type	Description
Path	NVarChar(4000) not null	OS path to a file
CreatedOn	DateTime2 not null	Document creation date
LastModifiedOn	DateTime2 not null	Document last modification date
Keywords	NVarChar(1600) null	Document keywords
Subject	NVarChar(300) null	Document subject
Title	NVarChar(300) null	Document title
IsEncrypted	Bit not null	Encryption flag (1=encrypted, 0=not encrypted)
NumberOfPages	Int not null	Number of pages in document
Author	NVarChar(255) null	Document author
Version	Decimal not null	Document PDF specification compatibility version
SizeBytes	BigInt not null	File size in bytes
EmbeddedFiles	Int not null	Document-embedded files count

TextFiles#

Column name	Column type	Description
Name	NVarChar(255) not null	OS file name with extension
SizeBytes	BigInt not null	File size in bytes
Path	NVarChar(4000) not null	OS file path
Encoding	VarChar(50) null	File encoding name
Type	NVarChar(30) null	File extension
TextSample	NVarChar(1000) null	First 1000 charactes of the file content

CSV parser and reader configuration#

Settings divide into two groups: connection and object-level settings. All database objects inherit settings from the connection. Relevant settings can be adjusted on the object level. If a setting is not overriden on the object level, changing it on the connection level propagates to every object that does not override the setting. Changing the setting on the object level stops the inheritance.

Basic settings#

Setting name	Allowed values	Description
File format	Autodetect delimiter*, Comma Delimited, Semicolon delimited, Tab delimited, Custom	File format selection. The default is “Autodetect delimiter”.
Allowed delimiter values	Any	Contains the list of characters or strings that will be used as candidates during separator detection.
Source URI’s	Any URI	Contains the list of URI’s that need to be scanned during the metadata discovery (import).
First row contains column names	True*, False	Indicates that the first row contains column names (header). The default is True. The “Skip first # rows” setting value (see Advanced Settings below) is applied first, so the ‘First Row’ may not be the actual first row in a file.
Always detect character encoding before reading	True*, False	Indicates that the character encoding has to be detected each time a file is opened. Useful when file encoding changes frequently. For best performance, uncheck this option, and set the Encoding to a fixed value.
Encoding	Any encoding from the list	Indicates the character encoding to be used when a file is opened. If set, the automatic detection does not take place.
Culture	Any culture from the list	Indicates the culture to be used when converting string values to desired data types.
Allow quotation marks in values	True*, False	Indicates that the quotation mark in values is allowed.
Treat a line break in quoted text as an error	True, False*	Indicates that a line break in quoted text is allowed.
Empty strings should be returned as NULL	True*, False	Indicates that the empty string is converted to a Null value.
Data error handling	Ignore row, Report Error*	Indicates the strategy to use when bad data is detected during the read. Default is “Show Errors”, that throws an error and stops the read operation. The “Ignore” setting ignores the bad row and continues the read operation.

Advanced settings#

Setting name	Allowed values	Description
Max rows to read for data type probing	0 (no limit) to 2147483647	Indicates the maximum number of lines to be read during schema detection. The higher the percentage of the file read, the lower the assumed error margins.
Allow comments	True*, False	Indicates if comment are allowed.
Comment character	Any character	The character used to denote a line that is commented out. Default is ‘#’. Available only if comments were allowed.
Escape character	Any character	The escape character used to escape a quote inside a field. Default is ‘”’.
Ignore blank lines	True*, False	Indicates if blank lines are ignored when reading. Default is true.
Trimming options	No trimming, Trim the whitespace around a field, Trim the whitespace inside of quotes around a field	The strategy used to trim data.
Skip first # rows	0* to 2147483647	Indicates the number of lines to skip from the top of the file.
Values interpreted as boolean True	Any string value(s)	The list of string values interpreted as boolean True. The value gets converted to True, if the whole value in a column matches it, eg. the value “ab” converts to True, if column value in a row equals “ab”.
Values interpreted as boolean False	Any string value(s)	The list of string values interpreted as boolean False. The value gets converted to False, if the whole value in a column matches it, eg. the value “ab” converts to False, if column value in a row equals “ab”.
Values interpreted as Null (case-insensitive)	Any string value(s)	The list of string values interpreted as Null. The value gets converted to Null, if the whole value in a column matches it, eg. the value “ab” converts to Null, if column value in a row equals “ab”.

Alternatives to the built-in File provider#

If for any reson the built-in implementation of the File provider is insufficient, these are the available alternatives:

Microsoft ODBC (free)
CData (commercial), see below

CSV/TSV sources using the Cdata driver#

Note

Querona supports CData ADO.NET drivers, however these drivers need to be purchased separately.

CData Driver installation#

Follow these steps to install the CData ADO.Net driver for CSV/TSV:

Go to the CData website.
In the ADO.NET Provider Download section, select Download Trial or Download Full, depending on your use-case.
In the list CData Software - Downloads that displays, register or sign-in.
Download the ADO.Net driver and install the driver.

Custom delimiter in CData driver#

Let’s assume that we want to connect to a CSV file that:

has a header,
is delimited by semicolon character “;”

The following connection string can be used:

DataSource=[ServerName]; IncludeFiles="CSV"; FMT="Delimited(;)"; HDR="Yes";

For different delimiter change the ‘FMT=”Delimited(_put_your_delimiter_here_)” property and replace the placeholder _put_your_delimiter_here_ with your desired character.

CSV/TSV, PDF, MSG, Text

Contents

CSV/TSV, PDF, MSG, Text#

Key features#

How the File provider works#

Example scenario#

Built-in tables#

AllFiles#

EmailMessages#

PdfDocuments#

TextFiles#

CSV parser and reader configuration#

Basic settings#

Advanced settings#

Alternatives to the built-in File provider#

CSV/TSV sources using the Cdata driver#

CData Driver installation#

Custom delimiter in CData driver#

See also#