CSV/TSV, PDF, MSG, Text#

The built-in File provider handles the following file types:

  1. CSV/TSV (.csv, .tsv)

  2. Exchange Email (.msg, .ems, .eml)

  3. Portable Document Format (.pdf)

  4. Unstructured text file (.txt)

Key features#

  • very large file support (tested on files of several GBs of size)

  • very wide file support (up to 32 000 columns)

  • automatic schema detection for schema-less files

  • custom schema support compatible with Microsoft ODBC schema.ini definition

  • multisource: local files, http(s) publicly available sources

  • built-in popular formats for parsing dates, currency, booleans, etc.

  • support of custom values that get translated into Null, boolean True and False

  • high throughput

How the File provider works#

File provider scans all designated source folders looking for files. Supported file types with a well-known, immutable schema, eg. email messages or PDFs, are presented as rows in a built-in virtual table that groups together files of the same type. The CSV/TSV files can have different schemas, therefore each of them is exposed as a dedicated virtual table. The following mapping rules govern the metadata import:

  1. Files in a folder designated in connection definition as a root, belong to the dbo schema.

  2. Each direct subfolder of the root folder converts into a schema, and all of its files belong the that schema.

  3. All supported file types having a well-known, immutable schema (eg. pdf), map into a single virtual table, stored in the builtin schema.

  4. CSV/TSV files having a detected or assigned schema, map to a virtual table 1-to-1.

  5. All files found, irrespective of type, map into a AllFiles virtual table.

The following table describes the built-in tables, available from the builtin schema:

Table name

Default file extensions

EmailMessages

.msg, .eml

PdfDocuments

.pdf

TextFiles

.txt

AllFiles

.*

Example scenario#

Let’s consider the following folder structure on your disk:

c:\files
c:\files\customers.csv
c:\files\invoices.tsv
c:\files\email1.msg
c:\files\email2.msg
c:\files\printout1.pdf
c:\files\printout2.pdf
c:\files\readme.txt
c:\files\company\employees.csv
c:\files\company\websites.csv

Let’s assume that we mapped the c:\files folder as the root and we import metadata of all files presented to us during virtual database creation.

The resulting virtual database consists of the following objects:

Source file path

Mapped table name

Mapped schema

Reason

c:\files\customers.csv

customers_csv

dbo

Customers.csv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.

c:\files\invoices.tsv

invoices_tsv

dbo

Invoices.tsv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.

c:\files\email1.msg, c:\files\email2.msg

EmailMessages

builtin

Exchange email files have a well-known, immutable schema, so are all mapped as rows in the builtin.EmailMessages virtual table.

c:\files\printout.pdf

PdfDocuments

builtin

PDF documents have a well-known, immutable schema, and are all mapped as rows in the builtin.PdfDocument virtual table.

c:\files\readme.txt

TextFiles

builtin

Text file have no per-file schema, however are all mapped as rows into the builtin.TextFiles virtual table for convenience.

c:\files\company\employees.csv

employees_csv

company

Company is a subdirectory of the directory designated as root, therefore directory maps to a schema. inside it map to respective tables. Employees.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.

c:\files\company\websites.csv

websites_csv

company

Websites.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.

c:\files\*.*

AllFiles

builtin

All files found in the root and its subfolders map as rows into the AllFiles virtual table in the builtin schema.

Built-in tables#

AllFiles#

Column name

Column type

Description

Name

NVarchar(255) not null

File name with extension

Type

NVarchar(30) not null

File extension with leading dot (.ext) or “directory”.

SizeBytes

bigint not null

File size in bytes

Path

NVarchar(4000) not null

OS file path

EmailMessages#

Column name

Column type

Description

MessageId

VarChar(160) not null

Message Id

Subject

NVarchar(250) not null

Message subject

SubjectNormalized

NVarchar(250) null

Normalized message subject

ReceivedOn

DateTime2 null

Received date

BodyHtml

NVarchar(max) null

Body in HTML format

BodyRtf

NVarchar(max) null

Body in Rich Text Format (RTF) format

BodyText

NVarchar(max) null

Body in Text format

Attachments

Int not null

Attachments count

AttachmentsJson

NVarchar(max) null

JSON text with attachments information

Sender_DisplayName

NVarchar(200) null

Sender display name

Sender_Email

NVarchar(200) null

Sender email address

Recipients_To

NVarchar(max) null

Recipients list

Recipients_CC

NVarchar(max) null

Carbon-copy recipients list

SentOn

DateTime2 null

Send date

CreatedOn

DateTime2 null

Created date

Categories

NVarchar(220) null

List of categories

ConversationIndex

NVarchar(160) null

Conversation index

ConversationTopic

NVarchar(200) null

Conversation topic

LastModifiedOn

DateTime2 null

Last modification date

Headers

VarChar(2000) null

List of headers

Importance

VarChar(30) null

Importance

InternetCodePage

VarChar(30) null

Internet code page

MessageLocalId

VarChar(160) null

Message local Id

Type

VarChar(30) null

File types

Path

NVarchar(4000) null

File path

PdfDocuments#

Column name

Column type

Description

Path

NVarChar(4000) not null

OS path to a file

CreatedOn

DateTime2 not null

Document creation date

LastModifiedOn

DateTime2 not null

Document last modification date

Keywords

NVarChar(1600) null

Document keywords

Subject

NVarChar(300) null

Document subject

Title

NVarChar(300) null

Document title

IsEncrypted

Bit not null

Encryption flag (1=encrypted, 0=not encrypted)

NumberOfPages

Int not null

Number of pages in document

Author

NVarChar(255) null

Document author

Version

Decimal not null

Document PDF specification compatibility version

SizeBytes

BigInt not null

File size in bytes

EmbeddedFiles

Int not null

Document-embedded files count

TextFiles#

Column name

Column type

Description

Name

NVarChar(255) not null

OS file name with extension

SizeBytes

BigInt not null

File size in bytes

Path

NVarChar(4000) not null

OS file path

Encoding

VarChar(50) null

File encoding name

Type

NVarChar(30) null

File extension

TextSample

NVarChar(1000) null

First 1000 charactes of the file content

CSV parser and reader configuration#

Settings divide into two groups: connection and object-level settings. All database objects inherit settings from the connection. Relevant settings can be adjusted on the object level. If a setting is not overriden on the object level, changing it on the connection level propagates to every object that does not override the setting. Changing the setting on the object level stops the inheritance.

Basic settings#

Setting name

Allowed values

Description

File format

Autodetect delimiter*, Comma Delimited, Semicolon delimited, Tab delimited, Custom

File format selection. The default is “Autodetect delimiter”.

Allowed delimiter values

Any

Contains the list of characters or strings that will be used as candidates during separator detection.

Source URI’s

Any URI

Contains the list of URI’s that need to be scanned during the metadata discovery (import).

First row contains column names

True*, False

Indicates that the first row contains column names (header). The default is True. The “Skip first # rows” setting value (see Advanced Settings below) is applied first, so the ‘First Row’ may not be the actual first row in a file.

Always detect character encoding before reading

True*, False

Indicates that the character encoding has to be detected each time a file is opened. Useful when file encoding changes frequently. For best performance, uncheck this option, and set the Encoding to a fixed value.

Encoding

Any encoding from the list

Indicates the character encoding to be used when a file is opened. If set, the automatic detection does not take place.

Culture

Any culture from the list

Indicates the culture to be used when converting string values to desired data types.

Allow quotation marks in values

True*, False

Indicates that the quotation mark in values is allowed.

Treat a line break in quoted text as an error

True, False*

Indicates that a line break in quoted text is allowed.

Empty strings should be returned as NULL

True*, False

Indicates that the empty string is converted to a Null value.

Data error handling

Ignore row, Report Error*

Indicates the strategy to use when bad data is detected during the read. Default is “Show Errors”, that throws an error and stops the read operation. The “Ignore” setting ignores the bad row and continues the read operation.

Advanced settings#

Setting name

Allowed values

Description

Max rows to read for data type probing

0 (no limit) to 2147483647

Indicates the maximum number of lines to be read during schema detection. The higher the percentage of the file read, the lower the assumed error margins.

Allow comments

True*, False

Indicates if comment are allowed.

Comment character

Any character

The character used to denote a line that is commented out. Default is ‘#’. Available only if comments were allowed.

Escape character

Any character

The escape character used to escape a quote inside a field. Default is ‘”’.

Ignore blank lines

True*, False

Indicates if blank lines are ignored when reading. Default is true.

Trimming options

No trimming, Trim the whitespace around a field, Trim the whitespace inside of quotes around a field

The strategy used to trim data.

Skip first # rows

0* to 2147483647

Indicates the number of lines to skip from the top of the file.

Values interpreted as boolean True

Any string value(s)

The list of string values interpreted as boolean True. The value gets converted to True, if the whole value in a column matches it, eg. the value “ab” converts to True, if column value in a row equals “ab”.

Values interpreted as boolean False

Any string value(s)

The list of string values interpreted as boolean False. The value gets converted to False, if the whole value in a column matches it, eg. the value “ab” converts to False, if column value in a row equals “ab”.

Values interpreted as Null (case-insensitive)

Any string value(s)

The list of string values interpreted as Null. The value gets converted to Null, if the whole value in a column matches it, eg. the value “ab” converts to Null, if column value in a row equals “ab”.

Alternatives to the built-in File provider#

If for any reson the built-in implementation of the File provider is insufficient, these are the available alternatives:

CSV/TSV sources using the Cdata driver#

Note

Querona supports CData ADO.NET drivers, however these drivers need to be purchased separately.

CData Driver installation#

Follow these steps to install the CData ADO.Net driver for CSV/TSV:

  1. Go to the CData website.

  2. In the ADO.NET Provider Download section, select Download Trial or Download Full, depending on your use-case.

  3. In the list CData Software - Downloads that displays, register or sign-in.

  4. Download the ADO.Net driver and install the driver.

Custom delimiter in CData driver#

Let’s assume that we want to connect to a CSV file that:

  • has a header,

  • is delimited by semicolon character “;”

The following connection string can be used:

DataSource=[ServerName]; IncludeFiles="CSV"; FMT="Delimited(;)"; HDR="Yes";

For different delimiter change the ‘FMT=”Delimited(_put_your_delimiter_here_)” property and replace the placeholder _put_your_delimiter_here_ with your desired character.

See also#