CSV/TSV, PDF, MSG, Text

The built-in File provider handles the following file types:

  1. CSV/TSV (.csv, .tsv)

  2. Exchange Email (.msg, .ems, .eml)

  3. Portable Document Format (.pdf)

  4. Unstructured text file (.txt)

Key features

  • very large file support (tested on files of several GBs of size)

  • very wide file support (up to 32 000 columns)

  • automatic schema detection for schema-less files

  • custom schema support compatible with Microsoft ODBC schema.ini definition

  • multisource: local files, http(s) publicly available sources

  • built-in popular formats for parsing dates, currency, booleans, etc.

  • support of custom values that get translated into Null, boolean True and False

  • high throughput

How the File provider works

File provider scans all designated source folders looking for files. Supported file types with a well-known, immutable schema, eg. email messages or PDFs, are presented as rows in a built-in virtual table. The CSV/TSV files can have different schemas, therefore each of them is exposed as a dedicated virtual table. The following mapping rules govern the metadata import:

  1. Files in a folder designated in connection definition as a root, go to the dbo schema.

  2. Each direct subfolder of the root folder converts into a schema, and all of its files belong the that schema.

  3. All supported file types having a well-known, immutable schema (eg. pdf), map into a single virtual table, stored in the builtin schema.

  4. CSV/TSV files having a detected or assigned schema, map to a virtual table 1-to-1.

  5. All files found, irrespective of type, map into a AllFiles virtual table.

The following table describes the built-in tables, available in the builtin schema:

Table name

Default file extensions

EmailMessages

.msg, .eml

PdfDocuments

.pdf

TextFiles

.txt

AllFiles

.*

Example scenario

For example, consider the following folder structure on your disk:

c:\files
c:\files\customers.csv
c:\files\invoices.tsv
c:\files\email1.msg
c:\files\email2.msg
c:\files\printout1.pdf
c:\files\printout2.pdf
c:\files\readme.txt
c:\files\company\employees.csv
c:\files\company\websites.csv

Let’s assume that we mapped the c:files folder as the root and we import metadata of all files presented to us during virtual database creation.

The resulting virtual database consists of the following objects:

Source file path

Mapped table name

Mapped schema

Reason

c:\files\customers.csv

customers_csv

dbo

Customers.csv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.

c:\files\invoices.tsv

invoices_tsv

dbo

Invoices.tsv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema.

c:\files\email1.msg, c:\files\email2.msg

EmailMessages

builtin

Exchange email files have a well-known, immutable schema, so are all mapped as rows in the builtin.EmailMessages virtual table.

c:\files\printout.pdf

PdfDocuments

builtin

PDF documents have a well-known, immutable schema, and are all mapped as rows in the builtin.PdfDocument virtual table.

c:\files\readme.txt

TextFiles

builtin

Text file have no per-file schema, however are all mapped as rows into the builtin.TextFiles virtual table for convenience.

c:\files\company\employees.csv

employees_csv

company

Company is a subdirectory of the directory designated as root, therefore directory maps to a schema. inside it map to respective tables. Employees.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.

c:\files\company\websites.csv

websites_csv

company

Websites.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema.

c:\files\*.*

AllFiles

builtin

All files found in the root and its subfolders map as rows into the AllFiles virtual table in the builtin schema.

Built-in tables

AllFiles

Column name

Column type

Description

Name

NVarchar(255) not null

File name with extension

Type

NVarchar(30) not null

File extension with leading dot (.ext) or “directory”.

SizeBytes

bigint not null

File size in bytes

Path

NVarchar(4000) not null

OS file path

EmailMessages

Column name

Column type

Description

MessageId

VarChar(160) not null

Message Id

Subject

NVarchar(250) not null

Message subject

SubjectNormalized

NVarchar(250) null

Normalized message subject

ReceivedOn

DateTime2 null

Received date

BodyHtml

NVarchar(max) null

Body in HTML format

BodyRtf

NVarchar(max) null

Body in Rich Text Format (RTF) format

BodyText

NVarchar(max) null

Body in Text format

Attachments

Int not null

Attachments count

AttachmentsJson

NVarchar(max) null

JSON text with attachments information

Sender_DisplayName

NVarchar(200) null

Sender display name

Sender_Email

NVarchar(200) null

Sender email address

Recipients_To

NVarchar(max) null

Recipients list

Recipients_CC

NVarchar(max) null

Carbon-copy recipients list

SentOn

DateTime2 null

Send date

CreatedOn

DateTime2 null

Created date

Categories

NVarchar(220) null

List of categories

ConversationIndex

NVarchar(160) null

Conversation index

ConversationTopic

NVarchar(200) null

Conversation topic

LastModifiedOn

DateTime2 null

Last modification date

Headers

VarChar(2000) null

List of headers

Importance

VarChar(30) null

Importance

InternetCodePage

VarChar(30) null

Internet code page

MessageLocalId

VarChar(160) null

Message local Id

Type

VarChar(30) null

File types

Path

NVarchar(4000) null

File path

PdfDocuments

Column name

Column type

Description

Path

NVarChar(4000) not null

OS path to a file

CreatedOn

DateTime2 not null

Document creation date

LastModifiedOn

DateTime2 not null

Document last modification date

Keywords

NVarChar(1600) null

Document keywords

Subject

NVarChar(300) null

Document subject

Title

NVarChar(300) null

Document title

IsEncrypted

Bit not null

Encryption flag (1=encrypted, 0=not encrypted)

NumberOfPages

Int not null

Number of pages in document

Author

NVarChar(255) null

Document author

Version

Decimal not null

Document PDF specification compatibility version

SizeBytes

BigInt not null

File size in bytes

EmbeddedFiles

Int not null

Document-embedded files count

TextFiles

Column name

Column type

Description

Name

NVarChar(255) not null

OS file name with extension

SizeBytes

BigInt not null

File size in bytes

Path

NVarChar(4000) not null

OS file path

Encoding

VarChar(50) null

File encoding name

Type

NVarChar(30) null

File extension

TextSample

NVarChar(1000) null

First 1000 charactes of the file content

Alternatives to the File provider

If for any reson the built-in implementation is of File provider is insufficient, these are the available alternatives:

CSV/TSV sources using the Cdata driver

Note

Querona supports CData ADO.NET drivers, however these drivers need to be purchased separately.

CData Driver installation

Follow these steps to install the CData ADO.Net driver for CSV/TSV:

  1. Go to the CData website.

  2. In the ADO.NET Provider Download section, select Download Trial or Download Full, depending on your use-case.

  3. In the list CData Software - Downloads that displays, register or sign-in.

  4. Download the ADO.Net driver and install the driver.

Custom delimiter in CData driver

Let’s assume that we want to connect to a CSV file that:

  • has a header,

  • is delimited by semicolon character “;”

The following connection string can be used:

DataSource=[ServerName]; IncludeFiles="CSV"; FMT="Delimited(;)"; HDR="Yes";

For different delimiter change the ‘FMT=”Delimited(_put_your_delimited_here_)” property and replace the placeholder _put_your_delimited_here_ with your desired character.