CSV/TSV, PDF, MSG, Text#
The built-in File provider handles the following file types:
CSV/TSV (.csv, .tsv)
Exchange Email (.msg, .ems, .eml)
Portable Document Format (.pdf)
Unstructured text file (.txt)
Key features#
very large file support (tested on files of several GBs of size)
very wide file support (up to 32 000 columns)
automatic schema detection for schema-less files
custom schema support compatible with Microsoft ODBC schema.ini definition
multisource: local files, http(s) publicly available sources
built-in popular formats for parsing dates, currency, booleans, etc.
support of custom values that get translated into Null, boolean True and False
high throughput
How the File provider works#
File provider scans all designated source folders looking for files. Supported file types with a well-known, immutable schema, eg. email messages or PDFs, are presented as rows in a built-in virtual table that groups together files of the same type. The CSV/TSV files can have different schemas, therefore each of them is exposed as a dedicated virtual table. The following mapping rules govern the metadata import:
Files in a folder designated in connection definition as a root, belong to the dbo schema.
Each direct subfolder of the root folder converts into a schema, and all of its files belong the that schema.
All supported file types having a well-known, immutable schema (eg. pdf), map into a single virtual table, stored in the builtin schema.
CSV/TSV files having a detected or assigned schema, map to a virtual table 1-to-1.
All files found, irrespective of type, map into a AllFiles virtual table.
The following table describes the built-in tables, available from the builtin schema:
Table name |
Default file extensions |
---|---|
EmailMessages |
.msg, .eml |
PdfDocuments |
|
TextFiles |
.txt |
AllFiles |
.* |
Example scenario#
Let’s consider the following folder structure on your disk:
c:\files
c:\files\customers.csv
c:\files\invoices.tsv
c:\files\email1.msg
c:\files\email2.msg
c:\files\printout1.pdf
c:\files\printout2.pdf
c:\files\readme.txt
c:\files\company\employees.csv
c:\files\company\websites.csv
Let’s assume that we mapped the c:\files folder as the root and we import metadata of all files presented to us during virtual database creation.
The resulting virtual database consists of the following objects:
Source file path |
Mapped table name |
Mapped schema |
Reason |
---|---|---|---|
c:\files\customers.csv |
customers_csv |
dbo |
Customers.csv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema. |
c:\files\invoices.tsv |
invoices_tsv |
dbo |
Invoices.tsv file has a schema and is in the root folder, therefore it maps to a virtual table in the dbo schema. |
c:\files\email1.msg, c:\files\email2.msg |
EmailMessages |
builtin |
Exchange email files have a well-known, immutable schema, so are all mapped as rows in the builtin.EmailMessages virtual table. |
c:\files\printout.pdf |
PdfDocuments |
builtin |
PDF documents have a well-known, immutable schema, and are all mapped as rows in the builtin.PdfDocument virtual table. |
c:\files\readme.txt |
TextFiles |
builtin |
Text file have no per-file schema, however are all mapped as rows into the builtin.TextFiles virtual table for convenience. |
c:\files\company\employees.csv |
employees_csv |
company |
Company is a subdirectory of the directory designated as root, therefore directory maps to a schema. inside it map to respective tables. Employees.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema. |
c:\files\company\websites.csv |
websites_csv |
company |
Websites.csv file has a schema and is inside a subfolder, therefore it maps to a virtual table in the company schema. |
c:\files\*.* |
AllFiles |
builtin |
All files found in the root and its subfolders map as rows into the AllFiles virtual table in the builtin schema. |
Built-in tables#
AllFiles#
Column name |
Column type |
Description |
---|---|---|
Name |
NVarchar(255) not null |
File name with extension |
Type |
NVarchar(30) not null |
File extension with leading dot (.ext) or “directory”. |
SizeBytes |
bigint not null |
File size in bytes |
Path |
NVarchar(4000) not null |
OS file path |
EmailMessages#
Column name |
Column type |
Description |
---|---|---|
MessageId |
VarChar(160) not null |
Message Id |
Subject |
NVarchar(250) not null |
Message subject |
SubjectNormalized |
NVarchar(250) null |
Normalized message subject |
ReceivedOn |
DateTime2 null |
Received date |
BodyHtml |
NVarchar(max) null |
Body in HTML format |
BodyRtf |
NVarchar(max) null |
Body in Rich Text Format (RTF) format |
BodyText |
NVarchar(max) null |
Body in Text format |
Attachments |
Int not null |
Attachments count |
AttachmentsJson |
NVarchar(max) null |
JSON text with attachments information |
Sender_DisplayName |
NVarchar(200) null |
Sender display name |
Sender_Email |
NVarchar(200) null |
Sender email address |
Recipients_To |
NVarchar(max) null |
Recipients list |
Recipients_CC |
NVarchar(max) null |
Carbon-copy recipients list |
SentOn |
DateTime2 null |
Send date |
CreatedOn |
DateTime2 null |
Created date |
Categories |
NVarchar(220) null |
List of categories |
ConversationIndex |
NVarchar(160) null |
Conversation index |
ConversationTopic |
NVarchar(200) null |
Conversation topic |
LastModifiedOn |
DateTime2 null |
Last modification date |
Headers |
VarChar(2000) null |
List of headers |
Importance |
VarChar(30) null |
Importance |
InternetCodePage |
VarChar(30) null |
Internet code page |
MessageLocalId |
VarChar(160) null |
Message local Id |
Type |
VarChar(30) null |
File types |
Path |
NVarchar(4000) null |
File path |
PdfDocuments#
Column name |
Column type |
Description |
---|---|---|
Path |
NVarChar(4000) not null |
OS path to a file |
CreatedOn |
DateTime2 not null |
Document creation date |
LastModifiedOn |
DateTime2 not null |
Document last modification date |
Keywords |
NVarChar(1600) null |
Document keywords |
Subject |
NVarChar(300) null |
Document subject |
Title |
NVarChar(300) null |
Document title |
IsEncrypted |
Bit not null |
Encryption flag (1=encrypted, 0=not encrypted) |
NumberOfPages |
Int not null |
Number of pages in document |
Author |
NVarChar(255) null |
Document author |
Version |
Decimal not null |
Document PDF specification compatibility version |
SizeBytes |
BigInt not null |
File size in bytes |
EmbeddedFiles |
Int not null |
Document-embedded files count |
TextFiles#
Column name |
Column type |
Description |
---|---|---|
Name |
NVarChar(255) not null |
OS file name with extension |
SizeBytes |
BigInt not null |
File size in bytes |
Path |
NVarChar(4000) not null |
OS file path |
Encoding |
VarChar(50) null |
File encoding name |
Type |
NVarChar(30) null |
File extension |
TextSample |
NVarChar(1000) null |
First 1000 charactes of the file content |
CSV parser and reader configuration#
Settings divide into two groups: connection and object-level settings. All database objects inherit settings from the connection. Relevant settings can be adjusted on the object level. If a setting is not overriden on the object level, changing it on the connection level propagates to every object that does not override the setting. Changing the setting on the object level stops the inheritance.
Basic settings#
Setting name |
Allowed values |
Description |
---|---|---|
File format |
Autodetect delimiter*, Comma Delimited, Semicolon delimited, Tab delimited, Custom |
File format selection. The default is “Autodetect delimiter”. |
Allowed delimiter values |
Any |
Contains the list of characters or strings that will be used as candidates during separator detection. |
Source URI’s |
Any URI |
Contains the list of URI’s that need to be scanned during the metadata discovery (import). |
First row contains column names |
True*, False |
Indicates that the first row contains column names (header). The default is True. The “Skip first # rows” setting value (see Advanced Settings below) is applied first, so the ‘First Row’ may not be the actual first row in a file. |
Always detect character encoding before reading |
True*, False |
Indicates that the character encoding has to be detected each time a file is opened. Useful when file encoding changes frequently. For best performance, uncheck this option, and set the Encoding to a fixed value. |
Encoding |
Any encoding from the list |
Indicates the character encoding to be used when a file is opened. If set, the automatic detection does not take place. |
Culture |
Any culture from the list |
Indicates the culture to be used when converting string values to desired data types. |
Allow quotation marks in values |
True*, False |
Indicates that the quotation mark in values is allowed. |
Treat a line break in quoted text as an error |
True, False* |
Indicates that a line break in quoted text is allowed. |
Empty strings should be returned as NULL |
True*, False |
Indicates that the empty string is converted to a Null value. |
Data error handling |
Ignore row, Report Error* |
Indicates the strategy to use when bad data is detected during the read. Default is “Show Errors”, that throws an error and stops the read operation. The “Ignore” setting ignores the bad row and continues the read operation. |
Advanced settings#
Setting name |
Allowed values |
Description |
---|---|---|
Max rows to read for data type probing |
0 (no limit) to 2147483647 |
Indicates the maximum number of lines to be read during schema detection. The higher the percentage of the file read, the lower the assumed error margins. |
Allow comments |
True*, False |
Indicates if comment are allowed. |
Comment character |
Any character |
The character used to denote a line that is commented out. Default is ‘#’. Available only if comments were allowed. |
Escape character |
Any character |
The escape character used to escape a quote inside a field. Default is ‘”’. |
Ignore blank lines |
True*, False |
Indicates if blank lines are ignored when reading. Default is true. |
Trimming options |
No trimming, Trim the whitespace around a field, Trim the whitespace inside of quotes around a field |
The strategy used to trim data. |
Skip first # rows |
0* to 2147483647 |
Indicates the number of lines to skip from the top of the file. |
Values interpreted as boolean True |
Any string value(s) |
The list of string values interpreted as boolean True. The value gets converted to True, if the whole value in a column matches it, eg. the value “ab” converts to True, if column value in a row equals “ab”. |
Values interpreted as boolean False |
Any string value(s) |
The list of string values interpreted as boolean False. The value gets converted to False, if the whole value in a column matches it, eg. the value “ab” converts to False, if column value in a row equals “ab”. |
Values interpreted as Null (case-insensitive) |
Any string value(s) |
The list of string values interpreted as Null. The value gets converted to Null, if the whole value in a column matches it, eg. the value “ab” converts to Null, if column value in a row equals “ab”. |
Alternatives to the built-in File provider#
If for any reson the built-in implementation of the File provider is insufficient, these are the available alternatives:
CData (commercial), see below
CSV/TSV sources using the Cdata driver#
Note
Querona supports CData ADO.NET drivers, however these drivers need to be purchased separately.
CData Driver installation#
Follow these steps to install the CData ADO.Net driver for CSV/TSV:
Go to the CData website.
In the ADO.NET Provider Download section, select Download Trial or Download Full, depending on your use-case.
In the list CData Software - Downloads that displays, register or sign-in.
Download the ADO.Net driver and install the driver.
Custom delimiter in CData driver#
Let’s assume that we want to connect to a CSV file that:
has a header,
is delimited by semicolon character “;”
The following connection string can be used:
DataSource=[ServerName]; IncludeFiles="CSV"; FMT="Delimited(;)"; HDR="Yes";
For different delimiter change the ‘FMT=”Delimited(_put_your_delimiter_here_)” property and replace the placeholder _put_your_delimiter_here_ with your desired character.