Quickstart¶
intake-odbc
provides quick and easy access to tabular data stored in
ODBC data sources, which include a wide variety of traditional relational
database systems such as MySQL and Microsoft SQL Server. Some DB systems
such as PostgreSQL may have better/faster plugin implementations.
Installation¶
To use this plugin for intake, install with the following command:
conda install -c intake intake-odbc
Setting up ODBC¶
Configuring ODBC is beyond the scope of this document, and generally not something that it performed by an end-user, and generally requires the installation of backend-specific drivers system-wide.
Specific documentation on the connection string and keyword arguments can be found on the TurbODBC website.
Usage¶
Ad-hoc¶
After installation, the functions intake.open_odbc
and intake.open_odbc_partitioned
will become available.
Assuming you have an ODBC set up with a fully-configured connection named
"MSSQL"
the following would fetch the contents of mytable
into a pandas
dataframe:
import intake
source = intake.open_odbc('Driver={MSSQL}', 'SELECT * FROM mytable')
dataframe = source.read()
Two key arguments are required to define an ODBC data source: the DB connection
parameters, and a SQL query to execute. The former may be as simple as a TurbODBC
connection string, but can commonly be a set of keyword arguments, all of which are
passed on to turbodbc.connect()
. The query must have a valid syntax to be
executed by the backend of choice.
In addition, the following arguments are meaningful for the non-partitioned data source:
- head_rows: how many rows are read from the start of the data to infer data
types for discovery
- mssql: a special flag to mark datasets which are backed by MS SQL Server, which
requires a different spelling of theLIMIT
statement.
When using the partitioned ODBC source, further details are required in order to
build the queries for each partition. It requires an index column to use for
the WHERE
statement, and the bounding values for each partition. The most
explicit way to provide the boundaries is with the divisions
keyword, or
boundaries will be calculated as npartitions
equally spaced boundaried between
the min/max values. Note that some partitions may be empty.
Further arguments when using partitioning:
- index: Column to use for partitioning
- max, min: the range of values of the index column to consider for building partitiona;
will execute a separate query to find these, if not given
- npartitions: the number of partitions to create
- divisions: explicit partition boundary values
Creating Catalog Entries¶
To us in a catalog entries must specify driver: odbc
or driver: odbc_partioned
.
Further arguments should be provided, as for the intake.open_*
ad-hoc
commands, above. In particular, the connection parameters and query string
are required, and also the index column, if using partitioning.
It should be noted that SQL query strings are generally quite long; the appropriate syntax may look like:
product_types:
description: Randomly generated data
driver: odbc_partitioned
args:
uri:
sql_expr: |
SELECT t.productid AS pid, t.productname, t.price, tt.typename
FROM testtable t
INNER JOIN testtypetable tt ON t.typeid=tt.typeid
dsn: MSSQL
mssql: true
Where in this case we provide a keyword argument to specify the connection and so
leave the uri
field empty, and the special YAML syntax with "|"
is used to
indicate the multi-line query, delimited by indentation.
Warning, while it is reasonable to include user parameters in the SQL query body, free-form strings or environment variables should not be used, since they will allow arbitrary code execution on the DB server (SQL injection). Similarly, the details of ODBC connections are unlikely to be useful as user parameters, except possibly to take the DB username and password from the environment.
Using a Catalog¶
Assuming a catalog file called cat.yaml
, containing an ODBC source pdata
, one could
load it into a dataframe as follows:
import intake
cat = intake.Catalog('cat.yaml')
df = cat.pdata.read()
The source may or may not be partitioned, depending on the plugin which was used and
the parameters. Use .discover()
to find out whether there is partitioning, and if there
is, the partitions can be accessed independently.
Dask
can be used to read a partitioned source in parallel (see method .to_dask()
);
note that there is some overhead to establishing connections from each worker, and the
same ODBC drivers and configuration must exist on each machine, in the case of a
distributed cluster.