Dremio

Dremio is a Data-as-a-Service platform, which enables data analysts and scientists to autonomously explore, validate and curate data from a variety of sources, all in a single, unified and coherent interface. Built for teams, Dremio leverages spaces and virtual datasets to offer a data platform with the following features.

The official website is https://www.dremio.com.

The official documentation with details on how to use Dremio is https://docs.dremio.com/.

Data Sources

Dremio supports modern data lakes built on a variety of different systems and provides:

  • native integrations with major RDBMS such as PostgresSQL, MySQL, MS SQL, IBM DB2, Oracle
  • NoSQL integration for modern datastores as MongoDB, Elasticsearch
  • support for file based datasources, cloud storage systems, NAS

Data Exploration

Dremio offers a unified view across all datasets connected to the platform, with:

  • live data visualization during query preparation and execution, with dynamic previews
  • optimized query pushdown for all sources in native query language
  • virtual datasets based on complex queries available as sources for analytics and external tools

Cloud Ready

Dremio is architected for cloud environments, with elastic computing abilities and dynamic horizontal scaling. Data reflections can be stored into distributed storage platforms such as S3, HDFS, ADLS.

Screenshots from the Website

../../_images/dremio_9.png ../../_images/dremio_2.png ../../_images/dremio_13.png ../../_images/dremio_35.png

Installation

Dremio is a Java software and requires a compatible JDK installed. The current version supports only OpenJDK 1.8 and Oracle JDK 1.8.

System requirements are:

  • a supported Linux distribution: RHEL/CentOS 6.7+/7.3+, SLES 12+, Ubuntu 14.04+, Debian 7+
  • at least 4 CPU cores and 8GB RAM for starting the software.

Given the nature of data analysis, and the distributed design of the software, production deployments should follow the following indications:

Node role Hardware required
Coordinators 8 CPU/16GB RAM recommended
Executors 4 CPU/16GB RAM minimum 16 CPU/64GB RAM recommended

You can read more at https://docs.dremio.com/deployment/rpm-tarball-install.html.

Platform Fork

The integration of Dremio into the Digital Hub platform required extending the open source version, which lacks some enteprise features, to support:

Without these extensions, Dremio supports internal authentication only and grants administrator privileges to all users, hence every user can access any resource indiscriminately.

As far as authentication is concerned, the following features have been implemented:

  • OAuth2.0 support, with access via the secure authorization_code flow and a native Dremio token integration for UI
  • authomatic user creation and personal space (user home) definition upon valid OAuth2.0 access
  • distinction between ADMIN role and USER role, which reflects on the UI in that admin actions and menus are hidden to unprivileged users
  • OAuth2.0 login in the UI

Additionally, the upstream support service, which exposes metrics, interactive chat and debug information to dremio.com for licensed enterprise environments, is disabled by default. This should be reviewed in privacy-sensitive environments, as the complete deactivation of user and session data leakage to dremio.com and its partners requires the explicit configuration of various properties in dremio.conf.

The multitenancy model implemented in the fork is structured as follows:

  • admin privileges are not assignable, ADMIN (Dremio admin or system admin) role is reserved to dremio user, every other user is assigned either TENANT ADMIN role or USER role
  • each user is associated to a single tenant
  • the tenant is attached to the username with the syntax <username>@<tenant>
  • all APIs accessible to regular users are protected so that non-ADMIN users can only access resources within their own tenant
  • when a resource belongs to a tenant (i.e. is shared among all its users), such tenant is specified as a prefix in the resource path with the syntax <tenant>__<rootname>/path/to/resource

In Dremio, resources are either containers (spaces, sources, homes) or inside a container (folders, datasets), therefore spaces and sources are prefixed with their tenant, while folders and datasets inherit it from their container, which is the root of their path, and do not need to be prefixed. For example, in the following resource tree, myspace, myfolder and mydataset all belong to mytenant:

mytenant__myspace
└───myfolder
    └───mydataset

The ADMIN user can access any resource. Regular users (i.e. tenant admins and users) can only access resources inside their own home or belonging to their tenant. This implies that users can only query data and access job results according to these constraints.

Note

Currently, when non-ADMIN users create a new source or space (sample sources included), that is automatically prefixed with their own tenant. Non-ADMIN users cannot create sources or spaces with a different tenant than their own.

Configuration for OAuth2.0

Note

The configuration described below uses AAC as the authentication provider, however any standard OAuth2.0 provider can be used.

1. Configuring a client application on AAC

On your AAC instance, create a new client app named dremio with the following properties:

  • Identity providers : internal
  • Redirect URIs: <dremio_url>/apiv2/oauth/callback
  • Grant types: authorization_code
  • Authentication methods: client_secret_basic, client_secret_post, none
  • Token type: JWT
  • Selected scopes: user.roles.me, user.spaces.me, openid, profile, email

Under “Hooks & Claims”, set:

  • Unique spaces prefix: components/dremio
  • Custom claim mapping: enable
  • Custom claim mapping function:
function claimMapping(claims) {
    var valid = ['ROLE_USER'];
    var owner = ['ROLE_OWNER'];
    var prefix = "components/dremio/";

    //fetch username where we find it
    var username = claims["username"];
    if(!username) {
        username = claims ["preferred_username"];
    }
    if(!username) {
        username = claims ["email"];
    }

    if ("spaceRoles" in claims && "space" in claims) {
        var space = claims["space"];
        //can't support no space selection performed
        if (Array.isArray(space)) {
            space = null;
        }
        //lookup for policy for selected space
        var tenant = null;
        if(space) {
            for (var role of claims["spaceRoles"]) {
                if (role.startsWith(prefix + space + ":")) {
                    var p = role.split(":")[1]

                    //replace owner with USER
                    if (owner.indexOf(p) !== -1) {
                        p = "ROLE_USER"
                    }

                    if (valid.indexOf(p) !== -1) {
                        tenant = space
                        break;
                    }
                }
            }
        }

        if (tenant) {
            tenant = tenant.replace(/\./g,'_');
            claims["dremio/tenant"] = tenant;
            claims["dremio/username"] = username+'@'+tenant;
            claims["dremio/role"] = "admin";
        }
    }

    return claims;
}

This function adds a custom claim holding a single user tenant, as AAC supports users being associated to multiple tenants while Dremio does not. During the authorization step on AAC, the user will be asked to select which tenant to use.

2. Configuring Dremio

Open your dremio.conf file and add the following configuration:

services.coordinator.web.auth: {
    type: "oauth",
    oauth: {
        authorizationUrl: "<aac_url>/oauth/authorize"
        tokenUrl: "<aac_url>/oauth/token"
        userInfoUrl: "<aac_url>/userinfo"
        callbackUrl: "<dremio_url>"
        jwksUrl: "<aac_url>/jwk"
        clientId: "<your_client_id>"
        clientSecret: "<your_client_secret>"
        tenantField: "dremio/tenant"
        scope: "openid profile email user.roles.me user.spaces.me"
        roleField: "dremio/role"
        jwtIssuer: "<expected_token_issuer>"
        jwtAudience: "<expected_token_audience>"
    }
}

The tenantField property matches the claim defined in the function above, which holds the user tenant selected during the login. Dremio will associate it to the username with the syntax <username>@<tenant>. That will be used as username in Dremio.

The roleField property matches another claim defined in the function, which holds the role of the user (either “user” or “admin”) within the selected tenant. Such roles correspond to READ and WRITE privileges over tenant data.

Additionally, to fully disable dremio.com intercom, add also:

services.coordinator.web.ui {
    intercom: {
        enabled: false
        appid:  ""
    }
}

Building from Source

Dremio is a maven project, and as such can be properly compiled, along with all the dependencies, via the usual mvn commands:

mvn clean install

Since some modules require license acceptance and checks, in automated builds it is advisable to skip those checks to avoid a failure:

mvn clean install -DskipTests -Dlicense.skip=true

The skipTests flag is useful to speed up automated builds, for example for Docker container rebuilds, once the CI has properly executed all the tests.

During development of new modules or modifications, it is advisable to disable the style-checker via the -Dcheckstyle.skip flag. In order to build a single module, for example dremio-common, use the following syntax:

mvn clean install -DskipTests -Dlicense.skip=true -Dcheckstyle.skip -pl :dremio-common

To test the build, you can execute only the distribution module, which will produce a complete distribution tree under the distribution/server/target folder, and a tar.gz with the deployable package named dremio-community-{version}-{date}-{build}, for example ./distribution/server/target/dremio-community-3.2.1-201905191350330803-1a33f83.tar.gz.

mvn clean install -DskipTests -Dlicense.skip=true -pl :dremio-distribution

The resulting archive can be installed as per upstream instructions.

Note

The first time you open Dremio, you will be asked to create an administrator account. The admin user must have the username dremio, as that is currently the only user that can have admin privileges.

Additional Changes in the Fork

Source Management

Differently from the original implementation, in which source management was restricted to ADMIN only, users with TENANT ADMIN role are allowed to manage (create, update and delete) sources in addition to spaces within their tenant, while the other users can only manage spaces.

Arrow Flight and ODBC/JDBC Services

While internal users can use their credentials to connect to Dremio Arrow Flight server endpoint and ODBC and JDBC services, users that log in via OAuth need to set an internal password in order to connect to Dremio with some client. Such password can be set in the Dremio UI on the Account Settings page.

Connecting WSO2 DSS to Dremio via JDBC

The fork includes an OSGi bundle for Dremio JDBC Driver that can be used with WSO2 Data Services Server. In order to use it, copy the JAR file to <DSS_PRODUCT_HOME>/repository/components/dropins and restart DSS.

DSS Datasource Configuration

A DSS data source can be connected to Dremio by configuring the following properties:

  • Datasource Type: RDBMS
  • Database Engine: Generic
  • Driver Class: com.dremio.jdbc.Driver
  • URL: jdbc:dremio:direct=localhost:31010
  • User Name: <dremio_username>
  • Password: <dremio_password>

When you create a datasource that connects to Dremio, you will likely get a warning on the DSS console that a default logger will be used for the driver logs.

Dremio APIs

Many features of Dremio are available via the Dremio REST API. Two versions of the API currently coexist:

  • v2 is still used internally, although it should be dismissed in the future
  • v3 is documented on the Dremio docs as the official REST API and is progressively replacing v2 also internally

Here is a collection of all the v3 endpoints with links to the corresponding Dremio docs pages, if any. Note that access to some stats APIs has been restricted to ADMIN (i.e. Dremio system admin) in the fork, while regular users have been granted access to source management APIs (if they are tenant admins). The required permission is marked in bold in the tables whenever it differs from the official documentation.

The API path is <dremio_url>/api/v3.

Catalog API:

Path Method Docs Permission
/catalog GET https://docs.dremio.com/rest-api/catalog/get-catalog.html user
POST https://docs.dremio.com/rest-api/catalog/post-catalog.html user
/catalog/{id} GET https://docs.dremio.com/rest-api/catalog/get-catalog-id.html user
POST https://docs.dremio.com/rest-api/catalog/post-catalog-id.html user
PUT https://docs.dremio.com/rest-api/catalog/put-catalog-id.html user
DELETE https://docs.dremio.com/rest-api/catalog/delete-catalog-id.html user
/catalog/{id}/refresh POST https://docs.dremio.com/rest-api/catalog/post-catalog-id-refresh.html user
/catalog/{id}/metadata/refresh POST Refresh of physical dataset metadata user
/catalog/by-path/{path} GET https://docs.dremio.com/rest-api/catalog/get-catalog-path.html user
/catalog/search GET Item research given a query string user
/catalog/{id}/collaboration/tag GET https://docs.dremio.com/rest-api/catalog/get-catalog-collaboration.html user
POST https://docs.dremio.com/rest-api/catalog/post-catalog-collaboration.html user
/catalog/{id}/collaboration/wiki GET https://docs.dremio.com/rest-api/catalog/get-catalog-collaboration.html user
POST https://docs.dremio.com/rest-api/catalog/post-catalog-collaboration.html user

Reflection API:

Path Method Docs Permission
/reflection POST https://docs.dremio.com/rest-api/reflections/post-reflection.html user
/reflection/{id} GET https://docs.dremio.com/rest-api/reflections/get-reflection-id.html user
PUT https://docs.dremio.com/rest-api/reflections/put-reflection.html user
DELETE https://docs.dremio.com/rest-api/reflections/delete-reflection.html user
/dataset/{id}/reflection GET Reflections used on a dataset user
/dataset/{id}/reflection/recommendation POST Reflections recommended for a dataset user

Job API:

Path Method Docs Permission
/job/{id} GET https://docs.dremio.com/rest-api/jobs/get-job.html user
/job/{id}/results GET https://docs.dremio.com/rest-api/jobs/get-job.html user
/job/{id}/cancel POST https://docs.dremio.com/rest-api/jobs/post-job.html user
/job/{id}/reflection/{reflectionId} GET Retrieval of a reflection job status user
/job/{id}/reflection/{reflectionId}/cancel POST Cancellation of a running reflection job user

SQL API:

Path Method Docs Permission
/sql POST https://docs.dremio.com/rest-api/sql/post-sql.html user

User API:

Path Method Docs Permission
/user POST User creation admin
/user/{id} GET https://docs.dremio.com/rest-api/user/get-user.html user
PUT User update user
/user/by-name/{name} GET https://docs.dremio.com/rest-api/user/get-user.html user

Cluster Statistics API:

Path Method Docs Permission
/cluster/stats GET Stats about sources, jobs and reflections admin

Job Statistics API:

Path Method Docs Permission
/cluster/jobstats GET Stats about the number of jobs per type over ten days admin

User Statistics API:

Path Method Docs Permission
/stats/user GET Stats about user activity admin

Info API:

Path Method Docs Permission
/info GET Basic information about Dremio user

Source API (deprecated in favour of Catalog API, will be removed):

Path Method Docs Permission
/source GET https://docs.dremio.com/rest-api/sources/get-source.html user
POST https://docs.dremio.com/rest-api/sources/post-source.html user
/source/{id} GET https://docs.dremio.com/rest-api/sources/get-source.html user
PUT https://docs.dremio.com/rest-api/sources/put-source.html user
DELETE https://docs.dremio.com/rest-api/sources/delete-source.html user
/source/type GET https://docs.dremio.com/rest-api/sources/source-types.html user
/source/type/{name} GET https://docs.dremio.com/rest-api/sources/source-types.html user