This document provides the guidance to start a successful integration with Kubit’s Self-Service Analytics. It covers the contexts and details steps required to share your data with Kubit and also how to get your metrics configured properly. It also includes the best practices in data model design and implementation as a reference.
With secure cloud data warehouse sharing, you can share your data with Kubit in real-time without the hassle of copying them, or introducing complicated batch jobs. While having full control over your data and maintaining a Single Source of Truth, all the computation cost involved with Self-Service Analytics will be covered by Kubit. This is the essence of No-Code Analytics offering from Kubit.
- You already have all your analytics related data stored in a Cloud Data Warehouse like Snowflake, BigQuery, RedShift, Azure or Vertica.
- Your data engineers will configure secure data share in your cloud data warehouse to grant Kubit read-only access to your data.
- If live analytics is in the PoC scope, please make sure that your data is up to date all the time.
- You have detailed documentation about your data model and event definitions (eg a Data Dictionary).
- If full documentation is not yet available, you will need to invest more time to pass along the knowledge and answering questions.
- This process often requires various members from your team to be involved in. Don’t worry, Kubit has a built-in Data Dictionary feature to help you to collaboratively build this knowledge base dynamically on the fly.
- There are some basic understanding and agreement of the metrics (KPIs) and dimensions that your team wants.
- The PoC project will cover a limited number of metrics and dimensions. Though all the tools and features are enabled for you to build your own. That’s the essence of Self-Service Analytics.
- Keep a constant communication on any changes on data model, data dictionary and metrics definition with Kubit team. Some changes are trivial to incorporate if planned ahead of time. Some may require invalidation and updates to existing analyses.
- Provide feedback and share your thoughts with Kubit.
- Inventory and get your data ready in your data warehouse.
- Discuss the scope of the PoC with Kubit: KPIs, metrics, dimensions and which data is needed.
- Share access to your data with Kubit through secure data share.
- Kubit defines the KPIs and dimensions.
- Provide credentials to Kubit to provision the users to access the PoC.
- Play with Self-Service Analytics and provide feedback.
About Your Data
Your data should already be stored in data warehouse tables for analytics purposes. They should also be updated frequently to stay live. Here are the key considerations for the PoC and some best practices.
Data model is sometimes presented as an ER (Entity Relationship) diagram. It captures the relationship between all the entities in your database.
It is never a good idea to just dump your transactional data (i.e. database tables used for serving your users) into the data warehouse for analytics purposes. A star schema is the most recommended pattern for storing analytics data.
- Fact tables: Events, Transactions (eg Purchases, Subscriptions)
- Dimension tables: Users, Campaigns, Attributions
Below is a design sample of the proposed data model. If your current data model is different, don’t worry, there are many creative ways to address that by collaborating with your data engineers.
The fact tables store information about when something happens (events), some inline properties (eg timestamp, country, app_id, app_version), and some dimension keys which can be used to lookup for the dimension values in dimension tables (eg user_id to lookup user’s email in users table, campaign_id to lookup for a campaign’s start and end dates).
In most scenarios, it is recommended to use inline properties on the events as much as possible, i.e. capture most dimensions’ values as properties and store them on every event. It simplifies the query and also improves analytics performance dramatically. Modern analytical data warehouses like Snowflake have built-in optimizations to store duplicated column values very efficiently. If you are interested in this topic, read more about Columnar Database.
How to implement ETL (Extract, Transform and Load) to populate the star schema from your transactional data is out of the scope for this guide. Though feel free to contact Kubit support to get additional information or advice on this topic.
Every user, regardless of registered or anonymous (guest), requires a unique identifier for the analytics to work. Counting unique users and following a user’s behavior (action events) are the most common analytical measures.
Ideally there should just be one unified id field (User ID) to identify any user. But in reality, it could be much more complicated because the relationship between a person, different apps, their devices and credentials can be super complex. Especially with considerations of privacy, fraud, hacking, device and credential sharing. It is a business decision for your organization to choose the best strategy.
Conceptually, there are several IDs:
- Device ID: IDFV or Android ID (deprecated). Used to identify a physical device though this concept is going away because of privacy concerns. In most places, people use Advertiser ID for Device ID.
- Advertiser ID: IDFA or AAID (Android Advertising ID). Usually used to match with attribution or advertising data from third-party vendors. Keep in mind these IDs may be empty or have a special value when ad tracking is disabled.
- Since iOS 14, ad tracking is now disabled by default.
- User ID: always generate a User ID as soon as the app is launched so every guest/anonymous user can be identified too.
- Account ID: some developers introduce this to identify a registered user across different apps and/or devices. In most cases it brings more pain and confusions. It is recommended to only use it in special cases like analyzing how users use multiple apps on different platforms (iOS, Android and Web).
- Credential (username, email): due to privacy concerns, it is a bad idea to store user credentials in any analytics data.
Generally speaking, the pair of User ID and Advertiser ID can satisfy most of the product analytics use cases.
Events are generated when a certain action takes place. For example: Login, Sing, Listen, Page View, Button Click. It is the most critical data structure for analytics.
It is strongly recommended to have a Data Dictionary (as simple as a Google Sheet) in place to capture the definition of every event:
- Event name: a key (unique short name) and a business name (easy to read and understand).
- Trigger condition: when this event is triggered.
- Properties: the content and meaning of each property stored in this event.
- Owner: who to talk to if a change is needed.
- Version history: what has changed to this event over time. This information is often critical to communicate.
Events need to be instrumented in the code on both mobile (in the app) and server (backend) sides because some events are triggered by user actions, some by backend processes (eg subscription renewal, push notifications sent).
If you are starting from scratch, there is no need to reinvent the wheel. It is recommended to use a CDP (Customer Data Platform) like Segment, mParticle, RudderStack or Snowplow to give you a mature SDK and the maximum flexibility to control where the data goes.
Technology aside, the key of a successful instrument project is communication. Keep in mind that the engineers who are injecting the events into code flow don’t usually know how these events are used in analytics. Providing this context information is often critical to avoid bugs and surprises down the line.
There are two kinds of properties for the events: Common Properties and Event Properties. A property can be categorical (enumeration of values) or numerical (unbounded continuous values).
These are the properties that every event should have. Usually they are used in filters or breakdowns as dimension attributes. Here are some examples:
- Timestamp: when an event happens. Sometimes it is necessary to have both Client Timestamp (client clock when action is taken) and Server Timestamp (server clock when the event is processed) since the mobile device clock may not be trusted.
- Date Week, Month, Quarter, Year: It is strongly recommended to also store these parts of the timestamp separately since analytics is often based on these time buckets.
- Fiscal Quarter, Fiscal Year: only if your organization has the fiscal concept and wants to use them in analytics.
- Device ID: often this is Advertiser ID
- User ID: the user identifier
- App: the id or name of the app
- App Version: the version of the app
- Country: country where the app is installed (usually from the device)
- Locale: device locale (language + country)
- Gender: user’s gender information
- Birth year: used to compute user age at query time
- Device model: the model of the device, eg iPhone 11 Pro, Pixel 4
- OS Version: version of the OS on the device, eg iOS 13.6.7
- A/B Test: test id and group assignment of an A/B test
- Install Date: when the app was installed, used to compute Install Days and group users by how long they have installed the app. It is also critical for Dn/Wn metrics.
Each event can have its own set of properties depending on the context. It is recommended to store them in a set of named generic properties instead of introducing ad-hoc properties on the fly. This way the database table schema can be kept clean and efficient. These generic properties can be explained and managed using Data Dictionary. Here are some examples:
- Context: capture the context information of the event. For example, it can be the flow where the user is in (eg Registration, Purchase), or the feature section (eg Songbook, Checkout).
- Target: the target of the event. For example: the song id for Sing event; the social channel (eg Facebook, Instagram) for Share event; the SKU for a subscription event.
- Value: the value associated with the event. For example, the price for a Subscription or Purchase event.
- K1, K2, …Kn: more named properties slotted for different events.
Any timezone conversion during query time will kill the performance. For simplicity and efficiency, all the data related to time should be stored as the Timestamp or Date data type, which indicates a certain point in time and has no timezone concept.
The complex issue of timezone arises when any date or time is presented in text format, as in every analytics report with dates (MM/DD/YYYY). It is recommended for the organization to decide on using one certain timezone across the board for all analytics purposes. Usually the selected timezone is where the headquarter is based since it is much easier to talk about local dates in business settings.
Share Your Data
By allowing Kubit to securely access your cloud data warehouse directly, no actual data is copied or transferred between accounts. It doesn’t require any engineering work to create ETL or batch jobs, neither does it take up any extra storage. With this approach, Kubit can deliver instant product insight from the real-time live data based on a Single Source of Truth.
All the access goes through secure, encrypted communication channels. In the meantime, your engineering team maintains full control of the data share by limiting the permissions and visibility.
Kubit IP Ranges
Snowflake Secure Data Share allows your data engineers to share a database, schema, table or view to Kubit with read-only permission and real-time live data. A view can be used to join multiple tables across different databases and add constraints to expose which columns. You are the Provider while Kubit is the Consumer of the share. It just takes four simple steps in SQL:
- Create the view: or select which databases, tables and/or schema
- Create a share: give it a name
- Grant permissions: grant usage permissions
- Share it with Kubit's Snowflake account depending on your Snowflake region (please confirm with Kubit support first)
- US West Coast: nk15162
- US East Coast: kubit
- EU: LP16358
Here is an example script (sharing to nk15162 on US West Coast):
Snowflake also offers a web interface to review and manage data shares. For details on this topic, please review Snowflake’s documentation here.
It is strongly recommended to use the proper Clustering Keys in your database tables to improve query performance. Typically at least the event date column and event name should be part of the clustering key since they are part of the filters in most queries.
With Secure Data Share, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an important concept because it means that shared data does not take up any extra storage for either the Provider or Consumer. Since the same underlying data is being shared, the data is guaranteed to be live and without delay.
More importantly, the computing environment is completely separated. The Consumer (Kubit) must create and use their own Virtual Data Warehouse and pay for it themselves. There is absolutely no impact on the Provider’s cost, performance or SLA for their Snowflake services.
Additionally, using Snowflake’s built-in roles and permissions capabilities, access to data share can be controlled and governed using the access controls already in place for your Snowflake account and the data therein. Access can be restricted and monitored the same exact way as your own data.
Google’s BigQuery allows you to share a dataset to Kubit securely through Service Account. Because BigQuery separates computing from storage, Kubit will cover the cost of querying and won’t impact your SLA.
Share a Dataset
- Kubit will provide you a Kubit Service Account Email, typically in the format of <YOUR_ORG>@kubit-oauth.iam.gserviceaccount.com
- You can also create a Service Account and share the JSON key file with Kubit. Please make sure to add firstname.lastname@example.org to "Service account users role" in step 3 "Grant user access to this service account (Optional)".
- In the BigQuery console, select a dataset from Resources, then click SHARE DATASET near the right side of the window.
- In the Share dataset panel, in the Dataset permissions tab, enter the previously created Kubit Service Account Email to the Add members field. Then select the role BigQuery Data Viewer, click Add, then click DONE.
- If you need to share multiple datasets from different GCP projects, you can just use the same Kubit Service Account Email when sharing every dataset.
Create Service Account
Enable BigQuery API
- If you haven’t done so yet, please go to Google Cloud API & Services - Dashboard. Click “+ ENABLE APIS AND SERVICES”
- Search for “BigQuery API”
- Select it and click ENABLE
Create Service Account
- Open the credentials page in Google Cloud Platform API Manager (if needed, select your target project from the dropdown list on the top).
- Click “CREATE CREDENTIALS” and choose “Service account”
- Enter “Kubit” as the name for the service account. Then click CREATE.
- Note: "Service account email" shown on the screen (eg kubit-###@<GCP-PROJECT_NAME>.iam.gserviceaccount.com) is "Kubit Service Account Email"
- Enter email@example.com as “Service account users role”. Click DONE.
- Select the newly created Service account. Click Keys - ADD KEY - Create new key.
- Select JSON in the popup window. Click CREATE
- A JSON private key file will be downloaded to your computer. Please send this file to Kubit securely. Click CLOSE.
Kubit can access your data in Amazon Redshift cluster directly with restricted access, or through S3 storage sharing.
- Configure your Redshift to be publicly accessible (with access control)
- The associated Security Group should whitelist port 5439 to the Kubit IP ranges.
- Create a new user in sharing database. Please, consider password requirements while you’re generating the Kubit use password:
- CREATE USER kubit WITH PASSWORD '<password>';
- Grant the Kubit user permissions to access all of your data:
GRANT SELECT on ALL TABLES IN SCHEMA public TO kubit;
- Or you can limit the access to specific tables or views:
GRANT USAGE ON SCHEMA public TO kubit;GRANT SELECT ON TABLE public.table_1 TO kubit;GRANT SELECT ON TABLE public.table_2 TO kubit;
Kubit can access your Presto cluster with restricted access, or through S3 storage sharing.
- Configure your Presto cluster to be publicly accessible (with access control).
- The firewall associated with the Presto cluster should whitelist port 8080(or other custom port) to Kubit IP ranges
- Create a principal, named kubit in your Presto's authentication provider (e.g. LDAP, Kerberos, etc.).
- Add the following rule to your System Access Control file to grant read-only access to certain catalogs.
- <data_catalog> is the name of the catalog you would like to share with Kubit. You can share multiple catalogs by listing them all, delimited by |. For example: (data_1|data_2|data_3).
S3 Storage Sharing
If your raw data is stored in S3 buckets, instead of open direct access to your data warehouse, you can choose to share the S3 storage to Kubit.
Kubit will provide the ID of a role used for accessing bucket data directly (without the need to assume your roles). All you need to do is to add the following statements to the S3 bucket policy (sample). For more details, please consult Kubit support team.
- For parquet files, Snappy compression is recommended
- A Hive directory structure (directories organized by data indexes and partitions) is recommended to get best performance. There are numerous libraries in each programming language to output this structure.
- Directory pattern: s3://<YOUR_BUCKET>/.../<TABLE_NAME>/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
- Usually we suggest the following partition columns (in order): event_date (or event_timestamp), event_name. In such case, the directory would look like this: s3://<YOUR_BUCKET>/.../<TABLE_NAME>/event_date=<DATE_VALUE>/event_name=<EVENT_NAME_VALUE>/1.parquet
This section provides guidance on how to convey your organization’s definition on metrics (KPIs) and dimensions for the PoC.
Please provide a list of top metrics for your product analytics needs. Most of these metrics can be presented in simple math formulas with measure functions, along with some filter conditions. For example:
- Engagement: DAU, MAU - unique users in a day or month
- Retention: D2, W2 Retention - % of users who retained (came back) in day 2 or week 2
- Activity: Likes/DAU, Shares/DAU, Listens/DAU - count an action event divided by DAU.
- It is always recommended to use DAU or MAU as denominator to measure rate instead of absolute volume
- Monetization: ARPU, Attach Rate - avg(revenue)/DAU, % of users who subscribed
- Funnel: Registration, Purchase - unique users taking action at each steps of the funnel during a certain time period
- Performance: Download Failure Rate - count(failure) / total(start, fail, canel, end)
Please provide the definitions of all the dimensions needed in the analytics. Most of these dimensions can come from the event's properties. Some dimensions require joining with dimension tables during query time. Here are the information needed:
- Name: display name of the dimension.
- For inline dimensions:
- Name of the event property
- For join dimensions:
- Dimension table name
- Dimension value column: the column on the dimension table to be used for dimension values
- Join key: event property and dimension table column for join
- Join condition: inner (only matching rows returns) or outer (for rows without match, use NULL)
- Special join conditions: first touch, last touch, time window
Cohort is a group of users matching certain criteria. You can define these cohorts to be used for breakdowns in any Formula, Prediction and Braze Integration using very similar definitions like a query with measures, filters and conditions. For example:
- New Users: count(first Launch) = 1 in last 7 days
- Frequent Singers: count(Sing) > 5 in last 1 day
- Whales: sum(Revenue on Purchases) > 100 in last 30 days
For a SaaS solution, security is always on top of the priority list at Kubit.
- Through Secure Data Sharing, Kubit doesn’t store or change your data. You have full control over what data is shared and the permissions.
- You can avoid exposing the tables or columns which are related to user privacy (like email, username, phone number etc) to Kubit. Analytics mostly just work with aggregated data instead of every user’s private information.
- All network communication goes through the HTTPS encrypted channel, including when Kubit talks to your data warehouse and the analytics web interface in the browser.
- Kubit’s cloud infrastructure takes all the necessary steps to enforce data security and isolation, including VPC (Virtual Private Cloud) and encrypted network traffic between all components.
- Strict access control and processes are enforced within Kubit organizations to limit and monitor employees who have access to customer data.
Kubit utilizes Auth0, the leading Enterprise Authentication and Authorization Platform to handle all user security and management.
If your organization is using enterprise user identity services like Google G-Suite, Active Directory or LDAP already, Single Sign On can be enabled with Kubit through Auth0. There are no extra credentials for users to remember and your IT department has full control over who has access to Kubit.
To start simple, you can also just provide several users’ email addresses to get them provisioned on Kubit directly. They can login to Kubit with email and password. Also MFA (Multi-Factor Authentication) can be enabled for these users too to add more security.
Support is provided during business hours (Pacific Time) through in-app chat. Your team members can always email firstname.lastname@example.org, going through dedicated Slack support channel or contact the Account Manager for direct support needs.