Pilot Project Guide
This document provides the guidance to start a successful proof of concept project with Kubit’s Self-Service Analytics. It covers the contexts and details steps required to share your data with Kubit and also how to get your metrics configured properly. It also includes the best practices in data model design and implementation as a reference.
With secure cloud data warehouse sharing, you can share your data with Kubit in real-time without the hassle of copying them, or introducing complicated batch jobs. While having full control over your data and maintaining a Single Source of Truth, all the computation cost involved with Self-Service Analytics will be covered by Kubit. This is the essence of No-Code Analytics offering from Kubit.
Your data should already be stored in data warehouse tables for analytics purposes. They should also be updated frequently to stay live. Here are the key considerations for the PoC and some best practices.
Data model is sometimes presented as an ER (Entity Relationship) diagram. It captures the relationship between all the entities in your database.
It is never a good idea to just dump your transactional data (i.e. database tables used for serving your users) into the data warehouse for analytics purposes. A star schema is the most recommended pattern for storing analytics data.
Below is a design sample of the proposed data model. If your current data model is different, don’t worry, there are many creative ways to address that by collaborating with your data engineers.
The fact tables store information about when something happens (events), some inline properties (eg timestamp, country, app_id, app_version), and some dimension keys which can be used to lookup for the dimension values in dimension tables (eg user_id to lookup user’s email in users table, campaign_id to lookup for a campaign’s start and end dates).
In most scenarios, it is recommended to use inline properties on the events as much as possible, i.e. capture most dimensions’ values as properties and store them on every event. It simplifies the query and also improves analytics performance dramatically. Modern analytical data warehouses like Snowflake have built-in optimizations to store duplicated column values very efficiently. If you are interested in this topic, read more about Columnar Database.
How to implement ETL (Extract, Transform and Load) to populate the star schema from your transactional data is out of the scope for this guide. Though feel free to contact Kubit support to get additional information or advice on this topic.
Every user, regardless of registered or anonymous (guest), requires a unique identifier for the analytics to work. Counting unique users and following a user’s behavior (action events) are the most common analytical measures.
Ideally there should just be one unified id field (User ID) to identify any user. But in reality, it could be much more complicated because the relationship between a person, different apps, their devices and credentials can be super complex. Especially with considerations of privacy, fraud, hacking, device and credential sharing. It is a business decision for your organization to choose the best strategy.
Conceptually, there are several IDs:
Generally speaking, the pair of User ID and Advertiser ID can satisfy most of the product analytics use cases.
Events are generated when a certain action takes place. For example: Login, Sing, Listen, Page View, Button Click. It is the most critical data structure for analytics.
It is strongly recommended to have a Data Dictionary (as simple as a Google Sheet) in place to capture the definition of every event:
Events need to be instrumented in the code on both mobile (in the app) and server (backend) sides because some events are triggered by user actions, some by backend processes (eg subscription renewal, push notifications sent).
If you are starting from scratch, there is no need to reinvent the wheel. It is recommended to use a CDP (Customer Data Platform) like Segment, mParticle, RudderStack or Snowplow to give you a mature SDK and the maximum flexibility to control where the data goes.
Technology aside, the key of a successful instrument project is communication. Keep in mind that the engineers who are injecting the events into code flow don’t usually know how these events are used in analytics. Providing this context information is often critical to avoid bugs and surprises down the line.
There are two kinds of properties for the events: Common Properties and Event Properties. A property can be categorical (enumeration of values) or numerical (unbounded continuous values).
These are the properties that every event should have. Usually they are used in filters or breakdowns as dimension attributes. Here are some examples:
Each event can have its own set of properties depending on the context. It is recommended to store them in a set of named generic properties instead of introducing ad-hoc properties on the fly. This way the database table schema can be kept clean and efficient. These generic properties can be explained and managed using Data Dictionary. Here are some examples:
Any timezone conversion during query time will kill the performance. For simplicity and efficiency, all the data related to time should be stored as the Timestamp or Date data type, which indicates a certain point in time and has no timezone concept.
The complex issue of timezone arises when any date or time is presented in text format, as in every analytics report with dates (MM/DD/YYYY). It is recommended for the organization to decide on using one certain timezone across the board for all analytics purposes. Usually the selected timezone is where the headquarter is based since it is much easier to talk about local dates in business settings.
By allowing Kubit to securely access your cloud data warehouse directly, no actual data is copied or transferred between accounts. It doesn’t require any engineering work to create ETL or batch jobs, neither does it take up any extra storage. With this approach, Kubit can deliver instant product insight from the real-time live data based on a Single Source of Truth.
All the access goes through secure, encrypted communication channels. In the meantime, your engineering team maintains full control of the data share by limiting the permissions and visibility.
Snowflake Secure Data Share allows your data engineers to share a database, schema, table or view to Kubit with read-only permission and real-time live data. A view can be used to join multiple tables across different databases and add constraints to expose which columns. You are the Provider while Kubit is the Consumer of the share. It just takes four simple steps in SQL:
Here is an example script (sharing to nk15162 on US West Coast):
Snowflake also offers a web interface to review and manage data shares. For details on this topic, please review Snowflake’s documentation here.
It is strongly recommended to use the proper Clustering Keys in your database tables to improve query performance. Typically at least the event date column and event name should be part of the clustering key since they are part of the filters in most queries.
With Secure Data Share, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an important concept because it means that shared data does not take up any extra storage for either the Provider or Consumer. Since the same underlying data is being shared, the data is guaranteed to be live and without delay.
More importantly, the computing environment is completely separated. The Consumer (Kubit) must create and use their own Virtual Data Warehouse and pay for it themselves. There is absolutely no impact on the Provider’s cost, performance or SLA for their Snowflake services.
Additionally, using Snowflake’s built-in roles and permissions capabilities, access to data share can be controlled and governed using the access controls already in place for your Snowflake account and the data therein. Access can be restricted and monitored the same exact way as your own data.
Google’s BigQuery allows you to share a dataset to Kubit securely through Service Account. Because BigQuery separates computing from storage, Kubit will cover the cost of querying and won’t impact your SLA.
Kubit can access your data in Amazon Redshift cluster directly with restricted access, or through S3 storage sharing.
CREATE USER kubit WITH PASSWORD '<password>';
GRANT SELECT on ALL TABLES IN SCHEMA public TO kubit;
GRANT USAGE ON SCHEMA public TO kubit;
GRANT SELECT ON TABLE public.table_1 TO kubit;
GRANT SELECT ON TABLE public.table_2 TO kubit;
If your Redshift data is stored in S3 buckets, instead of open direct access to your Redshift cluster, you can choose to share the S3 storage to Kubit.
Kubit will provide the ID of a role used for accessing bucket data. All you need to do is to add the following statements to the S3 bucket policy (sample):
This section provides guidance on how to convey your organization’s definition on metrics (KPIs) and dimensions for the PoC.
Please provide a list of top metrics for your product analytics needs. Most of these metrics can be presented in simple math formulas with measure functions, along with some filter conditions. For example:
Please provide the definitions of all the dimensions needed in the analytics. Most of these dimensions can come from the event's properties. Some dimensions require joining with dimension tables during query time. Here are the information needed:
Cohort is a group of users matching certain criteria. You can define these cohorts to be used for breakdowns in any Formula, Prediction and Braze Integration using very similar definitions like a query with measures, filters and conditions. For example:
For a SaaS solution, security is always on top of the priority list at Kubit.
Kubit utilizes Auth0, the leading Enterprise Authentication and Authorization Platform to handle all user security and management.
If your organization is using enterprise user identity services like Google G-Suite, Active Directory or LDAP already, Single Sign On can be enabled with Kubit through Auth0. There are no extra credentials for users to remember and your IT department has full control over who has access to Kubit.
To start simple, you can also just provide several users’ email addresses to get them provisioned on Kubit directly. They can login to Kubit with email and password. Also MFA (Multi-Factor Authentication) can be enabled for these users too to add more security.
Support is provided during business hours (Pacific Time) through in-app chat. Your team members can always email firstname.lastname@example.org, going through dedicated Slack support channel or contact the Account Manager for direct support needs.