Data Platform For Startups: An Overview
Building a data platform might initially seem daunting, but the right tools can simplify the process considerably. CDPs, ELT tools, a scalable storage medium, and simple data modeling tools can make the investment worthwhile.
We live in an age where you can set up the whole data platform in under 1 hour, and a typical startup can leverage the benefits without needing a data expert team or spending money.
What is a Data Platform?
A data platform is a piece of infrastructure that any data-driven company uses to collect all its relevant data. The data may come from the product, any backend solution, or the company's external services for operations (Sales, marketing, finance tools, and services).
The purpose of the data platform is to collect and centralize the data in a data warehouse or data lake where data analysts and engineers can process it and derive valuable information.
This post uncovers the best options for a startup to build its first data platform. Before diving into the details, let's examine the pros and cons of maintaining a data platform and warehouse for startups.
Pros:
- Single source of truth - A startup can store and access all of its data about the business in a single place. Typically, without a data warehouse, data is scattered in various tools. Decentralized data leads to the "disjoint investigation" problem (data is impossible to join), which slows decision-making.
- Easy accessibility - If all data is in a single place, stakeholders can access the data easily with the right tools. There is no need to learn multiple tools and semantics for the same business concept.
- Low cost - Traditionally, data warehouses are associated with high costs. I would argue that this is not true anymore, especially with the introduction of serverless data warehouses. Even with millions of rows of data in the data warehouse, a startup can keep operational and maintenance costs near zero.
Cons:
- Need for data experts - some companies are lucky that the core team is SQL native. For the rest, an additional hiring burden comes with introducing the data warehouse. In my experience, an analytics engineer is the best hire as the first data expert.
- Data quality maintenance- keeping the data quality at a level that stakeholders will trust the data is challenging. Luckily, DBT and similar data modeling tools have built-in testing and validation capabilities.
Okay, this being out of the way, let's look at the best practices to start with a data platform.
The Goal
There are millions of ways you can build a data platform. I will focus on keeping the operational and maintenance costs at a minimum. As I aimed this blog post for startups, I will assume that you have a limited number of tracked data points (under 100 million). However, I will keep in mind that your startup might grow, so the solution needs to be scalable and robust.
The other goal is to have it ready under 1 hour.
I suggest companies do not pick a single "magic" technology that seemingly can solve all data problems as it might become costly later, or you will hit a wall for some use cases.
The Solution
There are four essential infrastructure pieces for a data platform.
- Data storage (data warehouse or data lake)
- Customer Data Platform
- ELT solution
- Data modeling solution
Data storage
As mentioned before, it is best to pick a serverless solution for a startup.
What is a serverless data warehouse? You might ask.
In short, you pay only for the time you run any of your SQL queries. This is great, especially since most offerings come with a free package. So, you might stay in the free tier for years.
Best technologies for startups:
PostgreSQL (Multiple very cheap offerings)
- Cloud options: Neon.tech, AWS RDS, Digital Ocean, etc. Neon.tech has a free tier, that makes it very appealing.
- Postgres is a well-known and documented database technology
- Great community
- Great feature set
- Integrates very well with other tools
- Free for a long time
- Easy to use, hard to master.
- Extremely fast and scalable solution.
- It starts very cheap
- Data analysts tend not to like it due to unique JOIN mechanisms and missign SQL features
- It requires a specialized data engineering team
- Very generous pricing
Generally, I don't recommend starting with Snowflake or Databricks as your first data warehouse solution. They typically require costly specialists, while the operation expenses are also high.
Rule of thumb: If you start with warehousing, use PostgreSQL (for example, Neon.tech) or BigQuery. Both will be great solutions until millions of data points and thousands of active users, essentially without cost.
Customer Data Platform
A Customer Data Platform (CDP) is a specialized service that aggregates and organizes customer data from multiple sources into a single, unified database. The primary use case for CDP is to collect usage events from the startup's product.
Best technologies for startups:
- Industry-leading solution
- Free trial only
- After the trial, it starts very cheap but can be costly later on. I have seen multiple companies having to get rid of Segment due to its high cost.
- Warehouse native CDP. It stores all customer data only in the data warehouse. Compared to Segment, which holds metadata about the users internally.
- Open-source and cloud offering
- They offer a great Free tier option.
- Open-source, you can self-host it. Self-hosting might be more expensive than their cloud offering.
- It has a free tier that is great to start with
- Great connector options
- Event transformations with Javascript
- Segment SDK compatibility
Rule of thumb: If you start with warehousing and have many startup credits for Segment, then go with Segment. Beware, you might receive a surprise bill once your credit runs out.
If you don't have segment credits, RudderStack and Jitsu are great options. I love that they are open source. That gives another ease of mind that vendor lock-in is not an issue.
ELT Solution
An ELT (extract load transform) solution is a service that can move data from external tools and services to your data warehouse. Why would you need this?
You need the external services data in the data warehouse to create a complete picture of your users. To understand which users are paying typically, you need data from Stripe, Paddle, and Chargbee (etc.). You need to bring data to your data warehouse from these services. Another excellent use case is Marketing analytics. You want to connect your cost data to customer payment data to understand the return on investment.
There is an ocean of ELT solutions out there. The game is about the covered list of integrations. You can be sure that at least one or two of your external services won't be covered, no matter your chosen solution.
Best technologies for startups:
- Industry-leading ELT solution
- Somewhat pricier than the others, but not significantly
- It has a free tier!
- It is hard to estimate how much credit you are going to use.
- Open source alternative
- Fair pricing
- Some connectors are not yet production-ready, which can cause some headaches.
- Vendor lock-in is less of an issue here as their core connectors are open-source.
- Most connectors are based on the Singer open-source library.
- Predictable pricing
- In my experience, they offer the least of the connectors.
- Simple UX, only a few options for anything other than copying data to your data warehouse. Although, most likely, it is the only thing you need.
Case of CDPs as ELT solutions
Some CDPs can work as ELT/ETL solutions. Using the same service for CDP as ELT could make sense. The only problem is that the CDPs will likely cover even fewer integrations than most ELT solutions. You will pick an ELT solution anyway, and suddenly, you will maintain and pay for two services.
Rule of thumb: I suggest only using a CDP for ELT if you are sure you will only need sources they offer. As an ELT solution for startups, I recommend the option that covers most of your external sources. It is unlikely that an ELT solution will break the bank if you are a startup. Sales, marketing, and finance data in a data warehouse is tiny compared to product data. As an engineer, I would tilt towards Airbyte because they have an open-source version. However, Fivetran's free tier option is excellent as well.
Data Modeling
Once you have your product data (with the help of CDPs) and the data from your external services (with the use of ELT tools) in the data warehouse, you need a data modeling tool that can help:
- Clean the data in case there are some mistakes in it.
- Transform and aggregate it to learn about your customers or the business.
Another blog post will cover the data modeling process and best practices.
The data modeling space is far less crowded than the ELT or CDP space.
Essentially, you have only a couple of options as a startup:
Cloud provider options (AWS Glue, GCP Cloud Dataflow, Azure Data Factory)
- Usually needs a lot of engineering effort to maintain
- It may sound like a good idea to use them as you are already using them as a cloud provider. However, you may find hiring a single person to maintain your data modeling pipelines hard in the long run.
- DBT is an industry-standard modeling tool
- Easy to learn, easy to hire for
- It has an open-source version.
- SQL + Jinja (recently some Python capabilities) is well understood.
- DBT is not just a tool. It is a framework for how data should be governed and managed. It is much easier to get the modeling right with DBT.
Build your solution
- Generally, it is the worst idea for startups as you don't have the resources and time. Unless someone in your team can copy-paste an existing solution, I advise against this option.
Rule of thumb: Stick to DBT Cloud. You don't need to reinvent the wheel. It takes less than 5 minutes to set up.
Sample Startup Data Platform
As you can see, there are many great options that you can choose from building your data stack. Most of them have free or low-cost pricing options. Thanks to the cloud nature of all these solutions, you can easily set them up.
Here is a sample stack you can set up in less than 30 minutes:
This stack is free for a long time. All of these components have a Free Tier that you can leverage. Later, the price will scale reasonably with your usage.
Conclusion
Setting up your startup data platform in the current ecosystem is easier. You can choose between many great building blocks for your stack based on your and your business's preferences.
You can even start for free with the solution that I described above. Depending on your use cases, this setup will remain without cost for a long time.
The best part is that once a single component in your data stack becomes too expensive or won't support the changing requirements, you can swap it out for another solution. Changing infra in production can be challenging, but it can remain manageable with the four different components.