Data is a critical component of success for all fast-growth B2B companies. Not getting your infrastructure sorted or right will hit your product and distribution: knowing who you’re selling what to, where and when, can be a matter of life and death.
We sat down with our portfolio companies — data intelligence unicorn Collibra and open banking leader Tink — to identify and walk through best practice.
This will be useful for companies at any stage, but especially those at Series B+.
The Data Infrastructure Landscape
First, it’s worth noting that this is a difficult challenge for companies. Extracting, storing, transforming and understanding data still isn’t easy for most, and therefore, there are many vendors to consider depending on business needs. To help navigate the different data infrastructure vendors, here’s a quick overview of the landscape:
We at Dawn have invested across the entire value chain, from production sources (neo4j), to ETL/ELT and master data management solutions (Cluedin), to data warehouses (Firebolt), Data Intelligence platforms (Collibra), as well as applications for analytics (Quantexa, Onna) and data science (Dataiku). As such, over time, we have simplified our categorisation of data infrastructure across modern businesses. Typically, the framework consists of four areas: the ingestion of this data, the transformation of it into a usable format, the centralisation and storage of that data and then understanding and delivering business insights from it.
The data “lakehouse” is still really two things, depending on if it is a traditional “warehouse” getting “lake” capabilities, or a lake getting warehouse capabilities. Data lakes offer a file system where you can drop any unstructured files, so they can be further processed and transformed into a more structured format for easy querying — like a data warehouse.
Typically, data warehouses aren’t as great at processing unstructured data. For instance, AWS S3 storage is often used as a data lake. Adding Glue and Athena or even Dremio gives it the querying capabilities of a data warehouse. Data warehouses like BigQuery can be used more and more as a data store and to query “less structured” data, and hence behaves more and more like a lake. So essentially, data warehouses and data lakes are converging from different directions.
And there are more interesting evolutions unfolding in cloud data warehousing. Snowflake has led the second wave of innovation, introducing true elasticity by decoupling storage and compute, which in turn led to cost and scale efficiencies over first-generation cloud data warehouses like Redshift. Against this background, we already see the third generation of cloud-native data warehouses going up against Snowflake, including Firebolt, which is offering up to 100x better price/performance ratios over incluments and counts some of the most data-hungry tech companies as customers.
But innovation is not limited to the most tech-savvy spectrum of the market. In fact, the depth of the pain point of data for many mid-sized businesses, several players are emerging to help SMBs manage their analytics stack. Examples of growing European companies shaping this space include:
Interestingly, similar motions are starting to emerge in other subsets of the data value chain: data cataloguing / observability being one example. Here, just over the past year companies such as Castordoc, Stemma, Metaphor and Select Star have started to ship “data catalogue 3.0” products targeting mid-sized companies and scaleups running a modern data stack.
How are Companies Doing This Today?
We talked through how two of our portfolio companies play out the above in house — how they manage these workflows, and what the timelines around that look like.
We spoke with Stijn, Collibra Co-founder, about his data stack, and in his view, it’s important to set up dedicated data engineering resources as you’ll need someone to build, operate and maintain the data platform. Stijn also believes data needs vary significantly depending on the scale of the business, with Collibra updating their data stack two years ago as the company grew past 500 employees.
“We set up our infrastructure two years ago, and chose AWS as there was already a grassroots data lake in place to build upon.”
Collibra has recently added Fivetran to supplement Pyspark for ELT / ETL, and is considering adding data prep capability for transformation in the data warehouse. Collibra’s data stack today:
Takeaway: Making decisions about your tech stack can be strategic, and the data platform will require dedicated engineering resources to manage.
As far as Jens Larsson, the Head of Data & Analytics at Tink, is concerned, it’s important to set design principles such as all data being accessible in a single format — and to avoid diverging custom builds. This translates to standardized paths to contribute and to query data, and providing data tooling that enforces good practices.
Jens also recommends that if you have a high-velocity product with a lot of data, you need to start incorporating real-time infrastructure into your data stack. Back when Tink provided a consumer app, it got a lot of data capabilities out-of-the-box with Google Firebase. However, these days the team has built a solution around AWS Kinesis, with data consolidated into a data lake.
When in the lake, data is mostly transformed using SQL, orchestrated through Apache Airflow in a setup similar to what’s provided by dbt. Ultimately, building for scale and sustainable usage is key to developing a great data infrastructure stack.
“It is important that there’s a well defined, low-friction path to contribute data to the data platform. We’ve created tooling that makes it easy for any engineer at Tink to start streaming structured data from their services into the lake. Friction is the killer of a well leveraged data platform, and good data services.”
On which data vendors to choose, Jens is an advocate for Google’s BigQuery — it requires minimal engineering resources to set up and maintain, while enabling users to be highly productive. However, if you are already set up on AWS, Snowflake is a good option to consider.
Takeaway: A data stack is no better than the data it makes available, and ensuring any team or function can contribute to that data easily and in a standardized way opens up for more high quality data to consume.
Consumption, meanwhile, should also be low friction, a tool like Snowflake or BigQuery provide very powerful SQL interfaces for analysts and data scientists. Tools like dbt or Apache Airflow allow building and maintaining data transformation pipelines, producing the datasets that underpin dashboards and other utility datasets.
What does all this look like in terms of team set-up and growth?
There are multiple ways to set up — and grow — a team over time. Some businesses will opt to build out under a CTO, others under a CFO, or product. The going advice though is to start with an experienced leader and build the team up, and to remember that a lean experienced team (starting with one or two people) can drive a lot of value for a scaleup.
At Collibra, the data engineering team sits in the data office function today, with four dedicated resources managing infrastructure across the entire organisation (two data engineers and two data scientists). Business users also have their own analysts, driving decentralisation for Tableau and basic SQL tasks.
The data team reports to Stijn, Collibra’s Chief Data Citizen, effectively the business’ Chief Data Officer (CDO). But Stijn thinks this team can definitely sit under the CTO (responsible for engineering) or under the CPO. For larger businesses, Stijn can see good reasons for each business function having its own team.
At Tink, the data engineering team is intertwined with product and tech, with some business facing resources to support data analysis. The core thing to get right, in Jens’ view, is the first senior engineering hire, who knows what good looks like and will set the organisation up for scale.
Key Considerations As You Build Your Infrastructure
What are the key considerations that you should be thinking about as you build your data stack? We summarised them below:
- What are the key use cases for my data?
As you build out your data infrastructure, you need to think about what you will be using your data for. You might need a different set-up, for example, if you have real-time use cases. As Tink was building out its stack, this was an important consideration for the team. So it ended up going with AWS Kinesis as a core part of its infrastructure.
Your data platform will decouple production and consumption of data, and chances are high that data produced by one team will be consumed by multiple others, and for creative use-cases you didn’t first foresee. Keeping a low friction path to contributing data will ensure you don’t miss out on opportunities you couldn’t foresee.
2. What capabilities do I have in my data team?
It’s important to consider who will be spending the most time in the data infrastructure stack. If you have hired a non-technical team focused on business use cases, then Google BigQuery or Snowflake might be the best options. If you have hired a very technical team and pride yourself on a decentralised approach, AWS could be the right solution for you. You are still likely to need to hire data engineering specialists, especially given the multiple pieces in the stack today.
3. What KPIs do I really care about?
Starting with a preliminary hypothesis on the metrics you want to track can be very helpful. One of our portfolio companies, for instance, was hyper-focused on customer insights and customer success. As such, it focused on building out a strong Gainsight team in the first instance, with Tableau as a visualisation layer on top. It will build additional data infrastructure as it fleshes out new KPIs.
4. Where do I want my data team to sit? And how centralised should the team be?
This is one of the most important questions to address. There is no right answer, with different combinations working well, depending on the use case of the business and its scale. Both CTOs and CPOs successfully manage data teams. If there is a CDO in your organisation, having the data team reporting to him could work as well.
Here is Jens’ favorite article on the topic: https://hbr.org/2018/11/the-kinds-of-data-scientist.
5. Do I build for scale?
As your company grows, so will your data needs. Where you can, go for cloud-native solutions that can keep up with the pace of your business’ growth and avoid heavy customisation or development of proprietary data systems in the early days of the company. You will likely need to revamp later.
6. Who am I developing partnerships with?
If you are building a data company, and are already developing partnerships with AWS, GCP, you might consider building your data infrastructure around these tools. Collibra, for example, chose AWS for strategic reasons, because there was already a grassroots data lake in place to build upon.
7. What are the legal constraints of my data?
Apart from this, you will also want to ensure your customer and/or end-user agreements aren’t preventing you from developing your data capabilities. First and foremost, if you process any data collected under a customer or end-user agreement using BigQuery; then Google will be a so-called sub-processor, and your agreement will likely need to cover this. It is a good idea to separate your customer agreements, and DPA (Data Processing Agreement/Addendum) so you can amend one without renegotiating the other.
Wherever you are in the scaling journey, data and its associated infrastructure needs to be constantly thought about and iterated on. There is no panacea, no one-size-fits-all and, at scale, data infrastructure needs to become a critical, core function in and of itself.
Follow Shamillah on LinkedIn and Twitter