Activating and Governing a Growing Data Platform with Atlan
The Active Metadata Pioneers series features Atlan customers who have recently completed a thorough evaluation of the Active Metadata Management market. Paying forward what you’ve learned to the next data leader is the true spirit of the Atlan community! So they’re here to share their hard-earned perspective on an evolving market, what makes up their modern data stack, innovative use cases for metadata, and more.
In this installment of the series, we meet Surj Rangi, Enterprise Cloud Data Architect, Piyush Dhir, Senior Technical Lead, and Danni Garcia, Product Manager, at REA Group, the operator of leading residential and commercial property websites, mortgage brokering services, and more. Surj, Piyush, and Danni share REA’s evolving data stack, their data-driven ambitions, and the criteria and process behind their choice of Atlan.
This interview has been edited for brevity and clarity.
Could you tell us a bit about yourselves, your backgrounds, and what drew you to Data & Analytics?
Surj Rangi:
I’m Surj Rangi, Architect in Data Services, and I’ve been at REA for two years now. I graduated in IT from the UK, then worked in a number of consultancy firms in Data and Analytics and developed a strong background in cloud platforms and data architecture. I migrated to Australia about seven years ago, with two decades of experience in data across various industries including Media, Telecommunications, Finance, E-commerce and Banking.
I joined REA and was very keen on the role that I was offered and the team I was coming into. What really enticed me was working with a company that had a startup mentality, and were excited to push and deliver outcomes. Previously, I’ve worked with large banks where there’s a lot of bureaucracy and things take time, and I was excited to see how things work at a place like REA.
Piyush Dhir:
I’m a Senior Technical Lead at REA. My journey goes back to university when I was finishing my Bachelors in Software Engineering and needed to make a decision about what I wanted to do next.
I started as an Android developer back when it seemed like everybody’s next thing was “What is going to be my next Android project?” When I was doing that, I came across SQL Server, learning how you have to do operational modeling when you’re creating something like a front-end application. That’s how I made my first step into data. Since then, I’ve been working across a number of different kinds of data teams.
Danni Garcia:
I’m a Product Manager in Data Services with a specific background in Data studies. I haven’t always been in Product. I’ve worked in the technology industry for almost a decade now across many different areas and roles in both large and small organizations, but I started out as a Data Analyst.
Would you mind describing REA, and how your data team supports the organization?
Surj:
I think it’s good to know that REA started in a garage in Australia in the early-to-mid ’90s, and since then the company has grown and scaled enormously across the globe. REA has a presence not only in Australia, but Asia too and has strong ties with NewsCorp. We started by listing residential properties, and it’s grown from there to commercial properties and land, as well. We’ve also done a lot of mergers and acquisitions. For example in Australia, we’ve bought a firm called Mortgage Choice that allows REA to be positioned not only to advertise listings, publications, and provide insights into property into the industry in Australia, but also provide mortgage broker services.
We’ve gone through a long journey, and have had a Data Services team for a long period of time. Everything was decentralized, then it was centralized. Now it’s a bit of a hybrid, where we have a centralized data team building out the centralized data platform with key capabilities to be used across the organization, with decentralized data ownership. We are trying to align with a Data Mesh approach in terms of how we build out our platform capabilities and adoption of “data as a product” across the organization.
We are multi-cloud, both AWS and GCP, which brings its own challenges, and we do everything from ingestion of data, event-driven architecture to machine learning. We are building data assets to share with external companies in the form of a data marketplace.
Danni:
Data Services exists to support all of the internal lines of businesses across our organization. We’re not an operational team, but a foundational one, that builds data products and capabilities to help support teams so they can successfully leverage data for their products. Our mission is to make it easy to understand, protect and leverage REA data.
Piyush:
I’ll add that over the last couple of years, REA has predominantly seen themselves as a listings business. It’s still a listings business, providing the best listings information possible out to customers and consumers. But what’s happened is that this rich data evolution is helping our business become data-driven. Some of the data metrics you see on the REA website and mobile application are mostly derived from the work that the organization has put in to grow our Data & Analytics and ML practice to drive better decision making.
We have a lot of valuable data. There are a lot of initiatives going on now to expand the usage of data, and over the next two years, we will grow our landscape and derive even better outcomes for our customers and consumers. to understand, leverage, then showcase data to our customers and their customers.
What does your data stack look like?
Danni:
We have a real-time ingestion platform called Hydro using MSK, which is a custom-built streaming platform. Then we have our batch platform, which ingests batch data using Breeze, built on Airflow. Our data lake solution is BigQuery.
Piyush:
We look at ourselves as a poly-cloud company, using both AWS and Google Cloud Platform, at the moment.
From an AWS perspective, we have most of our infrastructure workloads running there. We have EC2 instances and RDS running there. We have our own VPC. We have multiple load balancers.
From a Data and Analytics perspective, the majority of our workloads are in GCP. We are currently using BigQuery as a data lake concept, and that’s where most of our workloads run. We use SageMaker for ML, and there’s some teams that are experimenting with BigQuery ML on the GCP side, as well. We also have a self-managed Airflow instance, so that’s our data platform.
We are currently in the process of setting up our own event-driven architecture framework using Kafka, which is on AWS MSK.
Apart from that, our Tableau front end is used for reporting, so we have both the Tableau desktop and the server version, at the moment.
Why search for an Active Metadata Management solution? What was missing?
Surj:
We have an existing open-source data catalog that we have been using for a few years now. Adoption has not been great. As we’ve scaled and grown, we realized that we needed something that’s more relevant for the modern data stack, which is the direction that we are going towards.
There’s also a stronger push in our industry toward better protection of data. We store a lot of personally identifiable data across the business, and some of our key strategies we have in Data Services are that we want to first understand the data, protect it, then leverage it. We want to be able to catalog our data, and understand how dispersed it is across our warehouses, various platforms, in batches, and streams.
We have a lot of data, e.g. we’ve got over two petabytes of data in GCP BigQuery alone. We want to be able to understand what data is, where it is put together, and apply more rigor to it. We have good frameworks internally in terms of governance, processes, and policies, but we want to have the right tech stack to help us use this data.
Danni:
There were some technical limitations, as our previous data catalog could only support BigQuery, but we really wanted to support the direction of the business in terms of scale and how it would align more broadly with our Data Vision and Strategy.
Our strategy wants to implement Data Mesh and ‘Data as a Product’ mindset across the organization. Every team owns data, they leverage it and they have a responsibility to manage it with governance frameworks.
So, in order to embed Data Governance practices and this cultural shift, we needed a tool to support the frameworks, metadata strategy, and tagging strategy. We also needed a solution to centralize all our Data Assets so we could have visibility of where data is and how it’s being classified which supports our Privacy initiatives.
We’re still on a transformation journey at REA, which is very exciting. A new data catalog was a real opportunity to push ourselves further into that transformation with a new Data Governance framework.
How did your evaluation process work? Did anything stand out?
Surj:
We did some market research, speaking to Gartner and reviewing available tooling across the industry. We could have obviously kept using our current Data Catalog, but we wanted to evaluate a wide spectrum of tools including Atlan, Alation, and Open Metadata, to cover Open Source vs. Vendor managed.
We felt Atlan fit the criteria of a modern data stack, providing us the capabilities we need, such as self-service tooling, an open API, and integrations to a variety of technology stacks which were all very important to us.
We had an overwhelmingly good experience engaging with Atlan, especially with the Professional Services team. The confidence that they gave us in the tooling when we went through our use cases drove a feeling of strong alignment between REA and Atlan.
Piyush:
We did a three-phase evaluation process. Initially we went out to the market, did some of our own research, trying to understand which companies could fit our use cases.
Once we did that, we went back and looked at different aspects such as pricing and used that as a filtering mechanism. We also looked at the future roadmap of those companies to figure out where each company might be going, which was our second filtering process. When we were done picking our options, we had to figure out which one would suit us best.
That’s when we did a light proof of value where we created high-level evaluation criteria where everybody involved could score different capabilities from 1-10. The team included a delivery manager, a product manager, an architect, and developers, just to get a holistic view of the experience everybody would be getting out of the tool. After that scoring, we made a lightweight recommendation and presented it to our executives.
Some of what we were looking at in the evaluation criteria were things like understanding what data sources we could integrate to, what security looked like, and concepts like extensibility so we could be flexible enough to extend the catalog programmatically or via API. Because we have our data platform running on Airflow, we also wanted to understand how well each option worked with that.
Then we also looked at roadmaps and asked ourselves what might happen in the future, and if something like Atlan’s investment in AI is something we ought to be looking into, and other future enhancements Atlan or other vendors could provide. We were trying to get an understanding of the next two or three years, because if we’re investing, we’re investing with a long-term perspective.
Surj:
If you look at the term “Data Catalog”, it’s been around for a very long time. I’ve been working over two decades, and I’ve used data catalogs for a long time, but the evolution has been significant.
When Piyush, Danni and I were looking at vendors, that’s something we were thinking about. Do you want a traditional data catalog, which we’ve probably seen in banks that have a strong, governed, centralized body, or do you want something that’s evolving with the times, and evolving where the industry is heading?
I think that’s why it was good to hear from Atlan, and we liked where they were positioned in that evolution. We like that Atlan integrates with a number of tech stacks. For example, we use Great Expectations for data quality at the moment, but we’re considering Soda or Monte Carlo, and we learned Atlan already has an integration with Soda and Monte Carlo. We’re finding more examples of that, where Atlan is becoming more relevant.
Conversely, when we were looking at addressing personally identifiable information, we wanted to be able to scan our data sets. Atlan was quite clear, saying “We’re not a scanning tool, that’s not us.” It was good to have that differentiation. When we looked at Open Metadata, they said they had scanning capability, but it wasn’t as comprehensive as we were expecting, and we know now that this use case is in a different realm.
It’s good to have that clarity, and know which direction Atlan is going to go.
How do you intend on rolling Atlan out to your users?
Danni:
So often in platforming and tooling, we’re very caught up focusing on the technology and not focusing on the user experience. That’s where Atlan can really help.
We want to create something that’s tangible, and that people want to use, so we can drive mass adoption of the platform. With our previous catalog, we didn’t have much adoption, so we’re making that a success metric, and one of the great features in Atlan is that we can customize it to meet the needs of differing personas. A concept that hasn’t been traditionally driven in the Data Governance space!
We went out to the business and undertook a big exercise, interviewing our stakeholders and potential users. Now, we really understand the use cases, scale and what our users want from the Data Catalog. Our personas – analysts, producers, owners and users will all be supported in the roll out of Atlan, making sure that their experience is customized within the tool and they can all understand and use data effectively for their roles.