As data and GenAI use cases continue to grow exponentially, data platform teams face the challenge of ensuring that the right people have the right level of access to various data and data products. Implementing efficient access control mechanisms is crucial for maintaining data security, privacy, and compliance. In this article, we will explore different access control methods that platform architects can implement to achieve efficient access management to data and data products and help you find the method that best suits your needs.
Gen AI's value to businesses is likely at an all-time high. Companies are using Gen AI to improve customer service with chatbots, design new products, and quickly analyze large datasets. Data teams are crucial in training these AI models to be accurate, relevant, secure, and aligned with business goals. This has led to big gains in productivity, customer engagement, and innovation across many industries. The result? Data teams start to grow bigger, better, and bolder. This often means different data consumers with rapidly changing access needs. As more and more teams work with data, and as data products become increasingly valuable, it's important to ensure that getting access to the data is both quick and secure.
With so many people requiring access to data, a one-size-fits-all approach won't work. We need a way to manage access that can bend and adjust to fit the different needs of each team and the data platform architecture. Access controls must be adaptable to changes in roles, projects, and data sensitivity, allowing for dynamic adjustments. The solution must scale to handle increasing data volumes and users without losing performance. It should also be user-friendly for both administrators and end-users.
Then there’s also the principle of least privilege to take into account. Regarding data access, this means that you grant the lowest level of permissions to users, but enough to perform their job functions. (Side note: compromised credentials continue to top the initial attack vectors.)
The Cost of a Data Breach Report tells us that the global average cost of a data breach in 2024 is USD 4.88M – a 10% increase over last year and the highest total ever. This creates an extra complexity when managing access: making sure that all data consumers' access rights are up to date at all times. Don’t let a data breach disrupt your business.
Now, which access control method do you need? Here’s an overview of the four most common methods that can help you efficiently manage access for your data consumers. I’m making a distinction between ‘Old School’ and ‘New School’ methods.
ACL’s or Access Control Lists is the traditional method for managing access to data and resources. They are like guest lists. Each object (file, schema, table) has a list of users or groups with the corresponding level of permissions (read, write, execute).
ACL’s are currently used in BigQuery, Databricks, Azure ADLS Gen 2, Unity Catalog, Redshift, Oracle.
ACL’s are relatively easy to understand and are mostly used in environments with a relatively small number of users and resources with stable access needs, or in scenarios where you want to quickly grant a user temporary access to a small amount of objects. Security requirements are straightforward, without the need for complex access control policies.
However, as your data team grows, ACL’s quickly hit their limitations. For every new access need a new ACL has to be created leading to an explosion in permissions. What started as a quick way to manage access, becomes exponentially time-consuming. With every organisational change, the admin has to iterate through the full list of ACL’s to update permissions. The lack of a logical naming convention makes yearly reviews a long and cryptic process, and the high number of permissions makes understanding who has access to what very difficult. That’s why many data teams turn to groups. By assigning ACL’s to groups they can limit the number of ACL’s, using naming conventions by assigning logical names to groups, and reflect organisational change by moving users across groups. However, this approach makes access monitoring even harder as access is now managed in multiple systems. This is also the reason why cloud providers are increasingly introducing RBAC.
That’s when the new school methods come into play.
Role-Based Access Control is a method where permissions are assigned to roles instead of individual users. RBAC is often used to manage access in larger organisations as it’s easier to scale and manage access in a dynamic environment.
A more detailed approach that offers more flexibility incorporates flexible role definitions. This approach works with specific user roles and logical user groupings. You’re creating custom roles that correspond with the specific access needs of users. To illustrate this, we take the example of access to a data product. You can create a role named “Data Analyst” and fine-tune the permissions that the analyst needs to be able to work on the data product without any friction. The advantage of this setup is that users and permissions are kept apart. When you create a role, you can pick and choose which permissions it gets.
Compared to ACL, RBAC has a large upfront investment as the administrator has to define roles first before they grant access. However, this initial investment is earned back quickly through a lower cost of management and reduced security risks.
In larger organisations, RBAC can become too generic. Assigning each user to a single, broad role (e.g. Marketing) might grant them access to data they don't necessarily need (e.g. Social media analytics) while restricting access to data crucial for their job (e.g. Performance data for a specific product launch). This can quickly become cumbersome in such scenarios, requiring the creation of numerous, increasingly specific roles to manage access effectively. Also known as the “Role Explosion Problem”.
To address this, organisations can reside in Purpose-Based Access Controls (PBAC) where permissions are defined for a specific purpose or project. Using a hybrid model, combining RBAC and PBAC is a powerful approach to address the aforementioned challenges.
RBAC provides a foundation, defines roles, and assigns permissions based on job functions or responsibilities. This offers a structured approach to managing access. PBAC adds granularity on top because it allows for more specific permissions based on the intended use of data or resources for a particular purpose. Users request access based on a stated purpose of data use (e.g., "finance analyst," or "product feature development").
This becomes incredibly handy when creating a new data product. Here’s an example to make it more practical. Let’s say a data analyst has to make the monthly finance report and as such requests access to the “Finance Analyst Purpose” which contains the permissions and access needed to create the report because that was already set up in the system.
Where RBAC and PBAC help organisations better manage access in large and dynamic data environments, their manual nature can introduce limitations.
Attribute-Based Access Control (ABAC), on the other hand, takes an even more dynamic approach. With ABAC, access to datasets is granted based on attributes associated with users and objects,. Attributes can include user attributes like department, job title, or security level, and object attributes can include sensitivity level or classification. This approach provides more flexibility, making it ideal for very large and dynamic environments.
The flexibility and scalability you get from ABAC involve a complex implementation and a deeper understanding of the system. ABAC relies on predefined rules, and user and data attributes to make authorization decisions, meaning a significant upfront investment. These policy rules also introduce a form of rigidity, making temporary deviations very difficult to manage.
Additionally, ABAC relies heavily on the accuracy of data and user attributes to make authorization decisions, something the attribute owners might not always be aware of. As a result, ABAC can help to automate data access and security immensely but comes with a complex implementation and maintenance cost. Therefore, its scope should be carefully managed, and monitored and strict SLA’s and meta-data quality should be implemented.