“It’s an awful job. It’s depressing”
These are the words of a data engineer at a famous European FinTech when I asked him about managing data access requests. When he started he inherited a backlog of data access requests from his predecessor, and every month that backlog grew by +30 requests. The data team was thriving, the number of data consumers was growing faster than ever, and he was snowed under with data access requests. I haven’t talked to his successor yet, but I’m sure they share the sentiment of despair. They manage data access with Terraform.
Terraform is a wildly popular infrastructure as code (IaC) tool used by data engineers to build, change, and version the infrastructure used for processing, storing and analysing data efficiently. The time saved and the efficiency gains that come from the automated, declarative and repeatable way of managing infrastructure across multiple cloud providers makes it a data engineer’s favourite.
The siren's call to manage access with Terraform
Every data team starts small, and has to prove its worth before it can get more funding. In order to get started quickly, data engineers leverage Terraform to manage permissions through what is called ‘Access as Code’ where access is stored in state files in a shared location. Terraform picks these files up and translates the permissions captured in these files into access controls in the underlying data sources. This is a great way to get started. You’re using existing infrastructure, your team is largely technical, and the number of access requests are still limited. You don’t worry too much about scalability as your main concern is still to prove the value of the data and your data team.
Build and they will come (and ask access)
The first data products are a hit, the data team gets more budget, and the rest of the organisation has gotten wind of the exciting market insights, and real-time predictive models that they built. Following this success, the number of data consumers and data products grows exponentially, and the data stack is becoming quite complex as tools are being added. It becomes increasingly difficult to process all the data access requests, and keep the access controls up to data with all the changes in the data and consistent across data sources. What started as a simple and elegant solution to manage access to one Snowflake instance is turning into monstrosity juggling access to Snowflake, S3, BigQuery, PowerBI and Looker. You see, Terraform was built to manage infrastructure, not to manage access. The nature of data access management makes you hit scalability issues when managing access with Terraform quickly.
Complex Setup
IAM is complex, IAM in the cloud is more complex, IAM in multiple clouds is extremely complex, IAM in DevOps DataOps scares the hell out of me! Managing access and setting up access controls using the cloud native IAM systems is already very hard, and Terraform adds a fat layer of complexity to it. This is particularly true for large organisations with many data consumers where the Terraform scripts with access controls have grown unwieldy.
Maintenance overhead
Data proliferates, and the move from ETL to ELT only adds to that. As a result, access controls must be maintained and updated regularly. With every change in the data, you have to change the access controls, adding to the administrative overhead of using Terraform. This lack of leverage points makes data access management with Terraform excruciatingly manual, which is in fact not the reason you’re using Terraform in the first place.
No overview
Who has access to customer data? Which data does Christina have access to? How often is the table containing person credit card information queried? Who has write access to production data? Which permissions should I give a new data analyst that just joined the ‘retail sales’ domain?
When access to a hybrid cloud environment is managed in state files it is nearly impossible to answer these types of questions, making it impossible to make informed decisions, efficiently audit, or process data access requests.
No collaboration with the business
The technical nature of Terraform and the related workflows make it very difficult to delegate the responsibilities of data access management and data access requests to less tech-savvy data owners in the business (like me). This way, data access management keeps falling back on the data engineer that knows the intricacies of the terraform scripts, but not who’s supposed to get access, nor which regulations apply.
Security Vulnerability
Kind of boring, I know, but storing Terraform state files (Access as Code) in a shared location, such as an S3 bucket, can result in severe security vulnerabilities if the state files are not properly secured.
If this sounds all too familiar, and you’re looking for an alternative to Terraform, check out our website, or reach out to info@raito.io to see how Raito helps with:
Data Access Observability and Auditability
Raito’s hybrid cloud data access and usage analytics give data owners one place for the right insights to manage access and approve data access requests. This saves them a lot of time browsing through state files, or running SQL queries in your databases.
Collaboration
Raito’s intuitive UX/UI makes it very easy for less technical users to manage access and approve data access requests, which unburdens the data engineer that has to maintain the access. No more need to look for those Slack threads, or getting lost in JIRA-tickets.
Automation
Raito’s tag-based policies make it possible to automate the configuration and maintenance of access controls across your hybrid cloud environment, and Raito’s Policy Recommender detects and remediates issues with your access controls. This way the data owner or data engineer does not have to update their access controls manually with every change in the data. They can set the tag-based policy, also called Attribute Based Access Control (ABAC), and Raito keeps the access controls up-to-date with every change in the data.
Photos by Elisa Ventur and Zichao Zhang on Unsplash