My memory isn’t great. In fact, it’s very unreliable. But, some memories linger so vividly they feel like yesterday.
I remember playing Prince of Persia on a Compaq Presario, our first home PC, when I was 13.
I remember wasting time on the online message board of my local radio station when I was 15.
I remember looking for directions to a record shop on my first smartphone when I was 25.
I will remember getting my first insights in our sales numbers by simply asking an LLM.
With LLM we’re at the onset of a platform shift that will affect every aspect of our professional and personal lives. One of the biggest ways it will affect our professional lives is by lowering the threshold for non-technical users to get new insights from their data. LLM with Retrieval Augmented Generation (RAG) will finally achieve the data democratization that we’ve been striving for the past 20 years without much avail. The technical hurdles put up by BI technology and the need for exact queries has stifled true data democratization as it prevents data literacy from going beyond +/- 20%. With LLM and RAG these hurdles will practically disappear, and everyone in the organisation will have access to insights independent of their technical know-how. It will integrate with your users’ business applications, and its language capabilities will allow for them to ask for insights in their own words instead of having to learn complex SQL syntax. Below, you can see how an LLM combined with RAG can provide Damien with insights on his organisation’s customer retention.
As with anything, the poison is in the tail. The freedom to get insights from your data in an unstructured way also exposes your organisation to significant privacy and security risks, previously unknown. It’s easy to see how a user can bypass access controls, data masking and other security controls by asking an LLM for information when that LLM has unfettered access to your data through RAG. A bit like when you could use Databricks spark jobs to bypass access controls a couple of years back. In the example below you can see how Damien uses his organisation’s LLM to get access to sensitive customer data he would otherwise not have access to when directly querying the data warehouse.
We even don’t have to assume malicious intent on the part of Damien. LLM’s have a track record of generating incorrect SQL queries from user prompts, exposing the organisation to accidental data breaches. Whoopsie-daisy!
Apart from the data privacy implications from RAG’s, there are also serious data security risks associated with giving RAG’s unlimited access to your cloud data. That is because your cloud data store and the account you use for the RAG to access your cloud data are the preferred vectors of attack for hackers. That is,
The non-deterministic nature of the queries generated by RAG’s, the fact you can manipulate LLM’s, and its access to cloud data make that GenAI has a significantly more privacy and security risks relative to the traditional predictive AI systems, as shown below.
Supervisors have recognized these privacy and security risks and are ramping up AI regulations. With the EU AI Act, and the White House’s executive order for safe AI, the EU and the US are spearheading a global trend of regulating AI. With fines running up to 7% of global turnover or EUR 35 million in case of non-compliance with the EU AI Act, it is clear that compliance will have to be taken seriously.
It will be extremely important to make sure that you apply consistent least privilege access controls to the RAG’s as you would when a user would access the data source directly. However, given the extremely dynamic nature of the RAG’s queries it is impossible to predict which data the RAG will want to query in the future, making it hard to efficiently manage access.
In fact, RAG’s will accelerate a 20 year old trend of queries becoming increasingly dynamic. Let me explain. Creating a new report in the early 2000’s was a painfully slow process that could easily take up to 3 months. You had to write the specs of the report, someone had to look for the data in your operational systems, write a complex ETL to get the data into the data warehouse, and foresee enough storage and compute. With cloud computing introduced in the 2010’s, this process was significantly streamlined. Instead of having to wait several months, you could get new insights in a week, and combined with dbt you can now have new insights in less than a day. As data development processes accelerate, the data security workflow can’t keep pace, resulting in huge productivity losses as confirmed by NIST. I’m confident that without change, data security workflows will completely break down with RAG’s where new queries will be introduced at the speed of light.
Given the pivotal role data plays in LLM, the data team will be held accountable for securing them. So how can data teams protect their organisation against the security and privacy risks of talkative LLM?
Shift Left
The first and most important step is to integrate Data Security in MLOps/DataOps by managing data security as code in Terraform, dbt, or data contracts. For instance, when creating a new data product containing customer data, the engineer can just tag the data set as such. This tag can then be used to dynamically give access or mask data, using tag-based policies. This replaces the manual process of having to manually set ACL’s after figuring out who can have access to the data, reducing the time spent on data security by more than 90%.
Automation using semantic meaning of data
To achieve data security at scale it will be important to automate data security using tag-based policies, or Attribute Based Access Controls (ABAC). If you can automate data security during the data development process you also take away the mental burden from data engineers of defining which data security measures to take and determining who can access their datasets, which can sometimes be tricky questions to answer.
For instance, if you have marked a dataset as containing customer data, Raito can dynamically restrict access to the sales team based on a tag-based policy which determines that only the sales team can access customer data. Additionally, you can dynamically mask customer PII using a Data Governance policy that defines that all PII needs to be masked. Combined with Identity Federation, this lets you dynamically secure RAG’s so the data team doesn't have to constantly manually update permissions.
Identity Federation
When you use RAG to improve the accuracy of LLM responses, it will be primordial to make sure that the RAG can only access data that the end user is authorised to access. This can only be achieved if you use Identity Federation for the RAG.
For instance, when Damien from engineering asks your LLM for the home address of one of their customers, the RAG should not get access to customer data because it uses Damien’s identity to access data, and Damien does not have access to customer data.
Continuous Monitoring
Often neglected, but something that will become a regulatory requirement is the regular monitoring of data access and usage patterns of the RAG’s. We’ve seen data teams bypass data security policies by giving Admin access at the data source, and are abusing service accounts to get access to sensitive data. Data Teams will have to continuously monitor changes in permissions, or anomalies in data consumption.
Collaboration
Data teams will have to work closely with data owners and the data governance team to implement and operationalise security policies in the data stack used for genAI. It is not up to the data engineer to determine who should have access to their data products. It’s not their responsibility, nor do they have the context to make informed decisions. A good collaboration enables the segregation of duties needed to streamline data security management.
It's clear that GenAI holds the promise to increase productivity by providing readily available insights on your organisation's data. However, without proper controls, the privacy and security risks significantly outweigh any potential rewards. The head of data will have to balance data security with GenAI adoption, but won’t be successful with the current IAM technology. Only a solution like Raito can help them unlock the value of GenAI securely.
Reach out to info@raito.io or book some time with me if you want to learn more about AI Security.
Bart Vee