The Force that Surrounds Us: The AI Supply Chain from My Jedai to Canva

UpGuard can now report that it has secured a Chroma database belonging to My Jedai, an AI chatbot company based in Russia. The database contained 341 collections of documents, where each collection could be used to guide responses for different chatbots. Many of the collections contained non-sensitive public data, but some contained private information. Most significantly, one collection contained thousands of responses to a survey of 571 participants in the Canva Creators program, including their email address, country of residence, rating for different components of the Creators program, and descriptions of their specific experiences and challenges with the program.

This leak, the first reported for an instance of a Chroma database, illustrates how the introduction and adoption of new AI-related technologies has created the conditions for a new wave of data leaks both fundamentally similar to those of the past and marked by the unique characteristics of data used by AI models. In particular, the curious path of this data– PII for designers from all over the world, collected by a large Australian tech company, and leaked by a Russian microenterprise hosted on an IP address in Estonia– shows how the data flows of the AI economy are even more interconnected and unpredictable than in the past.

Chroma database

Chroma is a document embedding database intended to provide AI chatbots with specific pieces of information they should use when answering user queries. For example, a gym might augment an LLM with a Chroma database to provide their chatbot with documents describing the gym’s facilities, policies, and hours of operation. A user on their website could then ask the bot questions and receive answers specific to that gym. In our survey of open Chroma databases, we found that developers are generally–but of course not always–using Chroma in this way.

Even when the documents in Chroma databases are non-sensitive, they still need to be configured appropriately to be secure. In the case where the chatbot is accessible to guest users, the database needs to be configured to prevent them from writing malicious content into the document store used to generate answers. While Chroma is a database specifically designed to support AI applications, the practices for securing it are common to all databases. Databases generally should not be exposed directly to the internet; they should have a strong user authentication mechanism; and they should not store more sensitive data than necessary for their purpose. Each of those controls must break down for a data leak to occur. Design decisions (like making authentication enabled by default) and market adoption (it can’t leak if no one uses it) can make a given technology more or less prone to data leaks, but ultimately any database can be used to store sensitive data and configured to leak it.

My Jedai

My Jedai is a Russian software development company that enables customers to create their own AI-powered chatbots. Users configure their bot with information about its role, particular documents that will help it answer questions, AI model, and integrations to the channels where they want it to answer questions. Example use cases on their website include responding to job applicants, communication about cargo deliveries, and providing psychological counseling.

The history of My Jedai exemplifies the boom in small AI-native businesses. In 2018, Andrey Vlasof uploaded an Android puzzle game to a third party APK site. That game can be found a year earlier on other APK sites but without an associated email address. By 2021, he had registered as an individual entrepreneur and received a Russian State Individual Business Registration Number Pattern (ОГРНИП or OGRNIP), now displayed on the My Jedai website, for operating a microenterprise (meaning they employ fewer than 15 people and have revenue under 120 million rubles per annum). At the end of 2023, the Internet Archive first captured the myjedai.ru website advertising the service it delivers today: “Create your own artificial intelligence in just 5 minutes!”

UpGuard notified My Jedai of the exposed data May 1, 2025. The next day My Jedai replied and confirmed that they had secured the database, and UpGuard confirmed it was no longer accessible.

The Data

Chroma organizes documents into groups called “collections.” Each collection is a JSON file containing the text of the documents themselves and associated metadata like their source and id. The My Jedai database contained 341 collections, varying in size from 161 bytes to 104 megabytes. The sources listed in the collections showed that the documents came from a mix of public web pages and uploaded files. Most of the text in the bodies of the documents appeared to us as unicode, which we decoded into Cyrillic characters and then translated from Russian to English.

*Examples of document sources via uploaded files and public web addresses.*

The information stored in the collections of documents ranged from public to personally identifiable. On the more innocuous end of the spectrum, some of the collections contained text related to some kind of mystical doctrine. (“Sacred is the highest manifestation of the pattern in all aspects. Extracurability and highlighting are important here, when your pattern becomes an algorithm for a new pattern. In particular, elves are a standard as an example of sacredness”). Other collections collated information about how to build emotional intimacy with a romantic partner from sources on the web like WikiHow and Marie Claire.

These diverse data collections support the operating model described on the My Jedai website: end users upload or link to whatever documents they want without the need for technical expertise, and with only whatever oversight they themselves perform. Some documents contained junk data scraped straight from the web, like display ads and page footers, that would be stripped out by a more proficient user.

Other documents were clearly private, though not necessarily sensitive. These included chunks of written chat messages that had been copied into text files for ingestion, including the names of the users, the times the messages were sent, and the message contents. Such messages might provide useful inputs for a knowledge base, as they include frequently asked questions and answers, but the examples here showed why plugging a company’s chat software straight into an AI system creates risk. In addition to useful technical answers, the chats included links to development environments, file sharing servers, and URLs for resources in Google Drive configured for sharing but only accessible to those with the link.

Furthermore, the original chat messages include additional context like weaknesses or gaps in system performance that might be the premise for a question but might not be desirable to add to a knowledge base. One of the users' display names included a nickname that made it easy to find his YouTube channel and other social media. These chat messages came from Russian technology companies that in turn list prominent Russian enterprises as customers, making even minor risks systemically significant.

In a world without LLMs, a person might search through Slack for frequently asked questions, copy them into a working document, and tacitly perform an editorial and curatorial function to ensure only desirable answers appear in a knowledge base. In a world with AI, it is increasingly popular to plug those tools into a company’s raw data– not just internal communications like Slack but project management tools, customer support chat, file storage. etc.– precisely because LLMs are so good at summarizing across large collections of data. This exposure highlights how one of the signature capabilities of LLMs drives users to share more raw data with their AI vendors, increasing the impact in the event of data exposure.

Canva

In our early assays of the data the unicode was an impediment to skimming the text, and random sampling returned a weird but unproblematic smattering of mysticism and tech support. However, when we searched the collections for “gmail.com”–a useful trick for gauging if a data set contains any personal information–one collection jumped out. Unlike the rest, almost all the text was in English, and it clearly contained records associated with identifiable people.

The JSON dump for this collection had 77k rows for a total of 13 megabytes of text. The documents were survey responses that began with a user’s email address and country, followed by their responses to 51 questions about their experience with the Canva Creators program. A few questions asked for the users’ firmographics: When did you join Canva Creators? Are you a professional designer? What is the size of your company or agency? Most asked the user to rate and describe their satisfaction with different parts of the Canva Creators program like the royalties program, the product, and the community engagement.

*Part of a survey response from a Canva Creator*

The responses contained a total of 571 unique email addresses, though many email addresses occurred more than once as Creators apparently answered multiple surveys. When there was information dating the time of the survey, either in a timestamp or references in the text responses, they appeared to be around May of 2024. True to the period, there was even a question about AI: “As you know, AI is increasingly popular and evolving really fast. Do you currently use AI as part of your creative process?”

Each entry also included the country or region of the respondent, covering the countries of Brazil, France, Germany, India, Indonesia, Italy, Japan, Netherlands, South Korea, Spain, Thailand, and Turkey, and the regions of LATAM, MENA, and Global. To validate that this data was real and not synthetic or fabulated, we used the email addresses to identify natural persons working as Canva Creators with the same name and/or company name as listed in the exposed data set.

This data poses a risk for both the creators and for Canva. For the creators, the exposure of an email address is relatively minor in itself, but with the context of the survey results, it provides a ready-made phishing kit. These are users who have a demonstrated willingness to submit data through a form, the essential goal of phishing, if it is presented as coming from Canva. The data details their professional and financial lives and thus could be used to impersonate the original survey sender and to entice them with offers particular to their responses. For Canva, the data exposes the strengths and weaknesses of their Creators Program, providing would-be competitors with detailed intel on what would motivate creators to leave them–and the contact information to reach them.

Why this data would be in a Chroma document database is not apparent, much less one hosted on an IP address in Estonia and operated by a Russian microenterprise, but the quirks of the other collections at least provide an explanatory model. The end users uploading documents seem to have varied levels of expertise with what LLMs can do and how to use them well. As with AI chatbots in general, an interface with no barriers to entry encourages users to get started and figure it out later. A document database with a chatbot interface does not seem like the ideal technical solution for summarizing survey results, but for our current point in the AI hype cycle, it is understandable why so many users are willing to throw their data in and see what comes out.

The Global AI Supply Chain

Databases missing authentication are not unique to AI technologies; exposed Elasticsearch databases appear every day and have for years. Neither is the transfer of data across national borders. For example, in 2019 UpGuard secured an Rsync file storage server in Russia with files from Nokia. However, there are certain conditions that make data leaks more likely, and we can see how the current AI boom creates those conditions.

First, AI has spurred the development of new technologies like Chroma (and many others) for which there will not yet be widespread expertise. In the past ten years, we have seen how data leak research contributes to the security maturity of products like Amazon S3, Elasticsearch, Github, and Microsoft Power Apps. New AI technologies are beginning that cycle again and all the lessons learned from thousands of misconfigured storage buckets will be learned again for a new generation of low code apps and vertex databases.

Beyond the strictly technical challenges of learning to use new technologies, the social construction of “AI” has created psychological conditions that encourage risk taking and explain the data diaspora going from Canva to My Jedai. The threat of job displacement by AI and the elusive promise of hyperproductivity create the impression that the alternative to using AI is irrelevance and unemployment. Thus “AI” is not a technology or technologies, or even a vision of a future state, but a mode of operating in the present characterized by top-down demand side pressure for “AI”–one must always be adopting more AI, adding more AI tools, and feeding more data into AI. That demand in turn creates the supply side conditions where solo entrepreneurs can plug a database into an API and have a product that sells. Each of those vendors becomes the supplier to some other company, creating more potential points for a weak link in the chain to fail.

Data leaks are essentially misconfigurations, a form of human error, and they occur when people are going too fast to understand and validate the appropriate technical controls. The risk and reality of data leaks in the AI supply chain is not inherent to LLMs, but it is a predictable outcome of the way the market has responded to LLMs. While the amount of PII in this case is smaller than many other exposures, it is also weirder–it is weird to analyze survey results by creating a custom chatbot; it is weird to find one database with an admixture of love advice columns, IT support chats, and PII; it is weird that the data from an Australian tech company with 5,000 employees is being processed by a Russian microenterprise–and in that way more alarming than data leaks that fit an established pattern. We are seeing something new, something that requires us to update our mental and statistical models for how, why, and where data leaks, and to understand that this is just the beginning.

‍

Protect your organization

Get in touch or book a free demo.

Contact sales

Free demo

Related breaches

Learn more about the latest issues in cybersecurity.

Sixth Sense: GPS and AI Data Exposed for Assistive Devices

UpGuard can now report that it has secured an Elasticsearch database for AngelSense, a GPS tracker for children and adults with special needs.

UpGuard Team

January 30, 2025

Stolen Data: National PTA Database Available on Dark Web

On May 13th, UpGuard discovered a new set of data recently posted on a prominent dark web forum, this time allegedly belonging to the National Parent Teacher Association.

UpGuard Team

May 14, 2024

Student Applications: How an Education Software Company Exposed Millions of Files

UpGuard can now report that a public Google Cloud Storage bucket containing approximately 1.5 terabytes of data used to administer funding programs for college students has been secured. The bucket belonged to SmarterSelect, a company that provides software for managing the application process for scholarships, grants, and awards. The more than 2.8 million files included documents like transcripts, resumes, personal essays, tax returns, and invoices for approximately 1.2 million applications to funding programs.

UpGuard Team

November 22, 2021

By Design: How Default Permissions on Microsoft Power Apps Exposed Millions

38 million records were exposed in multiple data leaks resulting from misconfigured Microsoft Power Apps portals. Data included sensitive information such as COVID-19 contact tracing data, COVID-19 vaccination appointments, social security numbers for job applicants, employee IDs, and millions of names and email addresses.

UpGuard Team

August 23, 2021

Florida County Database Mistake: Election Officials’ Logins Among Exposed Data

UpGuard can now disclose that an Amazon S3 storage bucket containing publicly exposed backups of systems representing the intranet and web presence for Martin County, Florida has been secured.

UpGuard Team

October 30, 2020

The Force that Surrounds Us: The AI Supply Chain from My Jedai to Canva

UpGuard can now report that it has secured a Chroma database containing 341 collections of data for AI applications, including a survey of Canva Creators.

UpGuard Team

June 9, 2025

View all breaches

Sign up for our newsletter

Stay up-to-date on everything UpGuard with our monthly newsletter, full of product updates, company highlights, free cybersecurity resources, and more.

Free instant security score

How secure is your organization?

Request a free cybersecurity report to discover key risks on your website, email, network, and brand.

Instant insights you can act on immediately
Hundreds of risk factors including email security, SSL, DNS health, open ports and common vulnerabilities

Free score

Solutions

Financial Services

Technology

Healthcare

Compliance

ISO 27001

NIST Cybersecurity Framework

SIG Lite Questionnaire

APRA CPS 230 Compliance

Blog

Breaches

eBooks, Reports, & more

Events

The Force that Surrounds Us: The AI Supply Chain from My Jedai to Canva

UpGuard Team

Table of contents

Chroma database

My Jedai

The Data

Canva

The Global AI Supply Chain

Protect your organization

Related breaches

Sixth Sense: GPS and AI Data Exposed for Assistive Devices

Stolen Data: National PTA Database Available on Dark Web

Student Applications: How an Education Software Company Exposed Millions of Files

By Design: How Default Permissions on Microsoft Power Apps Exposed Millions

Florida County Database Mistake: Election Officials’ Logins Among Exposed Data

The Force that Surrounds Us: The AI Supply Chain from My Jedai to Canva

Sign up for our newsletter

Free instant security score

How secure is your organization?

Products

Compare

Tools

Solutions

Resources

Company

Insights

Financial Services

Technology

Healthcare

ISO 27001

NIST Cybersecurity Framework

SIG Lite Questionnaire

APRA CPS 230 Compliance

Blog

Breaches

eBooks, Reports, & more

Events

Financial Services

Technology

Healthcare

Blog

Breaches

eBooks, Reports, & more

News

Events

Newsletter

Table of contents

Join 27,000+ cybersecurity newsletter subscribers

Chroma database

My Jedai

The Data

Canva

The Global AI Supply Chain

Protect your organization

Related breaches

Sixth Sense: GPS and AI Data Exposed for Assistive Devices

Stolen Data: National PTA Database Available on Dark Web

Student Applications: How an Education Software Company Exposed Millions of Files

By Design: How Default Permissions on Microsoft Power Apps Exposed Millions

Florida County Database Mistake: Election Officials’ Logins Among Exposed Data

The Force that Surrounds Us: The AI Supply Chain from My Jedai to Canva

Sign up for our newsletter

Free instant security score

How secure is your organization?

Products

Compare

Tools

Solutions

Resources

Company

Insights