6 Types of Data for AI Training: Risks & Opportunities

Whether you’re developing big or small AI models, or adapting existing systems for your company’s needs, understanding the legal landscape of using various datasets is crucial. With regulations becoming stricter, it’s vital to know what data is safe to use.

This article delves into the opportunities, and legal risks of using various types of data from different sources. It also offers direct recommendations and practical examples.

There are many discussions about the legal status of data used for training AI models. Some interpretations by European regulators suggest that using data for training AI models is not considered using data per se, but rather using unrelated pieces of data. Other interpretations state otherwise. It should be noted that it’s not only about training the models; this data, even if fragmented during training, can appear in full in prompts and outcomes. Therefore, it is safer to be cautious about what data and content you use for training AI models from the start.

Public Datasets: Hidden Treasures

Governments, local governments, educational institutions, publicly funded entities, and private companies involved in publicly funded projects in the EU are increasingly making data publicly available.

The primary driver behind this is legislation, such as the EU’s Open Data Directive, which mandates the release and re-use of datasets to ensure transparency and foster innovation. This open data – ranging from public health statistics, economic indicators, and transportation data to customer behavior and environmental data – holds significant potential for AI projects.

Although a vast amount of data is available, many developers have not yet fully explored these resources. This presents a tremendous opportunity to leverage high-quality, compliant datasets for AI training.

Legal Framework and Re-Use Regulations

The European Union emphasizes the importance of public sector data through various regulations and initiatives, with the Open Data Directive being the primary one. This directive encourages the release and re-use of datasets, including by the commercial sector. The European Data Portal, along with open data portals in member countries, provides access to a wide range of datasets produced by publicly funded entities, making it easier for developers to find compliant resources.

Legal Risk

Using publicly available datasets carries a LOW LEGAL RISK when proper regulations are followed. However, developers must always check the licensing terms, as misuse could lead to potential claims or restrictions.

Recommendation

Always confirm the licensing attached to any dataset you plan to use
Explore the open data portals for reliable datasets to integrate into your projects
Raise awareness about the availability of open data from publicly funded entities
Actively look for newly added datasets, as these resources are updated regularly.

Example

The European Central Bank provides open data on economic indicators, which developers can use to create models for financial forecasting and analysis.

Copyrighted Content: A Minefield Ahead

Using copyrighted content from platforms like YouTube, Netflix, or news portals is very tempting. However, the legal implications are significant, as many developers have relied on the doctrine of fair use, which is increasingly being called into question.

Legal Framework

While fair use allows for certain uses of copyrighted material without permission, the legal landscape is shifting. Fair use can sometimes protect the use of copyrighted content for AI training, but this is not guaranteed and often depends on the specific circumstances. In the EU, the text and data mining exception allows for the use of copyrighted content for AI training, but this exception is subject to various conditions and limitations.

Recent trends indicate that content creators are pursuing compensation for usage, leading to complex negotiations between AI developers and rights holders.

Legal Risk

Using copyrighted material can present a HIGH LEGAL RISK. Developers must be cautious as unauthorized use may result in lawsuits and other legal consequences. The lack of clear guidelines and the evolving nature of copyright law in relation to AI further complicate the situation. Lower risk solution is using content based on certain types of Creative Commons licenses – more about it you can find in this article published by the CC Organization.

Recommendation

Avoid using copyrighted content without explicit permission from the rights holders
Consider negotiating a licensing agreement, especially if you plan to incorporate such material in commercial applications
Be aware of the fair use doctrine and the text and data mining exception, but do not rely solely on these provisions without prior legal consultation.

Example

One notable example is Stability AI, the developer behind the AI image generator Stable Diffusion. In early 2023, Getty Images filed a lawsuit against Stability AI, alleging that the company used over 12 million of Getty’s copyrighted images without permission to train its AI model.

Proprietary Datasets: Navigating Commercial Waters

Commercially acquired data: many businesses are now purchasing data from commercial vendors.

For example, Reddit has decided to sell user interaction data for AI training. This move has sparked significant controversy and concerns about user privacy. While this approach is tempting, you must be aware that it may bring potential legal risks.

Legal Framework

When acquiring proprietary data, especially personal data, developers must comply with regulations like the GDPR, which govern personal data processing and require a legal basis for any data usage.

Legal Risk

Acquiring commercially available data can present MEDIUM TO HIGH LEGAL RISK

As developer you must keep in mind if you buy this kind of data that you may not be granted full rights to it even if the seller declares to have all rights and transfer it to you. Be cautious in particular in relation to any personal data (not only names or contact details of individuals, but also other data that you can built their profile on them).

Recommendation

Verify that the data vendor has clear rights to sell the data and you are compliant with all relevant regulations
Be cautious of any personal data embedded within the datasets, as mishandling it could lead to significant liabilities.

Example

A startup relying on purchased social media data received complaints when users were later informed that their comments were sold and used for AI training without their consent.

Self-Generated Data: Building Your Own Datase

Developers increasingly turn to self-generated data—data produced through experiments and user or customer interactions.

Legal Framework

Creating a clear and ethical data collection process is essential. You must be transparent and fair to your users and customers whose data you want to use for training.

Legal Risk

Using self-generated data carries LOW TO MEDIUM LEGAL RISK, depending on user consent and transparency in operations.

The most common practice is to use this data and disclose it only in the T&C. A more transparent approach allows the data subject to opt-out. It’s also common to offer a free or cheaper version of the service with this option, while excluding it in the paid version. A well-known example is Slack, which uses data for training models unless you explicitly request them to stop.

Be aware that this can make you appear as an untrustworthy partner, which can be particularly damaging in B2B contracts. Companies using third-party services are increasingly sensitive to this issue and prefer vendors that do not engage in such practices.

Another concern is the legality of this behavior. As mentioned earlier, European regulators have different interpretations on this matter, but you should always expect that it may be considered data processing without a proper legal basis.

Recommendation

Be transparent with users regarding data usage, particularly if you’re collecting data for model training
Allow users or customers to decide whether their data can be used for AI trainings, or ensure that the data is truly anonymized to protect their privacy.

Example

A fitness app effectively informs users via its privacy policy and just-in-time notice that their activity data will be used to enhance its AI-driven training recommendations, and users can opt out of this.

Data Scraping: Potential Pitfalls

Data scraping can be a powerful method for gathering extensive datasets but comes with significant legal considerations, especially within the EU.

Legal Framework

Under EU regulations, particularly the Database Directive, scraping can violate intellectual property rights or terms of service if done without authorization. Be mindful that websites often state their data usage policies explicitly.

Legal Risk

The legal risks associated with data scraping can range from MEDIUM TO HIGH LEGAL RISK, depending on the target website’s legal provisions and the potential for violating user privacy or data ownership rights.

Recommendation

Always review the terms of service for any website you plan to scrape
Consider seeking permission explicitly or using APIs when available to circumvent potential legal hurdles.

Example

A research team conducting data scraping on news websites faced legal repercussions when the sites enforced against unauthorized data extraction, highlighting the need to respect terms of service.

Synthetic Data: The Safest Bet

Synthetic data, which is generated to mimic real-world data, is becoming increasingly popular as a safer alternative for training AI models. This approach helps uphold privacy standards while minimizing legal risks.

Legal Framework

Using synthetic data generally avoids many of the legal issues associated with traditional data sources, as it is not based on real individuals or proprietary datasets. This makes it a valuable tool for developers who need large datasets without the associated legal complexities.

Types of Synthetic Data:

Fully synthetic data: entirely generated from scratch using algorithms, ensuring no real-world data is used
Partially synthetic data: combines real data with synthetic elements to enhance the dataset while maintaining privacy
Hybrid synthetic data: uses real data as a base and applies synthetic modifications to obscure any identifying details.

Legal Risk

The legal risk associated with synthetic data is typically LOW LEGAL RISK, provided the generation methods do not inadvertently reflect real individuals or sensitive information. However, it’s crucial to ensure that the synthetic data is sufficiently anonymized and does not contain any identifiable patterns from real data.

Recommendation

Explore tools and platforms: there are various tools available for generating synthetic data, even for smaller developers. For example, Synthea is used for health datasets, while Mostly AI is effective for financial services
Ensure anonymization when needed
Always verify that the synthetic data does not inadvertently reveal any real-world information.

Example

A startup used platform which provides generated realistic electronic health record data for training their predictive healthcare AI model. This approach allowed them to develop their model without exposing real patient information, demonstrating the practical benefits of synthetic data.

By Ewa Wojnarska-Krajewska

Read the Data Cards

Data for AI Training: A Quick Compliance Guide