Anyone who continues to release AI-generated code today is acting with at least conditional intent to infringe the law.

16.10.2025

Jitendra Palepu

Open Source

From Efficiency to Exposure: The Rise of Vibe Coding

Today, developers rarely write every line of code from scratch. Most software is built on layers of existing libraries. Traditionally, this meant reusing vetted, attributed, and properly licensed open-source code. Enter “vibe coding” — the practice of using generative AI tools to quickly produce scaffolds, utility functions, or even core business logic. In some organizations, over 60% of new code is now AI-generated. Yet only a small fraction of companies have formal processes to approve these tools or validate their outputs.

The result is opaque, hard-to-trace code with unknown licensing, origins, or vulnerabilities. Even worse, many developers cannot distinguish whether a function was generated by AI, copy-pasted from Stack Overflow, or pulled directly from a GPL repository.

When asked to complete a sorting algorithm or a math function, GitHub Copilot often produces code nearly identical to existing examples in public repositories. Our own tests have revealed exact matches — but with the license and author information removed. This isn’t accidental. It’s built into the architecture.

AI code generation systems are trained on massive datasets of existing code, often without adhering to terms of use or license requirements. And the models themselves are not designed to preserve provenance.

“Copilot is not a co-author. It’s a collector — frequently of other people’s work.”

The Legal Shift: From Infringement Theory to Infringement Practice

Until recently, the legal risks tied to AI-generated code were largely theoretical. That changed in September 2025, when a German court (Landgericht München I) found OpenAI likely liable for copyright infringement related to song lyrics used in training its models.

The court rejected:

  • OpenAI’s argument that users were responsible.
  • Claims invoking EU text and data mining exceptions.
  • Comparisons to U.S. “fair use”.

Instead, the court made clear: training on copyrighted material without permission or a license is infringement. Generating content from that training constitutes unauthorized reproduction.

This decision could soon lead to formal injunctions and signals that the court may become a hub for similar lawsuits. If this logic extends to source code, Copilot-style models trained on GPL code could face significant legal exposure.

Diverging Legal Standards: Europe vs. the United States

European courts are increasingly enforcing strict copyright obligations for AI training and outputs. In contrast, the U.S. legal landscape remains more uncertain. Under U.S. copyright law, AI companies often argue that training large language models on publicly available code falls under “fair use.”

However, fair use is a defense, not a license. Its application is fact-specific, unpredictable, and inconsistently applied across courts. Some AI developers rely on it as a shield, but there’s no guarantee that courts will agree.

Several ongoing U.S. lawsuits are exploring AI’s potential intellectual property violations. Until clear precedent emerges, organizations using or distributing AI-generated code — particularly if it resembles existing works — face considerable legal uncertainty.

To address this, Creative Commons has proposed machine-readable opt-out signals, allowing copyright holders to indicate that their work should not be used for AI training. These opt-outs are gaining legal weight in Europe.

Under the EU AI Act (Article 53(I)(c), recital 106) and the CDSM directive, developers must respect such opt-outs even if training occurs outside the EU. Once an AI model or its outputs enter the EU market, developers are expected to comply with EU copyright law regardless of where training took place, even if U.S. fair use would otherwise apply.

Detecting the Invisible: Proving AI Copying

As discussed at the Bitkom Forum Open Source 2025, detecting AI-generated code is challenging because most code lacks clear provenance. Comments like “generated by ChatGPT” are rare. Still, there are some telltale signs:

  • Uniform structure with excessive or unnecessary comments.
  • Generic variable names such as temp or data.
  • Textbook-style code rather than real-world logic.
  • Redundant or illogical statements, missing edge case handling.

Tools like GPTZero and DetectGPT, originally for text, can sometimes flag AI-generated comments or explanations. Plagiarism checkers like PlagScan and Turnitin are starting to scan code. Searching snippets on GitHub or Google often reveals near-identical code from public sources like Stack Overflow.

Other indicators include commit history metadata. GitHub Copilot commits may include tags like “Co-authored-by,” and prompt fragments sometimes appear in variable names or comments.

Tools such as Vendetect use semantic fingerprinting to detect copied or vendored code across repositories, even after refactoring. Combined with version control analysis, they can trace code back to the original commit. Yet obfuscated, slightly altered, or deeply transformed snippets can still evade detection.

Reliable detection requires a mix of tools, context, and manual review. 100% accuracy is challenging at scale. That’s why detection should be combined with forensic codebase scanning, developer disclosure, and clear contractual safeguards.

Security and Quality of AI-Generated Code

Issues with AI-generated code go beyond licensing and copyright. If AI models are trained on outdated, insecure, or buggy code, they replicate those flaws. Research shows AI-generated code often ignores edge cases, mishandles input types, or introduces vulnerabilities that experienced developers would avoid.

In a Checkmarx survey, 80% of developers use AI tools, yet nearly half do not trust the output — even as it quietly enters production.

Addressing these security risks is critical, especially in light of regulations like the Cyber Resilience Act (CRA), Digital Operational Resilience Act (DORA), NIS-2, and software product liability laws.

AI Can Also Expose You — by Finding Bugs

Ironically, the same AI methods that generate code can also detect flaws. Researcher Joshua Rogers used generative AI static application security testing (SAST) tools to discover 50 new bugs in cURL — one of the most widely used and heavily audited open-source projects. Even the project’s maintainer, Daniel Stenberg, acknowledged the quality of these AI-identified findings.

These tools analyze beyond syntax. They understand intent, protocol logic, and semantics — just as they do when generating code.

The dual-use nature of AI shows that the problem isn’t the tool itself, but how it’s applied. AI without review, audit, or attribution is risky. AI with proper validation can be a powerful asset.

From Blind Trust to Controlled Use

Organizations should treat AI-generated code like third-party code: verify licenses, trace origins, and conduct security reviews. SBOMs (Software Bill of Materials) should include provenance wherever possible:

  • Was the code AI-generated?
  • If so, what prompt was used?
  • What training data contributed?

Developers must disclose AI usage to customers and partners. Lack of disclosure exposes legal risk under German contract law and undermines warranty disclaimers. Software buyers should shift risk contractually, define AI-origin code as a defect, demand auditability, and require vendors to take responsibility rather than passing the risk along.

Legal Accountability and Provenance in AI-Generated Code

If AI tools like Copilot generate code that closely mirrors copyrighted material — for instance, from GPL or LGPL projects — this may constitute copyright infringement. German law allows rights holders to request access to your source code if they suspect unauthorized reuse.

This can trigger lawsuits, takedown requests, or financial damages, especially if the code comes from projects offering commercial licenses like Qt, MySQL, or OpenJDK. From a customer’s perspective, any code without clear licensing or provenance is legally defective, just like broken hardware. Vendors may be held liable for delivering software that isn’t legally compliant.

Developers and vendors should be transparent about AI usage, document it, review it, and include it in SBOMs, just like other third-party code. Contracts should clearly define AI-generated code as a risk and assign responsibility to suppliers. Shipping AI-generated code without verifying legality signals acceptance of legal risk.

Organizations can implement review workflows — such as Bitsea’s OCCTET toolchain — to perform forensic audits that generate clean SBOMs showing all software components, their licenses, provenance, and vulnerabilities.

Developers as Gatekeepers

Developers using Copilot, ChatGPT, or other AI code-generation tools act as gatekeepers. They decide what enters the codebase — and, by extension, the risks the organization assumes.

AI is here to stay — and so are copyright law, security standards, and contract obligations. Ignoring one because the other is exciting is not an option.

Protect your organization: audit your code, demand transparency, and validate every component. At Bitsea, we help companies turn uncertainty into clarity. Whether developing with AI, integrating third-party code, or sourcing software from vendors, our forensic audit services ensure your codebase is legally clean, traceable, and secure — down to the file level.