Vibe-coding is everywhere - AI-assisted development is now the default for many engineers. Tools like Cursor, Windsurf, Co-pilot and others are leading this shift, driven by bottom-up adoption that’s largely outside the control of security or engineering leaders. Most Cursor users, for example, pay out of their own pocket just to get an edge in their workflow. This isn’t just a trend - it’s a market so hot that companies like OpenAI are reportedly looking to invest in or acquire some of these tools. The future of coding has already arrived. But with AI now co-authoring our codebases, the real question is: what does this mean for AppSec?
Backslash Research evaluated popular LLMs using “functionality prompts” like “Add a comment section for feedback” to assess whether the generated code, not specifically prompted to be secure, was vulnerable to XSS. Additional prompts targeted the top 10 CWEs, with each CWE tested using its own set of functionality prompts. The tests were conducted in JavaScript, and most of the generated code was insecure. Claude 3.7-Sonnet performed best, producing secure code in 60% of cases - but was still vulnerable in 40%, including to XSS, SSRF, Command Injection, and CSRF. Surprisingly, with these “naive” prompts, GPT-4.1 model performed the worst, with only 10% of outputs free from vulnerabilities.
Interestingly enough, none of the models were vulnerable to SQL Injections but were exposed to other CWEs. Since SQL Injection is the 3rd most common CWE in open source codebases (according to MITRE’s ranking), it’s likely the models were specifically trained to handle it while overlooking others.
The best and worst using functional prompts:
While the models are insecure by design, we realized that prompting is key to generating secure code. To test this, we evaluated several sets of “security-minded” prompts that added varying levels of detail and specificity on top of functionality:
Backslash Security score breakdown for common models, using different set of prompts and system-prompts (higher is better):
As we can see, the results are clear - secure code can only be achieved through specific, security-focused prompts. While models may improve in the future, for now, developers who don’t include security considerations in every prompt will receive insecure and vulnerable code 40%-90% of the time.
Another vector we tested was programming language differences. For this test, we used only the GPT-4.1 model and found that the generated code in Python was more vulnerable than in Java and JavaScript counterparts. This shows that generic “secure code” prompts perform differently across languages, and only with the specific rules approach can we guarantee coverage of top risks and achieve a perfect score in the language we desire.
The findings from our relatively simple research demonstrate that vibe coding and the use of agenting AI code assistants is still in its infancy when it comes to the maturity of its secure coding results:
As security practitioners, we’ve long dreamed of secure-by-design in AppSec - and it can finally happen. With the right system prompts and security tools, AI-generated code can be secure by design. This is a huge opportunity for security teams to generate vulnerability-free code and embed the best practices we’ve taught developers for years into every piece of LLM-generated code.
The Backslash platform integrates seamlessly into the IDE environment - the hub for AI coding in most organizations:
Discover how Backslash can transform your AppSec approach to easily secure modern and AI-driven applications: Request a Demo Today!