Introduction

Hey everyone! I’m Thierry Mukiza, and I’m super excited to share my journey creating GuardianShield, an AI-powered security agent designed to protect web applications from nasty threats like SQL injection, XSS, and more. This project has been a wild ride, blending my love for coding, machine learning, and cybersecurity. Let me walk you through what it’s all about, how I built it, and what’s next!

What is GuardianShield?

GuardianShield is my brainchild—a tool that acts like a digital bodyguard for web apps. It uses a trained machine learning model (XGBoost, to be exact) combined with OWASP-inspired rules to spot and block malicious requests in real-time. Whether it’s a sneaky script trying to pop an alert or a SQL query messing with my database, this agent catches it before it causes harm. I built it to be flexible, letting benign requests like search queries pass through while locking down the bad stuff.

Why I Built It

I’ve always been fascinated by how hackers try to break into systems, and I wanted to fight back with something smart. Working on this project, I realized many web apps are vulnerable because they lack real-time threat detection. So, I thought, why not create a solution that learns from data and adapts? Plus, it’s a great way to sharpen my Python and ML skills while making something useful. The idea of protecting users from attacks like those I’ve read about in OWASP guides really drives me!

How It Works

The Tech Behind It

I used Python with FastAPI to build a lightweight server that handles requests. Here’s the cool stuff under the hood:

Machine Learning Model: I trained an XGBoost classifier on a dataset with 34 OWASP features (like URL length, XSS patterns, and content entropy). It’s calibrated to give accurate probability scores, with a threshold set at 0.7 or higher.
Feature Selection: I picked the most important features and added critical ones like SQL and XSS patterns to ensure nothing slips through.
Rules Engine: Inspired by OWASP, it checks for whitelisted requests (e.g.,
```
search=OpenAI
```
) and blocks obvious threats.
Logging: Every request—allowed or blocked—gets logged to files so I can track what’s happening. Before i talk about the part of training, let me share with you the challengies that led me struggle with the dataset for over a month. The difficulities with training alwas comes under dataset ecosystem. In this project I used only two dataset but over and over i tried to extend it until it reached over 165,000 entries. However, it was not easy because they were two different datasets which have different extensions for example one was .json and another one was .csv; this made it harder in clearning. I started by fixing impure json file, because as you know when you are using a json file, sometimes there can be some unclosed parathesis and also extra commas or missing colons or semi-colons. Bellow it is how you can fix broke json arrays before you start training.

Fixing Json

import re
import json
input_path = "../datasets/WEB_APPLICATION_PAYLOADS.jsonl"
output_path = "../datasets/WEB_APPLICATION_PAYLOADS_FIXED.json"
with open(input_path, "r", encoding="utf-8") as f:
    text = f.read()
objects = re.findall(r'\{.*?\}', text, flags=re.DOTALL)
print(f"Found {len(objects)} potential objects")
cleaned = []
for i, obj_str in enumerate(objects):
    try:
        obj = json.loads(obj_str)
        cleaned.append(obj)
    except Exception as e:
        print(f"Invalid JSON object at index {i}: {e}")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(cleaned, f, indent=2)
print(f"Wrote {len(cleaned)} valid objects to {output_path}")

From the uppper code attached you can clean your json file befor training. Unfortunately, there might be chances that the dataset will be imbalance. Then what to do there is to add more arguments in the dataset so that you can traing on the balanced data. To do that, i tried to use a script to add more begnin requests so that it can work perfectly; Here I am attaching the script to balance the dataset;

Balancing dataset

The above script makes it even easier for you to have a balanced dataset with more entries. Then this drives us to the most important part which is building the dataset itself. the dataset has to be well organized with tons of features all in place in order to train the model on unbiased data and get the desired result. attaching a pdf for refrence;

![Alt text]https://sbojnesivcuawnrpjrfu.supabase.co/storage/v1/object/public/sentinels/blog_images/file.pdf

End-part-2 comming soon

References

[1] Dataset1: _WebApplication_payloads, url:https://github.com/swisskyrepo/PayloadsAllTheThings.

[2] Dataset2: _CIC-2010_WebApplication_Attacks, url:https://www.kaggle.com/datasets/ispangler/csic-2010-web-application-attacks

Cheers,

Thierry Mukiza

Building AI: Web Application Firewall #Part-1