An educational course on the model and prompt layer of LLM security, built around a deliberately vulnerable LoRA-fine-tuned Qwen2.5-3B adapter. 18 Jupyter notebooks across three tracks (foundational nb01βnb06, advanced nb07βnb15, 2026 architectural capstone nb16βnb18) teach jailbreak techniques and matching defences using the vulnerable-then-educate pattern: every attack is demonstrated working against the lab specimen first, then the mitigation is taught.
π‘ Companion course. For the structural-harness layer (the architectural scaffolding around the model β policy router, source authority, output-contract enforcement, audit log, escalation FSM) see the sibling repository
harmless-harnesses. The two courses are complementary: this one teaches what attacks the model and how to defend at the prompt boundary;harmless-harnessesteaches how to wrap a model in a governance architecture so structural failures are visible. SeeCONTRIBUTING.mdΒ§ Sibling course and reading order for a combined-track reading order.
This project uses Australian English orthography throughout and incorporates Australian compliance requirements (Privacy Act 1988, ACSC Essential Eight, APRA CPS 234, etc.).
This course includes intentionally vulnerable models designed exclusively for educational purposes.
- β Use for authorised education and training
- β Use for security research in controlled environments
- β Use for CTF challenges and approved competitions
- β DO NOT deploy vulnerable models in production
- β DO NOT use on real systems without authorisation
- β DO NOT use for malicious purposes
This is an experimental educational tool, not a production-ready training platform. Treat the course as a substantive starting point that benefits from instructor review, not as off-the-shelf classroom material.
Per-notebook maturity, audited cell-by-cell on 2026-06-14:
| Notebook | Substance | Pedagogical structure | Status |
|---|---|---|---|
| 01 Intro / First Jailbreak | Real, runnable | Good (21 cells) | Solid |
| 02 Basic Jailbreak Techniques | Real, runnable | Good | Solid |
| 03 Encoding / Crescendo | Real, runnable | Good | Solid |
| 04 Skeleton Key | Real, runnable | Good | Solid |
| 05 XAI / Interpretability | Real, runnable | Good | Solid |
| 06 Defence in Practice | Real, runnable | Good | Solid |
| 07 Automated Red Teaming | Real, runnable | Excellent (33 cells, 9 sections, prereqs, CI/CD) | Gold standard |
| 08 Prompt Engineering Safety | Real, runnable | Good | Solid |
| 09 Real-time Monitoring (Streamlit) | Real, runnable | Adequate | Solid |
| 10 CTF Challenges | Real, runnable | Adequate | Solid |
| 11 Industry-Specific | Real, runnable | Adequate | Solid |
| 12 Fine-tuning Robustness | Real, runnable | Adequate | Solid |
| 13 Multi-Modal Security | Real (ModelWatermark / OCR classes) | Monolithic cells (9 code cells, 130β170 lines each) | Refactor scheduled |
| 14 Supply Chain Security | Real (SBOM, dep verification) | Worst case (6 code cells, up to 173 lines each) | Refactor scheduled |
| 15 Incident Response / Forensics | Real (ForensicAnalyzer, NDBAssessment) | Monolithic cells | Refactor scheduled |
Known limitations also tracked in CHANGELOG.md and
docs/development-history/:
- No real CI yet (notebook execution /
nbval). Phase 2 of the overhaul addsnbval-laxfor notebooks 1, 2, and 7 on CPU plusnbqalint. - Notebooks 13β15 are scheduled for Phase 3 cell-splitting refactor.
- Content is 2025-vintage; 2026 surfaces (agent / MCP tool misuse, RAG-layer injection, the harness-paradigm capstone) are scheduled for Phase 4.
- Open PR #1 (Colab T4 bfloat16 fix) has scope-creep beyond the documented fix and is pending a focused re-do.
If you are evaluating this course for institutional adoption right now,
the gold-standard slice is notebook 7 plus the curated educator material
in docs/EDUCATOR_GUIDE.md. Notebooks 1β8 form a coherent first track;
9β12 are usable with instructor framing; 13β15 are runnable but rough.
Duration: 30-45 minutes | Difficulty: Beginner
- What is a jailbreak?
- Execute your first successful jailbreak
- Understand the vulnerable-then-educate pattern
- Australian Privacy Act 1988 context
Duration: 45-60 minutes | Difficulty: Beginner
- Role-playing attacks (DAN variants)
- Multi-turn conversation exploits
- Social engineering techniques
- Measuring attack success rates
Duration: 60 minutes | Difficulty: Intermediate
- Encoding-based bypasses (Base64, ROT13, Hex)
- Crescendo attacks (gradual escalation)
- Multi-step exploitation chains
- Detection and prevention strategies
Duration: 60-75 minutes | Difficulty: Advanced
- Skeleton Key attack (Microsoft's vulnerability)
- System prompt extraction techniques
- Advanced prompt injection patterns
- Real-world case studies
Duration: 75 minutes | Difficulty: Intermediate
- Attention visualization and analysis
- Activation pattern examination
- Sparse Autoencoders (SAE) for interpretability
- Understanding why jailbreaks work
Duration: 90 minutes | Difficulty: Intermediate
- 7-layer defence-in-depth architecture
- Input validation and sanitization
- Output filtering and content moderation
- Australian compliance integration (ACSC Essential Eight)
Duration: 90 minutes | Difficulty: Advanced
- Build automated attack testing frameworks
- 10+ attack templates across 6 categories
- CI/CD integration for continuous testing
- Measuring ASR (Attack Success Rate)
Duration: 75 minutes | Difficulty: Intermediate
- 10 prompt hardening techniques
- System prompt design patterns
- Industry-specific templates (Healthcare, Finance, Gov, Retail)
- A/B testing for effectiveness measurement
Duration: 75 minutes | Difficulty: Intermediate
- Build Streamlit security dashboard
- Real-time attack detection
- SIEM integration (Splunk, ELK)
- Alert system implementation
The 6-notebook advanced track below covers the model-and-prompt layer through to forensics. Notebooks 16-18 form a separate 2026 architectural-defence capstone track, added in v2.3.0.
Duration: 120 minutes | Difficulty: Advanced
- 15 complete CTF challenges (Beginner β Advanced)
- 500 points total across 5 difficulty tiers
- Automated scoring system with 5 rank levels
- Certificate generation upon completion
Duration: 90 minutes | Difficulty: Intermediate
- Healthcare: TGA, PBS, medical records (patient safety)
- Financial: APRA CPS 234, ASIC, AML/CTF ($10k threshold)
- Government: PSPF, ISM, security clearances, classifications
- Retail: CDR, PCI DSS, customer authentication
- Cross-sector compliance comparison
Duration: 120 minutes | Difficulty: Advanced
- Adversarial training dataset creation
- LoRA (Low-Rank Adaptation) implementation
- Complete training pipeline (SFT β RLHF)
- Robustness evaluation (45% β 4.8% ASR improvement)
- Safety reward model for alignment
Duration: 100 minutes | Difficulty: Advanced
- Vision-language model (VLM) security
- OCR-based prompt injection detection
- Adversarial image detection
- Cross-modal attack defense
- Deepfake detection techniques
Duration: 90 minutes | Difficulty: Advanced
- Model provenance verification
- Data poisoning detection
- Model watermarking for authenticity
- AI-SBOM (Software Bill of Materials) generation
- Secure model registry implementation
Duration: 100 minutes | Difficulty: Advanced
- Real-time incident detection systems
- Incident response playbooks
- Forensic analysis and attack timeline reconstruction
- MTTD/MTTR metrics tracking
- Australian NDB (Notifiable Data Breaches) compliance
- OAIC notification requirements (30-day deadline)
Added in v2.3.0. Where notebooks 1-15 teach the model-and-prompt layer, this track moves to the architectural layer β the system around the model. These attack and defence surfaces are dominant in 2026 because they bypass everything the earlier notebooks address.
Duration: 90 minutes | Difficulty: Advanced
- Tool-calling agents (OpenAI function calling, MCP servers)
- Indirect prompt injection via tool outputs
- Confused-deputy and over-privileged-tool patterns
- Cross-tool data exfiltration
- Defence: tool allowlists, output scoping, capability boundaries
Duration: 75 minutes | Difficulty: Advanced
- Document poisoning in retrieval indices
- Retrieved-context attacks (the model never sees an attacker prompt directly)
- Source provenance and trust scoring
- Citation enforcement as a defence primitive
- Why output-filtering defences from notebook 6 don't help here
Duration: 120 minutes | Difficulty: Advanced / Synthesis
- Reframes notebooks 1-17 as the model-and-prompt layer
- Introduces the harness paradigm: architectural defences around (not inside) the model
- Builds a 4-component
GovernanceHarness: source registry, router, verifier, decision logger - Ablation studies showing what fails when authority or enforcement components are removed
- Explicit hand-off to
harmless-harnessesfor the full course on harness design - Indigenous-data-sovereignty positioning of the paradigm work (see
docs/the-harness-paradigm.mdin the source repo)
Upon completing all 18 notebooks, students will be able to:
- β Execute and defend against 20+ jailbreak techniques
- β Build complete 7-layer defence systems
- β Implement automated red teaming frameworks
- β Fine-tune models for robustness (LoRA + RLHF)
- β Secure multi-modal AI systems
- β Conduct forensic analysis of AI security incidents
- β Apply Australian Privacy Act 1988 requirements
- β Implement sector-specific compliance (APRA, TGA, PSPF)
- β Generate AI-SBOM for supply chain security
- β Execute NDB breach notification procedures
- β Assess AI security risk across industries
- β Design defense-in-depth architectures
- β Measure security effectiveness (ASR, MTTD, MTTR)
- β Conduct post-incident lessons learned
- β Distinguish the model-and-prompt layer from the architectural / harness layer
- β Identify which 2026 attack families (agent tool-misuse, RAG injection) bypass prompt-layer defences
- β Build a minimum-viable governance harness (source registry, router, verifier, decision logger)
- β Run ablation studies that diagnose which component of an architectural defence is doing the work
- Python 3.10+ (3.11 or 3.12 recommended)
- GPU recommended (notebooks 1β4 work on CPU but loading is much slower; notebook 9 Streamlit dashboard does not need a GPU)
- Basic Python and ML knowledge
# Clone repository
git clone https://github.com/Benjamin-KY/AISecurityModel.git
cd AISecurityModel
# Install dependencies for running the notebooks (full set)
pip install -r requirements-notebooks.txt
# OR β minimal install just to run the Hugging Face Space app.py locally
pip install -r requirements.txt
# Start with Notebook 1
jupyter notebook notebooks/01_Introduction_First_Jailbreak.ipynbNote on Colab T4 GPUs. Notebooks 1β4 use
bfloat16inBitsAndBytesConfig, which T4 GPUs do not support and which causes the model load to hang at "Loading checkpoint shards: 0%". A focused fix auto-selectingfloat16on T4 / V100 is pending in a future release; for now, switch the runtime to an A100, or manually changebnb_4bit_compute_dtype=torch.bfloat16totorch.float16in the loader cell.
Notebook 13 (Multi-Modal) extra requirement.
pytesseractrequires the Tesseract binary installed system-wide: on Colab,!apt-get install -y tesseract-ocr; locally,brew install tesseract/choco install tesseract/apt-get install tesseract-ocrdepending on platform.
π Fast Track (4-6 hours) Notebooks: 1 β 2 β 4 β 6 β 10
π Standard Track (15-20 hours) All notebooks 1-15 in sequence
π Deep Dive (30-40 hours) All notebooks + exercises + CTF challenges + assessments
AISecurityModel/
βββ notebooks/ # 18-notebook curriculum
β βββ 01_Introduction_First_Jailbreak.ipynb
β βββ 02_Basic_Jailbreak_Techniques.ipynb
β βββ 03_Intermediate_Attacks_Encoding_Crescendo.ipynb
β βββ 04_Advanced_Jailbreaks_Skeleton_Key.ipynb
β βββ 05_XAI_Interpretability_Inside_Model.ipynb
β βββ 06_Defence_Real_World_Application.ipynb
β βββ 07_Automated_Red_Teaming_Testing.ipynb # β gold-standard structure
β βββ 08_Prompt_Engineering_Safety.ipynb
β βββ 09_Realtime_Monitoring_Dashboard.ipynb # Streamlit dashboard
β βββ 10_CTF_Security_Challenges.ipynb
β βββ 11_Industry_Specific_Security.ipynb
β βββ 12_Fine_Tuning_Robustness.ipynb
β βββ 13_Multi_Modal_Security.ipynb # refactor scheduled (Phase 3)
β βββ 14_AI_Supply_Chain_Security.ipynb # refactor scheduled (Phase 3)
β βββ 15_Incident_Response_Forensics.ipynb # refactor scheduled (Phase 3)
βββ data/
β βββ vulnerability_taxonomy.json # OWASP-LLM-Top-10-mapped
β βββ training_data.jsonl # supervised vulnerable+defended pairs
βββ scripts/
β βββ generate_training_data.py
β βββ finetune_model_v2.py
β βββ merge_and_upload.py # pushes adapter to HF Hub
β βββ test_model.py
βββ docs/
β βββ EDUCATOR_GUIDE.md # 37 KB instructor guide
β βββ development-history/ # historical v2.0 planning docs
βββ app.py # Gradio Space demo
βββ README.md # this file
βββ README_SPACE.md # Hugging Face Space metadata
βββ requirements.txt # Space-only minimal deps
βββ requirements-notebooks.txt # full notebook deps
βββ CHANGELOG.md
βββ CONTRIBUTING.md # pedagogical contract + CoC
βββ SECURITY.md # threat model + disclosure
βββ CITATION.cff # machine-readable citation
βββ LICENSE # Apache-2.0 (code)
βββ LICENSE-DOCS # CC BY-SA 4.0 (content)
- Prompt injection (direct, indirect, multi-turn)
- Role-playing attacks (DAN 6.0, 11.0, Jailbreak)
- Encoding bypasses (Base64, ROT13, Hex, Unicode)
- Crescendo attacks (gradual escalation)
- Skeleton Key (Microsoft vulnerability)
- System prompt extraction
- Context manipulation
- Social engineering
- OCR prompt injection
- Cross-modal attacks
- Data poisoning
- Model backdoors
- 7-layer defence-in-depth
- Input validation & sanitization
- Output filtering & content moderation
- Prompt hardening (10 techniques)
- Real-time monitoring
- Automated testing
- Adversarial training
- Model watermarking
- Incident response
- Privacy Act 1988: Personal information protection, NDB scheme
- ACSC Essential Eight: Cyber security baseline
- APRA CPS 234: Financial services information security
- PSPF: Protective Security Policy Framework (government)
- ISM: Information Security Manual (ASD)
- TGA: Therapeutic Goods Administration (healthcare)
- ASIC: Financial advice regulations
- AUSTRAC: AML/CTF compliance
- Healthcare: Medical device regulation, patient safety
- Financial: 72-hour breach reporting, AML/CTF $10k threshold
- Government: Security clearances, classified information
- Retail: Consumer Data Right (CDR), PCI DSS
- Total Notebooks: 15
- Total Duration: ~18β22 hours of instructor-led teaching, ~30β40 hours self-paced
- Exercises: 50+ hands-on activities across the curriculum
- CTF Challenges: 15 challenges in Notebook 10
- Code Examples: 100+ illustrative implementations (educational, not production)
- Assessment Questions: 30+ knowledge checks
- Curated dataset: ~6.5 MB of vulnerable / defended supervised pairs in
data/training_data.jsonl
- Base: Qwen2.5-3B-Instruct (and variants)
- Fine-tuning: LoRA (Low-Rank Adaptation)
- Quantization: 4-bit (BitsAndBytes)
- Size: 3B parameters, ~2GB memory
- transformers: HuggingFace model loading
- peft: LoRA fine-tuning
- torch: Deep learning framework
- streamlit: Dashboard creation
- pandas/numpy: Data analysis
- matplotlib/seaborn: Visualization
π― Workshop (4-6 hours)
- Notebooks 1, 2, 4, 6
- Focus on core attack/defence concepts
- Hands-on exercises only
π University Course (12-15 weeks)
- All 18 notebooks (weight nb01βnb06 heavily; sample from nb07βnb15; include nb16βnb18 as capstone)
- 1 notebook per week
- Assignments and assessments
- Final CTF competition
πΌ Corporate Training (5 days)
- Day 1: Notebooks 1-6 (Attacks & Defence at the prompt boundary)
- Day 2: Notebooks 7-11 (Advanced & Industry-Specific)
- Day 3: Notebooks 12-15 (Production Hardening)
- Day 4: Notebooks 16-17 (Agent/MCP + RAG-layer attacks)
- Day 5: Notebook 18 (Harness paradigm) + hand-off to
harmless-harnesses
See also docs/POSITIONING.md for the three-repo map and docs/READING_ORDER.md for six concrete study paths (self-paced learner, working engineer, instructor, researcher, regulator, executive briefing).
- Quiz questions (included in notebooks)
- CTF challenge completion (Notebook 10)
- Build custom defence system (project)
- Incident response drill (tabletop exercise)
- OWASP LLM Top 10
- MITRE ATLAS - AI threat framework
- NIST AI Risk Management Framework
- Australian Cyber Security Centre
- OAIC Privacy Guidelines
- LLM Guard: Open-source security toolkit
- Garak: LLM vulnerability scanner
- PromptInject: Research benchmark
- CleverHans: Adversarial examples library
- "Jailbroken: How Does LLM Safety Break Down?" (Wei et al.)
- "Universal and Transferable Adversarial Attacks" (Wallace et al.)
- "Constitutional AI" (Anthropic)
- "Red Teaming Language Models" (Perez et al.)
Contributions welcome β see CONTRIBUTING.md for the pedagogical contract,
dual-licensing terms for PRs, and the Code of Conduct adapted from
Contributor Covenant v2.1 with course-specific responsible-use clauses.
Areas of particular interest right now:
- Pedagogical refactor of notebooks 13, 14, 15 (split monolithic cells into incremental teaching cells, modelled on notebook 7).
- Additional training examples (curated vulnerable / defended pairs).
- New attack techniques from 2026 disclosures.
- Industry-specific case studies (any jurisdiction; Australian framing is the default but additions are welcome alongside).
- Compliance updates as regulations change.
- Translation to other languages (notebook prose cells; please preserve code cells as English).
- New notebooks for 2026 surfaces (agent / MCP tool misuse, RAG-layer injection, harness-paradigm capstone) β coordinate via an issue first.
This repository is dual-licensed:
- Code (
app.py, scripts underscripts/, executable code cells in notebooks) β Apache License 2.0. SeeLICENSE. - Course content (notebook prose cells,
data/,docs/, top-level Markdown including this README) β Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). SeeLICENSE-DOCS.
By submitting a PR you license your contribution under the same dual
license; see CONTRIBUTING.md Β§ License + dual-licensing.
All users must:
- β Use only in authorised educational/research contexts
- β Practice responsible disclosure of vulnerabilities
- β Respect privacy and data protection laws
- β Follow institutional ethics guidelines
- β Never attack production systems without permission
- β Never use techniques for malicious purposes
Ensure you:
- Have ethics approval for security education
- Provide supervised learning environments
- Require signed code of conduct from students
- Implement proper safeguards and monitoring
- Comply with local regulations
- GitHub Issues: bug reports and feature requests
- Discussions: questions and community support
- Security: responsible disclosure via GitHub Security Advisories β see
SECURITY.mdfor the threat model, scope, and reporting channel
- Qwen Team (Alibaba Cloud) for base models
- HuggingFace for transformers library
- PEFT Team for LoRA implementation
- Australian AI security community
- OWASP, MITRE, NIST for frameworks
Machine-readable citation metadata is in CITATION.cff
(Citation File Format v1.2.0). For BibTeX:
@software{ai_security_jailbreak_defence_course,
title = {AI Security \& Jailbreak Defence: an educational course teaching
the model-and-prompt layer of LLM security through intentionally
vulnerable models},
author = {Kereopa-Yorke, Benjamin},
year = {2026},
url = {https://github.com/Benjamin-KY/AISecurityModel},
version = {2.3.0},
note = {Apache-2.0 (code) / CC BY-SA 4.0 (content); Australian
compliance focus; companion to harmless-harnesses.}
}Version: 2.3.0
Last Updated: 2026-06-14
Status: Experimental educational tool β see Maturity & realistic scope at the top of this README.
Companion course: harmless-harnesses (structural-harness layer)
Remember: This is a tool for learning. Use responsibly, teach responsibly, and build safer AI systems. π‘οΈ