Skip to content

Benjamin-KY/AISecurityModel

Repository files navigation

πŸ›‘οΈ AI Security & Jailbreak Defence Course

CI License (code) License (content)

An educational course on the model and prompt layer of LLM security, built around a deliberately vulnerable LoRA-fine-tuned Qwen2.5-3B adapter. 18 Jupyter notebooks across three tracks (foundational nb01–nb06, advanced nb07–nb15, 2026 architectural capstone nb16–nb18) teach jailbreak techniques and matching defences using the vulnerable-then-educate pattern: every attack is demonstrated working against the lab specimen first, then the mitigation is taught.

πŸ“‘ Companion course. For the structural-harness layer (the architectural scaffolding around the model β€” policy router, source authority, output-contract enforcement, audit log, escalation FSM) see the sibling repository harmless-harnesses. The two courses are complementary: this one teaches what attacks the model and how to defend at the prompt boundary; harmless-harnesses teaches how to wrap a model in a governance architecture so structural failures are visible. See CONTRIBUTING.md Β§ Sibling course and reading order for a combined-track reading order.

πŸ‡¦πŸ‡Ί Made for Australian Learners

This project uses Australian English orthography throughout and incorporates Australian compliance requirements (Privacy Act 1988, ACSC Essential Eight, APRA CPS 234, etc.).

⚠️ Important Disclaimer

This course includes intentionally vulnerable models designed exclusively for educational purposes.

  • βœ… Use for authorised education and training
  • βœ… Use for security research in controlled environments
  • βœ… Use for CTF challenges and approved competitions
  • ❌ DO NOT deploy vulnerable models in production
  • ❌ DO NOT use on real systems without authorisation
  • ❌ DO NOT use for malicious purposes

🧭 Maturity & realistic scope

This is an experimental educational tool, not a production-ready training platform. Treat the course as a substantive starting point that benefits from instructor review, not as off-the-shelf classroom material.

Per-notebook maturity, audited cell-by-cell on 2026-06-14:

Notebook Substance Pedagogical structure Status
01 Intro / First Jailbreak Real, runnable Good (21 cells) Solid
02 Basic Jailbreak Techniques Real, runnable Good Solid
03 Encoding / Crescendo Real, runnable Good Solid
04 Skeleton Key Real, runnable Good Solid
05 XAI / Interpretability Real, runnable Good Solid
06 Defence in Practice Real, runnable Good Solid
07 Automated Red Teaming Real, runnable Excellent (33 cells, 9 sections, prereqs, CI/CD) Gold standard
08 Prompt Engineering Safety Real, runnable Good Solid
09 Real-time Monitoring (Streamlit) Real, runnable Adequate Solid
10 CTF Challenges Real, runnable Adequate Solid
11 Industry-Specific Real, runnable Adequate Solid
12 Fine-tuning Robustness Real, runnable Adequate Solid
13 Multi-Modal Security Real (ModelWatermark / OCR classes) Monolithic cells (9 code cells, 130–170 lines each) Refactor scheduled
14 Supply Chain Security Real (SBOM, dep verification) Worst case (6 code cells, up to 173 lines each) Refactor scheduled
15 Incident Response / Forensics Real (ForensicAnalyzer, NDBAssessment) Monolithic cells Refactor scheduled

Known limitations also tracked in CHANGELOG.md and docs/development-history/:

  • No real CI yet (notebook execution / nbval). Phase 2 of the overhaul adds nbval-lax for notebooks 1, 2, and 7 on CPU plus nbqa lint.
  • Notebooks 13–15 are scheduled for Phase 3 cell-splitting refactor.
  • Content is 2025-vintage; 2026 surfaces (agent / MCP tool misuse, RAG-layer injection, the harness-paradigm capstone) are scheduled for Phase 4.
  • Open PR #1 (Colab T4 bfloat16 fix) has scope-creep beyond the documented fix and is pending a focused re-do.

If you are evaluating this course for institutional adoption right now, the gold-standard slice is notebook 7 plus the curated educator material in docs/EDUCATOR_GUIDE.md. Notebooks 1–8 form a coherent first track; 9–12 are usable with instructor framing; 13–15 are runnable but rough.


πŸ“š Complete Course Curriculum (18 Notebooks)

🟒 Beginner Track (Notebooks 1-4)

Notebook 1: Introduction & Your First Jailbreak

Duration: 30-45 minutes | Difficulty: Beginner

  • What is a jailbreak?
  • Execute your first successful jailbreak
  • Understand the vulnerable-then-educate pattern
  • Australian Privacy Act 1988 context

Notebook 2: Basic Jailbreak Techniques

Duration: 45-60 minutes | Difficulty: Beginner

  • Role-playing attacks (DAN variants)
  • Multi-turn conversation exploits
  • Social engineering techniques
  • Measuring attack success rates

Notebook 3: Intermediate Attacks (Encoding & Crescendo)

Duration: 60 minutes | Difficulty: Intermediate

  • Encoding-based bypasses (Base64, ROT13, Hex)
  • Crescendo attacks (gradual escalation)
  • Multi-step exploitation chains
  • Detection and prevention strategies

Notebook 4: Advanced Jailbreaks (Skeleton Key)

Duration: 60-75 minutes | Difficulty: Advanced

  • Skeleton Key attack (Microsoft's vulnerability)
  • System prompt extraction techniques
  • Advanced prompt injection patterns
  • Real-world case studies

🟑 Intermediate Track (Notebooks 5-9)

Notebook 5: XAI & Interpretability (Inside the Model)

Duration: 75 minutes | Difficulty: Intermediate

  • Attention visualization and analysis
  • Activation pattern examination
  • Sparse Autoencoders (SAE) for interpretability
  • Understanding why jailbreaks work

Notebook 6: Defence & Real-World Application

Duration: 90 minutes | Difficulty: Intermediate

  • 7-layer defence-in-depth architecture
  • Input validation and sanitization
  • Output filtering and content moderation
  • Australian compliance integration (ACSC Essential Eight)

Notebook 7: Automated Red Teaming & Testing

Duration: 90 minutes | Difficulty: Advanced

  • Build automated attack testing frameworks
  • 10+ attack templates across 6 categories
  • CI/CD integration for continuous testing
  • Measuring ASR (Attack Success Rate)

Notebook 8: Prompt Engineering for Safety

Duration: 75 minutes | Difficulty: Intermediate

  • 10 prompt hardening techniques
  • System prompt design patterns
  • Industry-specific templates (Healthcare, Finance, Gov, Retail)
  • A/B testing for effectiveness measurement

Notebook 9: Real-time Monitoring Dashboard

Duration: 75 minutes | Difficulty: Intermediate

  • Build Streamlit security dashboard
  • Real-time attack detection
  • SIEM integration (Splunk, ELK)
  • Alert system implementation

πŸ”΄ Advanced Track (Notebooks 10-15)

The 6-notebook advanced track below covers the model-and-prompt layer through to forensics. Notebooks 16-18 form a separate 2026 architectural-defence capstone track, added in v2.3.0.

Notebook 10: CTF Security Challenges

Duration: 120 minutes | Difficulty: Advanced

  • 15 complete CTF challenges (Beginner β†’ Advanced)
  • 500 points total across 5 difficulty tiers
  • Automated scoring system with 5 rank levels
  • Certificate generation upon completion

Notebook 11: Industry-Specific AI Security

Duration: 90 minutes | Difficulty: Intermediate

  • Healthcare: TGA, PBS, medical records (patient safety)
  • Financial: APRA CPS 234, ASIC, AML/CTF ($10k threshold)
  • Government: PSPF, ISM, security clearances, classifications
  • Retail: CDR, PCI DSS, customer authentication
  • Cross-sector compliance comparison

Notebook 12: Fine-tuning for Robustness

Duration: 120 minutes | Difficulty: Advanced

  • Adversarial training dataset creation
  • LoRA (Low-Rank Adaptation) implementation
  • Complete training pipeline (SFT β†’ RLHF)
  • Robustness evaluation (45% β†’ 4.8% ASR improvement)
  • Safety reward model for alignment

Notebook 13: Multi-modal AI Security

Duration: 100 minutes | Difficulty: Advanced

  • Vision-language model (VLM) security
  • OCR-based prompt injection detection
  • Adversarial image detection
  • Cross-modal attack defense
  • Deepfake detection techniques

Notebook 14: AI Supply Chain Security

Duration: 90 minutes | Difficulty: Advanced

  • Model provenance verification
  • Data poisoning detection
  • Model watermarking for authenticity
  • AI-SBOM (Software Bill of Materials) generation
  • Secure model registry implementation

Notebook 15: Incident Response & Forensics

Duration: 100 minutes | Difficulty: Advanced

  • Real-time incident detection systems
  • Incident response playbooks
  • Forensic analysis and attack timeline reconstruction
  • MTTD/MTTR metrics tracking
  • Australian NDB (Notifiable Data Breaches) compliance
  • OAIC notification requirements (30-day deadline)

🟣 2026 Architectural Capstone Track (Notebooks 16-18)

Added in v2.3.0. Where notebooks 1-15 teach the model-and-prompt layer, this track moves to the architectural layer β€” the system around the model. These attack and defence surfaces are dominant in 2026 because they bypass everything the earlier notebooks address.

Notebook 16: Agent & MCP Security

Duration: 90 minutes | Difficulty: Advanced

  • Tool-calling agents (OpenAI function calling, MCP servers)
  • Indirect prompt injection via tool outputs
  • Confused-deputy and over-privileged-tool patterns
  • Cross-tool data exfiltration
  • Defence: tool allowlists, output scoping, capability boundaries

Notebook 17: RAG-Layer Prompt Injection

Duration: 75 minutes | Difficulty: Advanced

  • Document poisoning in retrieval indices
  • Retrieved-context attacks (the model never sees an attacker prompt directly)
  • Source provenance and trust scoring
  • Citation enforcement as a defence primitive
  • Why output-filtering defences from notebook 6 don't help here

Notebook 18: The Harness Paradigm β€” Capstone

Duration: 120 minutes | Difficulty: Advanced / Synthesis

  • Reframes notebooks 1-17 as the model-and-prompt layer
  • Introduces the harness paradigm: architectural defences around (not inside) the model
  • Builds a 4-component GovernanceHarness: source registry, router, verifier, decision logger
  • Ablation studies showing what fails when authority or enforcement components are removed
  • Explicit hand-off to harmless-harnesses for the full course on harness design
  • Indigenous-data-sovereignty positioning of the paradigm work (see docs/the-harness-paradigm.md in the source repo)

🎯 Learning Outcomes

Upon completing all 18 notebooks, students will be able to:

Technical Skills

  1. βœ… Execute and defend against 20+ jailbreak techniques
  2. βœ… Build complete 7-layer defence systems
  3. βœ… Implement automated red teaming frameworks
  4. βœ… Fine-tune models for robustness (LoRA + RLHF)
  5. βœ… Secure multi-modal AI systems
  6. βœ… Conduct forensic analysis of AI security incidents

Compliance & Governance

  1. βœ… Apply Australian Privacy Act 1988 requirements
  2. βœ… Implement sector-specific compliance (APRA, TGA, PSPF)
  3. βœ… Generate AI-SBOM for supply chain security
  4. βœ… Execute NDB breach notification procedures

Strategic Understanding

  1. βœ… Assess AI security risk across industries
  2. βœ… Design defense-in-depth architectures
  3. βœ… Measure security effectiveness (ASR, MTTD, MTTR)
  4. βœ… Conduct post-incident lessons learned

Architectural / Harness Layer (v2.3.0 capstone track)

  1. βœ… Distinguish the model-and-prompt layer from the architectural / harness layer
  2. βœ… Identify which 2026 attack families (agent tool-misuse, RAG injection) bypass prompt-layer defences
  3. βœ… Build a minimum-viable governance harness (source registry, router, verifier, decision logger)
  4. βœ… Run ablation studies that diagnose which component of an architectural defence is doing the work

πŸš€ Quick Start

Prerequisites

  • Python 3.10+ (3.11 or 3.12 recommended)
  • GPU recommended (notebooks 1–4 work on CPU but loading is much slower; notebook 9 Streamlit dashboard does not need a GPU)
  • Basic Python and ML knowledge

Installation

# Clone repository
git clone https://github.com/Benjamin-KY/AISecurityModel.git
cd AISecurityModel

# Install dependencies for running the notebooks (full set)
pip install -r requirements-notebooks.txt

# OR β€” minimal install just to run the Hugging Face Space app.py locally
pip install -r requirements.txt

# Start with Notebook 1
jupyter notebook notebooks/01_Introduction_First_Jailbreak.ipynb

Note on Colab T4 GPUs. Notebooks 1–4 use bfloat16 in BitsAndBytesConfig, which T4 GPUs do not support and which causes the model load to hang at "Loading checkpoint shards: 0%". A focused fix auto-selecting float16 on T4 / V100 is pending in a future release; for now, switch the runtime to an A100, or manually change bnb_4bit_compute_dtype=torch.bfloat16 to torch.float16 in the loader cell.

Notebook 13 (Multi-Modal) extra requirement. pytesseract requires the Tesseract binary installed system-wide: on Colab, !apt-get install -y tesseract-ocr; locally, brew install tesseract / choco install tesseract / apt-get install tesseract-ocr depending on platform.

Course Paths

πŸƒ Fast Track (4-6 hours) Notebooks: 1 β†’ 2 β†’ 4 β†’ 6 β†’ 10

πŸ“š Standard Track (15-20 hours) All notebooks 1-15 in sequence

πŸŽ“ Deep Dive (30-40 hours) All notebooks + exercises + CTF challenges + assessments


πŸ“ Project Structure

AISecurityModel/
β”œβ”€β”€ notebooks/                       # 18-notebook curriculum
β”‚   β”œβ”€β”€ 01_Introduction_First_Jailbreak.ipynb
β”‚   β”œβ”€β”€ 02_Basic_Jailbreak_Techniques.ipynb
β”‚   β”œβ”€β”€ 03_Intermediate_Attacks_Encoding_Crescendo.ipynb
β”‚   β”œβ”€β”€ 04_Advanced_Jailbreaks_Skeleton_Key.ipynb
β”‚   β”œβ”€β”€ 05_XAI_Interpretability_Inside_Model.ipynb
β”‚   β”œβ”€β”€ 06_Defence_Real_World_Application.ipynb
β”‚   β”œβ”€β”€ 07_Automated_Red_Teaming_Testing.ipynb     # ← gold-standard structure
β”‚   β”œβ”€β”€ 08_Prompt_Engineering_Safety.ipynb
β”‚   β”œβ”€β”€ 09_Realtime_Monitoring_Dashboard.ipynb     # Streamlit dashboard
β”‚   β”œβ”€β”€ 10_CTF_Security_Challenges.ipynb
β”‚   β”œβ”€β”€ 11_Industry_Specific_Security.ipynb
β”‚   β”œβ”€β”€ 12_Fine_Tuning_Robustness.ipynb
β”‚   β”œβ”€β”€ 13_Multi_Modal_Security.ipynb              # refactor scheduled (Phase 3)
β”‚   β”œβ”€β”€ 14_AI_Supply_Chain_Security.ipynb          # refactor scheduled (Phase 3)
β”‚   └── 15_Incident_Response_Forensics.ipynb      # refactor scheduled (Phase 3)
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ vulnerability_taxonomy.json                # OWASP-LLM-Top-10-mapped
β”‚   └── training_data.jsonl                        # supervised vulnerable+defended pairs
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ generate_training_data.py
β”‚   β”œβ”€β”€ finetune_model_v2.py
β”‚   β”œβ”€β”€ merge_and_upload.py                        # pushes adapter to HF Hub
β”‚   └── test_model.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ EDUCATOR_GUIDE.md                          # 37 KB instructor guide
β”‚   └── development-history/                       # historical v2.0 planning docs
β”œβ”€β”€ app.py                                         # Gradio Space demo
β”œβ”€β”€ README.md                                      # this file
β”œβ”€β”€ README_SPACE.md                                # Hugging Face Space metadata
β”œβ”€β”€ requirements.txt                               # Space-only minimal deps
β”œβ”€β”€ requirements-notebooks.txt                     # full notebook deps
β”œβ”€β”€ CHANGELOG.md
β”œβ”€β”€ CONTRIBUTING.md                                # pedagogical contract + CoC
β”œβ”€β”€ SECURITY.md                                    # threat model + disclosure
β”œβ”€β”€ CITATION.cff                                   # machine-readable citation
β”œβ”€β”€ LICENSE                                        # Apache-2.0 (code)
└── LICENSE-DOCS                                   # CC BY-SA 4.0 (content)

πŸ”“ Vulnerability Categories Covered

Attack Techniques (20+)

  • Prompt injection (direct, indirect, multi-turn)
  • Role-playing attacks (DAN 6.0, 11.0, Jailbreak)
  • Encoding bypasses (Base64, ROT13, Hex, Unicode)
  • Crescendo attacks (gradual escalation)
  • Skeleton Key (Microsoft vulnerability)
  • System prompt extraction
  • Context manipulation
  • Social engineering
  • OCR prompt injection
  • Cross-modal attacks
  • Data poisoning
  • Model backdoors

Defence Mechanisms

  • 7-layer defence-in-depth
  • Input validation & sanitization
  • Output filtering & content moderation
  • Prompt hardening (10 techniques)
  • Real-time monitoring
  • Automated testing
  • Adversarial training
  • Model watermarking
  • Incident response

πŸ‡¦πŸ‡Ί Australian Compliance Coverage

Legislation & Frameworks

  • Privacy Act 1988: Personal information protection, NDB scheme
  • ACSC Essential Eight: Cyber security baseline
  • APRA CPS 234: Financial services information security
  • PSPF: Protective Security Policy Framework (government)
  • ISM: Information Security Manual (ASD)
  • TGA: Therapeutic Goods Administration (healthcare)
  • ASIC: Financial advice regulations
  • AUSTRAC: AML/CTF compliance

Sector-Specific Requirements

  • Healthcare: Medical device regulation, patient safety
  • Financial: 72-hour breach reporting, AML/CTF $10k threshold
  • Government: Security clearances, classified information
  • Retail: Consumer Data Right (CDR), PCI DSS

πŸ“Š Course Metrics

  • Total Notebooks: 15
  • Total Duration: ~18–22 hours of instructor-led teaching, ~30–40 hours self-paced
  • Exercises: 50+ hands-on activities across the curriculum
  • CTF Challenges: 15 challenges in Notebook 10
  • Code Examples: 100+ illustrative implementations (educational, not production)
  • Assessment Questions: 30+ knowledge checks
  • Curated dataset: ~6.5 MB of vulnerable / defended supervised pairs in data/training_data.jsonl

πŸ› οΈ Technical Stack

Models

  • Base: Qwen2.5-3B-Instruct (and variants)
  • Fine-tuning: LoRA (Low-Rank Adaptation)
  • Quantization: 4-bit (BitsAndBytes)
  • Size: 3B parameters, ~2GB memory

Libraries

  • transformers: HuggingFace model loading
  • peft: LoRA fine-tuning
  • torch: Deep learning framework
  • streamlit: Dashboard creation
  • pandas/numpy: Data analysis
  • matplotlib/seaborn: Visualization

πŸŽ“ For Educators

Course Formats

🎯 Workshop (4-6 hours)

  • Notebooks 1, 2, 4, 6
  • Focus on core attack/defence concepts
  • Hands-on exercises only

πŸ“š University Course (12-15 weeks)

  • All 18 notebooks (weight nb01–nb06 heavily; sample from nb07–nb15; include nb16–nb18 as capstone)
  • 1 notebook per week
  • Assignments and assessments
  • Final CTF competition

πŸ’Ό Corporate Training (5 days)

  • Day 1: Notebooks 1-6 (Attacks & Defence at the prompt boundary)
  • Day 2: Notebooks 7-11 (Advanced & Industry-Specific)
  • Day 3: Notebooks 12-15 (Production Hardening)
  • Day 4: Notebooks 16-17 (Agent/MCP + RAG-layer attacks)
  • Day 5: Notebook 18 (Harness paradigm) + hand-off to harmless-harnesses

See also docs/POSITIONING.md for the three-repo map and docs/READING_ORDER.md for six concrete study paths (self-paced learner, working engineer, instructor, researcher, regulator, executive briefing).

Assessment Options

  • Quiz questions (included in notebooks)
  • CTF challenge completion (Notebook 10)
  • Build custom defence system (project)
  • Incident response drill (tabletop exercise)

πŸ“š Additional Resources

Recommended Reading

Related Tools

  • LLM Guard: Open-source security toolkit
  • Garak: LLM vulnerability scanner
  • PromptInject: Research benchmark
  • CleverHans: Adversarial examples library

Research Papers

  • "Jailbroken: How Does LLM Safety Break Down?" (Wei et al.)
  • "Universal and Transferable Adversarial Attacks" (Wallace et al.)
  • "Constitutional AI" (Anthropic)
  • "Red Teaming Language Models" (Perez et al.)

🀝 Contributing

Contributions welcome β€” see CONTRIBUTING.md for the pedagogical contract, dual-licensing terms for PRs, and the Code of Conduct adapted from Contributor Covenant v2.1 with course-specific responsible-use clauses.

Areas of particular interest right now:

  • Pedagogical refactor of notebooks 13, 14, 15 (split monolithic cells into incremental teaching cells, modelled on notebook 7).
  • Additional training examples (curated vulnerable / defended pairs).
  • New attack techniques from 2026 disclosures.
  • Industry-specific case studies (any jurisdiction; Australian framing is the default but additions are welcome alongside).
  • Compliance updates as regulations change.
  • Translation to other languages (notebook prose cells; please preserve code cells as English).
  • New notebooks for 2026 surfaces (agent / MCP tool misuse, RAG-layer injection, harness-paradigm capstone) β€” coordinate via an issue first.

πŸ“„ License

This repository is dual-licensed:

  • Code (app.py, scripts under scripts/, executable code cells in notebooks) β€” Apache License 2.0. See LICENSE.
  • Course content (notebook prose cells, data/, docs/, top-level Markdown including this README) β€” Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). See LICENSE-DOCS.

By submitting a PR you license your contribution under the same dual license; see CONTRIBUTING.md Β§ License + dual-licensing.


βš–οΈ Ethics & Responsible Use

Code of Conduct

All users must:

  1. βœ… Use only in authorised educational/research contexts
  2. βœ… Practice responsible disclosure of vulnerabilities
  3. βœ… Respect privacy and data protection laws
  4. βœ… Follow institutional ethics guidelines
  5. ❌ Never attack production systems without permission
  6. ❌ Never use techniques for malicious purposes

For Institutions

Ensure you:

  • Have ethics approval for security education
  • Provide supervised learning environments
  • Require signed code of conduct from students
  • Implement proper safeguards and monitoring
  • Comply with local regulations

πŸ“§ Contact & Support

  • GitHub Issues: bug reports and feature requests
  • Discussions: questions and community support
  • Security: responsible disclosure via GitHub Security Advisories β€” see SECURITY.md for the threat model, scope, and reporting channel

πŸ™ Acknowledgements

  • Qwen Team (Alibaba Cloud) for base models
  • HuggingFace for transformers library
  • PEFT Team for LoRA implementation
  • Australian AI security community
  • OWASP, MITRE, NIST for frameworks

πŸ“ Citation

Machine-readable citation metadata is in CITATION.cff (Citation File Format v1.2.0). For BibTeX:

@software{ai_security_jailbreak_defence_course,
  title  = {AI Security \& Jailbreak Defence: an educational course teaching
            the model-and-prompt layer of LLM security through intentionally
            vulnerable models},
  author = {Kereopa-Yorke, Benjamin},
  year   = {2026},
  url    = {https://github.com/Benjamin-KY/AISecurityModel},
  version = {2.3.0},
  note   = {Apache-2.0 (code) / CC BY-SA 4.0 (content); Australian
            compliance focus; companion to harmless-harnesses.}
}

Version: 2.3.0 Last Updated: 2026-06-14 Status: Experimental educational tool β€” see Maturity & realistic scope at the top of this README. Companion course: harmless-harnesses (structural-harness layer)

Remember: This is a tool for learning. Use responsibly, teach responsibly, and build safer AI systems. πŸ›‘οΈ

About

No description, website, or topics provided.

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-DOCS

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors