BasicAgent

LLM Security

LLM security practices for prompt injection, data leakage, and runtime controls.

Executive summary

The research in this repo treats security as a control problem: prompts can be pushed off safe behavior, tools can be misused, and outputs can become unreliable. The mitigation pattern is layered controls, including system prompt hardening, moderation filters, anomaly detection, and validation. The IRZ-CoT pipeline adds structured extraction and confidence scoring to reduce ambiguity. This page summarizes those controls as a practical security workflow.

Logical Framework

Core concepts

  • Controlled prompting: role and instruction prompts are explicit and scoped.
  • Validation layers: RAV and RAC verify and correct extracted data.
  • Monitoring: anomaly detection and session-level analytics.

Workflow (inputs, outputs, checkpoints)

  1. Input: prompt templates and tool access rules.
  2. Apply system prompt hardening and role constraints.
  3. Run extraction with structured JSON output and confidence scores.
  4. Validate with RAV/RAC and log results.
  5. Checkpoints: moderation filters and anomaly detection.

Practical guidance and guardrails

Do:

  • Use system prompt hardening for critical tasks.
  • Apply input/output moderation filters.
  • Validate outputs with RAV/RAC and confidence thresholds.

Do not:

  • Allow unconstrained prompts for sensitive workflows.
  • Skip validation or confidence scoring.

Failure modes and mitigation:

  • Jailbreak or drift: controlled prompts and layered validation.
  • Tool misuse: strict tool protocols and monitoring.
  • Low-confidence outputs: threshold gating and review.

Evaluation criteria

  • Validation success rate from RAV/RAC.
  • Confidence scores vs thresholds.
  • Anomaly detection events per session.

Example pattern

From IRZ-CoT extraction requirements:

Accuracy is paramount. Only extract information you are certain about.
Provide reasoning and confidence for each field.
If a field is missing, return NOT_FOUND.

Create account

Create account