Detection: M365 Copilot Jailbreak Attempts

EXPERIMENTAL DETECTION

This detection status is set to experimental. The Splunk Threat Research team has not yet fully tested, simulated, or built comprehensive datasets for this detection. As such, this analytic is not officially supported. If you have any questions or concerns, please reach out to us at research@splunk.com.

Description

Detects M365 Copilot jailbreak attempts through prompt injection techniques including rule manipulation, system bypass commands, and AI impersonation requests that attempt to circumvent built-in safety controls. The detection searches exported eDiscovery prompt logs for jailbreak keywords like "pretend you are," "act as," "rules=," "ignore," "bypass," and "override" in the Subject_Title field, assigning severity scores based on the manipulation type (score of 4 for amoral impersonation or explicit rule injection, score of 3 for entity roleplay or bypass commands). Prompts with a jailbreak score of 2 or higher are flagged, prioritizing the most severe attempts to override AI safety mechanisms through direct instruction injection or unauthorized persona adoption.

 1`m365_exported_ediscovery_prompt_logs` 
 2| search Subject_Title="*pretend you are*" OR Subject_Title="*act as*" OR Subject_Title="*rules=*" OR Subject_Title="*ignore*" OR Subject_Title="*bypass*" OR Subject_Title="*override*" 
 3| eval user = Sender 
 4| eval jailbreak_score=case( match(Subject_Title, "(?i)pretend you are.*amoral"), 4, match(Subject_Title, "(?i)act as.*entities"), 3, match(Subject_Title, "(?i)(ignore
 5|bypass
 6|override)"), 3, match(Subject_Title, "(?i)rules\s*="), 4, 1=1, 1) 
 7| where jailbreak_score >= 2 
 8| table _time, user, Subject_Title, jailbreak_score, Workload, Size 
 9| sort -jailbreak_score, -_time 
10| `m365_copilot_jailbreak_attempts_filter`

Data Source

Name Platform Sourcetype Source
M365 Exported eDiscovery Prompts N/A 'csv' 'csv'

Macros Used

Name Value
m365_exported_ediscovery_prompt_logs (sourcetype=csv)
m365_copilot_jailbreak_attempts_filter search *
m365_copilot_jailbreak_attempts_filter is an empty macro by default. It allows the user to filter out any results (false positives) without editing the SPL.

Annotations

- MITRE ATT&CK
+ Kill Chain Phases
+ NIST
+ CIS
- Threat Actors
ID Technique Tactic
T1562.001 Disable or Modify Tools Defense Evasion
Exploitation
DE.AE
CIS 10

Default Configuration

This detection is configured by default in Splunk Enterprise Security to run with the following settings:

Setting Value
Disabled true
Cron Schedule 0 * * * *
Earliest Time -70m@m
Latest Time -10m@m
Schedule Window auto
Creates Risk Event True
This configuration file applies to all detections of type anomaly. These detections will use Risk Based Alerting.

Implementation

To export M365 Copilot prompt logs, navigate to the Microsoft Purview compliance portal (compliance.microsoft.com) and access eDiscovery. Create a new eDiscovery case, add target user accounts or date ranges as data sources, then create a search query targeting M365 Copilot interactions across relevant workloads. Once the search completes, export the results to generate a package containing prompt logs with fields like Subject_Title (prompt text), Sender, timestamps, and workload metadata. Download the exported files using the eDiscovery Export Tool and ingest them into Splunk for security analysis and detection of jailbreak attempts, data exfiltration requests, and policy violations.

Known False Positives

Legitimate users discussing AI ethics research, security professionals testing system robustness, developers creating training materials for AI safety, or academic discussions about AI limitations and behavioral constraints may trigger false positives.

Associated Analytic Story

Risk Based Analytics (RBA)

Risk Message:

User $user$ attempted M365 Copilot Jailbreak with score $jailbreak_score$ using prompt injection techniques to bypass AI safety controls and manipulate system behavior, potentially violating acceptable use policies.

Risk Object Risk Object Type Risk Score Threat Objects
user user 10 No Threat Objects

References

Detection Testing

Test Type Status Dataset Source Sourcetype
Validation Not Applicable N/A N/A N/A
Unit Passing Dataset csv csv
Integration ✅ Passing Dataset csv csv

Replay any dataset to Splunk Enterprise by using our replay.py tool or the UI. Alternatively you can replay a dataset into a Splunk Attack Range


Source: GitHub | Version: 1