Content Moderation

File content written via write_file, create_file, and staged uploads (upload_session action: “complete”) is asynchronously scanned for harmful content using Cloudflare Workers AI (@cf/meta/llama-guard-3-8b). How it works: After a successful write, content is enqueued for background moderation. The scan runs asynchronously — writes are never blocked or delayed by moderation. What agents see: Moderation is transparent — read_file only includes moderation metadata when content is flagged. Pending and approved statuses are internal bookkeeping and are never surfaced. When flagged, the response includes moderationStatus: "flagged" and a moderationCategory indicating the hazard type. What happens when content is flagged: Flagged content is not blocked from reads — it remains accessible. Flagged files cannot be shared via share_with_public. To dispute a flag, contact support@undisk.app. Hazard categories: violent_crimes, non_violent_crimes, sex_related_crimes, child_sexual_exploitation, defamation, specialized_advice, privacy, intellectual_property, indiscriminate_weapons, hate, suicide_self_harm, sexual_content. Deduplication: Moderation is keyed by content SHA-256 hash. Identical content is only scanned once regardless of how many files reference it.

Note: append_log and restore_version do not trigger moderation. Content moderation is optional — it degrades gracefully if the Workers AI binding is not configured.

Recommended agent behavior on flag

Stop share/export actions for the flagged file.
Record the event in audit context (for example with append_log).
Surface moderationCategory to a human reviewer.
Continue unrelated non-flagged tasks to avoid full pipeline stalls.

​Content Moderation

​Recommended agent behavior on flag

Content Moderation

Recommended agent behavior on flag