SP-Attack
ExternalSP-Attack and SP-Defense are MIT-developed open-access tools that use large language models to test and strengthen text classifiers against single-word adversarial attacks. SP-Attack crafts adversarial examples by swapping single words to flip labels, while SP-Defense retrains models for greater resilience, slashing attack success rates from 66% to 33.7% in benchmarks. They introduce the ρ(p) metric to quantify robustness, highlighting how just 0.1% of vocabulary drives nearly half of misclassifications. Essential for developers in high-stakes domains like content moderation, finance, and medicine, these tools enable efficient, scalable AI reliability improvements.
Description
SP-Attack and SP-Defense are MIT-developed open-access tools that use large language models to test and strengthen text classifiers against single-word adversarial attacks. SP-Attack crafts adversarial examples by swapping single words to flip labels, while SP-Defense retrains models for greater resilience, slashing attack success rates from 66% to 33.7% in benchmarks. They introduce the ρ(p) metric to quantify robustness, highlighting how just 0.1% of vocabulary drives nearly half of misclassifications. Essential for developers in high-stakes domains like content moderation, finance, and medicine, these tools enable efficient, scalable AI reliability improvements.
Key capabilities
- Generate adversarial sentences via single-word changes using LLMs to test text classifier robustness (SP-Attack)
- Retrain classifiers using adversarial examples to improve robustness (SP-Defense)
- Introduce ρ(p) metric to measure robustness against single-word attacks
Core use cases
- 1.Testing and hardening text classifiers in chatbots and content moderation
- 2.Enhancing reliability in financial and medical text classification systems
- 3.Evaluating classifier vulnerabilities to semantic-preserving perturbations
Is SP-Attack Right for You?
Best for
- Researchers and developers building text classifiers for high-stakes applications like chatbots, content moderation, finance, and medicine
- Teams seeking targeted, efficient adversarial testing and defense
Not ideal for
- Users needing defenses against broad AI threats like multi-word attacks or prompt injections
- Applications not centered on text classification robustness
Standout features
- LLM-powered automated adversarial example generation
- Efficient word ranking by influence to minimize computation
- Semantic equivalence verification for realistic attacks
- Scalable testing across large vocabularies
- Open-access implementation for easy adoption
User Feedback Highlights
Most Praised
- Outperforms prior methods by halving attack success rates in benchmarks
- Automated and scalable via LLMs for high efficiency
- Quantifiable ρ(p) metric reveals actionable insights
- Freely available to promote widespread AI robustness improvements
Common Complaints
- Performance gains vary, as low as 2% in some tasks
- Limited scope to single-word attacks, excluding multi-word or other threats
- Relies on LLMs, potentially inheriting their own vulnerabilities