Model Extraction & Model Stealing
Model extraction attacks attempt to replicate a proprietary machine learning model by querying it repeatedly and training a surrogate model based on observed outputs. This effectively steals the intellectual property and research investment behind the model without needing direct access to the model weights or training data.
Black-box Model Stealing
Repeated queries are used to build a dataset that mimics the original model behavior. This is often called "copycat" or "surrogate" modeling.
Attack Steps
- identify a relevant high-entropy dataset for the target domain
- systematically query the target API for labels or completions
- train a local "student" model on the gathered (input, output) pairs
Impact
- intellectual property theft
- unauthorized AI replication
- loss of competitive advantage
Confidence Score Exploitation
Using high-precision probability scores returned by the model to reconstruct its decision boundaries more efficiently than with labels alone.
Attack Steps
- send queries and record full probability distributions
- use the scores to calculate the gradient or decision boundary
- significantly reduce the number of queries needed to clone the model
Impact
- faster model extraction
- lower cost for the attacker
Query-based Distillation
Training a student model using high-quality synthetic outputs generated by the target model, often using an "active learning" subset of queries.
Impact
- creation of a lightweight equivalent model
- avoidance of expensive R&D costs
Training Data Extraction (Memorization)
Exploiting the model's tendency to memorize specific training examples to recover verbatim sensitive data from the training corpus.
Impact
- sensitive data leakage
- training privacy breach
Output Obfuscation & Quantization
Reduce the precision of confidence scores (e.g., return 'High' instead of '0.9845') to mask decision weights.
Watermarking Model Weights
Embed unique 'trigger outputs' in the model that prove a model was stolen if the clone produces the same unique signature.
Query Rate Limiting
Restrict the number of requests per user/token to make extraction economically unfeasible.
Differential Privacy
Add controlled statistical noise to outputs to mask individual training influences and decision weights.
Detection Methods
- query rate limiting and anomaly detection
- monitoring for systematic "mapping" patterns (e.g., grid search)
- semantic similarity analysis of incoming requests across sessions
- monitoring IP/Token behavior for abnormal activity spikes
Testing Tools
- Knockoff Nets
- Counterfit
- ART (Adversarial Robustness Toolbox)
- IBM Adversarial Toolkit
- Garak (for checking prompt-based extraction)
Hands-on Lab Environment
Ready for the practical lab?
Apply the concepts learned in the Model Extraction & Model Stealing course within our virtual terminal environment.
Start Lab Terminal