Option #1: AMI + Terraform + Ansible. â Option #2: Docker + .... and enforce the process. â Owner: come up with mitigation actions and plan them with the PM ...
AdvEx Assess robustness of machine learning models against adversarial examples Client Mentor
Nancy Mead Andrew Mellinger
MITS Project, Summer 2018
Our Wonderful Team
Yike Ma yikem
Shailee Vora skvora
Hao Tang htang1
Shangwu Yao shangwuy
Linghao Zhang linghaoz 2
Outline
1. 2. 3. 4. 5. 6.
Introduction Evaluation Algorithm System Architecture Video Demo Project Management Summary
3
1. Introduction
4
Background Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.
5
Background Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.
Real “gibbon” 6
Motivation & Scope Motivation:
● Pervasiveness of machine learning models in various industries ● Threat of adversarial attacks in critical applications such as autonomous driving, medical diagnosis, etc ● Relatively new field & absence of similar products Scope: To build a website based on CleverHans to assess robustness of machine learning models uploaded by the users, against adversarial examples.
7
CleverHans https://github.com/tensorflow/cleverhans
● ● ●
A library that provides standardized reference implementations of attack methods. Supports Tensorflow and Keras. Use case: ○ ○
Input: Clean Images Output: Adversarial Examples (Images) generated by each attack method
8
Stakeholders
Users
Developers 9
2. Evaluation Algorithm
10
Domain Research and design Conclusions from Spring
●
CleverHans is a reliable library ○ ○
●
ImageNet is the most suitable dataset for assessment ○
●
Its attack methods can significantly decrease a model’s accuracy Its implementations are generalizable across different datasets The research paper cited below can vouch for this
Feedback content ○ ○ ○
Include multiple attack methods Categorize into different “Threat Levels” Provide sample of adversarial images in the feedback
Torralba, Antonio, and Alexei A. Efros. "Unbiased look at dataset bias." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
11
1. Choosing Attack Methods Considerations: ●
Effectiveness ○ ○
●
Efficiency ○ ○
●
With the same noise level, some attack methods have higher success rate than others Some attack methods create adversarial images with higher transferability Some attack methods take much more time to generate adversarial images than others We evaluated the runtime of each attack methods and estimated the total time to generate the complete dataset
Resource constraints ○
Total amount of attack methods we can use based on budget limit
12
2. Black Box Attacks vs White Box Attacks White box attack assumes the attackers have complete access to the target model while Black box attack does not. We chose Black box attack because: ●
Fits our assumption about attacker’s knowledge of target model
●
White-box attack is computationally more expensive
●
It will require us to generate adversarial images every time the users upload their models while Black-box attack allows us to generate adversarial images beforehand.
13
3. Finalizing Feedback Content ●
Robustness Score ○
●
We used the average of the accuracy degrade of the model over each of the attack methods
Confidence ○ Confidence can be regarded as an alternative metric of evaluating the robustness of the model. The higher confidence with which the model makes incorrect predictions, the weaker it is.
●
Graph ○ We included this because of aesthetic reasons as an illustration of text data is generally more visually appealing to the users
14
Implementation Evaluation Pipeline Diagram
15
Challenges Issue tracker Issue
Solution
How to deal with the trade-off between the time to evaluate a user’s model and the reliability of our results
1. Turn to domain experts for help 2. Find out how many images do other papers use for validation
How to deal with the trade-off between usability and technical feasibility
1. Target models only for computer vision tasks and trained on ImageNet 2. Require users to upload their index-to-label mappings so that we can know which index in their model corresponds to which label in the ImageNet
……
……
16
3. System Architecture
17
Quality Attributes ● ● ● ● ● ●
Scalability Availability Performance Security Usability Configurability
18
Configurability ●
Dockerization ○ ○
●
Elastic Beanstalk ○
○
●
Automatically build and deploy on AWS Maintain package version to guarantee stable deployment Handle auto-scaling and load-balancing Configure easily and safely in environment variables
Config file for attack method ○ ○
Dynamically choose attack method without changing the code Add and remove attack methods by editing the config file
19
Functional Requirements & Constraints Business Constraints Functional Requirements ● ● ● ● ●
Dashboard Model Upload Form Submission History Submission Detail Information Page
● ●
Budget limit (AWS credits) Free to users
Technical Constraints ● ● ● ●
Based on Cleverhans library Deployed on AWS Supports Keras only Targets CV models only 20
Tech Stack
Data Storage
Web Service
Infrastructure & DevOps 21
Design Choices ●
Frontend ○ ○ ○
● ● ● ●
Angular.js -> Vue.js Flatter learning curve Greater code reusability
Backend Temporary / Persistent Data Storage Evaluation Framework DevOps Tools ○ ○ ○ ○
Option #1: AMI + Terraform + Ansible Option #2: Docker + Elastic Beanstalk Less steps involved Gains much configurability effortlessly 22
4. Video Demo
23
5. Project Management
24
●
●
●
Phase 1: Individual Development (Week 1 - 4) ○ Designed & developed frontend, backend and evaluation worker ○ Developed Alpha Phase 2: Integration & Deployment (Week 5 - 7) ○ Alpha User Acceptance Test ○ Beta in Week 7 Phase 3: System Refinement (Week 8 - 10) ○ Beta User Acceptance Test ○ Backlog feature implementation ○ Bug fixing ○ Final deliverables in Week 10
*New plans in Summer 25
Planning & Tracking ●
Process ○
○
●
Phase 1 & 2 ■ Close to Waterfall ■ Simple & clear requirements ■ Design (Week 1) -> Implementation (Week 2 - 4) -> Verification (Week 6 - 7) Phase 3 ■ Borrowed from Scrum & XP ■ Feature backlog ■ Frequent communication with clients & frequent release
Tools ○ ○
Trello: assign tasks; promote transparency within the team Google Docs Spreadsheets: track tasks, bugs and issues 26
Including bug fixing & new feature implementation.
27
Planning & Tracking - Reflections What Worked: ●
Hit Almost All Weekly Milestones ○ ○
○
●
26 of 36 tracked tasks finished on time. 6 tasks delayed (avg. 2.6 days) ■ 4 tasks (avg. 3 days) before Week 5 ■ 2 tasks (avg. 2 days) after Week 5 4 tasks finished earlier (avg. 2 days)
Feedback-driven Agile Development in Phase 3 ○ ○ ○
11 non-trivial bugs tracked 8 bugs fixed on the day of discovery 3 bugs fixed after one day 28
Planning & Tracking - Reflections What Could Be Improved: ●
Task duration estimated in days instead of hours ○ ○
●
Whereas time spent is recorded in hours with Toggl Due to lack of experience
Estimates are particularly inaccurate with unfamiliar dependencies ○ ○ ○
4 out of 6 delayed tasks are due to this Unfamiliarity with AWS services, testing framework, etc Could have planned a training period
29
Requirements Management ● ●
Feature Backlog Source of new requirements ○ ○
●
Execution ○ ○ ○
● ● ●
User Feedback Internal Reflections Add items when proposed (from around week 6) Re-prioritize items on weekly meeting (in Phase 3) Plan tasked based on priority and time considerations
7 feature requests (4 from users) 4 implemented (3 from users) Helped with scope management 30
Quality Management Domain Research ● Internal Tests: ● Code Reviews: feature / refactoring / bug fix ● Alpha User Acceptance Test ●
○ ○ ○
●
Week 4 - 5, two surveyed users (domain expert in ML & professor in security) In-person / remote meetings, questionnaire Feedback turned into immediately plannable tasks or backlog items
Beta User Acceptance Test ○ ○ ○
Week 7 - 9, three users (surveyed users + client) Users tested on the live website Feedback turned into backlog items 31
Quality Management - Reflections What Worked: ●
Users Helped to Improve Usability ○
●
Users Helped to Discover Unexpected Bugs ○ ○
●
E.g. navigation bars vs. buttons Unexpected use cases (e.g. ill-configured session persistence -> need to login again when opening the link in a new tab) Less-than-ideal environments (e.g. limited uploading bandwidth -> timeout)
Code Review as Quality Control ○ ○
# of non-trivial bugs discovered during code review: 8 # of non-trivial bugs reported by users: 3 32
Quality Management - Reflections What Could Be Improved: ●
●
●
Unable to Observe Whether Code Quality Improve Over Time ○ Not enough data due to lack of repeating patterns given our limited project scope ○ Hard to measure code quality due to lack of experience Sometimes Tests Fails to Ensure Quality due to Sloppy Execution ○ Especially when beyond the scope of a single component ○ Have more tests automated Tests Could Be More Exhaustive ○ Made a trade-off due to lack of time: depend more on user acceptance tests ○ Inadequate coordination between different component owners 33
Configuration Management ●
Code
●
Version Control via Git Feature-based Branching Code Reviews via Pull Requests Document (Google Drive) ○ Changelog-based Version Control ○ Unified Templates ○ Task / Issue / Bug Trackers (Spreadsheets)
●
Deployment
○ ○ ○
○ ○ ○
Built-in Version Control by Elastic Beanstalk Continuous Integration via Travis CI Faster Feedback Loops: 5 Minutes to Build & Deploy A New Version 34
Risk Management Risk #
Risk Description
Probability x Impact
R1
The exact system quality requirements like response time are not Medium decided before implementation in summer.
R2
We don’t know what all information users expect to see and Medium would consider useful in the feedback report.
R3
We don’t find people to test our website.
Medium
R4
Integration 3 components developed with different technologies and by different team members.
High
35
Risk #
Risk Description
Probability x Impact
R5
Organizing and planning our time to build such a huge system is going to be Medium difficult due to lack of industry experience.
R6
The technology decisions we have made aren’t correct.
Medium
R7
VGG16’s 40% accuracy issue is not resolved by the end week 3 of summer.
Low
R8
We change the parameters of the attack methods midway through the Low project.
R9
We design the entire UI before doing a User Acceptance Test.
Low
R10
S3 upload takes longer than the time allocated to it.
Low 36
#1 Risk in Summer ●
● ●
Condition: Integration of 3 components developed with different technologies and by different team members goes wrong where wrong means that components don’t work with each other. Consequence: Our progress could be seriously delayed. We may end up rewriting a large portion of the code. Mitigation: ○ ○
By Design: Have good decoupling, clear APIs, unit/integration tests, etc. Alpha Version ■ To validate assumptions about the technologies we chose. ■ Some code and notes produced while developing the Alpha serve as references for later 37
Risk Management - Reflections What Worked: ● ●
Being Aware of the Risks is the Key First Step Plan for Things to Happen rather than React to it ○
The only way to ensure that the tech we use will behave in an expected way is to build some minimum product with that tech
What Could Be Improved: ● ●
Need A Process to Close the Discussion on A Risk Could Have Categorized Risks and Delegate Them to Different Owners ○ Risk Manager: maintain the document and enforce the process ○ Owner: come up with mitigation actions and plan them with the PM 38
7. Summary
39
Team Goals Success Criterion: All met!
● Deploy the fully-functional website by the last week of July ○
●
First deployment on 7.3. Finalized deployment on 7.23.
Open source on Github with documentation by the first week of August ○
Finalized code and documentation on 7.28.
Learning Goals:
● Gain understanding in the field of adversarial machine learning ● Gain experience of designing, developing and deploying web apps with AWS 40
Personal Goals ● Yike ○
●
Hao ○
●
To understand how an end to end website can be built and deployed
Shangwu ○
●
Understand how adversarial machine learning works
Shailee ○
●
Familiarize with frontend development and Vue.js; Learn how to collaborate efficiently using Git, Trello and other tools
Gain experience with designing and implementing a website, as well as using automated test & build tools
Linghao ○
Gain experience of designing and implementing a cloud-based infrastructure; A first taste of tech leadership.
41
Final Takeaways ●
There is no one-fits-all process ○
●
Tracking things in a principled way is the key to project management ○
●
Accountability, transparency
The sense of ownership boosts productivity ○ ○
● ● ●
Understand context rather than formality
Having a project manager instead of consensus decision-making Split the work into smaller parts and assign tasks based on individual strength
Automated build & deploy tools ensure quality and improve efficiency It pays to invest time in design Respecting the formalities like agenda and templates that improve efficiency
42
Acknowledgements
Thank you for the help throughout our project! ● ● ● ●
Nancy Mead Andrew Mellinger Gregory Laidlaw Oren Wright
43
Thank you!
44
Questions?
45
Backup Slides
46
Functional Requirements ●
Dashboard ○ ○
● ● ●
Model Upload Form Submission History Submission Detail ○ ○
●
Statistics: # of models running, # of models queued Status of most recent submission
Robustness scores under various attack methods and threat levels A final score for complete evaluation
Information Page ○ ○
Submission Instructions Explanation of attack methods and evaluation dataset 47
Feature Backlog #
Feature
Details
Notes
Priority
Source
Activity History
1
Frontend - Dashboard Dashboard should show status of the most recent submission.
High
User Feedback
Added on 6/24 Planned on 7/9 Implemented on 7/15
4
Evaluation
Allow users to upload their own evaluation datasets.
Requires major changes to the Low system. Probably wouldn't have time for that.
User Feedback
Added on 6/24
5
Backend
Use WSGI server to support Required for large scale higher concurrency. production. Not necessary if we don't expect > 50 users simultaneously.
Medium
Linghao
Added on 6/25
7
Evaluation
Decouple config of attack methods from the code so that adding a new attack method is easier.
Medium
Andrew, Linghao & Hao
Added on 7/9 Planned on 7/16 Implemented on 7/20
48
Summer Risks
#8
#7
#9,#10 #5
#6
#3
#2
#4
#1
49