New Method of Detection and Wiping of Sensitive Information George Pecherle, Cornelia Győrödi, Robert Győrödi, Bogdan Andronic Faculty of Electrical Engineering and Information Technology, University of Oradea Oradea, Romania
[email protected],
[email protected],
[email protected],
[email protected]
Abstract—One of the biggest problems in sensitive data wiping is to determine if a file is sensitive or not. Data wiping applications have improved a lot, but they cannot determine by themselves if a file is sensitive. The method we propose tries to determine if a file is sensitive by using a pre-defined set of rules initially specified by the user. These rules can update themselves in time, by “learning” from data patterns in previous sensitive files. The hard drive is monitored continuously to detect new sensitive data. Also, our method replaces the standard delete function from Windows with a secure wiping algorithm (overwriting data several times with pre-defined passes), but only for files that are determined to be sensitive. Keywords-data wiping; data mining; text mining; regular expressions; real-time deletion; Windows service
I.
INTRODUCTION
Deleting a file using the operating system functions is not a secure operation. When a file is deleted, the operating system marks the disk areas previously occupied by the file as available for new data. Therefore, the old information is still on the hard drive, until new files happen to be saved in exactly the same location. This information can be easily recovered by any basic software recovery tool [6]. Files can be deleted: 1. With the user’s knowledge: for example, when the user deliberately removes one or more files that he no longer needs; 2. Without the user’s knowledge (by the Windows operating system or installed applications): for example, during their normal operation, most applications create and remove temporary data, without the user’s knowledge or approval. A. Hidden Sensitive Data One of the biggest problems when wiping sensitive data is what to wipe. Your computer’s hard drive can store millions of files and it is almost impossible to determine what files contain sensitive data [9]. Even if you know where you saved private files, sometimes you may simply forget about them and your computer will
Iosif Ignat Faculty of Automation and Computer Science, Technical University of Cluj-Napoca, Cluj-Napoca, Romania
[email protected]
become “infected” with sensitive data. Let’s suppose you receive an email with a confidential attachment. You save the attachment in a hurry just to read it and then you forget about it. The attachment will remain there and anyone with access to your computer may discover and read it. Also, there are files created by applications, in various places of your hard drive and it is almost impossible to determine all sensitive files left by those applications behind. Therefore, most of your sensitive data is hidden [10]. Our paper will present a way to unhide or uncover sensitive data by analyzing files and determining if they contain sensitive information. B. Disadvantages of existing methods and advantages of our method Our method presented in this paper is a better alternative than traditional file wiping tools, because it has a detection algorithm fully integrated with the operating system (by catching the file create, file change and file delete events). It works like an anti-virus software with on-going protection, however not against viruses, but against sensitive data. We have compared our solution with other sensitive file wiping tools and we haven’t seen anyone offering this kind of approach. Other tools are able to detect sensitive data by their location (for example, web browsers store browsing history in pre-defined folders), but this is not always enough, because data can be stored (by mistake or on purpose) in other locations as well. Therefore, the best way to search for sensitive files is by their contents. CCleaner [11] can detect sensitive data left by various applications and web browsers, taking into account their location on the hard drive. This method can overlook sensitive documents that maybe the user saved in other folders. For example, someone will save a sensitive Word document in a random folder on the hard drive. For CCleaner, it would be impossible to detect this file, because it can search only in specific locations. However, using our solution, we don’t detect sensitive data by location, but by content. In a similar situation are all the other file wiping tools, they only detect sensitive data by location.
II.
PROPOSED METHOD OF UNHIDDING (DISCOVERING) SENSITIVE DATA
A. Overview of our Method The application we propose will identify sensitive files based on a specified set of rules provided by the user. Files that contain a set of given keywords will be marked as sensitive data and will be wiped using a secure method in order to prevent file recovery. The application is deployed at start-up and runs as a Windows service in the background. It will catch the Windows events that are triggered when creating a new file, or changing an existing one. If the newly created (or the modified) file contains any of the given keywords, the file will be marked as being sensitive. The file mark system will be implemented either directly by expanding the NTFS file system (it will be a flag that will be activated when a sensitive file is detected), or a table containing the location of the new sensitive file, or extending the file attributes with custom attributes. The application has three main functions: 1. Scan the hard drive for sensitive files: the application will scan the hard drive for files and determine if they are sensitive, using the same set of rules that were used when a new file is created (searching the file contents and see if they contain sensitive data). If a file is found to be sensitive, it will be marked as sensitive. One of the two methods described above will be used: the file flag or the insertion of the file name in a special table that contains sensitive files. We propose using both of them, for backup/security purposes. 2. Catch the file create and file modify events, in order to check if the new/modified file contains sensitive data, using the method described in subsection B. 3. Replace the “file delete” event from the Windows operating system: the event hook from the application will catch the delete event raised when the user deletes a file. The file will be scanned for the sensitive mark flag or it will be searched for in the sensitive file table. If the file is found as being sensitive, the delete mechanism of the Windows operating system will be overridden with a more secure deletion mechanism [6] [7]. The application will be developed to run as a system service [1]. The main purpose is to track file related events, such as file create, file change, etc. The application will scan and mark all newly created (or modified) files. B. Determining Sensitive Data Based on File Contents The first step to determine that a file contains sensitive data is to scan the file contents using a set of given rules. The rules are keywords that the end-user defines (or predetermined rules that the user wishes to use). If any of the rules apply to the file, the file is declared as sensitive. The keywords can be initially defined by the user or he can define regular expressions to look for [2]. For example, a CNP (a unique identifier that is assigned to every person in our
country) contains exactly 13 digits, so it is a regular expression. An email address is also a regular expression (a string followed by the “@” character, then another string, then the “.” character, then another string and so on). The files can also be scanned using a text mining algorithm that determines the occurrences of the rules that the application provides and the rules that the user adds. We would also like to implement a self-learning system that auto-updates its rules based on previously deleted files [3]. This is somehow similar to anti-spam tools that can “learn” and be trained by taking into account previous spam messages. For this purpose, we can use a Genetic Algorithm (GA) as a training algorithm [4]. Besides the text mining and the regular expressions we would like to implement a natural language parsing in order to determine possible sensitive data and report them to the user, so the user can choose if the data is sensitive or not. In case the user confirms the data found is sensitive, the data structure will be added to the scanning rules and to the ignore rules of the natural language parser. The reason for this is that the natural language parser is a complex suite of algorithms and the scanning of files using this method is time consuming and resource consuming. The other two methods are much faster and less time and resource consuming. The natural language parser may be an option for a scheduled scan and not for the background scanning, because of its tendency to be slow and time / resource consuming, or can be a user enabled option, if the user wants to add a better sensitive data protection level. C. Marking a File as Sensitive If a file falls into the sensitive file category, the application will mark it as being sensitive data. The mark will be applied as a file flag. This is represented by a flag that is activated in the file descriptor, located in the file header from the file system. For this to work, we need to add a sensitive data flag to the existing file system. This will be done using a patch for the existing file system and modifying the operating system’s file system driver. This approach is more sophisticated, but it will ensure that the application remains as small as possible. The main disadvantage is that the patch needs to be applied both to the file system and to the operating system, so that the flag is visible by the application. The second approach would be for the application to add the sensitive file to an internal table that contains the exact location of the file. This is an easier approach but will expand the application size, and will add an extra file used to store paths to sensitive files. D. Securely Wiping Sensitive Data The application will catch the operating system’s delete event and the file is checked if it’s marked as a sensitive file. If the file is not marked as sensitive, then the deletion process is the standard one implemented in the operating system. If the file is found to be a sensitive file, the application will override the operating system’s deletion method with a
specialized method of file deletion described below. If the file is marked as being sensitive in an internal sensitive file table of the application, then the application will have to remove the record from its table. It has been proved that the more times a certain disk area is overwritten, the less chances are that someone could recover data. Moreover, if the overwriting patterns are randomly generated, the protection against data recovery is higher. This has led to well known industry and governmental data overwriting standards. The most popular governmental standard is the DoD 5220.22-M, presented in NISPOM (National Industrial Security Program Operating Manual) [5]. In order to make data unrecoverable, the DoD algorithm overwrites it several times (actually DoD proposes three passes) using randomly generated patterns [8]. III.
SENSITIVE FILE DETECTION SYSTEM WORKFLOW
This application will also have a graphical interface where the user will be able to define the rules that will be utilized to see if a file is sensitive or not. IV.
PUTTING IT ALTOGETHER
The application will be developed as a Windows service and it will have a front-end (graphical user interface) which is actually an optional module of this Windows service for better user experience and application customization. This interface will allow the user to create and/or change classification rules that are used to determine if a file is sensitive. Here are the steps and the various functions of the final solution: 1. Define classification rules: using the front-end, the user defines the initial classification rules for sensitive data (keywords, regular expressions, etc.). He can also select from a set of predefined rules. 2. Initial scan: the application (Windows service) will make an initial scan of the hard drive, to build an initial list of sensitive files. Once a sensitive file is detected, it is marked appropriately, using the two methods previously discussed (the file flag and the table of sensitive files). 3. Monitoring file changes: the application will monitor any file create and modify events. When such an event occurs, the file is checked for sensitive data. If it contains sensitive data, it will be marked appropriately (using the file flag and the table of sensitive files). 4. Monitoring file deletion: the application will monitor any file delete event from the monitored hard drives. When such an event takes place, the file is checked if it’s marked as sensitive. If yes, the file is securely wiped. If not, the normal delete function will run. 5. Self-training of classification rules: optionally, the user can select an option to have the classification rules auto-update (or self-train) based on other files that the user manually selects to be wiped beyond recovery. For example, if a file not marked as being sensitive is wiped manually by the user, this file will be analyzed to see what other data it contains and the rules will be automatically updated. In a way, this is similar to how antispam tools work: the anti-spam rules automatically update as you mark new email as spam.
Figure 1. The Workflow Chart of the Sensitive File Detection System
•
File Deletion: The Sensitive File Detection Service will check if a file is sensitive, by using the method described above. If the file is sensitive, it will be securely wiped. If the file is not found as being sensitive, a normal file deletion process will execute.
6. Maintenance scans: from time to time, it is recommended to re-scan the whole hard drive (as done at step 2 above), to re-build the list of sensitive files. This is needed, even if the file create and file modify events are captured, because the classification rules can change over time (see step 5 above). The user can also schedule maintenance scans, using the Windows scheduler or a built-in scheduler. It is recommended to run these scans when the computer is idle (for example at night, or automatically when no system activity is detected for some time).
•
File Creation / File Modification: The Sensitive File Detection Service will actually scan the file contents, to see if it contains sensitive data. If sensitive data is found, the file will be marked as containing sensitive data.
7. View and wipe sensitive files: at any moment, the user can see a list of all sensitive files from the hard drive, with the option to destroy (wipe) them immediately. He can choose to wipe only some of the sensitive files or all sensitive files.
The central element is the Sensitive File Detection Service, that controls the entire activity. This service will capture the following file actions:
V.
TESTS AND RESULTS USING OUR METHOD
We performed two kind of tests using our method: individual tests (without comparing it to similar methods) and comparative tests (by comparing it to at least two similar methods). The individual tests were performed based on performance, speed, the ability to determine sensitive data (how many sensitive files we were able to detect compared to what the user’s sensitive files are) and the recoverability of wiped data (we used file recovery software). The comparative tests were done by comparing our solution with two important competitors: CCleaner from Piriform [11] and Windows Washer from Webroot [12]. Although these solutions scan for sensitive data faster, we are able to detect sensitive data more precisely. TABLE I.
VI.
CONCLUSIONS
The method proposed in this paper is designed to simplify the task of determining and wiping sensitive data. It solves one of the most important problems: making the computer decide if a file is sensitive or not, based on pre-defined rules. This is somehow similar to an anti-spam application, but we apply the idea for detecting confidential data, not spam messages: our system is also able to train itself and update its definitions with new/better rules. It also acts like an “anti-virus for sensitive data”, like a guardian that protects the system from sensitive data. The main advantage of our method is the detection of sensitive data based on file contents, which increases the detection rate (see the tests described in the previous section). It is also an automated system that continuously monitors the system changes and automatically detects new sensitive data.
INDIVIDUAL TESTS
REFERENCES Test description
Results
Scan speed: we let it scan a 80 GB hard drive with approximately 120,000 files. The operation had to check files of specified types (doc, txt, pdf, eml, etc.) to see if they contain sensitive data.
approx. 3 hours
Sensitive data detection accuracy: this test compares the number of sensitive files detected using our method (by scanning file contents) with the real number of sensitive files (as reported by the user).
3250 files detected with our solution VS. 4120 actual sensitive files (78% accuracy rate)
Sensitive file recovery: this test determines if sensitive data can be recovered after wiping it by repeatedly ovewritting it with random data. We used both the DoD standard [9] and the Gutmann method.
No sensitive data could be recovered (100% success rate)
TABLE II.
[1]
COMPARATIVE TESTS
Test description
Our method
CCleaner
Window Washer
Scan speed: we let it scan a 80 GB hard drive with approximately 120,000 files. The operation had to check files of specified types (doc, txt, pdf, eml, etc.) to see if they contain sensitive data.
approx. 3 hours
approx. 2 hours and 25 minutes
approx. 2 hours and 15 minutes
3250 files
3140 files
2954 files
Number of sensitive files detected: this test compares the number of sensitive files detected with our method (file contents scanning) with the number of files detected using competitor products (file location scanning)
"Walkthrough: Creating a Windows Service Application in the Component Designer" http://msdn.microsoft.com/enus/library/aa984464%28v=vs.71%29.aspx [2] Regular-Expressions.info [3] "Anti-spam Filter Based on Data Mining and Statistical Test" - Gu-Hsin Lai, Chao-Wei Chou, Chia-Mei Chen and Ya-Hua Ou - Computer and Information Science 2009, SCI 208, pp. 179-192 [4] "AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS" - ABDUELBASET M. GOWEDER, TARIK RASHED, ALI S. ELBEKAIE, and HUSIEN A. ALHAMMI [5] National Industrial Security Program Operating Manual (NISPOM) http://www.dss.mil/isp/fac_clear/download_nispom.html [6] George Pecherle, Cornelia Győrödi, Robert Győrödi, Bogdan Andronic – “Data Wiping System with Fully Automated, Hidden and Remote Destruction Capabilities”, Journal WSEAS TRANSACTIONS on COMPUTERS, Issue 9, Volume 9, September 2010, ISSN: 1109-2750, pag. 939-948, http://www.wseas.us/elibrary/transactions/computers/2010/88-110.pdf , ISI / SCI indexed (Web of Science), SCImago: 0,038. [7] The IEEE Computer Society – "Remembrance of Data Passed: A Study of Disk Sanitization Practices" - SIMSON L.GARFINKE, ABHI SHELAT (Massachusetts Institute of Technology) –January/February 2003 [8] NIST Special Publication 800-88: "Guidelines for Media Sanitization", Recommendations of the National Institute of Standards and Technology, Richard Kissel, Matthew Scholl, Steven Skolochenko, Xing Li, September 2006 [9] "Controlling Your Personal Information Disclosure" - NORJIHAN ABDUL GHANI, ZAILANI MOHAMED SIDEK - Proceedings of the 7th WSEAS International Conference on INFORMATION SECURITY and PRIVACY (ISP '08), pp. 23-28, ISBN 978-960-474-048-2, ISSN 1790-5117 [10] "How to Forget a Secret", G. Di Crescenzo et al., Symposium Theoretical Aspects in Computer Science (STACS 99), Lecture Notes in Computer Science, Springer-Verlag, Berlin, 1999, pp. 500–509 [11] CCleaner - http://www.piriform.com/ccleaner [12] Webroot Windows Washer - http://www.webroot.com/