Meaning-Based Machine Learning

6 downloads 280 Views 649KB Size Report
Future Work. • Develop larger datasets. • Explore different feature performance. • Hydra: an OST parser using evol
Meaning-Based Machine Learning Dr. Courtney Falk Infinite Machines

Who Am I? • Day job • Senior research scientist at Optiv • Threat intelligence reporting • Some work on ontologies for information security applications

• Purdue graduate • Dissertation using ontology-based NLP • Infinite Machines

• Contact me • LinkedIn • ResearchGate • courtney dot falk at gmail dot com

Ontological Semantics

Language Independent

Language Dependent

Ontology

Lexicon

• Evolution

• Mikrokosmos (1995) • Ontological Semantics (2004) • Ontological Semantics Technology (2010)

• Built for natural language processing • No logical formalism a la Web Ontology Language (OWL) • Frame-based inheritance • Output structure known as a Text Meaning Representation (TMR)

Fact DB

Onomasticon

Abstract

Concrete

Resource Examples Concept

Word Sense (German) “fressen”-VERB1 SYN-STRUC subject

(EAT

(IS-A (VALUE (BIOLOGICAL-EVENT))) var 1 cat noun fressen verb

root cat SEM-STRUC (EAT (AGENT (VALUE (^$var1)) (SEM (ANIMATE-OBJECT)) (NOT (HUMAN)) ) )

(DEFINITION (VALUE (“Consumption of nutrition.”))) (AGENT (SEM (ANIMATE-OBJECT))) (THEME (SEM (FOOD))) )

Semantics from Machine Learning • Latent semantics analysis/indexing (LSA/LSI) • Singular value decomposition (SVD) dimensionality reduction • Concepts are groups of spatially proximate words

• Latent Dirichlet allocation (LDA) • Hierarchical topic model

• Word2vec • Neural networks • Vector space model (VSM)

• But are the structured learning meaningful to humans?

Meaning-Based Machine Learning • Start with meaningful data • Manually defined by human acquirers

• Use ML to find meaningful patterns • MBML for Information Assurance (2016) • Application to information security problems: phishing detection, stylometry, et.al.

Knowledge Modeling of Phishing Emails • Manually generated TMRs

• 28 phishing emails from the Anti-Phishing Working Group (APWG) • 28 known good emails from my inboxes

• Train binary classifiers on TMR structures

• Three algorithms: Naïve-Bayes, J48 (C4.5), and SVM • Compare learning on decomposed TMRs to unigram language models • Used K-fold cross validation to avoid overfitting

• Positives

• Performed better than unigram language models • Confidence intervals were smaller for semantic results

• Negatives

• Small sample size (not necessarily generalizable) • Didn’t record lexeme -> concept mappings

Feature Design “Johnny gave Jane the cake” (GIVE-37 {GIVE:AGENT:VALUE:HUMAN, (AGENT (VALUE (HUMAN-4))) (THEME (VALUE (BAKED-CAKE-78))) GIVE:THEME:VALUE:BAKED-CAKE, GIVE:BENEFICIARY:VALUE:HUMAN} (BENEFICIARY (VALUE (HUMAN-91))) ) Generates features

Experimental Results

Generated Decision Trees

Future Work • Develop larger datasets • Explore different feature performance • Hydra: an OST parser using evolutionary algorithms • Bootstrapping from LSA/LDA into lexemes and word senses • New applications outside of phishing detection

References • Onyshkevich, B. and Nirenburg, S. (1995) A lexicon for knowledge-based MT. Machine Translation, 10(1), pp. 5-57. • Nirenburg, S. and Raskin, V. (2004) Ontological semantics. Cambridge, MA: MIT Press. • Taylor, J. and Raskin, V. (2010) Fuzzy ontology for natural language. 2010 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1-6. • Falk C. and Stuart L. (2016) Meaning-based machine learning. Journal of Innovation in Digital Ecosystems, 3(2), pp. 141-147. • Falk C. (2016) Knowledge modeling of phishing emails (Doctoral dissertation). Retrieved from ProQuest. (10170565)