idealistic neural networks

0 downloads 0 Views 335KB Size Report
Neural Network, NLP, Hate speech detection. 1. INTRODUCTION. Imagine if instead of using backpropagation to alter the weights of a neural network in an NLP ...
IDEALISTIC NEURAL NETWORKS Tofara Moyo1 1

Mazusa

[email protected]

ABSTRACT We describe an Artificial Neural Network, where we have mapped words to individual neurons instead of having them as variables to be fed into a network. The process of changing training cases will be equivalent to a Dropout procedure where we replace some (or all) of the words/neurons in the previous training case with new ones. Each neuron/word then takes in as input, all the “b” weights of the other neurons, and weights them all with its personal a weight. To learn this network uses the backpropagation algorithm after calculating an error from the output of an output neuron that will be a traditional neuron. This network then has a unique topology and functions with no inputs. We will use coordinate gradient decent to learn where we alternate between training the “a” weights of the words and the b weights. The Idealistic Neural Network, is an extremely shallow network that can represent non-linearity complexity in a linear outfit.

KEYWORDS Neural Network, NLP, Hate speech detection

1. INTRODUCTION Imagine if instead of using backpropagation to alter the weights of a neural network in an NLP system, we used the backpropagation algorithm to alter the word vectors. We know that better word vectors make for better learning, but what if the actual learning was the change in word vectors? Then we could place the whole of the complexity of the system into the design of the word vectors through backpropagation (which takes the word vectors as inputs rather than the weights of the network) and rely on a minimal actual neural network infrastructure. Another step forward could be to actually map individual words onto individual neurons permanently and then use a Drop out procedure to represent different training cases in a neural network. Dropping out whole sentences if need be or altering the position of words and adding others .

2.ESTABLISHED THEORY Traditional approaches to NLP use word vectors such as those from word2vec as the inputs into a network. It is known that randomly initialized word vectors make for a poor performing NLP system, and is the reason for approaches like word2vec embedding system that first discover contexts for words from a corpus and places them in relation to each other in the vector space to reflect this. This approach uses an n - gram of n positions to the left and n positions to the right of the word being vectorized as a window to calculate its context and it is known that the greater the size of n the better value the are the word vectors that are generated to NLP tasks. These word vectors are then fed into a neural network for classification or translation or as the particular task may be. We know that if we used an arbitrarily large value for the value n in the n

gram model, our neural network would have an easier task in finding a mapping between the words.

3. THE IDEALISTIC NEURAL NETWORK In most models there is a distinction between the training data, as represented by variables, and the neural network itself. Furthermore, the network may be made with multiple layers, each layer learning different features of the data. The Idealistic Neural Network, as its name suggests is inspired by idealism. Here we merge the representation of the data with the neural network itself. To explain this process, we will use an NLP application that classifies Hate Speech. The Idealistic Neural Network is capable of representing any system currently being modelled with deep learning techniques in general, including but not limited to computer vision, robotics etc... Imagine we have a pool of neurons, from which we can form networks out of. Each neuron in this pool will be associated with a particular word from a natural language, say for this example English. So, there will be a “Dog” neuron and a “table” neuron as well as a “The” neuron, among all the neurons in the pool. Training the system will use supervised training regimen, where we are given many examples of labelled hate speech and normal speech. The labels will be a binary value of either 1 for hate speech and 0 for normal. To train the system, we do the following. Upon receiving the first training example, say the sentence “I hate you”. We take the neurons associated with those words from the pool and create a network out of them. There will be one layer which contains the neurons in the sentence and they will all be connected to an output neuron. This output neuron will be the only permanent feature in our network. As training examples change, we replace some or all of the neurons in the first layer with the neurons associated with the words in the current example. Each of the neurons from the pool will have two weights associated with it. We will call these weights the “a” weight and the “b” weight. Upon setting up the “I hate you” network, we take all the b weights of each neuron and feed it into all the other neurons. Then upon receiving this fan of b values, a neuron will weight them all with its personal “a” weight. We first randomly initialize the “a” and “b” values of all neurons in the pool. Then during training, we forward propagate by calculating the weighted sums of the b values, passing this through an activation function and sending it to the output neuron. The output generated will be compared to the expected output and the error backpropagated to the “a” and “b” values. Once a network/training example has been learned we save the values of its weights and replace the neurons into the pool. The cycle continues yet again with the next training example. Upon the completion of the training regimen, we may now when receiving a phrase, form a network with those neurons associated with its words found in the pool, and

forward propagate its “b” values to observe an output. That output should correctly classify it as either hate speech or normal speech. I should add that features are not learnt in the traditional manner with this network, this will be most illuminated with a CV cat classification example, where in the network , instead of having words be neurons we let pixel values be neurons. Since each pixel/neuron outputs a hyper plane the effect of one pixel input in another is to scale it. Since all pixels scale each other, a randomly initialized a' and b values will lead to random scaling , then when we train for the identification of a cat, we assign a scaling to the transformations that produce cats to a binary value, since different cats are in some sense simply scalings of each other 3.2. HATE SPEECH DETECTION ALGORITHYM 1. Create a neuron for every word in The English language. 2. Upon receiving a passage of words that could be hate speech or not, load the neurons that go with the words being used into the system. 2.1. Set up the network with the neurons associated with these words. 2.1.1. The first layer of this network will contain the neurons corresponding to the words in the passage.

2.1.2. There will be one output neuron which will be fully connected to the first Layer.

2.1.3. Output neuron will be a traditional neuron and will not change with different training examples.

-Number of inputs into output neuron will Change depending on the training example So, we use one weight for all incoming connections.

2.1.4. Each of the neurons associated with words in the first layer will be associated with two weights, weights a and b.

2.1.4.1. Weight b of a particular neuron will be input into the other neurons in the first layer.

2.1.4.2 .All inputs to a neuron are weighted with that neurons a weight.

- [note we are no longer associating a connection with a weight as the number of words (and hence “b” inputs) in a passage vary with passages so similar to the output neuron, each neuron has one weight for all inputs.]

3. Forward propagate by inputting the “b” values of other words/ neurons into each neuron and sending the outputs of each neuron into the output layer to get an output. 4. Calculate loss by using an expected value of 1 for hate speech and 0 for normal. 5. Backpropagate the loss taking in the a's and b's of each word neuron in the current training case as inputs to the loss function.

-We use coordinate gradient decent by -first randomly assigning values for a's and b's.

- treat the b values as constants while learning new values for the a's. Then

- treat learnt “a” values as constants while learning new values for “b's”

- perform last two steps until error is zero.

6.Do the above for each training case.

7.During operation load neurons corresponding to the words in the passage and see whether the output is 1 or 0. To see whether it contains hate speech or not.

4. RELEVENT FIGURES

Figure 1- Hate speech detection sample network.

5. CONCLUSIONS This paper introduced a novel neural network that can process non-linear systems as convex optimisation problems. It used concepts from idealism where there is no separation of the data from the system processing the data. And we have shown that that is a very promising way to look at deep learning. Or rather deep learning without the deep as it is a shallow network. Wherein lies another advantage in that this should be an easy to train network.

ACKNOWLEDGEMENTS The authors would like to thank the lectures and staff of the universities where they were educated.

Tofara Moyo

Graduated from National University of Science and Technology in Zimbabwe and has a start up called Mazusa which is in the research phase.

Photo