012

!"#$ % "# &"

' ( ' "# % "#

%) ) * "# % %

' +#! $ #' , !,$-). /0123 4 5 #' 6 , /011 /0173 %) " + ' "

!"#$%%$

!" #

$%$! %&

! " # $ !% ! & $ ' ' ($ ' # % % ) % * % ' $ ' + " % & ' !# $, ( $ - . ! "- ( % . % % % % $ $$ - - - - // $$$ 0 1"1"#23."

4& )*5/ +) * !6 !& 7!8%779:9& % ) 2 ; ! * & /- 4& )*5/ +) "3 " & :=9>

A Handbook on Computerized Tumor Classification from MRI of Brain

Sudipta Roy Assistant Professor, Department of Computer Science and Engineering, Academy of Technology, Adisaptagram, Hooghly-712121, West Bengal, India Email: [email protected]

Dr. Samir Kumar Bandyopadhyay Professor, Department of Computer Science and Engineering, Technology Campus, Calcutta University, JD-2, Sector-III, Salt Lake, Kolkata-98, India Email: [email protected]

1

Abstract Magnetic Resonance Imaging (MRI) technique is one of the most useful tools designed for the purpose of diagnosis of human soft tissues. Magnetic Resonance Imaging (MRI) is an image acquisition technique in radiology to visualize the structure and function of the body. Unlike Computed Tomography (CT) scanning, MRI uses no ionizing radiation, and is generally a very safe procedure. This multisequence digital imaging technique acquires stack of images with different tissue contrasts, for example T1-Weighted (T1W), T2-Weighted (T2W), Proton Density (PD) and so forth. In the event of diagnosis and identification of abnormality in brain tissues is a huge workload as there are over 120 types of brain tumour. Classification of tumours depending on the MRI slices of a patient involves a huge task of maintaining a database due the variety of tumours. Maintaining and updating of this database takes a huge amount of time and effort. Also, this variety of tumours is a huge problem on differentiating between one another.

2

Acknowledgements The completion of this book would not have been possible without the encouragement, help and friendship of many individuals. It is my privilege to thank the people who have supported and guided throughout pursuit of this book, and sincere apologies to the people, whom, we may miss. During these years we have collaborated with many people. My sincere thanks to Shayak Sadhu and Piu Ghosh for his/her encouragements and helping attitude which was the constant motivating factor for me to complete this dissertation.

Sudipta Roy Samir Kumar Bandyopadhyay

3

List of Contents Topic Chapter 1 1. Introduction Chapter 2 2. Objectives Chapter 3 3. Different Bain Imaging Techniques and Its Applications 3.1 Computed Tomography 3.2 Magnetic Resonance Imaging 3.3 Functional Magnetic Resonance Imaging 3.4 Positron Emission Tomography 3.5 Electroencephalography 3.6 Magneto encephalography Chapter 4 4. Feature Extraction from MRI Chapter 5 5. Different Types of Brain Tumor 5.1 Glioma 5.2 Meningioma 5.3 Sarcoma 5.4 Glioblastoma or Glioblastoma multiforme 5.5 Medulloblastoma 5.6 Oligodendroglioma Chapter 6 6. Machine Learning Models for Tumor Type Estimation 6.1 Support Vector Machine 6.2 K-Nearest Neighbour’s algorithm 6.3 Decision Trees 6.4 Naive Bayes classifier 6.5 Logistic Regression 6.6 Linear Regression 6.7 Discriminant Function Analysis 6.8 Adaboost Classifier 6.9 Multilayer perceptron 6.10 Neuro fuzzy classification Chapter 7 7. Summary and Conclusion Bibliography

Page Number 8-13 ..................................8 14-15 ..................................14 16-33 ..................................16 ..................................16 ..................................18 ..................................20 ..................................24 ..................................27 ..................................31 34-37 ..................................34 38-63 ..................................38 ..................................40 ..................................46 ..................................51 ..................................55 ..................................58 ..................................61 64-113 ..................................64 ..................................65 ..................................70 ..................................75 ..................................82 ..................................87 ..................................93 ..................................98 ..................................102 ..................................106 ..................................110 114-117 ..................................114 118-121 4

List of Figures Figure Name Figure. 1. MRI modalities Figure 3.1. CT image taken from Brain Atlas Database Figure 3.2. MRI images taken from Brain Atlas Database Figure 3.3 Some fMRI taken from www.med.nyu.edu /thesenlab/wp-content/ Figure 3.4 PET scans show different patterns of glucose (sugar) metabolism related to performing various mental tasks Figure 3.4.1. Images are cross-sections with front of brain at top. Highest metabolic rates are in red, with lower values from yellow to blue. Figure 3.5. Computer artwork of the top of a head with electroencephalography (EEG) electrodes attached to the scalp. Figure 3.6 The sequence of steps to localize sources of neuronal activity from time-domain recordings to MRI overlay Figure 4 The processing pipeline of the Voxel Based Morphometry (VBM) on structural MRI volumes by Chyzhyk, 2000 Figure 5. Most commonly brain tumours develop from cells that support the nerve cells of the brain and brain components Figure 5.1.1 Astrocytes are the cells that make up the “glue-like” or supportive tissue of the brain Figure 5.1.2 Ependymal cells line the ventricles of the brain and the center of the spinal cord Figure 5.2.1 Common locations of Meningiomas Figure 5.3.1 Sarcoma in T2, T1, and PD MR Images Figure 5.5.1 location for the cerebellum Figure 5.6.1: Oligodendrocytes are one of the types of cells that make up the supportive, or glial, tissue of the brain Figure 6.1.1: A linear classifier in SVM Figure 6.1.2: A of nonlinear classifier in SVM Figure 6.1.3: Schematic mapping from nonlinear to linear hyperline Figure 6.3.1: A schematic representation of decision tree is a flow-chart-like structure Figure 6.3.2: An example of decision tree for predicting whether the person cheats

Page Number ..................................12 ..................................16 ..................................19 ..................................22 ..................................25

..................................26

..................................28

..................................31

..................................37

..................................39

..................................41 ..................................42 ..................................47 ..................................51 ..................................60 ..................................62

..................................67 ..................................67 ..................................68 ..................................77 ..................................78 5

Figure 6.4.1: Consider the image classified as either green ..................................83 or red dots Figure 6.4.2: Classification of new object by white circle ..................................84 Figure 6.5.1: linear regression of the observed ..................................90 probabilities, Y, on the independent variable X Figure 6.5.2: logistic regression supported by a simple ..................................91 weighted linear regression Figure 6.5.3: Plain-vanilla empirical regression ..................................92 Figure 6.5.4: Empirical regression ..................................93 Figure 6.6.1: A scatter plot of the example data ..................................95 Figure 6.6.2: A scatter plot of the example data; The black ..................................96 line consists of the predictions, the points are the actual data, and the vertical lines between the points and the black line represent errors of prediction. Figure 6.8.1: Signal-flow graph of the perceptron ..................................107 Figure 6.8.2: Signal-flow graph of an MLP ..................................108 Figure 7.1: A block diagram of overall possible process ..................................116 for classification

6

List of Tables

Table Name Table 6.5.1 shows the relationship, for 64 infants, between X and Y Table 6.6. 1: Example data Table 6.6.2: Example data

Page Number ..................................89 ..................................95 ..................................97

7

Chapter 1 1. Introduction Curing cancer has a major goal of medical researchers for decades, but development of new treatments takes time and money. Science may yet find the root causes of all cancers and develop safer methods for shutting them down brain tumors are benign and can be before they have a chance to grow or spread. Approximately 40 per cent of all primary successfully treated with surgery and, in some cases, radiation. The number of malignant brain tumors appears to be increasing but for no clear reason.

Brain cancer is a complex disease, classified into 120 different types.

So called non-malignant (Benign) brain tumors can be just as life-threatening as malignant tumors, as they squeeze out normal brain tissue and disrupt function. The glioma family of tumors comprises 44.4 % of all brain tumors. Glioblastoma type of Astrocytoma is the most common glioma which comprises 51.9 %, followed by other types of astrocytoma at 21.6 % of all brain tumors. Brain tumors are the leading cause cancer death in children under the age of 20. They are the second leading cause of cancer death among 20 to 29 year old males. Metastatic brain tumors result from cancer that spreads from other parts of the body into the brain. About 10-15 % of people with cancer will eventually develop metastatic brain tumors. This has attracted the attention of researchers from various fields to involve in improving the standard diagnostics in encountering this disease. Early detection and accurate diagnosis of brain tumor is the most important point in implementing successful therapy and treatment planning. However the diagnosis of brain tumour is a very challenging task due to the large variance and complexity in tumour characteristics in images, such as size, shape, location and intensities and can only be performed by trained professional neuro-radiologists. In the recent few decades several research works have been done for the diagnosis and treatment of brain tumour due to its high mortality rate. Many researchers have contributed different techniques in the aim to classify brain tumours based on different sources of information. Brain tumour classification is very significant phase in the medical field. 8

The images acquired from different modalities should be verified by the physician for the further treatment, but the manual classification of the images is the challenging and time consuming task. The use of computer technology in medical decision support is now widespread and pervasive across a wide range of medical area such as cancer research, gastroenterology, brain tumours etc. MRI is the viable option now for the study of tumour in soft tissues. The method clearly finds tumour types, size and location. MRI is a magnetic field which builds up a picture and has no known side effects related to radiation exposure.

It has much higher details in soft tissues. Researcher had

proposed various features for classifying tumour in MRI. The statistical, Intensity, Symmetry, Texture features etc., which utilize gray value of tumours are used here for classifying the tumour. However the gray values of MRI tend to change due to over enhancement or in the presence of noise. Now, image segmentation is required to delineate the boundaries of the ROIs ensuring, in our case, that tumours are outlined and labelled consistently across subjects. Segmentation can be performed manually, automatically, or semi-automatically. The manual method is time consuming and its accuracy highly depends on the domain knowledge of the operator. Specifically, various approaches have been proposed to deal with the task of segmenting brain tumours in MR images. The performance of these approaches usually depends on the accuracy of the spatial probabilistic information collected by domain experts. Studies revealed that many neuro-imaging approach models require precise recognition of the brain in MRI of human head. The need for de¿QHGDQDWRPLFWKUHHdimensional (3D) models substantially improves spatial distribution concerning the relationships of critical structures (e.g., functionally signi¿FDQW FRUWLFDO DUHDV vascular structures) and disease. Texture analysis on images are native and complex visual patterns that reproduce the data of gray level statistics, anatomical intensity variations, texture, spatial relationships, shape, structure and so on. Image texture analysis aims to interpret and understand these real world visual patterns, which 9

involves the study of methods broadly used in image ¿OWHULQJ QRUPDOL]DWLRQ classi¿FDWLRQ VHJPHQWDWLRQ labelling, synthesis and shape from texture. Texture classi¿FDWLRQ LQYROYHV H[WUDFWLQJ IHDWXUHV IURP GLơerent texture images to build a classi¿HU ,W GHWHUPLQHV WR ZKLFK RI D ¿QLte number of physically de¿QHG FODVVHV (such as normal and abnormal tissue) a homogeneous texture region belongs. The classi¿HU LV WKHQ XVHG WR FODVVLI\ QHZ LQVWDQFHV RI WH[WXUH LPDJHV 7KH WH[WXUDO properties of spatial patterns on digital images have been successfully applied to many practical vision systems, such as the classi¿FDWLRQ RI LPDJHV WR analyse diagnosis tissues for dementia, tumors, hyper spectral satellite images for remote sensing, content based retrieval, detection of defects in industrial surface inspection, and so on. This research study on absolute exploration of three-dimensional (3D) texture features in the volumetric data sets requires extension of conventional 2D Grey Level Co-occurrence Matrix (GLCM) and run length texture computation into a 3D form for better texture feature analysis. Genetic Algorithm (GA) selects relevant elements in feature selection method. Classi¿FDWLRQ LV SHUIRUPHG ZLWK extreme learning machine( ELM), with Improved Particle Swarm Optimization (IPSO) technique to select the best parameters (Input weights, Bias, Norm, Hidden neurons) for better generalization and conditioning of the classi¿HUIRUEUDLQWLVVXHDQGtumour pathology tissue characterization as White Matter (WM), Gray Matter (GM), Cerebrospinal Fluid (CSF) and tumor. Texture analysis on images are native and complex visual patterns that reproduce the data of gray level statistics, anatomical intensity variations, texture, spatial relationships, shape, structure and so on. Image texture analysis aims to interpret and understand these real world visual patterns, which involves the study of methods broadly used in image ¿OWHULQJQRUPDOL]DWLRQFODVVL¿FDWLRQVHgmentation, labelling, synthesis and shape from texture. Texture classi¿FDWLRQ LQYROYHV H[WUDFWLQJIHDWXUHV from diơerent texture images to build a classi¿HU ,WGHWHUPLQHVWR ZKLFKRID¿QLWH number of physically de¿QHG FODVVHV VXFK DV QRUPDO DQG DEQRUPDl tissue) a homogeneous texture region belongs. 10

One particular application area where neural networks show some promise is the field of Magnetic Resonance (MR) image segmentation. Most previous studies of neural network based MR image segmentation have employed the back propagation (BP) algorithm. Some of this approach with BP neural network approach to the automatic characterization of brain tissues from multimodal MR images. The ability of a three layer BP neural network to perform segmentation based on a set of images acquired from a pathological human subject was studied. The results were compared with those obtained using a traditional Maximum Likelihood Classifier (MLC). Neural networks-based segmented images appear less noisy than MLC segmented images. Brain cancer detection and classification system are implemented using Artificial Neural Network. The design based on Image processing Techniques, Artificial Neural Network and Graphical User Interface was successfully completed and used in the system to detect and classify the tumor. The designed brain cancer detection and classification system use conceptually simple classification method using the Neuro- Fuzzy logic. Texture features are used in the training of the Artificial Neural Network. Co-occurrence matrices at different directions are calculated and GLCM features are extracted from the matrices. The above procedure effectively classifies the tumor types in brain images taken under different clinical circumstances and technical conditions, which were able to show high deviations that clearly indicated as abnormalities in area with brain diseases. An MR scanner contains strong magnets, arranged in the circular part of the scanner. The patient lies flat in the scanner bed and desired part of their body is examined by sliding them into the scanner. Most of the human body is made up of water molecules, which consist of hydrogen and oxygen atoms. A smaller particle, called proton, exists at the centre of each hydrogen atom. They are very sensitive to magnetic fields, and the magnetic moments of the individual hydrogen nuclei are oriented in random directions. When these nuclei are caught suddenly in a strong magnetic field, they lines up in the direction of the applied magnetic field like so many compass needles aligning with the earth's magnetic field. Short bursts of radio waves, RF pulses, are sent to certain areas of the body and the protons are pulled out of position. As this happens, each 11

proton transmits a radio signal that provides information about its exact location in the body. The basic idea of utilizing the water molecule in imaging makes MRI most desirable in disease detection, since majority of the diseases manifest themselves by an increase in water content. However, a contrast based assessment of pathology is sometimes difficult; for example, infection and tumor looks similar in some cases. The correct diagnosis can only be done by a careful image analysis results from an experienced radiologist. Tissue contrast in all MR images are affected by each of the parameters, T1, T2, PD, FLAIR etc. to some degree of this image has been shown below.

Figure 1. MRI modalities 12

These MRI modalities are often combined together to facilitate a more accurate analysis, referred to as multi-spectral image analysis, where more than one measurement is made at each location in the image. Techniques in multi-spectral MRI offer medical practitioners more information to characterize and discriminate various tissues based on physical and biochemical properties. Classification and detection techniques are generally considered as the most effective methods for MR image analysis, where the classification methods are divided into unsupervised and supervised learning. Unsupervised methods like Expectation Maximization (EM), kmeans and its fuzzy equivalent, the most widely used Fuzzy C-Means (FCM) generally creates satisfactory results in MR image analysis. However, clustering is not a reliable method for accurate classification in pathological analysis. The conventional supervised learning machines like Artificial Neural Networks (ANN), Probabilistic Neural Networks (PNN), and Support Vector Machines (SVM) have been effectively used in multispectral MRI analysis. However, application of these conventional classification methods alone in multispectral MRI analysis often failed to provide expected clinical accuracy. In rest of these book, we illustrate the objective of the book in section 2, different type of brain scan in section 3, features extraction from MRI slice in section 4, different type of brain tumor, use of different machine learning computerized models to give us an estimate of the type of tumor in section 5, summary and conclusion is described in section 6.

13

Chapter 2 2. Objectives MRI is a powerful visualization technique in clinical practice and biomedical research for investigation of brain anatomy and function. It provides much greater contrast between different soft tissues of the body than CT images, which makes it particularly useful in clinical diagnosis, especially in evaluating brain tumors and lesions. Brain matter segmentation and classification from MR sequences is an important image processing step for both medical practitioners and scientific researchers in pathological analysis. Computer Aided Diagnosis (CAD) systems help them to assess the tumor growth and treatment responses. In addition to this, it can assist them in computer-aided surgery, radiation therapy and for modeling the tumor growth. Large amounts of research efforts have been made in developing effective segmentation and classification methods in the past years. However, such methods failed to reach the accuracy level provided by the visual analysis from human experts. Brain tumor has a growing importance in the field of bio-medical field of research and the context of correctly classifying tumor has been a very difficult task for medical practitioners for the past few decades. As mentioned before there are approximately 120 different types of brain tumor of which we have taken only major 6 types which are most common occurring and illustrate their clinical features in MRI slices. We also discuss why MRI is preferred over other types of brain imaging technique. Then we move on to feature extraction from MRI images which will be an essential step in selecting the type of tumor present in the brain which gives us numeric value data and is stored in form of a table. Then this data is taken and is fed to different types of machine learning models or classifiers to get an estimate of the type of tumor and its percentage correctness to the ideal case of the detected type of tumor. This gives us an estimate of how near is that input MRI slices to the ideal case of that type of tumor.

14

The fast evolution of MR imaging techniques offers a wide repository of pulse sequences that can easily be tuned to offer specific visualizations of the brain. These sequences have high spatial resolution and provide much information on the anatomical structure, allowing quantitative pathological or clinical studies. For example, some lesions are obvious in FLAIR, but its presence should be confirmed in other sequences (T2W or PD) also to avoid false positives. T1W images also can give very useful information to provide an improved segmentation. Slice by slice examination of these sequences is a tedious job in clinical analysis. Therefore, neuroradiologists demand a new approach of computer-aided diagnosis due to their heavy workloads in extraction of relevant information from large amount of data.

15

Chapter 3 3. Different Bain Imaging Techniques and Its Applications There are many types of brain imaging techniques which differs in their implementation and their purpose of application. In this section we discuss types of brain imaging techniques and their implemented technology in details. Brain imaging techniques are used with the objective of viewing activities or problems inside the brain without invasive neurosurgery. There are many safe techniques which are used all around the world. These techniques give us a snapshot of the section of the brain in which we can differentiate between different parts of the brain. This enables medical practitioners and researchers to analyse the activity of brain without going for any surgical procedures. The major types of brain imaging techniques are discussed below: 3.1 Computed Tomography Computed Tomography scan also known as CT scan is an imaging technique that uses x-rays to produce pictures of cross-sections of the body. It is also called computerized tomography and computerized axial tomography (CAT). The term tomography comes from the Greek words tomos (a cut, a slice, or a section) and graphein (to write or record). Each picture created during a CT procedure shows the organs, bones, and other tissues in a thin “slice” of the body. The entire series of pictures produced in CT is like a loaf of sliced bread—you can look at each slice individually (2-dimensional pictures), or you can look at the whole loaf (a 3dimensional picture). Computer programs are used to create both types of pictures. The x-ray CT scan was obtained about 3 hours after the onset of symptoms and is normal. (Remember, these image datasets have been spatially matched so that a direct comparison between image types is possible.) It is common for CT images to be negative during the acute period of stroke is shown below.

16

Figure 3.1. CT images taken from Brain Atlas Database Most modern CT machines take continuous pictures in a helical (or spiral) fashion rather than taking a series of pictures of individual slices of the body, as the original CT machines did. Helical CT has several advantages over older CT techniques: it is faster, produces better 3-D pictures of areas inside the body, and may detect small abnormalities better. The newest CT scanners, called multi slice CT or multi detector CT scanners, allow more slices to be imaged in a shorter period of time. A CT scanner emits a series of narrow beams through the human body as it moves through an arc, unlike an X-ray machine which sends just one radiation beam. The final picture is far more detailed than an X-ray image. Inside the CT scanner there is an X-ray detector which can see hundreds of different levels of density. It can see tissues inside a solid organ. This data is transmitted to a computer, which builds up a 3D cross-sectional picture of the part of the body and displays it on the screen. Sometimes a contrast dye is used because it shows up much more clearly on the screen. If a 3D image of the abdomen is required the patient may have to drink a barium meal. The barium appears white on the scan as it travels through the digestive system. If images lower down the body are required, such as the rectum, the patient may be given a barium enema. If blood vessel images are the target, the barium will be injected.

17

Advantages: a) In CT scan overlapping structures are eliminated, making the internal anatomy more apparent b) Reducing the need for exploratory surgeries c) Improving cancer diagnosis and treatment d) Rapid acquisition of images e) A view of a large portion of the body Disadvantage: a) Very Expensive b) Delivers High Dose of Radiation c) Requires breath holding which some patients cannot manage d) Artefact is common, e.g. metal clips

3.2 Magnetic Resonance Imaging Magnetic resonance imaging (MRI), nuclear magnetic resonance imaging (NMRI), or magnetic resonance tomography (MRT) is a medical imaging technique used in radiology to investigate the anatomy and physiology of the body in both health and disease. MRI scanners use strong magnetic fields and radio waves to form images of the body. The technique is widely used in hospitals for medical diagnosis, staging of disease and for follow-up without exposure to ionizing radiation. Most of the human body is made up of water molecules, which consist of hydrogen and oxygen atoms. At the centre of each hydrogen atoms there is an even smaller particle called a proton. Protons are like tiny magnets and are very sensitive to magnetic fields. When you lie under the powerful scanner magnets, the protons in your body line up in the same direction, in the same way that a magnet can pull the needle of a compass. Short bursts of radio waves are then sent to certain areas of the 18

body, knocking the protons out of alignment. When radio waves are turned off, protons realign and send out radio signals, which are picked up by receivers. These signals provide information about the exact location of the protons in the body. They also help to distinguish between the various types of tissue in the body, because the protons in different types of tissue realign at different speeds and produce distinct signals. In the same way that millions of pixels on a computer screen can create complex pictures, the signals from the millions of protons in the body are combined to create a detailed image of the inside of the body. There are no known biological hazards of MRI because, unlike x ray and computed tomography, MRI uses radiation in the radiofrequency range which is found all around us and does not damage tissue as it passes through. Pacemakers, metal clips, and metal valves can be dangerous in MRI scanners because of potential movement within a magnetic field. Metal joint prostheses are less of a problem, although there may be some distortion of the image close to the metal. MRI departments always check for implanted metal and can advise on their safety. There is some weighted variation of MRI scan. They are listed below: x T1 weighted MRI x T2 weighted MRI x PD weighted MRI

PD

T2

T1

Figure 3.2. MRI images taken from Brain Atlas Database

19

These are the usual variations while taking MRI images. These variations are obtained by varying the electromagnetic response of the machine. Some advantages and disadvantages of brain MRI imaging are listed below. Advantages: a) Harmless to the patient - no radiation is involved (unlike computed tomography (CT) scanning and conventional radiology). b) Excellent detail makes it similar and even superior to, CT scanning in some situations. c) MRI contrast agent used is normally gadolinium which is less allergenic than iodine-based contrast agents used in CT scanning. Disadvantages: a) Limited availability - although this is rapidly improving. b) It is a lengthy procedure - eg, a pituitary gland MRI scan can take up to 30 minutes. c) In MRI scanning of the chest and abdomen the patient must lie still for long periods, which can prove difficult. Therefore, CT scanning is preferred in these situations. d) MRI scanning cannot be performed in the presence of foreign bodies or metallic implants - eg, pacemakers, aneurysm clips and some cardiac stents (even if distant from the site of the image). However, stainless steel objects, such as those in hip prostheses, may be OK. e) It is relatively expensive compared with other forms of imaging.

3. 3 Functional Magnetic Resonance Imaging Functional magnetic resonance imaging or functional MRI (fMRI) is a functional neuroimaging procedure using MRI technology that measures brain activity by detecting associated changes in blood flow. This technique relies on the 20

fact that cerebral blood flow and neuronal activation are coupled. When an area of the brain is in use, blood flow to that region also increases. The primary form of fMRI uses the blood-oxygen-level dependent (BOLD) contrast, discovered by Seiji Ogawa. This is a type of specialized brain and body scan used to map neural activity in the brain or spinal cord of humans or other animals by imaging the change in blood flow (hemodynamic response) related to energy use by brain cells. Since the early 1990s, fMRI has come to dominate brain mapping research because it does not require people to undergo shots, surgery, or to ingest substances, or be exposed to radiation, etc. The procedure is similar to MRI but uses the change in magnetization between oxygen-rich and oxygen-poor blood as its basic measure. This measure is frequently corrupted by noise from various sources and hence statistical procedures are used to extract the underlying signal. The resulting brain activation can be presented graphically by color-coding the strength of activation across the brain or the specific region studied. The technique can localize activity to within millimetres but, using standard techniques, no better than within a window of a few seconds. The goal of fMRI data analysis is to detect correlations between brain activation and a task the subject performs during the scan. It also aims to discover correlations with the specific cognitive states, such as memory and recognition, induced in the subject. The BOLD signature of activation is relatively weak, however, so other sources of noise in the acquired data must be carefully controlled. This means that a series of processing steps must be performed on the acquired images before the actual statistical search for task-related activation can begin. Nevertheless, it is possible to predict, for example, the emotions a person is experiencing solely from their fMRI, with a high degree of accuracy. fMRI is based on the same technology as magnetic resonance imaging (MRI) - a non-invasive test that uses a strong magnetic field and radio waves to create detailed images of the body. But instead of creating images of organs and tissues like MRI, fMRI looks at blood flow in the brain to detect areas of activity. These changes 21

in blood flow, which are captured on a computer, help doctors understand more about how the brain works.

Figure 3.3 Some fMRI taken from www.med.nyu.edu /thesenlab/wp-content/ fMRI is used both in the research world, and to a lesser extent, in the clinical world. It can also be combined and complemented with other measures of brain physiology such as EEG and NIRS. Newer methods which improve both spatial and time resolution are being researched, and these largely use biomarkers other than the BOLD signal. Some companies have developed commercial products such as lie detectors based on fMRI techniques, but the research is not believed to be ripe enough for widespread commercialization. Advantages: a) It can noninvasively record brain signals without risks of radiation inherent in other scanning methods, such as CT or PET scans. b) It has high spatial resolution. 2–3 mm is typical but resolution can be as good as 1mm. c) It can record signal from all regions of the brain, unlike EEG/MEG which are biased towards the cortical surface. d) fMRI is widely used and standard data-analysis approaches have been developed which allow researchers to compare results across labs. e) fMRI produces compelling images of brain "activation". 22

Disadvantages: a) The images produced must be interpreted carefully, since correlation does not imply causality, and brain processes are complex and often nonlocalized. b) Statistical methods must be used carefully because they can produce false positives. One team of researchers studying reactions to pictures of human emotional expressions reported a few activated voxels in the brain of a dead salmon when no correction for multiple comparisons was applied, illustrating the need for rigorous statistical analysis. c) The BOLD signal is only an indirect measure of neural activity, and is therefore susceptible to influence by non-neural changes in the body. This also means that it is difficult to interpret positive and negative BOLD responses. d) BOLD signals are most strongly associated with the input to a given area rather than with the output. It is therefore possible (although unlikely) that a BOLD signal could be present in a given area even if there is no single unit activity. e) fMRI has poor temporal resolution. The BOLD response peaks approximately 5 seconds after neuronal firing begins in an area. This means that it is hard to distinguish BOLD responses to different events which occur within a short time window. Careful experimental design can reduce this problem. Also, some research groups are attempting to combine fMRI signals that have relatively high spatial resolution with signals recorded with other techniques, electro encephalography (EEG) or magneto encephalography (MEG), which has higher temporal resolution, but worse spatial resolution.

23

f) The BOLD response can be affected by a variety of factors, including: drugs/substances, age, brain pathology, local differences in neurovascular coupling, amount of carbon dioxide in the blood, etc.

3.4 Positron Emission Tomography Positron emission tomography (PET) is a specialized radiology procedure used to examine various body tissues to identify certain conditions. PET may also be used to follow the progress of the treatment of certain conditions. While PET is most commonly used in the fields of neurology, oncology, and cardiology, applications in other fields are currently being studied. PET is a type of nuclear medicine procedure. This means that a tiny amount of a radioactive substance, called a radionuclide (radiopharmaceutical or radioactive tracer), is used during the procedure to assist in the examination of the tissue under study. Specifically, PET studies evaluate the metabolism of a particular organ or tissue, so that information about the physiology (functionality) and anatomy (structure) of the organ or tissue is evaluated, as well as its biochemical properties. Thus, PET may detect biochemical changes in an organ or tissue that can identify the onset of a disease process before anatomical changes related to the disease can be seen with other imaging processes, such as computed tomography (CT) or magnetic resonance imaging (MRI). PET is most often used by oncologists (doctors specializing in cancer treatment), neurologists and neurosurgeons (doctors specializing in treatment and surgery of the brain and nervous system), and cardiologists (doctors specializing in the treatment of the heart). However, as advances in PET technologies continue, this procedure is beginning to be used more widely in other areas. PET is also being used in conjunction with other diagnostic tests such as computed tomography (CT) to provide more definitive information about malignant (cancerous) tumors and other lesions. The combination of PET and CT shows 24

particular promise in the diagnosis and treatment of many types of cancer. Recently, PET procedures were performed in dedicated PET centres. The equipment used in these centres is quite expensive. However, a new technology called gamma camera systems (devices used to scan patients who have been injected with small amounts of radionuclide’s and currently in use with other nuclear medicine procedures) is now being adapted for use in PET scan procedures. The gamma camera system can complete a scan more quickly, and at less cost, than a traditional PET scan. PET is a specialized radiology procedure used to examine various body tissues to identify certain conditions. PET may also be used to follow the progress of the treatment of certain conditions shown below.

Figure 3.4 PET scans show different patterns of glucose (sugar) metabolism related to performing various mental tasks PET works by using a scanning device (a machine with a large hole at its centres) to detect positrons (subatomic particles) emitted by a radionuclide in the organ or tissue being examined. The radionuclides used in PET scans are made by attaching a radioactive atom to chemical substances that are used naturally by the particular organ or tissue during its metabolic process. For example, in PET scans of the brain, a radioactive atom is applied to glucose (blood sugar) to create a radionuclide called fluorodeoxyglucose (FDG), because the brain uses glucose for its metabolism. FDG is widely used in PET scanning. Other substances may be used for PET scanning, depending on the purpose of the scan. If blood flow and perfusion of 25

an organ or tissue is of interest, the radionuclide may be a type of radioactive oxygen, carbon, nitrogen, or gallium. The radionuclide is administered into a vein through an intravenous line. Next, the PET scanner slowly moves over the part of the body being examined. Positrons are emitted by the breakdown of the radionuclide. Gamma rays are created during the emission of positrons, and the scanner then detects the gamma rays. A computer analyses the gamma rays and uses the information to create an image map of the organ or tissue being studied. The amount of the radionuclide collected in the tissue affects how brightly the tissue appears on the image, and indicates the level of organ or tissue function. PET studies of glucose metabolism to map human brain's response in performing different tasks. Subjects looking at a visual scene activated visual cortex (arrow), listening to a mystery story with language and music activated left and right auditory cortices (arrows), counting backwards from 100 by sevens activated frontal cortex (arrows), recalling previously learned objects activated hippocampus bilaterally (arrows), and touching thumb to fingers of right hand activated left motor cortex and supplementary motor system (arrows) shown below.

Figure 3.4.1. Images are cross-sections with front of brain at top. Highest metabolic rates are in red, with lower values from yellow to blue. In general, PET scans may be used to evaluate organs and/or tissues for the presence of disease or other conditions. PET may also be used to evaluate the function of organs such as the heart or brain. Another use of PET scans is in the evaluation of the treatment of cancer. Advantages: a) Usually painless. 26

b) Can help diagnose, treat, or predict the outcome for a wide range of conditions. c) Unlike most other imaging types, can show how different parts of the body are working and can detect problems much earlier. d) Can check how far a cancer has spread and how well treatment is working. Disadvantages: a) Involves exposure to ionising radiation (gamma-rays). b) Radioactive material may cause allergic or injection-site reactions in some people. c) PET scanners cause some people to feel claustrophobic, which may mean sedation is required.

3.5 Electroencephalography Electroencephalography (EEG) is the recording of electrical activity along the scalp. EEG measures voltage fluctuations resulting from ionic current flows within the neurons of the brain. In clinical contexts, EEG refers to the recording of the brain's spontaneous electrical activity over a short period of time, usually 20–40 minutes, as recorded from multiple electrodes placed on the scalp. Diagnostic applications generally focus on the spectral content of EEG, that is, the type of neural oscillations that can be observed in EEG signals. EEG is most often used to diagnose epilepsy, which causes abnormalities in EEG readings. It is also used to diagnose sleep disorders, coma, encephalopathy, and brain death. EEG used to be a first-line method of diagnosis for tumors, stroke and other focal brain disorders, but this use has decreased with the advent of highresolution anatomical imaging techniques such as MRI and CT. Despite limited spatial resolution, EEG continues to be a valuable tool for research and diagnosis,

27

especially when millisecond-range temporal resolution (not possible with CT or MRI) is required. The brain's electrical charge is maintained by billions of neurons. Neurons are electrically charged (or "polarized") by membrane transport proteins that pump ions across their membranes. Neurons are constantly exchanging ions with the extracellular milieu, for example to maintain resting potential and to propagate action potentials. Ions of similar charge repel each other, and when many ions are pushed out of many neurons at the same time, they can push their neighbours, who push their neighbours, and so on, in a wave. This process is known as volume conduction. When the wave of ions reaches the electrodes on the scalp, they can push or pull electrons on the metal on the electrodes. Since metal conducts the push and pull of electrons easily, the difference in push or pull voltages between any two electrodes can be measured by a voltmeter. Recording these voltages over time gives us the EEG.

Figure 3.5 Computer artwork of the top of a head with electroencephalography (EEG) electrodes attached to the scalp. EEG measures and records the electrical activity of the brain.

The electric potential generated by an individual neuron is far too small to be picked up by EEG or MEG. EEG activity therefore always reflects the summation of 28

the synchronous activity of thousands or millions of neurons that have similar spatial orientation. If the cells do not have similar spatial orientation, their ions do not line up and create waves to be detected. Pyramidal neurons of the cortex are thought to produce the most EEG signal because they are well-aligned and fire together. Because voltage fields fall off with the square of distance, activity from deep sources is more difficult to detect than currents near the skull. Scalp EEG activity shows oscillations at a variety of frequencies. Several of these oscillations have characteristic frequency ranges, spatial distributions and are associated with different states of brain functioning (e.g., waking and the various sleep stages). These oscillations represent synchronized activity over a network of neurons. The neuronal networks underlying some of these oscillations are understood (e.g., the thalamocortical resonance underlying sleep spindles), while many others are not (e.g., the system that generates the posterior basic rhythm). Research that measures both EEG and neuron spiking finds the relationship between the two is complex, with a combination of EEG power in the gamma band and phase in the delta band relating most strongly to neuron spike activity. EEGs are frequently used in experimentation because the process is noninvasive to the research subject. The EEG is capable of detecting changes in electrical activity in the brain on a millisecond-level. It is one of the few techniques available that has such high temporal resolution. Advantages: a) EEG can detect covert processing (i.e., processing that does not require a response). b) EEG can be used in subjects who are incapable of making a motor response. c) Some ERP components can be detected even when the subject is not attending to the stimuli.

29

d) Unlike other means of studying reaction time, ERPs can elucidate stages of processing (rather than just the final end result). e) EEG is a powerful tool for tracking brain changes during different phases of life. EEG sleep analysis can indicate significant aspects of the timing of brain development, including evaluating adolescent brain maturation. Brain activity can also be monitored by CT 's. f) In EEG there is a better understanding of what signal is measured as compared to other research techniques, i.e. the BOLD response in MRI. Disadvantages: a) Low spatial resolution on the scalp. fMRI, for example, can directly display areas of the brain that are active, while EEG requires intense interpretation just to hypothesize what areas are activated by a particular response. b) EEG poorly measures neural activity that occurs below the upper layers of the brain (the cortex). c) Unlike PET and MRS, cannot identify specific locations in the brain at which various neurotransmitters, drugs, etc. can be found. d) Often takes a long time to connect a subject to EEG, as it requires precise placement of dozens of electrodes around the head and the use of various gels, saline solutions, and/or pastes to keep them in place. While the length of time differs dependent on the specific EEG device used, as a general rule it takes considerably less time to prepare a subject for MEG, fMRI, MRS, and SPECT. e) Signal-to-noise ratio is poor, so sophisticated data analysis and relatively large numbers of subjects are needed to extract useful information from EEG.

30

3.6 Magneto encephalography Magneto encephalography (MEG) is a functional neuro-imaging technique for mapping brain activity by recording magnetic fields produced by electrical currents occurring naturally in the brain, using very sensitive magnetometers. Arrays of SQUIDs (superconducting quantum interference devices) are currently the most common

magnetometer,

while

the

SERF

(spin

exchange

relaxation-free)

magnetometer is being investigated for future machines. Applications of MEG include basic research into perceptual and cognitive brain processes, localizing regions affected by pathology before surgical removal, determining the function of various parts of the brain, and neuro-feedback. This can be applied in a clinical setting to find locations of abnormalities as well as in an experimental setting to simply measure brain activity; it is a non-invasive technique for investigating human brain activity. It allows the measurement of on-going brain activity on a millisecondby-millisecond basis, and it shows where in the brain activity is produced.

Figure 3.6 The sequence of steps to localize sources of neuronal activity from timedomain recordings to MRI overlay

31

At the cellular level, individual neurons in the brain have electrochemical properties that result in the flow of electrically charged ions through a cell. Electromagnetic fields are generated by the net effect of this slow ionic current flow. While the magnitude of fields associated with an individual neuron is negligible, the effect of multiple neurons (for example, 50,000 – 100,000) excited together in a specific area generates a measureable magnetic field outside the head. These neuromagnetic signals generated by the brain are extremely small—a billionth of the strength of the earth’s magnetic field. Therefore, MEG scanners require superconducting sensors (SQUID, superconducting quantum interference device). The SQUID sensors are bathed in a large liquid helium cooling unit at approximately -269 degrees C. Due to low impedance at this temperature, the SQUID device can detect and amplify magnetic fields generated by neurons a few centimetres away from the sensors. A magnetically shielded room houses the equipment, and mitigates interference. There are many uses for the MEG, including assisting surgeons in localizing pathology, assisting researchers in determining the function of various parts of the brain, neurofeedback, and others. Some advantages and disadvantages are listed below: Advantages: a) Measures brain function. b) High precision – millimetre resolution. c) High temporal resolution – millisecond resolution (capture epileptic spikes). d) Non-invasive. e) Easy to use. Disadvantages: There are many disadvantages involved with MEG such as involvement of cost. In this particular scenario we may say that it is not so popular due to 32

involvement of huge cost of instrument and trained neurologists to operate the device.

33

Chapter 4 4. Feature Extraction from MRI Features are the distinguishable characteristics of the objects of interest, if selected carefully are representative of the maximum relevant information that the image has to offer for a complete characterization of a lesion. Feature extraction procedures analyse objects and images to extract the most prominent features that are representative of the various classes of objects. Features are used as inputs to classifiers that assign them to the class that they represent. The purpose of feature extraction is to reduce the original data by measuring certain properties, or features, that distinguish one input pattern from another pattern. The extracted feature should provide the characteristics of the input type to the classifier by considering the description of the relevant properties of the image into feature vectors. We may consider three types of features namely shape features, intensity features and texture features. These three types of features are extracted to provide us with the structural information of intensity, shape, and texture. These features may be redundant but we use feature selection to minimise this redundant information. This feature extraction can also be done using discrete wavelet transform, is any wavelet transform for which the wavelets are discretely sampled. The fundamental idea of wavelet transforms is that the transformation should allow only changes in time extension, but not shape. This is affected by choosing suitable basis functions that allow for this. Changes in the time extension are expected to conform to the corresponding analysis frequency of the basis function. Based on the uncertainty principle of signal processing, ο‫ݐ‬ο߱ ൒

1 2

where ‫ ݐ‬UHSUHVHQWV WLPH DQG Ȧ DQJXODU IUHTXHQF\ Ȧ ʌI ZKHUH I LV WHPSRUDO frequency). In some cases Gabor Wavelet Analysis can also be taken as features in recognising correct type of tumor. Gabor wavelets are wavelets invented by Dennis Gabor using complex functions constructed to serve as a basis for Fourier transforms 34

in information theory applications. They are very similar to Morlet wavelets. They are also closely related to Gabor filters. The important property of the wavelet is that it minimizes the product of its standard deviations in the time and frequency domain. Put another way, the uncertainty in information carried by this wavelet is minimized. However they have the downside of being non-orthogonal, so efficient decomposition into the basis is difficult. The motivation for Gabor wavelets comes from finding some function ݂( ‫ )ݔ‬which minimizes its standard deviation in the time and frequency domains. More formally, the variance in the position domain is: ାஶ

( ο‫ )ݔ‬ଶ =

‫ି׬‬ஶ (‫ ݔ‬െ ߤ) ଶ ݂(‫ݔ݀)ݔ( כ ݂)ݔ‬ ାஶ

‫ି׬‬ஶ ݂(‫ݔ݀)ݔ( כ ݂)ݔ‬

Where ݂ ‫ )ݔ( כ‬is the complex conjugate of ݂(‫ )ݔ‬and ߤ is the arithmetic mean, defined as: ାஶ

ߤ=

‫ି׬‬ஶ ‫ݔ݀)ݔ( כ ݂)ݔ( ݂ݔ‬ ାஶ

‫ି׬‬ஶ ݂(‫ݔ݀)ݔ( כ ݂)ݔ‬

The variance in the wave number domain is: ାஶ

ଶ

(ο݇) =

‫ି׬‬ஶ (݇ െ ݇଴ ) ଶ ‫݇݀)݇( כ ܨ)݇(ܨ‬ ାஶ

‫ି׬‬ஶ ‫݇݀ )݇( כ ܨ)݇( ܨ‬

Where ݇଴ is the arithmetic mean of the Fourier Transform of ݂(‫ )ݔ‬,‫ )ݔ(ܨ‬: ାஶ

݇଴ =

‫ି׬‬ஶ ݇‫݇݀)݇( כ ܨ)݇( ܨ‬ ାஶ

‫ି׬‬ஶ ‫݇݀)݇( כ ܨ )݇(ܨ‬

With these defined, the uncertainty is written as:(ο‫()ݔ‬ο݇) ଵ

This quantity has been shown to have a lower bound of . The quantum mechanics ଶ

view is to interpret ᇞ ‫ ݔ‬as the uncertainty in position and ԰ ᇞ ݇ as uncertainty in momentum. A function ݂(‫ )ݔ‬that has the lowest theoretically possible uncertainty bound is the Gabor Wavelet. Feature extraction can also be done using gray level co35

occurrence matrix. It is a statistical method of examining texture that considers the spatial relationship of pixels is the gray-level co-occurrence matrix (GLCM), also known as the gray-level spatial dependence matrix. The GLCM functions characterize the texture of an image by calculating how often pairs of pixel with specific values and in a specified spatial relationship occur in an image, creating a GLCM, and then extracting statistical measures from this matrix. This matrix is then enacted upon with texture analysis tools like contrast, correlation, etc. Morphometry analysis has become a common tool for computational brain anatomy studies. It allows a comprehensive measurement of structural differences within a group or across groups, not just in specific structures, but throughout the entire brain. Voxel Based Morphometry (VBM) is a computational approach to neuroanatomy that measures differences in local concentrations of brain tissue, through a voxel-wise comparison of multiple brain images. For instance, VBM has been applied to study volumetric atrophy of the grey matter (GM) in areas of neocortex of AD patients vs. control subjects. The processing pipeline of VBM is illustrated in Figure 4. The procedure involves the spatial normalization of subject images into a standard space, segmentation of tissue classes using a priori probability maps, smoothing to correct noise and small variations, and voxel-wise statistical tests. Smoothing is done by convolution with a Gaussian kernel whose Full-Width at Half-Maximum (FWHM) is tuned for the problem at hand. Statistical analysis is based on the General Linear Model (GLM) to describe the data in terms of experimental and confounding effects, and residual variability. Classical statistical inference is used to test hypotheses that are expressed in terms of GLM estimated regression parameters. This computation of given contrast provides a Statistical Parametric Map (SPM), which is thresholded according to the Random Field theory.

36

Figure 4 The processing pipeline of the Voxel Based Morphometry (VBM) on structural MRI volumes by Chyzhyk, 2000 Feature extraction is the process of extracting abnormal and normal tissue specific features from the pre-processed images in such a way that inter-class variation is maximized and intra-class similarity is maximized. Methods for feature extraction can be divided into different categories based on the type of features, such as pixel intensity-based features (gray scale values of the pixels), calculated pixel intensitybased features (calculated MR parameters such as metrics relating to flow of contrast material, cerebral blood volume, blood flow or blood oxygenation), edges and texture-based features, Transform based features etc. Transform based feature extraction methods to extract a set of discriminative features which provide better classification of MRI images. In literature, various feature extraction methods have been proposed such as PCA , ICA, Fourier Transform and Wavelet Transform.

37

Chapter 5 5. Different Types of Brain Tumor There are round about 120 types of brain tumor of which we are going to discuss 6 major occurring types of tumor, their pathological description and MRI (usually T2-weighted) variation so that we can distinguish between each of them. Tumors can be benign or malignant. A benign tumor is one which is harmless but a malignant tumor is the cancerous or harmful tumor. A malignant tumor is not selflimited in its growth, is capable of invading into adjacent tissues, and may be capable of spreading to distant tissues. A benign tumor has none of those properties. It can occur in different parts of the brain, and may be primary or secondary. A primary tumor is one that has started in the brain, as opposed to a metastatic tumor, which is something that has spread to the brain from another part of the body. The incidences of metastatic tumors are more prevalent than primary tumors by 4:1. Tumors may or may not be symptomatic: some tumors are discovered because the patient has symptoms, others show up incidentally on an imaging scan, or at an autopsy. Now brain tumor can be classified into two general groups namely primary and secondary brain tumors. Tumors that originate within brain tissue are known as primary brain tumors. Primary brain tumors are classified by the type of tissue in which they arise. The most common brain tumors are gliomas, which begin in the glial (supportive) tissue. Secondary brain tumors are tumors caused from cancer that originates in another part of the body. These tumors are not the same as primary brain tumors. The spread of cancer within the body is called metastasis. Cancer that spreads to the brain is the same disease and has the same name as the original (primary) cancer. For example, if lung cancer spreads to the brain, the disease is called metastatic lung cancer because the cells in the secondary tumor resemble abnormal lung cells, not abnormal brain cells.

38

Figure 5. Most commonly brain tumours develop from cells that support the nerve cells of the brain and brain components

Doctors group brain tumors by grade. The grade of a tumor refers to the way the cells look under a microscope: x Grade I: The tissue is benign. The cells look nearly like normal brain cells, and they grow slowly. x Grade II: The tissue is malignant. The cells look less like normal cells than do the cells in a Grade I tumor. x Grade III: The malignant tissue has cells that look very different from normal cells. The abnormal cells are actively growing (anaplastic). x Grade IV: The malignant tissue has cells that look most abnormal and tend to grow quickly. Cells from low-grade tumors (grades I and II) look more normal and generally grow more slowly than cells from high-grade tumors (grades III and IV). Over time, a low-grade tumor may become a high-grade tumor. However, the change to a highgrade tumor happens more often among adults than children. Some of the most common occurring brain tumors on their name are given below: 39

5.1 Glioma Glioma is a common type of primary brain tumor, accounting for about 33% of these tumors. Gliomas originate in the glial cells in the brain. Glial cells are the tissue that surrounds and supports neurons in the brain. Gliomas are called intrinsic brain tumors because they reside within the substance of the brain and often intermix with normal brain tissue. There are different grades of gliomas; however, they are most often referred to as "low-grade" or "high-grade" gliomas. The low or high grade designation reflects the growth potential and aggressiveness of the tumor. Three

types

of

normal

glial

cells

can

produce

tumors—astrocytes,

oligodendrocytes, and ependymal cells. Tumors that display a mixture of these cells are called mixed gliomas of which some of them are given below: x Astrocytoma: Astrocytomas are tumors that arise from astrocytes—star-shaped cells that make up the “glue-like” or supportive tissue of the brain. These tumors are “graded” on a scale from I to IV based on how normal or abnormal the cells look. There are low-grade astrocytomas and high-grade astrocytomas. Low-grade astrocytomas are usually localized and grow slowly. High-grade astrocytomas grow at a rapid pace and require a different course of treatment. Most astrocytoma tumors in children are low grade. Astrocytomas are tumors that arise from astrocytes—star-shaped cells that make up the “glue-like” or supportive tissue of the brain. These tumors are “graded” on a scale from I to IV based on how normal or abnormal the cells look. There are low-grade astrocytomas and high-grade astrocytomas. Lowgrade astrocytomas are usually localized and grow slowly. High-grade astrocytomas grow at a rapid pace and require a different course of treatment. Most astrocytoma tumors in children are low grade. In adults, the majority are high grade.

40

Figure 5.1.1 Astrocytes are the cells that make up the “glue-like” or supportive tissue of the brain

These tumors are “graded” on a scale from I to IV based on how normal or abnormal the cells look. There are low-grade astrocytomas and high-grade astrocytomas. Low-grade astrocytomas are usually localized and grow slowly. High-grade astrocytomas grow at a rapid pace and require a different course of treatment. Most astrocytoma tumors in children are low grade. In adults, the majority are high grade. Location: Astrocytomas can appear in various parts of the brain and nervous system, including the cerebellum, the cerebrum, and the central areas of the brain, the brainstem, and the spinal cord. Symptoms: Headaches, seizures, memory loss, and changes in behavior are the most common early symptoms of astrocytoma. Other symptoms may occur depending on the size and location of the tumor. Incidence: Pilocytic astrocytomas are typically seen in children and young adults. The other types tend to occur in males more often than females, and most often in people age 45 and over.

41

x Ependymoma: Ependymomas arise from the ependymal cells that line the ventricles of the brain and the center of the spinal cord.

Figure 5.1.2 Ependymal cells line the ventricles of the brain and the center of the spinal cord

Location: The various types of ependymomas appear in different locations within the brain and spinal column. Subependymomas usually appear near a ventricle. Myxopapillary ependymomas tend to occur in the lower part of the spinal column. Ependymomas are usually located along, within, or next to the ventricular system. Anaplastic ependymomas are most commonly found in the brain in adults and in the lower back part of the skull (posterior fossa) in children. They are rarely found in the spinal cord. Description: Ependymomas are soft, grayish, or red tumors which may contain cysts or mineral calcifications. Symptoms: Symptoms of an ependymoma are related to the location and size of the tumor. In babies, increased head size may be one of the first symptoms. Irritability, sleeplessness, and vomiting may develop as the tumor grows. In older children and adults, nausea, vomiting, and headache are the most common symptoms. Incidence: Ependymomas are relatively rare tumors in adults, accounting for 2-3% of primary brain tumors. However, they are the sixth most common brain tumor in children. About 30% of pediatric ependymomas are diagnosed in children younger than 3 years of age. 42

These tumors are divided into four major types: o Subependymomas (grade I): Typically slow-growing tumors. o Myxopapillary ependymomas (grade I): Typically slow-growing tumors. o Ependymomas (grade II): The most common of the ependymal tumors. This type can be further divided into the following subtypes, including cellular

ependymomas,

papillary

ependymomas,

clear

cell

ependymomas, and tancytic ependymomas. o Anaplastic ependymomas (grade III): Typically faster-growing tumors. x Mixed Glioma (also called Oligoastrocytoma): These tumors usually contain a high proportion of more than one type of cell, most often astrocytes and oligodendrocytes. Occasionally, ependymal cells are also found. The behavior of a mixed glioma appears to depend on the grade of the tumor. It is less clear whether their behavior is based on that of the most abundant cell type. Location: These tumors can be found anywhere within the cerebral hemispheres of the brain, although the frontal and temporal lobes are the most common locations. Description: Oligoastrocytomas (grade II) are considered low-grade tumors. They generally grow at a slower rate than anaplastic oligoastrocytomas (grade III), which are malignant. Oligoastrocytomas may evolve over time into anaplastic oligoastrocytomas. Symptoms: The most common symptoms of oligoastrocytoma are seizures, headaches, and personality changes. Other symptoms vary by location and size of the tumor. Incidence: About 40% of primary brain tumors are gliomas. Mixed gliomas, primarily oligoastrocytomas, account for 5-10% of gliomas and 1% of all brain tumors. Oligoastrocytomas develop in young and middle-aged adults (ages 30 to 50). Very few children are diagnosed with oligoastrocytoma. 43

x Optic Glioma: These tumors may involve any part of the optic pathway, and they have the potential to spread along these pathways. Most of these tumors occur in children under the age of 10. Grade I pilocytic astrocytoma and grade II fibrillary astrocytoma are the most common tumors affecting these structures. Higher-grade tumors may also arise in this location. Twenty percent of children with neurofibromatosis (NF-1) will develop an optic glioma. These gliomas are typically grade I, pilocytic astrocytomas. Children with optic glioma are usually screened for NF-1 for this reason. Adults with NF-1 typically do not develop optic gliomas. x Gliomatosis Cerebri: This is an uncommon brain tumor that features widespread glial tumor cells in the brain. This tumor is different from other gliomas because it is scattered and widespread, typically involving two or more lobes of the brain. It could be considered a “widespread low-grade glioma” because it does not have the malignant features seen in high-grade tumors. Gliomas are further categorized according to their grade, which is determined by pathologic evaluation of the tumor. x Low-grade gliomas [WHO grade II] are well-differentiated (not anaplastic); these tend to exhibit benign tendencies and portend a better prognosis for the patient. However, they have a uniform rate of recurrence and increase in grade over time so should be classified as malignant. x High-grade [WHO grade III–IV] gliomas are undifferentiated or anaplastic; these are malignant and carry a worse prognosis. Of numerous grading systems in use, the most common is the World Health Organization (WHO) grading system for astrocytoma, under which tumors are graded from I (least advanced disease—best prognosis) to IV (most advanced disease— worst prognosis). Gliomas can also be classified according to whether they are above

44

or below a membrane in the brain called the tentorium. The tentorium separates the cerebrum (above) from the cerebellum (below). x Supratentorial: above the tentorium, in the cerebrum, mostly found in adults (70%). x Infratentorial: below the tentorium, in the cerebellum, mostly found in children (70%). x Pontine: located in the pons of the brainstem. The brainstem has three parts (pons, midbrain and medulla); the pons controls critical functions such as breathing, making surgery on these extremely dangerous. Symptoms of gliomas depend on which part of the central nervous system is affected. A brain glioma can cause headaches, nausea and vomiting, seizures, and cranial nerve disorders as a result of increased intracranial pressure. A glioma of the optic nerve can cause visual loss. Spinal cord gliomas can cause pain, weakness, or numbness in the extremities. Gliomas do not metastasize by the bloodstream, but they can spread via the cerebrospinal fluid and cause "drop metastases" to the spinal cord. A child who has a subacute disorder of the central nervous system that produces cranial nerve abnormalities (especially of cranial nerve VII and the lower bulbar nerves), long-tract signs, unsteady gait secondary to spasticity, and some behavioral changes is most likely to have a pontine glioma. The exact causes of gliomas are not known. Hereditary genetic disorders such as neurofibromatoses (type 1 and type 2) and tuberous sclerosis complex are known to predispose to their development. Different oncogenes can cooperate in the development of gliomas. Gliomas have been correlated to the electromagnetic radiation from cell phones, and a link between the cancer and cell phone usage was considered possible, though several large studies have found no conclusive evidence. Experiments designed to test such a link gave negative results.

45

5.2 Meningioma Meningiomas are a diverse set of tumors arising from the meninges, the membranous layers surrounding the central nervous system. They arise from the arachnoid "cap" cells of the arachnoid villa in the meninges. These tumors usually are benign in nature; however, small percentages are malignant. Many meningiomas are asymptomatic, producing no symptoms throughout a person's life, and if discovered, require no treatment other than periodic observation. Typically, symptomatic meningiomas are treated with either radiosurgery or conventional surgery. Small tumors (e.g., < 2.0 cm) usually are incidental findings at autopsy without having caused symptoms. Larger tumors may cause symptoms, depending on the size and location, some of which are given below: x Focal seizures may be caused by meningiomas that overlie the cerebrum. x Progressive spastic weakness in legs and incontinence may be caused by tumors that overlie the parasagittal frontoparietal region. x Sylvian tumors may cause myriad motor, sensory, aphasic, and seizure symptoms, depending on the location. x Increased intracranial pressure eventually occurs, but is less frequent than in gliomas. x Diplopia (Double vision) or uneven pupil size may be symptoms if related pressure causes a third and/or sixth nerve palsy. The causes of meningiomas are not well understood. Most cases are sporadic, appearing randomly, while some are familiar. Persons who have undergone radiation, especially to the scalp, are more at risk for developing meningiomas, as are those who have suffered brain injury at some time. Atomic bomb survivors from Hiroshima had a higher than typical frequency of developing meningiomas, with the incidence increasing the closer that they were to the site of the explosion. Dental x-rays are correlated with an increased risk of meningioma, in particular for patients who had frequent dental x-rays in the past, when the x-ray dose of a dental x-ray was higher 46

than in the present. Heavy mobile telephone use has been associated with increased incidence of meningioma. Many individuals have meningiomas, but remain asymptomatic, so the meningiomas are discovered during an autopsy. One to two percent of all autopsies reveal meningiomas that were unknown to the individuals during their lifetime, since there were never any symptoms. In the 1970s, tumors causing symptoms were discovered in 2 out of 100,000 people, while tumors discovered without causing symptoms occurred in 5.7 out of 100,000, for a total incidence of 7.7/100,000. With the advent of modern sophisticated imaging systems such as CT scans, the discovery of asymptomatic meningiomas has tripled. Meningiomas are more likely to appear in women than men, though when they appear in men, they are more likely to be malignant. Meningiomas may appear at any age, but most commonly are noticed in men and women age 50 or older, with meningiomas becoming more likely with age. They have been observed in all cultures, Western and Eastern, in roughly the same statistical frequency as other possible brain tumors. Ninety-two percent of meningiomas are benign. Eight percent are either atypical or malignant. Meningiomas occur most commonly in older women. But a meningioma can occur in males and at any age, including childhood. A meningioma doesn't always require immediate treatment. A meningioma that causes no significant signs and symptoms may be monitored over time.

Figure 5.2.1 Common locations of Meningiomas(Source: www.abta.org, 2015) 47

Meningiomas arise from arachnoidal cells, most of which are near the vicinity of the venous sinuses, and this is the site of greatest prevalence for meningioma formation. They most frequently are attached to the dura over the superior parasagittal surface of frontal and parietal lobes, along the sphenoid ridge, in the olfactory grooves, the sylvian region, superior cerebellum along the falx cerebri, cerebellopontine angle, and the spinal cord. The tumor is usually gray, wellcircumscribed, and takes on the form of the space it occupies. They usually are domeshaped, with the base lying on the dura. In many cases, benign meningiomas grow slowly. This means that depending upon where it is located, a meningioma may reach a relatively large size before it causes symptoms. Other meningiomas grow more rapidly, or have sudden growth spurts. There is no way to predict the rate of growth for a meningioma, or to know for certain how long a specific tumor was growing before diagnosis. Most people with a meningioma will have a tumor at only one site, but it also is possible to have several tumors growing simultaneously in different parts of the brain and spinal cord. When multiple meningiomas occur, more than one type of treatment may be necessary. Meningiomas were originally classified into 9 major subtypes based on their structure and form. However, more recently it’s become more common to group them into three major classes, or not to distinguish subtypes at all. Multiple classifications exist today, but the most commonly used is the World Health Organization’s (WHO) "Classification of Tumours of the Nervous System," most recently updated in 2000. The variations of meningioma are given below: x Convexity meningiomas: These grow on the surface of the brain, often toward the front. They may not produce symptoms until they reach a large size. Symptoms of a convexity meningioma are seizures, focal neurological deficits, or headaches.

48

x Falx and Parasagittal meningiomas: The falx is a groove that runs between the two sides of the brain (front to back), and contains a large blood vessel (sagittal sinus). Parasagittal tumors lie near or close to the falx. Because of the danger of puncturing the blood vessels, removing a tumor in the falx or parasagittal region can be difficult. Large parasagittal meningiomas may result in bilateral leg weakness. x Olfactory groove meningiomas: Olfactory groove meningiomas grow along the nerves that run between the brain and the nose. These nerves allow you to smell, and so often tumors growing here cause loss of smell. If they grow large enough, olfactory groove meningiomas can also compress the nerves to the eyes, causing visual symptoms. Similarly, meningiomas growing on the optic nerve can cause visual problems, including loss of patches within your field of vision, or even blindness. They can grow to a large size prior to being diagnosed due to changes in the sense of smell and mental status changes being difficult to recognize. x Sphenoid meningiomas: Sphenoid meningiomas lie behind the eyes. These tumors can cause visual problems, loss of sensation in the face, or facial numbness. Tumors in this location can sometimes involve the blood sources of the brain (e.g. cavernous sinus, or carotid arteries), making them difficult or impossible to remove completely. x Posterior fossa meningiomas: Posterior fossa tumors lie on the underside of the brain. These tumors can compress the cranial nerves causing facial symptoms or loss of hearing. Petroclival tumors can compress the trigeminal nerve, resulting in sharp pain in the face (trigeminal neuralgia) or spasms of the facial muscles. Tentorial meningiomas or those near the area where your spinal cord connects to your brain (foramen magnum) can cause headaches, or other signs of brain stem compression such as difficulty walking. x Intraventricular meningiomas: Intraventricular meningiomas are associated with the connected chambers of fluid that circulate throughout the central 49

nervous system. They can block the flow of this fluid causing pressure to build up, which can produce headaches and dizziness. x Intraorbital meningiomas: Intraorbital meningiomas grow around the eye sockets of your skull and can cause pressure in the eyes to build up, resulting in a bulging appearance. They can also cause an increasing loss of vision. x Spinal meningiomas: Spinal meningiomas account for less than 10 percent of meningiomas. They tend to occur in women (with a female/ male ratio of 5 to 1), usually between the ages of 40 and 70. They are intradural (within or enclosed within the dura mater), extramedullary (outside or unrelated to any medulla) tumors occurring predominantly in the thoracic spine. They can cause back pain, or pain in the limbs from compression of the nerves where they run into the spinal cord. Meningiomas may cause seizures, headaches, and focal neurological defects, such as arm or leg weakness, or vision loss. Patients often have subtle symptoms for a long period before the meningioma is diagnosed. Sometimes memory loss, carelessness, and unsteadiness are the only symptoms. Diagnosis is made by a contrast enhanced CT (computerized tomography) and/or MRI (magnetic resonance imaging) scan. While MRIs are in some ways superior, the CT can be helpful in determining if the tumor invades the bone, or if it’s becoming hard like bone. Location: Although meningiomas are referred to as brain tumors, they do not grow from brain tissue. They arise from the meninges, which are three thin layers of tissue covering the brain and spinal cord. These tumors are most often found near the top and the outer curve of the brain. Tumors may also form at the base of the skull. Description: Meningiomas usually grow inward, causing pressure on the brain or spinal cord. They also can grow outward toward the skull, causing it to thicken. Most meningiomas are noncancerous, slow-growing tumors. Some contain sacs of fluid (cysts), mineral deposits (calcifications), or tightly packed bunches of blood vessels. 50

Symptoms: Meningiomas usually grow slowly, and may reach a large size before interfering with the normal functions of the brain. The resulting symptoms depend on the location of the tumor within the brain. Headache and weakness in an arm or leg are the most common symptoms. However, seizures, personality changes, and/or visual problems may also occur. Incidence: Meningiomas account for about 36.1% of all primary brain tumors, which are tumors that form in the brain or its coverings. They are most likely to be found in adults older than 60; the incidence appears to increase with age. Rarely are meningiomas found in children. They occur about twice as often in women as in men.

5.3 Sarcoma A sarcoma is a cancer that arises from transformed cells of mesenchymal origin. Thus, malignant tumors made of cancellous bone, cartilage, fat, muscle, vascular, or hematopoietic tissues are, by definition, considered sarcomas. This is in contrast to a malignant tumor originating from epithelial cells, which are termed carcinoma. Human sarcomas are quite rare. Common malignancies, such as breast, colon, and lung cancer, are almost always carcinoma.

a)

b)

c)

Figure 5.3.1 Sarcoma in T2, T1, and PD MR Images A sarcoma is a rare kind of cancer. Sarcomas are different from the much more common carcinomas because they happen in a different kind of tissue. Sarcomas grow in connective tissue -- cells that connect or support other kinds of tissue in your body. These tumors are most common in the bones, muscles, tendons, cartilage, 51

nerves, fat, and blood vessels of your arms and legs, but they can happen anywhere. In addition to being named based on the tissue of origin, sarcomas are also assigned a grade (low, intermediate, or high) based on the presence and frequency of certain cellular and subcellular characteristics associated with malignant biological behavior. Low grade sarcomas are usually treated surgically, although sometimes radiation therapy or chemotherapy are used. Intermediate and high grade sarcomas are more frequently treated with a combination of surgery, chemotherapy and/or radiation therapy. Since higher grade tumors are more likely to undergo metastasis (invasion and spread to locoregional and distant sites), they are treated more aggressively. The recognition that many sarcomas are sensitive to chemotherapy has dramatically improved the survival of patients. For example, in the era before chemotherapy, longterm survival for patients with localized osteosarcoma was only approximately 20%, but now has risen to 60–70%. Although there are more than 50 types of sarcoma, they can be grouped into two main kinds: soft tissue sarcoma and bone sarcoma, or osteosarcoma. About 1 out of 100 cases of adult cancers is soft tissue sarcoma. Osteosarcomas are even rarer. The most relevant types of sarcoma are listed below. (ICD-O codes are provided, where available, along with the relevant edition.) x Askin's tumor (8803/3) x Sarcoma botryoides x Chondrosarcoma (9220/3–9240/3) x Ewing's (9260/3)—PNET (9473/3) x Malignant Hemangioendothelioma (9130/3) x Malignant Schwannoma (9560/3–9561/3) x Osteosarcoma (9180/3–9190/3) x Soft tissue sarcomas, including: o Alveolar soft part sarcoma (9581/3) o Angiosarcoma (9120/3) 52

o Cystosarcoma Phyllodes o Dermatofibro sarcoma protuberans (DFSP) (8832/3–8833/3) o Desmoid Tumor (8821/1–8822/1) o Desmoplastic small round cell tumor (8806/3) o Epithelioid Sarcoma (8804/3) o Extraskeletal chondrosarcoma (9220/3) o Extraskeletal osteosarcoma (9180/3) o Fibrosarcoma (8810/3) o Gastrointestinal stromal tumor (GIST) o Hemangiopericytoma (9150)(Also known as "solitary fibrous tumor". Only a subset of these tumors are classified as malignant.) o Hemangiosarcoma (9120/3) (More commonly referred to as "angiosarcoma") o Kaposi's sarcoma (9140/3) o Leiomyosarcoma (8890/3–8896/3) o Liposarcoma (8850/3–8858/3) o Lymphangiosarcoma (9170–9175) o Lymphosarcoma (Not considered to be sarcomas) o Malignant fibrous histiocytoma (8830/3)(This is an obsolete term that is no longer recognized by the World Health Organization. Many of these tumors would currently be classified as "undifferentiated pleomorphic sarcoma".) o Malignant peripheral nerve sheath tumor (MPNST) o Neurofibrosarcoma (9540/3) o Rhabdomyosarcoma (8900–8920) 53

o Synovial sarcoma (9040/3–9043/3) o Undifferentiated pleomorphic sarcoma (previously referred to as Malignant fibrous histiocytoma) Soft tissue sarcomas are hard to spot because they can grow anywhere in your body. Most often, the first sign is a painless lump. As the lump gets bigger, it might press against nerves or muscles and make you uncomfortable or give you trouble breathing, or both. There are no tests that can find these tumors before they cause symptoms that you notice. It is not clear why some people develop sarcoma; however, researchers have been able to identify some common characteristics of groups with high rates of soft tissue sarcoma. Some studies have shown that people exposed to phenoxyacetic acid in herbicides and clorphenols in wood preservative have increased risk of soft tissue sarcoma. Researchers also know that people exposed to high doses of radiation are at a greater risk for developing soft tissue sarcoma. Researchers are also studying genetic abnormalities and chromosome mutations as possible causes for soft tissue sarcoma. People with certain inherited diseases such as neurofibromatosis or familial syndromes associated with p53 mutations have been shown to have higher risks of soft tissue sarcoma. Early on, soft tissue sarcoma rarely causes any symptoms. Because soft tissue is very elastic, the tumors can grow quite large before they are felt. The first symptom is usually a painless lump. As the tumor grows and begins to press against nearby nerves and muscles, pain or soreness can occur. Soft tissue sarcomas can only be diagnosed by a surgical biopsy. A biopsy is a procedure which removes tissue from the tumor and is analyzed under a microscope. Soft tissue sarcomas are treated using surgery, radiation therapy and chemotherapy. Depending on the size, location, extent and grade (growth rate) of the tumor, a combination of all or some of these treatments may be used. Biological therapy, such astreatment to stimulate the body’s immune system to fight cancer, or molecules that target certain genes expressed by the cancer cells, also is being used for some sarcomas such as GIST, and in clinical trials for many other types of sarcoma. 54

Accurate data about the actual diagnosed number of cases of sarcoma is hard to find. This is because cancer is reported against 'site of origin' (the area of the body where the cancer is found). Sarcomas can appear almost anywhere on or in the body and many are only found following investigations for a condition which seems unconnected with cancer. They are often reported as a cancer associated with a specific part of the body rather than as a sarcoma. Surgery is important in the treatment of most sarcomas. Limb sparing surgery, as opposed to amputation, can now be used to save the limbs of patients in at least 90% of extremity tumor cases. Additional treatments, including chemotherapy and radiation therapy, may be administered before and/or after surgery. Chemotherapy significantly improves the prognosis for many sarcoma patients, especially those with bone sarcomas. Treatment can be a long and arduous process, lasting about a year for many patients.

5.4 Glioblastoma or Glioblastoma multiforme Glioblastoma multiforme (GBM), WHO classification name "glioblastoma", also known as Grade IV Astrocytoma, is the most common and most aggressive malignant primary brain tumor in humans, involving glial cells and accounting for 52% of all functional tissue brain tumor cases and 20% of all intracranial tumors. GBM is a rare disease, with an incidence of 2–3 cases per 100,000 person life-years in Europe and North America. It presents two variants: giant cell glioblastoma and gliosarcoma. Treatment can involve chemotherapy, radiation and surgery. Median survival with standard-of-care radiation and chemotherapy with temozolomide is 15 months. Median survival without treatment is 4½ months. Although no randomized controlled trials have been done, surgery remains the standard of care. Although common symptoms of the disease include seizure, nausea and vomiting, headache, memory loss, and hemiparesis, the single most prevalent symptom is a progressive memory, personality, or neurological deficit due to temporal and frontal lobe involvement. The kind of symptoms produced depends highly on the location of the tumor, more so than on its pathological properties. The 55

tumor can start producing symptoms quickly, but occasionally is an asymptomatic condition until it reaches an enormous size. For unknown reasons, GBM occurs more commonly in males. Most glioblastoma tumors appear to be sporadic, without any genetic predisposition. No links have been found between glioblastoma and smoking, consumption of cured meat, or electromagnetic fields as of yet. Alcohol consumption may be a possible risk factor. Glioblastoma has been associated with the viruses SV40, HHV-6, and cytomegalovirus. There also appears to be a small link between ionizing radiation and glioblastoma. Some also believe that there may be a link between polyvinyl chloride (which is commonly used in construction) and glioblastoma. A 2006 analysis links brain cancer to lead exposure in the work-place. There is an association of brain tumor incidence and malaria, suggesting that the anopheles mosquito, the carrier of malaria, might transmit a virus or other agent that could cause glioblastoma or that the immune suppression associated with malaria could enhance viral replication. Also HHV-6 reactivates in response to hypersensitivity reactions from drugs and environmental chemicals. Other risk factors include: x Sex: male (slightly more common in men than women) x Age: over 50 years old x Ethnicity: Caucasians, Hispanics, and Asians x Having a low-grade astrocytoma (brain tumor), which often, given enough time, develops into a higher-grade tumor x Having one of the following genetic disorders is associated with an increased incidence of gliomas: o Neurofibromatosis o Tuberous sclerosis o Von Hippel-Lindau disease o Li-Fraumeni syndrome 56

o Turcot syndrome Glioblastoma tumors make their own blood supply, which helps them grow. It's easy for them to invade normal brain tissue. There are two types of GBM namely: x Primary glioblastoma is the most common. It starts out as a grade 4 tumor and is very aggressive. x Secondary glioblastoma starts as a grade 2 or 3 tumor, which grow slower. Then it becomes grade 4. About 10% of glioblastomas are this type. They tend to happen when you're 45 or younger. The goal of glioblastoma treatment is to slow and control tumor growth and improve your quality of life. There are three standard treatments: x Surgery is the first treatment. The surgeon tries to remove as much of the tumor as possible. In high-risk areas of the brain, the surgeon may not be able to remove all of a tumor. x Radiation is used to kill as many leftover tumor cells as possible after surgery. It can also slow the growth of tumors that can't be removed by surgery. x Chemotherapy may also be used. Temozolomide is the most common chemotherapy drug used for glioblastoma. Chemotherapy can cause shortterm side effects, but it is much less toxic than it used to be. There are several key factors that influence overall survival of a Glioblastoma Multiforme patient, these include initial diagnosis, number of times the tumor has recurred, patient age, molecular diagnostic results, past therapy, Karnosky Score, tissue features, along with 50-60 other key variables. It is the job of the NeuroOncologist to take into account all factors and formulate the best option for you and your family.

57

5.5 Medulloblastoma Medulloblastoma is a highly malignant primary brain tumor (cancer) that originates in the part of the brain that is towards the back and the bottom, on the floor of the skull, in the cerebellum or posterior fossa. The brain is divided into two main parts, the larger cerebrum on top and the smaller cerebellum below towards the back. They are separated by a membrane called the tentorium. Tumors that originate in the cerebellum or the surrounding region below the tentorium are therefore called infratentorial. Another term for medulloblastoma is infratentorial primitive neuroectodermal tumor (PNET). Medulloblastoma is the most common PNET originating in the brain. All PNET tumors of the brain are invasive and rapidly growing tumors that, unlike most brain tumors, spread through the cerebrospinal fluid (CSF) and frequently metastasize to different locations in the brain and spine. The cumulative relative survival rate for all age groups and histology follow-up was 60%, 52%, and 47% at 5 years, 10 years, and 20 years, respectively, with children doing better than adults. Medulloblastoma is relatively rare, accounting for less than 2% of all primary brain tumors and 18% of all pediatric brain tumors. More than 70% of all pediatric medulloblastomas are diagnosed in children under age 10. Very few occur in children up to age 1. Survival rates in children with medulloblastoma depend on the patient’s age and how much the tumor spreads. x If the disease has not spread, survival rates are around 70 to 80 percent. x If the disease has spread to the spinal cord, the survival rate is about 60 percent. x Children younger than age 3 often have lower survival rates because their disease tends to be more aggressive.

58

Medulloblastoma in children is classified as either standard (average) risk or high risk, depending on the following factors: the child’s age, how much of the tumor remains after surgery, and whether the tumor has spread. x Standard-risk tumor. The tumor is in the very back part of the brain and has not spread to other areas of the brain and spinal cord. Additionally, it is almost completely removed during surgery, meaning that less than 1.5 cubic centimeters (cm) of the tumor remains after surgery. However, the surgeon will usually prefer to remove the entire tumor if it can be completely removed without increasing the risk of severe side effects. x High-risk tumor. This type of tumor has either spread to other parts of the brain or the spine, or it has not spread but more than 1.5 cubic cm of tumor remains after surgery. Some tumors that first appear to be standard-risk tumors have high-risk molecular features and are treated as high-risk tumors. x Recurrent tumor. A recurrent tumor is a tumor that has come back after treatment. It may recur in the brain, spine, spinal fluid or, very rarely, elsewhere in the body. If there is a recurrence, the tumor may need to be staged again (called re-staging) using the system above. Signs and symptoms are mainly due to secondary increased intracranial pressure due to blockage of the fourth ventricle and are usually present for 1 to 5 months before diagnosis is made. The child typically becomes listless, with repeated episodes of vomiting, and a morning headache, which may lead to a misdiagnosis of gastrointestinal disease or migraine. Soon after, the child will develop a stumbling gait, frequent falls, diplopia, papilledema, and sixth cranial nerve palsy. Positional dizziness and nystagmus are also frequent and facial sensory loss or motor weakness may be present. Decerebrate attacks appear late in the disease. Like many tumor types, the exact cause of medulloblastoma is not known. However, scientists are making significant strides in understanding its biology. Changes have been identified in genes and chromosomes (the cell’s DNA blueprints) that may play a role in the 59

development of this tumor. There are also a few rare, genetic health syndromes that are associated with increased risk for developing this tumor. Medulloblastoma is always located in the cerebellum—the lower, rear portion of the brain. It is unusual for medulloblastomas to spread outside the brain and spinal cord.

Figure 5.5.1 location for the cerebellum

Medulloblastoma is most effectively treated with a combination of therapies that include surgery, radiation treatment, and chemotherapy. Complete surgical removal of the tumor is important and is usually the first step in treatment. This is usually followed by radiation treatment to the entire brain and spine in older patients, followed by several months of chemotherapy. Standard treatment for very young children (often defined as children less than 3 years old) includes surgical removal and chemotherapy. The use of radiation in this young age group is controversial, but some clinicians are increasingly using radiation restricted to the area of the tumor for these young patients. Because these young patients cannot tolerate whole brain and spine radiation, increasingly intensive doses of chemotherapy are being tested, including the use of high dose chemotherapy with autologous stem cell rescue (a form of bone marrow transplant). 60

Description: Medulloblastoma is a fast-growing, high-grade tumor. It is the most common of the embryonal tumors that arise from “emybryonal” or “immature” cells at the earliest stage of their development. Symptoms: The most common symptoms of medulloblastoma include behavioral changes, changes in appetite, symptoms of increased pressure on the brain (eg, headache, nausea, vomiting, and drowsiness, as well as problems with coordination). Unusual eye movements may also occur. Incidence: Medulloblastoma is relatively rare, accounting for less than 2% of all primary brain tumors and 18% of all pediatric brain tumors. More than 70% of all pediatric medulloblastomas are diagnosed in children under age 10. Very few occur in children up to age 1.

5.6 Oligodendroglioma Oligodendrogliomas are a type of glioma that believed to originate from the oligodendrocytes of the brain or from a glial precursor cell. They occur primarily in adults (9.4% of all primary brain and central nervous system tumors) but are also found in children (4% of all primary brain tumors). The average age at diagnosis is 35 years. In anywhere from fifty to eighty percent of cases, the first symptom of an oligodendroglioma is the onset of seizure activity. They occur mainly in the frontal lobe. Headaches combined with increased intracranial pressure are also a common symptom of oligodendroglioma. Depending on the location of the tumor, any neurological deficit can be induced, from visual loss, motor weakness and cognitive decline. A Computed Tomography (CT) or Magnetic Resonance Imaging (MRI) scan is necessary to characterize the anatomy of this tumor (size, location, heter/homogeneity). However, final diagnosis of this tumor, like most tumors, relies on histopathologic examination (biopsy examination).

61

Figure 5.6.1: Oligodendrocytes are one of the types of cells that make up the supportive, or glial, tissue of the brain Oligodendrogliomas are generally felt to be incurable using current treatments. However compared to the more common astrocytomas, they are slowly growing with prolonged survival. In one series, median survival times for oligodendrogliomas were 11.6 years for grade II and 3.5 years for grade III. However, such figures can be misleading since they do not factor in the types of treatment or the genetic signature of the tumors. A recent study analyzed survival based on chromosomal deletions and the effects of radiation or chemotherapy as treatment, with the following results (both low-grade and anaplastic oligodendrogliomas): 1p/19q deletion with radiation = 121 months (mean), 1p/19q deletion with chemotherapy = over 160 months (mean not yet reached), no 1p/19q deletion with radiation = 58 months (mean), and no 1p/19q deletion with chemotherapy = 75 months (mean). Another study divided anaplastic oligodendrogliomas into the following four clinically relevant groups of histology with the following results: combined 1p/19q loss = median survival was >123 months (not yet reached), 1p loss only = median survival was 71 months, 1p intact with TP53 mutation = median survival 71 months, and 1p intact with no TP53 mutation = median survival was 16 months. Because of the indolent nature of these tumors and the potential morbidity associated

with

neurosurgery,

chemotherapy

and

radiation

therapy,

most

neurooncologists will initially pursue a course of watchful waiting and treat patients symptomatically. Symptomatic treatment often includes the use of anticonvulsants for seizures and steroids for brain swelling. PCV chemotherapy (Procarbazine, 62

CCNU and Vincristine) has been shown to be effective and was the most commonly used chemotherapy regimen used for treating anaplastic oligodendrogliomas, but is now being superseded by a newer drug: Temozolomide. Temozolomide is a common chemotherapeutic drug to which oligodendrogliomas appear to be quite sensitive. It is often used as a first line therapy, especially because of its relatively mild side effects when compared to other chemotherapeutic drugs. Location: These tumors can be found anywhere within the cerebral hemisphere of the brain, although the frontal and temporal lobes are the most common locations. Description: Oligodendrogliomas are generally soft, grayish-pink tumors. They often contain mineral deposits (called calcifications), areas of hemorrhage, and/or cysts. Under the microscope, these tumor cells appear to have “short arms,” or a fried-egg shape. Symptoms: Because of their generally slow growth, oligodendrogliomas are often present for years before they are diagnosed. The most common symptoms are seizures, headaches, and personality changes. Other symptoms vary by location and size of the tumor. Incidence: About 4% of primary brain tumors are oligodendrogliomas, representing about 10-15% of the gliomas. Only 6% of these tumors are found in infants and children. Most oligodendrogliomas occur in adults ages 50-60, and are found in men more often than women.

63

Chapter 6 6. Machine Learning Models for Tumor Type Estimation In the process of correct estimation of brain tumor we have seen that how the tumor varies in shape and in location. Thus correct estimation without proper guidance is very difficult. But here we use machine learning algorithms which can learn from data to give us the correct estimation of the type of tumor. We may define machine learning as a set of algorithms which can learn from data. It can operate by building structures based on example input data and make prediction and can decision. Machine learning can be related to artificial intelligence and has strong ties with statistics and mathematical optimization. Machine learning tasks are typically classified into three broad categories, depending on the nature of the learning "signal" or "feedback" available to a learning system. These can be given as: x Supervised learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. x Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end. x In reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it whether it has come close to its goal or not. Another example is learning to play a game by playing against an opponent. In our case we usually use supervised learning to map inputs to outputs. We have already obtained the dataset from feature extraction from slides as our dataset which is fed to these algorithms to get a desired result. There are many machine learning models of which we have selected some of the effective algorithm which can be listed below: 64

6.1 Support Vector Machine In machine learning, Support Vector Machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyse data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function ݇(‫ݔ‬, ‫ )ݕ‬selected to suit the problem. The hyperplanes in the higher-dimensional space 65

are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters ߙ௜ of images of feature vectors that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation: σ௜ ߙ௜ ݇(‫ݔ‬௜ ,‫ݐ݊ܽݐݏ݊݋ܿ = )ݔ‬. Note that if ݇(‫ݔ‬, ‫ )ݕ‬becomes small as y grows further away from x, each term in the sum measures the degree of closeness of the test point x to the corresponding data base point ‫ݔ‬௜ . In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points x mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets which are not convex at all in the original space. Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a (p-1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability. A schematic example to demonstrate the working of SVM is shown in the illustration below. In this example, the objects belong either to class GREEN or RED. The separating line defines a boundary on the right side of which all objects are GREEN and to the left of which all objects are RED. Any new object (white circle)

66

falling to the right is labelled, i.e., classified, as GREEN (or classified as RED should it fall to the left of the separating line).

Figure 6.1.1: A linear classifier in SVM The above is a classic example of a linear classifier, i.e., a classifier that separates a set of objects into their respective groups (GREEN and RED in this case) with a line. Most classification tasks, however, are not that simple, and often more complex structures are needed in order to make an optimal separation, i.e., correctly classify new objects (test cases) on the basis of the examples that are available (train cases). This situation is depicted in the illustration below. Compared to the previous schematic, it is clear that a full separation of the GREEN and RED objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. Support Vector Machines are particularly suited to handle such tasks.

Figure 6.1.2: A of nonlinear classifier in SVM The illustration below shows the basic idea behind Support Vector Machines. Here we see the original objects (left side of the schematic) mapped, i.e., rearranged, using a set of mathematical functions, known as kernels. The process of rearranging the objects is known as mapping (transformation). Note that in this new setting, the mapped objects (right side of the schematic) is linearly separable and, thus, instead of 67

constructing the complex curve (left schematic), all we have to do is to find an optimal line that can separate the GREEN and the RED objects.

Figure 6.1.3: Schematic mapping from nonlinear to linear hyperline In Support Vector classification, the separating function can be expressed as a linear combination of kernels associated with the Support Vectors as

݂(‫ = )ݔ‬෍ ܽ௝ ‫ݕ‬௝ ݇(‫ݔ‬௝ , ‫ )ݔ‬+ ܾ ௫ೕചೄ

(1) Where ‫ݔ‬௜ denotes the training patterns,‫ݕ‬௜ ߳{െ1, + 1} denotes the corresponding class labels and ܵ denotes the set of Support Vectors. The dual formulation yields 1 ݉݅݊଴ஸ௔೔ ஸ஼ ܹ = ෍ ܽ௜ ܳ௜௝ ܽ௝ െ ෍ ܽ௜ + ܾ ෍ ܽ௜ ‫ݕ‬௜ 2 ௜௝

௜

௜

(2) Where ܽ ௜ are the corresponding coefficients,ܾ is the offset,ܳ௜௝ = ‫ݕ‬௜ ‫ݕ‬௝ , ‫ݔ(ܭ‬௜ ,‫ݔ‬௝ ) is a symmetric positive definite kernel matrix and ‫ ܥ‬is the parameter used to penalize error points in the inseparable case. The Karush-Kuhn-Tucker ( KKT ) conditions for the dual can be expressed as ݃௜ =

߲ܹ = ෍ ܳ௜௝ ܽ௜ + ‫ݕ‬௜ ܾ െ 1 = ‫ݕ‬௜ ݂(‫ݔ‬௜ ) െ 1 ߲ܽ௜ ௜௝

68

AND ߲ܹ = ෍ ‫ݕ‬௝ܽ௝ = 0 ߲ܾ ௝

(3) This partition the training set into ܵ the Support Vector set (0 < ߙ௜ < ‫ܥ‬,݃௜ = 0),E the error set (ߙ௜ = ‫ܥ‬, ݃௜ < 0) and R the well classified set (ߙ௜ = 0,݃௜ > 0). If the points in error are penalized quadratically with a penalty factor ‫ ܥ‬ᇱ , then, it has been shown that the problem reduces to that of a separable case with ‫ = ܥ‬λ. The kernel function is modified as ‫ ܭ‬ᇱ ൫‫ݔ‬௜ , ‫ݔ‬௝ ൯ = ‫ܭ‬൫‫ݔ‬௜ ,‫ݔ‬௝ ൯ +

1 ߜ ‫ ܥ‬ᇱ ௜௝ (4)

Where ߜ௜௝ = 1 if ݅ = ݆ and ߜ௜௝ = 0 otherwise. The advantage of this formulation is that the SVM problem reduces to that of a linearly separable case. It can be seen that training the SVM involves solving a quadratic optimization problem which requires the use of optimization routines from numerical libraries. This step is computationally intensive, can be subject to stability problems and is non-trivial to implement. The generalization performance of a learning algorithm is indeed limited by three sources of error: x The approximation error measures how well the exact solution can be approximated by a function implementable by our learning system. x The estimation error measures how accurately we can determine the best function implementable by our learning system using a finite training set instead of the unseen testing examples. x The optimization error measures how closely we compute the function that best satisfies whatever information can be exploited in our finite training set. 69

SVMs deliver a unique solution, since the optimality problem is convex. This is an advantage compared to Neural Networks, which have multiple solutions associated with local minima and for this reason may not be robust over different samples. With the choice of an appropriate kernel, such as the Gaussian kernel, one can put more stress on the similarity between companies, because the more similar the financial structure of two companies is, the higher is the value of the kernel. Thus when classifying a new company, the values of its financial ratios are compared with the ones of the support vectors of the training sample which is more similar to this new company. This company is then classified according to with which group it has the greatest similarity. Here are some examples where the SVM can help coping with non-linearity and non-monotonicity. A common disadvantage of non-parametric techniques such as SVMs is the lack of transparency of results.

6.2 K-Nearest Neighbour’s algorithm In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If ݇ = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbours. The k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, it can be useful to weight the contributions of the neighbours, so that the nearer neighbours contribute more to the average than the more distant ones. For example, a common weighting 70

scheme consists in giving each neighbour a weight of 1/d, where d is the distance to the neighbour. The neighbours are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficient. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbour or Neighbourhood components analysis. The implementation of the k-NN algorithm is non-incremental; namely, all training instances are taken and processed at once. An important characteristic of this algorithm is that instances are stored as their projections on each feature dimension. In the training phase, each training instance is stored simply as its projections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. The training phase for k-NN consists of simply storing all known instances and their class labels. A tabular representation can be used, or a specialized structure such as a kd-tree. If we want to tune the value of 'k' and/or perform feature selection, nfold cross-validation can be used on the training dataset. The testing phase for a new instance’t’, given a known set 'I' is as follows: 1. Compute the distance between 't' and each instance in 'I' 71

2. Sort the distances in increasing numerical order and pick the first 'k' elements 3. Compute and return the most frequent class in the 'k' nearest neighbours, optionally weighting each instance's class by the inverse of its distance to 't' The working of K-NN for classification can be demonstrated as follows: In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behaviour based on k. Case 1: k = 1 or Nearest Neighbour Rule This is the simplest scenario. Let x be the point to be labelled. Find the point closest to x. Let it be y. Now nearest neighbour rule asks to assign the label of y to x. This seems too simplistic and sometimes even counter intuitive. If you feel that this procedure will result a huge error, you are right – but there is a catch. This reasoning holds only when the number of data points is not very large. If the number of data points is very large, then there is a very high chance that label of x and y are same. An example might help – let’s say you have a (potentially) biased coin. You toss it for 1 million time and you have got head 900,000 times. Then most likely your next call will be head. We can use a similar argument here. Let me try an informal argument here - Assume all points are in a D dimensional plane. The number of points is reasonably large. This means that the density of the plane at any point is fairly high. In other words , within any subspace there is adequate number of points. Consider a point x in the subspace which also has a lot of neighbours. Now let y be the nearest neighbour. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is fairly same – Then by decision theory, x and y have the same class. Case 2: k = K or k-Nearest Neighbour Rule

72

This is a straightforward extension of 1NN. Basically what we do is that we try to find the k nearest neighbour and do a majority voting. Typically k is odd when the number of classes is 2. Let’s say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this case, KNN says that new point has to labelled as C1 as it forms the majority. We follow a similar argument when there are multiple classes. One of the straight forward extensions is not to give 1 vote to all the neighbours. A very common thing to do is weighted k-NN where each point has a weight which is typically calculated using its distance. For eg under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points. It is quite obvious that the accuracy might increase when you increase k but the computation cost also increases. The k-NN algorithm, like other instance-based algorithms, is unusual from a classification perspective in its lack of explicit model training. While a training dataset is required, it is used solely to populate a sample of the search space with instances whose class is known. No actual model or learning is performed during this phase; for this reason, these algorithms are also known as lazy learning algorithms. Different distance metrics can be used, depending on the nature of the data. Euclidean distance is typical for continuous variables, but other metrics can be used for categorical data. Specialized metrics are often useful for specific problems, such as text classification. When an instance whose class is unknown is presented for evaluation, the algorithm computes its k closest neighbours, and the class is assigned by voting among those neighbours. To prevent ties, one typically uses an odd choice of k for binary classification. For multiple classes, one can use plurality voting or majority voting. The latter can sometimes result in no class being assigned to an instance, while the former can result in classifications being made with very low support from the neighbourhood. One can also weight each neighbour by an inverse function of its distance to the instance being classified. The main advantages of k-NN for classification are: 73

x

Very simple implementation.

x

Robust with regard to the search space; for instance, classes don't have to be linearly separable.

x

Classifier can be updated online at very little cost as new instances with known classes are presented.

x

Few parameters to tune: distance metric and k.

The main disadvantages of the algorithm are: x

Expensive testing of each instance, as we need to compute its distance to all known instances. Specialized algorithms and heuristics exist for specific problems and distance functions, which can mitigate this issue. This is problematic for datasets with a large number of attributes. When the number of instances is much larger than the number of attributes, a R-tree or a kd-tree can be used to store instances, allowing for fast exact neighbor identification.

x

Sensitiveness to noisy or irrelevant attributes, which can result in less meaningful distance numbers. Scaling and/or feature selection are typically used in combination with k-NN to mitigate this issue.

x

Sensitiveness to very unbalanced datasets, where most entities belong to one or a few classes, and infrequent classes are therefore often dominated in most neighborhoods. This can be alleviated through balanced sampling of the more popular classes in the training stage, possibly coupled with ensembles.

74

6.3 Decision Trees Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A decision tree is a simple representation for classifying examples. Decision tree learning is one of the most successful techniques for supervised classification learning. For this section, assume that all of the features have finite discrete domains, and there is a single target feature called the classification. Each element of the domain of the classification is called a class. A decision tree or a classification tree is a tree in which each internal (non-leaf) node is labelled with an input feature. The arcs coming from a node labelled with a feature are labelled with each of the possible values of the feature. Each leaf of the tree is labelled with a class or a probability distribution over the classes. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive 75

manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of top-down induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data. In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorisation and generalisation of a given set of data. Data comes in records of the form: (‫ݔ‬,ܻ) = (‫ݔ‬ଵ ,‫ݔ‬ଶ ,‫ݔ‬ଷ ,.. . ,‫ݔ‬௞ ,ܻ) (5) The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize. The vector x is composed of the input variables, ‫ݔ‬ଵ ,‫ݔ‬ଶ , ‫ݔ‬ଷ , ݁‫ܿݐ‬., that are used for that task. Decision trees used in data mining are of two main types: x Classification tree analysis is when the predicted outcome is the class to which the data belongs. x Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital). The term Classification and Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures. Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split. Some techniques, often called ensemble methods, construct more than one decision tree:

76

x Bagging decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction. x A Random Forest classifier uses a number of decision trees, in order to improve the classification rate. x Boosted Trees can be used for regression-type and classification-type problems.] x Rotation forest - in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features. Decision tree learning is the construction of a decision tree from class-labelled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.

Figure 6.3.1: A schematic representation of decision tree is a flow-chart-like structure 77

The decision tree classifiers organized a series of test questions and conditions in a tree structure. The following figure shows an example of decision tree for predicting whether the person cheats. In the decision tree, the root and internal nodes contain attribute test conditions to separate records that have different characteristics. The entire terminal node is assigned a class label Yes or No.

Figure 6.3.2: An example of decision tree for predicting whether the person cheats Once the decision tree has been constructed, classifying a test record is straightforward. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. It then leads us either to another internal node, for which a new test condition is applied, or to a leaf node. When we reach the leaf node, the class label associated with the leaf node is then assigned to the record, As shown in the following figure , it traces the path in the decision tree to predict the class label of the test record, and the path terminates at a leaf node labelled NO. Build an optimal decision tree is key problem in decision tree classifier. In general, may decision trees can be constructed from a given set of attributes. While some of the trees are more accurate than others, finding the optimal tree is computationally infeasible because of the exponential size of the search space. 78

The decision tree inducing algorithm must provide a method for specifying the test condition for different attribute types as well as an objective measure for evaluating the goodness of each test condition. First, the specification of an attribute test condition and its corresponding outcomes depends on the attribute types. We can do two-way split or multi-way split, discretize or group attribute values as needed. The binary attributes leads to two-way split test condition. For nominal attributes which have many values, the test condition can be expressed into multi way split on each distinct values, or two-way split by grouping the attribute values into two subsets. Similarly, the ordinal attributes can also produce binary or multi way splits as long as the grouping does not violate the order property of the attribute values. For continuous attributes, the test condition can be expressed as a comparison test with two outcomes, or a range query. Or we can discretize the continuous value into nominal attribute and then perform two-way or multi-way split. Since there are many choices to specify the test conditions from the given training set, we need use a measurement to determine the best way to split the records. The goal of best test conditions is whether it leads a homogenous class distribution in the nodes, which is the purity of the child nodes before and after splitting. The larger the degree of purity, better the class distribution. The decision tree induction algorithm works by recursively selecting the best attribute to split the data and expanding the leaf nodes of the tree until the stopping criterion is met. The choice of best split test condition is determined by comparing the impurity of child nodes and also depends on which impurity measurement is used. After building the decision tree, a tree-pruning step can be performed to reduce the size of decision tree. Decision trees that are too large are susceptible to a phenomenon known as over fitting. Pruning helps by trimming the branches of the initial tree in a way that improves the generalization capability of the decision tree. However, various efficient algorithms have been developed to construct a reasonably accurate, albeit suboptimal, decision tree in a reasonable amount of time. These algorithms usually employ a greedy strategy that grows a decision tree by 79

making a series of locally optimum decisions about which attribute to use for partitioning the data. For example, Hunt's algorithm, ID3, C4.5, CART, SPRINT are greedy decision tree induction algorithms. Some advantages of decision trees are: x Simple to understand and to interpret. Trees can be visualised. x Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values. x The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree. x Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information. x Able to handle multi-output problems. x Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret. x Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. x Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

80

x Decision-tree learners can create over-complex trees that do not generalise the data well. This is called over fitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. x Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. x The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement. x There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. x Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

81

6.4 Naive Bayes classifier In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes has been studied extensively since the 1950s. It was introduced under a different name into the text retrieval community in the early 1960s, and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines. It also finds application in automatic medical diagnosis. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum

82

likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

Figure 6.4.1: Consider the image classified as either green or red dots To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. Thus, we can write: Prior Probability for GREEN ൎ Prior Probability for RED ൎ

ே௨௠௕௘௥௢௙ீோாாே௢௕௝௘௖௧௦ ்௢௧௔௟ே௨௠௕௘௥௢௙௢௕௝௘௖௧௦

ே௨௠௕௘௥௢௙ோா஽௢௕௝௘௖௧௦ ்௢௧௔௟ே௨௠௕௘௥௢௙௢௕௝௘௖௧௦

(7) 83

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are: Prior Probability for GREEN ൎ Prior Probability for RED ൎ

ସ଴ ଺଴

ଶ଴ ଺଴

Figure 6.4.2: Classification of new object by white circle Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular colour. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood: Likelihood of X given GREEN ؆ Likelihood of X given RED ؆

ே௨௠௕௘௥௢௙ீோாாே௜௡௧௛௘௩௜௖௜௡௜௧௬௢௙௑ ்௢௧௔௟௡௨௠௕௘௥௢௙ீோாாே௖௔௦௘௦

ே௨௠௕௘௥௢௙ோா஽௜௡௧௛௘௩௜௖௜௡௜௧௬௢௙௑ ்௢௧௔௟௡௨௠௕௘௥௢௙ோா஽௖௔௦௘௦

(8) From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus: Probability of X given GREEN ؆

ଵ ସ଴

84

Probability of X given RED ؆

ଷ ଶ଴

(11) Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761). Posterior probability of X being GREEN ؆ ܲ‫× ܰܧܧܴܩ݂݋ݕݐ݈ܾܾ݅݅ܽ݋ݎ݌ݎ݋݅ݎ‬ ‫ܰܧܧܴܩ݊݁ݒ݂݅݃ܺ݋݀݋݋݄݈݅݁݇݅ܮ‬ = Posterior

probability

of

X

4 1 1 × = 6 40 60

being

RED

؆ ܲ‫× ܦܧܴ݂݋ݕݐ݈ܾܾ݅݅ܽ݋ݎ݌ݎ݋݅ݎ‬

‫ܦܧܴ݊݁ݒ݂݅݃ܺ݋݀݋݋݄݈݅݁݇݅ܮ‬ =

2 3 1 × = 6 20 20 (12)

Finally, we classify X as RED since its class membership achieves the largest posterior probability. e provided an intuitive example for understanding classification using Naive Bayes. In this section are further details of the technical issues involved. Naive Bayes classifiers can handle an arbitrary number of independent variables whether continuous or categorical. Given a set of variables, X = {x1,x2,x...,xd}, we want to construct the posterior probability for the event Cj among a set of possible outcomes C = {c1,c2,c...,cd}. In a more familiar language, X is the predictors and C is the set of categorical levels present in the dependent variable. Using Bayes' rule: ‫݌‬൫‫ܥ‬௝ ห‫ݔ‬ଵ, ‫ݔ‬ଶ ,…. ,‫ݔ‬ௗ ൯ ؆ ‫݌‬൫‫ݔ‬ଵ , ‫ݔ‬ଶ ,…., ‫ݔ‬ௗ ห‫ܥ‬௝ ൯‫ܥ(݌‬௝ ) 85

(12) Where p(Cj | x1,x2,x...,xd) is the posterior probability of class membership, i.e., the probability that X belongs to Cj. Since Naive Bayes assumes that the conditional probabilities of the independent variables are statistically independent we can decompose the likelihood to a product of terms: ௗ

‫ܥ|ܺ(݌‬௝ ) ؆ ෑ ‫ݔ(݌‬௞ |‫ܥ‬௝ ) ௞ୀଵ

and rewrite the posterior as: ௗ

‫ܥ|ܺ(݌‬௝ ) ؆ ‫ܥ(݌‬௝ ) ෑ ‫ݔ(݌‬௞ |‫ܥ‬௝ ) ௞ୀଵ

(13) Using Bayes' rule above, we label a new case X with a class level Cj that achieves the highest posterior probability. Although the assumption that the predictor (independent) variables are independent is not always accurate, it does simplify the classification task dramatically, since it allows the class conditional densities p(xk | Cj) to be calculated separately for each variable, i.e., it reduces a multidimensional task to a number of one-dimensional ones. In effect, Naive Bayes reduces a highdimensional density estimation task to one-dimensional kernel density estimation. Furthermore, the assumption does not seem to greatly affect the posterior probabilities, especially in regions near decision boundaries, thus, leaving the classification task unaffected.

86

6.5 Logistic Regression In statistics, logistic regression, or logit regression, or logit model is a type of probabilistic statistical classification model. It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modelled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and hereafter in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression. Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. Thus, it treats the same set of problems as probit regression using similar techniques; the first assumes a logistic function and the second a standard normal distribution function. Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression. First, the conditional mean ‫ )ݔ|ݕ(݌‬follows a Bernoulli distribution rather than a Gaussian distribution, because logistic regression is a classifier. Second, the linear combination of the inputs ‫ ܴ א ݔ ் ݓ‬is restricted to [0,1] through the logistic distribution function because logistic regression predicts the probability of the instance being positive. 87

Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C"). In binary logistic regression, the outcome is usually coded as "0" or "1", as this leads to the most straightforward interpretation. If a particular observed outcome for the dependent variable is the noteworthy possible outcome (referred to as a "success" or a "case") it is usually coded as "1" and the contrary outcome (referred to as a "failure" or a "non case") as "0". Logistic regression is used to predict the odds of being a case based on the values of the independent variables (predictors). The odds are defined as the probability that a particular outcome is a case divided by the probability that it is a non case. Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes of the dependent variable (treating the dependent variable as the outcome of a Bernoulli trial) rather than a continuous outcome. Given this difference, it is necessary that logistic regression take the natural logarithm of the odds of the dependent variable being a case (referred to as the logit or log-odds) to create a continuous criterion as a transformed version of the dependent variable. Thus the logit transformation is referred to as the link function in logistic regression—although the dependent variable in logistic regression is binomial, the logit is the continuous criterion upon which linear regression is conducted. The logit of success is then fitted to the predictors using linear regression analysis. The predicted value of the logit is converted back into predicted odds via the inverse of the natural logarithm, namely the exponential function. Thus, although the observed dependent variable in logistic regression is a zero-or-one variable, the logistic regression estimates the odds, as a continuous variable, that the dependent 88

variable is a success (a case). In some applications the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a case; this categorical prediction can be based on the computed odds of a success, with predicted odds above some chosen cutoff value being translated into a prediction of a success. The following table shows the relationship, for 64 infants, between X: gestational age of the infant (in weeks) at the time of birth [column (i)]; and Y: whether the infant was breast feeding at the time of release from hospital ["no" coded as "0" and entered in column (ii); "yes" coded as "1" and entered in column (iii)] Also shown in the table are (v) the observed probability of Y=1 for each level of X, calculated as the ratio of the number of instances of Y=1 to the total number of instances of Y for that level; (vi) the odds for each level of X, calculated as the ratio of the number of Y=1 entries to the number of Y=0 entries for each level, or alternatively as and (vii) the natural logarithm of the odds for each level of X, designated as "log odds." Table 6.5.1 shows the relationship, for 64 infants, between X and Y i

ii

iii

iv

v

vi

Vii

Instances of Y Y

Coded as

as

Total Observed X

Y

as Y

as

0

1

ii+iii Probability Odds

Log Odds

28 4

2

6

.3333

.5000

-.6931

29 3

2

5

.4000

.6667

-.4055 89

30 2

7

9

.7778

3.5000

1.2528

31 2

7

9

.7778

3.5000

1.2528

32 4

16

20

.8000

4.0000

1.3863

33 1

14

15

.9333

14.0000

2.6391

Graph A, below, shows the linear regression of the observed probabilities, Y, on the independent variable X The problem with ordinary linear regression in a situation of this sort is evident at a glance: extend the regression line a few units upward or downward along the X axis and you will end up with predicted probabilities that fall outside the legitimate and meaningful range of 0.0 to 1.0, inclusive. Logistic regression, as shown in Graph B, fits the relationship between X and Y with a special S-shaped curve that is mathematically constrained to remain within the range of 0.0 to 1.0 on the Y axis. B. Logistic Regression A. Ordinary Linear Regression

Figure 6.5.1: linear regression of the observed probabilities, Y, on the independent variable X

The mechanics of the process begin with the log odds, which will be equal to 0.0 when the probability in question is equal to .50, smaller than 0.0 when the probability is less than .50, and greater than 0.0 when the probability is greater 90

than .50. The form of logistic regression supported by the present page involves a simple weighted linear regression of the observed log odds on the independent variable X. As shown below in Figure 6.5.1, this regression for the example at hand finds an intercept of -17.2086 and a slope of .5934. C. Weighted Linear Regression of

X

Observed Probability

Log Odds

Weight

28 29 30 31 32 33

.3333 .4000 .7778 .7778 .8000 .9333

-.6931 -.4055 1.2528 1.2528 1.3863 2.6391

6 5 9 9 20 15

C.

Observed

Log

Odds on X

For each level of X, the weighting factor is the number of observations for that level.

Intercept=-17.2086 is the point on the Y-axis (log odds) crossed by the regression line when X=0. Slope=.5934 is the rate at which the predicted log odds increases (or, in some cases, decreases) with each successive unit of X. Within the context of logistic regression, you will usually find the slope of the log odds regression line referred to as the "constant." The exponent of the slope exp(.5934) = 1.81 describes the proportionate rate at which the predicted odds changes with each successive unit of X. In the present example, the predicted odds for X=29 is 1.81 times as large as the one for X=28; the one for X=30 is 1.81 times as large as the one for X=29; and so on. Figure 6.5.2: logistic regression supported by a simple weighted linear regression

Once this initial linear regression is obtained, the predicted log odds for any particular value of X can then be translated back into a predicted probability value. Thus, for X=31 in the present example, the predicted log odds would be log[odds] = -17.2086+(.5934x31) = 1.1868

(14) 91

The corresponding predicted odds would be odds = exp(log[odds]) = exp(1.1868)=3.2766

(15)

And the corresponding predicted probability would be probability = odds/(1+odds)=3.2766/(1+3.2766) = .7662

(16)

Perform this translation throughout the range of X values and you go from the straight line of the graph on the left to the S-shaped curve of the logistic regression on the right.

Figure 6.5.3: Plain-vanilla empirical regression

Please note, however, that the logistic regression accomplished by this page is based on a simple, plain-vanilla empirical regression. You will typically find logistic regression procedures framed in terms of an abstraction known as the maximized log likelihood function. For two reasons, this page does not follow that procedure. The first reason, which can be counted as either a high-minded philosophical reservation or a low-minded personal quirk, is that the maximized log likelihood method has always impressed me as an exercise in excessive fine-tuning, reminiscent on some occasions of what Alfred North Whitehead identified as the fallacy of misplaced concreteness, and on others of what Freud described as the narcissism of small differences. The second reason is that in most real-world cases there is little if any 92

practical difference between the results of the two methods. The blue line in the adjacent graph is the same empirical regression line described above; the red line shows the regression resulting from the method of maximized log likelihood. I find it difficult to suppose that the fine-tuned abstraction of the latter is saying anything very different from what is being said by the former.

Figure 6.5.4: Empirical regression

6.6 Linear Regression In statistics, linear regression is an approach for modelling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regressions. (This term should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.) In linear regression, data are modelled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear 93

function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine. Linear regression has many practical uses. Most applications fall into one of the following two broad categories: x

If the goal is prediction, or forecasting, or reduction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.

x

Given a variable y and a number of variables ܺଵ , ... ,ܺ௣ that may be related to y, linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely linked, they are not synonymous.

94

In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as Y. The variable we are basing our predictions on is called the predictor variable and is referred to as X. When there is only one predictor variable, the prediction method is called simple regression. In simple linear regression, the topic of this section, the predictions of Y when plotted as a function of X form a straight line. The example data in Table 1 are plotted in Figure 1. You can see that there is a positive relationship between X and Y. If you were going to predict Y from X, the higher the value of X, the higher your prediction of Y. Table 6.6. 1: Example data X 1.00 2.00 3.00 4.00 5.00

Y 1.00 2.00 1.30 3.75 2.25

Figure 6.6.1: A scatter plot of the example data

95

Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The black diagonal line in Figure 2 is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.

Figure 6.6.2: A scatter plot of the example data; the black line consists of the predictions, the points are the actual data, and the vertical lines between the points and the black line represent errors of prediction. The error of prediction for a point is the value of the point minus the predicted value (the value on the line). Table 2 shows the predicted values (Y') and the errors of prediction (Y-Y'). For example, the first point has a Y of 1.00 and a predicted Y (called Y') of 1.21. Therefore, its error of prediction is -0.21.

96

Table 6.6.2: Example data X 1.00 2.00 3.00 4.00 5.00

Y 1.00 2.00 1.30 3.75 2.25

Y' 1.210 1.635 2.060 2.485 2.910

Y-Y' -0.210 0.365 -0.760 1.265 -0.660

(Y-Y')2 0.044 0.133 0.578 1.600 0.436

You may have noticed that we did not specify what is meant by "best-fitting line." By far, the most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in Figure 2. The last column in Table 2 shows the squared errors of prediction. The sum of the squared errors of prediction shown in Table 2 is lower than it would be for any other regression line. The formula for a regression line is ܻԢ = ܾܺ + ‫ܣ‬ (15) Where Y' is the predicted score, b is the slope of the line, and A is the Y intercept. The equation for the line in Figure 6.6.2 is ܻԢ = 0.425ܺ + 0.785 ‫ = ܺݎ݋ܨ‬1, ܻԢ = (0.425)(1) + 0.785 = 1.21. ‫ = ܺݎ݋ܨ‬2, ܻԢ = (0.425)(2) + 0.785 = 1.64. (14) From this example we get a brief idea of how the liner regression works. In statistics and numerical analysis, the problem of numerical methods for linear least squares is an important one because linear regression models are one of the most 97

important types of model, both as formal statistical models and for exploration of data sets. The majority of statistical computer packages contain facilities for regression analysis that make use of linear least squares computations. Hence it is appropriate that considerable effort has been devoted to the task of ensuring that these computations are undertaken efficiently and with due regard to numerical precision.

6.7 Discriminant Function Analysis Discriminant function analysis (DA) is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification - the act of distributing things into groups, classes or categories of the same type. Moreover, it is a useful follow-up procedure to a MANOVA instead of doing a series of one-way ANOVAs, for ascertaining how the groups differ on the composite of dependent variables. In this case, a significant F test allows classification based on a linear combination of predictor variables. Terminology can get confusing here, as in MANOVA, the dependent variables are the predictor variables, and the independent variables are the grouping variables. Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA). Let us consider a simple example. Suppose we measure height 98

in a random sample of 50 males and 50 females. Females are, on the average, not as tall as males, and this difference will be reflected in the difference in means (for the variable Height). Therefore, variable height allows us to discriminate between males and females with a better than chance probability: if a person is tall, then he is likely to be a male, if a person is short, then she is likely to be a female. We can generalize this reasoning to groups and variables that are less "trivial." For example, suppose we have two groups of high school graduates: Those who choose to attend college after graduation and those who do not. We could have measured students' stated intention to continue on to college one year prior to graduation. If the means for the two groups (those who actually went to college and those who did not) are different, then we can say that intention to attend college as stated one year prior to graduation allows us to discriminate between those who are and are not college bound (and this information may be used by career counsellors to provide the appropriate guidance to the respective students). Discriminant Function Analysis (DA) undertakes the same task as multiple linear regressions by predicting an outcome. However, multiple linear regression is limited to cases where the dependent variable on the Y axis is interval variables so that the combination of predictors will, through the regression equation, produce estimated mean population numerical Y values for given values of weighted combinations of X values. DA is used when: x The dependent is categorical with the predictor IV’s at interval level such as age, income, attitudes, perceptions, and years of education, although dummy variables can be used as predictors as in multiple regression. Logistic regression IV’s can be of any level of measurement. x There are more than two DV categories, unlike logistic regression, which is limited to a dichotomous dependent variable.

99

The major underlying assumptions of DA are: x the observations are a random sample; x each predictor variable is normally distributed; x each of the allocations for the dependent categories in the initial classification are correctly classified; x there must be at least two groups or categories, with each case belonging to only one group so that the groups are mutually exclusive and collectively exhaustive (all cases can be placed in a group); x Each group or category must be well defined, clearly differentiated from any other group(s) and natural. Putting a median split on an attitude scale is not a natural way to form groups. Partitioning quantitative variables is only justifiable if there are easily identifiable gaps at the points of division; x for instance, three groups taking three available levels of amounts of housing loan; x the groups or categories should be defined before collecting the data; x the attribute(s) used to separate the groups should discriminate quite clearly between the groups so that group or category overlap is clearly non-existent or minimal; x Group sizes of the dependent should not be grossly different and should be at least five times the number of independent variables. There are several purposes of DA: x To investigate differences between groups on the basis of the attributes of the cases, indicating which attributes contribute most to group separation. The descriptive technique successively identifies the linear combination of attributes known as canonical discriminant functions (equations) which contribute maximally to group separation. 100

x Predictive DA addresses the question of how to assign new cases to groups. The DA function uses a person’s scores on the predictor variables to predict the category to which the individual belongs. x To determine the most parsimonious way to distinguish between groups. x To classify cases into groups. Statistical significance tests using chi square enable you to see how well the function separates the groups. x To test theory whether cases are classified as predicted. Discriminant analysis creates an equation which will minimize the possibility of misclassifying cases into their respective groups or categories. The aim of the statistical analysis in DA is to combine (weight) the variable scores in some way so that a single new composite variable, the discriminant score, is produced. One way of thinking about this is in terms of a food recipe, where changing the proportions (weights) of the ingredients will change the characteristics of the finished cakes. Hopefully the weighted combinations of ingredients will produce two different types of cake. For example, a graduate admissions committee might divide a set of past graduate students into two groups: students who finished the program in five years or less and those who did not. Discriminant function analysis could be used to predict successful completion of the graduate program based on GRE score and undergraduate grade point average. Examination of the prediction model might provide insights into how each predictor individually and in combination predicted completion or non-completion of a graduate program. Discriminant function analysis is based on modelling the interval variable for each group with a normal curve. The mean of each group is used an estimate of mu for that group. Sigma for each group can be estimated by using weighted mean of the within group variances or using the standard deviation of that group. In the case of the weighted mean the variances are weighted by sample size and can be calculated either as the denominator for a nested t-test or as the square root of the Mean Squares Within Groups in an ANOVA, providing identical estimates. When using the 101

weighted mean of the variances, one must assume that the generating function for each group produces numbers that in the long run have the same variability. In the simple case of dichotomous groups and a single predictor variable, it really does not make a great deal of difference in the complexity of the model if the variability of each group is assumed to be equal or not. This is not true, however, when more groups and more predictor variables are added to the model. For that reason, the assumption of equality of within group variance is almost universal in discriminant function analysis.

6.8 Adaboost Classifier Generalization ability, which characterizes how well the result learned from a given training data set can be applied to unseen new data, is the most central concept in machine learning. Researchers have devoted tremendous efforts to the pursuit of techniques that could lead to a learning system with strong generalization ability. One of the most successful paradigms is Ensemble learning. In contrast to ordinary machine learning approaches which try to generate one learner from training data, ensemble methods try to construct a set of base learners and combine them. Base learners are usually generated from training data by a base learning algorithm which can be decision tree, neural network or other kinds of machine learning algorithms. Just like “many hands make light work”, the generalization ability of an ensemble is usually significantly better than that of a single learner. Actually, ensemble methods are appealing mainly because that they are able to boost weak learners which are slightly better than random guess to strong learners which can make very accurate predictions. So, “base learners” are also referred as “weak learners” or “weak classifiers”. AdaBoost, short for "Adaptive Boosting", is a machine learning metaalgorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other 102

learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favour of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the over fitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner. While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder to classify examples. AdaBoost refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the form ்

‫ = )ݔ( ்ܨ‬෍ ݂௧ (‫)ݔ‬ ௧ୀଵ

(15) Where each ݂௧ is weak learner, that takes an object x as input and returns a real valued results indicating the class of the objects. The sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the T-layer classifier will be positive if the sample is believed to be in the positive class and negative otherwise. Each weak learner produces an output, hypothesis݄(‫ݔ‬௜ ), for each sample in the training set. At each iteration ‫ݐ‬, a weak learner is selected and assigned a coefficient 103

ߙ௧ such that the sum training error ‫ܧ‬௧ of the resulting t-stage boost classifier is minimized. ‫ܧ‬௧ = ෍ ‫ܨ[ܧ‬௧ିଵ (‫ݔ‬௜ ) + ߙ௧ ݄(‫ݔ‬௜ )] ௜

(16) Here ‫ܨ‬௧ିଵ (‫ )ݔ‬is the boosted classifier that has been built up to the previous stage of training, ‫ )ܨ(ܧ‬is some error function and ݂௧ (‫ߙ = )ݔ‬௧ ݄( ‫ )ݔ‬is the weak learner that is being considered for addition to the final classifier. At each iteration of the training process, a weight is assigned to each sample in the training set equal to the current error ‫ܨ(ܧ‬௧ିଵ( ‫ݔ‬௜ )) on that sample. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favours splitting sets of samples with high weights. Some of the properties of adaboost classifier are listed below: x Weighted error of each new component classifier tends to increase as a function

of

boosting

iterations.

௡

1 ෩ (௞ିଵ) ‫ݕ‬௜ ݄(‫ݔ‬௜ ) ߳௞ = 0.5 െ ( ෍ ܹ ௜ 2 ௜ୀଵ

(17) x The training classification error has to go down exponentially fast if the weighted errors of the component classifiers, ߳௞ , are strictly better than chance ߳௞ < 0.5. x The boosting iterations also decrease the classification error of the combined classifier over the training examples. x After each boosting iteration, assuming we can find a component classifier whose weighted error is better than chance, the combined classifier is guaranteed to have a lower exponential loss over the training examples.

104

AdaBoost is more commonly applied to problems of moderate dimensionality; early stopping is used as a strategy to reduce over fitting. A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve. For steepest descent versions of AdaBoost, where ߙ௧ is chosen at each layer t to minimize test error, the next layer added is said to be maximally independent of layer t. It is unlikely that a weak learner t+1 will be chosen that is similar to learner t. However, there remains the possibility that t+1 produce similar information to some other earlier layer. Totally corrective algorithms, such as LPBoost, optimize the value of every coefficient after each step, such that new layers added are always maximally independent of every previous layer. This can be accomplished by backfitting, linear programming or some other method. Practical advantages of adaboost are listed below: x Fast x Simple and easy to program x no parameters to tune (except T) x flexible í can combine with any learning algorithm x no prior knowledge needed about weak learner x provably effective, provided can consistently find rough rules of thumb ĺ shift in mind set — goal now is merely to find classifiers barely better than random guessing x versatile — can use with data that is textual, numeric, discrete, etc. — has been extended to learning problems well beyond 105

— binary classification

6.9 Multilayer perceptron A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable. If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then it is easily proved with linear algebra that any number of layers can be reduced to the standard two-layer input-output model (see perceptron). What makes a multilayer perceptron different is that some neurons use a nonlinear activation function which was developed to model the frequency of action potentials, or firing, of biological neurons in the brain. This function is modeled in several ways. The two main activation functions used in current applications are both sigmoids, and are described by ‫ݒ(ݕ‬௜ ) = ‫ݒ(݄݊ܽݐ‬௜ )ܽ݊݀‫ݒ(ݕ‬௜ ) = (1 + ݁ ି௩೔ ) ିଵ (18) In which the former function is a hyperbolic tangent which ranges from -1 to 1, and the latter, the logistic function, is similar in shape but ranges from 0 to 1. Here ‫ݕ‬௜ is the output of the ith node (neuron) and ‫ݒ‬௜ is the weighted sum of the input synapses. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions which are used in another class of supervised neural network models.

106

The perceptron computes a single output from multiple real-valued inputs by forming a linear combination according to its input weights and then possibly putting the output through some nonlinear activation function. Mathematically this can be written as ௡

‫ ߮ = ݕ‬൭෍ ‫ݓ‬௜ ‫ݔ‬௜ + ܾ൱ = ߮(‫ ݔ ் ݓ‬+ ܾ) ௜ୀଵ

(19) where ‫ ݓ‬denotes the vector of weights, ‫ ݔ‬is the vector of inputs, ܾ is the bias and ߮ is the activation function. A signal-flow graph of this operation is shown in Figure 6.8.1. In multilayer networks, the activation function is often chosen to be the logistic sigmoid 1Τ(1 + ݁ ି௫ ) or the hyperbolic tangent ‫)ݔ(݄݊ܽݐ‬. They are related by (‫ )ݔ(݄݊ܽݐ‬+ 1)/ 2 = 1/ (1 + ݁ ିଶ௫ ). These functions are used because they are mathematically convenient and are close to linear near origin while saturating rather quickly when getting away from the origin. This allows MLP networks to model well both strongly and mildly nonlinear mappings.

Figure 6.8.1: Signal-flow graph of the perceptron A single perceptron is not very useful because of its limited mapping ability. No matter what activation function is used, the perceptron is only able to represent an oriented ridge-like function. The perceptrons can, however, be used as building blocks of a larger, much more practical structure. A typical multilayer perceptron 107

(MLP) network consists of a set of source nodes forming the input layer, one or more hidden layers of computation nodes, and an output layer of nodes. The input signal propagates through the network layer-by-layer. The signal-flow of such a network with one hidden layer is shown in Figure 4.2 . The computations performed by such a feedforward network with a single hidden layer with nonlinear activation functions and a linear output layer can be written mathematically as ‫ ݏܣ(߮ܤ = )ݏ(݂ = ݔ‬+ ܽ) + ܾ (20) Where ‫ݏ‬is a vector of input and output. ‫ ܣ‬is the matrix of weights of the first layer, ܽ is the bias vector of the first layer. ‫ ܤ‬and ܾ are the weight matrix and the bias vector of the second layer respectively. The function ߮ denotes an elementwise nonlinearity. The generalisation of the model to more hidden layers is obvious.

Figure 6.8.2: Signal-flow graph of an MLP While single-layer networks composed of parallel perceptrons are rather limited in what kind of mappings they can represent, the power of an MLP network with only one hidden layer is surprisingly large like the one in Equation, are capable of approximating any continuous function ݂:ܴ௡ ՜ ܴ௠ to any given accuracy, provided that sufficiently many hidden units are available. 108

MLP networks are typically used in supervised learning problems. This means that there is a training set of input-output pairs and the network must learn to model the dependency between them. The training here means adapting all the weights and biases (‫ܣ‬, ‫ܤ‬,ܾܽܽ݊݀ in Equation ) to their optimal values for the given pairs (‫)ݐ(ݏ‬, ‫) )ݐ(ݔ‬. The criterion to be optimised is typically the squared reconstruction error σ௧ |ห݂൫‫)ݐ(ݏ‬൯ െ ‫ )ݐ(ݔ‬ห| ଶ. The supervised learning problem of the MLP can be solved with the backpropagation algorithm. The algorithm consists of two steps. In the forward pass, the predicted outputs corresponding to the given inputs are evaluated from Equation. In the backward pass, partial derivatives of the cost function with respect to the different parameters are propagated back through the network. The chain rule of differentiation gives very similar computational rules for the backward pass as the ones in the forward pass. The network weights can then be adapted using any gradient-based optimisation algorithm. The whole process is iterated until the weights have converged. The MLP network can also be used for unsupervised learning by using the so called auto-associative structure. This is done by setting the same values for both the inputs and the outputs of the network. The extracted sources emerge from the values of the hidden neurons. This approach is computationally rather intensive. The MLP network has to have at least three hidden layers for any reasonable representation and training such a network is a time consuming process. The multilayer perceptron consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes and is thus considered a deep neural network. Each node in one layer connects with a certain weight ‫ݓ‬௜௝ to every node in the following layer. Some people do not include the input layer when counting the number of layers and there is disagreement about whether ‫ݓ‬௜௝ should be interpreted as the weight from i to j or the other way around. Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the 109

expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. The advantages of using MLP include: x Adaptive learning: An ability to learn how to do tasks based on the data given for training or initial experience. x One of the preferred techniques for gesture recognition. x MLP/Neural networks do not make any assumption regarding the underlying probability density functions or other probabilistic information about the pattern classes under consideration in comparison to other probability based models. x They yield the required decision function directly via training. x A two layer backpropagation network with sufficient hidden nodes has been proven to be a universal approximator.

6.10 Neuro fuzzy classification In the field of artificial intelligence, neuro-fuzzy refers to combinations of artificial neural networks and fuzzy logic. Neuro-fuzzy was proposed by J. S. R. Jang. Neuro-fuzzy hybridization results in a hybrid intelligent system that synergizes these two techniques by combining the human-like reasoning style of fuzzy systems with the learning and connectionist structure of neural networks. Neuro-fuzzy hybridization is widely termed as Fuzzy Neural Network (FNN) or Neuro-Fuzzy System (NFS) in the literature. Neuro-fuzzy system (the more popular term is used henceforth) incorporates the human-like reasoning style of fuzzy systems through the use of fuzzy sets and a linguistic model consisting of a set of IF-THEN fuzzy rules. The main strength of neuro-fuzzy systems is that they are universal approximators with the ability to solicit interpretable IF-THEN rules. 110

The fuzzy set theory as a generalization of the classical set theory is very flexible in handling different aspects of uncertainties or incompleteness about real life situations. In a fuzzy system the features are associated with a degree of membership to different classes. Both NNs and fuzzy systems are very adaptable in estimating the input–output relationships. Neural networks deal with numeric and quantitative data while fuzzy systems can handle symbolic and qualitative data. Neuro-fuzzy hybridization leads to a crossbreed intelligent system widely known as Neuro-fuzzy system (NFS) that exploits the best qualities of these two approaches efficiently. The hybrid system unites the human alike logical reasoning of fuzzy systems with the learning and connectedness structure of neural networks by means of fuzzy set theory based approach. There is another Neuro-fuzzy classification based model which comprises a set of interpretable IF-THEN rules. They consider two conflicting requirements in fuzzy modeling: interpretability against accuracy. In reality, one of the two properties persists. Therefore the rule based Neuro-fuzzy modeling research area is divided into two branches: the linguistic fuzzy modeling that focuses on interpretability, primarily the Mamdani model; and the exact fuzzy modeling that focuses on accuracy, mainly the Sugeno model or Takagi–Sugeno–Kang (TSK) model. The rule based Neurofuzzy classification approach normally applies the concept of adaptive neural network. An adaptive network is a network of nodes (processing elements) and directed links (weights) that is functionally equivalent to a Fuzzy Inference System and is referred to as Adaptive Neuro-fuzzy Inference System or ANFIS. It normally employs the Sugeno fuzzy model to produce IF-THEN learning rules. The nodes of an adaptive network are associated with certain parameters which might have an impact on the final output. ANFIS generally utilizes a hybrid learning algorithm which is the combination of gradient descent and least square method to adapt the parameters in adaptive network. To put it simply, ANFIS is the combination of MLPBPN and Sugeno fuzzy model. A fuzzy rule in the Sugeno model has the following form 111

‫ݔ(݂ = ݖܰܧܪܶܳݏ݅ݕ݀݊ܽܲݏ݅ݔܨܫ‬, ‫)ݕ‬ (21) where ܲ and ܳ are the fuzzy sets in the antecedent part of the given IF-THEN learning rule and ‫ݔ (݂ = ݖ‬, ‫ )ݕ‬is a crisp function in the consequent part of the rule. An adaptive network is a multi-layer feed-forward network in which each node performs a particular function (node function) based on incoming signals and a set of parameters pertaining to this node. The type of node function may vary from node to node; and the choice of node function depends on the overall function that the network is designed to carry out. Compared to a common neural network, connection weights and propagation and activation functions of fuzzy neural networks differ a lot. Although there are many different approaches to model a fuzzy neural network, most of them agree on certain characteristics such as the following: x A neuro-fuzzy system based on an underlying fuzzy system is trained by means of a data-driven learning method derived from neural network theory. This heuristic only takes into account local information to cause local changes in the fundamental fuzzy system. x It can be represented as a set of fuzzy rules at any time of the learning process, i.e., before, during and after. Thus the system might be initialized with or without prior knowledge in terms of fuzzy rules. x The learning procedure is constrained to ensure the semantic properties of the underlying fuzzy system. x A neuro-fuzzy system approximates a n-dimensional unknown function which is partly represented by training examples. Fuzzy rules can thus be interpreted as vague prototypes of the training data. 112

x A neuro-fuzzy system is represented as special three-layer feedforward neural network. The first layer corresponds to the input variables. The second layer symbolizes the fuzzy rules. The third layer represents the output variables. The fuzzy sets are converted as (fuzzy) connection weights. Some approaches also use five layers where the fuzzy sets are encoded in the units of the second and fourth layer, respectively. However, these models can be transformed into a three-layer architecture. To guarantee the characteristics of a fuzzy system, the learning algorithm must enforce the following mandatory constraints: x

Fuzzy sets must stay normal and convex.

x

Fuzzy sets must not exchange their relative positions (they must not pass each other).

x

Fuzzy sets must always overlap.

Additionally there exist some optional constraints like the following: x

Fuzzy sets must stay symmetric.

x

The membership degrees must sum up to 1.

A hybrid fuzzy neural network is also important and ARIC (approximate reasoningbased intelligent control) is presented as a neural network where a prior defined rule base is tuned by updating the network's prediction. However, it was observed that conventional classification methods failed to provide high performance with such a large amount of data. Therefore, efficient feature extraction methods are necessary to extract the significant tissue details simultaneously from a large amount of unique information, inherent in different MRI pulse sequences. 113

Chapter 7 7. Summary and Conclusions In this book we have illustrated all the necessary procedures needed to process and classify brain tumors. With the advance of computational intelligence and machine learning techniques, computer-aided detection attracts more attention for brain tumor detection. It has become one of the major research subjects in medical imaging and diagnostic radiology. The purpose is to develop tools for discriminating malignant tumors from benign ones assisting decision making in clinical diagnosis. This system performs this diagnosis in multiple phases. First phase of diagnosis is texture feature extraction which consists of first order and second order texture feature extraction. These extracted features are used for classification. The efficiency of classification gives us accurate results depending on the correctness of the dataset and image quality of the input slides taken for determining the type of tumor. The primary objective of this book is to understand the several methods and algorithms to improve the efficiency of brain MR image analysis. To achieve this, the current research work has concentrated on the following goals, x Identification of the major challenging issues in brain MRI analysis, and some of these issues by closing the gap between medical and technical disciplines are explained very clearly. x How design and implement system to perform simultaneous brain tissue classification and pathological study from multiple MR sequences has been described. Both supervised and unsupervised classification approaches are found to be widely used in multispectral brain MRI analysis. No operator intervention is required in the case of unsupervised analysis, and input multispectral cube is automatically segmented into different clusters. Sometimes this blind clustering without prior knowledge failed to produce meaningful segmentations, especially in the case of MRI analysis where many unknown brain tissue clusters with different tissue 114

characteristics may be present. However, they are very successful in clinical applications of normal brain analysis, since it includes only the known structural characteristics. Abnormalities with complex knowledge need some additional information from experts to provide a more accurate segmentation. There we can exploit the advantage of supervised methods to achieve the superior performance. The main issue with the supervised classification is the inconsistency due to large variations in intra-operator and inter-operator feature measurements. However, high performance feature analysis techniques can significantly reduce the inconsistent results. Experimental results in Chapter 5 demonstrated the efficiency and accuracy of SVM over FCM in MRI analysis. Compared to other conventional classification methods such as Gaussian maximum likelihood classifier and neural networks, SVM shows high generalization capability with relatively small number of training samples. However, selection of non-linear kernels and optimal parameters highly influences the classification performance. In clinical trials, quality and accuracy of results from original sequences for visual classification is very important. Location and shape of the abnormalities as well as the effect of these abnormalities in other normal brain portion has great significance in clinical diagnosis. There are several future directions which might further improve the CAD systems for human brain MR images: x The acquisition of large databases from different institutions with various image qualities for clinical evaluation and improvement in the CAD systems. x Improve the classification accuracy by extracting more efficient features and increasing the training data set. x There is still much room for additional researcher to utilize other machine learning techniques and integrate them into a hybrid one system. Further experiments and evaluation are therefore desirable to establish whether the proposed approaches have generic applications. Thus for the computerized system, the summarization of task can follow the following steps or block diagram. 115

Input Image

CT

MRI

fMRI

PET

X-Ray

SPEC

Preprocessing

Image Cropping

Image Filtering

Scaling

Gradient Operator

Image Enhancement

Sharpening

Noise Removal

Histogram Equalization

Contrast Stretching Smoothing

Features Extraction

Classifier Technique

PCA Analysis

GA based Analysis

Spectral Analysis

Image Classification Unsupervised

Supervised

Clustering

K-Means

FCM

ANN

SVM

Decision Tree

Image Segmentation

Figure 7.1: A block diagram of overall possible process for classification 116

Finally we hope that in the entire study, existing computer-aided analysis and evaluation systems on medical images can be improved and new evaluation criteria can be proposed and applied in the future works. Several state-of-the-art Artificial Intelligence (AI) techniques for automation of biomedical image classification are investigated. This study gathers representative works that exhibit how AI is applied to the solution of very different problems related to different diagnostic science analysis. It also detects the methods of artificial intelligence that are used frequently together to solve the special problems of medicine.

117

Bibliography R. C. Gonzalez, R. E. Woods, “Digital Image Processing”, 2nd Ed. Zöllner, F. G., Emblem, K. E., & Schad, L. R. (2012). SVM-based glioma grading: Optimization by feature reduction analysis. Zeitschrift für Medizinische Physik, 22(3), 205–214. Haykin, S. (2008). Neural networks and learning machines (3rd ed.). New Jersey: Pearson Prentice Hall. Zacharaki EI, Wang S, Chawla S, Yoo DS, Wolf R, Melhem ER, Davatzikos C: Classification of brain tumor type and grade using MRI texture in a machine learning technique. Magn Reson Med 62:1609–1618, 2009 Georgiardis P, Cavouras D, Kalatzis I, Daskalakis A, Kagadis GC, Malamas M, Nikifordis G, Solomou E: Improving brain tumor characterization on MRI by probabilistic neural networks on nonlinear transformation of textural features. Comput Meth Prog Bio 89:24–32, 2008 Georgiardis P, Cavouras D, Kalatzis I, Daskalakis A, Kagadis GC, Malamas M, Nikifordis G, Solomou E: Non-linear least square feature transformations for improving the performance of probabilistic neural networks in classifying human brain tumors on MRI. Lecture Notes on Computer Science 4707:239– 247, 2007 El-Dahshan EA, Hosny T, Badeeh A, Salem M: Hybrid MRI techniques for brain image classification. Digital Signal Process 20:433–44, 2009 S.N. Deepa and B. Aruna Devi “A survey on artificial intelligence approaches for medical image classification” Indian Journal of Science and Technology, Vol. 4 No. 11 (Nov 2011) ISSN: 0974- 6846 Dipali M Joshi, Rana NK and Misra VM (2010) Classification of brain cancer using artificial neural network. Intl. Conf. Electronic Comput. Technol. pp: 112-116. AmirEhsan Lashkari (2010) A neural network based method for brain abnormality detection in MR images using gabor wavelets. Intl. J. Comput. Appl. 4(7), 9- 15. Jude Hemanth D, Kezi Selva Vijila C and Anitha J (2010) Performance improved PSO based modified counter propagation neural network for abnormal MR brain image classification. Int. J. Advance. Soft Comput. Appl. 2(1), 65-84.

118

Kang H, Pinti A, Taleb-Ahmed A and X Zeng (2011) An intelligent generalized system for tissue classification on MR images by integrating qualitative medical knowledge. J. Biomed. Signal Processing & Control. 6, 21–26. Latha Parthiban and Subramanian R (2007) Intelligent heart disease prediction system using CANFIS and genetic algorithm. Intl. J. Biological & Life Sci. 3,157160. Ramakrishnan S, Ibrahiem El and Emary MM (2010) Classification brain MR images through a fuzzy multiwavelets based GMM and probabilistic neural networks. J. Telecom. Sys. Springer Sci. 46(3), 245- 252. Riyahi Alam N, Younesi F and Riyahi Alam MS(2009) Computer-Aided mass detection on digitized mammograms using a novel hybrid segmentation system. Intl. J. Biol. & Biomedical Engg. 3(4), 51-58. Shayak Sadhu, Sudipta Roy, Siddharth Sadhukhan, Samir K Bandyopadhyay “Automated Segmentation of the Human Corpus Callosum Variability from T1 Weighted MRI Image,” IC3T, Proc. Springer, Hyderabad, India, 2015 . Sudipta Roy, Samir Kumar Bandyopadhyay, "Automated Computer Aided Diagnosis System For Brain Abnormality Detection And Analysis From MRI Of Brain," Advances in Computer Science and Technology, Volume 4, No.3, March 2015. Sudipta Roy, Shayak Sadhu, Samir K Bandyopadhyay “A Useful Approach towards 3D Representation of Brain Abnormality from Its 2D MRI Slides with a Volumetric Exclamation,” Proc. IEEE, C3IT, West Bengal, February 2015. Sudipta Roy, Piue Ghosh, Samir Kumar Bandyopadhyay, "Segmentation and Contour Extraction of Cerebral Hemorrhage from MRI of Brain by Gamma Transformation Approach," The 2014 International Conference on Frontiers of Intelligent Computing: Theory and applications (FICTA ), Proc. Springer, Advances in Intelligent Systems and Computing 328, Bhubaneswar, Orissa, 2014. Sudipta Roy, Piu Ghosh, Samir Kumar Bandyopadhyay, "A Framework for Volumetric Computation of Brain Abnormality from MRI of Brain Slice and Its Appropriateness Measurement," International Conference on Communication and Computing (ICC-2014), Bangalore, Proc. Elsevier, Digital Signal and Image Processing, Bangalore, pp. 29-36, 2014. Sudipta Roy, Kingshuk Chatterjee, Samir Kumar Bandyopadhyay, "Segmentation of Acute Brain Stroke from MRI of Brain Image Using Power Law Transformation with Accuracy Estimation,” 2nd International Conference on Advanced Computing, 119

Networking, and Informatics [ICACNI-2014], Kolkata, Proc. Springer, Smart Innovation, Systems and Technologies, Volume 27, 2014, pp 453-461. Sanjay Nag, Indra Kanta Maitra, Sudipta Roy, Samir Kumar Bandyopadhyay "A Review of Image Segmentation Methods On Brain MRI For Detection Of Tumor And Related Abnormalities," International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014. Sudipta Roy, Samir Kumar Bandyopadhyay, "A Review on Volume Calculation of Brain Abnormalities from MRI of Brain using CAD system," International Journal of Information and Communication Technology Research(IJICTR), Volume 4, Number 4, pp. 114-120, April 2014. Sudipta Roy, Samir K. Bandyopadhyay, "Abnormal Regions Detection and Quantification with Accuracy Estimation from MRI of Brain," 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA), Canada, Proc. IEEE, pp. 611-615, 2013. Sudipta Roy, Sangeet Saha, Ayan Dey, Soharab Hossain Shaikh, Nabendu Chaki, "Performance Evaluation of Multiple Image Binarization Algorithms Using Multiple Metrics on Standard Image Databases," 48th Annual Convention of Computer Society of India, Advances in Intelligent Systems and Computing, Proc. Springer, Volume 249, 2014, pp 349-360. Pabitra Roy, Sudipta Roy, Samir Kumar Bandyopadhyay, "An Automated Method for Detection of Brain Abnormalities and Tumor from MRI Images," International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 11, November, 2013, pp. 1583-1589. Sudipta Roy, Sanjay Nag, Indra Kanta Maitra, Samir K. Bandyopadhyay, "A Review on Automated Brain Tumor Detection and Segmentation from MRI of Brain," International Journal of Advanced Research in Computer Science and Software Engineering, pp. 1706-1746, Volume 3, Issue 6, June 2013. Sudipta Roy, Sanjay Nag, Indra Kanta Maitra, Samir Kumar Bandyopadhyay, "Artefact Removal and Skull Elimination from MRI of Brain Image," International Journal of Scientific and Engineering Research, Volume 4, Issue 6, June-2013, pp 163-170. Sudipta Roy, Kingshuk Chatterjee, Indra Kanta Maitra, and Samir Kumar Bandyopadhyay, "Artefact Removal from MRI of Brain Image," International Refereed Journal of Engineering and Science (IRJES), Volume 2, Issue 3(March 2013), PP.24-30. 120

Sudipta Roy, Ayan Dey, Kingshuk Chatterjee, and Samir K. Bandyopadhyay, "An Efficient Binarization Method for MRI of Brain Image,” Signal & Image Processing: An International Journal (SIPIJ), Vol.3, No.6, pp. 35-51 December 2012. DOI: 10.5121/sipij.2012.3604. Sudipta Roy, and Samir K. Bandyopadhyay, “Detection and Quantification of Brain Tumor from MRI of Brain and it’s Symmetric Analysis,” International Journal of Information and Communication Technology Research(IJICTR), pp. 477483,Volume 2, Number 6, June 2012. Sudipta Roy, Atanu Saha, and Samir Kumar Bandyopadhyay, “Brain Tumor Segmentation and Quantification from MRI of Brain,” Journal of Global Research in Computer Science (JGRCS), Volume 2, No. 4, April 2011. Y.C. Ouyang, H.M. Chen, J.W. Chai, C.C. Chen, Clayton C.C. Chen, S.K Poon, C.W. Yang, and S.K. Lee, Independent component analysis for magnetic resonance image analysis, EURASIP J. Adv. Signal Process, 2008:780656, 2008. R. He, S. Datta, B.R. Sajja, and P.A. Narayana, Generalized fuzzy Clustering for segmentation of multi-spectral magnetic resonance images, Comput. Med. Imaging Graph, 32(5): 353-366, 2008. L.P. Clarke, R.P. Velthuizen, M.A. Camacho, J.J. Heine, M. Vaidyanathan, L.O. Hall, R.W. Thatcher, and M.L. Silbiger, Mri segmentation: methods and applications, Magn. Reson. Imaging, 13(3):343-368, 1995. C .Valdés Hernández Mdel, P.J. Gallacher, M.E. Bastin, N.A. Royle, S.M. Maniega, I.J. Deary, and J.M. Wardlaw, Automatic segmentation of brain white matter and white matter lesions in normal aging: comparison of five multispectral techniques, Magn. Reson. Imaging, 30(2): 222-229, 2012. J. P. Hornak, The Basics of http://www.cis.rit.edu/htbooks/mri/

MRI,

[online

book],

2004.

Website:

Darya Chyzhyk, Alexandre Savio, "Feature extraction from structural MRI images based on VBM: data from OASIS database,"2000 http://www.braintumourresearch.org/types-of-brain-tumour

121