Keyword Extraction from Educational Video Transcripts Using NLP techniques
Himani Shukla Department of Computer Science Amity University Uttar Pradesh
[email protected]
Misha Kakkar Department of Computer Science Amity University Uttar Pradesh
[email protected]
Abstract- Keyword Extraction is the most important task
regular expressions to construct a grammar rule are used in extracting noun phrases.
while working with the text data. Extracting Keywords benefit the reader as to judge the important part of text instead of
II.
going through the whole text. In this paper a technique to extract keywords from educational video transcripts from
Keyword extraction system aims to extract word chunks that are useful as information retrieval keywords which can be used for efficient searching. The keyword extraction system identifies a particular pattern in the text based on some grammar rule called Regex. Such pattern matching is called Chunking. This section reviews the earlier attempts on keyword extraction. Automatic Keyword extraction for database Search [2] performed keyword search and disambiguation algorithms on Okkam data. This study performed the task of automatic keyword extraction and extracted keywords into three groups: Text based, Database based, Text and Database based keywords. For Text based they employed TF-IDF approach, for Database based extraction they employed database specific statistical information along with TF-IDF scores and for Text and Database extraction they combined the TF-IDF score for text and database and averaged them.
MOOC's is discussed. The technique is based on Regular Expression Grammar Rule approach to identify the Noun Chunks in the text of the transcript. Extracting Keywords help in finding out the specifically important part of the educational material.
Keywords-Regex, Chunking, PoS Tagging, Noun Chunk
I.
INTRODUCTION
With the advancement in multimedia technology the number of videos available online are increasing rapidly. In multimedia based e-learning system there is a need for efficient video content searching tool. For building a search index for a video keyword extraction plays an important role. Keywords serve as a summary of a document however, extracting salient keywords from video transcripts is a challenging issue. In Information Retrieval Systems the existing methods for keyword extraction involves statistical techniques like term frequency, n-grams [1]. These techniques are not very useful as most of the transcripts are short as it is not practical to produce long videos due to high video production costs.
Rule Based Chunk Extraction from PDF documents using Regular Expressions [3] proposed a system for chunk extraction and to categorize and classify sentences in PDF documents based on Rule decided by the user. In this study rules were made in a principal file for adjusting various parameters for example if document is in upper or lower then the case sensitivity is ignored while matching the rule. The system extracted the Transaction id , the name of the document, attributes and the rule id. Extracting Salient Keywords from Educational Videos using joint text, Audio and Visual Cues [4] extracted domain specific keywords from educational transcripts by using various linguistic and statistical techniques. The study implemented an algorithm that identifies glossary terms using named entity recognition and syntactic grammar. In this study a multi modal feature based salient keyword extraction system was implemented using not only word feature but also sound and visual content type available to the user in the data. 82% of keywords identified were also in top 50 % of the keywords. Including the audio and visual data the precision and recall improved by 1.1 and 1.5%. Improved Automatic Keyword
Educational and Instructional videos from various MOOC's (Massive Open Online Courses) is the main focus of our study as they are playing an important role in student's life worldwide. A MOOC is a online forum which is openly accessed via internet. MOOC's provide course materials like lectures, written notes, problem sets, discussion forums etc. The advancement in Multimedia Technology has changed the face of distance learning. MOOC's benefit the society by providing quality education worldwide at no or at a very less cost. Extracting Keywords benefit the reader as to judge the important part of text without going through whole text details. In this paper we implemented a Regular Expression Based Grammar Rule approach to identify the Noun Chunks in the transcripts. Various concepts of Natural Language Processing like Chunking (Noun Phrase), Part of Speech (PoS) tagging and
978-1-4673-8203-8/16/$31.00 ©2016 IEEE
LITERATURE REVIEW
105
Extarction Given More Linguistic Knowledge [5] focused on the fact that by adding linguistic knowledge to the statistical features improved keyword generation can be performed. In this study NP Chunks were extracted and converted into n-grams. Addition of PoS tags as features drastically improve the classifier performance. Automatic Keyword Extraction for wikification of East Asian Documents [6] proposed a two step approach. Firstly extraction of noun chunks from a document using morphological analysis tools and extracting keywords by a method called Top Consecutive Noun Cohesion. This method connects consecutive nouns in a compound word. Then these keywords are ranked using Dice Coefficient or Keyphraseness. These methods achieved good accuracy. Kpcathcher (keyphrase catcher) [7] extracted noun phrases from enterprise videos using confidence based and counting based rules. Improved Keyword and Keyphrase extraction from Meeting transcripts [8] used machine learning and data mining techniques like MaxEnt and SVM Classifier. They also extracted bigrams and trigrams using n-grams and also identified the low frequency words occurring in the corpus using Latent Dirichlet Allocation. The quality of extracted keywords is improved using sequential pattern mining. III.
EXPERIMENTATION
The proposed system aims at extracting the keywords (Noun Phrases) from the transcripts of the online video lectures as shown in Fig. 1. The dataset used in this study is freely available transcripts on NPTEL (National Programme on Technology Enhanced Learning) [9] for the various courses. The transcripts are available in the PDF format and Fig. 2 shows one such sample transcript. A PDF to XML converter is used to extract metadata information like text (transcripts of video lectures), course name, instructor name etc. This metadata information is stored in an XML file for easy handling and processing.
Fig. l. NPTEL Video Sample [l0]
Raw Text Sentence Tokenisation Word Tokenisation PoS Tagging NP Chunking Extracted Keyword Fig. 3. Archjtecture for Keyword Extraction
Fig. 3 shows the keyword extraction process used in this study. The various components of keyword extraction Architecture is explained as follows: I) 2)
3) 4)
5)
In this slide I have illustrated the complexity of human proteome as co mpa red to genome or transcriptome. The extent of diversity and complexity due to alternative splicing and post-translational modification is tremendous, ther efore stud yin g proteins and proteome is very important. Steps involved in proteome ana lysis: protein extraction followed by their separation, identification and characterization. Pro te in extraction from whole cells, tissue or organisms is first requirement for proteome analysis in majority of the p roteo mics experiments. Protein separation and quantification is achieved by gel·based (e.g. l·DE) and gel·free techniques (e.g. i TRAQ) and identification
6)
Raw Text: Raw text is the extracted transcripts of the video lectures. Sentence Tokenisation: Sentence Tokenisation is defined as the problem of deciding the Sentence Boundaries (begining and end of the sentence). Sentence Boundary Identification is often a challenging task as the periods (dots) at times denote abbreviations, decimal point in numbers or email addresses and not the end of sentences. Word Tokenisation: Word Tokenisation is used to obtain the list of words also called tokens from a sentence. PoS Tagging: PoS tagging in linguistics is also called grammatical tagging and it consists of marking the various words in a text corresponding to their particular part of speech. Thus it identifies the verbs, nouns, adjectives, adverbs, determiner etc. Words assigned to a similar part of speech generally have same behaviour. Almost all languages have noun and verb but other than these there are certain variations in all languages of the world. NP Chunking: NP Chunking is a method of parsing the sentences into syntactical structures containing individual noun phrase in the text. A Noun Phrase is a phrase whose head is a noun or a pronoun accompanied by a set of modifiers for instance the Determiners, Articles (the), Demonstrative (this, that) and Adjectives. Extracted Keyword: This component stores the extracted NP tagged chunks as final keywords.
by MS. The functional characterization of proteins using novel proteomic platforms opens new horizon for exploration in biology.Abundance based
Fig. 2. Sample Transcript
106
In the proposed keyword extraction process, the first step is to retrieve the Raw Text from the XML file. This Raw Text obained is fed to Sentence Tokenizer in the next step which outputs the list of tokenized sentences. Further, the
2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)
tokenized sentences are converted to the list of words (tokens) using word tokenizer in the third step. In the next step, Part of Speech Tagger component processes the tokens obtained in previous step and attaches a part of speech to each token. In Noun Phrase Chunking or NP Chunking step, the text is searched for chunks containing individual noun phrases. PoS tags play a very important role in NP Chunking step. To do NP Chunking first of all a grammar is defined called NP chunk grammar. This grammar decides how chunks are to be formed as it contains rules in the form of regular expression with PoS tags. The most commonly used NP Chunk grammer is represented as: NP:
?* which defines a NP chunk pattern which has a optional Determiner(DT) followed by any number of Adjectives(JJ) and then followed by a Noun(NN). The example output of NP Chunking process is shown below in Fig. 4 where a synthetically created sentence is parsed by a NP chunker and represented as a Chunk Tree. The root of the Chunk tree is represented as S (Sentence) which is broken down to NP chunks in a given sentence by a NP Chunker.
Leamllg NNP I; VBZ
Madiine NN
NP
NP
aDT subNN
I~~NN
oliN
NP
NP
oomputer NN
Sderce NN
Fig. 4. Chunk Tree
After the chunking process, the NP tagged leaves of the chunk tree (as shown in Figure 4) are extracted and stored as final keywords in the last step. The experimented NP tag patterns (Grammer) used in this study are (in the order of there increasing complexities): • • • •
* ?* ?* *«NNP>+*)?+
The system is implemented in Python's Natural Language Toolkit (NLTK) [11], which has easy to use interface and provides access to over 50 corpora and lexical resources. The various NLTK libraries are used for Tokenisation, Lemmatisation, PoS Tagging and Chunking. IV.
*«NNP>+ *)?+
Running the NLTK Chunker using the above mentioned grammer on the transcripts yielded the following keywords.
Finally we can only observe. Physical parameter will vary with input and output but if you become an observer the observation can only be discrete. But whether it is a discrete time signal or analog time signal the va r iation of the signals
between these two limits is a continuous variation there is no defined points by which it can vary. What I am saying is supposing my initial value is 0V and Vrnax is SV as an example. as an example I make this SV I am only allowing this signal to change from 0 to 5'1 and within this 0 to 5V it can take any value.
Fig. 5. Sample Textt for Keyword Extraction
physica 1 parameter. input, output, obse r ve r, observation. disc rete time , ana tog time 5 igna t . va r iat ion. signa ls . limits . continuous va riation . points. initia tva tue. vmax . examp te . signa t. va tue. di ff ic utty, limitations