Small tutorial for installing Hadoop MapReduce on Windows Part I ...

6 downloads 56 Views 45KB Size Report
Small tutorial for installing Hadoop MapReduce on Windows. Part I - ... env.sh within the hadoop folder you just unzipped in your favorite text editor (eg.
Small tutorial for installing Hadoop MapReduce on Windows Part I - Installation Install the java SDK 1.6 http://www.oracle.com/technetwork/java/javase/downloads/index.html (you can use JDK 6 Update 22 JDK) Install Eclipse http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads /release/helios/SR1/eclipse-java-helios-SR1-win32.zip unzip the folder for example to C:\Programme Install Cygwin from http://www.cygwin.com/ to C:\cygwin (you can use the default preferences from the installer) Download Hadoop MapReduce http://hadoop.apache.org/mapreduce/releases.html use Version: release 0.20.0 You can decompress the tar.gz (winrar should do the job) file to C:\cygwin\home\ To configue hadoop, you will need to edit one file. Open the file conf/hadoopenv.sh within the hadoop folder you just unzipped in your favorite text editor (eg WordPad). Find the following line in the file: # export JAVA_HOME=/usr/lib/j2sdk1.6-sun and change it to: export JAVA_HOME=/cygdrive/c/Programme/Java/jdk1.6.0_22 Note: the comment sign has to be removed! Part II - Compiling a Hadoop job into a JAR file Create a new Java Project. Therefore, launch Eclipse. Selec New from the File Menu, then use the Wizard to create a new Java Project. Enter a project name, in this example WordCount. Make sure that the selected JRE is of version 1.6 (it may be listed as 6). Click Finish. Add hadoop library to the project: In Eclipse, right-click on your project, go to Build Paths, then Add External Archives. Browse to the hadoop folder and select the file hadoop-0.20.0-core.jar, click Open. Add source to code file: From the File Menu, select New, then File. Select the parent folder WordCount/src (make sure this is correct, or you will encounter trouble when exporting the JAR file below.) Name the new file WordCount.java, then click Finish. Copy this code http://www.infosci.cornell.edu/hadoop/wordcount.html, paste it into the new file

and save it. Eclipse will compile the file as soon as you save it. Export the JAR file: From the File Menu, select Export. Select the JAR file (it may be hidden under the heading Java), click Next. Select all resources to be exported. In this case, select the entire WordCount project. Make sure the export classes checkbox is checked. Select an export destination for your JAR file. For simplicity, name the file WordCount.jar and export it to C:\cygwin\home\. Running a Hadoop job on your development machine: Hadoop will run in "standalone mode", which means that it will run within a single process, not taking advantage of any parallel processing. This will be much slower than running in a cluster, so you may want to reduce the data size set for testing. Create or obtain test data: For this example, the input data can be this tutorial. Copy the entire text and save it with your favorite text editor as a plain text file named testing.txt. Place this file within a folder called input on C:\cygwin\home\ (you first have to create the folder) Run the job: From the start menu, go to Alle Programme, Cygwin, then launch Cygwin Bash Shell. This will bring up a UNIX-style command line interface. (Note: If you are using Windows Vista, you will need to right-click on Cygwin Bash Shell, and select "Run as Administrator", or else you will get "Permission Denied" errors when attempting the following.) Change into your hadoop directory hadoop-0.20.0 (should be in your home directory ... /home//hadoop-0.20.0) To do this, type: cd ome//hadoop-0.20.0. Finally type: ./bin/hadoop jar ../WordCount.jar WordCount ../input ../output If you get an error like "./bin/hadoop: line 2: $'\r': command not found" This can happen when the line endings in the hadoop script files become corrupted. To repair them, run the following set of commands: dos2unix bin/hadoop dos2unix bin/*.sh dos2unix conf/*.sh This should allow you to run the hadoop commands without errors. Retrieve the results: The results have been written to a new folder called output in your home directory, i.e. /home//output. There should be one file, named part-00000, which lists all the words on the test document, along with their occurrence count. Note that before running hadoop again you will need to delete the entire output folder, since hadoop will not do this for you.