It turned out that Mac Safari browser do not support geolocation if it has wired connection. Problem sovled once I turned on the wifi.
Gotcha!!!
If we have data, let’s look at data. If all we have are opinions, let’s go with mine
It turned out that Mac Safari browser do not support geolocation if it has wired connection. Problem sovled once I turned on the wifi.
Gotcha!!!
Because the distributed Hadoop 2.2.0 provides a 32bit libhadoop by default, user has to build the native libraries to avoid those warning messages such as, disabled stack guard of libhadoop.so.
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 13/11/01 10:58:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
The official Hadoop websitehttp://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/NativeLibraries.html gives completely unclear instructions on how to build Hadoop native libraries.
So here are what you should do:
You need all the build tools:
$ sudo apt-get install build-essential $ sudo apt-get install g++ autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev $ sudo apt-get install maven
Another prerequisite, protoco buffer: protobuf version 2.5, which can be downloaded from https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz. Download it to the /tmp directory; then,
$ tar xzvf protobuf-2.5.0.tar.gz $ cd protobuf-2.5.0 $ ./configure --prefix=/usr $ make $ make check $ sudo make install
Having all the tools, we can now build Hadoop native libraries. Assuming you have downloaded the Hadoop 2.2.0 source code, do:
$ tar xzvf hadoop-2.2.0-src.tar.gz $ cd hadoop-2.2.0-src $ mvn package -Pdist,native -DskipTests -Dtar
Note: there is a missing dependency in the maven project module that results in a build failure at the hadoop-auth stage. Here is the official bug report and fix is
Index: hadoop-common-project/hadoop-auth/pom.xml =================================================================== --- hadoop-common-project/hadoop-auth/pom.xml (revision 1543124) +++ hadoop-common-project/hadoop-auth/pom.xml (working copy) @@ -54,6 +54,11 @@ </dependency> <dependency> <groupId>org.mortbay.jetty</groupId> + <artifactId>jetty-util</artifactId> + <scope>test</scope> + </dependency> + <dependency> + <groupId>org.mortbay.jetty</groupId> <artifactId>jetty</artifactId> <scope>test</scope> </dependency>
Maven will do all the heavy work for you, and you should get this after build is completed
[INFO] BUILD SUCCESS [INFO] ------------------------------------------------ [INFO] Total time: 15:39.705s [INFO] Finished at: Fri Nov 01 14:36:17 CST 2013 [INFO] Final Memory: 135M/422M
The built native libraries should be at
hadoop-2.2.0-src/hadoop-dist/target/hadoop-2.2.0/lib
https://github.com/drweiwang/BigData/tree/master/grpstats
Using the airline on-time performance dataset as input data. We are only intersted in the UniqueCarrier and ArrDelay columns in the dataset.
More information about the dataset can be found here: http://stat-computing.org/dataexpo/2009/
Mapper simply parsing the data line by line. Extract the fields of UniqueCarrier and ArrDelay. Write keys and values, where key is the UniqueCarrier and value is the numeric delay in IntWritable type.
Reducer iterate through the list of all delays associated with the key(one airline) and update the maximum value. Finally, the reducer writes the final maximum value out with the key (airline code)
The menu item are stored in two places:
The menu item is stored as a .desktop file. The file should be UTF-8 coded and resemble the following example which adds the Google chrome item to the application menu.
[Desktop Entry] Version=1.0 Type=Application Terminal=false Icon[en_US]=google-chrome Name[en_US]=Google Chrome Exec=/opt/google/chrome/google-chrome Name=Google Chrome Icon=google-chrome
Line by line explanation
Line | Description |
---|---|
[Desktop Entry] | The first line of every desktop file and the section header to identify the block of key value pairs associated with the desktop. Necessary for the desktop to recognize the file correctly. |
Type=Application | Tells the desktop that this desktop file pertains to an application. Other valid values for this key are Link and Directory. |
Encoding=UTF-8 | Describes the encoding of the entries in this desktop file. |
Name=Sample Application Name | Names of your application for the main menu and any launchers. |
Comment=A sample application | Describes the application. Used as a tooltip. |
Exec=application | The command that starts this application from a shell. It can have arguments. |
Icon=application.png | The icon name associated with this application. |
Terminal=false | Describes whether application should run in a terminal. |
If your application can take command line arguments, you can signify that by using the fields as shown below:
Add… | Accepts… |
---|---|
%f | a single filename. |
%F | multiple filenames. |
%u | a single URL. |
%U | multiple URLs. |
%d | a single directory. Used in conjunction with %f to locate a file. |
%D | multiple directories. Used in conjunction with %F to locate files. |
%n | a single filename without a path. |
%N | multiple filenames without paths. |
%k | a URI or local filename of the location of the desktop file. |
%v | the name of the Device entry. |
Desktop Entry Specification:
http://standards.freedesktop.org/desktop-entry-spec/latest/index.html
Apache Hadoop website gives instructions on how to install Hadoop 2.2.0. However, as most open source projects, it is not well documented. Some of the instructions simply does not work. I have spent quite a lot of time figuring out how to install Hadoop version 2.2 on my Linux Debian 7 wheezy machine. Here are the steps I took:
# Debian 7 "Wheezy" deb http://http.debian.net/debian/ wheezy main contrib
# apt-get update && apt-get install java-package && exit
$ make-jpkg jdk-7u45-linux-x64.tar.gz
$ su # dpkg -i oracle-j2sdk1.7_1.7.0+update45_amd64.deb
By default the DebianAlternatives will automatically install the best version of Java as the default version. If the symlinks have been manually set they will be preserved by the tools. The update-alternatives tools try hard to respect explicit configuration from the local admin. Local manual symlinks appear to be an explicit configuration. In order to reset the alternative symlinks to their default value use the --auto option.
# update-alternatives --auto java
If you’d like to override the default to perhaps use a specific version then use --config and manually select the desired version.
# update-alternatives --display java # update-alternatives --config java
Choose the appropriate number for the desired alternative.
The appropriate java binary will automatically be in PATH by virtue of the /usr/bin/java alternative symlink.
You may as well use the update-alternatives tool from java-common package which let you update all alternatives belonging to one runtime or development kit at a time.
# update-java-alternatives -l # update-java-alternatives -s j2sdk1.7-oracle
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hadoopuser $ sudo adduser hadoopuser sudo
Setup SSH Certificate for password-less login
$ ssh-keygen -t rsa -P '' ... Your identification has been saved in /home/hadoopuser/.ssh/id_rsa. Your public key has been saved in /home/hadoopuser/.ssh/id_rsa.pub. ... $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ ssh localhost
$ cd ~ $ wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz $ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local $ cd /usr/local $ sudo mv hadoop-2.2.0 hadoop $ sudo chown -R hadoopuser:hadoop hadoop
Or you can compile and build Hadoop 2.2.0 from source. See the build instructions for details:
http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt and my other post on how to build native Hadoop libraries: http://drweiwang.com/build-hadoop-native-libraries/
$cd ~ $nano .bashrc
Append the following code to the end of your shell config profile ~/.bashrc
#Hadoop variables export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle/ export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export HADOOP_YARN_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
Now, you need to modify the $JAVA_HOME value in the hadoop-env.sh shell script, which is the required environment variable.
$ cd /usr/local/hadoop/etc/hadoop $ nano hadoop-env.sh
Find the line that has export JAVA_HOME , which is first export statement in the hadoop-env.sh file; and change the value to:
export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle/
Now, Hadoop should be installed and let’s log back as user, hadoopuser, and check the Hadoop version
$ hadoop version Hadoop 2.2.0 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar
Change the core-site.xml configuration, which defines the HDFS file server
$ cd /usr/local/hadoop/etc/hadoop $ nano core-site.xml
Paste the following between the <configuration></configuration> tag:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property>
Modify the yarn-site.xml by adding the following between the <configuration></configuration> tag:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
Modify the mapred-site.xml.template. First rename it to mapred-site.xml:
$ mv mapred-site.xml.template mapred-site.xml $ nano mapred-site.xml
and then, add the following code inside the <configuration> tag:
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Now, we need to configure the HDFS. Make two directories for the namenode and datanode:
$ cd ~ $ mkdir -p mydata/hdfs/namenode $ mkdir -p mydata/hdfs/datanode $ cd /usr/local/hadoop/etc/hadoop $ nano hdfs-site.xml
Paste following between <configuration> tag
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoopuser/mydata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hadoopuser/mydata/hdfs/datanode</value> </property>
The namenode needs to be formatted first:
hadoopuser@debian7-64:~$ hdfs namenode -format
Everything should be all set. We can now start the Hadoop services. The old shell commands start-all.sh and stop-all.sh of version 1.2 has been superseded by start-dfs.sh and start-yarn.sh in version 2.2.0.
hadoopuser@debian7-64:~$ start-dfs.sh .... hadoopuser@debian7-64:~$ start-yarn.sh .... hadoopuser@debian7-64:~$ jps
If everything is successful, you should see the following java processes running:
7313 NodeManager 7026 SecondaryNameNode 6692 NameNode 7213 ResourceManager 7470 Jps 6786 DataNode
hadoopuser@debian7-64:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5 Number of Maps = 2 Samples per Map = 5 ....... Wrote input for Map #0 Wrote input for Map #1 Starting Job ....... 14/01/14 22:11:09 INFO mapreduce.Job: map 0% reduce 0% 14/01/14 22:11:16 INFO mapreduce.Job: map 100% reduce 0% 14/01/14 22:11:22 INFO mapreduce.Job: map 100% reduce 100% 14/01/14 22:11:23 INFO mapreduce.Job: Job job_1389755304371_0002 completed successfully 14/01/14 22:11:23 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=50 FILE: Number of bytes written=239029 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=538 HDFS: Number of bytes written=215 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=10579 Total time spent by all reduces in occupied slots (ms)=3592 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=302 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=70 CPU time spent (ms)=1430 Physical memory (bytes) snapshot=620412928 Virtual memory (bytes) snapshot=1804890112 Total committed heap usage (bytes)=467140608 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 22.021 seconds Estimated value of Pi is 3.60000000000000000000
K-means clustering is a widely used clustering technique that seeks to minimize within-cluster sum of distance. Solving the problem exactly is computationally difficult, but Lloyd [6] proposed a local search solution, often referred to as “k-means” algorithm or Lloyd’s algorithm, that is still widely used today. It is the speed and simplicity of the k-means method that make it popular, not its accuracy. The algorithm can converge to a local optimum that can be very away from the global optimum (even under repeated random initializations). A typical example is given in Figure 1. In this example, k-means algorithm converges to a local minimum after five iterations, which contradicts the obvious cluster structure of the data set.Figure 1: A typical example of the k-means convergence to a local minimum. The result of kmeans clustering (the rightmost figure) contradicts the obvious cluster structure of the data set. The small circles are the data points, the four ray stars are the centroids (means). The initial configuration is on the leftmost figure. The algorithm converges after five iterations presented in the figures, from the left to the right. A proper initialization of k-means is crucial to obtain a good final solution. Arthur and Vassilvitskii proposed the k-means++ initialization algorithm that both improves the running time of Lloyd’s iterations and the quality of the final solution. Our Statistics Toolbox already offers the kmeans function which implements the Lloyd’s algorithm. We can improve the performance of the kmeans function by adding thek-means++ seeding to the initialization step. There are many other initialization methods for k-means, for example, K-means Refinement algorithm presented in and some others. I chose k-means++ because it is simple and can be extended in the MapReduce framework. The k-means refine algorithm can also be easily implemented in a MapReduce framework, but the initialization is rather complicated and has a non-empty clustering constraints. Although, the refinement method is salable and very suitable for big data, for small datasets, the run time can be much longer than the standard random sampling or the k-means++ algorithms due to the complexity and constraints.
You must be logged in to post a comment.