Can i use Hadoop with PHP?

Hadoop is the first choice if you want to start with big data analysis. Getting started with Hadoop is a bit tricky, it’s not very easy to install and there are some parts you have to know about like yarn or HDFS …

So first: What is Hadoop? Hadoop consists of two important parts, a clustered file system called HDFS and a MapReduce framework for writing MapReduce jobs in Java (or any other programming language).

There ist a great tutorial online for installing Hadoop on Ubuntu (DigitalOcean).

… after installing it you should try if your HDFS works. HDFS comes with many commandline tools you know from normal linux distributions like ls, copy, move and so on.

HDFS

Let’s create some content in your new distributed file system …


# create a directory and list the content of root
hadoop fs -mkdir /test-1234
hadoop fs -ls /

# create a local text file and put it on your hdfs
echo "This is a nice text file." > test.txt 
hadoop fs -copyFromLocal test.txt /test-1234/test.txt

# output a hdfs file
hadoop fs -cat hadoop fs -cat /test-1234/test.txt

Look at the list of commands in documentation if you want to know more about HDFS.

If you get a warning message like this one …

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

… you can get rid of it with:

export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"

HDFS is great if you really work with big data, you can span volumes cross many servers and create so huge virtual disk drives.

MapReduce

MapReduce is a algorithm to crunch big data into small results, first step is to map input data to key/value pairs and second step is to reduce this pairs until only your results are left. So it is very simple and thats the reason why you can easy split it in chunks and let it run on multiple cores and even multiple machines. Hadoop is the job-runner framework who splits your jobs and runs them in parallel.

Usually MapReduce jobs für Hadoop are written in Java, you can see an example in the official Hadoop tutorial.

But in the current version of Hadoop there is a feature called Hadoop Streaming, it let you create jobs with data from stdin and stdout streams and so you can use any programming language to write your Hadoop jobs.

Here is a simple example with unix commands cat and wc (wordcount):


hadoop jar hadoop-streaming-2.7.2.jar \
  -input myInputDirs \
  -output myOutputDir \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

So how would mapper and reducer look like in php? Here is an example of a simple scripts for counting words in input files:

mapper.php

#!/usr/bin/php
<?php

while (($line = fgets(STDIN)) !== false) { 
   $words = explode(" ", $line);
   foreach($words as $word) {
      echo $word."\n";
   }
}
echo PHP_EOL;

reducer.php

#!/usr/bin/php
<?php
$counts = [];
 
while (($line = fgets(STDIN)) !== false) {    
    $word = trim($line);

    if(! $word) { continue; }

    if(! isset($counts[$word])) {
        $counts[$word] = 0;
    }
    $counts[$word]++;
}

foreach($counts as $word => $count) {
    echo $word.': '.$count."\n";
}

echo PHP_EOL;

Call it …

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
   -input /test \
   -output /test-out \
   -mapper /usr/local/hadoop/scripts/mapper.php \
   -reducer /usr/local/hadoop/scripts/reducer.php

Hadoop feeds all input files from the -input folder line-wise into the mapper and the output of the mapper it feeds into the reducer and then the output of the reducer are written into text files in the -output folder.

Look at the output folder with this commands.

hadoop fs -ls /test-out
hadoop fs -cat /test-out/part-00000

That was my first big data experiment with Hadoop and PHP.