Hadoop是一個由Apache基金會所開發(fā)的分布式系統(tǒng)基礎(chǔ)架構(gòu),用戶可以在不了解分布式底層細(xì)節(jié)的情況下,開發(fā)分布式程序。充分利用集群的威力進(jìn)行高速運(yùn)算和存儲。
雖然Hadoop是用java寫的,但是Hadoop提供了Hadoop流,Hadoop流提供一個API,,允許用戶使用任何語言編寫map函數(shù)和reduce函數(shù)。 (推薦學(xué)習(xí):PHP視頻教程)
Hadoop流動關(guān)鍵是,它使用UNIX標(biāo)準(zhǔn)流作為程序與Hadoop之間的接口。因此,任何程序只要可以從標(biāo)準(zhǔn)輸入流中讀取數(shù)據(jù),并且可以把數(shù)據(jù)寫入標(biāo)準(zhǔn)輸出流中,那么就可以通過Hadoop流使用任何語言編寫MapReduce程序的map函數(shù)和reduce函數(shù)。
例如:
bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out4
Hadoop流引入的包:hadoop-streaming-0.20.203.0.jar,Hadoop根目錄下是沒有hadoop-streaming.jar的,因?yàn)閟treaming是一個contrib,所以要去contrib下面找,以hadoop-0.20.2為例,它在這里:
-input:指明輸入hdfs文件的路徑
-output:指明輸出hdfs文件的路徑
-mapper:指明map函數(shù)
-reducer:指明reduce函數(shù)
mapper函數(shù)
mapper.php文件,寫入如下代碼:
#!/usr/local/php/bin/php <?php $word2count = array(); // input comes from STDIN (standard input) // You can this code :$stdin = fopen(“php://stdin”, “r”); while (($line = fgets(STDIN)) !== false) { // remove leading and trailing whitespace and lowercase $line = strtolower(trim($line)); // split the line into words while removing any empty string $words = preg_split('/W/', $line, 0, PREG_SPLIT_NO_EMPTY); // increase counters foreach ($words as $word) { $word2count[$word] += 1; } } // write the results to STDOUT (standard output) // what we output here will be the input for the // Reduce step, i.e. the input for reducer.py foreach ($word2count as $word => $count) { // tab-delimited echo $word, chr(9), $count, PHP_EOL; } ?>