As everything that we do Hadoop Plugin for LizardFS is as simple as we could make it.
This is a java based solution allowing Hadoop to use LizardFS storage, implementing an HDFS interface to LizardFS. It functions as kind of a File System Abstraction Layer. It enables you to use Hadoop jobs to directly access the data on a LizardFS cluster. The plugin translates LizardFS protocol and makes the metadata readable for Yarn and Map Reduce. For performance, Hadoop nodes should run on the same machines as LizardFS chunk servers.
LizardFS mount gives direct access to stored files, from the OS level. This allows you to use it as a shared storage in your company and a computation storage for HADOOP at the same time. It is not required to use HADOOP tools to put/get files from your storage in comparison to HDFS. We can also take advantage of Erasure Coding and save a lot of disk space (HDFS recommends to store 3 copies).
public BlockLocation getFileBlockLocations(FileStatus file, long start, long len)
Returns information where data blocks are held in your LizardFS installation. If Hadoop is run on the same machines, it can take advantage of data locality.
To install Hadoop with LizardFS:
1) Install and setup LizardFS cluster
2) Install HADOOP – but don’t start
3) Install LizardFS-HADOOP plugin on all HADOOP nodes
4) Configure LizardFS-Plugin in HADOOP (alongside HDFS or replace it)
5) Start HADOOP
Let us know what you think of it.