- JDK is the Minimum runtime version for Hadoop 3.x
With the end of Oracle JDK 7 in 2015, Hadoop 3.0 JAR files are compiled to run on a new version-JDK 8. This enables Hadoop 3.x to have a new dependency upgrade to modern versions because most libraries support only Java 8 and above. Hadoop users who still use lower versions need to upgrade to the higher JDK 8 version compatible with these files.
- Erasure coding in HDFS support
With the rapid growth in data and the data center hardware, erasure coding support in Hadoop 3.0 is a critical feature. This technique lets any random piece of data be recovered based on another piece of data. This technique is like an advanced RAID technique that automatically recovers data when hard disks fail. The HDFS in Hadoop 2.0 inherits a 3-way replication from Google File System (GFS) to replicate each piece of data thrice for the purpose of reliability. Hadoop 3.0 will cut physical disk usage by more than half, and the fault tolerance also increases by more than 50%. This new feature in Hadoop 3.0 will save customers a lot of money on hardware infrastructure.
- Shell script rewrite
The previous versions of Hadoop had many bugs and compatibility issues. With this, the new version of Hadoop shell scripts have been rewritten to resolve bugs, compatibility issues, and installation problems. Some of the critical areas that have been updated include all the shell script subsystems, which now execute Hadoop-env.sh, allowing all environment variables to be in a single place. With this new version, daemonization has been moved from -daemon.sh to the bin command. The updated scripts test and report error messages better. These are just a few of the updates.
- Support for opportunistic container
The Execution Type notion has been introduced in the new version to allow applications to request containers that can be of an opportunistic type. The update allows containers to be dispatched for execution in the Node Manager even without any resource. Containers are queued at Node Manager, waiting for resources before it can start. The new opportunistic containers are of low priority and are preempted. As such, cluster utilization has been improved in the new version significantly.
- Port changes for various services and the addition of new of new default ports
There have been notable changes in default ports such as the movement of Secondary NameNode, NameNode, DataNode, and KMS to Linux ephemeral port range (32768-61000) to avoid any potential bind errors on startup due to conflict with other applications. With this feature, there is enhanced reliability with regard to rolling restarts on larger clusters of Hadoop.
Although these are just a few improvements that have been made on Hadoop 3.0, the updates are a major advancement in the big data space. With the above features and others that were not covered and others that are likely to be announced in the coming days, Hadoop will remain a competitive platform for the unforeseeable future.