Friday, November 18, 2016


Git Essentials LiveLessons



install git from
install sublime from

from git bash set your global variables

//set global user name
 git config --global "hassan jamous"
//set global email
 git config -global "''"
//set global color
 git config --global color.ui "auto"
//set global editor to  sublime
 git config --global core.editor "'C:\Program Files\Sublime Text 3\sublime_text.exe' -w"

you can view your global configuration 
git config --list


create a git repository
you can create a git repository in any folder, 
cd to the folder and type
git init
git will create a .git hidden folder in this folder and all subfolders, you can type 
ls -a to check 

Branch Master
when you create a new git repository, you will be working on the Master branch.

Untracked, Staging and Commit
when you create a new file in your repository, the file will be in Untracked. 
to track the file you type 
git add FILENAME
or you can use
git add .
to add all the files to the staging area

after you add the file the file is in the staging area.

to commit your changes, which means save it to the branch you type
git commit -m "commit message"

to check the commit log
git log

Now if you change a file, the file will NOT be in the staging are, you should add it then commit.

the HEAD is your last commit, so when you commit new changes you are moving the HEAD, to the new commit

Check the differences
if you change the file, and before moving the file to the staging area. you can compare your changes with the last commit by typing

git diff 

if the file is already in the staging area you should type

git diff --staged

after you commit you can compare with the previous version by
gid diff HEAD~1
which means compare with one commit before the head, you can use HEAD~2 or 3 ....

also you can compare with commit id, if you use
git log
you will git something like

commit 5dbed2e0bcd7bdb844d6a6fdfc6519b9f5da7e31
Author: hassan jamous <>
Date:   Wed Nov 16 19:00:50 2016 +1100

    second commit

commit c39fbd23776eb5e569bff21b5bd8d05eacb1facd
Author: hassan jamous <>
Date:   Wed Nov 16 18:42:30 2016 +1100

    first commit

you can use the commit id to compare the differences
git diff c39fbd23776eb5e569bff21b5bd8d05eacb1facd

you can move your head between commit by using git checkout.
for example you can move your head to the previous commit

git checkout HEAD~1

if you type git status here , it will tell you that the head is deattached and now is pointing to the previous commit 
$ git status
HEAD detached at c39fbd2
nothing to commit, working tree clean

to go back to the last commit type 
git checkout master 

sure you can also checkout to commit id 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

lets say you want an old version of a file, you simply checkout that file from the commit you want

git checkout HEAD~1 readme.txt

notice here that you are not moving the head, you just telling git that i want the file from HEAD~1

of course you can type 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

now if you check git status, you will see that the file is modified and in the staging area, 
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   README.txt

now you can commit your new changes.

Deleting a file
when you delete a file, the file delete action will not be in the staging area, you can confirm the delete simply by typing 
git add
to add it to the staging area, then 
git commit -m 'we deleted the file'
to commit

no if you want to reverse your changes, you dont want to add them to the staging area, you should type
git checkout master readme.txt
notice that we returned to the master version

Moving a file from a staging area to out of the staging area (undo the git add)
to move the file from the staging area type
git reset HEAD readme.txt

undo your changes
if you make a change, it is not in the staging area, if you want to undo the change
git reset --hard

Adding new folder to git repository
if you create a new empty folder, you will notice that git will not recognise that, you should have a file in the folder to be recognised. 
that's why people create a .gitkeep file inside empty folder, git will recognise the folder, the .gitkeep will be hidden so normal users will not see it.
of course you can see the hidden file from the bash, by running ls -a

ignore files 
to ignore files from git, you should create 
file on the root folder of the respoitory, inside this file you can add the files or patterns that you would like to ignore.

force ignored file to be committed 
to force an ignored file to commit 
git add -f FILENAME


GIT is structured this way

as you can see, you have a local copy, and you have remote, remote could be GITHUB, GITBUCKET, GITLAB or anything that follow git structure

you can have multiple remote, however the primary Remote is called origin (this is a convention)

you push or pull from remote

adding a remote
1- create a repositroy on github
2- we will use this repository as remote
3- we will add the remote, and it is gonna be our primary we will call it origin 

git remote add origin
as you can see we named this remote ORIGIN, you can call it whatever you want, but as it is the primary we are following the convintion ==> it should be called remote.

now you need to push your repository to GITHUB
git push origin master

which means i want to push my master to remote called origin.

it will ask you for github username and password

checking what remote do you have 
you can use 
git remote -v
to get the list of remote repository that you have, you will get something like
$ git remote -v
origin (fetch)
origin (push)

as you can see, for each remote you have to entries, one to fetch the code and another to push the code.

USE SSH to connect to github
any change that you wanna do on github, you need to provide a username and password.
in order to solve this, you should use ssh rather than http url to connect to github

to do that 
1- go to the root folder and create .ssh folder (on linux ~/.ssh, for windows C:\Users\Hassan)
2- cd to that folder 
3- type  ssh-keygen
4- you will receive this message Enter file in which to save the key (/c/Users/hassa/.ssh/id_rsa):
5- put the file name like id_rsa
6- then you will get the following messgae

Your identification has been saved in id_rsa.
Your public key has been saved in
The key fingerprint is:

 now, your key is store in

7- open
8- copy the key
9- go to github, open settings menu then SSH and GPG keys

10- add a new key, and copy the key.

11- now git the ssh location from git 

and type
git remote add origin SSHLOCATION

so now lets say you updated a file, you committed the changes, these changes will be stored locally, to push these changes to git hub 

git push origin master

now the file will go to github 

Pulling changes from GIT HUB
you can edit files on git hub directly, lets say you edited the file from git hub or you want to pull the latest changes 

you can type
git pull origin master

when you push your changes, you might git an error, which will say that a conflict has happened because you dont have the latest version.
you should pull the latest changes and then push
git pull origin master
git push origin master

when you pull auto merge might not be possible so you should handle this manually

Creating a new branch

to create a new branch you can typ
git branch BRANCHNAME

this will create a branch from where you are, so if you are on master the branch will be created from master.

or you can type 
git checkout -b BRANCHNAME

to list the list of branches you have 
git branch -a

to move from one branch to another
git checkout BRANCHNAME

to delete a branch
git branch -d BRANCHNAME

in order to force delete a branch (in case there is some work on this branch that is not committed yet)
git branch -D BRANCH NAME
use capital D

to merge change to the master branch , you should first checkout to that branch
git checkout master

then you can merge
git merge BRANCHNAME

when you have a new task, 
1- checkout from the master branch 
git checkout -b NEWJIRA
2-do the changes that you want
3- now we should push this branch to remote
git push origin NEWJIRA
4- go to github website and create a pull request, here you should specify the base branch and the branch that you want to be merged (base branch will be master, the branch to be merged is NEWJIRA)
5- someone will see the pull request, will review it and accept the request.
6- after this you can delete the NEWJIRA branch from your github.

now your merged the NEWJIRA branch to the master branch on github, which means you merged on REMOTE. you need to pull these changes to your local master

git checkout master
git pull origin master

now your master is similar to the remote master

when you type
git branch -a
you will get something like
$ git branch -a
* master

as you can see, it lists the branches that you have locally, and the remote branches.
lets say that you went to github and deleted the testingBranch, so you are deleting the remote testingBranch

after you do that you should update your local repository in order to be synced with the remote, in order to sync your remote with local WITHOUT CHANGING YOUR LOCAL BRANCHES, you should use git fetch

git fetch

this will sync the remote branches, however in order to delete the missing branches as well you should type

git fetch -- prune
now if you type 
git branch -a

$ git branch -a
* testingBranch


you will notice that the branch is deleted.

basically git pull is git fetch + git merge

you can use 
git log 
to get the log of a branch
however there are too many information there. in order to see better version

git log --oneline

this will print one line for each commit 

to print all commits
git log --oneline --all 

to print a graph and to see which branch merged the changes

git log --oneline --all --decorate --graph 

from the log you can see how the branches are related, it will tell you which branch is before another, and which branches are pointing to the same thing

for example, the following image tells us that development, origin/master and master branch are on the same level

and this image tells you that origin/master and master or on the same level
and feature/folder_documentation, origin/development and development are on the same level and in front of the master branch

in order to sync branches we used to do, git merge, what basically happens in git merge is the following

lets say you have this case

now when you merge you do the follwoing
git checkout master
git merge experiment

and this what will happen

there is another way to merge which is rebase

lets say we have the following 

rather than going to master and merge, we will do the following 
git checkout experiment
git rebase master

now this what will happen, the output will be

First, rewinding head to replay your work on top of it...
Applying: added staged command

so we forwarded expermint to C4' infront of master

now we will type

$ git checkout master
$ git merge experiment

and the result
Fast-forwarding the master branch.

Lesson 4

Adding a collaborator 
if you want to add someone to your github project so they can push and pull, go to your respository then choose setting -> collaborator, and then add the collaborator 

now the collaborator should donwload the project. to do that use git clone
this will donwload the project and will create the required remotes.

now the collaborator can push and pull

if you have a big a project with many collaborator, you will not add all of them, the best solution for this is to FORK

fork means that you have project REMOTE and you will clone this project to your remote. 
so when you fork it means that you are taking this project to your account (your remote), and when you push and pull you are basically doing that on your account not the project account.

by you do the update on your account (your remote) then create a pull request to merge it to the project remote

so, from GITHUB you can press the FORK button, now the project is forked and it is in your account.
you can copy the ssh link and use
git remote add origin SSHLINK

now do the changes and push it to your remote, then create a pull request to merge it to the project remote.

Now you when many people fork the project you will have a sync problem with a project.
to handle this you should add a new remote, this remote is the project itself,
so now you have your account remote (which we call it ORIGIN) and we should add the project remote (we call it upstream).
git remote add upstream PROJECT_SSH_LINK

very important that what you do is: 
1- pull from UPSTREAM (i.e. get the latest version from project remote)
2- push to ORIGIN (i.e. push your change to your remote)
3- create a pull request to merge from ORIGIN to UPSTREAM

There are some situation when you resolve conflict, 
you do git rebase
then you should do 
git rebase --continue
git rebase --skip

and when you do rebase you usually have to force push
git push -f oringin master


check this url for branching 

Monday, May 23, 2016

OREILLY Learning Apache Maven

Maven has 3 lifecycles: Clean, default and site.
in each lifecycles we have alot of phases, executing a phase means executes all the previous phases.

Maven is convention over configuration, which means you dont tell maven in some configuration where your java files are, by convention they must be in src/main/java

You can change these convention but it is not recommended, you may need that if you are working on a legacy application

you can have a Terminal inside eclipse, use TCF Terminal plugin

Inheritance in Maven
in the general pom all the directories and conventional stuff are defined

Maven profile
you can use Maven profile to build the project based on your environment, for example a profile for the test environment, and another for DEV another for production.

now when you run maven use -P
mvn -Pproduction package

if you dont want to use -P you can do something else
you can define an Environment Variable in windows and then use it in pom.xml

now Maven will check PACKAGE_ENV environment variable and determine which profile to use

Maven Dependency

Maven can handle transitive dependency which means if you depend on X.jar and X.jar depends on Y.jar, Maven will fetch Y.jar

You can define Remote repositories in Maven.

You can define Scope for your dependencies, which means you dont need junit when you compile you need it when you test

Maven can handle conflicts, for example you depend on X.jar and Y.jar, X.jar depends on Z.jar version1 and Y.jar depends on Z.jar version2 . Maven can handle this conflict.
it will fetch the latest version by default, however you can control this behaviour using the <exclusion> tag.

Maven Lifecycles 
there are 3 different life cycles in Maven, default, clean and site
cycles have phases
phases are connected to plugins, for each plugin you have goals that must pass in order for the phase to pass

default is the most used cycle
in default you have these phases:
Compile: which means compile everything in src/main/java
test-compiles: compile everything in src/main/test
test: run unit test
package: create the jar or ear or war
install: take the generated package and put it in local repository so other project can use it as dependeny
deploy: take the package and put it in remote repository, so other teams in the company can use it

Thursday, March 17, 2016

Head First Design Pattern Appendix: Leftover Patterns

Bridge Pattern 

bridge pattern says: “Decouples an abstraction from its implementation so that the two can vary independently.”

to say it in other words, Bridge Pattern basically abstract the abstraction.

for example: the steering wheel in the car is an abstraction, we dont care what is happening behind the hood.
this is a nice abstraction for the car, however the steering wheel is used also in ships, in airplanes ....

what we need to do is to abstract the steering wheel, and this is what the Bridge pattern does.

as you can see, we have an abstraction which is SteeringWheel, and this abstraction has an implementation which is SteeringSystem.

another example,
imagine that you want to build a web application framework which will let you build blogs stores ...

and now you have a new idea which is adding Themes to the system, for example light,dark theme, if you want to implement that without Bridge pattern you would have "Blog Light" "Blog Dark" .... and thousands of classes.

to fix this issue we use bridge Pattern

NOTE: bridge and strategy has the same structure, the difference is that strategy is for behaviour and bridge is for the structure of class


Builder Pattern
builder pattern allows you to encapsulate the construction of a product in addition to building the product in steps.

for example:
when you build a vacation, you build the day then the hotel the dinner and so on:

in this case you can use a builder,

another example

lets say we have a
class Person{firstName; lastname; Phone; Age ....}

if you want to make a constructor with all these fields you will have Public Person(firstname, lastName, Phone, Age ...) it will be a big constructor

you can define a Public PersonBuilder { //put all the fields inside
protected String firstname
protected String lastName
protected String age

and add set for all the fields

public void setFirstName(String fName ) {
this.firstName = fname;
return this

and add a method which is Build()

public Person Build() {
return new Person(this.firstName, this.lastName ...)

now to create a new person
PersonBuilder x = new PersonBuilder().setFname("asdfasfd").setLastName("asdfasdf")....
Person y =;

the idea why dont you do this in the Person Class, we dont we define the set method in the Person class.
The thing is the object will be in Inconsistent State which means what you will do
Person x = new Person()
and then you start setting the field

as you can see the object was created then the fields was set, this is inconsistent

Chain of Responsibility Pattern

it is like define a chain in Struts, so you have the request object, each handler in the chain will check the object, if it has to do anything it does, if not it passes to the next one:

for example in email chain of responsibility we will check if it is spam (do this) if it is fan email then do this if complain then do this ....

With the Chain of Responsibility Pattern, you create a chain of objects that examine a request. Each object in turn examines the request and handles it, or passes it on to the next object in the chain.
ofcourse many handler can handle the request

NTOE: you can define your chain of responsibility as an arrayList
Flyweight Pattern
we know that this pattern is about not creating a  lot of objects and use the ones that were created before.

an example could be like this

in the ShapeFactory, we have a HashMap to keep the created Shape objects, if the requested Shape was created before there is no need to create a new one , you can return it from the Map

Mediator Pattern
Use the Mediator Pattern to centralize complex communications and control between related objects.

for example if you have a program where if an alarm rings the coffee maker will work and the shower will do something.
and at the same time if the coffee works something else will happen

as you can see alot of communicaiton.

in order to solve this use the mediator pattern

Mediator pattern is very similar to the observer, however usually here we care about the execution order of the event, we save this order in the Mediator class.

Memento Pattern
use it when you want to return an object to one of its previous states, for instance, when user requests an undo.

Visitor Pattern

example from uncle bob: 

lets say you want to print this employee report

1429Bob Martin432$22,576
1532James Grenning490$28,776

you can write this code

public class Employee {
  public abstract String reportQtdHoursAndPay();

public class HourlyEmployee extends Employee {
  public String reportQtdHoursAndPay() {
    //generate the line for this hourly employee  }

public class SalariedEmployee extends Employee {
  public String reportQtdHoursAndPay() {} // do nothing

now you can iterate over an array of employees and call reportQtdHoursAndPay().
the thing is, here you put the report formating logic in the Employee class which actually breaks the single responsibility principle.

in order to fix this you can use the Visitor pattern, where you can remove the formating logic to a different class and iterate over the employees to get just the information

public class Employee {
  public abstract void accept(EmployeeVisitor v);

public class HourlyEmployee extends Employee {
  public void accept(EmployeeVisitor v) {

interface EmployeeVisitor {
  public void visit(HourlyEmployee he);
  public void visit(SalariedEmployee se);

public class QtdHoursAndPayReport implements EmployeeVisitor {
  public void visit(HourlyEmployee he) {
    // generate the line of the report.
  public void visit(SalariedEmployee se) {} // do nothing

to generate the report

  QtdHoursAndPayReport v = new QtdHoursAndPayReport();
  for (...) // each employee e

in this case we define an EmployeeVisitor interface, this interface just have a visit() method with all types of employees as parameters.

in adition we add the method accept() to employee class

so now when we iterate over the employees we call the method accept().
after that method accept() will call the method visit() and pass the object this.

now we are in method visit() inside QtdHoursAndPayReport, and we have access to the employee object, you can print the format you want here and get whatever information you want from the employee class.


Prototype Pattern
Prototype is simply cloning objects.


Interpreter Patter
use this pattern to build an interpreter for a language.
we will not write anything about it here, it is not used that often.

Sunday, March 13, 2016

Learning Apache Hadoop OREILLY Course

we should know the concept of Disk Stripping, in Disk Stripping or RAID0, the data is divided into multiple chunk, so lets say you have 4 hard Disks, data will be divided into 4 pieces, by that accessing data will be much faster,

in RAID1, we do mirroring for the data

in order to ensure that your data is safe you should combing RAID1 with RAID0.

Hadoop logically does that in the cluster, it stripes and mirror data.

- Hadoop is Fault tolerant, it means if a disk is couropted or a network card not working, this is fine.
- Hadoom has master slave structure.

you should choose a powerful and expensive computer for your master node.
master node is a single point of failure,
you should have 2 or 3 Master node in a cluster
you should have redundancy ( as it is SPOF)

You need alot of RAM, more than 25 as the deamon takes alot of ram
you should use RAID
you should use HOT SWAP Disk drive
you should have redundant Network card
you should have dual power supply

the bottom line, the MASTER NODE should never goes down.

CPU is not important like RAM here.

you will have 4 to 4000 slave node in a cluster
slave nodes are not single point of failure.
7400RPM disks are fine
more disks are better, which means 8 * 1 TB data is much better than 4 * 2 TB
it is better that all slaves have the same disk size.

sure slave IS NOT Redundant
you dont need RAID or Dual network card or Dual power supply

you need alot of RAM

lets say you have 10TB of data every month
you have slaves with 8TB
you have replication factor of 3
you should know that you have something called "intermediate data" which is the generated data betwee MAP and REDUCE. this data is about 25% of the disk size ( in this case 2TB )

the avaialbe space formula is = (RAW - ID) / RF = (8 - 2)/3 = 2 TB

which means each slave has 2TB not 8TB, which means you need 5 slaves every months (as you have 10TB every months).

it means all the things on top of hadoop,
basically when we say hadoop we main HDFS and MAPREDUCE.
main things in hadoop are
1- NAME NODE: part of the master
2- Secondery Name Node: part of the master
3- Job Tracker: part of the master
4- Data Node: part of the slave
5- Task Tracker: part of the slave

1- HBAS: fast scalable NoSql database
2-HIVE: write sql like queries instead of map reduce
3- pig: write functional queries instead of map reduce
4- sqoop: pull and push data to RDBMS, used for integration
5- flume: pull data into HDFS
6- HUE: web interface for users
7- Cloud Manager: web interface for managing the cluster for admin
8- oozie: workflow builder
9- Impala: real time sql queries, 70 faster that MapReduce.
10-Avro: serialize complex object to save in hadoop
11- Maheut: machine learingng in hadoop
12- Zoo Keeper
13- Spark
14- YARN
15- Storm

hadoop is used for batch processing, which means parallelization, which means problems like graph based doesnt fit with hadoop

the best is Cloudera


when we talk about Hadoop, we are talking about 2 main things
1- storage: whcih is HDFS, a distributed redundant storage
2- processing: which is MapReduce: a distributed processing system

some terminology to know:
1- a job: all tasks need to run on all data.
2- a task: individual thing, which is either a map or a reduce
3- Slave/Master: these are computers
4- NameNode, DataNode: these are daemons, which means JVM instances

we have MapReduce v1: old and stable
we have MapReduce v2: new things like dynamic allocation and scalability

Hadoop cluster has 5 daemons:
- Storage Daemons:
NameNode(on Master)
- Processing Daemons:

Master Daemons are for orchestration
Slave Daemons are for working

NameNode: Handle Storage meta data, it puts some information in Memory for fast access but also it persist data.
Secondary Name Node: it checks NameNode if it is alive or not, it is not a failover node
Job Tracker: coordinate processing and schedualling.

NOTE: use different machines for Name Node and Secondary Name Node, because if the machine is down, the Secondary Name Node will detect that and build a new Name Node directly

NOTE: you can install the job tracker on the same machine with the Name Node, and move it to another machine when your project gets bigger.

Data Node: handle row data (Read & Write)

Task Tracker: handle individual taks (Map or Reduce)

data node and task tracker always sends heart beats to the master to tell him that we are alive and we are working on this.

Hadoop run modes

1- Local JobRunner: Single computer, single JVM with all daemons, good for debugging
2- pseudo Distribution: Single computer, 5 JVM (one for each daemon), good for testing
3- Fully Distributed: Multiple computers, multiple JVM, this is the real environment.

when you install Hadoop it is recommended to use linux, use RHEL for Master and CentOS for slaves.

use Redhat Kickstart to install hadoop on multiple machines.

Elastic Map Reduce

is a solution from Amazon similar to hadoop.

it has this structure:

the master instance group: is like the master node
the core instance group: is like the slave node, but it is only responsilbe for storage ( as you can see it uses HDFS
task instance groupe: is like the slave node, it is only responsible for processing ( doing map reduce job)

usually we use S3 to write information and intermediate data.

Core instance group is static, you cannot add any new machine after you start the cluster, however the task instance group is not static, you can add new machine whenever you want


in this lap he created 5 EC2 instances, one is a master and 4 slaves
he installed cloudera manager
he installed Hadoop from cloudera manager
then he uploaded some data to Hadoop from the command line
then he ran a Map/Reduce example
then he checked everything from Cloudera manager

then he gave an example of  downloading Cloudera Quickstart VM locally, to install hadoop locally

HE used Ubunto 12.04 AMI


Hadoop Distributed File System (HDFS)

you can use HDFS without MAP/REDUCE, in that case you only need NameNode, SeconderyNameNode, DataNode

when you upload a file to HDFS it will be divided into blocks and stored in slaves nodes

every block will be replicated to 3 machines (by default)

you cannot edit or append the file you upload to HDFS, if you wanna change anything you should delete and create the file again.

the default block size is 64MB, however it is recommended to change it to 128MB

it is a master node
it has only metadata information about the files that are stored in slaves (e.g. name of the file, permessions, where the blocks exist).


the client asks the name node about the file then the client goes and read it directly from the slave node.

The name node metadata exists in RAM, however it is also persisted.

we have 2 files for the persisted metadata in name node:
1- FSIMAGE: it is a point in time image about the information that exists in HDFS
2- edit log: the changes that happened since we created the FSIMAGE, it stores the delta information

every now and then FSIMAGE and edit log will be merged and saved on the hard disk

you have to have multiple hard disks with RAID to insure that you will not lose the data.
it is also better to use remote NFS.
and daily or weekly backup.

every 3 seconds the datanode will send a heart beat to the name node
if 30 seconds passed without a heart beat, the node is out
if 10 minutes with no heart beat, hadoop will start copy the data that should be on that node to another machine.

every one hour (or after the restart of the name node) all data nodes will send Block report, which is a list of all blocks that they have.

Hadoop uses checksum to insure that data is transfered correctly.
every 3 weeks hadoop will do general checksum check on all blocks.


How writing Happen in Hadoop

here is an example

so the client divided the file into 4 pieces ,
he asked the name node to write the first piece,
the name node gave a pipeline which is: write to datanode A then c then F
the client write to A, then A write to C, then c write to f
F ack C, C ack A, a Ack the client, the client ack the nn and request a pipline for the next block.

how do we handle a failed node,

lets say DN_A is bad, the client will try with C, if not with F.
as long as the client is able to write into one node the client can move to the next block

general information:
1- checksum is used for each block
2- the file is considered as the number of written blocks, so lets say your file is 4 blocks and you wrote only 2 blocks, so to this point your file is only 2 blocks, and HDBS will see your file as 2 blocks.
for that it is better to have 2 folders, INCOMING: keep here the file that is under upload process, once you finish uploading the whole file, move the file to READY_TO_PROCESS folder.

How reading is handled

the client ask for a file, the Name Node also gives a read pipline for each block

Secondary Name Node

as we mentioned before, we have 2 files in the NameNode, fsimage which is a point in time file and edit log which is delta since the last fsimage

Note: we have 2 files, fsimage and edit log, because fsimage is a big file, opening a big file will slow down hadoop, thats why we have edit log, a small file and contains only delta information, using edit log means dealing with a small file ==> better performance



IF SECONDARY NAME NODE IS DOWN nothing will happen, the name node will keep writing on the edit log, the edit log will become bigger and bigger and the system will become slower and slower.


new lab, we used hadoop fs -put

when you do the instalation with cloudera manager, a trash directories will be created for you by default, when you delete something it will be moved to the trash directory.
if the directory is not created, it is recommended to create one.


High Availability Name Node
Name node as single point of failure is not acceptable,
thats why we have a new solution by cloudera which is intrduced in Hadoop 2, and called Name Node high availability.

as you can see, the Standby namenode will take over if the name node is off AND YOU DONT HAVE TO START THE WRITE OR READ OPERATIONS FROM BIGINNING .

NOTE IMPORTANT: Clients send all operations to both the NN and Standby NN, both of them have complete picture of what is happening in the memory.

With the architecture above, i can handle the failure of the NAME NODE, however the fsimage and edits log are still a Single Point of Failure.
that is why High Availaibility Structure introduced a new thing called JournalNode.

 the current active name node now writes, synchronously, the fsimage and edits log to set of journal nodes, the standby NN reads from these nodes

in order not to the nameNode and stand by name node misunderstood each other (maybe one think that it is the active name node now). they use something called epoch number with each write to the Journal Node


we use a cluster of Zoo keepers to determine who is the active name node (the number should be odd to avoid brain split).
as you can see we have ZKFC service in NameNode and StandBy Node, they send information to the ZooKeeper cluster to tell about the health of the node,
if the NAME NODE ZKFC noticed that the NameNode is down, he will send this information to zoo keeper, zoo keeper will set the stand by node a the active name node and the old name node as the standby one

as you can see HA is complicated, extra machines, extra configurations ...
you dont need this most of the times, the secondery name node on a different machine is usually enough.


scale name node functions by breaking up namespaces to multiple machines.

hadoop has authorization, but it doesnt have authentication, for example lets say you are sending a write request to machine1 as user xxx, user xxx is not authorized to do write operation but user yyy has, simple create user yyy and send a request as yyy, hadoop will not check that you are yyy for real.

to do Authentication you should use something else, Kerberos.

hadoop uses linux like permissions.



these are the player of map reduce

and here is how the job is done

we have also new version which is called MapReduce v2, in this version they focus on the scalability of the job tracker and removing a restriction on the number of the map and reduce jobs that can be run on each slave machine.

the Map Reduce configuration files are:
1- mapred-site.xml

in this lab he gave an example how to run a java map reduce function

this is the statement to run a map reduce, hadoop-examples.jar contains the Map and Reduce java classes.

he went over everyline of code, you can check it.

How MapReduce works in detailes

so to summarize, job tracker asks name node where the blocks are, it assigns some slaves to do map jobs, then it assigns one or more slaves to do reduce job, the reduce task trackers WILL COPY THE OUTPUT OF MAP TASK TRACKERS TO THEIR LOCAL MACHINES.


Hadoop is Rack Awareness


Advanced MapReduce, Partioners, Combiners, Comparators, And more

 firstly we should know that the Mapper and Reducers do some kind of sorting

The mapper sort the keys, and the reducer after the shuffle it also sort by the keys.

You can define a Comparator to do secondary sorting to sort the value in the Reducer, so in the example above we have us:[55,20] the secondary sorting will sort it to us:[20,55].

also we can define what we call a combiner, which is a pre-reducer, the combiner will run in the Map face, as you can see in the example above, the first mapper adds the US values and the output was 55, this is the combiner job.
With combiner you may reduce the processing time and the intermediate data.

we also have something called partioner

the mapper can partition its output to multiple partitions, and later the reducer can fetch the partion that it is intrested in,
in the example above we did a partition by key, and as you can see each reducer grabs a specific key.

There is a full example about writing a Partitioner.


for unit testing you have MRUnit  which is a new apache project.

he gave a practical example about loggin as well

when you do benchmarking we talk about terasort number, then number will give us an indecator about the performance of the cluster, and weather adding new machine gave us a gain in performance.

TERASORT is simply a simple or lets say the simplest mapreduce job hadoop can do. to do a TERASORT test you should use 3 scripts
1- teragen: to generate a dataset
2- terasort: it is a job that sorts the dataset.
3- teravalidate: it is used to validate if the dataset got sorted.

Hive vs Pig Vs Impala

we know Hive and Pig, we know that they are simply converting your requests to MapReduce requests.
they are in general 10-15% slower than a native java mapreduce.

as Hive and Pig converts the requests to MapReduce, they use the job tracker and task trackers

Impala is developed in cloudera, they are designed for real time queries, they use specific daemons for them, not the task trackers and job trackers. IMPALA DOESNT USE MAPREDUCE AT ALL.
Impala is not fault tolerant. Baisclly MApReduce is slow becuase of the time we need to start the jvm for map reduce jobs. Impala uses its own deamons.
Impala is on top of Hive, so it uses Hive (actually it is a sub set of HiveQL)



in HIVE, you can do the installation on each client and start calling.

or, you can have a HIVE server:

we always need a metastore, where we store the mapping between HIVE tables and HDFS data.

NOTE: in HIVEQL there is no update or delete, as HIVE runs on top of Hadoop and as we mentioned before you cannot delete or update a record.

Check the HIVE & PIG LAB.



Data Import and Export

we have 2 types of import and export:
1- Real Time Ingestion and Analysis:
products like Flume, Storm, Kafka, and Kinesis
the idea of these product is that you have multipel agents who push and pull data from each other.

these system doesnt care if the end system is Hadoop or NoSql or a Flat file

The products are similar, however Storm, Kafka and Kinesis has more Analysis functionality than Flume

2- Database Import Export:
Sqoop (SQL to Hadoop)
it is simply a single process that import/export data to/from hadoop.

there is no analysis or filtering or.. just import export.

you can do something like on 2:00 pull  all data from hadoop and put it in table xxx.


Flume is used to move massive amount of data from system A to System B (which is usually HDFS, MongoDB, NoSQL ...)

He talked about the architure of FLUME and there is a LAB.


some REST call examples


he gave a lab about sqoop

Oozie is used to build a workflow, the workflow is represented in XML format