Tuesday, October 24, 2017

Everything important for solution architect

Software Architect Pattern Book by and 


Mark Richard


1- there is no rule around the number of layers, some applications might have business and persistence in the same layer.
2- layers marked as closed means that you cannot bypass them, e.g. presentation layer cannot access the database directly 
3- you might add new layer like this

the Service layer was added as open which means the business layer can bypass it.

4- a good example of the layered architecture can be found below 

customer screen will receive getCustomerInformation request which will return the customer and its orders,
the customer screen will pass the request to customer delegate which has the logic the determine which component in the business layer should be called, in this case the customer delegate will pass the request to Customer Object
the business layer knows that it has to use the customer dao and order dao and then aggregate the result and send them back
the customer dao and order dao will fire sql statement to fetch the information from the db

5- be careful of the SINKHOLE ANTI-PATTERN: in the layered pattern you might find that a request is passing from presentation layer to databases layer without anything happening in these layers, the business layer is not doing anything just passing, this is wrong, you should follow the 80-20 rule in this case, if you have more than 20% of your operations that doesn't need any processing from layers you should start thinking of opening some layers

6- in your documentation you should state clearly which layers are open and why

Architecture Agility (LOW): the architecture is more monolothic in nature, even when you split to layers and component, it is not easy to react to constant environment changing

Ease of Deployment (LOW): sure, it depends how you build your application, but in general a change to one component might need redeployment for the full stack (or large part of it), usually you need to schedual that, not good for CD

Testability (HIGH): very good, as component belongs to a specific layer and you can mock other layers.

Performance (LOW): because of the multi layer nature, performance is very low

Scalability (LOW): scalability is difficult in this architecture because of the monlothic nature, usually you scale by copying the entire stack

Ease of development (HIGH): usually it is high, as it is well known, you can seperate your team based on the skills and layers, and here you should know Conway’s law which says that organizations are constrained to produce designs which are copies of the communication structures of these organizations.



It has 2 styles, mediator topology and broker topoloogy

1- mediator topology looks like this

the event comes to a queue, the mediator (or orchestrator) will pick the event from the queue, the mediator knows the steps that should be executed to serve the event.

the mediator will put event in another channels to get picked up by event processors.

mediator could be implemented using Apache Camel, Spring Integration, Apache ODE or any BPEL ...

sure, the mediator might publish to queue or topic depends on the event.


mediator receives Relocation Request,
it will get processed by adding request to get picked up by processors.

in the broker topology, there is no mediator, the routing logic is in the event processor, which means, after the processor process the event, it will inform the next step

the same Relocation example will look like this

3- BE CAREFUL OF TRANSACTIONAL UNITS: you cannot perform transactions easily in this architecture because of its asynchronous distributed nature,

Architecture Agility (HIGH): the event processor are single purpose component, changing them is easy

Ease of Deployment (HIGH): event processors are decoupled, very easy to deploy

Testability (LOW): testing is little difficult, the architecture is highly asynchronous, you need special code to track where the event is

Performance (HIGH): even though it is asynchronous, however the performance is high in this architecture as you are performing parallel asynchronous tasks

Scalability (HIGH): each event processor can be scaled independently

Ease of development (low): asynchronous development is difficult.



1- you have a core system and you add plugin to this system, example eclipse or your web browser.

in case of business applications, product based applications are very good example. Insurance companies, for example, could fit under this category, you have a core claim system, but because you have different insurance products, you can build these products as plugins and put them in the system.

you can see down a Claims Processing core system and Different US states plugins as each state has different logic.

Architecture Agility (HIGH): plugins are standalone, you can change them easily

Ease of Deployment (HIGH): plugins are standalone, you can deploy them independently

Testability (HIGH): you can test plugins independently

Performance (HIGH): usually because you can remove the plugins that dont perform.

Scalability (LOW): in general it is low, you can scale on plugin level

Ease of development (LOW): not easy to build such a system


CHAPTER 4- Microservices Architecture Pattern

the micro service architecture might look like this

the client send a request to an api, the api will communicate with the services

the second style might look like this

here you would have a traditional web based application (or any kind of application) with an interface, (e.g. angular application) and this application communicate with the services

the third style might look like this

this one has additional layer which is a message borker, NO ORCHESTRATION IN THIS LAYER, this is simply for queuing, monitoring ..

2- you should avoid ORCHESTRATION and Inter-service communication, if you have a lot STOP and move to SOA
3- you can solve inter - service communication by using shared db
4- you can solve inter - service communication by copying one service functionality to another
5- if you find that you need to orchestrate or alot of inter -service communication, then micro service might not be a good choice

Architecture Agility (HIGH): independent services
Ease of Deployment (LOW): independent services

Testability (HIGH): independent services

Performance (LOW): i dont know why but the author consider it as low because of the distributed nature

Scalability (HIGH): independent services

Ease of development (HIGH): independent services


Space based architecture is about scalability, (check below, the request comes to the virtualised middle ware)

scalability is achieved by removing the central database constraint and using replicated in-memory data grids instead. Application data is kept in-memory and replicated among all the active processing units, as you can see there is no centralize db, each processing unit has its own db

the processing unit looks like this

as you can see there is in-memory data grid, and the replication engine to replicate with other processing unit

The Virtual middleware is the key in this architecture and has the following elements:
Messaging Grid: When a request comes into the virtualized-middleware component, the messaging-grid component determines which active processing components are available to receive the request and forwards the request to one of those processing units

Data Grid:The data grid interacts with the data-replication engine in each processing unit to manage the data replication between processing units when data updates occur

Processing Grid: it is the orchestrator, If a request comes in that requires coordination between processing unit types (e.g., an order processing unit and a customer processing unit), it is the processing grid that mediates and orchestrates the request between those two processing units.

Deployment Manager: manages the dynamic startup and shutdown of processing units based on load conditions


 Service Oriented Architecture

Business services: this is high level services, like process claim, execute trade, these services are owned by business users
Enterprise Service: high level owned by the architect, like create customer, calculate quote
Application Services: this is owned by developers, they are fine grained, like add driver, add vhichle.
Infrastructure Services: here we have non business functianlity, like writeToLOG, SSO, checkCredintials
Message Bus: here you do the orchestration and choreography

Finally the comparison


Microservices vs. Service-Oriented Architecture by Mark Richards

SOA and micro-services shares the same characteristics and difficulties, some of these difficulties

1- Service Contract: the contract might change, how can you handle this. You have Service-based contract and Consumer-based contract.
you need to use contract versioning.
very important to find a way to inform your consumers about your changes.

2- Service Availability: how to set time out for a service, is it based on load testing, but maybe you come with wrong numbers, and lets say you set it to 8 seconds, it means the consumer will wait for 8 seconds before it knows that the service is down.
some solutions are using CIRCUIT BREAKER PATTERS.
dont use global timeout for a service, make it smarter and write some logic to update this timeout all the time

3- Security: where should you implement Authentication and Authorization, you can have another service that is responsible for Authentication and authorization. or make a service for Authentication and pot the authorization in the service.

4- Transaction: you should accept that ACID is difficult in distributed environment like the service environment; here we have BASE, basically available Soft state, Eventually Consistant.
If you want ACID, think of moving all the ACID parts to one service.

SOA VS Micro-services Characteristics

1- Service Taxonomy: How services are classified in architecture
in micro-services, very easy taxonomy, the types are Functional Services and non-functional services.
usually non-functional services are internal services and not exposed to the public.

in SOA we have a larage taxonomy,

Business Services: High level, coarse grained services, these are the Enterprise level services, you can define these services by using the question "ARE YOU IN BUSINESS OFF", for example consider the ProcessTrade and InsertCustomer services. Saying “Are we in the business of processing trades” makes it clear that ProcessTrade is a good business service candidate, whereas “Are we in the business of inserting customers” is a clear indication that the InsertCustomer service is not a good abstract business service candidate

business services are VERY ABSTRACT, they are devoid of any implementation or protocol, and they usually only include the name of the service, the expected input, and the expected output. You might add some orchestration in this level.

Business services are typically represented through either XML, Web Services Definition Language (WSDL), or Business Process Execution Language (BPEL).

Enterprise Services: coarse-grained services that implements the functionality of business services. Enterprise services can have a one-to-one or one-to-many relationship with a business service.
The middleware is usually the bridge between business services and enterprise services,
Enterprise Services are generally shared across the organization
examples:  CheckTradeCompliance, CreateCustomer, ValidateOrder, and GetInventory

Application Services: fine grained services, AND APPLICATION SPECIFIC, For example, an auto-quoting application as part of a large insurance company might expose services to calculate auto insurance rates—something that is specific to that application and not to the enterprise. Application services may be called directly through a dedicated user interface, or through an enterprise service. Some examples of an application service might be AddDriver, AddVehicle, and CalculateAutoQuote.

Infrastructure Services: non functional services.

VERY IMPORTANT: as an architect you can choose to use the standard service types or completely discard them and create your own classification scheme. Regardless of which you do, the important thing is to make sure you have a well-defined and well-documented service taxonomy for your architecture.

2- Service Ownership and Coordination
In microservices, the owner is the application dev team.

in SOA, you have different ouners

as you can see, in micro-services, you have minimal communication, in SOA, if you want to add one services you need to communicate with many teams.

3- Service Granularity
microservices are small, fine-grained services, even more, they are generally single-purpose services that do one thing really, really well.

IN SOA, we have many layers and each layer has different granularity.

VERY IMPORTANT: Service Granularity has an effect on performance and transactions, if you are too fine grained, this means your performance might get effected because of calling multiple services, and it wouldnt be easy to run transactions

SOA VS micro-services Comparing Architecture Characteristics

1- Component Sharing
we mean by components a set of roles and responsibilities with well defined interface.
in our case, the component is service.

in SOA we have the concept of SHARE-AS-MUCH-AS-POSSIBLE, in micro-services, SHARE-AS-LITTLE-AS-POSSIBLE

lets take the example below for SOA, we have 3 services, all of them needs ORDER SERVICE, however each one has a different way of making orders; 

in SOA, this is a candidate for Enterprise service, it will look like this

as you can see the service is smart enough to select which db should be used.

The problem is, you have now one service, testing this service is difficult. 

micro-services follows the domain driven BOUNDED CONTEXT design principle, which means all what you need in one place, in other words, dont share, it means violate DRY (Dont repeat yourself) principle.

well, in micro-services, you share, for example infrastructure services, the point is, make them as less as possible.

The benefit of this, every team is responsible for its work.

2- Service Orchestration and Choreography

In micro-services, we use Choreography, where each service call the next service in the process, so the service has some knowledge; the reason behind that is we dont have a middleware in microservices; HOWERVER YOU SHOULD MINIMIZE THIS INTERACTION AS WE MENTIONED BEFORE, THE INTERACTION SHOULD ONLY BE WITH INFRASTRUCTURE SERVICES.

in order to avoid choreography in micro-services, you can  build more coarse grained services.

in SOA, we use Service Orchestration and choreography, orchestration happens by the middleware, and sure services call each other as well 

3- Middleware vs API LAYER 

lets go back to micro-services architecture

as you can see, we have the API LAYER, THIS IS NOT MIDDLE WARE, it is just a facade, so rather than given the address of the service, you have the address of the API LAYER.
In addition this is good for service granularity, lets say you have service X which you find later that it is coarse-grained and you want to split it to 2 services, you can split it without telling the consumer, the consumer will keep use the API LAYER, and the API LAYER will call 2 services.

SOA uses a middleware, where you put mediation and routing, message enhancement, message transformation, and protocol transformation.

4- Accessing Remote Services
Micro-services uses REST-based or message-based (e.g JMS), the thing is, of course you can use others but micro-services wants to limit that and they dont want to mix, so you dont have to mix REST with JMS with something else.
SOA has no limitation on this, actually the middle ware has the capability of doing protocol transformation.

Comparing Architecture Capabilities

1- Application scope:
SOA is well-suited for large, complex, enterprise-wide systems that require integration with many heterogeneous applications and services. It is also well-suited for applications that have many shared components, particularly components that are shared across the enterprise.

Small web-based applications are also not a good fit for SOA because they don’t need an extensive service taxonomy, abstraction layers, and messaging middleware components.

The microservices pattern is better suited for smaller, well-partitioned web-based systems rather than large-scale enterprise-wide systems. The lack of a mediator (messaging middleware) is one of the factors that makes it ill-suited for large-scale complex business application environments. Other examples of applications that are well-suited for the microservices architecture pattern are ones that have few shared components and ones that can be broken down into very small discrete operations.

ofcourse sometimes you start with micro services and move to SOA or the other way around.

2- Heterogeneous Interoperability

in Microservices, the protocol is always the same, (e.g. REST or Message based) there is no middle ware to change protocol.

as you can see, here we have REST, it is always REST it cannot be REST OR MESSAGE-BASED.

however as you can see the implementation is up to you, jave, .net ....

SOA is perfect for Heterogeneous protocols because of the middle ware

3- Contract Decoupling
Contract decoupling means that consumer with send the request the way he likes and the service will accept different format.

in SOA, we can do this as we have the middle ware, in this case you can transform and enhance the request

we dont have contract decoupling as we dont have the middleware.


Microservices AntiPatterns and Pitfalls by Mark Richard

1- AntiPattern: something looks good at the begging but later it causes a lot of trouble
2- Pitfalls: something that it is bad since the beginning.

When we talk about services we talk about SERVICE COMPONENT, a component is a unit of software that is independently replaceable and upgradeable (Martin fowler).

one of the most important concepts in Micro services is Bounded Context: bounded context means that the service is bounded with its data (they are single unit), which means the service owns its data; also the bounded context is not only about the data, it is about other services, which means a micro service should not depend on other services (at least the dependency should be minimal), which makes the micro service a single unit that can be deployed and tested easily 

Data Driven Migration Anti Pattern

1- we need to migrate the monolithic database to micro service database (part of bounded context architecture)
3- normally what you want to do is this

so you want to split the functionality and the database,

4- dont do that at the beginning, firstly start with the functionality and let them use the same DB, then when you are fine withe granularity you can start with the db.
5- why do we do this? simply because you will not get the granularity correct from the first time

All the world's staging Anti Pattern

in this pattern, you should not focus on the devops task first and push all the functional work to the end, work on parallel

for example, usually people do this mistake, 

ass you can see, the devops tasks are taking 4 months, before you start the first functional iteration, and then the non functional will take one month before you reach iteration 5 which is the main business functionality. You can sell this to your product owner

do the work on parallel like this

as you can see we work on parallel, and you should always work with devops together because something might change during the work.

Friday, November 18, 2016



Git Essentials LiveLessons



install git from https://git-scm.com/download/win
install sublime from https://www.sublimetext.com/3

from git bash set your global variables

//set global user name
 git config --global user.name "hassan jamous"
//set global email
 git config -global user.email "'hassan.jamous@gmail.com'"
//set global color
 git config --global color.ui "auto"
//set global editor to  sublime
 git config --global core.editor "'C:\Program Files\Sublime Text 3\sublime_text.exe' -w"

you can view your global configuration 
git config --list


create a git repository
you can create a git repository in any folder, 
cd to the folder and type
git init
git will create a .git hidden folder in this folder and all subfolders, you can type 
ls -a to check 

Branch Master
when you create a new git repository, you will be working on the Master branch.

Untracked, Staging and Commit
when you create a new file in your repository, the file will be in Untracked. 
to track the file you type 
git add FILENAME
or you can use
git add .
to add all the files to the staging area

after you add the file the file is in the staging area.

to commit your changes, which means save it to the branch you type
git commit -m "commit message"

to check the commit log
git log

Now if you change a file, the file will NOT be in the staging are, you should add it then commit.

the HEAD is your last commit, so when you commit new changes you are moving the HEAD, to the new commit

Check the differences
if you change the file, and before moving the file to the staging area. you can compare your changes with the last commit by typing

git diff 

if the file is already in the staging area you should type

git diff --staged

after you commit you can compare with the previous version by
gid diff HEAD~1
which means compare with one commit before the head, you can use HEAD~2 or 3 ....

also you can compare with commit id, if you use
git log
you will git something like

commit 5dbed2e0bcd7bdb844d6a6fdfc6519b9f5da7e31
Author: hassan jamous <hassan.jamous@gmail.com>
Date:   Wed Nov 16 19:00:50 2016 +1100

    second commit

commit c39fbd23776eb5e569bff21b5bd8d05eacb1facd
Author: hassan jamous <hassan.jamous@gmail.com>
Date:   Wed Nov 16 18:42:30 2016 +1100

    first commit

you can use the commit id to compare the differences
git diff c39fbd23776eb5e569bff21b5bd8d05eacb1facd

you can move your head between commit by using git checkout.
for example you can move your head to the previous commit

git checkout HEAD~1

if you type git status here , it will tell you that the head is deattached and now is pointing to the previous commit 
$ git status
HEAD detached at c39fbd2
nothing to commit, working tree clean

to go back to the last commit type 
git checkout master 

sure you can also checkout to commit id 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

lets say you want an old version of a file, you simply checkout that file from the commit you want

git checkout HEAD~1 readme.txt

notice here that you are not moving the head, you just telling git that i want the file from HEAD~1

of course you can type 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

now if you check git status, you will see that the file is modified and in the staging area, 
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   README.txt

now you can commit your new changes.

Deleting a file
when you delete a file, the file delete action will not be in the staging area, you can confirm the delete simply by typing 
git add
to add it to the staging area, then 
git commit -m 'we deleted the file'
to commit

no if you want to reverse your changes, you dont want to add them to the staging area, you should type
git checkout master readme.txt
notice that we returned to the master version

Moving a file from a staging area to out of the staging area (undo the git add)
to move the file from the staging area type
git reset HEAD readme.txt

undo your changes
if you make a change, it is not in the staging area, if you want to undo the change
git reset --hard

Adding new folder to git repository
if you create a new empty folder, you will notice that git will not recognise that, you should have a file in the folder to be recognised. 
that's why people create a .gitkeep file inside empty folder, git will recognise the folder, the .gitkeep will be hidden so normal users will not see it.
of course you can see the hidden file from the bash, by running ls -a

ignore files 
to ignore files from git, you should create 
file on the root folder of the respoitory, inside this file you can add the files or patterns that you would like to ignore.

force ignored file to be committed 
to force an ignored file to commit 
git add -f FILENAME


GIT is structured this way

as you can see, you have a local copy, and you have remote, remote could be GITHUB, GITBUCKET, GITLAB or anything that follow git structure

you can have multiple remote, however the primary Remote is called origin (this is a convention)

you push or pull from remote

adding a remote
1- create a repositroy on github
2- we will use this repository as remote
3- we will add the remote, and it is gonna be our primary we will call it origin 

git remote add origin http://github.com/hassan.jamous/SomeStuff.git
as you can see we named this remote ORIGIN, you can call it whatever you want, but as it is the primary we are following the convintion ==> it should be called remote.

now you need to push your repository to GITHUB
git push origin master

which means i want to push my master to remote called origin.

it will ask you for github username and password

checking what remote do you have 
you can use 
git remote -v
to get the list of remote repository that you have, you will get something like
$ git remote -v
origin  https://github.com/hassan-jamous/SomeStuff.git (fetch)
origin  https://github.com/hassan-jamous/SomeStuff.git (push)

as you can see, for each remote you have to entries, one to fetch the code and another to push the code.

USE SSH to connect to github
any change that you wanna do on github, you need to provide a username and password.
in order to solve this, you should use ssh rather than http url to connect to github

to do that 
1- go to the root folder and create .ssh folder (on linux ~/.ssh, for windows C:\Users\Hassan)
2- cd to that folder 
3- type  ssh-keygen
4- you will receive this message Enter file in which to save the key (/c/Users/hassa/.ssh/id_rsa):
5- put the file name like id_rsa
6- then you will get the following messgae

Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:

 now, your key is store in id_rsa.pub

7- open id_rsa.pub
8- copy the key
9- go to github, open settings menu then SSH and GPG keys

10- add a new key, and copy the key.

11- now git the ssh location from git 

and type
git remote add origin SSHLOCATION

so now lets say you updated a file, you committed the changes, these changes will be stored locally, to push these changes to git hub 

git push origin master

now the file will go to github 

Pulling changes from GIT HUB
you can edit files on git hub directly, lets say you edited the file from git hub or you want to pull the latest changes 

you can type
git pull origin master

when you push your changes, you might git an error, which will say that a conflict has happened because you dont have the latest version.
you should pull the latest changes and then push
git pull origin master
git push origin master

when you pull auto merge might not be possible so you should handle this manually

Creating a new branch

to create a new branch you can typ
git branch BRANCHNAME

this will create a branch from where you are, so if you are on master the branch will be created from master.

or you can type 
git checkout -b BRANCHNAME

to list the list of branches you have 
git branch -a

to move from one branch to another
git checkout BRANCHNAME

to delete a branch
git branch -d BRANCHNAME

in order to force delete a branch (in case there is some work on this branch that is not committed yet)
git branch -D BRANCH NAME
use capital D

to merge change to the master branch , you should first checkout to that branch
git checkout master

then you can merge
git merge BRANCHNAME

when you have a new task, 
1- checkout from the master branch 
git checkout -b NEWJIRA
2-do the changes that you want
3- now we should push this branch to remote
git push origin NEWJIRA
4- go to github website and create a pull request, here you should specify the base branch and the branch that you want to be merged (base branch will be master, the branch to be merged is NEWJIRA)
5- someone will see the pull request, will review it and accept the request.
6- after this you can delete the NEWJIRA branch from your github.

now your merged the NEWJIRA branch to the master branch on github, which means you merged on REMOTE. you need to pull these changes to your local master

git checkout master
git pull origin master

now your master is similar to the remote master

when you type
git branch -a
you will get something like
$ git branch -a
* master

as you can see, it lists the branches that you have locally, and the remote branches.
lets say that you went to github and deleted the testingBranch, so you are deleting the remote testingBranch

after you do that you should update your local repository in order to be synced with the remote, in order to sync your remote with local WITHOUT CHANGING YOUR LOCAL BRANCHES, you should use git fetch

git fetch

this will sync the remote branches, however in order to delete the missing branches as well you should type

git fetch -- prune
now if you type 
git branch -a

$ git branch -a
* testingBranch


you will notice that the branch is deleted.

basically git pull is git fetch + git merge

you can use 
git log 
to get the log of a branch
however there are too many information there. in order to see better version

git log --oneline

this will print one line for each commit 

to print all commits
git log --oneline --all 

to print a graph and to see which branch merged the changes

git log --oneline --all --decorate --graph 

from the log you can see how the branches are related, it will tell you which branch is before another, and which branches are pointing to the same thing

for example, the following image tells us that development, origin/master and master branch are on the same level

and this image tells you that origin/master and master or on the same level
and feature/folder_documentation, origin/development and development are on the same level and in front of the master branch

in order to sync branches we used to do, git merge, what basically happens in git merge is the following

lets say you have this case

now when you merge you do the follwoing
git checkout master
git merge experiment

and this what will happen

there is another way to merge which is rebase

lets say we have the following 

rather than going to master and merge, we will do the following 
git checkout experiment
git rebase master

now this what will happen, the output will be

First, rewinding head to replay your work on top of it...
Applying: added staged command

so we forwarded expermint to C4' infront of master

now we will type

$ git checkout master
$ git merge experiment

and the result
Fast-forwarding the master branch.

Lesson 4

Adding a collaborator 
if you want to add someone to your github project so they can push and pull, go to your respository then choose setting -> collaborator, and then add the collaborator 

now the collaborator should donwload the project. to do that use git clone
this will donwload the project and will create the required remotes.

now the collaborator can push and pull

if you have a big a project with many collaborator, you will not add all of them, the best solution for this is to FORK

fork means that you have project REMOTE and you will clone this project to your remote. 
so when you fork it means that you are taking this project to your account (your remote), and when you push and pull you are basically doing that on your account not the project account.

by you do the update on your account (your remote) then create a pull request to merge it to the project remote

so, from GITHUB you can press the FORK button, now the project is forked and it is in your account.
you can copy the ssh link and use
git remote add origin SSHLINK

now do the changes and push it to your remote, then create a pull request to merge it to the project remote.

Now you when many people fork the project you will have a sync problem with a project.
to handle this you should add a new remote, this remote is the project itself,
so now you have your account remote (which we call it ORIGIN) and we should add the project remote (we call it upstream).
git remote add upstream PROJECT_SSH_LINK

very important that what you do is: 
1- pull from UPSTREAM (i.e. get the latest version from project remote)
2- push to ORIGIN (i.e. push your change to your remote)
3- create a pull request to merge from ORIGIN to UPSTREAM

There are some situation when you resolve conflict, 
you do git rebase
then you should do 
git rebase --continue
git rebase --skip

and when you do rebase you usually have to force push
git push -f oringin master


check this url for branching http://nvie.com/posts/a-successful-git-branching-model/ 

Monday, May 23, 2016

OREILLY Learning Apache Maven

Maven has 3 lifecycles: Clean, default and site.
in each lifecycles we have alot of phases, executing a phase means executes all the previous phases.

Maven is convention over configuration, which means you dont tell maven in some configuration where your java files are, by convention they must be in src/main/java

You can change these convention but it is not recommended, you may need that if you are working on a legacy application

you can have a Terminal inside eclipse, use TCF Terminal plugin

Inheritance in Maven
in the general pom all the directories and conventional stuff are defined

Maven profile
you can use Maven profile to build the project based on your environment, for example a profile for the test environment, and another for DEV another for production.

now when you run maven use -P
mvn -Pproduction package

if you dont want to use -P you can do something else
you can define an Environment Variable in windows and then use it in pom.xml

now Maven will check PACKAGE_ENV environment variable and determine which profile to use

Maven Dependency

Maven can handle transitive dependency which means if you depend on X.jar and X.jar depends on Y.jar, Maven will fetch Y.jar

You can define Remote repositories in Maven.

You can define Scope for your dependencies, which means you dont need junit when you compile you need it when you test

Maven can handle conflicts, for example you depend on X.jar and Y.jar, X.jar depends on Z.jar version1 and Y.jar depends on Z.jar version2 . Maven can handle this conflict.
it will fetch the latest version by default, however you can control this behaviour using the <exclusion> tag.

Maven Lifecycles 
there are 3 different life cycles in Maven, default, clean and site
cycles have phases
phases are connected to plugins, for each plugin you have goals that must pass in order for the phase to pass

default is the most used cycle
in default you have these phases:
Compile: which means compile everything in src/main/java
test-compiles: compile everything in src/main/test
test: run unit test
package: create the jar or ear or war
install: take the generated package and put it in local repository so other project can use it as dependeny
deploy: take the package and put it in remote repository, so other teams in the company can use it

Sunday, March 13, 2016

Learning Apache Hadoop OREILLY Course

we should know the concept of Disk Stripping, in Disk Stripping or RAID0, the data is divided into multiple chunk, so lets say you have 4 hard Disks, data will be divided into 4 pieces, by that accessing data will be much faster,

in RAID1, we do mirroring for the data

in order to ensure that your data is safe you should combing RAID1 with RAID0.

Hadoop logically does that in the cluster, it stripes and mirror data.

- Hadoop is Fault tolerant, it means if a disk is couropted or a network card not working, this is fine.
- Hadoom has master slave structure.

you should choose a powerful and expensive computer for your master node.
master node is a single point of failure,
you should have 2 or 3 Master node in a cluster
you should have redundancy ( as it is SPOF)

You need alot of RAM, more than 25 as the deamon takes alot of ram
you should use RAID
you should use HOT SWAP Disk drive
you should have redundant Network card
you should have dual power supply

the bottom line, the MASTER NODE should never goes down.

CPU is not important like RAM here.

you will have 4 to 4000 slave node in a cluster
slave nodes are not single point of failure.
7400RPM disks are fine
more disks are better, which means 8 * 1 TB data is much better than 4 * 2 TB
it is better that all slaves have the same disk size.

sure slave IS NOT Redundant
you dont need RAID or Dual network card or Dual power supply

you need alot of RAM

lets say you have 10TB of data every month
you have slaves with 8TB
you have replication factor of 3
you should know that you have something called "intermediate data" which is the generated data betwee MAP and REDUCE. this data is about 25% of the disk size ( in this case 2TB )

the avaialbe space formula is = (RAW - ID) / RF = (8 - 2)/3 = 2 TB

which means each slave has 2TB not 8TB, which means you need 5 slaves every months (as you have 10TB every months).

it means all the things on top of hadoop,
basically when we say hadoop we main HDFS and MAPREDUCE.
main things in hadoop are
1- NAME NODE: part of the master
2- Secondery Name Node: part of the master
3- Job Tracker: part of the master
4- Data Node: part of the slave
5- Task Tracker: part of the slave

1- HBAS: fast scalable NoSql database
2-HIVE: write sql like queries instead of map reduce
3- pig: write functional queries instead of map reduce
4- sqoop: pull and push data to RDBMS, used for integration
5- flume: pull data into HDFS
6- HUE: web interface for users
7- Cloud Manager: web interface for managing the cluster for admin
8- oozie: workflow builder
9- Impala: real time sql queries, 70 faster that MapReduce.
10-Avro: serialize complex object to save in hadoop
11- Maheut: machine learingng in hadoop
12- Zoo Keeper
13- Spark
14- YARN
15- Storm

hadoop is used for batch processing, which means parallelization, which means problems like graph based doesnt fit with hadoop

the best is Cloudera


when we talk about Hadoop, we are talking about 2 main things
1- storage: whcih is HDFS, a distributed redundant storage
2- processing: which is MapReduce: a distributed processing system

some terminology to know:
1- a job: all tasks need to run on all data.
2- a task: individual thing, which is either a map or a reduce
3- Slave/Master: these are computers
4- NameNode, DataNode: these are daemons, which means JVM instances

we have MapReduce v1: old and stable
we have MapReduce v2: new things like dynamic allocation and scalability

Hadoop cluster has 5 daemons:
- Storage Daemons:
NameNode(on Master)
- Processing Daemons:

Master Daemons are for orchestration
Slave Daemons are for working

NameNode: Handle Storage meta data, it puts some information in Memory for fast access but also it persist data.
Secondary Name Node: it checks NameNode if it is alive or not, it is not a failover node
Job Tracker: coordinate processing and schedualling.

NOTE: use different machines for Name Node and Secondary Name Node, because if the machine is down, the Secondary Name Node will detect that and build a new Name Node directly

NOTE: you can install the job tracker on the same machine with the Name Node, and move it to another machine when your project gets bigger.

Data Node: handle row data (Read & Write)

Task Tracker: handle individual taks (Map or Reduce)

data node and task tracker always sends heart beats to the master to tell him that we are alive and we are working on this.

Hadoop run modes

1- Local JobRunner: Single computer, single JVM with all daemons, good for debugging
2- pseudo Distribution: Single computer, 5 JVM (one for each daemon), good for testing
3- Fully Distributed: Multiple computers, multiple JVM, this is the real environment.

when you install Hadoop it is recommended to use linux, use RHEL for Master and CentOS for slaves.

use Redhat Kickstart to install hadoop on multiple machines.

Elastic Map Reduce

is a solution from Amazon similar to hadoop.

it has this structure:

the master instance group: is like the master node
the core instance group: is like the slave node, but it is only responsilbe for storage ( as you can see it uses HDFS
task instance groupe: is like the slave node, it is only responsible for processing ( doing map reduce job)

usually we use S3 to write information and intermediate data.

Core instance group is static, you cannot add any new machine after you start the cluster, however the task instance group is not static, you can add new machine whenever you want


in this lap he created 5 EC2 instances, one is a master and 4 slaves
he installed cloudera manager
he installed Hadoop from cloudera manager
then he uploaded some data to Hadoop from the command line
then he ran a Map/Reduce example
then he checked everything from Cloudera manager

then he gave an example of  downloading Cloudera Quickstart VM locally, to install hadoop locally

HE used Ubunto 12.04 AMI


Hadoop Distributed File System (HDFS)

you can use HDFS without MAP/REDUCE, in that case you only need NameNode, SeconderyNameNode, DataNode

when you upload a file to HDFS it will be divided into blocks and stored in slaves nodes

every block will be replicated to 3 machines (by default)

you cannot edit or append the file you upload to HDFS, if you wanna change anything you should delete and create the file again.

the default block size is 64MB, however it is recommended to change it to 128MB

it is a master node
it has only metadata information about the files that are stored in slaves (e.g. name of the file, permessions, where the blocks exist).


the client asks the name node about the file then the client goes and read it directly from the slave node.

The name node metadata exists in RAM, however it is also persisted.

we have 2 files for the persisted metadata in name node:
1- FSIMAGE: it is a point in time image about the information that exists in HDFS
2- edit log: the changes that happened since we created the FSIMAGE, it stores the delta information

every now and then FSIMAGE and edit log will be merged and saved on the hard disk

you have to have multiple hard disks with RAID to insure that you will not lose the data.
it is also better to use remote NFS.
and daily or weekly backup.

every 3 seconds the datanode will send a heart beat to the name node
if 30 seconds passed without a heart beat, the node is out
if 10 minutes with no heart beat, hadoop will start copy the data that should be on that node to another machine.

every one hour (or after the restart of the name node) all data nodes will send Block report, which is a list of all blocks that they have.

Hadoop uses checksum to insure that data is transfered correctly.
every 3 weeks hadoop will do general checksum check on all blocks.


How writing Happen in Hadoop

here is an example

so the client divided the file into 4 pieces ,
he asked the name node to write the first piece,
the name node gave a pipeline which is: write to datanode A then c then F
the client write to A, then A write to C, then c write to f
F ack C, C ack A, a Ack the client, the client ack the nn and request a pipline for the next block.

how do we handle a failed node,

lets say DN_A is bad, the client will try with C, if not with F.
as long as the client is able to write into one node the client can move to the next block

general information:
1- checksum is used for each block
2- the file is considered as the number of written blocks, so lets say your file is 4 blocks and you wrote only 2 blocks, so to this point your file is only 2 blocks, and HDBS will see your file as 2 blocks.
for that it is better to have 2 folders, INCOMING: keep here the file that is under upload process, once you finish uploading the whole file, move the file to READY_TO_PROCESS folder.

How reading is handled

the client ask for a file, the Name Node also gives a read pipline for each block

Secondary Name Node

as we mentioned before, we have 2 files in the NameNode, fsimage which is a point in time file and edit log which is delta since the last fsimage

Note: we have 2 files, fsimage and edit log, because fsimage is a big file, opening a big file will slow down hadoop, thats why we have edit log, a small file and contains only delta information, using edit log means dealing with a small file ==> better performance



IF SECONDARY NAME NODE IS DOWN nothing will happen, the name node will keep writing on the edit log, the edit log will become bigger and bigger and the system will become slower and slower.


new lab, we used hadoop fs -put

when you do the instalation with cloudera manager, a trash directories will be created for you by default, when you delete something it will be moved to the trash directory.
if the directory is not created, it is recommended to create one.


High Availability Name Node
Name node as single point of failure is not acceptable,
thats why we have a new solution by cloudera which is intrduced in Hadoop 2, and called Name Node high availability.

as you can see, the Standby namenode will take over if the name node is off AND YOU DONT HAVE TO START THE WRITE OR READ OPERATIONS FROM BIGINNING .

NOTE IMPORTANT: Clients send all operations to both the NN and Standby NN, both of them have complete picture of what is happening in the memory.

With the architecture above, i can handle the failure of the NAME NODE, however the fsimage and edits log are still a Single Point of Failure.
that is why High Availaibility Structure introduced a new thing called JournalNode.

 the current active name node now writes, synchronously, the fsimage and edits log to set of journal nodes, the standby NN reads from these nodes

in order not to the nameNode and stand by name node misunderstood each other (maybe one think that it is the active name node now). they use something called epoch number with each write to the Journal Node


we use a cluster of Zoo keepers to determine who is the active name node (the number should be odd to avoid brain split).
as you can see we have ZKFC service in NameNode and StandBy Node, they send information to the ZooKeeper cluster to tell about the health of the node,
if the NAME NODE ZKFC noticed that the NameNode is down, he will send this information to zoo keeper, zoo keeper will set the stand by node a the active name node and the old name node as the standby one

as you can see HA is complicated, extra machines, extra configurations ...
you dont need this most of the times, the secondery name node on a different machine is usually enough.


scale name node functions by breaking up namespaces to multiple machines.

hadoop has authorization, but it doesnt have authentication, for example lets say you are sending a write request to machine1 as user xxx, user xxx is not authorized to do write operation but user yyy has, simple create user yyy and send a request as yyy, hadoop will not check that you are yyy for real.

to do Authentication you should use something else, Kerberos.

hadoop uses linux like permissions.



these are the player of map reduce

and here is how the job is done

we have also new version which is called MapReduce v2, in this version they focus on the scalability of the job tracker and removing a restriction on the number of the map and reduce jobs that can be run on each slave machine.

the Map Reduce configuration files are:
1- mapred-site.xml
2- hadoop-env.sh

in this lab he gave an example how to run a java map reduce function

this is the statement to run a map reduce, hadoop-examples.jar contains the Map and Reduce java classes.

he went over everyline of code, you can check it.

How MapReduce works in detailes

so to summarize, job tracker asks name node where the blocks are, it assigns some slaves to do map jobs, then it assigns one or more slaves to do reduce job, the reduce task trackers WILL COPY THE OUTPUT OF MAP TASK TRACKERS TO THEIR LOCAL MACHINES.


Hadoop is Rack Awareness


Advanced MapReduce, Partioners, Combiners, Comparators, And more

 firstly we should know that the Mapper and Reducers do some kind of sorting

The mapper sort the keys, and the reducer after the shuffle it also sort by the keys.

You can define a Comparator to do secondary sorting to sort the value in the Reducer, so in the example above we have us:[55,20] the secondary sorting will sort it to us:[20,55].

also we can define what we call a combiner, which is a pre-reducer, the combiner will run in the Map face, as you can see in the example above, the first mapper adds the US values and the output was 55, this is the combiner job.
With combiner you may reduce the processing time and the intermediate data.

we also have something called partioner

the mapper can partition its output to multiple partitions, and later the reducer can fetch the partion that it is intrested in,
in the example above we did a partition by key, and as you can see each reducer grabs a specific key.

There is a full example about writing a Partitioner.


for unit testing you have MRUnit  which is a new apache project.

he gave a practical example about loggin as well

when you do benchmarking we talk about terasort number, then number will give us an indecator about the performance of the cluster, and weather adding new machine gave us a gain in performance.

TERASORT is simply a simple or lets say the simplest mapreduce job hadoop can do. to do a TERASORT test you should use 3 scripts
1- teragen: to generate a dataset
2- terasort: it is a job that sorts the dataset.
3- teravalidate: it is used to validate if the dataset got sorted.

Hive vs Pig Vs Impala

we know Hive and Pig, we know that they are simply converting your requests to MapReduce requests.
they are in general 10-15% slower than a native java mapreduce.

as Hive and Pig converts the requests to MapReduce, they use the job tracker and task trackers

Impala is developed in cloudera, they are designed for real time queries, they use specific daemons for them, not the task trackers and job trackers. IMPALA DOESNT USE MAPREDUCE AT ALL.
Impala is not fault tolerant. Baisclly MApReduce is slow becuase of the time we need to start the jvm for map reduce jobs. Impala uses its own deamons.
Impala is on top of Hive, so it uses Hive (actually it is a sub set of HiveQL)



in HIVE, you can do the installation on each client and start calling.

or, you can have a HIVE server:

we always need a metastore, where we store the mapping between HIVE tables and HDFS data.

NOTE: in HIVEQL there is no update or delete, as HIVE runs on top of Hadoop and as we mentioned before you cannot delete or update a record.

Check the HIVE & PIG LAB.



Data Import and Export

we have 2 types of import and export:
1- Real Time Ingestion and Analysis:
products like Flume, Storm, Kafka, and Kinesis
the idea of these product is that you have multipel agents who push and pull data from each other.

these system doesnt care if the end system is Hadoop or NoSql or a Flat file

The products are similar, however Storm, Kafka and Kinesis has more Analysis functionality than Flume

2- Database Import Export:
Sqoop (SQL to Hadoop)
it is simply a single process that import/export data to/from hadoop.

there is no analysis or filtering or.. just import export.

you can do something like on 2:00 pull  all data from hadoop and put it in table xxx.


Flume is used to move massive amount of data from system A to System B (which is usually HDFS, MongoDB, NoSQL ...)

He talked about the architure of FLUME and there is a LAB.


some REST call examples


he gave a lab about sqoop

Oozie is used to build a workflow, the workflow is represented in XML format