Technology - continuous learning process: March 2022

Friday, March 25, 2022

Docker Setup

1. Repository SetUp -

Pre requisite - Login as sudo

1. yum install -y yum-utils device-mapper-persistent-data lvm2

2. yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

2. Installation of Docker Engine -

1. yum install docker-ce docker-ce-cli containerd.io

Installation of specific version of docker-

2. yum install docker-ce-18.03.0.ce

Updation of docker.service -

1. vi /usr/lib/systemd/system/docker.service

Update following properties in docker.service -

2. ExecStart=/usr/bin/dockerd --insecure-registry=myregistry.com -H tcp://0.0.0.0:2000 –H unix:///var/run/docker.sock

3. ExecReload=/bin/kill -s HUP $MAINPID

Flush Changes-

1.systemctl daemon-reload

Start Docker-

1. systemctl start docker

Remove all the file in folder

rm -rf *.log

Above command will remove all the file with ".log" instances

Remove all docker images with initials of image name

docker images

docker rmi $(docker images | grep 'imagename')

Replace data in file in all occurances

:%s/google.com/google.com:8020/g

Above command will add port 8020 in all the occurances of google.com

Monday, March 14, 2022

Client Server Model

Clients

one which are requesting information they want from somewhere
eg. browser, devices, any program - java, etc.
generally clients speak to servers in some kind of network protocol language. HTTP is most common one.

Server

a physical machines or virtual servers that is receiving traffic
it is the provider of information
As it has access to some kind of information, as its not necessary it should have information stored locally. It should know how and where to get it.
Server exposes the set of APIs known as application programming interfaces

Client Server Model

we have clients which are requesters of information
we have servers which are resources or providers of information
Data is exchanged through APIs, in which client call a particular API on a server that is listening on a particular port, server responds with that information

Content Delivery Network - CDN

CDN is a service that accelerates internet content delivery.

It make your website faster.

CDN helps in reducing the amount of distance between user and server providing the content

We've CDN end points in many locations around the world.

Now when users in A, B, C ,D tries to access same content, it first retrieved by Content Delivery Network service and then distributed around the world.

Now, our users in A instead of going to MS, it will be able to access the content directly from closest geographical location, drastically reducing the amount of time that it takes to retrieve that content.

Indirect benefit are as follows:

it leads to reduction in the load or reduction in amount of capacity that you need in MS to serve all these users.
it leads to increase in up-time
it leads to increase in security

Domain Name System - DNS

DNS translates domain names to IP addresses.

What is Domain ?

domain name is any text or string that you enter in your web page.
eg. google.com, etc.

What is IP address ?

It is a set of protocols that has set of rules that helps millions of devices communicate with each other.
four set number. eg. 172.25.43.128
when we say web browser is down, you're not able to access DNS that you're looking for.
If we use IP address instead of domain name, it will route to the web page

DNS Resolver

it acts as phone book in this entire thing

How to bridge the gap b/w human communication, DNS and networking world ?

In networking word computers use numbers to communicate with each other.
In human word we use names to communicate with each other
Here, DNS resolver acts as phone book, where you search name and match it to number.
Ip addresses is a set of protocols that has set of rules that helps millions of devices communicate with each other.
Thats why we have IP addresses allocated to each and every device on internet.
All these devices communicate using IP address as unique identifier.

How DNS Lookup process works ?

Client : where we enter the website address from web page

web browser has what we call cache memory.
Cache memory stores certain values for a certain time period of time.
So, when you enter address abc.com,

it first looks in cache memory
If its a cache miss, then, we forward request to DNS Resolver
Also, DNS resolver has its own cache
If again, its a cache miss, then we route request to ROOT SERVER.
Root Server: is top server at top level in DNS hierarchy.
Let, root server still don't have information of abc.com that you're looking for

Root server has information about Top level domain server that you have to route your request to.
Also, Root servers are placed across different locations, throughout the world.
There are 12 different organizations that manage these root servers.
Root server provides the top level domain server IP address.
Then, we send request to Top Level Domain server.
TLD is nothing but "Top Level Domain" which basically means it has all the information for the top level domain.
In our case TLD will be ".com".

Now let TLD also, dont have IP address of abc.com, but it then send the request to Authoritative Name Server
Authoritative Name Server, has all the information and all the DNS records that we need to access
It renders the DNS record that we need to access, and sends back the IP address.
Not DNS resolver store IP address in it's cache and send it back to web browser.
Now, since, web browser has IP address, then it sends the request using IP address to web server.
Now Web Server has all the content that needed to display in your web page.
What we render back is content of web page.

Load Balancing

How you can scale your application to handle hundreds of thousands if not million of users using a technique called Load Balancing.

How Load Balancing works?

Load balancers is a component that sits between application servers and clients

Now, when load balancers comes into picture,

instead of public IPs at application layer we'll have private IPs
load balancer as public IP

Now when client makes request to your application the DNS resolution is going to point to load balancer's IP address
Load balancer is going to know that these 2 different machines are in grouping of targets that are available to route traffic. Then it will route the request to the server availability.

Methods of assigning traffic/ routing traffic between user and host machines

1. Round Robin: if we have 2 server each will recieve request alternatively.

2. Load Based:

if machine 1 is getting over burdened with traffic may be because of too many requests or taking too long to process the request.
Then it will redirect traffic to Machine 2, that are supposed to hit Machine 1.
Hence, due to this we'll alleviate some of the stress that's on Machine 1.

Categories of Load-based load balancing

Least connection: load balancer determines which of these machines in the pool have least amount of connections, and then it'll route traffic to the one that does.

Resource based: load balancer monitors the machine themselves and checks

CPU level
memory level
some kind of metric exists on this machine

Now this information is reported back to load balancer, and it uses that information to determine who to send traffic to.

Load Balancing advantages

Scalability:

we can scale our application server horizontally, so you can keep on adding machines, and the adding then to the group, so that load balancer knows new machines availability

Availability:

if any machine suddenly goes down, we have periodic health checks that are occuring between load balancer and host machines to make sure that they are in healthy state and able to recieve requests.
hence, load balancer know to redirect the traffic only to healthy hosts, and connection with unhealthy application server cuts off.

Convenience to redirect traffic

we can shift traffic b/w different applications then it's easy to redirect traffic.

Sunday, March 13, 2022

Database Sharding

Fundamental principle that very very large scale systems are built on because it allows systems to scale out horizontally as much as they want.

Topics:

1. What is sharding

2. Why is it required

3. What are diff. option in terms of scaling your database

4. How to increase database performance

a. Scaling up your hardware

b. Adding replicas

c. Sharding

5. Pros and cons of sharding

What is Sharding?

- It is optimization technique for achieving horizontal scalability in databases, i.e. splitting our databases into multiple smaller ones.

- Now instead of having one big database you have smaller individual ones and each of those can have same kind of hardware set-up, hence we get better performance and more data storage.

- Sharding is specific type of partitioning

- We’ll be majorly focusing on horizontal sharding, as it is more applicable for large scale problems.

Options available for increasing performance of database

Why?

- May be database is too expensive

o Cz of hardware cost

- May be we are running into the limits in terms of storage on that database

o So we have to dump older data, as we can’t query using older data

Option 1. Scaling up hardware

Ø Increase the RAM

Ø Pick better performance processor

Not a better solution, as its very expensive to keep on scaling up and up, every time when we face performance problems

Option 2. Add replicas

Ø We make copies of database

o Instead of one database that’s serving all the traffic, we have multiple copies of that database

o Copies are also, able to field some of the traffic that’s coming in.

o We have

§ one MASTER NODE (orchestrator)

§ multiple READ REPLICAS (copy of master node)

Ø Master node is responsible for receiving all write requests

Ø Read replicas are not allowed to receive write requests

Ø Master will accept the write request and will propagate the update to that row to read replicas, in eventual propogation

Ø Eventual propagation/ delayed propagation through asynchronous, leads to a problem called “Eventual Consistency”

Ø Eventual Consistency means,

o that there is delay from, when a write request is received and commited to master node

o read replicas get updated with that row

Why Eventual Consistency is a problem?

- Having eventual consistency in a system can result in stale data.

Option 3. Sharding

Ø Splitting of your databases into multiple smaller databases

Ø How will you separate your data?

o Fundamental thing with sharding is that your need some kind of key i.e. predictable and is part of every request that comes into your system.

o We use this key as input to hashing function and that key will tell you which shard the data regarding the key is on.

Method 1 Sharding: Partition Mapping i.e. using independent Shard Databases

Before Sharding

- We have one database, if that goes down the whole application goes down. .

After Sharding

- Now we are saying that we need small table that tells which all customers are on which shard.

- Here, Database Shard 1 and Databse Shard 2, both are on completely different master node in completely different database

- Lets take a scenario where Database Shard 1 goes down, it will definitely impact customers, but we have Database Shard 2, which is storing the subset of your information. Hence, for customer 3, 4 we can still receive the traffic and process the query accordingly.

Method 2 Sharding: using Routing Layer

- So that when a request comes in instead of hitting the database directly, we can introduce another intermediary layer.

- When we query this intermediary layer, then this layer is going to be responsible for looking up which shard each customer is on and then forwarding the request to corresponding underlying database instance.

- This layer should be able to do below mentioned:

o Should know how to route this traffic

o Need to be able to handle single point of failure, if you are using database to store your shard location.

Pros

Cons

Scalability: instead of having one database instance which can lead to single point of failure, we can chunk our data into smaller independent databases.

This splitting into small independent databases can be recursive as per our requirements.

Complexity:

1. Partition Mapping

2. Routing Layer

3. Non-uniformity of data.

Goal of Hashing Function is to split your data evenly.

Let there is a user which has tons of data in your application. Eventually, your shard size can grow and make shard storage disproportionate to one another.

4. Re-sharding/ Re-shuffling

- Lead because of non-uniformity of data

Now we need to orchestrate the whole process

Availability/Fault tolerance: if one shard database goes down, at least we can serve for the remaining subset of customer data.

Analytical type queries restricted

Eg. We’ve all these different shards, that have different databases underneath them.

So you need to have your layer know, to go out and collect all the information from different underlying databases, wrap all that up and then return it to caller called, Scatter Gather.

For analytical queries, it is necessity to go out to each individual shard grab all that data and then return back to your customer.

MUST READ: https://aws.amazon.com/blogs/database/sharding-with-amazon-relational-database-service/

Friday, March 11, 2022

Deep Copy vs Shallow Copy

Breadth vs Depth; think in terms of a tree of references with your object as the root node.

Shallow Copy:

Before Copy Shallow Copying Shallow Done

The variables A and B refer to different areas of memory, when B is assigned to A the two variables refer to the same area of memory. Later modifications to the contents of either are instantly reflected in the contents of other, as they share contents.

Deep Copy:

Before Copy Deep Copying Deep Done

The variables A and B refer to different areas of memory, when B is assigned to A the values in the memory area which A points to are copied into the memory area to which B points. Later modifications to the contents of either remain unique to A or B; the contents are not shared.

Passing Array vs String vs Integer variable in Recursion

String are immutable in nature, which means, when we pass any String variable in recursive methods as parameters each method will have its own existing value not update one.

Arrays are mutable in nature, which means, its value will remain updated even after the fallback.

eg. In Print all paths question, we need to set visited boolean array value to false during fallback.

Also, each recursive call will get updated visited boolean array.

Integer variables, when passed in recursion calls, in each call each value will be stored in different memory address.

Let,

In Call 1, -> int i =10

Now, when i variable is passed in recursion, deep copy takes place, and i will have different memory address.

while backtracking, call 1 will still have its own value not updated one.

Deep:

Before Copy Deep Copying Deep Done

Graphs: DFS vs BFS

Wednesday, March 9, 2022

Graphana vs Kibana

Apache Kafka

Apache Kafka is an open source, distributed streaming platform that allows for development of real-time event-driven applications.

Main characteristics of Kafka are as follows:

1. Produce – Consume

a. Specifically, it allows developers to make application that continuously produce and consumes streams of data records.

2. Kafka is distributed.

a. It runs as a cluster that can span multiple servers or even multiple data centers.

3. Fast

a. The records that are produced, are replicated and partitioned in such a way that allows for high volume of users to use the application simultaneously, without any perceptible lag in performance.

4. High accuracy

a. Maintains high level of accuracy within data records

5. In order

a. Maintains the order of their occurance.

6. Resilient and fault tolerant

a. It’s replicated

USE CASE 1. Decoupling

1. User checkout

2. Then order gets shipped

Here, we need to write the integration, considering the shape of data, way the data is transported and format of the data. Not a big deal when there is only one integration.

1. User checkout

2. Then

a. order gets shipped

b. add automated email receipt when checkout happens

c. update to an inventory, when checkout happens

As frontend and backend services get added and application grows more and more integrations need to get built and it can get very messy.

Also, each team is dependent on each other before they can make any changes and development is slow.

Solution: Decoupling system dependencies

We’ll remove all the dependencies and instead we do is checkout will stream events.

Every time a checkout happens, that will get streamed and checkout is not concerned with who’s listening to that stream. Its’ broadcasting those events.

Then other services – email, shipment, inventory, they subscribe to that stream, and choose to listen to that one, and then they get the information they need and it triggers them to act accordingly.

USE CASE 2. Messaging

Our application use messaging to move checkout experience along.

Eg. Let we have 2 APIs and for them we’ll have Kafka Topic and Kafka Template created, with their id’s.

We will refer them with their id’s only.

1. search hotel

2. search price

USE CASE 3. Location Tracking

Eg. Ride share service,

· driver in ride share service using the application would turn on thein app and maybe every sec, new event would get admitted with their current location

· At small scale to let an individual user know how close thein particular ride is

· At large scale,

o to calculate surge pricing

o to show user a map before they choose which ride they want

USE CASE 4. Data Gathering

· to collect analytics, to optimize your website

· can be used in more complex way, with music streaming service, i.e. where one user, every song they listen to can be stream of records, and your application could use that stream to give real-time recommendations to that user.

· It can take data records from all the users, aggregate them and then come up with a list of an artist’s top songs

KAFKA ARCHITECTURE

4 core APIs

1. Producer API

a. Allows your application to produce, to make, these streams of data

b. So, it creates the records, and produces them to topics.

Topic: is an ordered list of events. These can be saved for a minute if it’s going to be consumed immediately or can be saved for hours, days, or even forever. As long has you have enough storage that the topics are persisted to physical storage.

2. Consumer API

a. Subscribes to one or more topics and listens and ingests that data.

b. It can subscribe to topics in real time or it can consume those old data records that are saved to the topic.

3. Stream API

a. Consumers are able to consume the data from Kafka topic, in original format as it was produced by the producer. But to transform that data. We need STREAM API

b. It is beneficial for both producer and consumer API

c. So, stream api will consume from topic/ topics, and then it’ll analyze, aggregate or otherwise transform the data in real time, and then produce the resulting streams to a topic – either the same topics or to new topics.

4. Connector API

a. Enables developers to write connectors, which are reusable producers and consumers.

b. So, in Kafka cluster many developers might need to integrate the same type of data source, like MongoDB for example

c. So not every single developer should have to write that integration, what connector API allows is for that integration to get writer once, the code is there, and then all developers need to do is configure it, in order to get that data source into their cluster.

Technology - continuous learning process