— BigData (Problem)

Shashwat Singh
4 min readSep 17, 2020

--

Lets First Understand the Problem:-

Why call this problem as BigData?

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Q>How much data are collecting by the Big MNC’s Companies day by day ?

Lets Search it:-

#Google — 40,000 Search per second 😲

Google Collect Data

#Twitter — 500 Millions Tweet per Day 😲

Twitter

#Facebook — 500+ TeraBytes 😲

Facebook

#Netflix — 151 millions subscribers 😲

So, name of this problem is known as BigData…..

I think now you got the idea about BigData !!!!

😲😲How they Manage Such a Huge Data?

First you need to know about the about Sub-problems of the BigData: —

>>>>Sub problems Of the Big Data !!!

  1. Volume :-

Now you think ok, this is not a big deal . These MNC’s can Hire the company(Like Dell emc,HP,IBM etc etc) who made these storage appliances to extend the VOLUME of the HardDisk.

like ; — —

From TB() → PB() →EB() →ZB() →……

Companies can extend the size of the HardDisk but these are useless Hard-disk because of the I/O problem

So, I/O is the second sub-problem of big Data …

2. I/O problem

<I will explain this problem practically also, stay tuned >

As you all know that ram stores data or read data 100 times faster then Hard-disk but Hard-disk take some time like 2–3 min take to Input(write) Data or Output(read) Data , even with SSD’s so we can’t take in account of SATA HDD…

How can we solve the Big Data Problem ?

This can be done with the concept of Distributed Storage ..

What is Distributed Storage ?

It follows a Topology which I am going to explain with the help of an Diagram:

Cluster

From the diagram, we see that there is a machine through which all the other machine share there computing hardware by the help of the networking concept.

So, the machine which are placed at the top are known as Master or Name Node and the others are Slaves.This whole setup is called Cluster.

Lets understand the concept…

Lets say we get a data of size 400GB at our master node , then we divides that data into 100 chunks when we have 100 machines which are connected to master.

Then,

Each slaves receives data of size = 400/100 = 4 GB

This 4 GB data goes into every machine simultaneously ….

Now as you can see that, the 400GB will act as 4GB data for the master slave …

So, Our sub-problem of the BigData i.e. Volume problem is resolved …

Now lets talk about speed ..

Consider that a machine has a capacity to accept data at 1hr per GB ..

Then time taken by a machine(normal) to accept 400GB data is 400 hr 😨…

But For Master node it will be only 4hr 😃 …

Hence, the sub-problem of BigData i.e. I/O problem is also solved …

We can use this Concept with the help of many products like HADOOP, AWS s3 , CEPH etc etc….

Most Popular one is HADOOP…

Stay Tuned for this Concept. I will give a practical explanation …

--

--