Index
NIT 067- BIG DATA
1.
UNDERSTANDING BIG DATA
1.1.
What is big
data,
1.2.
why big
data,
1. UNDERSTANDING
BIG DATA
1.1 What
is big data:
Big
data is data that exceeds the processing capacity of conventional database
systems. The data is too big, moves too fast, or does not fit the structures of
traditional database architectures. In other words, Big data is an
all-encompassing term for any collection of data sets so large and complex that
it becomes difficult to process using on-hand data management tools or
traditional data processing applications. To gain value from this data, you
must choose an alternative way to process it. Big Data is the next generation
of data warehousing and business analytics and is poised to deliver top line
revenues cost efficiently for enterprises. Big data is a popular term used to
describe the exponential growth and availability of data, both structured and
unstructured. Every day, we create 2.5 quintillion bytes of data — so much that
90% of the data in the world today has been created in the last two years
alone. This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos, purchase
transaction records, and cell phone GPS signals to name a few. This data is big
data.
Definition: Big data usually includes data sets with
sizes beyond the ability of commonly used software tools to capture, create,
manage, and process the data within a tolerable elapsed time Big data is
high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision-making
Big Data is a collection of large
datasets that cannot be processed using traditional computing techniques. It is
not a single technique or a tool, rather it involves many areas of business and
technology.
Increasingly,
organizations today are facing more and more Big Data challenges. They have
access to a wealth of information, but they don’t know how to get value out of
it because it is sitting in its most raw form or in a semistructured or
unstructured format; and as a result, they don’t even know whether it’s worth
keeping (or even able to keep it for that matter).
What Comes Under Big Data?
Big data
involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
Ø Black Box Data: It is a component of helicopter,
airplanes, and jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of the aircraft.
Ø Social Media Data: Social media such as Facebook
and Twitter hold information and the views posted by millions of people across
the globe. Fig .1
Ø Stock Exchange Data: The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
Ø Power Grid Data: The power grid data holds
information consumed by a particular node with respect to a base station.
Ø Transport Data: Transport data includes model,
capacity, distance and availability of a vehicle.
Ø Search Engine Data: Search engines retrieve lots of
data from different databases.
Ø
Thus Big Data includes huge volume, high velocity,
and extensible variety of data. The data in it will be of three types.
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media Logs.
Characteristics of Big Data
Three characteristics define Big Data: volume, variety, and velocity (as
shown in Figure ).
Together, these characteristics define
what we refer to as “Big Data.”
Can
There Be Enough? The Volume of Data
The sheer volume of data being stored today is exploding. In the year 2000,
800,000 petabytes (PB) of data were stored in the world. Of course, a lot of
the data that’s being created today isn’t analyzed at all and that’s another
problem
Bigshot Companies are trying to address this
with their respective tools. We expect this number to reach 35 zettabytes (ZB)
by 2020. Twitter alone generates more than 7 terabytes (TB) of data every day,
Facebook 10 TB, and some enterprises generate terabytes of data every hour of
every day of the year. It’s no longer unheard of for individual enterprises to
have storage clusters holding petabytes of data. We’re going to stop right
there with the factoids: Truth is, these estimates will be out of date by the
time you read this book, and they’ll be further out of date by the time you
bestow your great knowledge of data growth rates on your friends and families
when you’re done reading this book.
When you stop and think about it, it’s
little wonder we’re drowning in data. If we can track and record something, we
typically do. (And notice we didn’t mention the analysis of this stored data,
which is going to become a theme of Big Data—the newfound utilization of data
we track and don’t use for decision making.)
As implied by the term “Big Data,”
organizations are facing massive volumes of data. Organizations that don’t know
how to manage this data are overwhelmed by it. But the opportunity exists, with
the right technology platform, to analyze almost all of the data (or at least
more of it by identifying the data that’s useful to you) to gain a better
understanding of your business, your customers, and the marketplace. And this
leads to the current conundrum
facing today’s businesses across all
industries. As the amount of data available to the enterprise is on the rise,
the percent of data it can process, understand, and analyze is on the decline,
thereby creating the blind zone. What’s in that blind zone? You don’t know: it
might be something great, or may be nothing at all, but the “don’t know” is the
problem (or the opportunity, depending how you look at it).
The conversation about data volumes has
changed from terabytes to petabytes with an inevitable shift to zettabytes, and
all this data can’t be stored in your traditional systems for reasons that
we’ll discuss in this chapter and others.
Variety Is the Spice of Life
The volume associated with the Big Data
phenomena brings along new challenges for data centers trying to deal with it:
its variety. With the explosion of sensors, and smart devices, as well as
social collaboration technologies, data in an enterprise has become complex,
because it includes not only traditional relational data, but also raw,
semistructured, and unstructured data from web pages, web log files (including
click-stream data), search indexes, social media forums, e-mail, documents,
sensor data from active and passive systems, and so on. What’s more,
traditional systems can struggle to store and perform the required analytics to
gain understanding from the contents of these logs because much of the
information being generated doesn’t lend itself to traditional database
technologies. In our experience, although some companies are moving down the
path, by and large, most are just beginning to understand the opportunities of
Big Data (and what’s at stake if it’s not considered).
Quite simply, variety represents all types of data—a fundamental
shift in analysis requirements from traditional structured data to include raw,
semistructured, and unstructured data as part of the decision-making and
insight process. Traditional analytic platforms can’t handle variety. However,
an organization’s success will rely on its ability to draw insights from the
various kinds of data available to it, which includes both traditional and
nontraditional.
When we look back at our database careers,
sometimes it’s humbling to see that we spent more of our time on just 20
percent of the data: the relational kind that’s neatly formatted and fits ever
so nicely into our strict schemas. But the truth of the matter is that 80
percent of the world’s data (and more and more of this data is responsible for
setting new velocity and volume records) is unstructured, or semistructured at
best. If you look at a Twitter feed, you’ll see structure in its JSON
format—but the actual text is not structured, and understanding that can be
rewarding. Video and picture images aren’t easily or efficiently stored in a
relational database, certain event information can dynamically change (such as
weather patterns), which isn’t well suited for strict schemas, and more.
To capitalize on the Big Data opportunity,
enterprises must be able to analyze all types
of data, both relational and nonrelational: text,
sensor data, audio, video, transactional,
and more.
How Fast Is Fast? The Velocity of Data
Just as the sheer volume and variety of
data we collect and store has changed, so, too, has the velocity at which it is generated and needs to be
handled. A conventional understanding of velocity typically considers how
quickly the data is arriving and stored, and its associated rates of retrieval.
While managing all of that quickly is good—and the volumes of data that we are
looking at are a consequence of how quick the data arrives—we believe the idea
of velocity is actually something far more compelling than these conventional
definitions.
To accommodate velocity, a new way of
thinking about a problem must start at the inception point of the data. Rather
than confining the idea of velocity to the growth rates associated with your
data repositories, we suggest you apply this definition to data in motion: The speed at which the data is flowing. After
all, we’re in agreement that today’s enterprises are dealing with petabytes of
data instead of terabytes, and the increase in RFID sensors and other
information streams has led to a constant flow of data at a pace that has made
it impossible for traditional systems to handle.
Sometimes, getting an edge over your
competition can mean identifying a trend, problem, or opportunity only seconds,
or even microseconds, before someone else. In addition, more and more of the
data being produced today has a very short shelf-life, so organizations must be
able to analyze this data in near real time if they hope to find insights in
this data.
Dealing effectively with Big Data requires
that you perform analytics against the volume and variety of data while it is still in motion, not just after it is at rest.
1.2 Why
big data
Big Data solutions are ideal for analyzing
not only raw structured data, but semi structured and unstructured data from a
wide variety of sources.
• Big Data solutions are ideal when all,
or most, of the data needs to be analyzed versus a sample of the data; or a
sampling of data isn’t nearly as effective as a larger set of data from which
to derive analysis.
• Big Data solutions are ideal for
iterative and exploratory analysis when business measures on data are not
predetermined.
·
Is
the reciprocal of the traditional analysis paradigm appropriate for the business
task at hand? Better yet, can you see a Big Data platform complementing what
you currently have in place for analysis and achieving synergy with existing
solutions for better business outcomes? For example, typically, data bound for
the analytic warehouse has to be cleansed, documented, and trusted before it’s
neatly placed into a strict warehouse schema (and, of course, if it can’t fit
into a traditional row and column format, it can’t even get to the warehouse in
most cases). In contrast, a Big Data solution is not only going to leverage data
not typically suitable for a traditional warehouse environment, and in massive
amounts of volume, but it’s going to give up some of the formalities and
“strictness” of the data. The benefit is that you can preserve the fidelity of
data and gain access to mountains of information for exploration and discovery
of business insights before
running it through the due diligence that
you’re accustomed to; the data that can be included as a participant of a
cyclic system, enriching the models in the warehouse. • Big Data is well suited
for solving information challenges that don’t natively fit within a traditional
relational database approach for handling the problem at hand.
IT
for IT Log Analytics
Log analytics is a common use case for an
inaugural Big Data project. We like to refer to all those logs and trace data
that are generated by the operation of your IT solutions as data exhaust. Enterprises have lots of data exhaust,
and it’s pretty much a pollutant if it’s just left around for a couple of hours
or days in case of emergency and simply purged. Why? Because we believe data
exhaust has concentrated value, and IT shops need to figure out a way to store
and extract value from it. Some of the value derived from data exhaust is
obvious and has been transformed into value-added click-stream data that
records every gesture, click, and movement made on a web site.
The
Fraud Detection Pattern
Fraud detection comes up a lot in the
financial services vertical, but if you look around, you’ll find it in any sort
of claims- or transaction-based environment (online auctions, insurance claims,
underwriting entities, and so on). Pretty much anywhere some sort of financial
transaction is involved presents a potential for misuse and the ubiquitous
specter of fraud. If you leverage a Big Data platform, you have the opportunity
to do more than you’ve ever done before to identify it or, better yet, stop it.
They
Said What? The Social Media Pattern
Perhaps the most talked about Big Data
usage pattern is social media and customer sentiment. You can use Big Data to
figure out what customers are saying about you (and perhaps what they are
saying about your competition); furthermore, you can use this newly found
insight to figure out how this sentiment impacts the decisions you’re making
and the way your company engages. More specifically, you can determine how
sentiment is impacting sales, the effectiveness or receptiveness of your
marketing campaigns, the accuracy of your marketing mix (product, price,
promotion, and placement), and so on.
Social media analytics is a pretty hot
topic, so hot in fact that IBM has built a solution specifically to accelerate
your use of it: Cognos Consumer Insights (CCI). It’s a point solution that runs
on BigInsights and it’s quite good at what it does. CCI can tell you what
people are saying, how topics are trending in social media, and all sorts of
things that affect your business, all packed into a rich visualization engine.
Big
Data and the Energy Sector
The energy sector provides many Big Data
use case challenges in how to deal with the massive volumes of sensor data from
remote installations. Many companies are using only a fraction of the data
being collected, because they lack the infrastructure to store or analyze the
available scale of data. Take for example a typical oil drilling platform that
can have 20,000 to 40,000 sensors on board. All of these sensors are streaming
data about the health of the oil rig, quality of operations, and so on. Not every
sensor is actively broadcasting at all times, but some are reporting back many
times per second. Now take a guess at what percentage of those sensors are
actively utilized. If you’re thinking in the 10 percent range (or even 5
percent), you’re either a great guesser or you’re getting the recurring theme
for Big Data that spans industry and use cases: clients aren’t using all of the
data that’s available to them in their decision-making process. Of course, when
it comes to energy data (or any data for that matter) collection rates, it
really begs the question, “If you’ve bothered to instrument the user or device
or rig, in theory, you’ve done it on
purpose, so why are you not capturing and leveraging the information you are
collecting?”
Benefits
of Big Data
Using the information kept in the social network
like Facebook, the marketing agencies are learning about the response for their
campaigns, promotions, and other advertising mediums.
Using the information in the social media like
preferences and product perception of their consumers, product companies and
retail organizations are planning their production.
Using the data regarding the previous medical
history of patients, hospitals are providing better and quick service.
Why Big data?
1. Understanding and Targeting
Customers
This is one of the biggest and most
publicized areas of big data use today. Here, big data is used to better
understand customers and their behaviors and preferences.
Companies are keen to expand their
traditional data sets with social media data, browser logs as well as text
analytics and sensor data to get a more complete picture of their customers.
The big objective, in many cases, is to create predictive models. You might
remember the example of U.S. retailer Target, who is now able to very
accurately predict when one of their customers will expect a baby. Using big
data, Telecom companies can now better predict customer churn; Wal-Mart can
predict what products will sell, and car insurance companies understand how
well their customers actually drive. Even government election campaigns can be
optimized using big data analytics.
2. Understanding and Optimizing
Business Processes
Big data is also increasingly used to
optimize business processes. Retailers are able to optimize their stock based
on predictions generated from social media data, web search trends and weather
forecasts. One particular business process that is seeing a lot of big data
analytics is supply chain or delivery route optimization. Here, geographic
positioning and radio frequency identification sensors are used to track goods
or delivery vehicles and optimize routes by integrating live traffic data, etc.
HR business processes are also being improved using big data analytics. This
includes the optimization of talent acquisition – Moneyball style, as well as
the measurement of company culture and staff engagement using big data tools
3. Personal Quantification and
Performance Optimization
Big data is not just for companies and
governments but also for all of us individually. We can now benefit from the
data generated from wearable devices such as smart watches or smart bracelets.
Take the Up band from Jawbone as an example: the armband collects data on our
calorie consumption, activity levels, and our sleep patterns. While it gives
individuals rich insights, the real value is in analyzing the collective data.
In Jawbone’s case, the company now collects 60 years worth of sleep data every
night. Analyzing such volumes of data will bring entirely new insights that it
can feed back to individual users. The other area where we benefit from big
data analytics is finding love - online this is. Most online dating sites apply
big data tools and algorithms to find us the most appropriate matches.
4. Improving Healthcare and Public
Health
The computing power of big data analytics
enables us to decode entire DNA strings in minutes and will allow us to find
new cures and better understand and predict disease patterns. Just think of
what happens when all the individual data from smart watches and wearable
devices can be used to apply it to millions of people and their various diseases.
The clinical trials of the future won’t be limited by small sample sizes but could
potentially include everyone! Big data techniques are already being used to monitor
babies in a specialist premature and sick baby unit. By recording and analyzing
every heart beat and breathing pattern of every baby, the unit was able to
develop algorithms that can now predict infections 24 hours before any physical
symptoms appear. That way, the team can intervene early and save fragile babies
in an environment where every hour counts. What’s more, big data analytics
allow us to monitor and predict the developments of epidemics and disease
outbreaks. Integrating data from medical records with social media analytics
enables us to monitor flu outbreaks in real-time, simply by listening to what
people are saying, i.e. “Feeling rubbish today - in bed with a cold”.
5. Improving Sports Performance
Most elite sports have now embraced big
data analytics. We have the IBM SlamTracker tool for tennis tournaments; we use
video analytics that track the performance of every player in a football or
baseball game, and sensor technology in sports equipment such as basket balls
or golf clubs allows us to get feedback (via smart phones and cloud servers) on
our game and how to improve it. Many elite sports teams also track athletes outside
of the sporting environment – using smart technology to track nutrition and sleep,
as well as social media conversations to monitor emotional wellbeing.
6. Improving Science and Research
Science and research is currently being
transformed by the new possibil ities big data brings. Take, for example, CERN,
the Swiss nuclear physics lab with its Large Hadron Collider, the world’s
largest and most powerful particle accelerator. Experiments to unlock the
secrets of our universe – how it started and works - generate huge amounts of
data. The CERN data center has 65,000 processors to analyze its 30 petabytes of
data. However, it uses the computing powers of thousands of computers
distributed across 150 data centers worldwide to analyze the data. Such
computing powers can be leveraged to transform so many other areas of science
and research.
7. Optimizing Machine and Device
Performance
Big data analytics help machines and
devices become smarter and more autonomous. For example, big data tools are
used to operate Google’s self-driving car. The Toyota Prius is fitted with
cameras, GPS as well as powerful computers and sensors to safely drive on the
road without the intervention of human beings. Big data tools are also used to
optimize energy grids using data from smart meters. We can even use big data
tools to optimize the performance of computers and data warehouses.
8. Improving Security and Law
Enforcement.
Big data is applied heavily in improving
security and enabling law enforcement. I am sure you are aware of the
revelations that the National Security Agency (NSA) in the U.S. uses big data
analytics to foil terrorist plots (and maybe spy on us). Others use big data
techniques to detect and prevent cyber attacks. Police forces use big data
tools to catch criminals and even predict criminal activity and credit card
companies use big data use it to detect fraudulent transactions.
9. Improving and Optimizing Cities
and Countries
Big data is used to improve many aspects
of our cities and countries. For example, it allows cities to optimize traffic
flows based on real time traffic information as well as social media and
weather data. A number of cities are currently piloting big data analytics with
the aim of turning themselves into Smart Cities, where the transport infrastructure
and utility processes are all joined up. Where a bus would wait for a delayed
train and where traffic signals predict traffic volumes and operate to minimize
jams.
10. Financial Trading
My final category of big data application
comes from financial trading. High-Frequency Trading (HFT) is an area where big
data finds a lot of use today. Here, big data algorithms are used to make
trading decisions. Today, the majority of equity trading now takes place via
data algorithms that increasingly take into account signals from social media
networks and news websites to make, buy and sell decisions in split seconds.
Operational
Big Data
These include systems like MongoDB that provide
operational capabilities for real-time, interactive workloads where data is
primarily captured and stored.
NoSQL Big Data systems are designed to take
advantage of new cloud computing architectures that have emerged over the past
decade to allow massive computations to be run inexpensively and efficiently.
This makes operational big data workloads much
easier to manage, cheaper, and faster to implement.
Some NoSQL systems can provide insights into
patterns and trends based on real-time data with minimal coding and without the
need for data scientists and additional infrastructure.
let ’s breakdown the
three pinnacle stages in the evolution of data systems:
■ Dependent (Early Days). Data systems were fairly new and users didn’t know
quite know what they wanted. IT assumed that “Build it and they shall come.”
■ Independent (Recent Years). Users understood what an analytical platform was
and worked together with IT to define the business needs and approach for
deriving insights for their firm.
■ Interdependent (Big Data Era). Interactional stage between
various companies, creating more social collaboration beyond your firm’s walls.