Why is Big Data so important?Every so often a new term or phrase permeates the technical IT arena. From “Year 2K” to “Agile” to “Cloud” there has always been a series of buzz words that everyone seems to want to talk about but no-one really understands. Big Data is one of those terms.
It’s not just that Big Data is the latest catchy phrase. Like so much in tech there is an underlying thing that causes the excitement but often that gets lost in a wave of hysteria and over hyped marketing. This article is about explaining where Big Data came from and why it is so important.
What is Big Data?
Put simply, Big Data is the next generation of data storage that will inevitably replace the traditional SQL database. If an organisation is running some flavor of SQL, whether it is Oracle, SQL Server, MySQL, Postgres or even ACCESS then the medium of it’s data storage is now officially obsolete. Structured Query Language is not going to disappear overnight but it is safe to say that it is an inferior technology to what Big Data offers. SQL will gradually be supplanted as the primary mode of information storage by Big Data.
This is all very well but what exactly is Big Data?
The simple answer to that question is that Big Data is the back-end of a search engine. Technology that was once the preserve of search companies has now begun to trickle down in to the mainstream. The once dominant database is now in the process of being replaced by exactly the same technology that runs every time someone enters a search into Google.
The NoSQL movement
About 15 years ago system designers realised there was a fundamental problem with the way they stored data. They could take significant amounts of data and store it in a well designed normalised database but there was always a limit. No matter how well they designed the system eventually they would encounter a bottleneck in how their data was accessed or written and this inevitably meant that the database would slow down.
Much of software architecture from this time was concerned with working around this limitation and designers came up with ever ingenious ways of extending the sizes of their database. Warehousing, partitioned views and messaging all meant that more could be squeezed out of the hardware. Unfortunately even the most adventurous of designers always conceded that a limit existed and eventually it would defeat them.
When the Internet arrived it led to an increase in the amount of data generally and everyone realised that traditional SQL, a technology that had originated in the 1970’s was no longer suitable. The best minds in data storage got together and came up with a solution, one that was so radical and revolutionary that they labelled it NoSQL, almost to highlight the definitive break from the past.
The internet search engines were at this time only emerging and they were the first to jump on the bandwagon. The vast volumes of data that had to be stored to provide adequate searching for users were so vast that it was really the only option. Yahoo and Google were the first real early adopters and NoSQL became the preserve of serious search engineers. The technology was young and the early adopters were keen to retain the knowledge in-house so it was not widely syndicated. Most regular developers were still working with datasets that could be done in SQL so it was not really an issue.
Gradually over time that knowledge began to seep in to the wider community and eventually someone decided that the term NoSQL was a bit confusing and decided to peel the label off the box and replace it with one called Big Data. It was catchier.
In the modern world, with multitudes of connected devices all sending and receiving data via the Internet, Big Data has become almost a necessity.The detail level that can be captured due to the extra capacity means that a modern system can record almost any conceivable event and store it.
Data sets of epic proportions can now be analysed to obtain the most low-level insight. By realising that the customers who are most likely to buy were the ones who have brown hair and like to get out of bed on the left side meant that sales analysts could finally target specific segments with highly focussed advertising. This meant their sales techniques became more effective and gradually over time the term Big Data started to take on a different meaning.
In marketing circles Big Data means the process of delving in to a gigantic data set to extract useful information. “We should make use of Big Data” became a phrase all too popular in sales meetings and so confusion arose between the underlying technology and the act of actually using it.
To a technical professional who designs systems, the term Big Data still means NoSQL. To a marketing professional and those in the media who deal with them most it means the act of drilling in to an exceptionally large data set to find insights.
How does it work?
The fundamental difference between NoSQL and a database is in what technicians call concurrency. When you add or change data in a database the next user to access the data will always get the newly updated data. If you have ever used a database then you will be familiar with the idea of a transaction. A user cannot read data that is in the process of being modified by another user. Only when the data has been committed and is correct can the second user retrieve the data.
There is no easy equivalent to a transaction in NoSQL. The term most often used is “eventual concurrency” and this means that the next user who comes along after a change has been made *may* still get the old data. Technically the data is still atomic, in that it is either correct or will be rolled back on update but dirty reads of old data can now occur. This causes huge problems for system designers who are used to the safety of full transactional concurrency.
The reason for this is simple. The designers of NoSQL realised that when the data set gets too big there is simply no feasible way of searching all of it in one go. If a user queries a dataset of 100 billion records then no computer on Earth is capable of trawling through that volume in a reasonable time. It was an issue that stumped the best brains in the profession.
The solution was simple
The result of the query had to be calculated in advance. Now this may sound like madness but there is one surprising fact that very few people understand. Most of the time servers are doing nothing. They sit, fully powered, waiting for a user to come along and request some data. Rather than wait till the user arrives to make the request and then do the work it made sense to use this unused time to pre-calculate the result before the request for it actually arrives. That does mean that the system has to calculate every possible result but it turns out servers spend most of their time sleeping and cloud computing will scale as needed anyway.
Once the transctional requirement was removed it also became possible to split a very large data set in to chunks and store each separate chunk on a different server. Each server could pre-calculate based on its own reduced dataset.
When a user does a single search of all the data the request is split into separate requests and each server searches the bit of data that it has. It then returns its own result back to the controlling server and each of these results is then reduced in to a single result which is what the user finally gets.
The algorithm that provides this functionality is called Map Reduce. A task is mapped in to series of smaller tasks that are distributed and the multiple results are then reduced to a single response for the user.
When you do a search on Google there may be thousands of servers involved in providing you with the result. Each server is only occupied for the briefest of milliseconds but the amount of combined processing power is so vast that it outstrips any solitary database and the result is almost instant.
I say search, but what I really mean is accessing the pre-calculated results that each individual server has already worked out while the user was doing something else.
SummaryThere simply is no comparison between a Big Data system and an SQL database. Searching occurs in a radically different way, almost back to front. The fact that all accessed data is pre-calculated means that the source data no longer has to be stored in a fixed format and can be stored raw. It is the calculation process of indexing that gives the data meaning.
Those that are familiar with databases have to essentially forget every rule or technique they have used in the past and adopt an entirely new way of thinking. Things that were easy in a database are now very difficult in Big Data while things that were impossible in a database are now trivial in a NoSQL data store.
The advantages are overwhelming. Systems now no longer have any limits and when combined with application cloud infrastructure it is possible to build a mechanism that literally has no top end. If a billion new customers appear out of nowhere then the system will simply expand to meet the demand. You might get a much larger bill from the cloud hosting company but no-one will miss out on placing their order.
The downside is that an entire generation of technical professionals have to retrain in a radically new technology, one that is significantly more challenging both conceptually and in implementation than what went before.
Yatter uses NoSQL everywhere. We have no database. Most of the calculated information you access on Yatter was already worked out way before you needed it.
Free accounts for a limited time only!
Claim yours today