What Is Big Data?
First of all, what is Big Data? In it’s purest form, Big Data is used to describe the massive volume of both structured and unstructured data that is so large it is difficult to process using traditional techniques. So Big Data is just what it sounds like – a whole lot of data.
The concept of Big Data is a relatively new one and it represents both the increasing amount and the varied types of data that is now being collected. Proponents of Big Data often refer to this as the “datification” of the world. As more and more of the world’s information moves online and becomes digitized, it means that analysts can start to use it as data. Things like social media, online books, music, videos and the increased amount of sensors have all added to the astounding increase in the amount of data that has become available for analysis.
Everything you do online is now stored and tracked as data. Reading a book on your Kindle generates data about what you’re reading, when you read it, how fast you read it and so on. Similarly, listening to music generates data about what you’re listening to, when how often and in what order. Your smart phone is constantly uploading data about where you are, how fast you’re moving and what apps you’re using.
What’s also important to keep in mind is that Big Data isn’t just about the amount of data we’re generating, it’s also about all the different types of data (text, video, search logs, sensor logs, customer transactions, etc.). In fact, Big Data has four important characteristics that are known in the industry as the 4 V’s:
- Volume – the increasing amount of data that is generated every second
- Velocity – the speed at which data is being generated
- Variety – the different types of data being generated
- Veracity – the messiness of data, ie. it’s unstructured nature
Based on the incredible amount, speed, variety and unstructuredness of the data we are now generating and storing, it’s no surprise that it quickly became unmanageable using traditional storing and analysis methods. This is where the term Big Data becomes confusing, because it is often used to refer to the new technologies, tools and processes that have sprung up to accommodate this vast amount of data.
Glossary of Big Data Terms
Inevitably, much of the confusion around Big Data comes from the variety of new (for many) terms that have sprung up around it. Here is a quick run-down of the most popular ones:
- Algorithm – mathematical formula run by software to analyze data
- Amazon Web Services (AWS) – collection of cloud computing services that help businesses carry out large-scale computing operations without needing the storage or processing power in-house
- Cloud (computing) – running software on remote servers rather than locally
- Data Scientist – an expert in extracting insights and analysis from data
- Hadoop – collection of programs that allow for the storage, retrieval and analysis of very large data sets
- Internet of Things (IoT) – refers to objects (like sensors) that collect, analyze and transmit their own data (often without human input)
- Predictive Analytics – using analytics to predict trends or future events
- Structured v Unstructured data – structured data is anything that can be organized in a table so that it relates to to other data in the same table. Unstructured data is everything that can’t.
- Web scraping – the process of automating the collection and structuring of data from web sites