In today’s technologically dependent and interconnected world, data can come in many forms. There are structured data as we encounter in the numeric fields of a database. There are semi-structured data and unstructured data, as we encounter in text files and web interaction. In fact, unstructured data is the most common, lurking in places where data is not regularly deemed to exist. Current analytical work requires extensive time spent putting data into a structured form and preparing it for analysis. Consequently, being able to understand the different data types is vital for analytical success.
Structured and unstructured data are both used extensively in big data analysis. Historically, because of limited processing capability, inadequate memory, and high data-storage costs, utilizing structured data was the only means to manage data effectively. More recently, unstructured data analytics sources have skyrocketed in use due to the increased availability of storage and the sheer number of complex data sources.
Structured data is very banal. It concerns all data which can be stored in a relational database like SQL, in tables with rows and columns. They always have a relational key and can be easily mapped into pre-designed fields. Today, this data are the most processed in development and the simplest way to manage information. However, unfortunately structured data represent only 5 to 10% of all informatics datas.
Structured data leaves out immense amounts of material that do not fit simply into a firm’s organization of information. Until recently, structured data was supplemented by this additional information in the form of paper or microfiche. With the improvement of processing by computers, lowered cost of data storage, and the spread of new formats of data, semi-structured data and unstructured data are saturating businesses.
Semi-structured data is information that does not reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database, but the semi structure exists to ease space, clarity and computations. Some NoSQL databases are optimized to store semi-structured data.
Unstructured data represents around 80% of data! It often includes text and multimedia content, e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. While these sorts of files may have an internal structure, they are still considered unstructured because the data they contain does not fit neatly in a database. Consequently, unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data.
The fundamental challenge of unstructured data sources is that they are difficult for nontechnical business users and data analysts alike to unbox, understand, and prepare for analytic use. Beyond issues of structure, is the sheer volume of this type of data. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive.
Most of the data used in corporate business structures are unstructured, or sometimes semi-unstructured. For example, for an online retailer, click counts and visitor information are extremely important. While, this information can contain a multitude of data, it does not have to. Usually, this data can be stored in a graph database, so one can track the journey of a customer on their website as one basic use case. A lot of preparation needs to go into cleaning and processing this data to make it usable.
One of the most common metrics in retail is success, e.g. a customer added something to their cart. However, a visitor can have multiple visits in a day, or could have an extensive journey with multiple success actions on various products. To truly understand how a product page is doing requires extensive aggregation and cleaning of this data. However, trying to collect this data in a structured way would make one miss out on other valuable information. Sometimes that is the price you have to pay to achieve a competitive advantage by using predictive analytics!.
What are your thoughts? How much of your time is devoted to data preparation? Let us know in the comments below!
The SaberSmart Team