Thursday, February 19, 2015

Structured Data v/s Unstructured Data

Data:

Data is information that has been translated into a form that is more convenient to move or process.



Structured Data:
  • Data that resides in fixed fields within a record or file
  • Well defined content: displayed in rows and columns
  • Can be easily organized and processed by data mining tools
  • Everything is labelled and easy to access
  • Can be stored in RDBMS

Examples:
  • Databases
  • XML Data
  • Enterprise Systems (ERP, CRM)

Unstructured Data:
  • Has no identifiable internal structure
  • Data does not have a pre-defined data model
  • Is not organized in a pre-defined manner
  • Unstructured information is typically text - heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to structured data
  • Storing it in RDBMS is not a good-fit

Examples:
  • Word documents
  • Email messages
  • Audio/ Video files



Data warehouse:

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.


Types of data:

Historical Data:
  • Typically contains several years of historical data
  • Amount of data depends on disk size
Derived Data:
  • Generated from existing data using a mathematical operation of a data transformation
  • Generated in run-time as a response to a query
Metadata:
  • It describes the data and the schema objects
  • Used by applications to fetch and compute the data directly

Data Warehouse Architecture:

Basic:
  • End users directly access the data derived from several source systems through the data warehouse
With a staging area:
  • Operational data must be cleaned and processed before putting it into the warehouse
  • Staging area is used to accomplish this as it cleanses and consolidates the operational data coming from multiple source systems
With a staging area and data marts:
  • Used to customize warehouse architecture
  • Data marts are systems designed for a particular line of business

Limitations of Data Warehousing:

Extra Reporting Work:
  • Larger the organization, more the amount of data
  • Each business division generates the data needed in the warehouse
  • Not easy to generate reports, requires significant effort
Cost benefit ratio:
  • Involves lot of man hours
  • Lot of investment for the implementation
Huge maintenance cost:
  • The cost of updating the warehouse to adapt to the changing business needs is too high
Data Ownership Concerns:
  • Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Hence the security is always a concern
  • A data warehouse that leaks customer data is a privacy and public relations nightmare
Data flexibility:
  • The data is normally days or weeks old before it is actually used
  • Due to ad hoc nature of the queries, it is difficult to tune them for processing speed and query speed

Future of Data Warehousing:
  • Hadoop will serve as a great companion of data warehouse and will be used to share the heaviest workloads and larger volume of data
  • A data warehouse of customer information can be used for sentiment analysis, personalization, marketing automation, sales, and customer service  
  • Data warehouses hold some of the most valuable data for any organization to grow and stay competitive. Thus, the dependency of each organization will increase by a huge extent and data warehousing will play a huge role in contributing to any decision making
  • Enterprise data warehouses will face huge changes from the world of data warehouse automation. Just like we no longer “hand code” ETL scripts, it is foreseen that 2015 as the year that productization of data modeling and database administration to speed up “time to implementation”
  • Processing data and analytics in the cloud will become a requirement


No comments:

Post a Comment