Thursday, February 19, 2015

Structured Data v/s Unstructured Data

Data:

Data is information that has been translated into a form that is more convenient to move or process.



Structured Data:
  • Data that resides in fixed fields within a record or file
  • Well defined content: displayed in rows and columns
  • Can be easily organized and processed by data mining tools
  • Everything is labelled and easy to access
  • Can be stored in RDBMS

Examples:
  • Databases
  • XML Data
  • Enterprise Systems (ERP, CRM)

Unstructured Data:
  • Has no identifiable internal structure
  • Data does not have a pre-defined data model
  • Is not organized in a pre-defined manner
  • Unstructured information is typically text - heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to structured data
  • Storing it in RDBMS is not a good-fit

Examples:
  • Word documents
  • Email messages
  • Audio/ Video files



Data warehouse:

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.


Types of data:

Historical Data:
  • Typically contains several years of historical data
  • Amount of data depends on disk size
Derived Data:
  • Generated from existing data using a mathematical operation of a data transformation
  • Generated in run-time as a response to a query
Metadata:
  • It describes the data and the schema objects
  • Used by applications to fetch and compute the data directly

Data Warehouse Architecture:

Basic:
  • End users directly access the data derived from several source systems through the data warehouse
With a staging area:
  • Operational data must be cleaned and processed before putting it into the warehouse
  • Staging area is used to accomplish this as it cleanses and consolidates the operational data coming from multiple source systems
With a staging area and data marts:
  • Used to customize warehouse architecture
  • Data marts are systems designed for a particular line of business

Limitations of Data Warehousing:

Extra Reporting Work:
  • Larger the organization, more the amount of data
  • Each business division generates the data needed in the warehouse
  • Not easy to generate reports, requires significant effort
Cost benefit ratio:
  • Involves lot of man hours
  • Lot of investment for the implementation
Huge maintenance cost:
  • The cost of updating the warehouse to adapt to the changing business needs is too high
Data Ownership Concerns:
  • Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Hence the security is always a concern
  • A data warehouse that leaks customer data is a privacy and public relations nightmare
Data flexibility:
  • The data is normally days or weeks old before it is actually used
  • Due to ad hoc nature of the queries, it is difficult to tune them for processing speed and query speed

Future of Data Warehousing:
  • Hadoop will serve as a great companion of data warehouse and will be used to share the heaviest workloads and larger volume of data
  • A data warehouse of customer information can be used for sentiment analysis, personalization, marketing automation, sales, and customer service  
  • Data warehouses hold some of the most valuable data for any organization to grow and stay competitive. Thus, the dependency of each organization will increase by a huge extent and data warehousing will play a huge role in contributing to any decision making
  • Enterprise data warehouses will face huge changes from the world of data warehouse automation. Just like we no longer “hand code” ETL scripts, it is foreseen that 2015 as the year that productization of data modeling and database administration to speed up “time to implementation”
  • Processing data and analytics in the cloud will become a requirement


Monday, February 2, 2015

BI tools evaluation

The following BI tools have been evaluated in this blog:

  • Tableau
  • Qlik
  • Microstrategy
  • Oracle
  • SAS


Tableau:


Tableau is a streamlined, user-friendly business intelligence solution that provides a simple, quick way for non-experts to access data and create their own dashboards in just a few clicks. Tableau is tailored to meet the needs of anyone looking to analyze and explore business data. It provides business intelligence that is actionable and insightful. One can learn the tool very quickly just by looking at the video tutorials and exploring the tool.

Pros:
  • Has a very intuitive UI, drag and drop tools that allow non-technical people use it with ease
  • Easy to learn
  • It can be integrated with R
  • It has ready-made drivers for many databases
  • Helps create instant, real-time dashboards
  • Can connect to cube-based data sources
  • Has an active online community
Cons:

  • The in-memory engine is not the fastest
  • The manual mapping of datatypes that are not recognized is cumbersome
  • Has trouble when working with large datasets
  • There is no option to create custom groups for different dimensions

Qlik:


Qlik is a self-service access BI tool built for non-technical professionals that utilizes both engaging graphics and data consolidation from multiple sources into a single place to greatly simplify data analysis.

Pros:
  • Good in-memory processor which speeds up the application
  • Combines with data sources with ease
  • Can be easily deployed and configured
  • Has a large number of partners
Cons:

  • The menus have too many tabs that lack logical structure
  • The visuals are not intuitive drag/ drop as Tableau
  • The online community is not very active
  • Support is not that good
  • Qlik Applications are constrained by how much RAM can be addressed in a single hardware box

Microstrategy:


Microstrategy is an enterprise BI application software vendor. It allows interactive dashboards, easy and intuitive control on data layout, alerts, automated reports, and supports web, desktop as well as mobile interfaces.

Pros:
  • Scalable and can be used across all platforms like mobile, desktop, cloud etc.
  • Capable of handling complex enterprise requirements
  • The SDK allows customization of applications
  • Supports offline access to data
Cons:
  • The online community is dormant
  • The development speed is slow
  • The graphics that are obtained are unusable and formatting them to be presentable takes a long time

Oracle:


Oracle provides an all-in-one BI solution featuring eight components so that users don't have to worry about multiple software or higher cost.

Pros:
  • Supports big data capability
  • Very good training is provided
  • Can analyze large sets of data in a short time
  • Has a user-friendly interface
Cons:

  • Customizing the software requires significant investment of time
  • Has issues with respect to integration


SAS:


The SAS Visual Analytics is an in-memory data visualization tool that works well with both big and small data, providing robust query and reporting features, alerts, and predictive analytics.

Pros:
  • Integration is powerful
  • Has a huge market share
  • Extremely fast and efficient
  • Deals easily with large amounts of data
  • Support for R programming language
Cons:

  • Does not have a user friendly interface
  • Cost is on the higher side
  • The visualizations that are provided are not very aesthetic

Criteria:


Ease of Use

The users should be able to use the application easily with minimal support/ training
  • Integration
The application should integrate seamlessly with multiple data sources
  • Cost
The application should be cost effective
  • Customer support/ Online Community
The customer support should be responsive and the problems should be resolved in minimal number of calls. The online support community should be vast and solutions to issues that are faced should be found easily.
  • Performance
The application should give high performance and not crash while dealing with large datasets.

Criteria
Weight
Tableau
Qlik
Microstrategy
Oracle
SAS
Ease of use
40%
10
9
7
7
7
Integration
10%
9
9
8
7
9
Cost
20%
10
8
7.5
7
6.5
Customer Support/ Online Community
15%
10
8
6
7
7
Performance
15%
8
8
7
8
9
Points

9.6
8.5
7.05
7.15
7.4
Rank

1
2
5
4
3


                     



Based on the criteria mentioned by me, Tableau claims the number one spot among the BI tools that have been evaluated.