Data Science: April 2014

Monday, 7 April 2014

What do data scientists do?

In general terms, Data Scientists gets data and convert it in to information/predicts results.

Data Scientists collects data from real world(generally BigData from internet), process that data and convert it in to dataset that can be analyzed. They analyze this dataset based on statistical models or machine learning and create results/reports which can be useful for data driven products(even for general public).

Step by step process they do with data :

Define the question
Define the ideal dataset
Determine what data you can access
Obtain the data

Reading data - Excel, XML,JSON, Web,..
Merging data

Clean the data

Reshaping data
Summarizing data

Exploratory data analysis

Graphs
Plotting systems
Clustering

Statistical prediction/modeling

Extracting generalization information from data

Interpret results
Challenge results
Synthesize/Write up results
Create reproducible code

Completely reproduce all the documents such that you can communicate it with other people

Distribute results to other people

Why R?

R programming language has become the single most important tool for computational statistics, visualization and data science.

It is free
It has comprehensive set of packages

Data Access
Data Cleaning
Analysis
Data reporting

It has one of the best development environment - R Studio (in terms of data science)
It has amazing ecosystem of developers - large collection of packages developed and published by larger R community
Packages are easy to install and "nicely play together"

Sunday, 6 April 2014

Difference between Pig and Hive

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library.

Pig

Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for.
Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.

If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop".

Hive

Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop.
The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense.
Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.