Data Science: 2014

Wednesday, 17 September 2014

Bayesian Classiﬁcation

Bayesian classiﬁers are statistical classiﬁers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classiﬁcation is based on Bayes’ theorem, described below. Bayesian classiﬁers have also exhibited high accuracy and speed when applied to large databases.

Bayes' Theorem:

Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a speciﬁed class C. For classiﬁcation problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.

P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is conﬁned to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reﬂects the probability that customer X will buy a computer given that we know the customer’s age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.

Saturday, 13 September 2014

Random Forests

Random Forest is a trademark term for an ensemble of decision trees.

Unlike single decision trees which are likely to suffer from high Variance or high [Bias] (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes.

[Error due to Bias - Difference between expect(or average) prediction of our model and the correct value which we are trying to predict.

Error due to variance - The variability of a model prediction at a given data point.]

Bagging / Bootstrap aggregation is a technique for reducing the variance of an estimated prediction function.

Bagging seems to work for high variance law bias procedure, such as tree.

Random Forest is substantial modification of bagging that builds a large collection of de-correlated trees and then average them.

Pros:

Accuracy

Cons:

Speed
Interpretability
Overfitting

Random forests are one of the two top performing algorithms along with Boosting in prediction contests.

Random forests are difficult to interpret but often very accurate.

Care should be taken to avoid overfitting.

Friday, 12 September 2014

Machine learning

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

It is grew out of work in AI and assigning new capability to computers.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
[Suppose your email program watches which email you do mark as spam or do not, based on that learn how to better filter spam. Then classifying email as spam or not - T, label email as spam/not spam - E, the number of emails correctly classified as spam/not spam - P.]

Types of machine learning algorithm:

Supervised machine learning

Classification
Regression

Unsupervised machine learning

Clustering

Tuesday, 9 September 2014

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Here is a example of decision tree for a person entering in a restaurant will wait for order or not. 0-10,10-30,30-60,>60 minutes are waiting time for order. "T" - Patron will wait for order and "F"- he won't.

Decision tree algorithm first developed in late 1970s and early 1980s known as ID3(Iterative Dechotomiser). This work was expanded on earlier work on Concept learning system, later published as C4.5. In 1984, L.Breiman introduced CART(Classification and regression tree), which described the generation of the binary decision tree. ID3 and CART follow a approach for learning decision tree from training tuples.

All these adopt a greedy(i.e. nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner.

Saturday, 6 September 2014

Difference between Cloud Computing and Distributed Computing

Cloud computing is a style of computing in which resources are made available over the internet. Most often, these resources are extensible and are highly visualized resources and they are provided as a service. These resources can mainly be broken down to applications, platforms or infrastructure.

The field of computer science that deals with distributed systems (systems made up of more than one self-directed nodes) is called distributed computing. Typically, distributed computing is used to utilize the power of multiple machines to achieve a single large scale goal.

What is Cloud Computing?

Cloud computing is the emerging technology of delivering many kinds of resources as services, mainly over the internet. Delivering party is referred to as the service providers, while the users are known as the subscribers. Subscribers pay subscription fees typically on a per-use basis. Cloud computing is broken down in to few different categories based on the type of service provided. SaaS (Software as a Service) is the category of cloud computing in which the main resources available as a service are software applications. PaaS (Platform as a Service) is the category/application of cloud computing in which the service providers deliver a computing platform or a solution stack to their subscribers over the internet. IaaS (Infrastructure as a Service) is the category of cloud computing in which the main resources available as a service are hardware infrastructure. DaaS (Desktop as a Service), which is an emerging –aaS service deals with providing a whole desktop experience over the internet. This is sometimes referred to as desktop virtualization/virtual desktop or hosted desktop.

Examples of Cloud Computing Services

These examples illustrate the different types of cloud computing services available today:
Amazon EC2 - virtual IT
Google App Engine - application hosting
Google Apps - software as a service
Apple MobileMe - network storage

Some providers offer cloud computing services for free while others require a paid subscription.

Cloud Computing Pros and Cons

Service providers are responsible for installing and maintaining core technology within the cloud. Some customers prefer this model because it limits their own manageability burden. However, customers cannot directly control system stability in this model and are highly dependent on the provider instead.

Cloud computing systems are normally designed to closely track all system resources, which enables providers to charge customers according to the resources each consumes. Some customers will prefer this so-called metered billing approach to save money, while others will prefer a flat-rate subscription to ensure predictable monthly or yearly costs.

Using a cloud computing environment generally requires you to send data over the Internet and store it on a third-party system. The privacy and security risks associated with this model must be weighed against alternatives.

What is Distributed Computing?

The field of computer science that deals with distributed systems is called distributed computing. A distributed system is made up of more than one self-directed computers communicating through a network. These computers use their own local memory. All computers in the distributed system talk to each other to achieve a certain common goal. Alternatively, different users at each computer may have different individual needs and the distributed system will do the coordination of shared resources (or help communicate with other nodes) to achieve their individual tasks. Nodes communicate using message passing. Distributed computing can also be identified as using a distributed system to solve a single large problem by breaking it up to tasks, each of which is computed in individual computers of the distributed system. Typically, toleration mechanisms are in place to overcome individual computer failures. Structure (topology, delay and cardinality) of the system is not known in advance and it is dynamic. Individual computers do not have to know everything about the whole system or the complete input (for the problem to be solved).

Example : Hadoop, Google has it's own distributed system.

What is the difference between Cloud and Distributed Computing?

Cloud computing is a technology that delivers many kinds of resources as services, mainly over the internet, while distributed computing is the concept of using a distributed system consisting of many self-governed nodes to solve a very large problem (that is usually difficult to be solved by a single computer). Cloud computing is basically a sales and distribution model for various types of resources over the internet, while distributed computing can be identified as a type of computing, which uses a group of machines to work as a single unit to solve a large scale problem. Distributed computing achieves this by breaking the problem up to simpler tasks, and assigning these tasks to individual nodes.

Thursday, 8 May 2014

Git

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Created by same people who developed LINUX
The most popular implementation of version control today
Everything is stored in local repositories on your computer
Operated from the command line

GitHub

Github.Com is a web based hosting service for software development projects which uses the Git version control system.

Pushing and pulling:

Basic Git commands

Configure username and email using Git Bash

$git config --global user.name "your name"
$git config --global user.email "your email"

Type the following to confirm your changes

$git config --list

Close Git Bash with following command

$ exit

Initialize repository in an existing directory

$ git init localpath

Cloning and existing repository

$git clone url

Adding

$git add . (add new files)
$git add -u (update tracking for files that changed names or were deleted)
$git add -A (does both of the previous)

You should do this before committing.

Committing

$git commit -m "message" ("message" is useful description of what you did)

git commit -am "save arezzo files"( This command will add and commit file. git commit -a -m "message" - both do the same thing)

Pushing

$git push

Push an existing repository
$git remote add origin https://github.com/manaliajudiya/R.git

$git push -u origin master

Checking the status of your file
$git status

Branches

Sometimes you are working on a project with a version being used by many people. You may not want to edit that version.

So you can create a branch with the command

$git checkout -b branchname

To see what branch you are on

$git branch

To switch back to master branch type

$git checkout master

Pull requests

If you fork someone's repository or have multiple branches you will both be working separately. In this case, to merge you changes, you need to send a pull request. This is a feature of github.

$git pull (fetch files from repository, you need to have all files on your computer otherwise there will be error "Updates were rejected. Remote contains work that you do not have locally.")

List files in local git repo
$git rev-parse --show-toplevel

To change top level
$cd ~/dirname

Show current working directory

$git ls-files --directory

Change working directory
$git --git-dir git/dirname status

Monday, 7 April 2014

What do data scientists do?

In general terms, Data Scientists gets data and convert it in to information/predicts results.

Data Scientists collects data from real world(generally BigData from internet), process that data and convert it in to dataset that can be analyzed. They analyze this dataset based on statistical models or machine learning and create results/reports which can be useful for data driven products(even for general public).

Step by step process they do with data :

Define the question
Define the ideal dataset
Determine what data you can access
Obtain the data

Reading data - Excel, XML,JSON, Web,..
Merging data

Clean the data

Reshaping data
Summarizing data

Exploratory data analysis

Graphs
Plotting systems
Clustering

Statistical prediction/modeling

Extracting generalization information from data

Interpret results
Challenge results
Synthesize/Write up results
Create reproducible code

Completely reproduce all the documents such that you can communicate it with other people

Distribute results to other people

Why R?

R programming language has become the single most important tool for computational statistics, visualization and data science.

It is free
It has comprehensive set of packages

Data Access
Data Cleaning
Analysis
Data reporting

It has one of the best development environment - R Studio (in terms of data science)
It has amazing ecosystem of developers - large collection of packages developed and published by larger R community
Packages are easy to install and "nicely play together"

Sunday, 6 April 2014

Difference between Pig and Hive

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library.

Pig

Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for.
Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.

If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop".

Hive

Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop.
The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense.
Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.