Data Science: September 2014

Wednesday, 17 September 2014

Bayesian Classiﬁcation

Bayesian classiﬁers are statistical classiﬁers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classiﬁcation is based on Bayes’ theorem, described below. Bayesian classiﬁers have also exhibited high accuracy and speed when applied to large databases.

Bayes' Theorem:

Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a speciﬁed class C. For classiﬁcation problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.

P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is conﬁned to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reﬂects the probability that customer X will buy a computer given that we know the customer’s age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.

Saturday, 13 September 2014

Random Forests

Random Forest is a trademark term for an ensemble of decision trees.

Unlike single decision trees which are likely to suffer from high Variance or high [Bias] (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes.

[Error due to Bias - Difference between expect(or average) prediction of our model and the correct value which we are trying to predict.

Error due to variance - The variability of a model prediction at a given data point.]

Bagging / Bootstrap aggregation is a technique for reducing the variance of an estimated prediction function.

Bagging seems to work for high variance law bias procedure, such as tree.

Random Forest is substantial modification of bagging that builds a large collection of de-correlated trees and then average them.

Pros:

Accuracy

Cons:

Speed
Interpretability
Overfitting

Random forests are one of the two top performing algorithms along with Boosting in prediction contests.

Random forests are difficult to interpret but often very accurate.

Care should be taken to avoid overfitting.

Friday, 12 September 2014

Machine learning

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

It is grew out of work in AI and assigning new capability to computers.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
[Suppose your email program watches which email you do mark as spam or do not, based on that learn how to better filter spam. Then classifying email as spam or not - T, label email as spam/not spam - E, the number of emails correctly classified as spam/not spam - P.]

Types of machine learning algorithm:

Supervised machine learning

Classification
Regression

Unsupervised machine learning

Clustering

Tuesday, 9 September 2014

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Here is a example of decision tree for a person entering in a restaurant will wait for order or not. 0-10,10-30,30-60,>60 minutes are waiting time for order. "T" - Patron will wait for order and "F"- he won't.

Decision tree algorithm first developed in late 1970s and early 1980s known as ID3(Iterative Dechotomiser). This work was expanded on earlier work on Concept learning system, later published as C4.5. In 1984, L.Breiman introduced CART(Classification and regression tree), which described the generation of the binary decision tree. ID3 and CART follow a approach for learning decision tree from training tuples.

All these adopt a greedy(i.e. nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner.

Saturday, 6 September 2014

Difference between Cloud Computing and Distributed Computing

Cloud computing is a style of computing in which resources are made available over the internet. Most often, these resources are extensible and are highly visualized resources and they are provided as a service. These resources can mainly be broken down to applications, platforms or infrastructure.

The field of computer science that deals with distributed systems (systems made up of more than one self-directed nodes) is called distributed computing. Typically, distributed computing is used to utilize the power of multiple machines to achieve a single large scale goal.

What is Cloud Computing?

Cloud computing is the emerging technology of delivering many kinds of resources as services, mainly over the internet. Delivering party is referred to as the service providers, while the users are known as the subscribers. Subscribers pay subscription fees typically on a per-use basis. Cloud computing is broken down in to few different categories based on the type of service provided. SaaS (Software as a Service) is the category of cloud computing in which the main resources available as a service are software applications. PaaS (Platform as a Service) is the category/application of cloud computing in which the service providers deliver a computing platform or a solution stack to their subscribers over the internet. IaaS (Infrastructure as a Service) is the category of cloud computing in which the main resources available as a service are hardware infrastructure. DaaS (Desktop as a Service), which is an emerging –aaS service deals with providing a whole desktop experience over the internet. This is sometimes referred to as desktop virtualization/virtual desktop or hosted desktop.

Examples of Cloud Computing Services

These examples illustrate the different types of cloud computing services available today:
Amazon EC2 - virtual IT
Google App Engine - application hosting
Google Apps - software as a service
Apple MobileMe - network storage

Some providers offer cloud computing services for free while others require a paid subscription.

Cloud Computing Pros and Cons

Service providers are responsible for installing and maintaining core technology within the cloud. Some customers prefer this model because it limits their own manageability burden. However, customers cannot directly control system stability in this model and are highly dependent on the provider instead.

Cloud computing systems are normally designed to closely track all system resources, which enables providers to charge customers according to the resources each consumes. Some customers will prefer this so-called metered billing approach to save money, while others will prefer a flat-rate subscription to ensure predictable monthly or yearly costs.

Using a cloud computing environment generally requires you to send data over the Internet and store it on a third-party system. The privacy and security risks associated with this model must be weighed against alternatives.

What is Distributed Computing?

The field of computer science that deals with distributed systems is called distributed computing. A distributed system is made up of more than one self-directed computers communicating through a network. These computers use their own local memory. All computers in the distributed system talk to each other to achieve a certain common goal. Alternatively, different users at each computer may have different individual needs and the distributed system will do the coordination of shared resources (or help communicate with other nodes) to achieve their individual tasks. Nodes communicate using message passing. Distributed computing can also be identified as using a distributed system to solve a single large problem by breaking it up to tasks, each of which is computed in individual computers of the distributed system. Typically, toleration mechanisms are in place to overcome individual computer failures. Structure (topology, delay and cardinality) of the system is not known in advance and it is dynamic. Individual computers do not have to know everything about the whole system or the complete input (for the problem to be solved).

Example : Hadoop, Google has it's own distributed system.

What is the difference between Cloud and Distributed Computing?

Cloud computing is a technology that delivers many kinds of resources as services, mainly over the internet, while distributed computing is the concept of using a distributed system consisting of many self-governed nodes to solve a very large problem (that is usually difficult to be solved by a single computer). Cloud computing is basically a sales and distribution model for various types of resources over the internet, while distributed computing can be identified as a type of computing, which uses a group of machines to work as a single unit to solve a large scale problem. Distributed computing achieves this by breaking the problem up to simpler tasks, and assigning these tasks to individual nodes.