Data Science

Friday, 24 July 2015

Choosing the appropriate clustering algorithm

Here are the criteria that helps to decide right clustering algorithm.

1. Nature of data
Weather the data to be clustered is Numerical, Categorical or Mixed.

2. Set of inputs needed by the algorithms
Some algorithms needs to specify number of clusters.

3. Size of the data sets
Most clustering data requires multiple data scans. It can be critical in case of large data sets.

Saturday, 20 June 2015

Support Vector Machines

Support vector machines(SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns used for classification and regression analysis.

Advantages:

Effective in high dimensional space
Still effective in cases where no. of dimensional is greater than no. of samples.
Uses a subset of training points in the decision function. So it is also memory efficient.
Versatile: different kernal functions can be specified for the decision function. Common kernals are provided. But it also possible to specify custom kernals.

Monday, 15 June 2015

Logarithm in computer science

The logarithm of a number is the exponent to which another fixed value,the base, must be raised to produce that number.

 $y=b^x\Leftrightarrow x=\log_b(y)$

The binary logarithm (log2 n) is the logarithm to the base 2. In Computer Science or information theory, logarithm is very useful because it is closely connected to the binary numeral system. Binary numbers are actually base-2 numeral system.

100101₂ = [ ( 1 ) × 2⁵ ] + [ ( 0 ) × 2⁴ ] + [ ( 0 ) × 2³ ] + [ ( 1 ) × 2² ] + [ ( 0 ) × 2¹ ] + [ ( 1 ) × 2⁰ ]
100101₂ = [ 1 × 32 ] + [ 0 × 16 ] + [ 0 × 8 ] + [ 1 × 4 ] + [ 0 × 2 ] + [ 1 × 1 ]
100101₂ = 37₁₀

Difference between classification and clustering

Classification– The task of assigning instances to pre-defined classes.
–E.g. Deciding whether a particular patient record can be associated with a specific disease.

Classification is supervised learning technique used to assign per-defined tag to instance on the basis of features. So classification algorithm requires training data. Classification model is created from training data, then classification model is used to classify new instances.

Clustering – The task of grouping related data points together without labeling them.
–E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.

Clustering is unsupervised technique used to group similar instances on the basis of features. Clustering does not require training data. Clustering does not assign per-defined label to each and every group.

Thursday, 9 April 2015

Difference between Supervised and Unsupervised Learning

Supervised LearningIn this technique the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to. This is called supervised because the machine is told who is what, a significant number of times, and then is expected to predict this on its own.
The claims example above is an example of Supervised learning. Below are few more examples –
– Identifying if a news article belongs to a sports news or politics
– Classify an animal in one of the predefined classes like mammal, bird etc.
– Classify a person as male or female based on the products bought by the user.
There are many open datasets available here to try supervised learning.

Algorithms
Below is a list of most widely used supervised learning algorithms –
– Naïve Bayes
– Support Vector Machines
– Random Forests
– Decision Tree

Unsupervised Learning
This technique is used when the groups (categories) of data are not known. This is called unsupervised as it is left on the learning algorithm to figure out patterns in the data provided. Clustering is an example of unsupervised learning in which different data sets are clustered into groups of closely related items.
Some of the use cases of unsupervised learning are as follows –
– Given a set of news reports, cluster related news items together. (Used by news.google.com)
– Given a set of users and movie preferences, cluster users who have similar taste

Algorithms
Below is a list of most widely used unsupervised learning algorithms –
– K-Means
– Fuzzy clustering
– Hierarchical clustering

There are many open datasets available here to try supervised learning.

Wednesday, 17 September 2014

Bayesian Classiﬁcation

Bayesian classiﬁers are statistical classiﬁers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classiﬁcation is based on Bayes’ theorem, described below. Bayesian classiﬁers have also exhibited high accuracy and speed when applied to large databases.

Bayes' Theorem:

Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a speciﬁed class C. For classiﬁcation problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.

P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is conﬁned to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reﬂects the probability that customer X will buy a computer given that we know the customer’s age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.

Saturday, 13 September 2014

Random Forests

Random Forest is a trademark term for an ensemble of decision trees.

Unlike single decision trees which are likely to suffer from high Variance or high [Bias] (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes.

[Error due to Bias - Difference between expect(or average) prediction of our model and the correct value which we are trying to predict.

Error due to variance - The variability of a model prediction at a given data point.]

Bagging / Bootstrap aggregation is a technique for reducing the variance of an estimated prediction function.

Bagging seems to work for high variance law bias procedure, such as tree.

Random Forest is substantial modification of bagging that builds a large collection of de-correlated trees and then average them.

Pros:

Accuracy

Cons:

Speed
Interpretability
Overfitting

Random forests are one of the two top performing algorithms along with Boosting in prediction contests.

Random forests are difficult to interpret but often very accurate.

Care should be taken to avoid overfitting.