Data Science: 2015

Friday, 24 July 2015

Choosing the appropriate clustering algorithm

Here are the criteria that helps to decide right clustering algorithm.

1. Nature of data
Weather the data to be clustered is Numerical, Categorical or Mixed.

2. Set of inputs needed by the algorithms
Some algorithms needs to specify number of clusters.

3. Size of the data sets
Most clustering data requires multiple data scans. It can be critical in case of large data sets.

Saturday, 20 June 2015

Support Vector Machines

Support vector machines(SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns used for classification and regression analysis.

Advantages:

Effective in high dimensional space
Still effective in cases where no. of dimensional is greater than no. of samples.
Uses a subset of training points in the decision function. So it is also memory efficient.
Versatile: different kernal functions can be specified for the decision function. Common kernals are provided. But it also possible to specify custom kernals.

Monday, 15 June 2015

Logarithm in computer science

The logarithm of a number is the exponent to which another fixed value,the base, must be raised to produce that number.

 $y=b^x\Leftrightarrow x=\log_b(y)$

The binary logarithm (log2 n) is the logarithm to the base 2. In Computer Science or information theory, logarithm is very useful because it is closely connected to the binary numeral system. Binary numbers are actually base-2 numeral system.

100101₂ = [ ( 1 ) × 2⁵ ] + [ ( 0 ) × 2⁴ ] + [ ( 0 ) × 2³ ] + [ ( 1 ) × 2² ] + [ ( 0 ) × 2¹ ] + [ ( 1 ) × 2⁰ ]
100101₂ = [ 1 × 32 ] + [ 0 × 16 ] + [ 0 × 8 ] + [ 1 × 4 ] + [ 0 × 2 ] + [ 1 × 1 ]
100101₂ = 37₁₀

Difference between classification and clustering

Classification– The task of assigning instances to pre-defined classes.
–E.g. Deciding whether a particular patient record can be associated with a specific disease.

Classification is supervised learning technique used to assign per-defined tag to instance on the basis of features. So classification algorithm requires training data. Classification model is created from training data, then classification model is used to classify new instances.

Clustering – The task of grouping related data points together without labeling them.
–E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.

Clustering is unsupervised technique used to group similar instances on the basis of features. Clustering does not require training data. Clustering does not assign per-defined label to each and every group.

Thursday, 9 April 2015

Difference between Supervised and Unsupervised Learning

Supervised LearningIn this technique the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to. This is called supervised because the machine is told who is what, a significant number of times, and then is expected to predict this on its own.
The claims example above is an example of Supervised learning. Below are few more examples –
– Identifying if a news article belongs to a sports news or politics
– Classify an animal in one of the predefined classes like mammal, bird etc.
– Classify a person as male or female based on the products bought by the user.
There are many open datasets available here to try supervised learning.

Algorithms
Below is a list of most widely used supervised learning algorithms –
– Naïve Bayes
– Support Vector Machines
– Random Forests
– Decision Tree

Unsupervised Learning
This technique is used when the groups (categories) of data are not known. This is called unsupervised as it is left on the learning algorithm to figure out patterns in the data provided. Clustering is an example of unsupervised learning in which different data sets are clustered into groups of closely related items.
Some of the use cases of unsupervised learning are as follows –
– Given a set of news reports, cluster related news items together. (Used by news.google.com)
– Given a set of users and movie preferences, cluster users who have similar taste

Algorithms
Below is a list of most widely used unsupervised learning algorithms –
– K-Means
– Fuzzy clustering
– Hierarchical clustering

There are many open datasets available here to try supervised learning.