MITB Banner

Major data distributions a data scientist should know

Different distributions of data and their properties is one such area of statistics in which a data scientist has to have crystal clear clarity.

Share

Statistics forms the foundation of data science. It is absolutely necessary for anyone trying to build a career in data science to have a good hold over the concepts of statistics and understand how they can be applied in business settings. Different distributions of data and their properties are one such area of statistics in which a data scientist has to have crystal clear clarity.

Let us take a look at a few of the most common distributions a data scientist encounters in their career.

Normal distribution

In a normal distribution, the data is arranged in a way that most of the values form a cluster in the middle and taper off in a symmetric fashion towards either extreme. It is also called a Gaussian distribution. It appears as a bell curve when shown graphically. In a standard normal distribution, the mean is zero, and the standard deviation takes the value of 1 along with a zero skew. The mean, median and mode are all the same in a normal distribution.

In a normal distribution, the midpoint has the maximum frequency. In normal distributions, there is a constant proportion of the area under the curve lying between the mean and any given distance from the mean when they are measured in terms of standard deviation units. 

Normal distributions are represented in standard scores or Z scores. These scores give an idea of the distance between  an actual score and the mean in terms of standard deviations.

Bernoulli distribution

In a Bernoulli distribution, there are two possible values for the random variable (A random variable is a variable whose value depends on the outcome of an experiment). They are of two types – discrete and continuous.

A Bernoulli distribution is a discrete distribution. It has two possible outcomes and a single trial (called a Bernoulli trial). A Bernoulli trial is one of the simplest experiments conducted in statistics. It comes with two possible outcomes of success and failure. Some examples of bernoulli trials include coin tosses, rolling a dice, etc. The probability values of mutually exclusive events that make up all the possible outcomes has to sum up to one.

 The two possible outcomes in the Bernoulli distribution are indicated by n=0 and n=1. Here, n=1 indicating success has a probability p and n=0 indicating failure has a probability 1-p (0<=p<=1).

Uniform distribution

Uniform distribution is one of the simplest statistical distributions to understand. It is a probability distribution in which all the possible outcomes are equally possible to occur. Graphically, we can think of it as a straight horizontal line. Uniform distributions are of two types – discrete and continuous. 

A discrete uniform distribution will have a finite number of outcomes, while a continuous uniform distribution will have an infinite number of measurable outcomes that are equally likely.

Poisson distribution

A Poisson distribution is a probability distribution that shows how many times an event is likely to occur over a fixed period of time and space. It is named after French mathematician Siméon Denis Poisson. It is a discrete distribution where the variables take only specific values. It is a limiting process of the binomial distribution.

T-distribution

It is a type of normal distribution used mainly for smaller sample sizes, and population standard deviation is unknown. It is also known as Student’s t-Distribution – it is also bell-shaped and symmetrical with zero mean. The shape undergoes a change with the change in degrees of freedom. It has a greater dispersion than the standard normal distribution. As the degrees of freedom increase, the closer the distribution starts to approximate a standard normal distribution.

The student distribution ranges from –∞ to ∞ (infinity). Some important applications of T-distribution include the Test of the Hypothesis of the population mean, Test of Hypothesis of the difference between the two means and Test of Hypothesis of the difference between two means with dependent samples.

Log-normal distribution

A log-normal distribution is a probability distribution of a random variable that has its logarithm normally distributed. A random variable of log-normal distribution takes only positive real values. A random variable that is log-normally distributed will only consider positive real values.

Share
Picture of Sreejani Bhattacharyya

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.