Logo

The Data Daily

Data Types in Statistics Used for Machine Learning.

Data Types in Statistics Used for Machine Learning.

The field of statistics is the science of learning from data. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply.

To become a successful Data Scientist you must know our basics. Math and Stats are the building blocks of Machine Learning algorithms. It is important to know the techniques behind various Machine Learning algorithms to know how and when to use them. Now the question arises, what exactly is Statistics?

“Statistics is a Mathematical Science of data collection, analysis, interpretation and presentation”.

One of the central concepts of data science is gaining insights from data. Statistics is an excellent tool for unlocking such insights in data. Statistics is a form of math, and it involves formulas, but it doesn’t have to be that scary even if you’ve never encountered it before.

Machine learning came from statistics. The algorithms and models used in machine learning all come from what’s called statistical learning. Knowing some basic statistics is extremely helpful whether you are deep into machine learning algorithms or just staying up-to-date on the latest machine learning research.

Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA) since you can use certain statistical measurements only for specific data types.

You also need to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables. We will discuss the main types of data and look an example for each.

The distinction between qualitative and quantitative data is the most fundamental way to divide types of data. Is the characteristic something you can objectively measure with numbers or not?

The information represents characteristics that you do not measure with numbers. Instead, the observations fall within a countable number of groups. This type of variable can capture information that isn’t easily measured and can be subjective. Taste, the colour of a car, architectural style, and marital status are all types of qualitative data. Analysts also refer to this as categorical data.

Nominal values represent discrete units and are used to label variables, that has no quantitative value. Just think of them as labels. Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:

Visualization Methods: To visualize nominal data you can use a pie chart or a bar chart.

In Data Science, you can use one-hot encoding, to transform nominal data into a numeric feature.

Ordinal data mixes of both numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, the numbers do have mathematical meaning. It is therefore nearly the same as nominal data, except that it’s ordering matters. You can see an example below:

ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction, Rank of students in the class, education qualification etc.

Therefore you can summarize your ordinal data with frequencies, proportions, percentages. And you can visualize it with pie and bar charts. Additionally, you can use percentiles, median, mode and the interquartile range to summarize your data.

In addition to ordinal and nominal values, there is a special type of categorical data called binary.

Binary data types only have two values — yes or no. This can be represented in different ways such as “True” and “False” or 1 and 0. Binary data is used heavily for classification machine learning models. Examples of binary variables can include whether a person has stopped their subscription service or not, or if a person bought a car or not.

The information is recorded as numbers and represents an objective measurement or a count. Temperature, weight, and a count of transactions are all quantitative data. Analysts also refer to this type as numerical data.

Discrete quantitative data are a count of the presence of a characteristic, result, item, or activity. These measures cannot be meaningfully divided into smaller increments. For example, a single household can have 1 or 2 cars, but it cannot have 1.6. There are a finite number of possible values that you can record for an observation.

With discrete variables, you can calculate and assess a rate of occurrence or a summary of the count, such as the mean, sum, and standard deviation. For example, U.S. households had an average of 2.11 vehicles in 2014.

Bar charts are a standard way to graph discrete variables. Each bar represents a distinct value, and the height represents its proportion in the entire sample.

Continuous variables can take on almost any numeric value and can be meaningfully divided into smaller increments, including fractional and decimal values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data.

For example, the mean height in India is 5 feet 9 inches for men and 5 feet 4 inches for women.

In Continuous data and there are 2 types

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains the temperature of a given place as you can see below:

The problem with interval values data is that they don’t have a “true zero”.

Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

When you are dealing with continuous data, you can use the most methods to describe your data. You can summarize your data using percentiles, median, interquartile range, mean, mode, standard deviation, and range.

To visualize continuous data, you can use a histogram or a box-plot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box-plots.

In this post, you discovered the different data types that are used throughout statistics. You learned the difference between discrete & continuous data and learned what nominal, ordinal,binary, interval and ratio measurement scales are. Furthermore, you now know what statistical measurements you can use at which datatype and which are the right visualization methods. You also learned, with which methods categorical variables can be transformed into numeric variables. This enables you to create a big part of an exploratory analysis on a given dataset.

Images Powered by Shutterstock