An Introduction to Data Science Part 2

Shivatmica Murgai
5 min readApr 12, 2021

In the last article, I talked about how Data Science can be arranged into tasks and covered the first three tasks(collecting data, storing data, processing data). This article looks at the next three tasks (describing data and modelling data), where modelling data is broken up into Statistical and Algorithmic Modelling. You can find the first article here.

Describing Data

Describing data starts with visualizing data and summarizing data. In visualizing data, you may want to plot the number of shirts of each color, for example, as a relation of which color is most popular. From a bar graph of some sort, it can be easily analyzed which is the most dominant or popular color, so this is much easier to analyze. There are stack bar graphs, group bar graphs, etc. These can be used to compare and determine trends between different years or products.

This is a lot more helpful as compared to data in a relational database (which is insignificant as it’s represented since the data isn’t structured). A scatter plot is helpful as well to find general trends. The next is summarizing the data — for example, what is the typical amount of TVs sold daily at a store. The typical amount may be the mean, median, or mode. These different statistics can be misleading but are usually important since sometimes in 1 day, there may be a significantly higher amount of TVs sold than other days.

This typical variation could enable you to see how “varied” the terms are from the average or how far they are from the average. This is standard deviation and involves variance. Additionally, descriptive statistics are useful since this is an iterative process, also called exploratory data analysis. To do this, you can draw some plots, compute some variances, then iterate again to create trends.

Statistical Modelling

Statistical modelling has underlying data distribution. For example, patient data, such as blood sugar, cholesterol, etc. can be collected. Most physical reading with biological studies have a majority of the mass around the mean with a normal distribution. There will also be specific outliers (almost always in trends). These models can be used to create and infer patterns, which is a very important part of a data scientist’s job. For example, there may be a new drug in the market and doctors would like to determine whether it reduces blood sugar level or not.

A specific sample of people may be tested and the mean and variance are the values most crucial to show the drug’s results. The mean of before and after the drug may have been shifted — which may show the effects of the drug. Without the specific distribution of the data or making these assumptions, we may not be able to make robust arguments about the drug’s effectiveness because of the randomness of the small sample used. These results should be quantified for a specific percentage as a robust and statistical guarantee.

There may be underlying relationships such as the relationship between the number of days of treatment and blood sugar. You may now be able to make very robust arguments such as the blood sugar will decrease 3 points for each day of treatment, for example. There may be a linear relationship in the form of y = mx + c. The robust statistical arguments can be much more easily created with a linear function.

We could be interested in modelling underlying data distribution and underlying relations in the data. We must generate formulae and test hypotheses. The formal hypothesis, in that case, was the average blood sugar after the drug was lower than that of before the drug. There must be robust arguments with statistical guarantees such as with p-values and goodness-of-fit tests.

Algorithmic Modelling

There is an alternate form of modelling, which is algorithmic modelling. Statistical modelling only has one variable with a very simple case. They are used in fields that may need statistical guarantees such as farming and medicine. There shouldn’t be very complicated models in statistical modelling because then trends can’t be determined with guarantees. An alternative approach would be to build complex models. There may be a complicated function such as f(x).

The blood sugar level depends on several factors — not just the number of days. There are some relationships between the input and outputs, which we would like to find in the form of a complex function. These inputs may include age, weight, genetics, blood pressure, etc. It would create a complicated relation like y=m1x1+m2x2+…mnxn=f(x1,x2,x3,…xn). Machine learning allows a family of extremely complex functions. Machine learning must be used to estimate using data available, then using optimization techniques. Estimating parameters is part of machine learning to choose complex relations and functions.

For a new patient, we plug in several factors to get the predicted blood pressure. We focus on prediction without worrying about underlying problems. There is also predicting stock problems — it depends on various factors — ranging from different cases. An algorithmic approach is only caring about the resulting stock price or the increase or decrease without worry about the reasons. Statistical modelling has simple, intuitive models to find the relations with robust analysis. Algorithmic modelling has complex, flexible models without caring about the reasons.

Statistical modelling is suited for more low-dimensional data and algorithmic modelling works with very high dimensional data with several inputs without robust statistical analysis. Robust statistical analysis is possible in statistical modelling, but in algorithmic modelling, it isn’t suitable for statistical analysis. Statistical modelling focuses more on the reasons and interpretability, but algorithmic modelling focuses on prediction (without worrying about the reasons for the result).

Data lean models are used in statistical modelling because of these simple models. Data-hungry models are used in algorithmic modelling because of the fact that the machine must learn the trends. There is more statistics in statistical modelling and more ML and DL in algorithmic modelling. In statistical modelling, there will be linear regression, logistic regression, and linear discriminant analysis. This list is similar to algorithmic modelling, with the addition of Decision trees, KNNs, SVMs, Naive Bayes, and Multilayered neural networks. Both have y=f(x) with x as the input.

There are much more simple statistics in statistical modelling in comparison with algorithmic modelling. “When you have large amounts of high-dimensional data and you want to learn very complex relationships between the output and input use a specific class of complex ML models and algorithms, collectively referred to as Deep Learning.” Another example is, given a picture of the retina, we must predict if the patient is suffering from diabetic retinopathy. This includes very high-dimensional data.

We are given lots of similar data given from hospitals. The output is whether the patient has the disease or not. Based on these several variables, even with lots of such images, we may be able to say whether the disease is there or not. Deep Learning can be used to predict the relationship between this input and output or in other words, f(x). Deep learning is very popular today since there are large amounts of data with complex relationships with good software frameworks. There is also better computing power with Deep learning.

This was the second part of the introduction to data science. Thanks for reading!

--

--