UNIT 1 : INTRODUCTION TO DATA SCIENCE
What Is Data Science?
Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models. The data used for analysis can come from many different sources and presented in various formats.
Data Science Lifecycle
Earlier data used to be much less and generally accessible in a well-structured form, that we could save effortlessly and easily in Excel sheets, and with the help of Business Intelligence tools data can be processed efficiently. But Today we used to deals with large amounts of data like about 3.0 quintals bytes of records is producing on each and every day, which ultimately results in an explosion of records and data. According to recent researches, It is estimated that 1.9 MB of data and records are created in a second that too through a single individual.
So this a very big challenge for any organization to deal with such a massive amount of data generating every second. For handling and evaluating this data we required some very powerful, complex algorithms and technologies and this is where Data science comes into the picture.
1. Business Understanding: The complete cycle revolves around the enterprise goal. What will you resolve if you do not longer have a specific problem? It is extraordinarily essential to apprehend the commercial enterprise goal sincerely due to the fact that will be your ultimate aim of the analysis. After desirable perception only we can set the precise aim of evaluation that is in sync with the enterprise objective. You need to understand if the customer desires to minimize savings loss, or if they prefer to predict the rate of a commodity, etc.
2. Data Understanding: After enterprise understanding, the subsequent step is data understanding. This includes a series of all the reachable data. Here you need to intently work with the commercial enterprise group as they are certainly conscious of what information is present, what facts should be used for this commercial enterprise problem, and different information. This step includes describing the data, their structure, their relevance, their records type. Explore the information using graphical plots. Basically, extracting any data that you can get about the information through simply exploring the data.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like choosing the applicable data, integrating the data by means of merging the data sets, cleaning it, treating the lacking values through either eliminating them or imputing them, treating inaccurate data through eliminating them, additionally test for outliers the use of box plots and cope with them. Constructing new data, derive new elements from present ones. Format the data into the preferred structure, eliminate undesirable columns and features. Data preparation is the most time-consuming but arguably the most essential step in the complete existence cycle. Your model will be as accurate as your data.
4. Exploratory Data Analysis: This step includes getting some concept about the answer and elements affecting it, earlier than constructing the real model. Distribution of data inside distinctive variables of a character is explored graphically the usage of bar-graphs, Relations between distinct aspects are captured via graphical representations like scatter plots and warmth maps. Many data visualization strategies are considerably used to discover each and every characteristic individually and by means of combining them with different features.
5. Data Modeling: Data modeling is the coronary heart of data analysis. A model takes the organized data as input and gives the preferred output. This step consists of selecting the suitable kind of model, whether the problem is a classification problem, or a regression problem or a clustering problem. After deciding on the model family, amongst the number of algorithms amongst that family, we need to cautiously pick out the algorithms to put into effect and enforce them. We need to tune the hyperparameters of every model to obtain the preferred performance. We additionally need to make positive there is the right stability between overall performance and generalizability. We do no longer desire the model to study the data and operate poorly on new data.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be deployed. The model is examined on an unseen data, evaluated on a cautiously thought out set of assessment metrics. We additionally need to make positive that the model conforms to reality. If we do not acquire a quality end result in the evaluation, we have to re-iterate the complete modelling procedure until the preferred stage of metrics is achieved. Any data science solution, a machine learning model, simply like a human, must evolve, must be capable to enhance itself with new data, adapt to a new evaluation metric. We can construct more than one model for a certain phenomenon, however, a lot of them may additionally be imperfect. The model assessment helps us select and construct an ideal model.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in the preferred structure and channel. This is the last step in the data science life cycle. Each step in the data science life cycle defined above must be labored upon carefully. If any step is performed improperly, and hence, have an effect on the subsequent step and the complete effort goes to waste. For example, if data is no longer accumulated properly, you’ll lose records and you will no longer be constructing an ideal model. If information is not cleaned properly, the model will no longer work. If the model is not evaluated properly, it will fail in the actual world. Right from Business perception to model deployment, every step has to be given appropriate attention, time, and effort.
Data science and machine learning are closely related fields, and they often go hand in hand, with machine learning being a subset of data science. Here's an overview of their relationship:
Data Science Overview:
- Definition: Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights and knowledge from structured and unstructured data.
- Components: Data science encompasses various components, including data cleaning, exploration, visualization, statistical analysis, and predictive modeling.
Machine Learning Overview:
- Definition: Machine learning is a subset of artificial intelligence that focuses on the development of algorithms that enable computers to learn patterns and make predictions or decisions without being explicitly programmed.
- Components: Machine learning includes tasks such as supervised learning, unsupervised learning, reinforcement learning, and deep learning. It involves training models on data to recognize patterns and make predictions or decisions.
Relationship between Data Science and Machine Learning:
- Data as the Foundation: Data science relies on data to derive insights and solve problems. Machine learning, in turn, uses algorithms to learn from and make predictions or decisions based on that data.
- Predictive Modeling: Machine learning is often a crucial component of data science, especially in predictive modeling. Data scientists use machine learning algorithms to build models that can predict future outcomes or classify data into different categories.
- Tools and Techniques: Data scientists often use machine learning tools and techniques to analyze and interpret data. Machine learning algorithms help in identifying patterns, trends, and relationships within the data that may not be apparent through traditional statistical methods.
- Iterative Process: Data science is often an iterative process where machine learning models are built, evaluated, and refined based on the results obtained. This iterative cycle contributes to the improvement of models and the overall data science process.
Common Tasks in Data Science and Machine Learning:
- Exploratory Data Analysis (EDA): Both data science and machine learning involve exploring and understanding the data through techniques such as visualization and statistical analysis.
- Feature Engineering: In both fields, selecting and engineering relevant features from the data is crucial for building effective models.
- Model Evaluation: Both data scientists and machine learning practitioners need to assess the performance of their models, ensuring they generalize well to new, unseen data.
In summary, while data science is a broader field that encompasses various techniques for extracting insights from data, machine learning is a specific set of techniques within data science that focuses on building models capable of learning and making predictions or decisions. Data science provides the foundation, and machine learning is a powerful tool within the data scientist's toolkit.
Types Of Data – Nominal, Ordinal, Discrete and Continuous
The data is classified into four categories:
- Nominal data.
- Ordinal data.
- Discrete data.
- Continuous data.
Qualitative or Categorical Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers. These types of data are sorted by category, not by number. That’s why it is also known as Categorical Data. These data consist of audio, images, symbols, or text. The gender of a person, i.e., male, female, or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market researchers understand the customers’ tastes and then design their ideas and strategies accordingly.
The other examples of qualitative data are :
- What language do you speak
- Favorite holiday destination
- Opinion on something (agree, disagree, or neutral)
- Colors
The Qualitative data are further classified into two parts :
Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The color of hair can be considered nominal data, as one color can’t be compared with another color.
The name “nominal” comes from the Latin name “numen,” which means “name.” With the help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data. These data don’t have any meaningful order; their values are distributed into distinct categories.
Examples of Nominal Data :
- Color of hair (Blonde, red, Brown, Black, etc.)
- Marital status (Single, Widowed, Married)
- Nationality (Indian, German, American)
- Gender (Male, Female, Others)
- Eye Color (Black, Brown, etc.)
Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by their position on the scale. These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.
Ordinal data is qualitative data for which their values have some kind of relative position. These kinds of data can be considered “in-between” qualitative and quantitative data. The ordinal data only shows the sequences and cannot use for statistical analysis. Compared to nominal data, ordinal data have some kind of order that is not present in nominal data.
Examples of Ordinal Data :
- When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10
- Letter grades in the exam (A, B, C, D, etc.)
- Ranking of people in a competition (First, Second, Third, etc.)
- Economic Status (High, Medium, and Low)
- Education Level (Higher, Secondary, Primary)
Difference between Nominal and Ordinal Data
Nominal Data | Ordinal Data |
Nominal data can’t be quantified, neither they have any intrinsic ordering | Ordinal data gives some kind of sequential order by their position on the scale |
Nominal data is qualitative data or categorical data | Ordinal data is said to be “in-between” qualitative data and quantitative data |
They don’t provide any quantitative value, neither can we perform any arithmetical operation | They provide sequence and can assign numbers to ordinal data but cannot perform the arithmetical operation |
Nominal data cannot be used to compare with one another | Ordinal data can help to compare one item with another by ranking or ordering |
Examples: Eye color, housing style, gender, hair color, religion, marital status, ethnicity, etc | Examples: Economic status, customer satisfaction, education level, letter grades, etc |
Quantitative Data
Quantitative data can be expressed in numerical values, making it countable and including statistical data analysis. These kinds of data are also known as Numerical data. It answers the questions like “how much,” “how many,” and “how often.” For example, the price of a phone, the computer’s ram, the height or weight of a person, etc., falls under quantitative data.
Quantitative data can be used for statistical manipulation. These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie charts, line graphs, etc.
Examples of Quantitative Data :
- Height or weight of a person or object
- Room Temperature
- Scores and Marks (Ex: 59, 80, 60, etc.)
- Time
The Quantitative data are further classified into two parts :
Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall under integers or whole numbers. The total number of students in a class is an example of discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible. These data are represented mainly by a bar graph, number line, or frequency table.
Examples of Discrete Data :
- Total numbers of students present in a class
- Cost of a cell phone
- Numbers of employees in a company
- The total number of players who participated in a competition
- Days in a week
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android phone, the height of a person, the length of an object, etc. Continuous data represents information that can be divided into smaller levels. The continuous variable can take any value within a range.
The key difference between discrete and continuous data is that discrete data contains the integer or whole number. Still, continuous data stores the fractional numbers to record different types of data such as temperature, height, width, time, speed, etc.
Examples of Continuous Data :
- Height of a person
- Speed of a vehicle
- “Time-taken” to finish the work
- Wi-Fi Frequency
- Market share price
Difference between Discrete and Continuous Data
Discrete Data | Continuous Data |
---|---|
Discrete data are countable and finite; they are whole numbers or integers | Continuous data are measurable; they are in the form of fractions or decimal |
Discrete data are represented mainly by bar graphs | Continuous data are represented in the form of a histogram |
The values cannot be divided into subdivisions into smaller pieces | The values can be divided into subdivisions into smaller pieces |
Discrete data have spaces between the values | Continuous data are in the form of a continuous sequence |
Examples: Total students in a class, number of days in a week, size of a shoe, | Example: Temperature of room, the weight of a person, length of an object, |
STATISTICS:
Mean, Median and Mode
Mean
In mathematics and statistics, the mean is the average of the numerical observations which is equal to the sum of the observations divided by the number of observations.
where,
= | arithmetic mean | |
= | number of values | |
= | data set values |
Median
The median of the data, when arranged in ascending or descending value is the middle observation of the data i.e. the point separating the higher half to the lower half of the data.
To calculate the median:
- Arrange the data in ascending or descending order.
- an odd number of data points: the middle value is the median.
- even number of data points: the average of the two middle values is the median.
= | an ordered list of values in the data set | |
= | number of values in data set |
Mode
The mode of a set of data points is the most frequently occurring value.
For example:
5,2,6,5,1,1,2,5,3,8,5,9,5 are the set of data points. Here 5 is the mode because it’s occurring most frequently.
Variance and Standard Deviation
Variance
Mathematically and statistically, variance is defined as the average of the squared differences from the mean. But for understanding, this depicts how spread out the data is in a dataset.
The steps of calculating variance using an example:
Let’s find the variance of (1,4,5,4,8)
- Find the mean of the data points i.e. (1 + 4 + 5 + 4 + 8)/5 = 4.4
- Find the differences from the mean i.e. (-3.4, -0.4, 0.6, -0.4, 3.6)
- Find the squared differences i.e. (11.56, 0.16, 0.36, 0.16, 12.96)
- Find the average of the squared differences i.e. 11.56+0.16+0.36+0.16+12.96/5=5.04
The formula for the same is:
Standard Deviation
Standard deviation measures the variation or dispersion of the data points in a dataset. It depicts the closeness of the data point to the mean and is calculated as the square root of the variance.
In data science, the standard deviation is usually used to identify the outliers in a data set. The data points which lie one standard deviation away from the mean are considered to be unusual.
The formula for standard deviation is:
= | population standard deviation | |
= | the size of the population | |
= | each value from the population | |
= | the population mean |
Population Data V/s Sample Data
Population data refers to the complete data set whereas sample data refers to a part of the population data which is used for analysis. Sampling is done to make analysis easier.
When using sample data for analysis, the formula of variance is slightly different. If there are total n samples we divide by n-1 instead of n:
= | sample variance | |
= | the value of the one observation | |
= | the mean value of observations | |
= | the number of observations |
PROBABILITY:
What is Probability?
The concept of probability is extremely simple. It means how likely an event is about to occur or the chance of the occurrence of an event.
The formula for probability is:
For example:
The probability of the coin showing heads when it’s flipped is 0.5.
Conditional Probability
Conditional probability is the probability of an event occurring provided another event has already occurred.
The formula of conditional probability:
For example:
The students of a class have given two tests of the subject mathematics. In the first test, 60% of the students pass while only 40% of the students passed both the tests. What percentage of students who passed the first test, cleared the second test?
Bayes’ Theorem
Bayes’ Theorem is a very important statistical concept used in many industries such as healthcare and finance. The formula of conditional probability which we have done above has also been derived from this theorem.
It is used to calculate the probability of a hypothesis based on the probabilities of various data provided in the hypothesis.
The formula for Bayes’ theorem is:
= | events | |
= | probability of A given B is true | |
= | probability of B given A is true | |
= | the independent probabilities of A and B |
For example:
Let’s assume there is an HIV test that can identify HIV+ positive patients accurately 99% of the time, and also accurately has a negative result for 99% of HIV negative people. Here, only 0.3% of the overall population is HIV positive.
Dependent and Independent Events are the types of events that occur in probability. Suppose we have two events say Event A and Event B then if Event A and Event B are dependent events then the occurrence of one event is dependent on the occurrence of other events if they are independent events then the occurrence of one event does not affect the probability of other events.
We can learn about dependent and independent events with the help of examples such as the event of tossing two coins simultaneously the outcome of one coin does not affect the outcome of another coin then they are independent events. Suppose we take other experiments where we toss a coin only when we get a six in the throw of dice, where the outcome of one event is affected by other events then they are dependent events.
Dependent Events
Dependent events are those events that are affected by the outcomes of events that had already occurred previously. i.e. Two or more events that depend on one another are known as dependent events. If one event is by chance changed, then another is likely to differ. Thus, If whether one event occurs does affect the probability that the other event will occur, then the two events are said to be dependent.
When the occurrence of one event affects the occurrence of another subsequent event, the two events are dependent events. The concept of dependent events gives rise to the concept of conditional probability which will be discussed in the article further.
Examples of Dependent Events
For Example, let’s say three cards are to be drawn from a pack of cards. Then the probability of getting a king is highest when the first card is drawn, while the probability of getting a king would be less when the second card is drawn.
In the draw of the third card, this probability would be dependent upon the outcomes of the previous two cards. We can say that after drawing one card, there will be fewer cards available in the deck, therefore the probabilities after each drawn card changes.
Independent Events
Independent events are those events whose occurrence is not dependent on any other event. If the probability of occurrence of an event A is not affected by the occurrence of another event B, then A and B are said to be independent events.
Examples of Independent Events
- Tossing a Coin
Sample Space(S) in a Coin Toss = {H, T}
Both getting H and T are Independent Events
- Rolling a Die
Sample Space(S) in Rolling a Die = {1, 2, 3, 4, 5, 6}, all of these events are independent too.
Both of the above examples are simple events. Even compound events can be independent events. For example:
- Tossing a Coin and Rolling a Die
If we simultaneously toss a coin and roll a die then the probability of all the events is the same and all of the events are independent events,
Sample Space(S) of such experiment = {(1, H), (2, H), (3, H), (4, H), (5, H), (6, H), (1, T), (2, T), (3, T), (4, T) (5, T) (6, T)}.
These events are independent because only one can occur at a time and occurring of one event does not affect other events.
Note
- A and B are two events associated with the same random experiment, then A and B are known as independent events if
P(A ∩ B) = P(B).P(A)
Difference Between Independent Events and Dependent Events
The difference between independent events and dependent events is discussed in the table below,
Independent Events | Dependent Events |
---|---|
Independent events are events that are not affected by the occurrence of other events. | Dependent events are events that are affected by the occurrence of other events. |
The formula for the Independent Events is, P(A and B) = P(A)×P(B) | The formula for the Dependent Events is, P(B and A) = P(A)×P(B after A) |
Examples of Independent Events are,
| Examples of Dependent Events are,
|
Comments
Post a Comment