This tutorial explains common and tricky data analyst interview questions, along with answers.
The main responsibility of data analyst is to generate insights from data and present it to stakeholders such as external or internal clients. During this process, he/she extracts data from database and then clean it up to prepare it for analysis. Later data analysis step involves exploration of data with descriptive statistics and then building predictive model for prediction.
Data analyst must know basics and intermediary statistics and know how to apply it with Python or SAS. Excel and SQL are the two most popular tools used in data analytics so candidate must possess a good knowledge of these tools. Excel is used for a variety of purposes such as generating quick summaries and presenting it in an interactive excel dashboard. Mostly offline reporting deliverables are in either excel or powerpoint report formats.
This tutorial covers interview questions on the following topics:
Excel Interview Questions
The following is a list of some tricky or advanced excel interview questions.
1. What is the default value of last parameter of VLOOKUP?
Solution : TRUE/1 . It refers to finding the closest (approximate) match and assuming the table is sorted in ascending order. Whereas, FALSE/0 refers to exact match.
2. What is the main limitation of VLOOKUP function?
The lookup value should be at the most left side column in the table array. VLOOKUP only looks right. It cannot look right to left.
3. Does VLOOKUP look up case-sensitive values?
No, it is not case-sensitive. The text 'ram' and 'RAM' is identical for VLOOKUP.
4. 2 ways to extract unique values in excel
Use Advanced Filter option (shortcut key : ALT D F A) and 'Remove Duplicates' option under Data tab.
5. How to find duplicates in a column?
Use CONDITIONAL FORMATING to highlight duplicate values. OR use COUNTIF function as shown below. For example, values are stored in cells D4:D7.
=COUNTIF(D4:D7,D4)Apply filter on the column wherein you applied COUNTIF function and select values greater than 1.
6. How to insert a drop down?
Go to Data tab >> Select Data Validation. Another way to insert a drop down is to enable Developer tab and Insert Combo box.
7. How to sum values based on some conditions?
Use SUMIF or SUMPRODUCT functions. The SUMIF function is explained below -
=SUMIF(range, criteria, sum_range)
=SUMIF(A2:A5,"A",B2:B5)
Excel : SUMIF Function |
8. How to create cross tabulation in Excel?
Use Pivot Table and select one variable in Row label and the other variable in Column label.
9. What is Excel Array Formula?
Excel Array Formula Explained
10. How to extract First Name from a full name?
Suppose you need to pull 'Neha' from 'Neha Sharma'. Use MID and FIND functions.
=MID (A2,1,FIND(" ",A2)-1)Tutorial : Practical Uses of MID Function
11. How Index and Match Function works?
Index function returns a value from a range based on row number.
= INDEX(range, row_number)See the image below -
Excel : Index Function |
Match function returns the relative position of a value in range.
= MATCH(lookup_value, range, match_type)match_type can be exact match, largest/smallest value that is less than or greater than equal to lookup_value.
Excel : Match Function |
12. How Index and Match Function works together?
=INDEX(range, MATCH(lookup_value, lookup_range, match_type))Suppose information of Product and Sales are stored in columns A and B. You need to look for product against sales value so you need to tell EXCEL to look from right to left as sales value is placed in the right hand side of the range/table.
=INDEX(A2:A5,MATCH(45,B2:B5,0))
Nested INDEX MATCH Excel Functions |
Statistics Interview Questions
The following questions touch upon some basics and intermediate statistics topics. These topics are generally taught in undergraduate / graduate courses.
1. What is p-value?
It is the lowest level of significance at which you can reject the null hypothesis. If p-value < 0.05, you reject the null hypothesis at 5% level of significance.
2. Difference between MEAN. MEDIAN, MODE
Mean is calculated by summing all the values divided by number of observations. Median is the middle value. And Mode is the most occurring value.
3. In which data types MEAN, MEDIAN and MODE are more suitable?
MEAN is suitable for continuous data with no outliers. It is affected by extreme values (Outliers).
MEDIAN is suitable for continuous data with outliers or ordinal data. Mode is suitable for categorical data (including both nominal and ordinal data).
4. Different Types of Sampling Techniques?
There are following four main types of sampling techniques.
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Systematic sampling
5. Difference between Cluster and Stratified Sampling?
The main difference between cluster and stratified sampling is that in stratified sampling all the strata need to be sampled. In cluster sampling one proceeds by first selecting a number of clusters at random and then sampling each cluster or conduct a census of each cluster. But usually not all clusters would be included.
6. When should we use T-test than Z-test?
Theoretically, we should use T-test when sample size (N) is less than 30. Practically, we always use t-test. It is because t-test and z test are equivalent as N tends to infinity.
7. What is the difference between R-square and Adjusted R-square?
Check out this link - R-square vs. Adjusted R-square
8. How to detect outliers?
Box Plot Method - If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier.
Standard Deviation Method - If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier.
Ways to detect Outliers with SAS
9. Define Homoscedasticity?
In a linear regression model, there should be homogeneity of variance of the residuals. In other words, the variance of residuals are approximately equal for all predicted dependent variable values.
Check Homoscedasticity with SAS
10. Difference between Standardized and Unstandardized Coefficients?
To calculate Standardized Coefficients, first we need to standardize both dependent and independent variables and use the standardized variables in the regression model to get standardized estimates. By 'standardize', it implies subtracting mean from each observation and divide that by the standard deviation. The standardized coefficient is interpreted in terms of standard deviation. Whereas, unstandardized coefficient is measured in actual units.
11. Difference between Factor Analysis and Principal Component Analysis?
Both the analysis are very much similar but they are different in terms of calculation and their practical usage :
- In Principal Components Analysis, the components are calculated as linear combinations of the raw input variables. In Factor Analysis, the raw input variables are defined as linear combinations of the factors.
- The main idea of using PCA is to explain as much of the total variance in the variables as possible. Whereas, the factor analysis explains the covariances or correlations between the variables.
- PCA is used when we need to reduce the number of variables (dimensionality reduction) whereas FA is used when we need to group variables into some factors.
12. Difference between Linear and Logistic Regression?
There are more than 10 differences between these two algorithms. Check out the link below -
Linear vs. Logistic Regression
13. How to statistically compare means between groups?
Use Independent T-test when a continuous variable and a categorical variable having two independent categories.
Use Paired T-test when a continuous variable and a categorical variable having two dependent or paired categories.
Use one way ANOVA when a continuous variable and a categorical variable having more than two independent categories.
Use GLM Repeated Measures when a continuous variable and a categorical variable more than two dependent categories.
14. Explain eigenvalues and eigenvectors intuitively
Eigenvalues are variances explained by principal components. By 'variances', i mean the diagonal values of the covariance matrix below -
x | y | z |
---|---|---|
1.34 | -0.16 | 0.19 |
-0.16 | 0.62 | -0.13 |
0.19 | -0.13 | 1.49 |
The sum of the diagonal values is 3.45.
Why eigenvalue greater than 1 is considered to retain components? It is because the average eigenvalue will be 1, so > 1 implies higher than average.
Eigenvectors are the coefficients of orthogonal (uncorrelated) transformation of variables into principal components.
Python Interview Questions
Here are the questions that would help you to prepare for Python interview round.
Q. What are the main data structures in pandas, and how are they different from each other?
Answer: In pandas, there are two primary data structures: Series and DataFrame. A Series is like a one-dimensional list, while a DataFrame is like a two-dimensional table. The big difference is that a DataFrame can store multiple Series, which makes it great for tables.
Q. How can you pick specific rows and columns from a DataFrame in pandas?
Answer: You can use .loc[]
or .iloc[]
to choose particular rows and columns from a DataFrame. .loc[]
works with labels (like names), and .iloc[]
works with numbers (like positions).
Q. Explain what the groupby() function in pandas does and when you might use it.
Answer: The groupby()
function in pandas is used to group rows in a DataFrame based on one or more columns. It's often used to summarize data. For instance, you can use groupby()
to group sales data by product category and calculate the total sales for each category.
Q. How do you deal with duplicates and missing values in a pandas DataFrame?
Answer: To handle duplicates, you can use .drop_duplicates()
to remove repeated rows. For missing values, you can use methods like .dropna()
to get rid of rows or columns with missing data and .fillna()
to fill in missing values with specific ones.
Q. Explain the key features of NumPy arrays and how they are different from Python lists.
Answer: NumPy arrays are like lists, but they are tidier and can have many dimensions. Python lists can hold different types of things and are only one-dimensional. NumPy arrays also make math stuff faster and easier.
Q. How do you do element-wise and math operations on NumPy arrays?
Answer: To do element-wise operations, just use +, -, *, and / with NumPy arrays. NumPy also has lots of math functions like np.mean()
and np.sum()
for doing math on arrays.
Q. What's the point of the np.random
module in NumPy, and how do you make random numbers with it?
Answer: np.random
in NumPy helps you make random numbers and data. You can use np.random.rand()
to get random values from a uniform range or np.random.randn()
for random values from a normal range.
Q. How do you manage missing data in a pandas DataFrame?
Answer: You can handle missing data in pandas using methods like isna()
, fillna()
, or dropna()
. If you want to replace missing values, use fillna()
, and if you want to remove rows or columns with missing data, use dropna()
.
Q. What are the different data structures in Python?
Answer: The main data structures in Python are lists, tuples, dictionaries, and sets. Lists are ordered collections of stuff, tuples are unchangeable lists, dictionaries are like maps from keys to values, and sets are groups of unique stuff without order.
Q. What is a lambda function in Python?
Answer: A lambda function is a tiny, unnamed function in Python. It's used for simple tasks and is often written directly where it's needed. See the example below :
add = lambda x, y: x + y result = add(5, 3) # Result : 8
SAS Interview Questions
The following questions would help you to prepare for SAS interview round.
1. Difference between WHERE and IF statement?- WHERE statement can be used in procedures to subset data while IF statement cannot be used in procedures.
- WHERE can be used as a data set option while IF cannot be used as a data set option.
- WHERE statement is more efficient than IF statement. It tells SAS not to read all observations from the data set
- WHERE statement can not be used when reading data using INPUT statement whereas IF statement can be used.
- When it is required to use newly created variables, use IF statement as it doesn't require variables to exist in the READIN data set.
2. How PROC MEANS works?
PROC MEANS DATA = dataset_name;Detailed Explanation : PROC MEANS
VAR analysis_variable;
CLASS grouping_variable;
RUN;
3. Difference between INFORMAT and FORMAT?
Informat is used to read data whereas Format is used to write or display data.
4. Difference between NODUPKEY and NODUP in PROC SORT?
The NODUPKEY option removes duplicate observations where value of a variable listed in BY statement is repeated while NODUP option removes duplicate observations where values in all the variables are repeated.
5. How many maximum characters SAS library name can take?
A valid library name must start with an alphabet and cannot have more than 8 characters.
6. Which is more faster - Proc SQL or SAS data step?
The SQL procedure performed better with the smaller datasets whereas the data step performed better with the larger datasets (more than approx. 100 MB).
7. Two main advantages of Proc SQL Joins over Data Step Merging?
- Proc SQL JOINS do not require variables to be sorted prior to joining them whereas Data Step Merging requires.
- Proc SQL works perfectly when key variables have different names
8. What would happen if i don't use 'BY statement in MERGE?
Without 'BY' statement, SAS will merge the 1st observation from dataset A and 1st observation from dataset B to form the 1st observation of the final dataset. It might lead to meaningless results.
9. What are the ways to create a macro variable?
There are 5 ways to create macro variables:
- %Let
- Iterative %DO statement
- Call Symput
- Proc SQl into clause
- Macro Parameters
10. How to rename columns with PROC SQL?
Use AS alias.
Proc SQL;
select name as fullname from table1;
quit;
11. How to calculate percentile values with SAS?
We can use PROC MEANS or PROC UNIVARIATE to calculate percentile values. For example, specify options P10, P90 to calculate 10th and 90th percentile score. PROC MEANS cannot calculate custom percentile values such as 97.5th or 99.5th percentile. To calculate these custom percentile values, you can use PCTLPTS= option in PROC UNIVARIATE.
Tutorial : PROC UNIVARIATE
12. How to replace missing values of all the numeric variables to 0 in a single run?
We can use _numeric_ to specify all the numeric variables and dim functions in array to count the number of numeric variables.
data temp;
set sampledata;
array Q(*) _numeric_;
do i= 1 to dim(Q);
if Q(i) = . then Q(i)= 0;
end;
run;
SQL Interview Questions
1. How to write conditional statements (IF ELSE) in SQL?In SQL, it is possible with CASE WHEN statements.
select
case when sex='M' then 1 else 0 end as males
, case when sex='F' then 1 else 0 end as females
from sashelp.class;
quit;
2. What are the common SQL data types ?
Data | Type Format |
---|---|
Numeric | NUMBER(L,D), INTEGER,SMALLINT,DECIMAL(L,D) |
Character | CHAR(L),VARCHAR(L) |
Date | DATE |
3. How to subset or filter data in SQL?
We can use WHERE clause to subset or filter data.
SELECT *
FROM PRODUCT
WHERE SALES > 200
4. Difference between WHERE and HAVING clauses
The HAVING clause comes into effect when GROUP BY is used. It runs after GROUP BY so it filters grouping variable. Whereas WHERE clause runs prior to GROUP BY clause so it does not filter a grouping variable.
5. Difference between Full Join and Cross Join?
A full join keeps all rows from both of the input tables even if we cannot find a matching row.
Cross Join returns cartesian product of tables. It matches every row of one table with every row of another table.
6. Difference between UNION and UNION ALL
The main use of UNION and UNION ALL is to join two tables. The main difference between them is that UNION removes duplicate records and UNION ALL keeps the duplicate records. By 'duplicate records', all the values of two or more rows are same.
7. How to create a blank table
Method I :
The following method creates a new table called temp2 with the same column names and attributes of table temp.
CREATE TABLE TEMP2 LIKE TEMP;Method II :
In this case, we are creating a blank table by subsetting data method. As 1 is not equal to 2, it returns zero row.
create table temp3 as
select * from temp
where 1=2;
8. What will be the result of the query below?
select case when null = null then 'Yes' else 'No' end as Result;It will return NO as the above code is not right way to compare null values. The correct way would be to use 'is' keyword to compare -
select case when null is null then 'Yes' else 'No' end as Result;
9. Suppose you have a table named TEMP. You need to recode values of column Y, Swap values 2 and 3 in column Y
Table |
UPDATE TEMP
SET Y= CASE WHEN Y = 2 THEN 3
WHEN Y = 3 THEN 2
ELSE Y END;
10. Identify second maximum value
select max(y) from tempIn this code, the logic is to remove the maximum value from the main table and then calculate the max value to figure out the second max value of the main table.
where y not in (select max(y) from temp);
11. Identify second maximum value by a group
The code below first removes all the max values by a group from the main table. Later we calculated the second max value by a group.
select a.x, max(a.y) as maxy from temp a left join
(select x, max(y) as maxy from temp group by 1) b
on a.x = b. x and a.y = b.maxy
where b.x is null and b.maxy is null
group by 1;
12. Is the query below correct? If not, what's the issue?
SELECT custid, YEAR(ref_date) AS ref_yearThe calculated column cannot be used in WHERE condition so we need to modify the above code like this -
FROM custmart
WHERE ref_year >= 2015;
SELECT custid, YEAR(ref_date) AS ref_year
FROM custmart
WHERE YEAR(ref_date) >= 2015;
HR / Project Related Questions
The following is a list of common questions asked by recruiters.
1. Explain one of your project- Start from problem definition
- Explain Data Cleaning, Exploration and Data Preparation Steps
- What technique / algorithm is used in the project?
- Financial (Dollar) value impact of the project
2. What are your strengths and weaknesses?
3. Why are you leaving the current organization?
4. Why should we hire you?
5. Where do you see yourself five years from now?
6. What was the toughest decision you ever had to make?
The above list of questions would assist you in preparing for interviews for roles of senior / lead data analyst. Don't just mug up answers, understand the concepts of the topics covered in these questions.
Excellent blog
ReplyDeletePlease add some more question related to excel and sas base
ReplyDeleteextremely helpful and up to the mark blog. great job
ReplyDelete3rd Answer is wrong. It should be 'Yes' cos its case sensitive.
ReplyDeleteIt is not case-sensitive. can you try it once using excel?
DeleteFan Ho gaya sir me ap ka!!
ReplyDeleteThanks alot, its helpful in like preparing on exam night.
ReplyDeleteQuestion 1:
ReplyDeleteUse shoes dataset and Create sortedshoes in work from sashelp library
Sort the data with “product” variable in descending and then with “sales” variable in Ascending order
Now answer the following questions :
What is the value under product variable at 148th observation
What is the value under Region variable at 130th observation