10 Essential 数据分析面试问题 *

最好的数据分析师可以回答的全面来源的基本问题. 在我们社区的推动下，我们鼓励专家提交问题并提供反馈.

现在就雇佣一名顶级数据分析师

是顶级自由软件开发人员的专属网络吗, 设计师, 金融专家, 产品经理, 和 project managers in the world. 顶级公司雇佣Toptal自由职业者来完成他们最重要的项目.

面试问题

What techniques can be used to h和le missing data?

查看答案

There are plenty of alternatives to h和le missing data, although none of them is perfect or fits all cases. 其中一些是:

删除不完整行: Simplest of all, it can be used if the amount of missing data is small 和 seemingly r和om.
把变量: This can be used if the proportion of missing data in a feature is too big 和 the feature is of little significance to the analysis. 一般来说，应该避免使用它，因为它通常会丢掉太多的信息.
Considering “not available” (NA) to be a value: Sometimes missing information is information in itself. 取决于问题域, missing values are sometimes non-r和om: Instead, they’re a byproduct of some underlying pattern.
价值归责: This is the process of estimating the value of a missing field given other information from the sample. There are various viable kinds of imputation. Some examples are mean/mode/median imputation, KNN, regression models, 和 multiple imputations.

什么是数据验证?

查看答案

在任何面向数据的进程中，“垃圾输入，垃圾输出”问题总是可能出现的. 为了减轻它, 我们利用数据验证, a process composed of a set of rules to ensure that the data reaches a minimum quality st和ard. A couple of examples of validation checks are:

数据类型验证: Checks whether the data is of the expected type (eg. integer, string) 和 conforms to the expected format.
Range 和 constraint validation: Checks if the observed values fall within a valid range. 例如, temperature values must be above absolute zero (or likely a higher minimum depending on the operating range of the equipment being used to record them.)

线性回归和逻辑回归有什么区别?

查看答案

Linear regression is a statistical model that, 给定一组输入特征, 尝试拟合最好的直线(或超平面), (一般情况下)在自变量和因变量之间. 自 its output is continuous 和 its cost function measures the distance from the observed to the predicted values, it is an appropriate choice to solve regression problems (e.g. 预测销售数字).

逻辑回归, 另一方面, 输出一个概率, 根据定义，哪个是0到1之间的有界值, due to the sigmoid activation function. 因此，最适合解决分类问题(如.g. 来预测一个给定的交易是否欺诈).

Apply to Join Toptal's Development Network

和 enjoy reliable, steady, remote 自由数据分析师职位

申请成为自由职业者

什么是模型外推? 它的陷阱是什么??

查看答案

Model extrapolation is defined as estimating beyond a previously observed data range to establish the relationships between variables.

外推法的主要问题在于，它充其量只是一种有根据的猜测. 自 it has no data to support it, 通常不可能声称观察到的关系仍然成立. A relationship that looks linear in a given range might actually be non-linear when outside of range.

What is data leakage in the context of data analysis? 这会产生什么问题呢? Which strategies can be applied to avoid it?

查看答案

Data leakage is the process of training a statistical model with information that would be actually unavailable when using the model to make predictions.

Data leakage makes the results during model training 和 validation much better than what is observed when the model is deployed, generating too optimistic estimates, possibly leading to an entirely invalid predictive model.

There is no single recipe to eliminate data leakage, but some practices are helpful to avoid them:

Don’t use future data to make predictions of the past. 虽然明显, it’s a very common mistake when validating models, especially when using cross-validation. When training on time-series data, always make sure to use an appropriate validation strategy.
Prepare the data within cross-validation folds. Another common mistake is to make data preparations, like normalization or outlier removal on the whole dataset, prior to splitting the dataset to validate the model, 哪个是信息泄露.
调查id. It’s easy to dismiss IDs as r和omly generated values, 但有时它们会编码目标变量的信息. 如果它们是漏的，最好从任何类型的模型中移除它们.

一个零售连锁店的老板从他的商店收集了10年的购买历史数据. The data dictionary is shown below:

功能	描述
事务ID	唯一事务ID. Must appear just once in the dataset
存储ID	唯一的存储ID. May appear more than once in the dataset
客户机ID	唯一客户端ID. May appear more than once in the dataset
项ID	唯一的项目ID. May appear more than once in the dataset
项目数量	Number of items bought together
物品的价格	单件价格
购买日期和时间	购买时间戳
支付方式	下列之一:现金、信用卡、借记卡或凭证

What kind of information or analyses could we leverage that might generate value to the business? 假设每笔交易代表购买一种类型的物品.

查看答案

答案不是封闭的，而是取决于以前的经验和领域专业知识. The goal is not to get every single item right, but to showcase critical thinking 和 domain knowledge.

对于这种情况，可以探索的一些可能路径是:

Determine which are the most popular items sold
Explore how much is spent per transaction
Find which clients spend the most
Find the most recurrent clients
Uncover seasonalities 和 trends

All of the above can be analyzed over the whole dataset or by region, by store or by time frame. The analyses could be further enriched with store, client 和 item information, if available.

The information uncovered could be used to:

Better scale inventory sizing 和 the number of on-site employees using time-series forecasting.
Perform direct marketing to the most profitable clients, 哪些可以通过聚类技术识别.
通过将可能一起购买的商品分组，提高商品在商店中的定位, which could be identified through market basket analysis. Recommender systems could also be applied.

数据分析过程通常由这些步骤组成?

查看答案

Finding a relevant business problem to solve: 常被忽视, this is the most important step of the process, 因为产生业务价值是任何数据分析师的最终目标. Having a clear objective 和 restricting the data space to be explored is paramount to avoiding wasting resources. 自 it requires deep knowledge of the problem domain, 此步骤可能由数据分析人员以外的领域专家执行.
数据提取: The next step is to collect data for analysis. It could be as simple as loading a CSV file, 但通常情况下，它涉及到从多种来源和格式收集数据.
数据清理: 数据收集完成后，需要准备数据集进行处理. Likely the most time-consuming step, data cleansing can include h和ling missing fields, 破坏数据, 离群值, 重复的项.
数据探索: 在考虑数据分析时，这通常会出现在脑海中. Data exploration involves generating statistics, 特性, 以及数据的可视化，以更好地理解其潜在模式. 然后，这会产生可能产生业务价值的见解.
Data modeling 和 model validation (optional): 培训一名统计或机器学习模型并不总是必需的, as a data analyst usually generates value through insights found in the data exploration step, but it may uncover additional information. 易于解释的模型, like linear or tree-based models, 和 clustering techniques often expose patterns that would be otherwise difficult to detect with data visualization alone.
讲故事: This last step encompasses every bit of information uncovered previously to finally present a solution to—or at least a path to continue exploring—the business problem proposed in the first step. It’s all about being able to clearly communicate findings to stakeholders 和 convincing them to take a course of action that will lead to creating business value.

These are the most common steps of data analysis. Although they have been presented as a list, more often than not they are not executed sequentially 和 some steps may require several iterations as new data sources are added 和 information is uncovered.

What is the difference between correlation 和 causation? 我们如何推断后者呢?

查看答案

Correlation is a statistic that measures the strength 和 direction of the associations between two or more variables.

另一方面，因果关系是一种描述因果关系的关系.

“Correlation does not imply causation” is a famous quote that warns us about the dangers of the very common practice of looking at a strong correlation 和 assuming causality. 在下列情况下，强相关性可能没有因果关系:

隐藏变量: 影响两个感兴趣变量的未观察变量, causing them to exhibit a strong correlation, even when there is no direct relationship between them.
混杂变量: A confounding variable is one that cannot be isolated from one or more of the variables of interest. Therefore we cannot explain if the result observed is caused by the variation of the variable of interest or of the confounding variable.
伪相关: 有时由于巧合, 即使没有合理的逻辑关系，变量也可以相互关联.

Causation is tricky to be inferred. 最常见的解决方法是进行随机实验, 在那里作为候选原因的变量被隔离和测试. 不幸的是, 在许多领域进行这样的实验是不切实际的或不可行的, 因此，运用逻辑和领域知识对于得出合理的结论至关重要.

是什么精度和回忆? 在哪些情况下使用它们?

查看答案

精确率和召回率是衡量分类性能的指标, 每个都有自己的标准, 由下式给出:

\[\text{Precision} = \frac{TP}{TP+FP}\] \[\text{Recall} = \frac{TP}{TP+FN}\]

地点:

TP =真阳性
FP =假阳性
FN =假阴性

换句话说, 精度 is the ratio of correctly classified positive cases over all cases predicted as positive, 召回率是正确分类的阳性病例与所有阳性病例的比率.

当假阳性的代价很高时，精度是一个合适的度量.g. email spam classification), while 回忆 is appropriate when the cost of a false negative is high (e.g. 欺诈检测).

两者也经常以f1分数的形式一起使用，其定义为:

\[\text{F1} = 2*\frac{\text{Precision} * \text{Recall}}{\text{Precision}+\text{Recall}}\]

The F1-score balances both 精度和回忆, 因此，对于高度不平衡的数据集，这是一个很好的分类性能度量.

10.

我们如何在单个图表中可视化超过三个维度的数据?

查看答案

通常, 数据通过使用图像中的位置(高度)的图表直观地表示, 宽度, 和深度). Going beyond three dimensions, we need to make use of other visual cues to add more information. 最常见的有:

Color一种视觉上吸引人且直观的方式来描述连续和分类数据.
大小: Marker 大小 is also used to represent continuous data. Could be applied for categorical data as well, 但由于尺寸差异比颜色更难察觉, 对于这种类型的数据，它不是最合适的选择.
形状最后，我们有形状，这是一个有效的方式来表示不同的类.

结合以上所有，我们可以可视化到六个维度, though one could argue that cramming so much information in a single chart does not make for a very effective visualization.

Another possibility is to make an 动画图表，这是非常有用的描述变化随着时间的推移:

面试不仅仅是棘手的技术问题, so these are intended merely as a guide. 并不是每一个值得雇佣的“A”候选人都能回答所有的问题, nor does answering them all guarantee an “A” c和idate. 一天结束的时候， hiring remains an art, a science — 和 a lot of work.

为什么Toptal

提出面试问题

提交的问题和答案将被审查和编辑, 和 may or may not be selected for posting, at the sole discretion of Toptal, 有限责任公司.

寻找数据分析师?

寻找数据分析师? Check out Toptal’s data analysts.

欧博体育app下载

视图奥利弗

奥利弗·霍洛威学院

自由数据分析师

联合王国Toptal成员自 2016年5月10日

Oliver is a versatile data scientist 和 software engineer combining over a decade of experience 和 a postgraduate mathematics degree from Oxford. Career assignments have ranged from building 机器学习 solutions for startups to leading project teams 和 h和ling vast amounts of data at Goldman Sachs. 有了这样的背景, he is adept at picking up new skills quickly to deliver robust solutions to the most dem和ing of businesses.

欧博体育app下载

视图克里斯托弗

克里斯托弗Karvetski

自由数据分析师

美国Toptal成员自 2016年8月24日

Dr. Karvetski作为一名数据和决策科学家有十年的经验. 他曾在学术界和工业界的各种团队和客户环境中工作, 和 has been recognized as an excellent communicator. 他喜欢与团队合作，构思和部署新颖的数据科学解决方案. 他精通R、SQL、MATLAB、情景应用程序和其他数据科学平台.

欧博体育app下载

视图蕾妮

蕾妮Ahel

自由数据分析师

克罗地亚Toptal成员自 2020年6月18日

Renee is a data scientist with over 12 years of experience, 和 five years as a full-stack software engineer. 超过12年, he has worked in international environments, with English or German as a working language. This includes four years working remotely for German 和 Austrian client companies 和 nine months working remotely as a member of the Deutsche Telekom international analytics team.

Toptal连接排名前3% of Freelance Talent All Over The World.

加入Toptal社区.

了解更多

10 Essential 数据分析 面试问题 *

最好的数据分析师可以回答的全面来源的基本问题. 在我们社区的推动下，我们鼓励专家提交问题并提供反馈.

面试问题

为什么Toptal

提出面试问题

寻找数据分析师?

奥利弗·霍洛威学院

克里斯托弗Karvetski

蕾妮Ahel

10 Essential 数据分析面试问题 *