, , ,

Discussion on the Method for Testing and Treating Outliers

, , ,

§2016级本科生

 基金资助: 国家基础科学人才培养基金项目.  J1310024

 Fund supported: 国家基础科学人才培养基金项目.  J1310024

Abstract

This paper compares and analyzes outliers related contents in analytical chemistry textbooks which are widely used nowadays, and explains the significance of the detection of outliers. Several common methods for testing outliers are introduced and compared, and the data treatment methods of dealing with confirmed outliers are also compared and discussed. All kinds of methods have their own advantages and disadvantages, thus we should choose different methods or a variety of methods, in accordance with the actual situation, to achieve better detection and treatment of outliers.

Keywords： Outlier ; Test of methods of outliers ; Treatment of outliers

ZHU Jiaxin, BAO Yutian, LI Zhao. Discussion on the Method for Testing and Treating Outliers. University Chemistry[J], 2018, 33(8): 58-65 doi:10.3866/PKU.DXHX201802008

1.1 标准偏差已知情况

${R_n} = \frac{{\left| {{x_{{\mathop{\rm out}\nolimits} }} - \overline x } \right|}}{\sigma }$

1.2.1 拉依达法

$\left| {{x_{{\rm{out}}}} - \overline x } \right| ＞ 3s$

1.2.2 4d检验法

$\left| {{x_{{\rm{out}}}} - \overline x } \right| ＞ 4\overline d$

1.2.3 肖维勒(Chauvenet)法

${\omega _n}{\rm{ = }}\frac{{\left| {{x_{{\rm{out}}}} - \overline x } \right|}}{s}$

1.2.4 t检验法

${k_n} = \frac{{\left| {{x_{{\mathop{\rm out}\nolimits} }} - \overline x } \right|}}{s}$

1.2.5 格鲁布斯(Grubbs)检验法

${G_n} = \frac{{\left| {{x_{{\mathop{\rm out}\nolimits} }} - \overline x } \right|}}{s}$

1.2.6 狄克逊(Dixon)检验法(样本容量3 ≤ n ≤ 30)

 样本容量 离群值为xn 离群值为x1 n: 3–7 ${r_{10}} = \frac{{{x_n} - {x_{n - 1}}}}{{{x_n} - {x_1}}}$ ${r'_{10}} = \frac{{{x_2} - {x_1}}}{{{x_n} - {x_1}}}$ n: 8–10 ${r_{11}} = \frac{{{x_n} - {x_{n - 1}}}}{{{x_n} - {x_2}}}$ ${r'_{11}} = \frac{{{x_2} - {x_1}}}{{{x_{n - 1}} - {x_1}}}$ n: 11–13 ${r_{21}} = \frac{{{x_n} - {x_{n - 2}}}}{{{x_n} - {x_2}}}$ ${r'_{21}} = \frac{{{x_3} - {x_1}}}{{{x_{n - 1}} - {x_1}}}$ n: 14–30 ${r_{22}} = \frac{{{x_n} - {x_{n - 2}}}}{{{x_n} - {x_3}}}$ ${r'_{22}} = \frac{{{x_3} - {x_1}}}{{{x_{n - 2}} - {x_1}}}$

1.2.7 Q检验法

Dixon在提出了1.2.6的检验方法之后，于1951年与Dean合作提出了一种针对样本容量较小(n < 10)的简化的离群值检验方法[9]，即为著名的Q检验法(Dixon’s Q test)。此法为国内外分析化学教材普遍长期采用。统计量Q值的计算极为简单，即用可疑值与其最邻近值之差(xn - xn-1)或(x2 - x1)，除以极差(xn - x1)：

${Q_1} = \frac{{{x_2} - {x_1}}}{{{x_n} - {x_1}}}\;\;\;\;或\;\;\;\;{Q_n} = \frac{{{x_n} - {x_{n - 1}}}}{{{x_n} - {x_1}}}$

1.3.1 偏度-峰度检验法

${{b}_{s}}=\frac{\sqrt{n}\sum\limits_{i=1}^{n}{{{({{x}_{i}}-\bar{x})}^{3}}}}{{{\left[ \sum\limits_{i=1}^{n}{{{({{x}_{i}}-\bar{x})}^{2}}} \right]}^{{}^{3}\!\!\diagup\!\!{}_{2}\;}}}=\frac{\sqrt{n}\left[ \sum\limits_{i=1}^{n}{x_{i}^{3}}-3\bar{x}\sum\limits_{i=1}^{n}{x_{i}^{2}}+2n{{(\bar{x})}^{3}} \right]}{{{\left[ \sum\limits_{i=1}^{n}{x_{i}^{2}-n{{{\bar{x}}}^{2}}} \right]}^{{}^{3}\!\!\diagup\!\!{}_{2}\;}}}$

${{b}_{k}}=\frac{n\sum\limits_{i=1}^{n}{{{({{x}_{i}}-\bar{x})}^{4}}}}{{{\left[ \sum\limits_{i=1}^{n}{{{({{x}_{i}}-\bar{x})}^{2}}} \right]}^{2}}}=\frac{n\left[ \sum\limits_{i=1}^{n}{x_{i}^{4}}-4\bar{x}\sum\limits_{i=1}^{n}{x_{i}^{3}}+6{{{\bar{x}}}^{2}}\sum\limits_{i=1}^{n}{x_{i}^{2}}-3n{{{\bar{x}}}^{4}} \right]}{{{\left[ \sum\limits_{i=1}^{n}{x_{i}^{2}}-n{{{\bar{x}}}^{2}} \right]}^{2}}}$

1.4 方法对比

 检验方法 平均值 标准偏差 平均偏差 极差 测定次数 置信度 拉依达法 √ √ 4d法 √ √ 肖维勒法 √ √ √ √ a t检验法 √ √ √ √ 格鲁布斯法 √ √ √ √ Q检验法 √ √ √ 狄克逊检验法 √ √ √

a肖维勒法尽管考虑了置信度的问题，但置信度取决于样本容量n

3 离群值处理方法讨论

Andersen [11]在一篇关于分析质量保证的论文中对离群值的处理提出了自己的看法。他以不同标准实验室对某标准值进行测定导致不确定度增大引出“在统计学中大量数据必定趋向真值，而在实验中高度重复的数据却不一定趋向真值”的观点，从而说明用统计学方法舍弃离群值是不合理的。舍弃离群值的做法不仅会改变均值和不确定度，还会降低实验的可重复度。而邓勃[12]教授对于离群值的处理主张“技术异常造成的异常值舍弃，无法找出技术异常的高度离群值亦要舍弃”“离群值在标准物质误差范围内或仪器精度范围内都不应舍弃”“以估计总体参数为目的时一般需舍弃离群值”。对于不同的观点进行了解和分析后，笔者也在下面给出一点个人的看法。

参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Harvey, D. Modern Analytical Chemistry; McGraw-Hill:New York, USA, 2000; pp 93-94.

Skoog D. A. ; West D. M. ; Holler F. J. ; Crouch S. R. Fundamentals of Analytical Chemistry Brooks/Cole: Belmont, USA, 2014, pp 146- 149.

Harris D. C. Quantitative Chemical Analysis W. H. Freeman and Company: New York, USA, 2010, pp 83.

Dean R. B. ; Dixon W. J. Anal. Chem. 1951, 23, 636.

Dixon W. J. Ann. Math. Stat. 1950, 21, 488.

Andersen J. E. T. Anal. Bioanal. Chem. 2014, 406 (25), 6081.

GB/T4883-2008数据的统计处理和解释正态样本离群值的判断和处理.

Frigge M. ; Hoaglin D. C. ; Iglewicz B. Am. Stat. 1989, 43 (1), 50.

/

 〈 〉