Rank connection coefficients. The essence of the theory of nonparametric statistics

Approximates R.s. quite well. T, and the difference is negligible when . If the hypothesis H 0 is true, according to the cut component X 1 , ... , X n random vector X are independent random variables, projection of R.s. Determined by the formula

where (see).

There is an internal connection between R. s. And . As shown in , if the hypothesis H 0 is true, the projection Kendall correlation coefficient into the family of linear linear systems. up to a constant factor coincides with the Spearman rank correlation coefficient, namely:


From this equality it follows that the correlation coefficient corr between and is equal to


i.e. at large pr. With. and are asymptotically equivalent (see).

Lit.: G a e k Ya., Sh i d a k Z., Theory of rank criteria, trans. from English, M., 1971; K e n d a l l M. G., Rank correlation methods, 4ed., L., 1970. M. S. Nikulin.


Mathematical encyclopedia. - M.: Soviet Encyclopedia. I. M. Vinogradov. 1977-1985.

See what "RANKING STATISTICS" is in other dictionaries:

    ranking statistics- - [A.S. Goldberg. English-Russian energy dictionary. 2006] Energy topics in general EN rank statistics ... Technical Translator's Guide

    This term has other meanings, see Statistics (meanings). Statistics (in the narrow sense) is measurable numeric function from the sample, independent of unknown distribution parameters. In a broad sense, the term (mathematical) ... ... Wikipedia

    - (statistics) 1. The totality of data and mathematical methods, used to study relationships between different variables. It includes methods such as linear regression and rank correlation. 2. Values ​​used... ... Economic dictionary

    STATISTICS- 1. A type of activity aimed at obtaining, processing and analyzing information characterizing the quantitative patterns of life in all its diversity, in inextricable connection with its qualitative content. In a narrower sense of the word... ... Russian Sociological Encyclopedia

    - (non parametric statistics) Statistical techniques that do not allow special functional forms for relationships between variables. The rank correlation of two variables is an example of this. The use of such technical... ... Economic dictionary- K. m., which received their name. due to the fact that they are based on “co-relation” variables, they are statistical methods, the beginning of which was made in the works of Karl Pearson around late XIX V. They are closely related to... ... Psychological Encyclopedia

    Developer Digital Illusions CE Publisher ... Wikipedia

    Karl Pearson Karl (Carl) Pearson Date of birth ... Wikipedia

Events C

expert j = 1

experts a ij

expert j = 2

expert j = 1

importance a ij

expert j = 2

Total importance rank a i

Average value for the total ranks of the series under consideration

The total square deviation S of total events from the average value a is

called the concordance coefficient. The value of W varies from 0 to 1. At W = 0 there is absolutely no consistency, i.e. There is no connection between the assessments of different experts. On the contrary, at W = 1 the agreement between experts’ opinions is complete.

In the case where sequence (5.2) has equalities in addition to strict inequalities, i.e. there is a coincidence of ranks, then the formula for calculating the concordance coefficient has the form

When the ranks are repeated, then to obtain a normal ranking having an average rank value equal to

must be attributed to events that have same ranks, a rank equal to the average of the places that these events shared among themselves.

For example, the following ranking of events was obtained:

Ranks a i

Events 2 and 5 shared second and third places. This means they are assigned a rank

events 3, 4 and 6 shared the fourth, fifth, sixth places among themselves, and they are assigned a rank

Thus, we get a normal ranking:

Ranks a" i

Example. Consider the ranking of m = 10 events p = 3 experts; N, Q, R. The calculation results are presented in table. 5.3.

For extreme values ​​of the concordance coefficient, the following assumptions can be made. If W = 0, then there is no consistency in the assessments, therefore, in order to obtain reliable assessments, it is necessary to clarify the initial data on the events and (or) change the composition of the expert group. When W = 1, it is not always possible to consider the obtained assessments as objective, since sometimes it turns out that all members of the expert group agreed in advance to protect their common interests.

It is necessary that the found value of W be greater than the specified value of W 3 (W > W 3). You can take W 3 = 0.5, i.e. when W > 0.5, the actions of experts are more coordinated than not. At W< 0,5 полученные оценки нельзя считать достоверными, и поэтому следует повторить опрос заново. Жесткость данного утверждения опреде­ляется важностью проводимого исследования и возможностью повторной экспертизы. Практика показывает, что очень часто этим требованием пренебрегают.

The calculation of the W coefficient taking into account the competence of experts is given in the work.

1 Brief history emergence correlation analysis

The beginning of the use of mathematical and statistical techniques to study correlation dependencies dates back to the 70s of the nineteenth century. Many historians and statisticians trace the history of the development of correlation back to the forties of the nineteenth century - from the time when the French mathematician O. Bravais proposed a formula for the distribution of two random variables that satisfy the requirements of the law of normal distribution.

However, the true founder of the correlation theory is considered to be the English mathematician and statistician K. Pearson, who created in the late nineteenth and early twentieth centuries this theory. In it, correlation acts as a form of dialectical connection, in which many different causes operate, both necessary and random, both common to both correlation values, and private, affecting only one of them. Moreover, not all natural connections are causal.

The development of the theory was carried out with the help of other studies, when the main provisions of the correlation theory had already been created. Moreover, in the field of studying correlations, practice sharply diverged from theory, placing researchers in conditions that did not satisfy its requirements.

The basis for the formation of methods for studying correlations and regressions was data characterizing any quantitatively expressed characteristics. Therefore, at the very first steps, researchers encountered the problem of correlation qualitative signs, for example, the relationship between eye color in fathers and sons. General principle, which was the basis for the design of correlation indicators of qualitative characteristics, was that two qualitative characteristics can be considered interrelated if the effect of one of them A under the action of attribute B is the same as under the action of attribute not B. In development of this principle, and were offered various designs such indicators as, for example, Pearson's mean square contingency coefficient or Chuprov's mutual contingency coefficient.

The study of the correlation of qualitative characteristics gave rise to the so-called theory of ranks and the theory of rank correlation based on it in the general doctrine of correlation. The English mathematician and statistician M. Kendall, the author of a monograph devoted to the problems of rank correlation, pointed out that the theory of ranks first arose as an offshoot of the theory of random processes. On initial stage in ranks they most often saw simply a convenient device, thanks to which it is possible to do without measuring the absolute value of variables and thereby save time and effort. Later, rank statistics were able to gain recognition due to their own merits. Kendall constructed a measure that is also applicable to studying partial correlation between ranks. It is impossible to imagine the modern theory of rank correlation without M. Kendall's most comprehensive studies.

Thus, by the beginning of the twentieth century, mathematical and statistical methods for measuring correlations and regressions had generally developed into a fairly coherent integrated system, including methods of nonparametric statistics and nonparametric rank methods.

2 Nonparametric rank methods

Nonparametric rank methods are a rapidly developing area of ​​mathematical statistics. The history of modern nonparametric rank-based methods is quite short—only about 40 years. Rank methods have emerged as a special area of ​​nonparametric statistics not only due to the nature of the source material, but also due to the ideas behind it. further use. Today, these methods solve many problems in the analysis of economic, statistical, engineering, natural science, sociological, and medical data.

Ranking is a procedure for arranging objects of study, which is performed on the basis of preference. Rank is serial number values ​​of a characteristic, arranged in ascending or descending order of their values. As statistical studies conducted over the past 10-15 years have shown, ranking methods are largely free of a number of disadvantages for working with small samples, the distribution of which is unknown. As is known, the transition from the observations themselves to their ranks is accompanied by a certain loss of information. However, these losses are not too great. Unfortunately, at present there is still a lack of specialized literature on this issue.

IN lately Expert assessments have become widely used in forecasting and in solving a number of other problems. Rank correlation methods in this area are perhaps the only way to generalize expert assessments.

Rank theory first emerged as an offshoot of the theory of random processes. At the initial stage, ranks were most often seen as simply a convenient device, thanks to which it was possible to do without changing the absolute value of variables and thereby save time or effort. Thanks to the use of ranks, it was possible to avoid the difficulties associated with constructing an objective scale of absolute values. Later, rank statistics were able to gain recognition on their own merits.

Below we will consider the most common ways of organizing the objects being studied:

The task may simply be to organize objects according to the place they occupy in space or time. For example, the cards were arranged in a deck in some order and then shuffled. The new arrangement of cards is also characterized by a certain order, ranking. Comparing it with the old one, you can see how carefully the cards were shuffled. In this task, only the general arrangement of cards in the deck is interesting, and there is no need to arrange objects in accordance with the “increase” or “decrease” of one or another characteristic inherent in all of them;

Objects can also be ordered according to some quality, for which there is no objective absolute scale of change. You can, for example, rank samples rocks by hardness, based on the following simple criterion: A is harder than B if A leaves a scratch on B when they touch. If A leaves a scratch on B, and B leaves a scratch on C, then A will leave a scratch on C. Thus, by resorting to a series of comparisons, the objects in question can be ordered with reasonable accuracy (unless the set contains two objects that have the same hardness ). However, this method does not allow measuring the absolute value of rock hardness. It is always possible to establish that A is harder than B. However, until one or another measurement scale is constructed absolute values, it cannot be said that A is, say, twice as hard as B;

The ordering can be carried out in accordance with the measured (or theoretically calculated) value of some attribute. For example, you can arrange people in one order or another depending on their height, and cities by population. In this case, it is not always necessary to resort to the measurement process itself: you can build a group of students by height “by eye”; however, in such cases, the criterion by which the ranking occurs must allow for direct comparisons.

It is possible to order objects according to some attribute, the value of which, in principle, can be measured, but in practice (or even theoretically) it is not possible to resort to such a measurement for one reason or another. For example, one might order a series of persons according to their intellectual abilities, believing that such a quality actually exists and that people can be placed in one order or another according to the intensity of this attribute.

IN practical applications Ranking-based methods sometimes encounter cases where two or more objects are so similar that it is impossible to give preference to one of them. When an expert ranks an object based on subjective judgments, then this property (lack of preference) is associated with the truth of their indistinguishability or the inability of the researcher to find significant differences. In this case, they say that such an object is called bound.

For example, students were ranked according to their merits or exam scores. The method adopted for prescribing numerical values ​​for the ranks of related objects is to average the ranks they would have if they were distinguishable. For example, if the third and fourth objects are connected, then each is assigned a rank of 3.5, but if objects from the second to the seventh are connected, then the resulting rank is 4.5.

This approach is sometimes called the “average rank method.” When there is no basis for choosing between objects, then it is clear that in this case it is necessary to assign equal ranks to everyone. The advantage of this method is that the sum of ranks for all objects remains exactly the same as when ranking without connections.

In the analysis of socio-economic phenomena, it is often necessary to resort to various conditional estimates using ranks, and the relationship between individual characteristics is measured using non-parametric correlation coefficients.

3 Kendall's rank concordance coefficient

To determine the closeness of the relationship between an arbitrary number of ranked features, a multiple correlation coefficient (concordance coefficient) is used.

In the practice of statistical research, there are cases when a set of objects is characterized not by two, but by several sequences of ranks; it is necessary to establish a statistical relationship between several variables. As such a meter, the multiple correlation coefficient (concordance coefficient) of Kendall ranks is used, determined by the following formula:

Where W– concordance coefficient;

D– the sum of squares of ranks is calculated according to formula (2);

n– number of objects of the ranked characteristic (number of experts);

m– number of analyzed ordinal variables.

In a sense, W serves as a measure of generality.

, (2)

Where r ij– ranked judgments of the group of experts;

n– number of objects (number of experts).

The values ​​of the concordance coefficients are contained in the segment .

An increase in the coefficient from 0 to 1 means greater consistency of judgments. If all these judgments coincide, then W=1.

Testing the significance of the coefficient is based on the fact that if the null hypothesis about the absence of correlation for n>7 is true, the statistics m(n-1)* W has approximately a distribution with k=n-1 degrees of freedom. Therefore, the concordance coefficient is significant at level =0.05 if m(n-1)W> .

In the analysis of socio-economic phenomena, it is often necessary to resort to various conditional estimates using ranks, and the relationship between individual characteristics is measured using non-parametric correlation coefficients.

Ranging is a procedure for arranging objects of study, which is performed on the basis of preference.

Rank- this is the serial number of the characteristic values, arranged in ascending or descending order of their values. If the characteristic values ​​have the same quantification, then the rank of all these values ​​is taken to be equal to the arithmetic mean of the corresponding place numbers that they define. These ranks are called connected.

Among the nonparametric methods for estimating the strength of connection highest value have Spearman's (p1?/) and Kendall's (t^) rank correlation coefficients. These coefficients can be used to determine the closeness of the relationship between both quantitative and qualitative characteristics.

Rank correlation coefficient(Spearman coefficient) is calculated using the formula

Where (11 - rank difference squares; p - number of observations (number of rank pairs).

The Spearman coefficient takes any value in the interval [-1; 1].

Example. Based on data on the purchase and sale of currency by citizens of the constituent entities of the Volga Federal District of the Russian Federation through credit organizations in 2010, we will determine the relationship between these characteristics using the Spearman coefficient (Table 7.14).

Table 7.14. Spearman coefficient calculation

Subject

Buying currency X, million rubles

Selling currency y, million rubles

Rank

pop a ranks

Squared difference of ranks

$

TO

Ry

1. Republic of Bashkortostan

2. Republic of Mari El

3. Republic of Mordovia

4. Republic of Tatarstan

5. Udmurt Republic

6. Chuvash

Republic

7. Perm region

8. Kirov region

9. Nizhny Novgorod region

10. Orenburg region

11. Penza region

12. Samara region

13. Saratov region

14. Ulyanovsk region

Let's calculate the Spearman rank correlation coefficient:

As a result of the calculation, we determined that the connection between the purchase and sale of currency by citizens of the constituent entities of the Volga Federal District of the Russian Federation through credit organizations in 2010 was strong, close to functional.

Kendall's rank correlation coefficient also used to measure the degree of closeness and direction of connection between qualitative and quantitative characteristics that characterize homogeneous objects and are ranked according to the same principle. The Kendall rank coefficient is calculated using the formula

where 5 is the sum of the differences between the number of sequences and the number of inversions according to the second characteristic; p - number of observations.

Calculation given coefficient is performed in this order.

  • 1. Values X ranked in ascending or descending order.
  • 2. Values at are arranged in order corresponding to the values X.
  • 3. For each rank at the number of rank values ​​that follow it and exceed its value is determined. Thus, by adding numbers the value is determined R as a measure of correspondence between sequences of ranks xx and y, which is taken into account with a "+" sign.
  • 4. For each rank at the number of rank values ​​that follow it and are less than its value is determined. The total value is denoted by (2 and is fixed with a “-” sign.
  • 5. The sum of points for all members of the series is determined.

The relationship between characteristics is considered statistically significant if the Spearman and Kendall rank correlation coefficients are greater than 0.5.

According to the table. 7.14 obtained the results presented in table. 7.15.

Thus, the Kendall rank correlation coefficient will be

Table 7.15.

which also indicates a strong connection between the purchase and sale of currency by citizens of the constituent entities of the Volga Federal District of the Russian Federation through credit organizations in 2009.

Multiple rank correlation coefficient (concordance coefficient) used to determine the closeness of the connection between an arbitrary number of ranked features. It is calculated using the formula

where 5 is the deviation of the sum of squares of ranks from the average of squares of ranks; T - number of factors; n - number of observations.

Example. Let us determine the degree of closeness of the connection between such basic indicators of technology trade with the CIS countries in 2010 as the number of export agreements, the cost of the subject of the agreement and the flow of funds (Table 7.16).

Table 7.16. Calculation of the concordance coefficient

Country

Number of agreements

X

Cost of the subject of the agreement y, million dollars

Receipt of funds for the year, million dollars.

TO

Row sum

Square of the sum

1. Azerbaijan

2. Armenia

3. Belarus

4. Kazakhstan

5. Kyrgyzstan

6. Republic of Moldova

Rank correlation coefficients- these are less accurate, but simpler to calculate non-parametric indicators for measuring the closeness of the relationship between two correlated characteristics. These include the Spearman (ρ) and Kendal (τ) coefficients, based on the correlation not of the values ​​of the correlated features themselves, but of their ranks– serial numbers assigned to each individual value X And at(separately) in a ranked series. Both characteristics must be ranked (numbered) in the same order: from lower to higher values ​​and vice versa. If multiple values ​​occur X(or at), then each of them is assigned a rank equal to the quotient of dividing the sum of ranks (places in a row) attributable to these values ​​by the number of equal values. Feature ranks X And at denoted by symbols Rx And Ry(Sometimes Nx And Ny). Judging the relationship between changes in values X And at based on comparison of the behavior of ranks according to two characteristics in parallel. If every couple X And at the ranks coincide, this characterizes the closest possible connection. If there is a complete opposite of ranks, i.e. in one row the ranks increase from 1 to n, and in the other – decrease from n up to 1, this is the maximum possible feedback. Spearman's and Kendal's approaches to assessing the closeness of a connection are somewhat different. For calculation Spearman coefficient feature values X And at numbered (separately) in ascending order from 1 to n, i.e. they are assigned a certain rank ( Rx And Ry) – serial number in the ranked series. Then, for each pair of ranks, their difference is found (denoted as d=RxRy), and the squares of this difference are summed.

Where d– rank difference X And at;

n– number of observed pairs of values X And at.

Coefficient ρ can take values ​​from 0 to ±1. It should be borne in mind that since the Spearman coefficient takes into account the difference only in ranks, and not in the values ​​themselves X And y, it is less accurate compared to the linear coefficient. Therefore, its extreme values ​​(1 or 0) cannot be unconditionally regarded as evidence of a functional connection or a complete absence of dependence between X And u. In all other cases, i.e. When ρ does not take extreme values, it is quite close to r.

Formula (147) is strictly theoretically applicable only when individual values X(And y), and therefore their ranks are not repeated. For the case of repeating (linked) ranks, there is another, more complex formula, adjusted for the number of repeating ranks. However, experience shows that the results of calculations using the adjusted formula for related ranks differ little from the results obtained using the formula for non-repeating ranks. Therefore, in practice, formula (147) is successfully used for both non-repeating and repeating ranks.

Kendal Rank Correlation Coefficientτ is constructed somewhat differently, although its calculation also begins with ranking the values ​​of the features X And u. Ranks X(Rx) are placed strictly in ascending order and in parallel write down the corresponding Rx meaning Ry. Because Rx are written strictly in ascending order, then the task is to determine the degree of consistency of the sequence Ry following the “correct” Rx. At the same time, for everyone Ry sequentially determine the number of ranks following it, exceeding its value, and the number of ranks less in value. The first (“correct” following) are counted as points with a “+” sign, and their sum is indicated by the letter R. The second (“incorrect” following) are taken into account as points with a “–” sign, and their sum is indicated by the letter Q. Obviously, the maximum value R is achieved if the ranks y (Ry) coincide with ranks X (Rx) and in each row represent a row natural numbers from 1 to p. Then after the first pair of values Rx= 1 and Ry = 1 number of excess of these rank values ​​will be ( n– 1), after the second pair, where Rx= 2 and Ry= 2, respectively (p – 2) etc. Thus, if the ranks X And at coincide and the number of rank pairs is equal n, That

If the sequence of ranks X And at has the opposite tendency with respect to the rank sequence X, That Q there will be the same maximum value modulo:

.

If the ranks of y do not coincide with the ranks X, then all positive and negative points are summed up ( S=P+Q); ratio of this amount S to the maximum value of one of the terms and represents the Kendal rank correlation coefficient τ, i.e.:

. (148)

The Kendal rank correlation coefficient formula (148) is used for cases when individual values ​​of a characteristic (as X, so and y) are not repeated and, therefore, their ranks are not combined. If there are several identical values X(or y), those. ranks are repeated, become related, the Kendal rank correlation coefficient is determined by the formula:

, (149)

Where S– the actual total score when assessing +1 for each pair of ranks with the same order of change and –1 for each pair of ranks with the opposite order of change;

– the number of points that correct (reduce) the maximum amount of points due to repetitions (combinations) t ranks in each row.

Note that cases of identical repeating ranks (in any row) are scored 0, i.e. they are not taken into account in the calculation either with the “+” sign or with the “–” sign.

The advantages of Spearman and Kendal rank correlation coefficients: they are easy to calculate, with their help you can study and measure the relationship not only between quantitative, but also between qualitative (descriptive) features ranked in a certain way. In addition, when using rank correlation coefficients, it is not necessary to know the form of connection between the phenomena being studied.

If the number of ranked characteristics (factors) is more than two, then to measure the closeness of the connection between them, you can use the concordance coefficient (multiple rank correlation coefficient) proposed by M. Kendal and B. Smith:

, (150)

Where S- sum of squared deviations of the sum T ranks from their average value;

T - number of ranked features;

p - number of ranked units (number of observations).

Formula (150) is used for the case where the ranks for each attribute are not repeated. If there are related ranks, then the concordance coefficient is calculated taking into account the number of such repeating (related) ranks for each factor:

, (151)

Where t– the number of identical ranks for each characteristic.

Concordance coefficient W can take values ​​from 0 to 1. However, it is necessary to check it for significance (significance) using the χ2 criterion in the absence of related ranks using formula (152), and if they are present, using formula (153):

, (152) . (153)

The actual value of χ2 is compared with the tabulated value corresponding to the accepted significance level α (0.05 or 0.01) and the number of degrees of freedom v = p – 1. If χ2fact > χ2table, then W – significant (significant).

The concordance coefficient is especially often used in expert assessments, for example, in order to determine the degree of agreement between experts’ opinions about the importance of a particular indicator being assessed or to rank individual units on any basis. In formula (150), in these cases, m means the number of experts, and n is the number of ranked units (or features).

What else to read