When studying public health and healthcare for scientific and practical purposes, the researcher often has to conduct a statistical analysis of the relationships between factor and performance characteristics of a statistical population (causal relationship) or determine the dependence of parallel changes in several characteristics of this population on some third value (on their common cause ). It is necessary to be able to study the features of this connection, determine its size and direction, and also evaluate its reliability. For this purpose, correlation methods are used.
Functional connection- this type of relationship between two characteristics when each value of one of them corresponds to a strictly defined value of the other (the area of a circle depends on the radius of the circle, etc.). Functional connection is characteristic of physical and mathematical processes.
Correlation- such a relationship in which each specific value of one characteristic corresponds to several values of another characteristic interrelated with it (the relationship between a person’s height and weight; the relationship between body temperature and pulse rate, etc.). Correlation is typical for medical and biological processes.
Dependence of parallel changes in several characteristics on some third value. For example, under the influence of high temperature in the workshop, changes in blood pressure, blood viscosity, pulse rate, etc. occur.
1) Method of squares
2) Rank method
Method 1
Reliability is determined by the formula:
The t criterion is evaluated using a table of t values, taking into account the number of degrees of freedom (n - 2), where n is the number of paired options. The t criterion must be equal to or greater than the table one, corresponding to a probability p ≥99%.
Method 2
Reliability is assessed using a special table of standard correlation coefficients. In this case, a correlation coefficient is considered reliable when, with a certain number of degrees of freedom (n - 2), it is equal to or more than the tabular one, corresponding to the degree of error-free prediction p ≥95%.
Exercise: calculate the correlation coefficient, determine the direction and strength of the relationship between the amount of calcium in water and water hardness, if the following data are known (Table 1). Assess the reliability of the relationship. Draw a conclusion.
Table 1
Justification for the choice of method. To solve the problem, the method of squares (Pearson) was chosen, because each of the signs (water hardness and amount of calcium) has a numerical expression; no open option.
Solution.
The sequence of calculations is described in the text, the results are presented in the table. Having constructed series of paired comparable characteristics, denote them by x (water hardness in degrees) and by y (amount of calcium in water in mg/l).
Water hardness (in degrees) |
Amount of calcium in water (in mg/l) |
d x | d y | d x x d y | d x 2 | d y 2 |
4 8 11 27 34 37 |
28 56 77 191 241 262 |
-16 -12 -9 +7 +14 +16 |
-114 -86 -66 +48 +98 +120 |
1824 1032 594 336 1372 1920 |
256 144 81 49 196 256 |
12996 7396 4356 2304 9604 14400 |
M x =Σ x / n | M y =Σ y / n | Σ d x x d y =7078 | Σ d x 2 =982 | Σ d y 2 =51056 | ||
M x =120/6=20 | M y =852/6=142 |
Criterion t = 14.1, which corresponds to the probability of an error-free forecast p > 99.9%.
2nd method. The reliability of the correlation coefficient is assessed using the table “Standard correlation coefficients” (see Appendix 1). With the number of degrees of freedom (n - 2)=6 - 2=4, our calculated coefficient the correlation r xу = + 0.99 is greater than the table one (r table = + 0.917 at p = 99%).
Conclusion. The more calcium in water, the harder it is (connection direct, strong and authentic: r xy = + 0.99, p > 99.9%).
to use the ranking methodExercise: using the rank method, establish the direction and strength of the relationship between years of work experience and the frequency of injuries if the following data are obtained:
Justification for choosing the method: To solve the problem, only the rank correlation method can be chosen, because The first row of the attribute “work experience in years” has open options (work experience up to 1 year and 7 or more years), which does not allow the use of a more accurate method - the method of squares - to establish a connection between the compared characteristics.
Solution. The sequence of calculations is presented in the text, the results are presented in table. 2.
Table 2
Work experience in years | Number of injuries | Ordinal numbers (ranks) | Rank difference | Squared difference of ranks | |
X | Y | d(x-y) | d 2 | ||
Up to 1 year | 24 | 1 | 5 | -4 | 16 |
1-2 | 16 | 2 | 4 | -2 | 4 |
3-4 | 12 | 3 | 2,5 | +0,5 | 0,25 |
5-6 | 12 | 4 | 2,5 | +1,5 | 2,25 |
7 or more | 6 | 5 | 1 | +4 | 16 |
Σ d 2 = 38.5 |
Standard correlation coefficients that are considered reliable (according to L.S. Kaminsky)
Number of degrees of freedom - 2 | Probability level p (%) | ||
95% | 98% | 99% | |
1 | 0,997 | 0,999 | 0,999 |
2 | 0,950 | 0,980 | 0,990 |
3 | 0,878 | 0,934 | 0,959 |
4 | 0,811 | 0,882 | 0,917 |
5 | 0,754 | 0,833 | 0,874 |
6 | 0,707 | 0,789 | 0,834 |
7 | 0,666 | 0,750 | 0,798 |
8 | 0,632 | 0,716 | 0,765 |
9 | 0,602 | 0,885 | 0,735 |
10 | 0,576 | 0,858 | 0,708 |
11 | 0,553 | 0,634 | 0,684 |
12 | 0,532 | 0,612 | 0,661 |
13 | 0,514 | 0,592 | 0,641 |
14 | 0,497 | 0,574 | 0,623 |
15 | 0,482 | 0,558 | 0,606 |
16 | 0,468 | 0,542 | 0,590 |
17 | 0,456 | 0,528 | 0,575 |
18 | 0,444 | 0,516 | 0,561 |
19 | 0,433 | 0,503 | 0,549 |
20 | 0,423 | 0,492 | 0,537 |
25 | 0,381 | 0,445 | 0,487 |
30 | 0,349 | 0,409 | 0,449 |
Rank correlation coefficients- these are less accurate, but simpler to calculate non-parametric indicators for measuring the closeness of the relationship between two correlated characteristics. These include the Spearman (ρ) and Kendal (τ) coefficients, based on the correlation not of the values of the correlated features themselves, but of their ranks– serial numbers assigned to each individual value X And at(separately) in a ranked series. Both characteristics must be ranked (numbered) in the same order: from lower to higher values and vice versa. If multiple values occur X(or at), then each of them is assigned a rank equal to the quotient of dividing the sum of ranks (places in a row) attributable to these values by the number of equal values. Feature ranks X And at denoted by symbols Rx And Ry(Sometimes Nx And Ny). Judging the relationship between changes in values X And at based on comparison of the behavior of ranks according to two characteristics in parallel. If every couple X And at the ranks coincide, this characterizes the closest possible connection. If there is a complete opposite of ranks, i.e. in one row the ranks increase from 1 to n, and in the other – decrease from n up to 1, this is the maximum possible feedback. Spearman's and Kendal's approaches to assessing the closeness of a connection are somewhat different. For calculation Spearman coefficient feature values X And at numbered (separately) in ascending order from 1 to n, i.e. they are assigned a certain rank ( Rx And Ry) – serial number in a ranked series. Then, for each pair of ranks, their difference is found (denoted as d=Rx – Ry), and the squares of this difference are summed.
Where d– rank difference X And at;
n– number of observed pairs of values X And at.
Coefficient ρ can take values from 0 to ±1. It should be borne in mind that since the Spearman coefficient takes into account the difference only in ranks, and not in the values themselves X And y, it is less accurate compared to the linear coefficient. Therefore, its extreme values (1 or 0) cannot be unconditionally regarded as evidence of a functional connection or a complete absence of dependence between X And u. In all other cases, i.e. When ρ does not take extreme values, it is quite close to r.
Formula (147) is strictly theoretically applicable only when individual values X(And y), and therefore their ranks are not repeated. For the case of repeating (linked) ranks, there is another, more complex formula, adjusted for the number of repeating ranks. However, experience shows that the results of calculations using the adjusted formula for related ranks differ little from the results obtained using the formula for non-repeating ranks. Therefore, in practice, formula (147) is successfully used for both non-repeating and repeating ranks.
Kendal Rank Correlation Coefficientτ is constructed somewhat differently, although its calculation also begins with ranking the values of the features X And u. Ranks X(Rx) are placed strictly in ascending order and in parallel write down the corresponding Rx meaning Ry. Since Rx are written strictly in ascending order, then the task is to determine the degree of consistency of the sequence Ry following the “correct” Rx. At the same time, for everyone Ry sequentially determine the number of ranks following it, exceeding its value, and the number of ranks less in value. The first (“correct” following) are counted as points with a “+” sign, and their sum is indicated by the letter R. The second (“incorrect” following) are taken into account as points with a “–” sign, and their sum is indicated by the letter Q. Obviously, the maximum value R is achieved if the ranks y (Ry) coincide with ranks X (Rx) and in each row represent a row natural numbers from 1 to p. Then after the first pair of values Rx= 1 and Ry = 1 number of excess of these rank values will be ( n– 1), after the second pair, where Rx= 2 and Ry= 2, respectively (p – 2) etc. Thus, if the ranks X And at coincide and the number of rank pairs is equal n, That
If the sequence of ranks X And at has the opposite tendency with respect to the rank sequence X, That Q there will be the same maximum value modulo:
.
If the ranks of y do not coincide with the ranks X, then all positive and negative points are summed up ( S=P+Q); ratio of this amount S to the maximum value of one of the terms and represents the Kendal rank correlation coefficient τ, i.e.:
. (148)
The Kendal rank correlation coefficient formula (148) is used for cases when individual values of a characteristic (as X, so and y) are not repeated and, therefore, their ranks are not combined. If there are several identical values X(or y), those. ranks are repeated, become related, the Kendal rank correlation coefficient is determined by the formula:
, (149)
Where S– the actual total score when assessing +1 for each pair of ranks with the same order of change and –1 for each pair of ranks with the opposite order of change;
– the number of points that correct (reduce) the maximum amount of points due to repetitions (combinations) t ranks in each row.
Note that cases of identical repeating ranks (in any row) are scored 0, i.e. they are not taken into account in the calculation either with the “+” sign or with the “–” sign.
The advantages of Spearman and Kendal rank correlation coefficients: they are easy to calculate, with their help you can study and measure the relationship not only between quantitative, but also between qualitative (descriptive) features ranked in a certain way. In addition, when using rank correlation coefficients, it is not necessary to know the form of connection between the phenomena being studied.
If the number of ranked characteristics (factors) is more than two, then to measure the closeness of the connection between them, you can use the concordance coefficient (multiple rank correlation coefficient) proposed by M. Kendal and B. Smith:
, (150)
Where S- sum of squared deviations of the sum T ranks from their average value;
T - number of ranked features;
p - number of ranked units (number of observations).
Formula (150) is used for the case where the ranks for each attribute are not repeated. If there are related ranks, then the concordance coefficient is calculated taking into account the number of such repeating (related) ranks for each factor:
, (151)
Where t– the number of identical ranks for each characteristic.
Concordance coefficient W can take values from 0 to 1. However, it is necessary to check it for significance (significance) using the χ2 criterion in the absence of related ranks using formula (152), and if they are present, using formula (153):
, (152) . (153)
The actual value of χ2 is compared with the tabulated value corresponding to the accepted significance level α (0.05 or 0.01) and the number of degrees of freedom v = p – 1. If χ2fact > χ2table, then W – significant (significant).
The concordance coefficient is especially often used in expert assessments, for example, in order to determine the degree of agreement between experts’ opinions about the importance of a particular indicator being assessed or to rank individual units on any basis. In formula (150), in these cases, m means the number of experts, and n is the number of ranked units (or features).
Approximates R.s. quite well. T, and the difference is negligible when . If the hypothesis H 0 is true, according to the cut component X 1 ,
... , Xn random vector X are independent random variables, projection of R.s. Determined by the formula
where (see).
There is an internal connection between R. s. And . As shown in , if the hypothesis H 0 is true, the projection
Kendall correlation coefficient
into the family of linear linear systems. up to a constant factor coincides with the Spearman rank correlation coefficient, namely:
From this equality it follows that the correlation coefficient corr between and is equal to
i.e. at large pr. With. and are asymptotically equivalent (see).
Lit.: G a e k Ya., Sh i d a k Z., Theory of rank criteria, trans. from English, M., 1971; K e n d a l l M. G., Rank correlation methods, 4ed., L., 1970. M. S. Nikulin.
Mathematical encyclopedia. - M.: Soviet Encyclopedia. I. M. Vinogradov. 1977-1985.
ranking statistics- - [A.S. Goldberg. English-Russian energy dictionary. 2006] Energy topics in general EN rank statistics ... Technical Translator's Guide
This term has other meanings, see Statistics (meanings). Statistics (in the narrow sense) is measurable numeric function from the sample, independent of unknown distribution parameters. In a broad sense, the term (mathematical) ... ... Wikipedia
- (statistics) 1. The totality of data and mathematical methods, used to study relationships between different variables. It includes methods such as linear regression and rank correlation. 2. Values used... ... Economic dictionary
STATISTICS- 1. A type of activity aimed at obtaining, processing and analyzing information that characterizes the quantitative patterns of life in all its diversity, in inextricable connection with its qualitative content. In a narrower sense of the word... ... Russian Sociological Encyclopedia
- (non parametric statistics) Statistical techniques that do not allow special functional forms for relationships between variables. The rank correlation of two variables is an example of this. The use of such technical... ... Economic dictionary- K. m., which received their name. due to the fact that they are based on “co-relation” variables, they are statistical methods, the beginning of which was made in the works of Karl Pearson around late XIX V. They are closely related to... ... Psychological Encyclopedia
Developer Digital Illusions CE Publisher ... Wikipedia
Karl Pearson Karl (Carl) Pearson Date of birth ... Wikipedia
The use of an ordinal scale allows you to assign ranks to objects according to any criterion. Thus, metric values are converted into rank values. At the same time, differences in the degree of expression of properties are recorded. There are 2 rules to follow during the ranking process.
Ranking order rule. It is necessary to decide who receives the first rank: the object with the greatest degree of expression of any quality or vice versa. Most often, this is absolutely indifferent and does not affect the final result. It is traditional to assign the first rank to objects with a greater degree of quality expression (a higher value means a lower rank). For example, the champion is awarded first place, and not vice versa. Although, even here, if the reverse order had been adopted, the results would not have changed. So each researcher has the right to determine the ranking order himself. For example, E.V. Sidorenko recommends assigning a lower rank to a smaller value. In some cases it is more convenient, but more unusual.
For example: there is an unordered sample whose data needs to be ranked. (2, 7, 6, 8, 11, 15, 9). After ordering the sample, we rank it.
Metric data |
Alternative: |
Metric data | ||
The following should be said separately. There is a group of rarely used nonparametric tests (Wilcoxon T-test, Mann-Whitney U-test, Rosenbaum Q-test, etc.), when working with which you should always assign a lower rank to a smaller value.
Rule of related ranks. Objects with the same expression of properties are assigned the same rank. This rank is the average of the ranks they would have received if they had not been equal. For example, you need to rank a sample containing a number of identical metric data: (4, 5, 9, 2, 6, 5, 9, 7, 5, 12). After ordering the sample, the arithmetic mean of the related ranks should be calculated.
Metric data |
Preliminary ranking |
Final Ranking |
Rank the sample according to the rule “ higher value– lower rank”: (111, 104, 115, 107, 95, 104, 104).
Rank the sample according to the rule “lower value – lower rank” (20, 25, 8, 7, 20, 14, 27).
Combine the two previous samples and rank according to the rule “higher value - lower rank”
Indicators of which features from Table I are nominative and which are metric?
Convert the awareness indicators from Appendix Table I to a ranking scale. Identify the levels of expression of indicators by translating them into a nominative scale.
Table I Data for processing
students |
university profile |
awareness |
hidden figures |
missed |
arithmetic |
understanding |
exception images |
analogies |
number series |
inferences |
geometric addition |
learning words |
average IQ |
extroversion- introversion |
neuroticism |
average mark |
||
University profile: 0 - student’s choice of a humanitarian profile;
1 - student’s choice of a mathematical or natural science profile
1 Brief history emergence correlation analysis
The beginning of the use of mathematical and statistical techniques to study correlation dependencies dates back to the 70s of the nineteenth century. Many historians and statisticians trace the history of the development of correlation back to the forties of the nineteenth century - from the time when the French mathematician O. Bravais proposed a formula for the distribution of two random variables that satisfy the requirements of the law of normal distribution.
However, the true founder of the correlation theory is considered to be the English mathematician and statistician K. Pearson, who created in the late nineteenth and early twentieth centuries this theory. In it, correlation acts as a form of dialectical connection, in which many different causes operate, both necessary and random, both common to both correlation values, and private, affecting only one of them. Moreover, not all natural connections are causal.
The development of the theory was carried out with the help of other studies, when the main provisions of the correlation theory had already been created. Moreover, in the field of studying correlations, practice sharply diverged from theory, placing researchers in conditions that did not satisfy its requirements.
The basis for the formation of methods for studying correlations and regressions was data characterizing any quantitatively expressed characteristics. Therefore, at the very first steps, researchers encountered the problem of correlation qualitative signs, for example, the relationship between eye color in fathers and sons. General principle, which was the basis for the design of correlation indicators of qualitative characteristics, was that two qualitative characteristics can be considered interrelated if the effect of one of them A under the action of attribute B is the same as under the action of attribute not B. In development of this principle, and were offered various designs such indicators as, for example, Pearson's mean square contingency coefficient or Chuprov's mutual contingency coefficient.
The study of the correlation of qualitative characteristics gave rise to the so-called theory of ranks and the theory of rank correlation based on it in the general doctrine of correlation. The English mathematician and statistician M. Kendall, the author of a monograph devoted to the problems of rank correlation, pointed out that the theory of ranks first arose as an offshoot of the theory of random processes. On initial stage in ranks they most often saw simply a convenient device, thanks to which it is possible to do without measuring the absolute value of variables and thereby save time and effort. Later, rank statistics were able to gain recognition due to their own merits. Kendall constructed a measure that is also applicable to studying partial correlation between ranks. It is impossible to imagine the modern theory of rank correlation without M. Kendall's most comprehensive studies.
Thus, by the beginning of the twentieth century, mathematical and statistical methods for measuring correlations and regressions had generally developed into a fairly coherent integrated system, including methods of nonparametric statistics and nonparametric rank methods.
2 Nonparametric rank methods
Nonparametric rank methods are a rapidly developing area of mathematical statistics. The history of modern nonparametric rank-based methods is quite short—only about 40 years. Rank methods have emerged as a special area of nonparametric statistics not only due to the nature of the source material, but also due to the ideas behind it. further use. Today, these methods solve many problems in the analysis of economic, statistical, engineering, natural science, sociological, and medical data.
Ranking is a procedure for arranging objects of study, which is performed on the basis of preference. Rank is a serial number of attribute values, arranged in ascending or descending order of their values. As statistical studies conducted over the past 10-15 years have shown, ranking methods are largely free of a number of disadvantages for working with small samples, the distribution of which is unknown. As is known, the transition from the observations themselves to their ranks is accompanied by a certain loss of information. However, these losses are not too great. Unfortunately, at present there is still a lack of specialized literature on this issue.
IN lately Expert assessments have become widely used in forecasting and in solving a number of other problems. Rank correlation methods in this area are perhaps the only way to generalize expert assessments.
Rank theory first emerged as an offshoot of the theory of random processes. At the initial stage, ranks were most often seen as simply a convenient device, thanks to which it was possible to do without changing the absolute value of variables and thereby save time or effort. Thanks to the use of ranks, it was possible to avoid the difficulties associated with constructing an objective scale of absolute values. Later, rank statistics were able to gain recognition on their own merits.
Below we will consider the most common ways of organizing the objects being studied:
The task may simply be to organize objects according to the place they occupy in space or time. For example, the cards were arranged in a deck in some order and then shuffled. The new arrangement of cards is also characterized by a certain order, ranking. Comparing it with the old one, you can see how carefully the cards were shuffled. In this task, only the general arrangement of cards in the deck is interesting, and there is no need to arrange objects in accordance with the “increase” or “decrease” of one or another characteristic inherent in all of them;
Objects can also be ordered according to some quality, for which there is no objective absolute scale of change. You can, for example, rank samples rocks by hardness, based on the following simple criterion: A is harder than B if A leaves a scratch on B when they touch. If A leaves a scratch on B, and B leaves a scratch on C, then A will leave a scratch on C. Thus, by resorting to a series of comparisons, the objects in question can be ordered with reasonable accuracy (unless the set includes two objects that have the same hardness ). However, this method does not allow measuring the absolute value of rock hardness. It is always possible to establish that A is harder than B. However, until one or another measurement scale is constructed absolute values, it cannot be said that A is, say, twice as hard as B;
The ordering can be carried out in accordance with the measured (or theoretically calculated) value of some attribute. For example, you can arrange people in one order or another depending on their height, and cities by population. In this case, it is not always necessary to resort to the measurement process itself: you can build a group of students by height “by eye”; however, in such cases, the criterion by which the ranking occurs must allow for direct comparisons.
It is possible to order objects according to some attribute, the value of which, in principle, can be measured, but in practice (or even theoretically) it is not possible to resort to such a measurement for one reason or another. For example, one might order a series of persons according to their intellectual abilities, believing that such a quality actually exists and that people can be placed in one order or another according to the intensity of this attribute.
IN practical applications Ranking-based methods sometimes encounter cases where two or more objects are so similar that it is impossible to give preference to one of them. When an expert ranks an object based on subjective judgments, then this property (lack of preference) is associated with the truth of their indistinguishability or the inability of the researcher to find significant differences. In this case, they say that such an object is called bound.
For example, students were ranked according to their merits or exam scores. The method adopted for prescribing numerical values for the ranks of related objects is to average the ranks they would have if they were distinguishable. For example, if the third and fourth objects are connected, then each is assigned a rank of 3.5, but if objects from the second to the seventh are connected, then the resulting rank is 4.5.
This approach is sometimes called the “average rank method.” When there is no basis for choosing between objects, then it is clear that in this case it is necessary to assign equal ranks to everyone. Advantage this method is that the sum of ranks for all objects remains exactly the same as when ranking without connections.
In the analysis of socio-economic phenomena, it is often necessary to resort to various conditional estimates using ranks, and the relationship between individual characteristics is measured using nonparametric coefficients communications.
3 Kendall's rank concordance coefficient
To determine the closeness of the relationship between an arbitrary number of ranked features, a multiple correlation coefficient (concordance coefficient) is used.
In the practice of statistical research, there are cases when a set of objects is characterized not by two, but by several sequences of ranks; it is necessary to establish a statistical relationship between several variables. As such a meter, the multiple correlation coefficient (concordance coefficient) of Kendall ranks is used, determined by the following formula:
Where W– concordance coefficient;
D– the sum of squares of ranks is calculated according to formula (2);
n– number of objects of the ranked characteristic (number of experts);
m– number of analyzed ordinal variables.
In a sense, W serves as a measure of generality.
, (2)
Where r ij– ranked judgments of the group of experts;
n– number of objects (number of experts).
The values of the concordance coefficients are contained in the segment .
An increase in the coefficient from 0 to 1 means greater consistency of judgments. If all these judgments coincide, then W=1.
Testing the significance of the coefficient is based on the fact that if the null hypothesis about the absence of correlation for n>7 is true, the statistics m(n-1)* W has approximately a distribution with k=n-1 degrees of freedom. Therefore, the concordance coefficient is significant at level =0.05 if m(n-1)W> .
kayabaparts.ru - Hallway, kitchen, living room. Garden. Chairs. Bedroom