· 

Item analysis: How to calculate the relative information entropy of test and survey items in R?

A basic requirement for test and survey items is that they are able to detect variance with respect to a latent variable. To do this, an item scale must discriminate between test subjects and must have a systematic, clear and sufficiently strong relationship with the underlying construct.

 

One possibility to examine the variability of an item is the computation of the relative information content. The relative information content (also called relative entropy) is a dispersion measure for at least nominally scaled variables. But it also can be calculated on higher scale levels.

 

Mathematically, relative entropy can be expressed in a function as follows:

 

 

Where k represents the number of response categories of an investigated item and hj represents the value of relative frequency in each category. The more evenly the frequencies are distributed across the various categories, the greater the effect. However, the characteristic of the relative entropy value also depend on the number of categories, so that it cannot be interpreted independently of the number of categories. If all response categories occur equally frequently, the relative entropy (H) is maximum = 1 (100 %). If only one category occurs (i.e. the frequencies for all other categories are 0), H is minimal = 0 (0 %).

 

Unfortunately, there is no function implemented in R to compute the relative information entropy of an item. I also don't know a package where these function is included. For this reason I wrote my own R function:

 

 

Rel.Entropy <- function(x, cat, na.rm = TRUE) {

  

  if(na.rm==TRUE){ 

    tbl <- table(x, useNA="no")

  }

  else

    tbl <- table(x, useNA="always")

  }

  

  prp <- tbl/sum(tbl)

  eq1 <- -1/log(scale)

  eq2 <- sum(prp*log(prp))

  eq3 <- eq1 * eq2

  res <- paste(round(eq3*1002),

             "%", sep = "")

 

sum.tbl <- data.frame(tbl)

sum.prp <- data.frame(

  "pcnt"= paste(round(

    100*prp, digits=2), 

    "%", sep=""))

 

sum.all <- data.frame(

  sum.tbl, sum.prp)

 

sum.all <- data.frame(

  "Cat"= sum.all$x, 

  "Freq"= sum.all$Freq, 

  "Prop"= sum.all$pcnt)

 

RETURN<- list(

  Table = sum.all, 

  Entropy = res)

 

return(RETURN)

 

}

 

 

Rel.Entropy: computes the relative entropy of an item.

 

x: one or more objects which can be interpreted as factors (including character strings), or a list or data frame whose components can be so interpreted. 

 

cat: number of response categories (Required. The characteristic of the relative entropy level depend on the number of response categories)

 

na.rm: controls if the table includes counts of NA values. The allowed values correspond to rm.na = "TRUE" to remove NAs and rm.na = "FALSE" to to keep NAs on the dataset.

 

 

O.k. let's see how it works... First we create three example datasets based on a Likert scale with 10 response categories.

 

 
set.seed(3436)
 
Ex.1 <- round(runif(1000, min = 1, max = 10))
Ex.2 <- round(runif(1000, min = 3, max = 7))
Ex.3 <- round(runif(1000, min = 4, max = 6))
 

 

To visualize the distributions of the example data we can plot histograms:

 


library(ggplot2)
library(ggpubr)
 
Fig.1 <- qplot(
  Ex.1, 
  geom = "histogram", 
  binwidth= 1, 
  xlim = c(0, 11), 
  col =I("black"), 
  fill = I("grey"), 
  xlab = "Example 1", 
  ylab = "Count") +
  theme_classic()
 
Fig.2 <- qplot(
  Ex.2, 
  geom = "histogram", 
  binwidth= 1, 
  xlim = c(0, 11), 
  col =I("black"), 
  fill = I("grey"), 
  xlab = "Example 2", 
  ylab = "Count")+
  theme_classic()
 
Fig.3 <- qplot(
  Ex.3, 
  geom = "histogram", 
  binwidth= 1, 
  xlim = c(0, 11), 
  col =I("black"), 
  fill = I("grey"), 
  xlab = "Example 3", 
  ylab = "Count")+
  theme_classic()
 
ggarrange(Fig.1, Fig.2, Fig.3, 
          labels = c("A", "B", "C"), 
          ncol = 3, 
          nrow = 1)

 

As we can see the dispersion level between the Examples varies from high to low.

 

 

Now let's call the Rel.Entropy function from above:

 


Rel.Entropy(Ex
.1, cat = 10, na.rm = FALSE)
Rel.Entropy(Ex.2, cat = 10, na.rm = FALSE)
Rel.Entropy(Ex.3, cat = 10, na.rm = FALSE)

 

When running the lines we get the following expressions:

 

#Example 1
 
> Rel.Entropy(Ex.1, scale = 10, na.rm = TRUE)
 
$Table
   Cat Freq  Prop
1    1   48  4.8%
2    2   93  9.3%
3    3  109 10.9%
4    4  109 10.9%
5    5  102 10.2%
6    6  129 12.9%
7    7  107 10.7%
8    8  119 11.9%
9    9  121 12.1%
10  10   63  6.3%
 
$Entropy
[1] "98.54%"
 
 
#Example 2
 
> Rel.Entropy(Ex.2, scale = 10, na.rm = TRUE)
 
$Table
  Cat Freq  Prop
1   3  141 14.1%
2   4  237 23.7%
3   5  266 26.6%
4   6  231 23.1%
5   7  125 12.5%
 
$Entropy
[1] "68.1%"
 
 
#Example 3
 
> Rel.Entropy(Ex.3, scale = 10, na.rm = TRUE)
 
$Table
  Cat Freq  Prop
1   4  252 25.2%
2   5  500   50%
3   6  248 24.8%
 
$Entropy
[1] "45.15%"

 

 

The function creates a summary table with a category, frequency and proportion column followed by the relative entropy value in percentage.

 

As we can see, the relative information content decreases with the decrease of variability in the example data (Ex.1 = 98.54% to Ex.3 = 45.15 %). 

 

The R scripts can be downloaded at: https://osf.io/ky3f2/