mona                 package:cluster                 R Documentation

_M_O_N_o_t_h_e_t_i_c _A_n_a_l_y_s_i_s _C_l_u_s_t_e_r_i_n_g _o_f _B_i_n_a_r_y _V_a_r_i_a_b_l_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Returns a list representing a divisive hierarchical clustering of
     a dataset with binary variables only.

_U_s_a_g_e:

     mona(x)

_A_r_g_u_m_e_n_t_s:

       x: data matrix or data frame in which each row corresponds to an
          observation, and each column corresponds to a variable.  All
          variables must be binary.  A limited number of missing values
          (NAs) is allowed. Every observation must have at least one
          value different from NA.  No variable should have half of its
          values missing. There must be at least one variable which has
          no missing values.  A variable with all its non-missing
          values identical, is not allowed. 

_D_e_t_a_i_l_s:

     'mona' is fully described in chapter 7 of Kaufman and Rousseeuw
     (1990). It is "monothetic" in the sense that each division is
     based on a single (well-chosen) variable, whereas most other
     hierarchical methods (including 'agnes' and 'diana') are
     "polythetic", i.e. they use all variables together.

     The 'mona'-algorithm constructs a hierarchy of clusterings,
     starting with one large cluster. Clusters are divided until all
     observations in the same cluster have identical values for all
     variables.
      At each stage, all clusters are divided according to the values
     of one variable. A cluster is divided into one cluster with all
     observations having value 1 for that variable, and another cluster
     with all observations having value 0 for that variable.

     The variable used for splitting a cluster is the variable with the
     maximal total association to the other variables, according to the
     observations in the cluster to be splitted. The association
     between variables f and g is given by a(f,g)*d(f,g) -
     b(f,g)*c(f,g), where a(f,g), b(f,g), c(f,g), and d(f,g) are the
     numbers in the contingency table of f and g. [That is, a(f,g)
     (resp. d(f,g)) is the number of observations for which f and g
     both have value 0 (resp. value 1); b(f,g) (resp. c(f,g)) is the
     number of observations for which f has value 0 (resp. 1) and g has
     value 1 (resp. 0).] The total association of a variable f is the
     sum of its associations to all variables.

     This algorithm does not work with missing values, therefore the
     data are revised, e.g. all missing values are filled in. To do
     this, the same measure of association between variables is used as
     in the algorithm. When variable f has missing values, the variable
     g with the largest absolute association to f is looked up. When
     the association between f and g is positive, any missing value of
     f is replaced by the value of g for the same observation. If the
     association between f and g is negative, then any missing value of
     f is replaced by the value of 1-g for the same observation.

_V_a_l_u_e:

     an object of class '"mona"' representing the clustering. See
     'mona.object' for details.

_S_e_e _A_l_s_o:

     'agnes' for background and references; 'mona.object', 'plot.mona'.

_E_x_a_m_p_l_e_s:

     data(animals)
     ma <- mona(animals)
     ma
     ## Plot similar to Figure 10 in Struyf et al (1996)
     plot(ma)

