Research Project "Linguistic Diversity".
For Graduate Students with a strong
interest in languages, computation, and
computational
statistics.
This project is in collaboration with Giovani Vasconcelos
and Marcelo Gomes from the Physics Department of
the Universidade Federal de Pernambuco, in Recife,
Brazil.
The Characterization of Diversification of Human Languages.
There are on the order of 7000 spoken languages around the world today as listed
in [1]. Much is uncertain, and this number greatly depends on what one calls a
language (as opposed to a dialect for example, see [2] for an explanation of
terms). However, there is little doubt that some of today's languages have common
ancestor languages which may be extinct now. Others appear to exhibit no such
mutual "genetic" relationship. A group of languages (including extinct ones) such
that any two of its members have a (possibly distant) genetic relationship
is called a family. The directed graph representing the relationships in one family
is thought of as a connected tree. An important motivation for comparative
linguistics is the idea that all or almost all languages can be organized into a
relatively small number of such families. The task is then to reconstruct the
genealogical trees involved. It is common to distinguish at least 10 families of
languages, although the exact number is highly controversial (see [3]) and
membership of some families has changed substantially over the last 20 years.
The mathematical properties of the trees formed by the main families (such as
they are currently accepted by the linguistics community) form the object of
this study.
The average number of daughter languages that each language gives rise to in a
given tree is called its branching ratio. From preliminary statistical studies [5]
based on arguments in [4] one is tempted to conclude that some 10 of the best known
families may fall in 3 clearly defined groups according to the average branching
ratios characterizing their trees. What causes the (apparent?) difference? Can we
devise methods to shore up (or refute) this conclusion?
We also would like to gather data supporting or refuting the following analysis.
Assume that the branching ratio (per unit time) for a given family is constant.
The genealogical distance between two languages in the same family is the
number of edges of the shortest (undirected) path in the tree connecting them.
Suppose two communities speak two languages that are originally closely related.
It is reasonable to believe that the genealogical distance between the languages
spoken by the communities over time grows roughly linearly in time. On the
other hand, as these languages evolve, according to a standard diffusion model,
they should have a geometrical distance of growing as time to the power of 1/d,
where d is a diffusion exponent (presumably between 2 and 3). Combining the
two statements ([6]) we find that within a family the genetic distance between
two languages should be proportional to the geographic distance between them
to the power d.
[3] B. Comrie, The World's Major Languages, Oxford University Press, 1987.
[4] M. A. F. Gomes, G. L. Vasconcelos, I. J. Tsang, I. R. Tsang,
Physica A 271, 489, 1999.
[5] F. F. P. Souza, M. A. F. Gomes, G. L. Vasconcelos, unpublished data.
[6] M. A. F. Gomes, personal communication.