Research Project "Linguistic Diversity".

For Graduate Students with a strong

interest in languages, computation, and

computational statistics.
 


This project is in collaboration with Giovani Vasconcelos

and Marcelo Gomes from the Physics Department of

the Universidade Federal de Pernambuco, in Recife,

Brazil.


The Characterization of Diversification of Human Languages.

 

There are on the order of 7000 spoken languages around the world today as listed

in [1]. Much is uncertain, and this number greatly depends on what one calls a

language (as opposed to a dialect for example, see [2] for an explanation of

terms). However, there is little doubt that some of today's languages have common

ancestor languages which may be extinct now. Others appear to exhibit no such

mutual "genetic" relationship. A group of languages (including extinct ones) such

that any two of its members have a (possibly distant) genetic relationship

is called a family. The directed graph representing the relationships in one family

is thought of as a connected tree. An important motivation for comparative

linguistics is the idea that all or almost all languages can be organized into a

relatively small number of such families. The task is then to reconstruct the

genealogical trees involved. It is common to distinguish at least 10 families of

languages, although the exact number is highly controversial (see [3]) and

membership of some families has changed substantially over the last 20 years.

The mathematical properties of the trees formed by the main families (such as

they are currently accepted by the linguistics community) form the object of

this study.

 

The average number of daughter languages that each language gives rise to in a

given tree is called its branching ratio. From preliminary statistical studies [5]

based on arguments in [4] one is tempted to conclude that some 10 of the best known

families may fall in 3 clearly defined groups according to the average branching

ratios characterizing their trees. What causes the (apparent?) difference? Can we

devise methods to shore up (or refute) this conclusion?

 

We also would like to gather data supporting or refuting the following analysis.

Assume that the branching ratio (per unit time) for a given family is constant.

The genealogical distance between two languages in the same family is the

number of edges of the shortest (undirected) path in the tree connecting them.

Suppose two communities speak two languages that are originally closely related.

It is reasonable to believe that the genealogical distance between the languages

spoken by the communities over time grows roughly linearly in time. On the

other hand, as these languages evolve, according to a standard diffusion model,

they should have a geometrical distance of growing as time to the power of 1/d,

where d is a diffusion exponent (presumably between 2 and 3). Combining the

two statements ([6]) we find that within a family the genetic distance between

two languages should be proportional to the geographic distance between them

to the power d.

 

[1] http://www.ethnologue.com

[2] http://www.sil.com

[3] B. Comrie, The World's Major Languages, Oxford University Press, 1987.

[4] M. A. F. Gomes, G. L. Vasconcelos, I. J. Tsang, I. R. Tsang,

Physica A 271, 489, 1999.

[5] F. F. P. Souza, M. A. F. Gomes, G. L. Vasconcelos, unpublished data.

[6] M. A. F. Gomes, personal communication.