Pāṇini and Information theory – back to the future

Professor Gérard Huet is a French Computer Scientist, Mathematician and Computational Linguist. CSP connected with Professor Huet to ask him about Sanskrit and Computational Linguistics.

Recipient of the prestigious EATCS Award in 2009, Professor Huet is Emeritus at Inria (the French National Institute for Research in Computer Science and Automation) and was Directeur de Recherche de Classe Exceptionnelle from 1989 to 2013. He is a member of the French Academy of Sciences and of Academia Europaea.

From the year 2000 onwards, he has worked and contributed
immensely to Computational Linguistics. Author of a Sanskrit-French hypertext
dictionary, he has developed various tools for the phonetical, morphological
and lexical analysis of Sanskrit, such as the Zen toolkit. From this research
evolved a new paradigm for relational programming, inspired from Samuel
Eilenberg's X-machines.

Professor Huet was Program Chair and local organizer of the First
International Sanskrit Computational Symposium in Paris in October 2007, member
of the Program Committee of the second one at Brown University in 2008, co-Program
Chair of the Third International Sanskrit Computational Symposium in Hyderabad
in January 2009, the fourth one at JNU in Delhi in 2010 and the fifth one at IIT Bombay in January 2013, the 6th one at IIT
Kharagpur in October 2019. He is
founding member of the Steering Committee of this series of symposia.  

Pro - Vice Chancellor of University of Hyderabad Professor Rajasekhar with Professor Huet

He is principal investigator on the French side of a joint team on Sanskrit Computational Linguistics between Inria and University of Hyderabad since 2007.  Professor Huet’s talk at the University of Hyderabad this week, titled Pāṇini’s Machine, was about how Pāṇini’s grammar may be thought to be the operational manual of an abstract machine. A note on the lecture by the university, says “this machine performs the grammatical operations prescribed or permitted in the Aṣṭādhyāyī sutras. It produces recursively a correct Sanskrit enunciation as a sign pairing the phonetic signifier and its signified sense. Its proper operation yields both the utterance as a phonetic stream and the intended meaning of a correct Sanskrit sentence. This view places Pāṇini as a precursor in a long list of automata inventors such as Turing, Babbage, Pascal, thus adding to his fame as a renowned linguist.”

In his talk, Professor Huet briefly explained how ‘formal methods
used in Aṣṭādhyāyī are anticipating
computer sciences control and data structure and show a keen understanding of
information theory’.

What interests you most about
Sanskrit? What was your first introduction to the language?

Professor Huet: I was interested in Sanskrit as a key to understand the traditional culture of ancient India, and was fascinated by the fact that this culture is still alive, as opposed to say Greek culture, where all that remains are frozen artefacts like ruins of ancient monuments, and Homeric literature that has lost its connections to the present.

How can the design and implementation
of computer-aided processing tools help in analysing the enormous store of knowledge
and literature available as Sanskrit text?

Professor Huet: These tools may help in several ways.

Firstly, they will allow texts
preservation in a better way than just letting physical documents deteriorate
with time - a lot of manuscripts are still only available in fragile form such
as palm leaves or birch bark, documents which have been digitalized under
photographic form are less useful than searcheable character-level
representations, themselves less useful than word-level segmented documents, etc.

Our tools allow the representation of marked-up documents, where words are indicated with their lemmatization, indicating their morphological parameters (case, number, gender, person, tense, voice, etc) or even their semantic parameters (dependency graphs, anaphora antecedents, word disambiguation, name-entity links, etc). They can be considered as some kind of first-level interpretation of the texts. For instance, सेनाभाव may be segmented as senā-bhāva (existence of army) or as senā-abhāva (inexistence of army). Choosing one or the other gives opposite meanings. Even a text such as Bhagavadgītā is not segmented in the same way by Śaṅkara and Madhva.

This allows the progressive
establishment of data banks of marked-texts, which may be subject to error-correction,
alignment of versions, establishment of phylogenetic trees for use by
philologists in dating versions, detecting inter-textuality relations, and
preparing critical editions.

Our grammar-informed tools are thus
preparing the ground for the use of more automated statistical or neuronal analysers,
trained on our tagged corpus, which will be able to scan and analyse massive
quantities of texts.

Another use of our tools is to give
new methods for teaching the language, alleviating the burdensome initial
investment in learning the script, learning complex phonology rules, complex
un-sandhi analysis, complex morphology: the student may dive directly into the
text, and concentrate on its meaning with the help of dictionaries linked to
the analysed texts. This is very important, since it is next to impossible to
translate Sanskrit in non-Indian languages. Not only terms like dharma, karma,
moksha, etc. are very hard to translate without their context, but poetry uses
complex figures of speech (alaṃkāra) such as upamā, yamaka, rūpaka, sasaṃdeha,
paryāyokta, śleṣa, virodha, etc. which are totally untranslatable and must be
enjoyed in the original text.

Please can you explain briefly your
segmenter for Sanskrit.

Professor Huet: The segmenter is lexicon-directed and uses finite-state transducers technology. That is, I build a database of inflected forms by expanding morphology generation processes on a lexicon of elementary word stems and roots, and then I build specialized transducers that segment the text by guessing sandhi transitions between padas. The full technical explanation and justification is explained in http://gallium.inria.fr/~huet/PUBLIC/SALA.pdf.

How does Paninian Grammar anticipate
and show an understanding of information theory

Professor Huet: This is not easy to explain succinctly. You
have to look into Paninian encodings and see how these encodings can be put in
the context of coding theory in the sense of Shannon and minimizing entropy. In
a nutshell, you may explain that Panini used encodings that permitted optimal
compression of his notations, and allowed to express the grammar in 4000 terse
sutras, whereas a more naive organisation would have necessitated a much larger
repertory of rules, and thus forbid the complete memorizing of the grammar.

Another remark of this nature is that the Shivasutras are a way of expressing very concisely all the subsets of phonemes that are necessary to express regularities in the grammar, like « for all nasals, do this » where nasals is expressed in the condensed definition (pratyāhāra) ñam.

The optimality of the representation
of the Shivasutras has been recently demonstrated by the German scholar Wiebke
Petersen.

Sanskrit cannot be reduced to a universal
system of signs, it is also co-extensive with Indian culture? How can structural
semantics take this into account?

Structural semantics is universal, and in this sense is not sufficient to represent cultural context. Paninian methods are also to an extent universal, and have been used to express other languages by the Akshar Bharati group of IIIT Hyderabad (Rajiv Sangal, Chatanya, Amba Kulkarni, Dipti Sharma), and thus Paninian methods are not specific to Indian cultural aspects. Cultural aspects go beyond the grammar. They are of semiotic nature, beyond linguistics. You must go into literary theory (Anandavardhana, Abhinavagupta, Dandin, etc) and aesthetics in order to account for cultural aspects.