Our parser aims to meet an important need of the serious student, as there is no existing software tool that can split a verse of a complex text into clauses, and thereafter identify the components of each clause (for e.g, the Subject, Object, and Verb of each clause). The phenomenon of 'free' word order in Sanskrit makes it very difficult for the reader to figure out the internal structure of a verse. Our parser has successfully parsed the first 200+ verses of the Srimad Bhagavad Gita (with defects in the order of 2-3%); the rules of the parser will continue to be fine-tuned when other texts are taken up for parsing. We focus only on the syntactic aspects of each term in the sentence, and expect the student to consult commentaries by leading experts for the meaning of each term.
For a language such as Sanskrit, it is very important that translations / commentaries should be attempted only after a detailed grammatical analysis is completed. If a term is attached to the wrong clause, or is assumed to have the wrong Case (or if 'euphonic combinations' and 'compound words' are unpacked incorrectly), the interpretation of the stanza could be distorted significantly. These are real risks facing even expert commentators. In the absence of a detailed grammatical analysis by an author, it is also very difficult for a reader to judge whether an interpretation is correct, or whether it is based on a flawed grammatical analysis. A detailed grammatical analysis helps by documenting (for posterity) the underlying basis of the interpretation.
A syntactic parser must split a sentence into clauses, and thereafter identify the Subject, Verb, and Object of each clause (wherever applicable). Sometimes the Subject or the Verb of a clause may be elided, and the parser must reconstruct them from other cues (for e.g., if the Verb is a First-Person Singular conjugation, the elided Subject can be read as 'I'). Further, in the case of Relative clauses, the parser will also try to relate each 'Trace' term of a dependent clause with its antecedent in a clause elsewhere (such as in Srimad Bhagavad Gita, Chapter 1.7) . It must be noted that it is very common for multiple candidate declined/ conjugated forms to be applicable to each term; the parser tries to figure out the appropriate candidate for each term given its syntactic context. For example, 'uvAcha' in the example shown below (Srimad Bhagavad Gita, Chapter 1.1) has two verbal candidates, i.e. a Third-Person Singular form and a First-Person Singular form ('uvAcha:III-S:vach:2:P:perfect:speak/read/relate/state' and 'uvAcha:I-S:vach:2:P:perfect:speak/read/relate/state'). Given the presence of a Nominative Third-Person Subject ( dhRutarASHTras ), the parser can eliminate the First-Person Singular candidate verbal term. However, this process of elimination/ selection gets very complicated in a sentence with multiple clauses, ambiguous terms with multiple candidates, and 'free' word-order (all of which are very commonly found in Sanskrit verse).
NOTE: Clicking on the links shown for the root/ base word in the parsed output will take you to an external site hosted by the University of Cologne (Cologne Digital Sanskrit Dictionaries server), and show the dictionary meaning from the Monier-Williams "Sanskrit-English Dictionary 1899" [1], the Cappeller "Sanskrit-English Dictionary 1891" [2], and the Sorensen, et.al., "Index to Names used in the Mahabharata 1904" [3].
There are several tasks involved in parsing Sanskrit sentences, as described here. The reader will note that 'sandhi analysis' is just the first step in parsing a Sanskrit sentence.
Our rule-based Sanskrit Parser selects one from amongst a number of candidate declensions, conjugations, and indeclinables, and associates it with each word. The output of the software includes the clause structure and a rough translation of those words that are present in the lexicon (for e.g., uvAcha:III-S:vach:2:P:perfect:speak/read/relate/state, and dhRutarASHTraHa:NOM-S:dhRutarASHTra:m:noun:Dhritarashtra).1 2
It must be noted here that the grammatical analysis of non-prose texts (such as the Srimad Bhagavad Gita) presents the highest level of difficulty due to the problem of 'free' word-order3. However, we have undertaken such an exercise in order to develop a highly nuanced, bottom-up understanding of 'free' word-order in Classical Sanskrit. In the succeeding chapter, we present the results of this grammatical analysis of the Srimad Bhagavad Gita, from which it may be observed that our parser successfully handles extremely complex issues. Our Sanskrit parser has evolved rapidly to a very advanced level, after encountering several unique challenges in each chapter of this text. This is an ongoing project, and the results of processing each chapter will be uploaded to the website soon after it is completed.
Example 1
Let us now examine the steps involved in the analysis of the following printed form of stanza (Srimad Bhagavad Gita, Chapter 1.1):
धृतराष्ट्र उवाच
धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः
मामकाः पाण्डवाश्चैव किमकुर्वत सञ्जय
Printed: dhRutarASHTra uvAcha . dharmakSHetre kurukSHetre samavetA yuyutsavaHa , mAmakAHa pANNDavAshchEva kimakurvata saNjaya .
Step 1. The first step is a 'sandhi analysis' of the sentence, including the splitting of 'euphonic combinations (सन्धि )'. This analysis requires the parser to have a sound understanding of a number of Paninian rules (in this case, of course, the rules applicable are simple). The parser must also have an in-depth understanding of the conjugations/ declensions of complex and irregular verbs and nouns, such as some of the verbs in the two samples discussed here i.e. वच् (vach:2P) and ब्रू (brU:2P) belonging to the irregular verb class 2. Further, there is considerable ambiguity during this splitting process, as there may be multiple, valid ways to split a combination. The sandhi analysis of the above sentence produces the following underlying terms for the stanza:
धृतराष्टः उवाच
धर्मक्षेत्रे कुरुक्षेत्रे समवेताः युयुत्सवः
मामकाः पाण्डवाः च एव किम् अकुर्वत सञ्जय
Underlying: dhRutarASHTras uvAcha . dharmakSHetre kurukSHetre samavetAs yuyutsavas , mAmakAs pANNDavAs cha eva kim akurvata saNjaya .
Step 2. The next step is to figure out the clauses in the sentence, and the component phrases in each. There are considerable difficulties in doing this due to the possibility that the component words of a phrase may be in different parts of the sentence (for e.g., see the second example given below). Furthermore, the same declined/ conjugated word form may have been derived from different root/ base words, or may be the same for different Case/ Number / Gender combinations. For example, the declined form भीमार्जुनसमाः ('equal to Bhima and Arjuna') has 23 different nominal candidates across different word bases to choose from in our lexicon/ rule base (Note that there are multiple nominal word bases, each with two or more declined forms, ending in the pattern आः, resulting in 23 nominal candidates, in Classical Sanskrit alone, available for the parser to choose from, some of which may be extremely rare but cannot be ruled out, especially in the case of proper names). In addition, there may also be indeclinables, derived forms, and conjugated verb forms that share the same declined form (the parser may re-evaluate its earlier decision to treat the term as a nominal, verb, or indeclinable). In the above example, the declined form समवेताः (assembled/ gathered) has 5 different nominal candidates for each word base (Masculine: NOM-P, VOC-P. Feminine: NOM-P, ACC-P, VOC-P). Similarly, the verb उवाच (speak/read/relate/state) has two verb candidates for each root word (Perfect: I-S:1st person Singular, III-S:3rd Person Singular), and there is no reason why there cannot be a proper name that shares the same form (say, in VOC-S declined form). The availability of multiple candidate declined / conjugated forms is a very common occurrence in Sanskrit sentences, and it is not easy for the parser to handle due to a combinatorial explosion of the search space. Clearly, the choice of the wrong candidate form can result in a very bad translation or require back-tracking (while this is one of the key strengths of a rule-based system, it may be impractical for parsing long sentences). For example, if the parser initially chooses the VOC-P candidate for समवेताः, then it must back-track when it reaches सञ्जय (which has a single VOC-S candidate), as it is not permissible for the components of a single Noun Phrase to have different Numbers (in this case, VOC-P and VOC-S). Sanskrit has an agreement rule, indicating that the components of a Noun Phrase (including adjectives, as in this case) must agree in Number, Gender, and Case (with some exceptions). Agreement between the verb and the Noun Phrase subject, Subcategorization rules and Selectional restrictions are other constraints that help in parsing the sentence. Eliminating various candidates, the parser produces the following result: 4
Printed:dhRutarASHTra uvAcha . dharmakSHetre kurukSHetre samavetA yuyutsavaHa , mAmakAHa pANNDavAshchEva kimakurvata saNjaya .dhRutarASHTra uvAcha ... [8.3.17, 8.3.19, 1.3.2, 8.2.66] bho bhago agho apUrvasya yo'shi
samavetA yu ... [8.3.17, 8.3.22, 1.3.2, 8.2.66] bho bhago agho apUrvasya yo'shi
mAmakAHa p ... [8.3.15, 1.3.2, 8.2.66] kharavasAnayo visarjanIyaHa
pANNDavAshchEva ... [8.4.40, 8.3.34, 8.3.15, 1.3.2, 8.2.66] stoHa shchunA shchuHa
chEva ... [6.1.88, 1.1.1] vRuddhirechi
kimakurvata ... [6.1.72] saNhitAyAm
pANNDavAHa ch ... [8.3.15, 1.3.2, 8.2.66] kharavasAnayo visarjanIyaHaUnderlying:dhRutarASHTras uvAcha . dharmakSHetre kurukSHetre samavetAs yuyutsavas , mAmakAs pANNDavAs cha eva kim akurvata saNjaya .
A: dhRutarASHTraHa uvAcha
A.1:
dhRutarASHTraHa:NOM-S:dhRutarASHTra:Masc.:Noun:Dhrutarashtra uvAcha:III-S:vach:2:P:VerbPerfect:spoke
B: dharmakSHetre kurukSHetre samavetAHa yuyutsavaHa mAmakAHa pANNDavAHa cha eva kim akurvata saNjaya
B.1:
dharmakSHetre:LOC-S:dharman-kSHetra (dharmakSHetra) :Masc.:Noun:samAsa_tatpuruSHa(GEN):Link_gov_samavetAHa:in Dharmakshetra kurukSHetre:LOC-S:kuru-kSHetra (kurukSHetra) :Masc.:Noun:samAsa_tatpuruSHa(GEN):in Kurukshetra samavetAHa:NOM-P:samaveta:Masc.:Adj:past_participle_passive_kta_2P_sam-ava-i:Link_subj_yuyutsavaHa:[those who were] assembled yuyutsavaHa:NOM-P:yuyutsu:Masc.:Noun:[those who were] eager to fight mAmakAHa:NOM-P:mAmaka:Masc.:Noun:mine pANNDavAHa:NOM-P:pANNDava:Masc.:Noun:sons of Pandu cha:Indeclinable:and eva:Indeclinable:also kim:ACC-S:kim:Neut.:Pronoun:what akurvata:III-P:kRu:8:A:VerbImperfect:did [they do] saNjaya:VOC-S:saNjaya:Masc.:Noun:Sanjaya
Step 3. In case the parser is unable to derive a well-formed parse-tree, it may need to revert to Step 1, and consider other, alternate, ways to analyze various 'euphonic combinations'.
Step 4. At this stage, the student can map the form available in the various commentaries to the declined/ conjugated forms (as provided above). Please note that the word order shown in the parsed results above may be preferred by most students to help in correlating the sentence with the parsed form above. However, it can also be presented using the English word order (Subject-Verb-Object), and the verb and noun base words can also be changed easily to reflect Number, Tense, etc. (for e.g., 'sons', 'spoke/read/related/stated'). Further, instead of showing each word separately, they can also be presented in a form where words are grouped together in phrases (for e.g. grouping together the NOM-P terms above along with their attached indeclinables, denoting the grammatical Subject). However, such a simplistic grouping is fraught with risk, and has not been undertaken at present. However, some cues are found in the gloss for each entry; the comment 'Link_gov_samavetAHa' tells the reader that the term 'LOC-S:dharmakSHetre' is governed by the passive past participle term 'NOM-P:samavetAHa:[who are] assembled'. Likewise, the comment 'Link_subj_yuyutsavaHa' tells the reader that the subject of the passive past participle is the term 'NOM-P:yuyutsavaHa:[the ones who are] eager to fight'. This participle phrase can be read as '... [the ones who are] eager to fight [who are] assembled in Dharmakshetra Kurukshetra ...'.
Step 5. A very rough reading of the above partial translation could be 'Dhritarashtra spoke: O Sanjaya, [those who are] eager to fight [who are] assembled in the Dharmakshetra Kurukshetra, mine and also the sons of Pandu, what did they do ?'.
It must be mentioned that the above is a fairly simple sentence. Sentences can get very complicated when they include relative clauses, long-distance movement of a few components of phrases, and rare conjugated/ declined forms, and elision.
Example 2
Here is a slightly more difficult sample printed form of a stanza (Srimad Bhagavad Gita, Chapter 1.2):
सञ्जय उवाच
दृष्ट्वा तु पाण्डवानीकं व्यूढं दुर्योधनस्तदा
आचार्यमुपसङ्गम्य राजा वचनमब्रवीत्
"saNjaya uvAcha . dRuSHTvA tu pANNDavAnIkaM vyUDhaM duryodhanastadA , AchAryamupasaNgamya rAjA vachanamabravIt . "
And the relevant analysis and underlying terms are as follows:
Printed:saNjaya uvAcha . dRuSHTvA tu pANNDavAnIkaM vyUDhaM duryodhanastadA , AchAryamupasaNgamya rAjA vachanamabravIt .pANNDavAnIkaM v ... [8.3.23] mo'nusvAraHa
vyUDhaM d ... [8.3.23] mo'nusvAraHa
vachanamabravIt ... [8.4.56, 8.2.39] vA'vasAne
saNjaya uvAcha ... [8.3.17, 8.3.19, 1.3.2, 8.2.66] bho bhago agho apUrvasya yo'shi
duryodhanastadA ... [8.3.34, 8.3.15, 1.3.2, 8.2.66] visarjanIyasya saHa
AchAryamupasaNgamya ... [6.1.72] saNhitAyAm
vachanamabravIt ... [6.1.72] saNhitAyAmUnderlying:saNjayas uvAcha . dRuSHTvA tu pANNDavAnIkam vyUDham duryodhanas tadA , AchAryam upasaNgamya rAjA vachanam abravIt .
A: saNjayaHa uvAcha
A.1:
saNjayaHa:NOM-S:saNjaya:Masc.:Noun:Sanjaya uvAcha:III-S:vach:2:P:VerbPerfect:spoke
B: dRuSHTvA tu pANNDavAnIkam vyUDham duryodhanaHa tadA AchAryam upasaNgamya rAjA vachanam abravIt
B.1:
dRuSHTvA:-:dRush:1:P:VerbGerund:having seen tu:Indeclinable:indeed pANNDavAnIkam:ACC-S:pANNDava-anIka:Neut.:Noun:samAsa_tatpuruSHa(GEN):the army of the Pandavas vyUDham:ACC-S:vyUDha:Neut.:Adj:past_participle_passive_kta_1U_vi-Uh:Link_subj_pANNDavAnIkam:[that was] arrayed B.2:
tadA:Indeclinable:then AchAryam:ACC-S:AchArya:Masc.:Noun:preceptor upasaNgamya:-:upa-sam-gam:1:P:VerbGerund:having approached B.3:
duryodhanaHa:NOM-S:duryodhana:Masc.:Noun:Duryodhana rAjA:NOM-S:rAjan:Masc.:Noun:King vachanam:ACC-S:vachana:Neut.:Noun:utterance abravIt:III-S:brU:2:P:VerbImperfect:said
Notice that according to the normal rules of 'sandhi', राजा could also have wrongly resulted in राजाः after sandhi analysis (i.e. instead of the correct unchanged declined NOM-S form राजा for the base word राजन्). It must be kept in mind that it is common for feminine nominals to have such -आ endings (usually in NOM-S), and these must be analyzed differently by the parser compared to the common case discussed earlier (i.e. समवेताः). Another point to note is that in the original sentence, दुर्योधनः (Duryodhana) was very far from its verb अब्रवीत्, as well as from the other component राजा (king) of its own noun phrase (i.e. the translated form is राजा दुर्योधनः). Notice also that the non-finite verb उपसङ्गम्य intervenes between दुर्योधनः and its verb अब्रवीत् (i.e. the parser must be aware that the intervening verb can be skipped since it does not accept an NP-subject). Another point to note is that, in Clause 1 the non-finite verb दृष्ट्वा is not clause-final, unlike Clause 2 with the non-finite verb उपसङ्गम्य (canonical Sanskrit word order). 5
A rough reading of the above parse-tree could be 'Sanjaya spoke: having seen indeed the army of the Pandavas [that was] arrayed, having approached then the preceptor, King Duryodhana spoke the utterance'.
Contact us for a copy of the parsing results for Chapter 1 of the Srimad Bhagavad Gita. Parsing results for the remaining chapters will also be available very soon.
NOTES:
1. As the reader would have concluded from the above discussion, there are many issues in getting the analysis correct. Our software produces a parse-tree and a partial translation of individual words (that can be grouped together to form phrases). The parsing statistics for the first few chapters of the Srimad Bhagavad Gita (a non-prose text) are displayed in the respective chapters (for e.g., defects are 2.2% in Chapter 2 and 2.2% in Chapter 3). The early results of processing this text are encouraging, showing that it may be possible to automate the analysis of some texts with a reasonably small margin of error. Please remember that one of the initial objectives is to arrive at a bottom-up understanding of the rules underlying 'free' word order in Sanskrit.
2. In a rule-based parser, the lexicon is of vital importance as it is the repository of a number of critical features that help the parser make the right decisions (for e.g., when a word is ambiguous, or there is ambiguity regarding which site a phrase should be attached to). Our lexicon is limited in size at present (fewer than 9,000 head words), but the parser is aware of all the Paninian rules for euphonic combinations, as well the rules to form conjugations and declensions correctly, and can usually identify and handle verbs and nouns that are not present in the lexicon (except some Vedic Sanskrit exceptional forms). The lexicon contains almost all of the verb roots described in Panini's Dhatukosha, and the parser can recognize their irregular forms (such as the conjugation यन्तु discussed on the previous page), which are prerequisites for a functional parser for Classical Sanskrit. However, there are also a large number of verbs that are 'related' to the verb roots through upasarga prefixes, for e.g. उप-सम्-गम् ('to approach') is 'related' to the verb root गम् ('to go'). Such 'related' verb roots (as distinguished from 'derived' verb roots, which are derived through the application of Paninian rules) will be added in the near term, but their absence is not critical for the parser because they share the same word-final conjugated forms (in most cases) as the Paninian Dhatukosha verb roots that they are 'related' to, and are easy to identify. It must also be mentioned here that the number of upasargas is limited to around 22 (note, however, that they can be stacked together as shown in the previous sentence i.e. उप-सम्-गम्), hence it may be possible for the student to figure out an approximate meaning of such an upasarga+verb composition, as the meanings of these upasargas are usually regular (i.e. one could read उप-सम्-गम् as 'towards/near+with/together+go', whereas it is translated in the Monier-Williams Dictionary[1] as 'to approach together', and, more broadly, as 'approach' in our lexicon). However the student must be forewarned that this 'compositional' approach to deriving an 'approximate meaning' of a 'related' verb root is fraught with risk, and we do plan to add such 'related' verb roots to our lexicon in the medium term. Indeclinables are relatively few in number, and most of the frequently used ones are already present in our lexicon (and others will be added in the short term). As the number of nominals is vast (excluding proper names, which cannot all be expected to be present in the lexicon), these may be added more gradually. Until these missing nominals are added, the parser may not make the best choices due to the unavailability of certain critical features from the lexicon. Despite this, the student will benefit from the alternate base forms that will be provided, which provide an important cue for a manual dictionary lookup (for e.g., as discussed previously, there may be many base words sharing the same declined form भीमार्जुनसमाः).
3. The reader may have noted from the above sample sentences that the software does not split compound words ( समास ) into their constituents. Such a process, once again involving ambiguity, may often require some 'knowledge of the world', which is outside the scope and capabilities of our parser. Such words are usually defined as nominals in our lexicon. However, in future, a high-risk strategy of splitting them may be explored for the simpler, clearly identifiable,
categories of compound words (such as the word भीमार्जुनसमाः i.e. 'Bhima and Arjuna even/equal'). Clearly, this identification and handling of simple compound word types will also lead to better results in Step 1, as the splitting of 'euphonic combinations' can exclude such word-internal combinations from its scope.
4. There is still work remaining to be done for the resolution of certain kinds of ambiguous components (such as the short-form pronouns मे, नः , नौ , ते , वाम् , वः that can refer to Dative, Ablative, Genitive, Accusative, etc. declined forms). Some of these issues of ambiguity can be resolved by adding specific Features in the lexicon, but this may result in longer term non-scalability of the software). The attachment of adjunctive components (or any component that does not need to satisfy an Agreement rule) will also remain problematic, as a heuristic that attaches such elements to one clause or another will have, at best, a 50% chance of success. The correct attachment of such components requires an 'understanding of the world', and is best left to human readers.
5. The parser will frequently 'read in' an elided verb in order to complete a clause. This is most commonly the copula verb (as seen in Stanza 1.39 of the Srimad Bhagavad Gita). However, it could infrequently be a different verb, depending on the local context; for example, the presence of unexplained Nominative and Accusative terms may indicate an elided Transitive verb (see Stanza 1.16 of the Srimad Bhagavad Gita). Clearly, there is a risk of the parser incorrectly inserting a verb into the sentence (i.e. False Positives); we define strict conditions to govern insertion of such elided verbs, and also flag all such insertions for review by humans. Parsing statistics for Chapter 2 show that, with reference to elided verbs, there was only 1 False Positive, while there were 4 False Negatives (i.e. where the parser failed to insert an elided verb). Interestingly, over half the number of total parser defects (in the chapter as a whole) occurred in the sentences with these elided verb parser defects.