Parsing Statistics

Sat, 10/20/2018 - 15:12 — antardhvani

Our goal for a parsing analysis of the text was to analyze each stanza in order to identify the correct declined/ conjugated form for each term, and to associate it with its respective Clause as an argument (for e.g., the NOM subject or ACC object of a verb in active voice) or as an adjunct/ non-argument (for e.g, LOC, VOC). The declensions and conjugations should also show details of the base/ root and other relevant details of each declension/ conjugation. This is a very challenging problem, as will become clear from a review of our analysis of each stanza.

A sample parsing analysis for Stanza 1.39 is shown below:

A: katham na jnyeyam ~~asti~~ asmAbhiHa pApAt ~~asmAn/~~asmAt nivartitum kulakSHayakRutam doSHam prapashyadbhiHa janArdana

A.2:

katham:Indeclinable

na:Indeclinable

jnyeyam:NOM-S:jnyeya:Neut.:Noun:potential_participle_passive_yat_9U_jnA

asmAbhiHa:INS-P:asmad:Masc.:Pronoun:Link_gov_jnyeyam

A.1:

pApAt:ABL-S:pApa:Neut.:Noun

asmAt:ABL-S:idam:Masc.:Pronoun

nivartitum:-:ni-vRut:1:A:VerbInfinitive

kulakSHayakRutam:ACC-S:kula-kSHaya-kRuta (kulakSHayakRuta) :Neut.:Noun:past_participle_passive_kta_8P_kRu:Link_subj_doSHam

doSHam:ACC-S:doSHa:Neut.:Noun:Link_gov_prapashyadbhiHa

prapashyadbhiHa:INS-P:prapashyat:Masc.:Adj:present_participle_shatRu_1P_pra-dRush

janArdana:VOC-S:janArdana:Masc.:Noun

In our presentation of the results, we will largely follow the method followed by Sanskrit grammarians for several thousand years (see for e.g., [KAL2015] [1] and [MM2015] [2] ). However, we are unable to provide an analysis of 'compound words' ('samAsa'), as this is beyond the capabilities of a syntactic parser. The parser needs to choose one from amongst several alternative forms for each term (for e.g., 'jnyeyam' above could be one of 'Adj/Noun:jnyeya:NOM-S-Neut.:that which ought to be known', 'Adj/Noun:jnyeya:ACC-S-Masc.:one who ought to be known', or 'Adj/Noun:jnyeya:ACC-S-Neut.:that which ought to be known' i.e. it could be either the subject of the clause, the object, or a Predicative Adjective)1. The parsing analysis of each stanza shows the clauses (for e.g., A.1 and A.2), as well as the selected declension/ conjugation/ indeclinable form for each term. Please note that the sample analysis above does not show the insertion by the parser of the elided 'copula verb' ('as:2:P:to be:VerbPresent') in clause A.1, which is crucial for the analysis, but may cause some confusion for the reader as it is not present in the input provided to the parser.

Although our parser is still at an exploratory stage, it performed very well on stanzas of Chapter 1 of the text. However, we do not claim that this high level of accuracy can be replicated in further chapters, as significant variations have been noted in the syntactic complexity of each stanza, and a number of nuances will need to be handled (for e.g., in Stanza 1.12 and Stanza 1.39, we discuss nominals derived from transitive verbs, requiring an object complement). On the contrary, it is quite likely that the accuracy of the parser will be very much lower on liturgical texts and non-prose texts (where the word order is partially determined by the metre of the verse), making it extraordinarily difficult for the parser to make the correct choices in the absence of sufficient cues. It must be kept in mind that the insertion of elided verbs is crucial in order that the parser can make the right choices, but it is far from trivial to insert these verbs correctly.

It must also be pointed out that the input used by the parser was the output produced by the 'sandhi analyzer' (i.e. terms are expected to be in their underlying forms, prior to the operation of sandhi sutras), but with the 4 defects observed in the sandhi analysis stage being marked as 'unresolved' (for e.g., the input to the parser was changed to 'asmAn/asmAt' in Stanza 1.39, and this was treated by the parser as a term that was not 'resolved' by the 'sandhi analyzer'). The parser was additionally tasked with identifying the correct alternative for each such 'unresolved' term (for e.g., choose 'asmAt' instead of 'asmAn' in the stanza, if justified). Happily, the parser chose the correct alternative for each of the 4 defects, leading to an excellent result for 'sandhi analysis'. Needless to say, this manual intervention is a temporary measure, as the 'sandhi analysis' software will eventually mark the alternatives for 'unresolved' terms automatically. We prefer to keep the 'sandhi analyzer' and the parser independent of each other at present, in order to be able to measure the defects in each independently.

In conclusion, it is safe to say that parsing a text of this complexity is extraordinarily difficult given the occurrence of 'free' word order. However, despite this difficulty, we believe that it is necessary for a 'bottom-up' analysis of such texts in order to obtain a highly nuanced understanding of 'free' word order in Sanskrit. We will show a number of instances where the assignment by the parser does not coincide with one or the other of our chosen experts ([KAL2015] and [MM2015]). We have also come across a few instances (in later chapters of the text) where the parser's assignment is at variance with those of both the experts, but the parser has probably made the correct assignment (this is an advantage of a rule-based system, provided the rules are highly nuanced).

Defective assignment of Case/ Number (and, in one case, the verb class or 'gaNNa') have been highlighted in the respective stanzas. The standard used for identifying defects were [KAL2015] and [MM2015]. As will be seen in the table below, most of the defects identified as per [KAL2015] are questionable if checked against [MM2015]. It is possible that some of these 'defects' may be printing errors in [KAL2015].

Total Word Count:569

Defects in declension / conjugation term assignment:2 (i.e. < 1%)

*dd: Incorrect definitions in the parser's lexicon are marked as '*dd' in the table below.
*spd: Short-form pronouns are very difficult to handle reliably (marked as '*spd' below). The correct identification of short-form pronouns may be beyond the capability of a syntactic parser, as this task may require the parser to have an understanding of the 'real world'.
*ad: Incorrect declension/ conjugation assignments by the parser are marked '*ad' below.

The following table summarises the variations (and defects) found during the parsing Analysis of Chapter 1. We list all the terms that are at variance with the views of one or the other of the two experts ([KAL2015] and [MM2015]).

Table A: Defective assignment of Declension/ Conjugation/ Indeclinable Glosses to terms by the Parser
#	Stanza	Clause	Word	ASSIGNMENT			Type	Comment	#Defects
				Parser	[KAL2015]	[MM2015]
1	1.7	A.3	saNjnyArtham	Ind.	ACC-S	Ind.	*dd	Not a defect	0
2	1.11	A.1	yathAbhAgam	Ind.	ACC-S	Ind.	*dd	Not a defect	0
3	1.21	B.1	madhye	Ind.	LOC-S	Ind.	*ad	Not a defect	0
4	1.24	B.2	madhye	Ind.	LOC-S	Ind.	*ad	Not a defect	0
5	1.25	A.1	bhISHmadroNNapramukhatas	Ind.	NOM-P	Ind.	*dd	Not a defect	0
6	1.25	A.1	uvAcha	1stSingular	3rdSingular	3rdSingular	*ad	Defect	1
7	1.32	A.1	naHa	GEN-P	ACC-P	GEN-P	*spd	Not a defect	0
8	1.33	A.1	kANkSHitam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
9	1.35	A.1	ghnataHa	ACC-P	NOM-S	ACC-P	*ad	Not a defect	0
10	1.36	A.2	naHa	GEN-P	ACC-P	GEN-P	*spd	Not a defect	0
11	1.36	A.3	pApam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
12	1.37	A.1	tasmAt	ABL-S	Ind.	ABL-S	*ad	Not a defect	0
13	1.40	A.2	kulam	ACC-S	NOM-S	ACC-S	*ad	Not a defect	0
14	1.40	A.2	kRutsnam	ACC-S	NOM-S	ACC-S	*ad	Not a defect	0
15	1.44	A.1	aniyatam	NOM-S	ACC-S	Ind.	*ad	Not a defect	0
16	1.44	A.1	vAsaHa	Neut.	Masc.	Masc.	*gd	Defect	1
17	1.47	B.3	visRujya	class 6P	class 1P	Undefined	*ad	Not a defect	0
		TOTAL DEFECTS							2

From the above analysis, we can see that there are very few defects in the assignment of terms by the parser. However, this is a small sample (Chapter 1 has 569 words in 47 stanzas); the parser is sure to make a number of defective assignments in later chapters of the text, given the high complexity of the problem being addressed. The parser necessarily needs to insert elided terms (largely verbs, but also nominals occasionally) in order to figure out the clause structure, as the correct assignment of terms is contingent upon identifying the clauses. In this process, several errors can result from the erroneous inclusion or exclusion of an elided term, as we are sure to observe while processing the remaining chapters of this text. However, we expect the parser to perform better while processing prose, as the complexity of non-prose texts is very high due to the constraints imposed by meter (i.e. 'free word-order' is expected to be less of a problem in prose text as compared with non-prose). This is a conjecture that will be reviewed in the near future.

It is also important to note that if some of the stanzas had not been reorganised (i.e. split and merged with their adjacent stanzas), parsing may not have been possible. This is because of the presence of a large number of unrelated terms, or the absence of a large number of related terms, as the case may be. We refer in particular to Stanzas 1.20, 1.21, 1.26, 1.27, 1.28, and 1.29, where incomplete fragments occurring in individual stanzas needed to be reorganised for syntactic analysis. This reorganisation is not necessary for the 'sandhi analysis' stage, but is crucial for the subsequent parsing stage.

This analysis was extraordinarily challenging and rewarding, presenting new problems everyday that needed to be resolved from advanced grammar texts. [Kale1995] [3], [JRB1995] [4], and [WDW2003] [5] provide a wealth of information on the Paninian system that are best absorbed in small increments, when applied to resolve a specific problem (for e.g., rare and exceptional declined and conjugated forms). The expressive power of the Paninian system is truly amazing, and its nuances may take a lifetime of study to understand fully.

1. [KAL2015] frequently does not distinguish between Adjectives and Nouns, hence we will not make that distinction when assignments to either are made by the parser. Note that, for the parser, this is usually a question of the definition of the term in the lexicon.

References

[kal2015] Kalavade L., Kalavade P.. Gitavyakaranam Panniniyapraveshaya. Chinmaya International Foundation:Unspecified; 2015.
[mm2015] Michika M. Grammatical Analysis of the Bhagavad Gita Chapters 1 to 6. Arsha Avinash Foundation:Coimbatore; 2015.
[Kale1995] Kale M.. A Higher Sanskrit Grammar. Delhi: Motilal Banarsidass Publishers; 1995.
[jrb1995] Ballantyne J.. Laghu Kaumudi of Varadaraja: A Sanskrit grammar. Motilal Banarsidass Publishers:New Delhi; 1995.
[wdw2003] Whitney W.. Sanskrit Grammar. Dover Publications Inc:Mineola, New York; 2003.

Navigation

You are here

References