Parsing Statistics

Fri, 03/27/2020 - 09:20 — antardhvani

Our goal for a parsing analysis of the text was to analyze each stanza in order to identify the correct declined/ conjugated form for each term, and to associate it with its respective Clause as an argument (for e.g., the NOM subject or ACC object of a verb in active voice) or as an adjunct/ non-argument (for e.g, LOC, VOC). The declensions and conjugations should also show details of the base/ root and other relevant details of each declension/ conjugation. This is a very challenging problem, as will become clear from a review of our analysis of each stanza.

A sample parsing analysis for Stanza 1.39 is shown below:

A: katham na jnyeyam ~~asti~~ asmAbhiHa pApAt ~~asmAn/~~asmAt nivartitum kulakSHayakRutam doSHam prapashyadbhiHa janArdana

A.2:

katham:Indeclinable

na:Indeclinable

jnyeyam:NOM-S:jnyeya:Neut.:Noun:potential_participle_passive_yat_9U_jnA

asmAbhiHa:INS-P:asmad:Masc.:Pronoun:Link_gov_jnyeyam

A.1:

pApAt:ABL-S:pApa:Neut.:Noun

asmAt:ABL-S:idam:Masc.:Pronoun

nivartitum:-:ni-vRut:1:A:VerbInfinitive

kulakSHayakRutam:ACC-S:kula-kSHaya-kRuta (kulakSHayakRuta) :Neut.:Noun:past_participle_passive_kta_8P_kRu:Link_subj_doSHam

doSHam:ACC-S:doSHa:Neut.:Noun:Link_gov_prapashyadbhiHa

prapashyadbhiHa:INS-P:prapashyat:Masc.:Adj:present_participle_shatRu_1P_pra-dRush

janArdana:VOC-S:janArdana:Masc.:Noun

In our presentation of the results, we will largely follow the method followed by Sanskrit grammarians for several thousand years (see for e.g., [KAL2015] [1] and [MM2015] [2] ). However, we are unable to provide an analysis of 'compound words' ('samAsa'), as this is beyond the capabilities of a syntactic parser. The parser needs to choose one from amongst several alternative forms for each term (for e.g., 'jnyeyam' above could be one of 'Adj/Noun:jnyeya:NOM-S-Neut.:that which ought to be known', 'Adj/Noun:jnyeya:ACC-S-Masc.:one who ought to be known', or 'Adj/Noun:jnyeya:ACC-S-Neut.:that which ought to be known' i.e. it could be either the subject of the clause, the object, or a Predicative Adjective)1. The parsing analysis of each stanza shows the clauses (for e.g., A.1 and A.2), as well as the selected declension/ conjugation/ indeclinable form for each term. Please note that the sample analysis above does not show the insertion by the parser of the elided 'copula verb' ('as:2:P:to be:VerbPresent') in clause A.1, which is crucial for the analysis, but may cause some confusion for the reader as it is not present in the input provided to the parser.

Although our parser is still at an exploratory stage, it performed very well on stanzas of Chapter 2 of the text. However, we do not claim that this high level of accuracy can be replicated in further chapters, as significant variations have been noted in the syntactic complexity of each stanza, and a number of nuances will need to be handled. In general, the accuracy of the parser will be lower on liturgical texts and non-prose texts (where the word order is partially determined by the metre of the verse), making it extraordinarily difficult for the parser to make the correct choices in the absence of sufficient cues. It must be kept in mind that the insertion of elided verbs is crucial in order that the parser can make the right choices, but it is far from trivial to insert these elided terms correctly.

It must also be pointed out that the input used by the parser was the output produced by the 'sandhi analyzer' (i.e. terms are expected to be in their underlying forms, prior to the operation of sandhi sutras), but with the 6 defects observed in the sandhi analysis stage being marked as 'unresolved' (for e.g., the input to the parser was changed to 'yotsya/yotsye' in Stanza 2.9, and this was treated by the parser as a term that was not 'resolved' by the 'sandhi analyzer'). The parser was additionally tasked with identifying the correct alternative for each such 'unresolved' term (for e.g., choose 'yotsya' instead of 'yotsye' in the stanza, if justified). Happily, the parser chose the correct alternative for each of the defects, leading to an excellent result for 'sandhi analysis'. Needless to say, this manual intervention is a temporary measure, as the 'sandhi analysis' software will eventually mark the alternatives for 'unresolved' terms automatically. We prefer to keep the 'sandhi analyzer' and the parser independent of each other at present, in order to be able to measure the defects in each independently.

In conclusion, it is safe to say that parsing a text of this complexity is extraordinarily difficult given the occurrence of 'free' word order. However, despite this difficulty, we believe that it is necessary for a 'bottom-up' analysis of such texts in order to obtain a highly nuanced understanding of 'free' word order in Sanskrit. We will show a number of instances where the assignment by the parser does not coincide with one or the other of our chosen experts ([KAL2015] and [MM2015]). We have also come across a few instances (in later chapters of the text) where the parser's assignment is at variance with those of both the experts, but the parser has probably made the correct assignment (this is an advantage of a rule-based system, provided the rules are highly nuanced).

Defective assignment of Case/ Number/ Clause have been highlighted in the respective stanzas. The standard used for identifying defects were [KAL2015] and [MM2015]. As will be seen in the table below, most of the defects identified as per [KAL2015] are questionable if checked against [MM2015]. It is possible that some of these 'defects' may be printing errors in [KAL2015].

Table 1: Count of Expected terms in parser output
Description	Count	Notes
Input terms	964	The terms after the sandhi process analyses the input stanza.
Additional Elided verb terms	55	The parser needs to insert elided verbs in a large number of clauses.
False Negative Elided verb terms	4	The parser failed to create these elided verbs. See Table 5
False Negative Elided nominal terms	1	The parser failed to create these elided terms. See Table 5
Total Expected terms	1024	This count is used as the denominator for computing percentage of defects
Total Defects (see Table 2 below)	6+8+9=23	2.2 percent (23/1024)

From the above analysis, we can see that the number of defects is quite small (i.e. 2.2 percent) in the assignment of terms by the parser. The parser necessarily needs to insert elided terms (largely verbs, but also nominals occasionally) in order to figure out the clause structure, as the correct assignment of terms is contingent upon identifying the clauses. In this process, some errors can result from the erroneous inclusion or exclusion of an elided term. However, we expect the parser to perform better while processing prose, as the complexity of non-prose texts is very high due to the constraints imposed by meter (i.e. 'free word-order' is expected to be less of a problem in prose text as compared with non-prose).

Table 2: Summary of defects
Description	Count		Notes
Description	Correct	Defective	Notes
*dd	5	1	Incorrect definitions in the parser's lexicon are marked as '*dd' in the table below.
*spd			Short-form pronouns are very difficult to handle reliably (marked as 'spd' below). The correct identification of short-form pronouns may be beyond the capability of a syntactic parser, as this task may require the parser to have an understanding of the 'real world'.Short-form pronouns are very difficult to handle reliably (marked as 'spd' below). The correct identification of short-form pronouns may be beyond the capability of a syntactic parser, as this task may require the parser to have an understanding of the 'real world'.
*ad	30	4	Incorrect declension/ conjugation assignments by the parser are marked '*ad' below.
*gd	1	1	Incorrect gender in declension assignments by the parser are marked '*gd' below. Selecting the correct gender (where there are multiple genders) is a semantic issue that is beyond the capabilities of a syntactic parser.
*cd	8	8	Incorrect clause assignments by the parser are marked '*cd' below. Clause assignment is marked as N.A for [KAL2015], as it is not marked. Only significant differences are shown.
*hd		9	Sometimes, a higher-level understanding is required to handle concepts such as metaphors, that require the creation of additional Hidden Clauses (and elided verbs). These semantic defects of the parser are marked '*hd' in the tables below. Clause assignment is marked as N.A for [KAL2015], as it is not marked. Only significant differences are shown.
*id	3	?	Incorrect input leading to defective assignments by the parser are marked '*id' below.
Total Defects	47	23	See Tables 3,4,5 below

Notes:

We have not marked defects in identification of participles (for e.g., kta, yat, shatRu) or compound words (samAsa), as too many terms are unmarked as such in [MM2015] (although they are assumed to be present in the translations).

Further, we have not marked defects in the identification of the Arguments of participles, since these are not marked in [KAL2015], and occasionally unmarked in [MM2015].
The internal structure of clauses is difficult to compare, as the parser treats non-finite verbs (Gerunds and Infinitives) as separate clauses with their own arguments, while [MM2015] includes them as a part of a larger finite verb clause.

The following table lists those terms which have been assigned the wrong Conjugation (Verb, Number, Person) or Declension (Case, Number, Gender) in the Gloss.

Table 3: Defective assignment of Declension/ Conjugation/ Indeclinable Glosses to terms by the Parser
#	Stanza	Clause	Word	ASSIGNMENT			Type	Comment	#Defects
				Parser	[KAL2015]	[MM2015]
1	2.2	B.1	kashmalam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
2	2.2	B.1	idam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
3	2.2	B.1	samupasthitam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
4	2.2	B.1	anAryjuSHTam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
5	2.2	B.1	asvargyam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
6	2.2	B.1	akIrtikaram	NOM-S	ACC-S	\NOM-S	*ad	Not a defect	0
7	2.3	A.1	gamaHa	Injunctive	Imperfect	Aorist	*ad	Not a defect	0
8	2.4	B.2	pUjArhO	ACC-D	NOM-D	ACC-D	*ad	Not a defect	0
9	2.4	B.2	yotsyAmi	class 4P	class 4A	N.A.	*dd	Not a defect	0
10	2.6	A.1	etat	ACC-S	ACC-S	Ind.	*ad	Not a defect	0
11	2.6	A.7	katarat	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
12	2.6	A.7	naHa	GEN-P	DAT-P	GEN-P	*ad	Not a defect	0
13	2.6	A.7	garIyaHa	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
14	2.6	A.6	pramukhe	LOC-S	Ind.	LOC-S	*ad	Not a defect	0
15	2.7	A.2	yat	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
16	2.7	A.2	nishchitam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
17	2.8	A.2	yat	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
18	2.9	B.2	parantapa/parantapaHa	VOC-S	VOC-S	NOM-S	*id	Questionable defect	0
19	2.14	A.2	anityAHa/anityAn	NOM-P	ACC-P:anityAn	NOM-P	*ad	Not a defect	0
20	2.17	A.4	sarvam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
21	2.17	A.4	idam	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
22	2.18	A.1	tasmAt	ABL-S	Ind.	ABL-S	*ad	Not a defect	0
23	2.20	A.5	abhavitA/bhavitA	bhavitA	bhavitA	abhavitA	*id	Questionable Defect	0
24	2.26	A.3	evam/enam	evam	enam	evam	*id	Not a defect	0
25	2.29	A.1	kashchit	NOM-S	Ind.	Ind.	*dd	Not a defect	0
26	2.29	A.5	kashchit	NOM-S	Ind.	Ind.	*dd	Not a defect	0
27	2.30	A.3	nityam	NOM-S	Ind.	Ind.	*dd	Not a defect	0
28	2.32	A.2	upapanna	A.2:NOM-S	A.1:ACC-S	A.1:ACC-S	*ad	Defect	1
29	2.32	A.2	svargadvAram	A.2:NOM-S	A.1:ACC-S	A.1:ACC-S	*ad	Defect	1
30	2.32	A.2	apAvRutam	A.2:NOM-S	A.1:ACC-S	A.1:ACC-S	*ad	Defect	1
31	2.34	A.1	te	GEN-S	GEN-S	GEN-S	*ad	Not a defect*	0
32	2.36	A.2	duHakhataram	NOM-S	ACC-S	NOM-S	*ad	Not a defect	0
33	2.44	A.1	samAdhO	Masc.	Fem.	Masc.	*ad	Not a defect	0
34	2.46	A.2	yAvAn	NOM-S	Ind.	NOM-S	*ad	Not a defect	0
35	2.46	A.1	tAvAn	NOM-S	Ind.	NOM-S	*ad	Not a defect	0
36	2.48	A.1	siddhyasiddhyoHa	GEN-S	GEN-S	LOC-S	*ad	Defect	1
37	2.59	A.2	param	Masc.	Neut.	Neut.	*gd	Defect	1
38	2.59	A.2	rasavarjam	A.2:ACC-S	A.1:Ind.	A.1:Ind	*dd	Defect	1
		TOTAL DEFECTS							6

The following table lists those terms that are assigned to the wrong clause by the parser. Note that information on clauses can only be ascertained from [MM2015] ([KAL2015] does not provide such details).

Table 4: Defective assignment of Clauses to terms by the Parser
#	Stanza	Clause	Word	ASSIGNMENT			Type	Comment	#Defects
				Parser	[KAL2015]	[MM2015]
1	2.22	A.1	navAni	A.1	N.A.	A.2	*cd	Defect	1
2	2.22	A.4	jIrNNAni	A.4	N.A.	A.3	*cd	Defect	1
3	2.28	A.2	tatra	A.2	N.A.	A.1	*cd	Defect	1
4	2.32	A.2	yadRuchchhayA	A.2	N.A.	A.1	*cd	Defect	1
5	2.32	A.2	cha	A.2	N.A.	A.1	*cd	Defect	1
6	2.32	A.2	pArtha	A.2	N.A.	A.1	*cd	Defect	1
7	2.61	A.2	vashe	A.2	N.A.	A.5	*cd	Defect	1
8	2.61	A.2	hi	A.2	N.A.	A.5	*cd	Defect	1
		TOTAL DEFECTS							8

The following table lists those terms that can be classified as Hidden Clauses that the parser should have created, but did not detect. Terms shown in square brackets are elided terms (such as [asti] below) that should have been inserted by the parser in the Hidden Clause. Note that information on clauses can only be ascertained from [MM2015] ([KAL2015] does not provide such details).

Table 5: Defective identification of Hidden Clauses by the Parser
#	Stanza	Clause	Word	ASSIGNMENT			Type	Comment	#Defects
				Parser	[KAL2015]	[MM2015]
1	2.32	A.2	[verb:asti]	A.2	N.A.	A.1	*hd	Defect	1
2	2.70	A.3	na	A.3	N.A.	A.4	*hd	Defect	1
3	2.70	A.3	kAmakAmI	A.3	N.A.	A.4	*hd	Defect	1
4	2.70	A.3	[noun:shAntim]	A.3	N.A.	A.4	*hd	Defect	1
5	2.70	A.3	[verb:Apnoti]	A.3	N.A.	A.4	*hd	Defect	1
6	2.72	A.2	eSHA	A.2	N.A.	A.5	*hd	Defect	1
7	2.72	A.2	brAhmI	A.2	N.A.	A.5	*hd	Defect	1
8	2.72	A.2	sthitiHa	A.2	N.A.	A.5	*hd	Defect	1
9	2.72	A.2	[verb:asti]	A.2	N.A.	A.5	*hd	Defect	1
		TOTAL DEFECTS							9

1. [KAL2015] frequently does not distinguish between Adjectives and Nouns, hence we will not make that distinction when assignments to either are made by the parser. Note that, for the parser, this is usually a question of the definition of the term in the lexicon.

References

[kal2015] Kalavade L., Kalavade P.. Gitavyakaranam Panniniyapraveshaya. Chinmaya International Foundation:Unspecified; 2015.
[mm2015] Michika M. Grammatical Analysis of the Bhagavad Gita Chapters 1 to 6. Arsha Avinash Foundation:Coimbatore; 2015.

Navigation

You are here

References