The goal of this technology is to automatically generate questions based on 'factoids' that are present in a sentence. An illustration, based on sentence #1 below, will help clarify the scope and working of a standalone QG system (see [1] for a detailed review):
We define a 'factoid' as an 'interesting' piece of information that can be queried in a sentence, and can be identified through syntactic analysis. For instance, 'Which factoids' are usually Noun Phrases such as 'the man bitten by the dog in the park' in sentence #1 below. 'Where factoids' are usually Preposition Phrases such as 'in the park' in #1 below. 'What factoids' are usually based on propositions such as 'the man should not sue the city' in #1 below. There are similar conditions for identification of 'When/Why/How/Whether factoids'. Crucially, 'factoids' can be identified largely through syntactic analysis (i.e. they do not require 'reading between the lines' for e.g. there is no need to semantically associate 'lawyer' with 'attorney'). We will discuss the concept of 'interesting' later in this section.
The input for a Question Generation system is a high-quality parse-tree produced by a Deep Parser. A Question Generation system is best suited for sentences that are shorter than 25 words, because the quality of parse-trees declines for long sentences. The Deep Parser should produce a parse-tree that enables us to 'split' sentence #1 into the following 'split' sentences:
From the above 'split' sentences, we should be able to generate the following questions:
Notice that #7 above is a question in 'active' voice, whereas it was based on a clause in 'passive' voice in sentence #1. Of course, this transformation is only possible if the Agent is present, and can be identified as such, in the sentence.
Notice also that #6 asks the question 'which man', to which the answer provides some additional information that is available to identify 'the man' i.e. 'the man who was bitten by the dog in the park'. Our technology considers a 'factoid' to be 'interesting' only if it has a little more information than the question based on the 'factoid', or it provides additional contextual information. The 'where' question in #7 above provides additional contextual information i.e. 'the place where the dog bit the man'. In contrast, if we ask the question 'which lawyer', we can only expect the uninformative answer 'the lawyer'. While more information may be available about 'the lawyer' or 'the city' in preceding sentences of the passage, this is a more difficult problem. For similar reasons, it is risky to generate questions involving 'pronouns' (for e.g. 'he') or 'demonstratives' (for e.g. 'those lawyers'). Linking of 'factoids' across sentences in a passage can also be taken up only if 'False Positives' can be minimized.
It must also be noted that the questions that are generated may not be limited to 'key' facts alone. A syntactic analysis cannot reliably distinguish 'key' facts from 'non-key' ones; however, if sentences are ranked in order of their importance in the passage (a sentence with a high rank will have many 'concepts' in common with other sentences in the passage). Further, in order to avoid generating 'False Positives', many questions will not be generated by the system. We must also consider the possibility that 'the park' was the place where 'the lawyer did the advising' (and not the place where 'the dog did the biting' as assumed by the Deep Parser). Clearly, sentence #1 is ambiguous in this respect; however, a human would probably reason that lawyers probably advise their clients in chambers (and not in parks), whereas dogs play in parks (and poorly trained dogs could bite people in parks). A syntactic Deep Parser usually focuses on sentence-level structures and leaves 'reasoning about the world' to downstream systems.