Question Extractor

8 min read

Functionality

This module is designed to extract all questions from a document, which can for example be a meeting transcription or a request for proposal.

Sequence of events within module

File Handling:
- Automatically detects the relevant file name from user input.
- Possibility to upload directly upload the transcript document and process its content.
Company/Project Name Extraction: Knowing the company or the project name improves the model better understand the transcript (Particularly with extended questions).
Question Extraction: Identifies and extracts questions from the transcript.
Question Extending: Question extracted from the transcript usually lack context. Thus, we extend the original questions extracted in the previous step with using the context from the chunk.
Topic Assignment: Assigns topics to the extracted questions for better categorisation.
Chat Output Streaming: After processing each chunk, the message on the chat interface is updated with the new questions that have been extracted.
Excel Export: Exports the collected questions to an Excel file and links it in the chat for downloading.
Word Export: Exports a version of the transcript document with highlights of extracted questions.

Input

The input document is preferably a Word file (the module also works with PDF but no highlighting functionality is available in this case.) uploaded to the designated scope in the Knowledge Center, which is connected to the extractor module or directly to the chat.

A user then uses the assistant linked to this module to extract all questions, e.g.

Extract questions from meeting_transcript_xyz.pdf.

If the file is directly uploaded to the chat, the user can use the following prompt:

Extract questions from this uploaded file.

Meeting Transcript Format:

Each statement in the discussion must be preceded by the name of the speaker. Below you can see an example document:

Steerco Meeting Example Transcript.docxWord documentDownload

Output

The output consists of:

Table in chat interface containing all extracted questions
Downloadable excel file with extracted questions
Downloadable word file with extracted questions highlighted in yellow

Example outputs:

Chat_2024-09-17_10_47_Meeting_Demo_Unique_UBS_highlighted.docxWord documentDownloadChat_2024-09-17_09_58_Steerco Meeting Example Transcript_output.xlsxExcel spreadsheetDownload

Reference in Code in AI Module Template

QuestionExtractor

Configuration settings

This section contain all configurable parameters.

Default Configuration

json

{
    "languageModel": "AZURE_GPT_4o_2024_1120",
    "companyNameExtraction": {
        "systemMessage": "You are given a file name. Your task is to extract the name of the company contained in the file name. If you can not extract a company name, return \"NONE\". \"IC\" is not a company name.\n\nThe expected output is a JSON in the format: \n{\n    \"company_name\": \"<extracted name of company>\"\n}",
        "userMessage": "$document_name",
        "exampleMessages": [
            {
                "role": "user",
                "content": "20240215_annual_report_ubs.pdf"
            },
            {
                "role": "assistant",
                "content": "{\n    \"company_name\": \"UBS\"\n}"
            },
            {
                "role": "user",
                "content": "FIRST_QUARTER_UNIQUE_2022.pdf"
            },
            {
                "role": "assistant",
                "content": "{\n    \"company_name\": \"Unique\"\n}"
            },
            {
                "role": "user",
                "content": "2022_03_15_Quarterly_Report.pdf"
            },
            {
                "role": "assistant",
                "content": "{\n    \"company_name\": \"NONE\"\n}"
            },
            {
                "role": "user",
                "content": "memo_2024 (vf).pdf"
            },
            {
                "role": "assistant",
                "content": "{\n    \"company_name\": \"NONE\"\n}"
            }
        ]
    },
    "questionExtraction": {
        "useCompanyName": true,
        "additionalInputKeywords": [],
        "dummyExtendedQuestionPlaceholder": "Additional Input",
        "sheet1Name": "Sheet1",
        "sheet2Name": "Sheet2",
        "originalQuestion": {
            "system": "Your role as an assistant is to analyze a transcript from a meeting about an investment opportunity. Your task is to identify and highlight all the questions posed by the participants. This will include both direct questions and requests for more information or clarity. Here's a step-by-step guide to your task:\n\n**Name Extraction:** Identify the name of the individual who asked the question. The name is always the first word in the bullet point. If there is no name mentioned, assign the bullet point name to \"Unknown\".\n\n**Question Extraction:** Identify and extract the question or request for information. Questions are sentences ending with a question mark, or requests for clarification or information (e.g., \"We need to understand why this happened\", \"We should look in this\"). Extract the question or request verbatim, so it can be located in the transcript using a simple text search (ctrl+F).\n\n**Multiple Questions:** If a single person asks several consecutive questions within one bullet point, extract each of them separately. They will be highlighted differently for easy identification.\n\n**Question Mark Rule:** Any sentence ending with a question mark should be considered a question and must be included in the output.\n\nRemember, your primary goal is to identify and extract the questions or requests from the transcript. The extracted text should be as concise as possible, focusing only on the sentences of interest. There's no need for additional context as this will be handled at a later stage.\n\nPlease structure your output as a JSON in the following format:\n\n{\n    \"question\": [\n        {\"name\": \"Alice\",\"original_question\": \"What is photosynthesis?\"},\n        {\"name\": \"Bob\",\"original_question\": \"Who was Cleopatra?\"},\n        {\"name\": \"Bob\",\"original_question\": \"I would like to understand better the market price of Novartis.\"},\n        {\"name\": \"Bob\",\"original_question\": \"We have to look up why this company grew so much in the last 2 years.\"},\n        {\"name\": \"Bob\",\"original_question\": \"You need to review the tech stack and understand how old it really is.\"},\n        {\"name\": \"Bob\",\"original_question\": \"You need to understand why this happened.\"}\n    ]\n}\n\nYou must not create any questions. Only extract the questions directly from the text, word by word, without correcting typos or punctuations. If no questions are present, return an empty question list.\n",
            "systemWithStructuredOutput": "Your role as an assistant is to analyze a transcript from a meeting about an investment opportunity. Your task is to identify and highlight all the questions and action points posed by the participants. This will include both direct questions and requests for more information or clarity. Here's a step-by-step guide to your task:\n\n**Name Extraction:** Identify the name of the individual who asked the question. The name is always the first word in the bullet point. If there is no name mentioned, assign the bullet point name to \"Unknown\". 'Chair' is not a name. Do not assign 'Chair' as a name.\n\n**Transcript Extraction:** Identify and extract transcript sections matching at least one category explained below. Extractsentence verbatim, so it can be located in the transcript using a simple text search (ctrl+F). Be sure not to miss any transcript section. Rather extract a sentence to much than miss one.\nHere a definition of the extract item categories:\n${categories}\n\n**Multiple Questions:** If a single person asks several consecutive questions within one bullet point, extract each of them separately. They will be highlighted differently for easy identification.\n\n**Question Mark Rule:** Any sentence ending with a question mark should be considered a question and must be included in the output.\n\nYou must not create any questions. Only extract the text sections directly from the provided transcript, word by word, without correcting typos or punctuations. If no questions are present, return an empty text extract list. Never create any extractions.\n",
            "trigger": "Transcript:\n'''\n${chunk}\n'''\n\nEvery question or request extracted from the transcript SHOULD be confined to a maximum of TWO sentences (preferably one sentence).\n\nYour answer in JSON format:",
            "triggerWithStructuredOutput": "Transcript:\n'''\n${chunk}\n'''\n\nEvery extract from the transcript SHOULD be confined to a maximum of TWO sentences (preferably one sentence). Again, make sure not to miss any relevant transcript section. Never create any extractions by yourself."
        },
        "additionalInputs": {
            "system": "Your purpose is to convert a text with bullet points into a structured JSON object. Each bullet point represents a contribution from an individual. When it is not clear to whom a bullet point belongs, assign it to the last person mentioned. If there is no name mentioned, assign the bullet point name to \"Unknown\".\n\nStructure the output in JSON, associating each bullet point with the respective contributor\"s name. Each bullet point should be listed as a separate entry. Ensure that you do not break or shorten the bullet points; include each bullet point in its entirety. Maintain the keys \"question\", \"name\", and \"original_question\" in your JSON structure.\n\nJSON Output:\n{\n  \"question\": [\n    {\n      \"name\": \"Alice\",\n      \"original_question\": \"What is photosynthesis?\"\n    },\n    {\n      \"name\": \"Alice\",\n      \"original_question\": \"25 years of investment is a lot\"\n    },\n    {\n      \"name\": \"Bob\",\n      \"original_question\": \"Who was Cleopatra?\"\n    }\n  ]\n}\n\nExtract the bullet points directly from the text, word by word, without correcting typos or punctuations. If no bullet points are present, return an empty question list. Do not break up the bullet points. Include each bullet point in its entirety as a single entry in the \"original_question\" field.\n",
            "trigger": "Input:\n'''\n${chunk}\n'''\n\nJSON output:"
        },
        "extendedQuestion": {
            "system": "You are a helpful AI designed to augment a given question extracted from a transcript, to enhance its clarity and comprehensibility. The original question is provided as well as the context in which it was asked. Your task consist of focusing on improving the question by enriching it with relevant details according to the context. You MUST be concise and you must use as little words as possible. \n\nStructure your output as a JSON in the following format:\n{\n    \"extended_question\": \"A concise augmented question with only relevant details from context. \"\n}\n",
            "triggerWith": "Question:\n'''\n${original_question}\n'''\n\nContext:\n'''\n${chunk}\n'''\n\nThe transcript pertains to the company ${company_name}. If \"they\" or \"it\" is mentioned and the reference is ambiguous, assume that ${company_name} is being referred to.\n\nJSON output:\n",
            "triggerWithout": "Question:\n'''\n${original_question}\n'''\n\nContext:\n'''\n${chunk}\n'''\n\nJSON output:\n",
            "useCompanyName": false
        },
        "topic": {
            "system": "Your task is to give assign a topic to a \"question\" that has been asked during an investment meeting. The topic should be one of the following:\n\n[\n'Artificial Intelligence',\n'Business Plan / Underwriting',\n'Management',\n'Customers',\n'Competitors,\n'Financials',\n'Financing',\n'Go To Market / Sales',\n'Governance',\n'Growth',\n'Integration',\n'Market',\n'M&A (Mergers and Acquisitions)',\n'Product',\n'Strategy',\n'Technology',\n'Valuation'\n]\n\nIf you can not find a topic for the question, put the topic as 'Other'.\n\nTo help you achieve the task, you will be provided the question as well as the transcript where the question was extracted.\n\nStructure your output as a JSON in the following format:\n{\n    \"explanation\": \"A brief explanation of why you chose this topic\",\n    \"topic\": \"topic that describes best the question. Only choose one.\"\n}",
            "trigger": "Question:\n'''\n${original_question}\n'''\n\nContext:\n'''\n${chunk}\n'''\n\nJSON output:\n"
        },
        "keywordSearchMode": "basic",
        "enableStructuredOutput": false,
        "chunkSplitAtLineBreaksForQuestionExtraction": false
    },
    "excelGenerator": {
        "uploadScopeId": null,
        "uploadToChat": true,
        "renameColMap": null,
        "tableHeaderFormat": {
            "bg_color": "#966919",
            "bold": true,
            "font_color": "white",
            "text_wrap": true
        },
        "skipIngestion": true,
        "tableDataFormat": {
            "bg_color": "#FFFFFF",
            "bold": false,
            "font_color": "black",
            "text_wrap": true,
            "border": 1,
            "valign": "top"
        }
    }
}

General parameters

Parameter	Description	Type	Default
`languageModel`	used LLM model	string	`AZURE_GPT_4o_2024_1120`
`companyNameExtraction`	Configuration for the service that attempts to extract a company name from the uploaded document’s filename. Identifying the company name helps improve the accuracy of question extraction.	object
`questionExtraction`	Configuration for question extraction service	object
`excelGenerator`	See Excel Generator	object

questionExtraction

Parameter	Description	Type	Default
`useCompanyName`	Whether to include the company name that is extracted from the name of the file or not in the prompt messages.	boolean	`true`
`additionalInputKeywords`	A list of keywords to use to split the chunks. Questions in the first group of chunks will be extracted with `originalQuestion`. The second group will be using `additionalInputs` prompts for the question extraction	list[string]	`[]`
`dummyExtendedQuestionPlaceholder`	To use as column name in excel for extended questions when using the `additionalInputs` prompts for the extraction.	string	`Additional Input`
`sheet1Name`	Name of sheet1 in generated Excel file. Questions that are extracted using `originalQuestion` prompts.	string	`Sheet1`
`sheet2Name`	Name of sheet2 in generated Excel file. Questions that are extracted using `additionlInputs` prompts	string	`Sheet2`
`originalQuestion`	Prompts to extract questions. Contains a dictionary with `system` and `trigger` prompts.	dict	See Default Configuration above
`additionalInputs`	Prompts to extract additional inputs. Contains a dictionary with `system` and `trigger` prompts.	dict	See Default Configuration above
`extendedQuestion`	Prompts to extract extended questions. contains a dictionary with `system` and `trigger` prompts.	dict	See Default Configuration above
`topic`	Prompts to extract topic of extracted questions. contains a dictionary with `system` and `trigger` prompts.	dict	See Default Configuration above