Detecting mild cognitive impairment in narratives in Brazilian Portuguese : first steps towards a fully automated system

In recent years, Mild Cognitive Impairment (MCI) has received a great deal of attention, as it may represent a pre-clinical state of Alzheimer ́s disease (AD). In the distinction between healthy elderly (CTL) and MCI patients, automated discourse analysis tools have been applied to narrative transcripts in English and in Brazilian Portuguese. However, the absence of sentence boundary segmentation in transcripts prevents the direct application of methods that rely on these marks for the correct use of tools, such as taggers and parsers. To our knowledge, there are only a few studies evaluating automatic sentence segmentation in transcripts of neuropsychological tests. The purpose of this study is to investigate the impact of the automatic sentence segmentation method DeepBond on nine syntactic complexity metrics extracted of transcripts of CTL and MCI patients.


Introduction
The ageing of the population is a well-known social trend in developed countries that has become increasingly pronounced in developing countries.In Brazil, for example, the population pyramid is changing in shape, according to the IBGE (Brazilian Institute of Geography and Statistics) census from 2000 and 2010.Increased life expectancy with a high quality of life is priceless for every citizen; however, it raises serious financial and social issues, particularly in health, because aging may be accompanied by neurodegenerative diseases, requiring new resources and medical facilities.A recent study by Engedal and Laks (2016) reveals figures about the prevalence of dementia disorders worldwide and puts the total at 44 million individuals and possibly rising to 140 million by 2050.With regards to Brazil, they also draw some estimates about people with dementia as being 1.6 million, 1.2 million of which are not diagnosed at all, based on Herrera et al. (2002) and Scazufca et al. (2008) as well as on a recent study by Nakamura et al. (2015).One particular problem in Brazil is the low average level of education, since lower education is a known risk factor for dementia and imposes problems for its diagnosis and treatment (cf. CéSAR et al., 2016).
Recently, the study of the syndrome known as Mild Cognitive Impairment (MCI), which is defined as a cognitive decline greater than expected in individuals at the same age and level of education, has become more and more important (cf. LEHR et al., 2012;TóTH et al., 2015;VINCzE et al., 2016;and SANTOS et al., 2017).The interference caused by MCI in day-to-day activities is minimal, since it can only be perceived in complex situations and cannot be considered a type of dementia.However, the most frequent type, amnestic MCI, has the highest conversion rate to Alzheimer's disease (AD) (15% per year, versus 1-2% of the total population) (CLEMENTE; RIBEIRO-FILHO, 2008).
There are several instruments to identify preclinical and manifested dementias, such as the use of biomarkers, Magnetic Resonance Imaging and molecular neuroimaging (MCKHANN et al., 2011;MAPSTONE et al., 2014), but none of these are inexpensive solutions for public hospitals.Language is one of the most efficient information sources for assessing cognitive functions.Changes in language are frequently observed in patients with dementia and normally being the first to be observed by themselves and their family members.Therefore, the automatic analysis of discourse production is seen as a promising solution for diagnosing MCI, because its early detection ensures a greater chance of success in addressing potentially reversible factors or maintenance of functionality (MUANGPAISAN et al., 2012).
Neuropsychological tests that require some degree of memorization are usually included in verbal memory tests.This is the case of the logical memory test, in which an individual reproduces a story after listening to it.The higher the number of recalled elements from the narrative, the higher the memory score (WECHSLER, 1997;BAyLES;TOMOEDA, 1993;MORRIS et al., 2006).The evaluation of language from another standpoint presents, in discourse production (mainly in narratives), an attractive alternative because it allows for the analysis of linguistic microstructures (ANDREETTA et al., 2012), including phonetic-phonological, morphosyntactic and semantic-lexical components, as well as semanticpragmatic macrostructures.Since it is a natural form of communication, it favors the observation of the patient's functionality in everyday life.Moreover, it provides data for observing the language-cognitive skills interface, such as executive functions (planning, organizing, updating and monitoring data).However, the main difficulties are: (i) the time required, since it is a manual task; and (ii) the subjectivity of the clinician in checking the presence of the main ideas of the narrative retold by the patient.
In terms of distinction between healthy aging adults, refereed in our study as CTL, and MCI patients, several studies have shown that discourse production is a sensitive task to differentiate individuals with MCI from controls using the Wechsler Logical Memory (WLM) test (PRUD'HOMMEAUx et al., 2011, PRUD'HOMMEAUx;ROARK, 2015).The original narrative used in this test is short, allowing the use of the output of Automatic Speech Recognition (ASR) methods of patients' speeches even without capitalization and sentence segmentation, as shown Lehr et al. (2012) for data in English.They based their method on automatic alignments of the original and patient transcripts in order to calculate the number of recalled elements from the narrative.Moreover, automated discourse analysis tools based on Natural Language Processing (NLP) resources and tools aiming at the diagnosis of language-impairing dementias via machine learning methods are already available for the English language (FRASER et al., 2015a;yANCHEVA et al., 2015).yet a comprehensive NLP environment publicly available, designed for Brazilian Portuguese (BP), called Coh-Metrix-Dementia (ALUíSIO et al., 2016a), was only recently developed.
Coh-Metrix-Dementia is based on a previous tool for discourse analysis, named Coh-Metrix-Port (SCARTON; ALUíSIO, 2010), which was already used in a clinical discourse analysis study to classify written descriptions of healthy adults (TOLEDO et al., 2014).Based on previous studies using metrics and machine learning classifiers for the English language in clinical settings (e.g.CHAND et al., 2012;ROARK et al., 2011), 25 new metrics were added to the existing 48 metrics for measuring syntactic complexity, semantic content of language via idea density (CUNHA et al., 2015), and text cohesion through latent semantics.
Although Coh-Metrix-Dementia is publicly available, there are major issues for its wide use in clinical settings: (i) the current need for manual narrative transcription and (ii) the absence of capitalization and boundary segmentation of the transcript, preventing the direct application of NLP methods that rely on these marks for the correct use of tools, such as taggers and parsers.In this paper we will focus on nine syntactic metrics of Coh-Metrix-Dementia for which the performance of segmentation has high impact.
The task of predicting sentence boundaries has been treated by many researchers.Liu et al. (2006) investigated the imbalanced data problem, since there are more non boundary words than ones with boundaries; their study was carried out using two speech corpora: conversational telephone and broadcast news, both for the English language.More recent papers have focused on Conditional Random Field (CRF) models.Wang et al. (2012) and Hasan et al. (2014) use CRF based strategies to identify word boundaries in speech corpora datasets, more specifically on English broadcast news data and English conversational speech (lecture recordings), respectively.
Although there are several methods of sentence segmentation for BP datasets (SILLA; KAESTNER, 2004;BATISTA, 2013;LóPEz;PARDO, 2015), none of which are adopted in transcriptions used in clinical settings for elderly people with dementias and related syndromes.The study most similar to our scenario is Fraser et al.'s (2015b), which proposes a segmentation method for aphasic speech based on lexical, Part of Speech (PoS) and prosodic features using tools and a generic acoustic model trained on resources for English.Their approach is based on a CRF model, which classifies a word by taking its context into account.With this model better results were obtained for broadcast news data, where speech is prepared, but the results on patient data were generally similar to the controls' data, allowing the use of several syntactic complexity metrics.
In this paper we present the first steps taken towards the wide use of Coh-Metrix-Dementia, working together with the DeepBond method for sentence segmentation.DeepBond (TREVISO et al., 2017a) uses recurrent convolutional neural networks with prosodic, PoS features, and also word embeddings and it was evaluated intrinsically on impaired, spontaneous speech and on normal, prepared speech.Moreover, when comparing the administration of the CRF method presented in Fraser et al. (2015b) and DeepBond method on our data, DeepBond method presents better results.In Section 2 we present Coh-Metrix-Dementia tool to automatically analyse text productions using several metrics.Section 3 presents DeepBond and details of the task of sentence segmentation, formally called Sentence Boundary Detection.Section 4 presents the extrinsic evaluation of DeepBond, using syntactic complexity metrics of Coh-Metrix-Dementia in order to measure the impact of using DeepBond to automatically segment narratives of neuropsychological tests.

Coh-Metrix-Dementia
Coh-Metrix-Dementia has been used to extract 73 features of oral narrative productions based on a sequence of pictures from the Cinderella story.Narratives of CTL, AD, and MCI patients were used in experiments with machine learning classification and regression methods (ALUíSIO et al., 2016b).In their study, it was possible to separate CTL, AD, and MCI with an F1 score of 0.817, and separate CTL and MCI with an F1 score of 0.900.As for machine learning regression, the best results for Mean Absolute Error (MAE) were 0.238 and 0.120 for scenarios with three (CTL, AD and MCI) and two classes (CTL and MCI), respectively.The most discriminative features for the classifier and regressor were: dependence distance, Yngve and Frasier syntactic complexity metrics, the informativeness metric idea density (CUNHA et al., 2015) and disfluencies metrics, such as average duration of pauses, average number of short pauses, average number of vowel prolongations.
The architecture of the Coh-Metrix-Dementia environment is depicted in Figure 1.It receives, as input, two versions of the narratives to be analyzed: (i) the original transcription, with several kinds of annotations and (ii) a clean transcription of a patient's speech sample separated in sentences and capitalized.
In the original transcript, segments with hesitations or repetitions of more than one word or segments of a single word are annotated.The labels used for this kind of disfluency are <disf> and </disf>.Repetitions of unique words are captured automatically by the tool and do not require manual annotation.Empty emissions, which are comments that are not related to the topic of narration or confirmations, such as "né" (alright), are also annotated.Empty emissions are delimited by <empty> and </empty>.Prolongations of vowels (indicated by :::), short pauses (indicated by ...) and long pauses (indicated in seconds ((pausa xx segundos))) are also annotated.The six metrics related to these annotations are calculated as averages over the length of the narrative.
In the study by Aluísio et al. (2016b), narrative samples were recorded and transcribed by a trained researcher and sentence boundaries were marked later by a single researcher according to semantic and syntactic cues and the annotation of short and long pauses included in the original transcription; there was no distinction among the several types of disfluencies besides vowel prolongations.
A refined categorization of the types of disfluencies is welcome in order to be used as features to better distinguish the groups of interest, such as MCI and CTL and also to automate their removal to guarantee a successful parsing.In this paper, we have annotated several types of disfluencies over the same dataset in a double blind annotation experiment to compare the inter-annotator agreement between annotators following a manual based on Saffran et al. (1989).The removal of disfluencies is used as a first step for the sentence segmentation phase.Therefore, here we follow a manual to annotate sentence boundaries, different from the annotation used in Aluísio et al. (2016b).More details of the proposed manual annotation of narratives are presented in Section 4.3.
Figure 2 shows the two versions of a narrative that are expected in the Coh-Metrix-Dementia environment.The clean transcript is enumerated here to be compared with a second annotation proposed in this study.
After analyzing the versions of a narrative, Coh-Metrix-Dementia outputs a set of 73 textual metrics, divided in 14 categories: Ambiguity, Anaphoras, Basic counts, Connectives, Constituents, Coreferences, Disfluencies, Frequencies, Hypernyms, Logic operators, Latent Semantic Analysis, Semantic density, Syntactical complexity, and Pronouns, Types & Tokens.More detail can be found in the help section of the environment1 .

DeepBond: automatic sentence segmentation of narratives
The Sentence Boundary Detection (SBD) is the name of the task of segmenting narratives in neuropsychological tests which use audio transcriptions.SBD attempts to break a text into sequential units that correspond to sentences, and can be applied to either written text or audio transcriptions which do not necessarily end in final punctuation marks but are complete thoughts nonetheless.To perform SBD in speech texts is more complicated due to the lack of information such as punctuation and capitalization.Moreover text output is susceptible to recognition errors, in case of Automatic Speech Recognition (ASR) systems are used for automatic transcriptions (GOTOH; RENALS, 2000).
The work of Treviso et al. (2017a)2 proposed an automatic SBD method for impaired speech in Brazilian Portuguese, to allow a neuropsychological evaluation based on discourse analysis.The method uses RCNNs (Recurrent Convolutional Neural Networks) which independently treat prosodic and textual information, reaching state-of-the-art results for impaired speech.Also, this study showed that it is possible to achieve good results when comparing them with prepared speech, even when practically the same quantity of text is used.
In a follow-up study, Treviso et al. (2017b) showed that by using only a good word embedding model to represent textual information it is possible to achieve similar results with the state-of-the-art for impaired speech.Their study was set to verify which embedding induction method works best for the sentence boundary detection task, specifically whether it be those which were proposed to capture semantic, syntactic, or morphological similarities.
Here, we used the version of DeepBond presented in Treviso et al. (2017a).A boundary is defined as a period, exclamation mark, question mark, colon or semicolon, i.e., our problem is one of binary classification.DeepBond consists of a linear combination of two models.The first model is responsible for treating only lexical information, while the second treats only prosodic information.In order to obtain the most probable class, a linear combination was created between the two models, where one receives the pondered complement of the other.
4 Extrinsic evaluation using syntactic complexity metrics

Datasets
We used three datasets in the study of this paper.For all of them we have removed information about the capitalization and left all disfluencies intact in order to simulate a high-quality ASR system.In Table 1, statistics relevant to each dataset used are presented.The demographic information about the datasets are presented in Table 2.

Spontaneous speech: MCI and healthy controls narratives
The first dataset of discourse tests is a set of spontaneous speech narratives, based on a book of sequenced pictures from the well-known Cinderella story.In the test, an individual receives the book, then verbally tells the story to the examiner.The narrative is manually transcribed by a trained annotator who scores the transcription by counting the number of recalled propositions.This dataset consists of 60 narrative texts of BP speakers, 20 controls, 20 AD patients, and 20 MCI patients, diagnosed at Medical School of the University of São Paulo (FMUSP) and also used in Aluísio et al. (2016b).Counting all patient groups, this dataset has an average of 30.72 sentences per narrative, and each sentence averages of 12.92 words.
The second dataset of neuropsychological tests is available from the Bateria de Avaliação da Linguagem no Envelhecimento (BALE) ("Battery of Language Assessment in Aging", in English), under a process of validation (JERONIMO, 2015;HüBNER et al. [in preparation]).Including tasks assessing naming, episodic verbal memory, semantic judgement, semantic categorization at the word level, metaphor comprehension and completion at the sentence level, as well as narrative production based on a sequence of story scenes, free narrative production on a given topic (news and funny event) and narrative retelling from a story orally presented, the battery aims at tackling some of the language impairments normally associated with MCI and AD.Moreover, the tasks were developed so that their administration is adjusted to include illiterate and lower educational level participants' linguistic data, population samples which are very common in the Brazilian public health system.
Here, we used the transcription of 10 narratives taken from the narrative production test based on the presentation of a set of seven pictures telling a story of a boy who hides a dog that he found on the street (The dog story (LE BOEUF, 1976)).The participants are asked to carefully observe the pictures, displayed in the correct sequence, and as soon as they feel confident to start telling the story, their production is recorded.The test administrator tries not to interfere, as in the previous task.Assessment includes the amount and quality of recorded propositions from the text.Complementary assessment included comprehension questions approaching the micro and macrostructural aspects of the narrative, as well as semantic and syntactic quantitative and qualitative aspects.Because this dataset is also composed of patient narratives, we can evaluate how well our model behaves on data from the same domain, where the story and vocabulary of the narratives are different from the ones in which the model has been tested.The average number of sentences and the average size of the sentences in this dataset are 16.60 and 6.58, respectively.When compared with the first dataset, this one is composed of less sentences and the sentences have fewer words on average.

Prepared speech: Brazilian Constitution
The third dataset was made available by FalaBrasil, a project at the Federal University of Pará's Signal Processing Laboratory (BATISTA, 2013).This dataset is composed of articles of Brazil's 1988 constitution, in which the speech is read.Each file has an average of 30 seconds of transcribed speech.To use these files in our scenario a preprocessing step was necessary, which removed lexical tips that indicate the beginning of articles, sections and paragraphs.This removal was carried out on both the transcripts and audio.In addition, we separated the new dataset organized by articles, yielding 357 texts in total.Then we marked the end of each article, paragraph, and inserted punctuation at the end.Titles and chapters have been ignored in this process.We randomly selected 60 texts from this dataset, only following the condition that the number of sentences of each text sentence should be higher than 12.We refer to the largest dataset as Constitution L, and the dataset with the 60 texts as Constitution S. The average number of sentences in each text of Constitution L is 7.56, and the average size of these sentences is 23.45 words while Constitution S has 23.48 sentences on average, and these sentences have an average of 21.66 words.

Segmentation of the speech corpora
Word and sound boundaries must be identified in order to label useful audio excerpts.Fluent listeners hear speech as a sequence of discrete sounds even when there are no pauses in the waveform.This segmentation is not as trivial for a machine which receives a single signal.In our algorithm we segment the audio excerpts from the corpus in phone and word boundaries by forced alignment (yUAN; LIBERMAN, 2008).Forced alignment uses a trained acoustic model to predict phoneme sequences, then uses the orthographic transcriptions to force the recognized phonemes into their likely transcriptions based on the words present and attempts to join the transcriptions with the correct timestamps present in the audio signal.The forced alignment method requires two indispensable components: (i) a robust acoustic model; and (ii) a welldesigned pronunciation model since the more phoneme sequences are added, the more training examples are required, and the more closely related the phoneme sequences occurring in similar phonological contexts are, the more difficulties will be encountered by the model.In this work a dictionary was built using automatic transcriptions from Petrus (MARqUIAFáVEL, 2015) for all words in the scripts used for training and testing.Since this was a pilot experiment, adaptations were not made for multiple pronunciations, which will be included in the future.

Segmentation and annotation of the narratives
The segmentation of the narratives which were the basis for the assessment of Coh-Metrix-Dementia reported in Aluísio et al. (2016b) was performed by a single person, without the support of a spontaneous speech annotation method on clinical data nor an annotation manual.Therefore, this work re-annotated the disfluencies and segmentation in sentences based on the work of Saffran et al. (1989) using a 3 step process for annotation of propositions of an ungrammatical narrative: (1) Removal (by annotation) of text excerpts, here termed "non-word" narratives; (2) Segmentation of sentences and judgment if (+) or (-), that is, (+) is for sentences prosodically, syntactically and semantically well formed in the argumental structure of the sentence; and (3) Annotation propositions in the well-formed sentences.
The annotation was performed by peers, using the brat annotation tool 3 , and the agreement between peers was measured by the kappa statistic.The kappa statistics for steps (1) and ( 2), that are of interest for this work, were calculated for the task of selection and categorization of the selected excerpt.In step 1 the categorization involves the following types of 10 non-words: (1) Neologisms, i.e. word creation; (2) Patient comments not related to the topic of narration or confirmations.For example, <empty>né ("alright?")</empty>;(3) False starts: then, well then, so; (4) Coordinating conjunctions (and, but, or)    adverbs with locative value: aqui ("here"), ali ("there"), cá ("over here"), lá ("over there").
And in the step 2 if the sentence is well-formed (+) or not (-).
The kappa value for non-words selection was 0.81 and for categorization was 0.91; for sentence boundary identification it was 0.84, all of them very high, but for sentence categorization it was 0.14 as the judges diverged strongly on the concept of which arguments are needed for a specific verb.This involves the knowledge of semantic role labeling theory, which is not an easy concept and deserves a manual by itself for annotation.The categorization is not important for this work, as we have to segment all the sentences regardless.
After the evaluation of our annotation manual and inter annotator agreement, the segmentation in sentences was carried out before the non-words were removed.This was done in order to simulate an ASR system.
Of particular interest to this article are the narrative non-words (many of them are disfluencies) and sentence segmentation, using semantic, syntactic and prosodic cues.From the 10 non-words, 9 were removed from the original transcript for the extrinsic evaluation of this study.The conjunctions were maintained as they are well resolved by parsers, as long as they are separated by commas throughout the narrative.
Figure 3 shows the same excerpt from Figure 2, using the new annotation for segmentation, with annotated disfluencies (marked with *) to be removed.
removed, Figure 3 shows, for example, two reformulations in sentence 3 and in sentence 8 which were removed (marked with an "*"); sentence 8 contains a false start and sentence 1 a comment.This annotation process generates short sentences, allowing for higher success for analysis by the parsers.

Metrics evaluated
Some of the metrics used in automatic evaluation studies of speech in the clinical field are influenced by the method in which the transcription is segmented because it depends on the robustness of the parser being used and the characteristics of the annotation used in the datasets on which the parsers were trained.In our study of BP narratives, the syntactic metrics of Coh-Metrix-Dementia used the dependency parser MALT-parser (NIVRE et al., 2006), trained with the dataset for the task CoNLL-x 2006 Multi-lingual Dependency Parsing, and the constituency parser Lx-parser (SILVA et al., 2010).The former was the parser with better performance and the choice of the latter was because it is the only freely available constituency parser for Portuguese.
In this study, we intended to assess whether or not automatic segmentation had an impact on the syntactic metrics, therefore disfluencies were removed from the narratives with manual and automatic annotation.In order not to have to evaluate two variables, we selected 9 syntactic metrics from Coh-Metrix-Dementia to see if there was any significant difference between the manual and automatic segmentation.
The metrics mean yngve's complexity (yNGVE, 1960), Frazier's complexity (FRAzIER, 1985), mean clauses per sentence, noun phrase incidence, modifiers per noun phrase and pronouns per noun phrase depend on the constituent structure and the dependency distance metric is calculated by the dependency parser.The first two depend on the success of analyzing the tree as a whole; the third depends on the correct identification of verb phrases; and the last three depend on the correct identification of noun phrases.Words per sentence and number of sentences are correlated, since the greater the number of sentences, the smaller their size.

Results and discussion
The results are given in Table 3.Such comparisons were analyzed using the Wilcoxon for paired data, with a significance level of 5% (p-value <0.05), the null hypothesis is that the metrics have equal averages for manual and automatic segmentation.Only the metric modifiers per noun phrase for the MCI group presented a significant statistical difference.Comparing the resulting annotations of Fig. 2 and Figure 3, one can see that 6 sentences in Figure 2 were transformed into 9 sentences.As for the disfluencies Since manual segmentation was used only for lexical information, and the beginning of the sentences are well defined by the discourse markers "então" ("then") and "ai" ("there") and finally by the confirmation marker "né" ("alright"), our method was able to learn this information, but occasionally the label "então" ("then") is used as a conjunction and our method could end up adding a period before this marker, this fact can be seen in the first example below.
Depending on how the sentences are segmented, the parser can not generate a noun phrase due to an error in the model or the sentence does not really have a noun phrase; this fact is shown in the second pair of examples in which the second sentence ("Mas não coube nelas.")does not have a noun phrase.This discordance between manual and automatic segmentation ends up generating a difference in metrics, and since the modifiers per noun phrase metric makes an analytical differentiation within a small context it may be more susceptible to these small variations.Figure 4 shows three pairs of examples with manual and automatic segmentation showing the metric value modifiers per noun phrase in parentheses.
We also analyzed whether any of the 9 metrics were able to distinguish the groups.Although we know that in a machine learning approach the features in tandem help to correctly classify groups, for a manual analysis of the clinical metrics results this distinction is important.
Table 4 shows the p value results between CTL and MCI for manual and automatic segmentation.Group differences were measured using Mann-Whitney non-parametric statistical tests for unpaired data, with a significance level of 5% (p value <0.05), the null hypothesis is that the mean CTL is the same as the mean MCI.Table 4 indicates that the Frazier Complexity has a statistically significant difference in the automatically segmented samples.One can also see that this metric in manual segmentation has a low p-value but not enough to reject the null hypothesis.Even though through most metrics it is not possible to affirm the existence of statistical difference between the groups we know that some syntactic metrics help in the automatic classification process as presented by Roark et al. (2011) and Aluísio et al. (2016b).An adequate segmentation and the removal of disfluencies help the parser and do not generate incorrect trees.

Conclusions and future work
We showed that our model, using a recurrent convolutional neural network, is benefited by word embeddings and can achieve promising results even with a small amount of data.We found that our method is better for cases where speech is planned, since the prosodic features lend more weight for classification.The results of our evaluation indicate that only the metric modifiers per noun phrase for the MCI group presented a significant statistical difference on automatically and manually segmented transcripts.These results suggest that DeepBond is robust to analyze impaired speech and can be used in automated discourse analysis tools to differentiate narratives produced by MCI individuals and healthy controls and similar studies.As for future work, we intend to analyze the impact of segmentation in other tests.
An ideal system for automatic detection of cognitive impairments would need to be fully automated.Our future work includes a pipeline featuring a full ASR system for Brazilian Portuguese which would be robust enough to handle impaired speech.The recognition output would then be piped into a disfluency detection stage, followed by the sentence segmentation presented in this work, so that it could be properly treated by the NLP tools in Coh-Metrix-Dementia and finally a classification stage to output whether or not the subject can be classified as having MCI.This approach might permit the screening of MCI through a computerized test using tablets.Automatic Speech recognition for this purpose presents a series of challenges.Firstly, one can imagine that the eventual administrator of this tool will not be in a noise free environment, so the acoustic model must be robust enough to function well in a noisy hospital or clinic.Secondly, cognitively impaired speech differs in many ways from "normal" speech.This becomes a double-bladed sword for our task as this piece of information is very important for MCI detection but it is very cumbersome for the ASR because most systems are not robust enough to anticipate these differences.We plan to work on both of these fronts by building methods as well as corpora suitable for the ultimate task.

Figure 2 .
Figure 2. Excerpt from a narrative of a patient with MCI, showing the original transcript and the clean transcript, capitalized, and segmented in sentences.

Figure 3 .
Figure 3. Segmentation annotation with emphasis on the removal of disfluencies, informed by the annotation manual.

Figure 4 .
Figure 4. Pairs of examples with manual and automatic segmentation.

Table 1 .
Narrative statistics for each dataset.

Table 2 .
Demographic information of the participant groups.

Table 3 .
Values shown are mean (standard deviation) and p-value.Bold values denote statistical significance at the p < 0.05 level.

Table 4 .
A comparison of CTL and MCI.Bold values denote statistical significance at the p < 0.05 level.