Compilation of a University Learner Corpus

Corpus Linguistics (CL) and Second Language Acquisition (SLA) areas have been complementary background for researchers interested in contrastive interlanguage analysis (Granger, 1998), shedding light on our understanding of English acquisition by various learner groups. In Brazil there is a paucity of research that describes university learner English so that pedagogical interventions fit their needs. The main objective of this paper is to describe the compilation of a Brazilian university level learner corpus, CorIsF-Inglês, and illustrate how a frequency analysis can reveal learner choices when they perform different written tasks. The type of task, independent or integrated, is likely to have influenced the frequency of nouns, verbs and adjectives learners used.


INtroDUCtIoN
Studies on learner corpus started at the end of the 1980´s (Granger, 2015).Before that, mainly in the 60's and 70's, with error analysis studies, learner language research focused on data that was rarely controlled by testing or language classroom condition (Granger, 1998).Researchers interested in language acquisition were, at times, very concerned with attesting a theory.White (1989) and Schachter (1988), for instance, were concerned in proving the availability of Universal Grammar (Chomsky, 1981) to second language acquisition (SLA).Scholars that sided with cognitive accounts of SLA have argued that language learning involves all aspects of cognitive processing (MacWhinney, 1987) and that interlanguage (Selinker, 1972), learners' production in a second language (SL), should be studied as a system in itself.This cognitive view of learners' use of a SL has matched the interest of some corpus linguists who have been studying learners´ language since the 90´s.The advent of more accessible computers and ways of storing data have made it possible, then, for corpus linguistic tools to be used by more researchers and, according to Granger (2015), contrastive interlanguage analysis (CIA) has flourished since 1996.
Although the increase of CIA studies has been steady, Leech (1998, p. xvi) predicted that when "SLA meets corpus linguistics" such encounter would not be so smooth.Corpus linguists (CL) may not be well prepared in SLA issues and SLA linguists may not be aware of all the tools that corpus linguistics can provide, neither want to focus their investigations on learner's production, but on their mental process in learning a language.Fortunately, Granger's 1998 book was the first volume of its kind, opening the doors for many other publications involving CL and SLA.The growing interest in empirical studies have given support to corpus linguists that study learner production either oral or written.
The compilation of a learner corpus is a challenging issue, especially due to the complexity of collecting a large amount of data.One way of facing this problem is to develop international projects.Some of these projects have been very successful, such as the International Corpus of Learner English, (ICLE), which is already in its second version (Granger et al., 2009) 1 , and LINDSEI (Louvain International Database of Spoken English Interlanguage) 2 .ICLE has been compiled in 16 different countries (e.g.Japan, Belgium, Netherlands, etc.) among university language level students who wrote argumentative essays.LINDSEI also has the same type of participants, yet, data is being collected to form an oral corpus.This project has 20 partners from different countries, (e.g.Greece, Italy, China, etc.) and 13 of them have already completed the data compiling.Although there is a need for more studies on Brazilian learner corpora, some systematic investigations have been done on lexical bundles in written production (Dutra;Berber-Sardinha, 2013;Shepherd, 2009) and on the design of an oral learner corpus (Mello et al., 2012).Other specialized learner corpora have been compiled, such as the Corpus of Academic Learner English (CALE)3 , which comprises seven different academic text types (e.g. research papers, reading reports, abstracts, reviews, etc.) and also has partners from different countries.Other corpora aim at gathering a large amount of learner data from one specific country and language background (e.g Jinan Chinese Learner Corpus) (Wang et al. 2015) and Corpus do Inglês sem Fronteiras described in this article and also in Dutra et al. (in press).There are also studies carried out in Brazil with the compilation of small learner corpus, which, unfortunately, in most cases, are of restricted use of the researchers themselves (Alcântara, 2015) 4 .
The main objective of this paper is to describe the compilation of a Brazilian university level learner corpus (CorIsF-Inglês5 ).Such compilation is based on pre-set parameters that are essential for the feasibility of subcorpora6 comparison.These parameters are described in the next section.In order to provide a sample of the type of analysis that Corpus Linguistics can facilitate, a partial data analysis is presented based on lexical frequency.Data was extracted from dependent and independent tasks with the purpose of yielding insights to our understanding of Brazilian learners' interlanguage.

methoDoLogy
In this part of the paper we describe CorIsF-Inglês according to the following characteristics: participants, data collection, corpus design and data analysis.

Participants
As mentioned in the introduction, Brazil has a university level corpus, Br-ICLE, compiled at several universities in the country, yet, it is restricted to texts produced by English major students, which makes it quite different from the corpus we are describing in this article, namely CorIsF-Inglês.This corpus, due to its objective of compiling Brazilian university level student interlanguage, focuses on the collection of texts composed by participants from different college courses.Students who have given us permission to use their texts for the corpus are registered at the English without Borders Program face-to-face courses.The target audience comes mainly from the hard sciences and health courses.Nevertheless, students from arts and humanities have been able to register in some universities, depending on the offering of English courses and the students' interest in taking them7 .The insertion of their texts in CorIsF-Inglês is authorized after they read the consent form and agree with its terms8 .Although at first all the texts are identified, so that teachers can give their students feedback, as soon as they are sent to the corpus managers, they are given a number.Other major differences between Br-ICLE and CorIsF-Inglês lie in mode task variety and in type Alegre, 2015; 6

Data collection
The data collected for CorIsF-Inglês come from tests and activities designed for the IsF English courses which show that the primary motivation is to meet students' needs and, consequently, compile a corpus.IsF teachers have worked together to prepare integrated skills online tests for level A2 (high basic), B1 (intermediate), B2 (high intermediate) and C1 (advanced) 9 so as to prepare their students to take proficiency tests (Dutra, in press).It is clearly evident that most IsF audience has the interest to take international proficiency tests, such as TOEFL ITP, TOEFL iBT or IELTS 10 because they have plans to apply for academic scholarships abroad.Despite the fact that most IsF courses are not preparatory for proficiency tests, it seems to be reasonable to give students chances to take in-class tests.These tests may help learners develop skills that will enable them to demonstrate their linguistic knowledge even under time constraints.As soon as the idea of preparing online tests was presented, the teachers realized that these tests results would allow them to keep a record of learners' linguistic development for pedagogic and research purposes.Course activities that have also generated texts for CorIsF-Inglês are the ones proposed in skill specific courses, such as in academic writing or academic speaking.Text genres produced in these courses are, for instance, summaries and oral presentations.
As mentioned before, the first motivation for online test preparation is pedagogic.When students take a 64-hour course, they can take the test at 3 different points in the term (beginning, middle and end of the course).In addition, students that take part in the IsF different level courses for more than one term may have their texts compiled in more than one term 11 .From a pedagogic perspective, teachers have their same level student comparative samples, which allows for the preparation of tailor-made activities to cater for students' needs.Teachers can provide specific feedback to their students and/or adapt course materials based on corpus analysis.Since data and metadata are available for CorIsF-Inglês partner institutions, several types of analysis can be done. 9The IsF course levels are based on the Common European Framework of Reference (CEFR) <http://isf.mec.gov.br/ingles/pt-br/qual-e-meu-nivel-de-proficiencia-em-ingles> and they have been offered from basic level (A2) on. 10The acronyms are TOEFL ITP (Test of English as a Foreign Language -Institutional Testing Program); TOEFL iBT (Test of English as a Foreign Language internet-based test) and IELTS (International English Language Testing System). 11All research compiled texts do not carry participants' identification.It is the research group responsibility to look for learners' texts produced at different points in time, so as to include them in the longitudinal section of the corpus.The cross-sectional part of the corpus, presented in this article, carries texts that were produced by different participants.In other words, these participants contributed only once to this part of the corpus.• what format the questions should have (multiple choice or open-ended); • how to save files in the office computer or in personal computers;

BELT |
• which internet resources could be used (e.g.Youtube, TED-Ed); • how to give feedback to students (e.g.automatic feedback using a free online program called Flubaroo12 ); • choosing test themes; • making small teacher groups according to test themes and CEFR level, so they could prepare the tests; • sharing activities to receive other teachers' and English Teaching Assistants' (ETAs13 ) feedback; • making tests available to students through Google Docs; • sending automatic results to students and teacher; • sending written or oral texts to the teacher for group and individual feedback; • sending the results to Cor-IsF Inglês for storage and for sharing them with partners.

Corpus design
The design of CorIsF-Inglês (Table 1) allows for the compilation of oral and written learner language in a variety of genres.Data comes from timed activities such as tests or online activities (e.g.argumentative essay or opinion response) or may be the result of preparation and/or several drafts as in course activities (e.g.presentations or abstracts) that are processoriented.Therefore, data can be sorted out depending on research interest.
For instance, researchers may analyze verb tense usage in timed and untimed activities so as to depict appropriateness of tense variation.
1415 Data (students' texts) and metadata (information about text genre, text production conditions, participant's age, TOEFL score, course, etc.) are saved in .csvfiles (comma-separated values), making them organized in spreadsheets that may be easily manipulated by teachers and research partners.Before data is made available to all partners, they are carefully screened so only authorized texts become part of the corpus.Each text receives a reference number and is cleaned (e.g.typos, letter and word repetition are removed).Data is treated with R, which is a free software for statistics and graphics that can be widely used in corpus linguistics 16 .
One of the corpus characteristics that can be singled out is the type of tasks that the participants were involved with: independent and integrated tasks.Independent tasks require that learners use their world knowledge and personal experiences to produce texts, such as in argumentative essays (written mode) or opinion responses (oral mode).On the other hand, integrated tasks make participants use information presented in written and oral texts or even in infographics or graphs.They are, thus, asked to select and report information, using the criteria of relevance to make comparisons.Lexical-grammatical patterns seem to be influenced by type of task proposed (Biber & Gray, 2013) and such tendency needs to be thoroughly investigated in CorIsF-Inglês due to our interest in better understanding Brazilian learners' interlanguage at different acquisition stages.

generating and analyzing data
Using the R software, frequency lists with and without stopwords were generated helping the partial analysis of the corpus.Stopwords are words that carry little informational content, e.g.an, the, on, in, etc.These lists have been used to create word clouds that show visually the prominence of words in a corpus.At this early stage of our research, parts of the corpus are unbalanced, which means that independent tasks have yielded much more data than integrated tasks.Word frequency lists and word clouds have been used to organize our data for analysis.The most frequent grammatical categories were identified and a connection with their frequency in specific genres was correlated.

CorIsF-Inglês PArtIAL ANALySIS
The data presented in this section illustrates what type of analysis can be carried out based on a learner corpus.There is an array of possible research foci and we emphasize that this sample analysis can be greatly improved once the corpus grows and it is balanced.A lexical analysis will be presented considering the two parts of the corpus: data from independent task students' production and data from integrated tasks.Consequently, we discuss the grammatical categories that tend to emerge from these two types of tasks.
The independent task data is comprised of 104,437 words; therefore, it is the majority of CorIsF-Inglês data that up to this date has 130,999 words.This data shows that most online tests prepared by our IsF group included this type of writing task17 .Besides generating a frequency list, we wanted to have a visual idea of which words tend to be more produced by learners in independent written tasks.Therefore, word clouds were generated to provide a picture of the most prominent words in this part of the corpus.
Figure 1 presents the word cloud for independent task data without stopwords, revealing that the spotlight is given to nouns and auxiliary modal verbs (people, person, can, will) as well as to adjectives (good and important) and verbs (think and like).In a very similar level of frequency there is also an adverb (first) and the word one which can be classified in different ways, according to its function in the sentence, for instance as a determiner or a pronoun18 .It calls our attention that in the next level of frequency there are many nouns (e.g.religion, language, students, impression, water, school etc.) probably chosen due to task topics and prompts.Verbs are the second most prominent grammatical category in the word cloud (Figure 1) which include the main verbs think, like, know want, learn and make as well as the auxiliary modal verbs can and will.Most of these verbs are mental state verbs (think, know, want and learn), making evident that the expression of participants' ideas and thoughts are central in independent tasks which require the production of position or argumentative essays (see Appendix C for an example).In such tasks writers are supposed to present ideas and to convince the readers of an issue.The integrated task wordcloud (Figure 2) was generated from the 26,562 words, excluding the stopwords, which were compiled from the CorIsF-Inglês integrated task section.The much lower number of words in this corpus part, as compared to the independent task part, as already mentioned, is largely due to the fact that most online tests included independent tasks rather than the integrated ones.There is a prevalence of nouns in the word cloud  2, such as coffee19 , people and day.The next most frequent group of words includes mainly nouns as well (e.g.year, Americans, cups, caffeine, men and women).The verbs that are prominent are can (auxiliary modal verb), drink, divorced and consumed.The last two forms may also have been used in the tasks as adjectives.There were two test integrated task themes that seem to have attracted most IsF teachers and the word cloud clearly represents this preference (Coffee theme for A2 level and Love, Marriage and Divorce for B1).Nouns are, then, the most frequent grammatical items in the integrated task word cloud.This is not a surprise to us as integrated tasks require learners to describe and report information provided in infographics or charts, making comparisons if appropriate.Comparing the results from word frequency analysis of independent and integrated tasks, some differences arise.The type of task is likely to have influenced the frequency of nouns, verbs and adjectives learners chose to use.While independent tasks tend to favor the high occurrence of nouns, mental state verbs and some adjectives, integrated tasks create the environment for prominent use of nouns.In the former case, arguments prevail and writers need to convince the audience of their position.Mental state verbs, for example, come in handy in argument construction.In the latter type of task, description is valued and verbs, although present in the texts, do not need to be the same.The variation of verb types, then, allows nouns to appear as the most frequent category in integrated writing tasks.Some nouns are clearly at a high frequency level because they are directly related to the task topic (e.g.coffee, divorce).

CoNCLUSIoN
In this article we presented Cor-IsF Inglês design and explained how pedagogical motivations can be linked to research interests.A detailed design allows teachers and research group members to profit from data collection and systematize both group and individual feedback besides creating a corpus that can be widely investigated.The IsF community has a great chance to develop an array of studies that may focus either on written or on oral interlanguage or on the comparison of both modes.
It is also relevant to further investigate the results generated from independent and integrated tasks to discuss task design and, furthermore, research how different level students perform in such tasks.A deeper analysis of the corpus, especially when it reaches 200,000 words at each proficiency level, should shed light into a better understanding of interlanguage.From a pedagogic perspective, IsF teachers, who are also task planners, should look into corpus data to assist their learners to better express themselves in specific genres.
(sp. n. -suppl.),s21-s33 s24 Original Article Dutra, D. P. & Gomide, A. R. | Compilation of a University Learner Corpus of data collection.While Br-ICLE concentrated on written interlanguage, CorIsF-Inglês has been designed to compile both written and oral interlanguage.Moreover, Br-ICLE collaborators compiled argumentative essays, yet, CorIsF-Inglês researchers have aimed at collecting a variety of academic written genres, such as abstracts, summaries and essays.Another difference is that Br-ICLE provides data for cross-sectional studies and CorIsF-Inglês design can yield both cross-sectional and longitudinal data.
Porto Alegre, 2015; 6 (sp.n. -suppl.),s21-s33 s25 Original Article Dutra, D. P. & Gomide, A. R. | Compilation of a University Learner CorpusOnline test preparation involves the following steps(Dutra et al., in press):• opening a Gmail account for all the teachers' group to have access to files created in Google Docs; • taking technical decisions in group:

Figure 1 :
Figure 1: Independent task word cloud without stopwords

Figure 2 :
Figure 2: Integrated task word cloud without stopwords