Picture of Composition and Big Data

Composition and Big Data

Edited by Amanda Licastro and Benjamin M. Miller

Reviewed by Angela Laflen

University of Pittsburgh Press, 2021

ISBN : 9780822946748

Publisher's Webpage

Video Review


In this video review, I examine Amanda Licastro and Benjamin Miller's edited collection Composition and Big Data. I provide an overview of the book and the critical questions it engages as a whole before turning to consider how each of the book's 16 chapters models an approach to using big data in composition. Most chapters provide insight into how composition scholars, teachers, and administrators 1) build corpora of data for analysis; 2) analyze corpora using a variety of programs, tools, and statistical tests; and 3) visualize their results for readers using tables and graphs. My video review highlights the programs, tools, tests, and data visualizations featured in the book as a way to demonstrate the range of big data methods considered in this volume and also because choosing among the numerous available programs, tools, tests, and visualization types can be intimidating to those new to working with data.

Click the links inside the tabbed menu to navigate to the different sections of the video transcript.

Reasons to Engage Composition Through Big Data Data in Students' Hands Data Across Contexts Data and the Discipline Dealing with Data's Complications Conclusion

Amanda Licastro and Benjamin Miller's (2021) edited collection Composition and Big Data makes the case that data analysis is a viable means of studying composition, rhetoric, and writing and deserves a place in the discipline alongside other research methods used more frequently such as case studies and ethnographies. As Licastro and Miller explain, quantitative methods have long played a role in composition scholarship and administration. They refer to Janet Emig's (1971) influential work in The Composing Processes of Twelfth Graders as evidence. However, it is also true, as they point out, that studies employing what Richard Haswell (2005) referred to as RAD methods—for "replicable/aggregable/data-supported" (p. 201)—fell out of favor with the discipline from the 1980s to the 2000s.

Today, Licastro and Miller see a return to quantitative methods in rhetoric, composition, and writing studies due in part to the increasing importance of big data more generally. As they point out, "Big data has changed the way information is processed, and thus the environment in which writing happens" (p. 3). Certainly, big data analytics provide new ways to approach longstanding questions in composition while also raising new questions for researchers to consider. Composition and Big Data asks readers to consider: "What questions will we ask of this data? What further data do we want to collect or examine, given the field's longstanding and emerging questions? And what protections or special considerations need to be considered for RCWS contexts?" (p. 4).

The authors of the book's 16 chapters address these questions both explicitly and implicitly as they model the use of big data methods and datasets to discuss open questions in the field and crucial questions that are raised by the use of big data methodologies. The book is organized into a critical introduction and four sections that highlight the range of disciplinary questions that can be considered through algorithmic analysis of large datasets. These sections include Data in Students' Hands, Data Across Contexts, Data and the Discipline, and Dealing with Data's Complications.

Section One includes three chapters and explores applications of big data for single classrooms. In chapter one, Trevor Hoag and Nicole Emmelhainz address the question of whether distant reading assignments are capable of helping students to do more than "mere counting" (p. 28). They describe an assignment included in an introductory digital studies/digital writing course in which undergraduate students performed textual analyses through distant or machine reading. Drawing on evidence from students' reflections on their learning throughout the assignment, they find that the assignment enabled students to perform rhetorical analysis. Chapter one is unique in the book as the only chapter focused on students pursuing their own big data projects in the classroom. Although students are ever present as research participants and stakeholders in the other chapters, chapter one is especially valuable for addressing the question of whether big data projects are worth considering for classroom use.

Chapters two and three both report on studies considering, in different ways, the development of academic writers from beginners to experts. Chapter two takes this topic up in the context of how students adopt features of spoken versus academic discourses in their writing, and the results confirm an earlier study by Douglas Biber et al. that found that students fell between spoken and academic discourses. In addition to contributing to ongoing conversations in the field, chapter two provides fascinating insight into the use of corpus analysis and evidence that the use of different tagging and parsing programs can yield different results. In chapter three, the focus is on how students synthesize multiple sources in their writing. Alexis Teagarden reports on a study using computer-assisted keywords and cluster analyses to show that expert and successful student papers use similar moves with different language while unsuccessful student papers mostly lack synthesizing moves. The three chapters in section one show how big data can be used to better understand student writing and will be particularly useful for readers interested in the pedagogical implications of big data for composition or in designing projects focused on student writing.

Section two, Data Across Contexts, includes four chapters that focus on cross–curricular and programmatic perspectives on big data. In chapter five, Laura Aull argues for using big data as a type of mirror that program administrators and instructors can hold up to current practices in order to better understand them and ensure ethical assessment practices, and each of the chapters in section two demonstrates insights garnered from using big data in this way. In chapter four, the focus is on the use of data mining and computational methodologies to assess the successes and failures of the University of Tampa's Academic Writing Program in preparing students to write in other courses across the university, and chapter five uses corpus-based keyword analysis of lexico–grammatical features to show how students' treatment of source material differed depending on whether the wording of the directed–self–placement prompt they received primed them to think of either argument or explanation. In chapter six, the authors report on a corpus–based study designed to see whether and how key terms learned in first-year composition transferred to a STEM context and illustrate how insights garnered through big data can be used to inform pedagogical interventions. Closing out section two, Kathryn Lambrecht analyzes national and local corpora to show how terminology about interdisciplinarity circulates in broad and local contexts, discovering that although interdisciplinarity is "complicated, diverse, and evolving," "our definitions for disciplinary categories are shrinking" (p. 127).

Taken together, the chapters in section two question the extent to which students transfer knowledge of writing gained in one context to another and, along with that, whether big data methods can help researchers and program administrators to observe transfer between contexts. For this reason, section two is likely to be of particular interest to program administrators engaged in designing placement procedures and programmatic assessments.

The four chapters in section three share insights that big data provides about the discipline of rhetoric, composition, and writing studies, and consider the nature of the relationship between big data and composition, including some of the challenges that confront researchers who wish to use big data. Chapter eight applies distant and close reading methods to the WPA-L listserv to compare how discussions on WPA-L compare with other datasets in the field, while chapter nine explicitly addresses the question of what further data could be beneficial for those working in rhetoric, composition, and writing studies to collect, focusing on the value of "preserving prepublication connections made by researchers working in an archive" (p.13). Based on her experience at the National Archives of Composition and Rhetoric, Jenna Morton-Aiken argued for the value of adding "folksonomy hashtags" (p. 166) to archival materials in order to "fill a critical gap between current archival praxis and the multivoiced discourse in which the artifacts were originally produced and intended for consumption" (p. 167).

Chapter 10 considers how big data can provide a better understanding of disciplinary time, focusing on the time spent on conferences, and argues for the value of combining big data methodologies with microanalysis to generate a more complete picture than would be otherwise possible by the use of either practice alone. Chapter 11, the final chapter in section three, imagines the possibilities of an open archive of boutique data used in writing research projects, while also discussing the difficulties involved in building such "collaboratories" (p. 207).

The chapters in section three illustrate the potential for big data to help scholars better understand the discipline of rhetoric, composition, and writing studies—not only to document the history of the discipline but also to more deliberately plan for the future.

Each of the five chapters in section four takes up a particular complication related to conducting research with data. This section focuses less than the others on modeling the use of big data methodologies in describing the results and implications of empirical studies. Instead, the chapters in this section address crucial questions related to the ethical treatment of research participants in big data studies and to algorithmic biases. Chapter 12 discusses the challenges that IRBs present to composition researchers who wish to work with data; however, despite these challenges, Johanna Phelps echoes the call issued in chapter 11 to build disciplinary datasets in the discipline to facilitate the processes of curating, storing, and researching.

Chapters 13, 14, and 15 are antitheses to the focus on insights garnered through data throughout sections two and three in as much as each of these chapters emphasizes what data obscure. Thus, chapter 13 highlights the use of opaque algorithms in data analysis, chapter 14 focuses on the role that subjective judgment plays in the use of interpretability as a model selection criterion in unsupervised machine learning, and chapter 15 reflects on what programs like Voyant and Textexture make it possible to see and what they obscure.

The final chapter in section four and the book, chapter 16 is particularly noteworthy as the complication that Jill Dahlman discusses is one that confronts nearly every scholar attempting to work with data, that is, data that are missing, incomplete, or broken. Dahlman suggests that rather than reject broken data in the search for perfect data, researchers should engage broken data and study them carefully to see what questions they are capable of answering. In doing so, Dahlman underscores the importance of building data-driven arguments based on what the data say rather than what the researcher wants or needs them to say.

Section four is vital to Composition and Big Data in that it recognizes some of the challenges and complexities that accompany the use of big data. Given the institutional role that writing programs play—for example, by placing students into courses and assessing student writing—these issues are essential to consider before choosing to use data for any assessment, research, or instructional purpose.

Composition and Big Data succeeds in showing what is possible and pointing readers to useful tools and tests to consider using. However, it is worth noting that the book does not provide all the instructions readers will need to implement the methods discussed, and a few of the authors might have provided more details about their methods to provide a clearer path forward for readers new to big data methods. Obstacles to using big data in composition include the need to assemble corpora and to choose appropriate tools, programs, and tests to use for analysis. Most chapters are quite thorough in describing how corpora were assembled, and, together, the chapters highlight a variety of useful tools and tests that composition researchers might use to analyze corpora depending on the purposes of their research. The chapters also illustrate a number of different ways in which data can be usefully visualized for readers. This approach is useful for readers new to big data methods and intimidated by how to get started as well as more experienced researchers and administrators who wish to try new methods, even though many readers will require more information than is provided in the collection to learn how to use the programs, run the tests, and create the data visualizations included throughout the book.

In the end, Composition and Big Data is a book that raises questions about the possibilities for the use of big data in composition. Considering the potential for big data—especially in combination with qualitative methods—to provide new insight into longstanding disciplinary issues and warning researchers about the challenges and pitfalls that can befall those who embrace big data, the book is a compelling primer to which instructors, administrators, and researchers can refer in order to design projects that use big data thoughtfully.


Emig, Janet. (1971). The composing processes of twelfth graders. National Council of Teachers of English.

Haswell, Richard H. (2005). NCTE/CCCC's recent war on scholarship. Written Communication, 22(2), 198–223. https://doi.org/10.1177/0741088305275367