Peer Observation of Teaching in Higher Education: Systematic Review of Observation Tools

Background / purpose . This study presents a systematic review of teaching observation instruments in the current literature based on PRISMA standards. Materials / methods . Three researchers performed searches on two databases, SCOPUS and Web of Science, focusing on two criteria: a) peer observation of teaching and b) higher education, with search terms included in the “Title/Keyword” fields. The AND command was used to join certain words, including peer observation and teaching, whilst the OR command was used to separate search terms within each criterion. Five exclusion criteria were defined and applied following the initial searches.


Introduction
A substantial body of scientific evidence exists on the positive correlation between feedback and learning, as supported in the research published by Brooks, Carroll, et al. (2019), Brooks, Huang, et al. (2019), Hattie and Clarke (2018), and Panadero and Lipnevich (2022).The current study, however, focuses specifically on feedback provided through peer observation of teaching (POT), where teacher A observes teacher B while teacher B is teaching, and subsequently provides feedback to teacher B on their teaching performance in the classroom.Although no studies were found in the literature exclusively on POT applied in the higher education context, early evidence quantifying the impact of POT on learning exists for other educational stages (Burgess et al., 2021).POT has become increasingly popular in higher education institutions worldwide, including in the United States, Australia, and the United Kingdom (Carragher & McGaughey, 2016;Johnston et al., 2022).
It would be inappropriate to analyze POT without considering the rationales that underlie the conceptual POT frameworks, which are: technical, practical, and critical.To synthesize, technical rationality is behaviorist and quantitative in nature, with a curricular approach formed according to pedagogically-based objectives, whilst practical rationality refers to a process-based and studentcentered pedagogy.Finally, critical rationality is based on the philosophy of practical rationality, but is oriented more towards transforming reality (López Pastor, 1999).Analysis of research by different authors shows how different interpretations of POT align with these three rationales (Bell & Mladenovic, 2008;Byrne et al., 2010;Gosling, 2002;Peel, 2005).Bell and Mladenovic's (2008) theoretical framework rests on two continuums: "performance/development and training purposes on the vertical axis" and "formal-informal processes on the horizontal axis," which were originally suggested by Peel (2005).Peel (2005) also proposed a conceptual structure that organized POT into three dimensions.The first dimension (D1) is based on a technical rationality that links teachers from an informative viewpoint, whilst the second dimension (D2) relates teachers based on collaboration and process-based research.The last dimension (D3) focuses on critical reflection as well as moral and ethical criteria.The literature shows that Gosling (2002) and Byrne et al. (2010) used different but related models.Gosling (2002) created three POT models: the evaluation model, the professional development model, and the collaborative model.The evaluation model aims to produce judgments on teaching practices, while the professional development and the collaborative models were designed to improve didactic skills.In contrast, Byrne et al. (2010) outlined four models.The first model identifies underperformance and is based on authority, while the collaborative models seek engagement and discussion about practice.Byrne et al. (2010), however, added a fourth model that incorporated "ideas about learning and teaching into practice through a shared and reciprocal process that has potential for greater impact than can be gained via a one-off observation of teaching".In summary, the literature includes POT models that range from accountability-based models to those that promote collaboration among teachers, and even models that aim to transform the institution's educational realities (Gosling, 2002).These existing POT models are highly relevant, hence the analysis elaborated by Byrne et al. (2010) is presented in Table 1.Strathern (2000) questioned whether "a university is first and foremost an organization whose performance as an organization can be observed".Nevertheless, as Biesta (2019) suggested, peer observation of teaching (POT) can be directed exclusively towards the measurement of learning, or conversely, putting these practices at the service of dialogue, mutual construction, and ultimately, education.
Although these models are not exclusive, adherence to one trend or another could condition the type of observation instrument to be designed.Similarly, models that solely address measurement may fail to consider the observing teacher's role.We must therefore remember that POT research has generally focused on the practices of observed teachers.Yet, a growing number of authors have pointed out the importance of what the observers may also be learning (Rosselló & De la Iglesia, 2021;Torres et al., 2017).From this latter perspective, that is, the dual learning of both the observers and those being observed, or mentors and mentees (Kohut et al., 2007), the initial choice of observation tool plays a key role.While interesting systematic reviews on POT in general can be found in the literature (Carragher & McGaughey, 2016;Ridge & Lavigne, 2020), none have specifically examined the tools or instruments that were employed.Our initial research revealed a number of relevant tools that have not been used in the context of university education (Burgess et al., 2021;Muijs et al., 2018), whilst other tools are exhaustive (Torres et al., 2017), and some are considerably less comprehensive (Barnard et al., 2011;Bolt, 2013).
On the other hand, as Gosling (2014) pointed out, POT has been criticized for a lack of reliability when it comes to observer judgments in two aspects: the observers themselves, and the observation tool employed.In the case of observers, planning that includes various phases must be designed prior to applying POT (Cannarozzo et al., 2019).The first phase should train observers in the use of observation tools, where they are systematically trained in order to guarantee that they understand not only the theoretical basis of the instrument, but also how it should be applied in real-world situations.In the case of the tool itself, we refer to Tenbrink's (2000) definition of the evaluation concept and summarize three steps in this conceptualization: obtaining information, formulating judgments, and making decisions.Based on this definition, the following research questions are formed:  What type of information allows judging a teacher and should be obtained?
 What tools can be used to obtain information to judge a teacher?
 What are the characteristics of tools used to judge a teacher?Based on these research questions, we hypothesize that multiple peer observation instruments exist that are applicable within the higher education context.However, there has been no prior systematic and categorized review of such instruments to draw an accurate picture of their current status.In view of this and given the importance of the choice of observation tool in POT research, the aim of the current study was to conduct a systematic review of tools employed for peer observation of teaching based on PRISMA standards (Moher et al., 2009).

Search Criteria
The searches performed centered on two criteria: a) peer observation of teaching b) higher education.Search terms were applied to the "Title/Keyword" fields.The AND command was used to join the words included in the concept: peer observation of teaching, whilst the OR command was used to separate search terms within each criterion.The complete list of search instructions was as follows: ( The timeframe for the study was not limited.In order to be included, articles had to appear in either the SCOPUS or Web of Science databases.The search was conducted in duplicate so as to ensure accuracy and coverage, and followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.The PRISMA statement is comprised of a 27-item checklist, covering the title, abstract, introduction, method, results, and discussion sections of studies in the literature.For the current study, we reviewed each article against each of the 27 items.Items 4 and 6, which refer to the PICO format, were of particular interest since they "Provide an explicit statement of questions being addressed with reference to participants, interventions, comparisons, outcomes, and study design."For the current study, the "participants" were identified as the "educational stage," whereas the "interventions" referred to the "method," and "comparisons" were established using "self-assessment" as search criteria.

Exclusion criteria
Five exclusion criteria were defined and applied in the study: 1. Lack of a process of observation of teaching (e.g., mentoring without observation, observation of a written product, or solely collecting a teacher's perception or opinion of mentoring).2. Absence of teacher-to-teacher observation.3. Observation not having taken place in the higher education context (i.e., another educational stage).4. Observation process conducted entirely online.5. Insufficient detail provided regarding the instrument used.

Procedure
The searches were conducted between March and April of 2023.Three researchers performed searches in two databases, namely SCOPUS and Web of Science, using the search terms listed in Section 2.1 (see also Figure 1).An initial sample of 662 articles resulted from the first search (367 from SCOPUS and 295 from Web of Science).Duplicate articles (n = 18) were then eliminated, and four rounds of review were conducted based on the inclusion and exclusion criteria.During the first round, experts were informed about the context in which the observation tool was applied.In the second round, experts wrote comments on each tool after reading the abstract.A total of 502 papers were then excluded on the basis of exclusion criteria 1 (lack of process followed) and exclusion criteria 5 (insufficient detail on instrument).During the third round, tools deemed inappropriate for application in the higher education context were eliminated based on the comments received from peers, and a further 127 papers were excluded (76 from SCOPUS and 51 from Web of Science [WoS]).Finally, in the fourth round of analysis, the resulting 13 observation tools were scored, and the experts justified their scores in qualitative terms.2019) tool (see Table 3).
The maximum score obtainable from the tool is 105 points.The quality indicators for the studied articles were as follows: (1) mean methodological quality score for the 13 selected articles was 87.76%; (2) 11 articles scored between 80 and 105 points (excellent methodological quality); (3) two articles scored between 60 and 79 points; and (4) zero articles scored below 60 points (see Table 3).

Analysis
Based on the postulates of López-Noguero (2002), a "content analysis" exercise was conducted in order to make an adequate approach to the 13 observation tools results.The content of each tool was analyzed on the basis of four variables: country, validation, observation, and feedback.

Results and Discussion
Based on the exclusion criteria detailed in the Methodology section, the aim of the current study was to systematically review peer observation instruments employed in university teaching.The findings revealed a total of 13 instruments which were then analyzed according to four variables: country, validation, observation, and feedback.

Country
Over half the instruments were designed by researchers from universities in two countries (see Table 4); the United States (n = 4) and Australia (n = 4).The other five instruments were originated in four European countries: The United Kingdom (n = 2), Italy (n = 1), Switzerland (n = 1), and Portugal (n = 1).Regarding the tradition of peer observation of teaching in Australia, it is worth taking note of the systematic review conducted by Johnston et al. (2022).Teaching practices are so deeply entrenched in Australia that their review found a sufficient critical mass to categories 19 studies into: a) organizational factors (disciplinary context, program sustainability, collegiality, and leadership); b) program factors (program design, basis of participation, observation, feedback, and reflective practice); and, c) individual factors (experience and participants' perceived development requirements).The extent of POT in Australia is manifest in works such as that of Bell and Cooper (2013), which explored the experiences of four Associate Deans of Learning and Teaching at a research-intensive Australian university.The Anglo-Saxon tradition of this teaching practice is clear since 10 of the 13 works found were from English-speaking countries (United States, United Kingdom, and Australia).The work of Wingrove et al. (2018) established differences between the models employed in the United Kingdom and in Australia.For Australia, "quality assurance and measurement imperatives occupy a prominent place in higher education discourses, with performativity through continuous improvement in learning and teaching now central to the very practice of learning and teaching itself" (Wingrove et al., 2018. Sachs andParsell (2013) earlier claimed that in Australia, "peer review is neither systematically supported nor generally perceived to be a high-quality developmental activity".On the other hand, a consolidated model of peer observation of teaching is known to exist in both England and North America.

Validation
The existence of validation processes in the development of instruments advocated the need to analyze this as a variable (see Table 4).The results of the review showed that three of the 13 instruments had undergone a process of validation: the Reformed Teaching Observation Protocol (RTOP= by Amrein-Beardsley and Popp (2012); the instrument for the observation of "Project Based Learning" by García et al. (2017); and the peer observation tool introduced by Rabada and Scott (1986).The reliability and validity of RTOP were investigated within the higher education context and through examination of what participants' perceived having learned during their faculty/peer evaluation process.In the case of García et al.'s (2017) instrument designed for the observation of "Project-Based Learning," the instrument's first version was pretested and completed by four peer observers as they watched two videorecorded PBL sessions.Two videotaped tutors also completed the instrument and were then interviewed to gather their suggestions and comments; after which, certain adjustments in the formulations were applied as considered necessary.Regarding the tool designed by Rabada-Rice and Scott (1986), a group of experienced teachers deliberated on the suitability of 25 initial items, which were later reduced to 10 following a period of discussion.
In this regard, content validity was defined as the quality that ensures that a set of items are representative of the behavioral domain to be measured (Moscoso et al., 2003).This representativeness is usually evaluated via subjective expert opinion, which may sometimes be quantified through the use of algorithms such as the Aiken coefficient (Aiken, 1980).Therefore, the use of content validity or any other procedure that systematizes instrument design offers a level of methodological assurance that the instruments developed are both reliable and valid tools.

Observation
The number of items in each tool, their grouping (or lack of grouping) into dimensions, and the response format were each aspects that were analyzed in terms of the information observed and how that information was organized for each tool (see Table 3).Regarding the number of items, the tools ranged from being highly unstructured and with just a few items to more exhaustive instruments.The least structured instruments were those developed by Cosh (1998), Drew et al. (2015), and Sullivan et al. (2012), whilst Georgiou et al.'s (2018) instrument did not include any items or dimensions.
The observation process starts with a highly detailed formulation of objectives linked to teaching.Subsequently, observation sheets guide a qualitative reflection on the teaching-learning process with respect to the formulated objectives.Along the same lines of flexibility, Cosh (1998) designed a qualitative, open instrument that consisted of three sections for the observer to complete: a) What was learned from the observation; b) Action intended to be taken (e.g., reading, staff development, further observation, and experimentation with own teaching); c) Suggested topics of interest (e.g., staff seminars, staff days, or action research).For their part, Sullivan et al. (2012) included six rating items (voice, pace, non-verbal communication, organization and preparation, use of overhead projectors, audiovisual aids, etc., and attitude).Of the four least exhaustive instruments mentioned, Drew's (2015) PRO-Teaching presented a revised version of an earlier tool.The resulting definitive version consisted of 10 largely open questions such as "Does the teacher clearly define explicit, realistic, and challenging yet achievable aims and learning objectives?" or "Does the teacher reveal a scholarly approach to teaching and seek to improve teaching performance?"Moreover, a number of tools included multiple observation items such as those by Torres et al.  2017), the instrument's 35 items were structured into six dimensions (Class structure, Class organization, Class climate, Content, Teachers' attitude, and Other considerations).For its part, the instrument created by Servivlio et al. (2017) from Monmouth University (United States) groups its 33 items into five dimensions (Content, Organization, Interaction, Verbal and nonverbal communication, and Use of media)."Peer observation-clinical/small group teaching" was an observation grid designed by Hassel et al. (2020) at Colorado State University (United States) and used five dimensions to group its 27 items.
Finally, among the tools with a higher number of items, Amrein-Beardsley and Popp (2012) from Arizona State University (United States) organized the 25 questions of the RTOP instrument into four dimensions (Lesson design, Propositional knowledge, Procedural knowledge, and Classroom Culture: Interactions).
In terms of the ideal number of items for a POT tool to consist of, we do not believe that such a number exists.However, it may be said that instruments that contain many items may risk complicating the observer's task, since they will require observers to assess many different aspects within a short period of time.Furthermore, when these tools are designed, it is essential to identify who the observers are likely to be in terms of their professional experience and job role.In the case of teacher observers, not all are likely to share the same pedagogical knowledge, and POT experiences may also be multidisciplinary (Torres et al., 2017).Thus, regardless of the number of items that an observation tool consists of, it is crucial that appropriate training programs are designed and implemented for current and future observers (Cannarozzo et al., 2019).This type of training should involve understanding the purpose of the tool in question (e.g., educational, accountability), having sufficient knowledge of the tool's items, and ensuring reliability when coding the observed behavior by referring back to the questions/items of the instrument itself.
Finally, the response format was another variable analyzed across the 13 tools reviewed.Response formats are conditioned by the number of instrument items.Tools with few questions commonly have an open response format (Bolt, 2013;Cosh, 1998;Drew et al., 2015;Georgiou et al., 2018;Rabada-Rice & Scott, 1986).On the other hand, tools with numerous items tend to have response formats with four or five options (Amrein-Beardsley & Popp, 2012;Cannarozzo et al., 2019;García et al., 2017;Servilio et al., 2017;Torres et al., 2017).Sensitivity is understood as the quality that allows for differentiation in the range of possible answers.While response formats with few options (e.g., two options) do not allow for much differentiation, it should also be noted (for tools with closed response formats) that human discrimination capacity is limited (Goñi et al., 2003).On occasion, an odd number of responses could induce the so-called central tendency error.Additionally, when POT procedures are formative in nature (beyond being designed solely for accountability), it seems appropriate to include sections where qualitative comments and clarifications can be made as closed coding response options.

Feedback
Another important aspect analyzed was whether instruments provided a section to analyze and reflect upon the observed behaviors (see Table 5).Examples of such are the inclusion of sections such as "strengths," "weaknesses," or "comments" in a developed tool.The goal being to provide feedback to the observed teacher and, ultimately, to improve the teaching and learning process.It is worthy of note that four of the 13 instruments evaluated only included an observation section, without any specific section attributed to feedback.Of the remaining instruments, some included all three aspects of "strengths," "weaknesses," and "comments" in the feedback section (Bolt, 2013;Cannarozzo et al., 2019;Georgiou et al., 2018), while others only included a section for "comments" or "suggestions for improvements" (Cosh, 1998;Hassel et al., 2020;Rabada-Rice & Scott, 1986;Torres et al., 2017).
If we understand POT as a phenomenon based on a critical (non-technical) paradigm and with a transformative objective, observation tools should include a specific section on feedback.A learning-oriented evaluation closes its cycle if the observed person receives adequate and valuable feedback, and if that analysis is executed horizontally (not vertically) between the observer and the observed.This type of POT model is classified under the theoretical categorization that Peel (2005) defined as "D3," that Gosling (2002) defined as the "collaborative model," and that Byrne et al. (2010) classified as the "peer development model."This model can also be said to be based upon constructivist critical teacher training (Mutlu-Gülbak, 2023).
One limitation of the current study is that only tools designed for the higher education context were reviewed.Other interesting POT tools exist in the literature that were designed for other educational stages.Also, since the COVID-19 pandemic, online learning has intensified as a more commonplace educational practice (Noor & Md Isa, 2023;Strelchuk et al., 2023;Sultoni & Gunawan, 2023); hence, another limitation of the current study is that the review excludes the influences of the pandemic on newer forms of learning.

Conclusion
In this work, we conducted a systematic review of tools used for the peer observation of teaching (POT) in the higher education context.A total of 13 POT instruments were identified based on inclusion and exclusion criteria.A majority of these tools (n = 8) were developed by universities in two countries, namely the United States and Australia.Only three of the resulting instruments in the review included some form of validation process in their design.Thus, we believe that any instrument should undergo systematic processes as part of its design phase, even if the tool is to be directed towards learning-oriented evaluation rather than for the purpose of bureaucratic quality assurance.A recent systematic review published by Nuis et al. (2023) also agreed with this conclusion.
Moreover, we deem it necessary that teachers be involved in the design of instruments that are aimed towards observing the behaviors of teachers in the classroom.In can be said that teachers have long been evaluated without any involvement in the process, and the benefits of such a limited process are therefore doubtful.Furthermore, the number of items in the instruments reviewed ranged from three up to 35 questions.From our perspective, excessive numbers of questions could make the observation exercise overly complex for the observer, unless the items are purposefully distributed across several observed sessions.In any case, the actual appropriate number of questions should correspond to that deemed by teachers themselves as being essential to observing the teaching process.Hence, it is crucial that teachers help to design observation tools that may then be implemented to observe their performance behaviors.
Understanding observation from a critical viewpoint, we believe that in the case of closedresponse formats, tools should include fields in which observers may add qualitative comments that can later be used to deepen the understanding of the record and to improve the feedback quality of the observer-observed encounter.
Finally, if POT is directed towards the transformation of not only technical but also human and educational realities in institutions and people, we believe that POT instruments

Table 1 .
(Byrne et al., 2010)vation of teaching(Byrne et al., 2010) Only articles written in the English language were included in the search.The exhaustive search equation is presented in Table2.
*) OR TITLE (formative AND observation AND of AND teaching) OR TITLE (peer AND observation) OR TITLE (mentoring) AND KEY (peer AND review AND of AND teaching) OR KEY (peer AND evaluation AND of AND teaching) OR KEY (peer AND feedback AND on AND teaching) OR KEY (peer AND partner*) OR KEY (formative AND observation AND of AND teaching) OR KEY (peer AND observation) OR KEY (higher AND education OR university* OR college* OR tertiary AND education) AND NOT TITLE-ABS-KEY (school OR elementary OR secondary OR middle)

Table 3 .
Quality score of studies

Table 4 .
Characteristics of the "observation" dimension of ETP instruments

Table 5 .
Characteristics of the "feedback" dimension of the POT instruments