Mixed Methods:
Usability Testing

Formative evaluation
for continuous improvements

For this study, a charity organization Crisis.org.uk was chosen to assess usability against a set of the criteria given for the evaluators.

Introduction

Considering Crisis.org.uk's aims to eradicate homelessness in Great Britain and to support a number of homeless people in order to re-establish their lives, the website of Crisis should deliver its messages with clarity and impact to motivate the British public who might be interested in its service and campaigns, such as fundraising and volunteering to be part of their activism and advocacy. Vulnerable homeless people can be empowered by British people to fight collectively against social inequality and barriers and the website wants to encourage more people to partake in such activities and increase the number of donations and fundraising.

To help achieve the charity's goals on the website, the research focused on the areas specified in the brief below. The objectives were to uncover how people perceived the website in conjunction with its social actions and whether there are any usability problems that hinder users from understanding and partaking in any social actions run by the charity.

The degrees of persuasion and engagement of the website for the target audience
1. Does the website clearly communicate its goal with its target audience?
2. Does the website motivate its target audience and have a trigger that makes theaudience keep engaging with the organization?
The overall user experience of the website
The degrees of ease of use and satisfaction with the user journeys which are closelyassociated with social actions such as donations that help keep the charity running andsupporting people in need of help.
1. How do users feel while actively engaging with the user journeys?
2. Does the website allow its target audience to partake in social actions without mucheffort?
The degrees of comprehensibility and wayfinding while exploring the website's campaigns
1. Does the website clearly communicate with its target audience and successfullydeliver its messages to them?
2. Is the target audience aware of what kind of social actions are required to supporthomeless people?
3. Are they motivated and willing to take action accordingly after reading about acampaign?

The test for the desktop website, Crisis.org.uk was conducted between 30th March and 3rd April 2023 regarding the aforementioned objectives. A remote moderated usability test was adopted, with 6 participants including a participant for a pilot test. As explicated in the objectives of the test, its methodology will be explained.

Methodology

Recruiting participants

Participants were recruited via Microsoft Forms with a screening questionnaire to find people whose interests align with the charity's human-rights activities so that the test could reflect the likelihood of its target audiences' behaviours and attitudes while interacting with the website.

The recruitment took place with a mix of convenience sampling and volunteer panels which are the most common approaches of the nonprobability sampling methods for data gathering, hence the generalizations of the findings of the research for a bigger population could be argued to be unfit (Sharp et al, 2019)1. However, taking this and the constraints of the research into account such as available resources and time, the right screening questions and triangulation would counteract its disadvantage. The specific demographics were not considered apart from their attitudes towards human-rights activities and their prior experience for the same reason mentioned above to determine the participants for the test.

As a consequence of the pilot test and my enthusiasm for the usability test, 6 participants undertook the test even though Nielsen (2000) stated that most usability issues can be found by up to 5 participants, and the more people we observe, the less the outcome one would get considering the cost-benefit trade-offs. 21 responses were received via the screening questionnaire. Apart from two participants for convenient sampling, there was a high number of unresponsive survey participants after the consent forms were sent out to volunteer panels to explain what would be involved, how the test would run and the requirement of session recording. Presumably, they were reluctant to take part and changed their minds after founding out the session needed to be recorded once they read the consent form. As an administrator, it was felt necessary to recruit backup participants as a contingency plan in anticipation of this. Otherwise, unexpected unresponsiveness from participants could affect the timeline for the research and potentially also hamper gathering a sufficient amount of data for discovery and data reliability. Except for the participants for convenience sampling, the remaining participants for volunteer panels were incentivized with a £10 Amazon coupon as a small token of appreciation for their time and effort.

Research design

Taking into account the constraints of remote usability tests and a lack of available resources such as time, tools and support, Triangulation of data was chosen to collect the data from different sources at different times. Those are namely the review of the website by myself as a UX professional in the name of desk research to find out about the organization and possible task ideation, System Usability Scale (SUS) as a study-based UX metric for quantitative data, and mixed methods approach with qualitative and quantitative data as a task-based metrics from the usability tests. Triangulation would increase the validity and reliability of the data rather than relying on one method as the generalizations from different perspectives and resources are less risky and provide a holistic picture of the problem. Moreover, biases from one method could also be mitigated (Anderson, N., 2022).

The website review was conducted prior to designing a set of tasks. The process was required to interpret and refine the goals of the brief. This allowed me to choose the right task methods according to the context and goal of each task and subsequently, a set of tasks for the test was generated with a mix of verb-based, scavenger hunt and interview-based tasks to gain insights related to the objectives with the expectation that they would uncover attitudinal behaviours of the participants about the organization's website based on their past experience and interests in human rights activities. In addition, each task comprised a set of questions which were a combination of open-ended (qualitative) and closed-ended (primarily quantitative) questions depending on their purposes.

Variations of perceptions of the task experience, such as persuasion and engagement depending on a task's aim were included as task-based metrics to gauge the usability of the website and achievement of the objectives of the test between the website and its users and they are essential questionnaires to measure participants' attitudes (Sauro, J. and Lewis, J., 2022). Subsequently, SUS was chosen to measure general users' perception of the usability of the website as a study-based UX metric. SUS is technology agnostic and tested on multiple interactive systems including websites and only consists of a 10-item questionnaire (Sauro, J., 2011). Compared to other metrics such as PSSUQ, it is a lightweight and practical way to gather data from participants. SUS would be the best fit considering its purpose and the constraints of the research. The form for this post-assessment of usability was created via Microsoft Form and it was provided to the participants after the test.

Apart from the challenges of collecting quantitative data without tools or support, and the fact that recruiting 40 participants is recommended to gain statistically reliable results to reduce margins of error (Budiu, R. and Moran, K., 2021), formative assessment was the best fit to identify problems to meet the requirements of the coursework. Therefore, the research focused on collecting qualitative data, such as the participants' verbal and non-verbal behaviours towards the website. Even though quantitative metrics were adopted both for task-based and study-based UX metrics, these were only used as indications of the degree of perception of usability issues of the website.

Pilot test & its findings

Since the pilot test did not go as smoothly as expected, minor changes were made to the tasks. First, the participant did not understand a task fully with a given scenario without lived experience or knowledge about shares and tax relief in relation to donations. Consequently, the task could not be carried out. The task was intended to highlight usability problems, not the participant's intelligence. One participant expressed frustration about their understanding of the task and struggled to determine where to start and what to do even though it was explained further. It was apparent that the specific task would be inappropriate for the rest of the participants who were recruited without asking about their fitness for the task during the screener.

Another finding was that a structured interview with a participant-chosen task which relied on common sense may not have been the best combination of techniques. First, the task was designed to determine whether the website provides information that the public wants to know about the charity in general. In addition, it aimed at finding out whether the website meets the mental models of general users whose interest is human-right activities, and its question can be found below.

Question

Remember you are on the website with a personal interest in this charity. What would you expect to find on the website?

The participant replied instantly with "Volunteering" and started looking for the relevant menu. As the section was meant to be explored with pre-defined tasks afterwards, this was unexpected and unprepared for a structured interview. Eventually, one pre-defined task was skipped as it had already been dealt with. Two scavenger hunt tasks with scenarios were withdrawn due to the reasons above. However, the remaining tasks as a basic script for the guidance were reserved. The usability test session lasted around one hour and a half taking longer than expected as the session did not go smoothly for the reasons mentioned above. In addition, some technical issues prevented the participant from sharing the screen at the outset. The session structure and usability test script read out prior to proceeding with a test to all the participants.

Data gathering

Since the techniques of quantitative data collection were previously detailed in the Research design section, it would be apt to expound on the techniques of qualitative data collection. With the participants' consent, all test sessions were recorded via Teams with transcripts so that they could be re-watched during data analysis without support from others. Notes were taken as often as possible, but this was challenging while also talking, listening, and observing participants' verbal and non-verbal behaviours simultaneously. Data recording allowed for the collection of various types of data, such as screenshots, capturing participants' attitudinal behaviours that were missed during the interviews. Inasmuch as observing the performances of the given tasks by the participants cannot provide rationales for their decision makings, their feelings and thoughts, a concurrent think-aloud technique was requested to be performed by all the participants during the test.

As mentioned before, a structured interview was expected having rigorously planned out the test and modified the tasks after the pilot test. However, the more tests that were run, the more familiar the test became, leading to more confidence in holding them. In addition, the issue whereby a participant's chosen scenario overlapped with a pre-defined question from section 3 reoccurred since the tasks could be performed with common sense regarding the charity; most participants started talking about the same topics, such as donating and volunteering which were expected to be covered at a later stage of the test.

A concurrent think-aloud technique complemented semi-structured interviews while making them run fluidly and naturally. Interestingly, the combination of the two techniques allowed for the uncovering of various areas of the problem spaces of the website thanks to the participants who held various points of view even though the basic script was prepared and identical. The remaining participants were asked to explore where their interests in the website of the organization were aligned. By virtue of the nature of semi-structured interviews with participant-chosen tasks, exploring wider areas of the website within the scope of a goal of a task and various responses from each participant could be gathered. The change in interview techniques in the midst of the tests did not affect the results of data gathering to the extent that previous data became negated for the research.

Furthermore, various questions on perceptions of the task experience, such as "How easy was it to find?" and "How persuasive were they?" were asked to participants at the end of each task and the sets of questions varied depending on the characteristic and aim of the task. These questions not only provided quantitative data following a 7-point Likert Scale but also qualitative data which allowed participants to reflect on their perceptions according to the degree of their experiences and feelings and to feel more comfortable and at ease during the interviews. This was because the questions were adapted to remain relevant so that they continued the conversations and delved deeper into specific aspects of their experiences related to the question. It was found that asking closed-end questions was a quintessential part of the interviews whereby the participants could evaluate the website in depth in line with the specific question whereas an open-ended question like "How do you feel about this new app?" would elicit various responses from participants, however, it was noted that data processing could be challenging, and a meaningful outcome may not be produced due to individual levels of responses without having a fixed set of scope and range (Sharp et al., 2019)2.

At the end of each interview, the participants were given a link (refer to Appendix 6) to fill out the post-assessment questionnaire, SUS to quantify their perceptions of the overall user experience of the website.

Data Analysis

Qualitative data analysis

As Sharp et al (2019) state, an inductive approach was adopted to extract concepts from the data and identify themes (thematic analysis) which was felt to be the most appropriate way to practise data analysis considering the purpose of the usability test. As the first step of the inductive approach, the two spreadsheets were subsequently created in Microsoft Excel; one was a summary of the data which included my interpretation and analysis of the data per participant in each column and the other one was the rainbow chart with Nielson's severity ratings. A cell was filled out every time a new insight, whether it was positive or negative, was found in the notes. Whenever there was a lack of clarity, perhaps unsurprising given the time gap between the two activities, video recordings were re-watched to confirm understanding and context if necessary. Thanks to the keywords in the notes, it was straightforward to locate when conversations happened as the video recordings allowed one to search for a keyword from a transcript and play the timeline that matched the keyword instantly. Each column of the summary spreadsheet was completed, and one column of the rainbow chart was filled out with the usability problems thereafter. Duplication of usability problems was not considered during this stage. Instead, gathering all the problems on the rainbow chart was focused on. Even though the tests aimed at finding usability problems with the website, some participants furnished positive feedback on things they found interesting or strikingly different from their expectations.

An affinity diagram was not created although it is the most common technique to explore data, identify themes, and organize them in groups since data analysis should be done by myself relying on intuition based on years of experience as a designer rather than taking notes for every single problem and de-duplicating them. Having said that, the reliability of the process cannot be questioned as identifying themes was done independently and manually. In other words, the thematic analysis took place mentally by juxtaposing each participant's summary with the pre-existing usability problems on the two spreadsheets: the summary and the rainbow chart respectively. In a collaborative setting, an affinity diagram would facilitate the visualization of all the usability problems on sticky notes and a discussion with other people.

Initially, some duplicated usability problems appeared on the rainbow chart even though effort was undertaken to make sure to de-duplicate them taking the context of the problem and its location into account during the thematic analysis. Despite the effort, the rainbow chart went through several iterative processes between inductive and deductive approaches, to determine whether the meaningful codes were orthogonal (Sharp et al, 2019)2. During the deductive approach to categorising data, the context column on the rainbow chart was created to make sure each usability problem was independent and to specify where they belonged, if a couple of usability problems were interdependent or if a usability problem was overarching broad areas of the website.

Quantitative data analysis

The research was a formative assessment to identify usability problems of the website with 6 participants. Noting that quantitative user testing typically has a 19% margin of error and the average usability difference between websites is 64%, quantitative usability research would not necessarily be as efficient and accurate as expected (Nielsen, J., 2011). Due to the nature of the research, quantitative data garnered from the research was not of paramount importance; however, it helped summarize the data of the variation of perception of task experiences and these closed-end questions allowed for the asking of more follow-up questions to the participants.

Study-based metric: SUS score

On a spreadsheet, the results of 10 item questionnaire per participant were tallied up to calculate an individual SUS score and then, they were averaged out so that it would be possible to gauge how well thewebsite performs compared to others in general. The formula for the SUS score was found on measuringux.com and the spreadsheet for the SUS score of the tests can be found in Appendix 8. Overall, theresponses from the participants were strikingly contrasting; Participants No.1, No.3 and No.6 gave marginal scores between 52 and 55 whereas the others gave generous scores between 80 and 87. Taking thenumber of participants for the research into account, it may be inappropriate to generalize the result of the overall score for a wider population for the same reason mentioned previously. Knowing that theperception of the overall user experience of the website is very subjective and personal, it would arguably be more appropriate to reflect on the pain points of the participants who gave lower scores rather thangeneralizing the number.
Task-based metrics: Task success rate and more

The task success rate was divided into a 3-point scoring system; 3 indicates a participant passed a task easily, 2 means a participant passed a task with difficulty and 1 means a participant failed a task. Eachrate was recorded by myself as a facilitator right before moving on to the next task with my judgement without asking participants. The score reflected the approximate time taken to complete a task and theattitudinal behaviours of a participant during the task session. The scores were tabulated in a table with two columns for the sum and median values of the scores. By comparing the sum and median values, thesum values seem to indicate the locations of the usability issues more noticeably. The result was factored in severity ratings with other quantitative data that will be discussedbelow, that is, the quantitative data tended to indicate the frequency and impact of the problem. A quadrant matrix of the severity rating which delineates the levels of severityof the usability problems. In the rainbow chart, the rating of each problem was finalized after factoring in the criticality and persistence of the problem with my judgement.

At the end of each task, the participants were asked to measure the variation of perception of task experience. The topic of these closed-end questions varied depending on the goal of a task. They were used tofind out whether a certain page of the website was engaging/persuasive/satisfying and the like. As mentioned above, the answer from the participants to these questions helped to uncover the attitudinal behaviours of the task areas in detail as the specific perception questions allowed for the examination of the scope of the topic. Once the research was completed, the data was tabulated into a spreadsheetcalled "Quant", and each task was given a column of Median value per task. As Sauro and Lewis (2020) state that more points would increase the validity and reliability of the result of rating scales, the 7-point Likert scale was felt to be the best fit for the reliable measurement of the participant's attitudes towards the website. As the research aimed at the formative assessment, these numbers did not consider themargin of error and confidence level; hence the quantitative data should not be regarded as statistical analysis. In other words, they are reflections of the degrees of attitudinal behaviours that were gathered forsuch purposes and the technique allowed for the follow-up questions such as why, what and how during the research.

Four Main Recommendations

Way findings

The navigation should establish the submenu items' hierarchy and classification toincrease findability with information scent and reduce interaction cost (Loranger, H.,2013). Considering the number of submenu items, it was noted that many participants had to scan through the individual submenu items until they found the right one which they were expected to see under the "Get Involved" menu. A couple of the participants expressed that they were overwhelmed by several subpages with the secondary navigation while browsing the website.

The classification would improve memorability and hence, findability. It would lower the interaction cost for users as they could find the relevant information as quickly as possible. Therefore, it would increase the findability of each menu item and the right information both for new visitors and regular users. As card or tree testing has not been conducted, this should be regarded as a design hypothesis.

A navigation suggestion for the "Get Involved" menu item
Your support	Other ways to help	Partnership
Donation About the donation (a provisional labelling) Donate Philanthropy	Art from Crisis	Corporate
Volunteer	Shop from Crisis	Trust, Satutory & National lottery
Fundraise	Virtual gifts	Venture Studio
Campaign with us (currently Campaign)	Little helpers (currently Resources for young people)	Venture Studio

Consistent location and design of navigation throughout the website (W3C, 2022).Multiple subpages have inconsistent navigation systems. For example, Campaignand Volunteer pages do not furnish the main navigation but secondary navigationwith breadcrumbs whereas the Donate page does not have the secondarynavigation. Main navigation should be shown on every single page for consistencywhich helps users navigate the website more easily without deviating.

Batch filtering can minimize any waiting time for pages to be loaded if a website is slow or if it works on mobile devices considering latency. It would be the best fit for finding a volunteering role and a fundraising event on the Crisis website as the filter options are not too broad and extensive and it has a limited number of criteria as the table below depicted (Sherwin, K.,2016).

Robust filter options for the "Fundraise" and "Find a volunteer role"
Filter by Nation	Remote		England		Scotland	Wales
Filter by Location	UK - Wide and Remote		Birmingham		Edinburgh	South Wales
			Coventry & Warwickshire
			Croydon
			London
			Merseyside
			Newcastle
			Oxford
			Swansea
			South Yorkshire
			Merseyside
Filter by Date	By week		By month		By year
Filter by Role (only relevant to the "Find a volunteer role")	Fundraiser		Local services		Retail	Others
Filter by Event (only relevant to the "Fundraise")	All	Run	Cycle	Walk	Swim	Others

Spark and signal as triggers

Signal: The target audience who may already have the motivation to support thecharity would like to know how transparent and reliable the organization is andwhere and how the donation is spent, for example, a discrete information page forthe donation with infographics and annual financial report of the charity within the website.
Spark: The target audience who may lack motivation could be persuaded and the website could achieve the target behaviour from them if the website triggers aspark, that is, real homeless people's photography with their stories and how the charity supports them and contribute to changing their lives.

Content design with information scent

Decluttered and logically flowing content with information scent to increase thediscoverability of information that is related to the page's goal which aligns withusers' intention on the page (IxDF Course Instructor, 2023).

Infographics and key points rather than long-winded descriptions/articles to optimizethe user experience and efficiency of information foraging (Budiu, R., 2019)

Clear and effective communication with the audience

As some participants stated, the website deals with a great amount of information,and it hinders them from finding the right information and what the charity tries toachieve on the website in general. Setting the right information goal (Budiu, R.,2019) for each page would increase the effective communication between the targetaudience and the website along with the rest of the key areas mentioned above.

In line with Spart and Signal as Triggers, storytelling could draw more people's attention and the credibility ofthe charity could be improved in tandem.

Conclusion

The formative assessment was successfully completed, and it was a great opportunity to learn and realize how much work, effort, time, and multidisciplinary approaches are involved in user research. When planning the research, the purpose of qualitative and quantitative data seemed to be subtle, it was uncovered when, why and how they can be used depending on the purpose of the research. I hope that my findings and recommendations could be of great help to Crisis.org.uk and no one should be criminalized for their financial predicament and destitution if the unfair socio-economic system drives them out.

Mixed Methods:Usability Testing

Formative evaluationfor continuous improvements