Voice-based User Interface vs Text-based User

0 downloads 0 Views 766KB Size Report
Tommy Hilfiger, through the use of a Messenger chatbot, received an 87% return rate to the ..... 360/C360_2011_brochure_FINAL.pdf#page=2 [Accessed 10 Jul.
Voice-based User Interface vs Text-based User Interface Voice-based User Interface vs Text-based User Interface: Comparing perceived usability of two natural language processor (NLP) based question answering (QA) systems

Diarmuid Lane – State Street Advanced Technology Centre UCC – [email protected] Robin Renwick – State Street Advanced Technology Centre UCC - [email protected] John McAvoy – State Street Advanced Technology Centre UCC – [email protected] Philip O’Reilly – State Street Advanced Technology Centre UCC – [email protected]

Voice-based User Interface vs Text-based User Interface Abstract Purpose This paper details a comparative perceived usability study of two comparable natural language processor (NLP) based question answering (QA) systems. The two systems operate over two different modalities – voice and text. The study was completed with two custom built question answering systems, specifically designed for a proposed target user - the Financial Manager. Design/Methodology/Approach The study was completed in an exploratory manner within a semi-formal testing environment. An industry and academia accepted measure for perceived usability was used as the measurement method, in the System Usability Scale (SUS). Research Limitations / Implications The study was completed with seven participants, along with an additional four participants who acted as key informants. There is scope for expanding these numbers in future studies. The study only examined perceived usability, and any future testing should attempt to measure usability from a more objective standpoint. Practical Implications In the business context it is important for organisations to understand which mode of communication their employees prefer, and why, as employees may begin to interface with systems similar to the ones proposed more within their working environment and their business processes. Originality / Value Studies have been completed that analyse and understand single communicational methods of distinct systems, but very little research has been completed that compares two systems. There is business and research value in such a study, and this exploratory primary study should lead onto a deeper and more thought provoking studies in the future. Keywords Usability; Natural Language Processors; Question Answering Systems; Amazon Alexa; IBM Watson Assistant.

Voice-based User Interface vs Text-based User Interface Introduction Current trends have hinted at upheaval in a number of industries, with technology dictating an evolution of business practice (Levy, 2016). The value of emerging technologies and fields such as artificial intelligence and ‘big data’ is found in their affordance of leveraged businesses practice and increased insight extraction. One example of such a technology, still in its nascent adoption phase, is natural language processor (NLP) based question answering (QA) systems. QA systems are a subfield of NLP research and can be defined as: systems focused on the location and retrieval of specific information based on a user’s question posed in natural language. The questions, often posed through the medium of speech or text, are interpreted by QA systems acting as a human-computer interface, seeking to deliver effective and efficient responses (Lopez et al., 2011). It has been noted that interactions with such systems will become increasingly commonplace (Panetta, 2017), with the trend exemplified through increasing adoption of tools such as Amazon Alexa, IBM Watson, Apple Siri, and the Google Assistant (López et al., 2017). Further evidence of QA system proliferation may be seen through the development of ‘chat-bot’ interfaces (Boulton, 2018), which have become commonplace on the internet, often becoming the communication medium between customer and business organisation. The uptake of QA systems is becoming increasingly prevalent in contemporary workplaces, with chatbots carrying out a range of administrative tasks through to complex human resource functions (Knight, 2016). The technology research company Gartner has estimated that by 2020 over 50% of large organisations will have internally deployed ‘chat-bots’ to augment or carry out business processes, drastically reducing levels of procedural human-error (Goasduff, 2018). Recent research has highlighted the increased levels of market penetration of Voice-based User Interfaces, emphasising voice as an accepted medium of communication between human and computer (Moore et al., 2016; Moore, 2017a; Moore, 2017b; Simonite, 2016). However, the questions remain as to which communication modality is preferred, or more effective, for management to implement - speech or text. The goal of this paper is to understand how NLP based QA systems may be integrated into organisations by understanding usage, context, and how both firms and employees intend to adopt technology into business process. Core to the idea of interaction preference is the concept of usability. Usability in and of itself is a complex term, with a multitude of interpretations found in varying fields: design science; design engineering; information science; human computer interface (HCI) design; user interface design (UI); and user experience design (UX) (Green and Pearson 2006). Recent research has addressed the overall importance of ‘usable’ NLP based QA systems (Hausawi and Mayron, 2013; Ferrucci et al., 2009), but there remains a paucity of research within the context of interface comparison. With this in mind this paper presents findings and discussion of a usability study that assessed, analysed, and compared perceived usability of two distinct NLP based QA systems. The systems leverage emerging technology, reducing time spent manually analysing and querying complex multi-format financial documents. A Text-based User Interface (TUI) has been developed using the IBM Watson Assistant, described as an artificial intelligence (AI) based system designed to analyse, gather, and extract meaning from large data sets. The potential benefits of Watson lead to increased overall business effectiveness, efficiency, and productivity (IBM, 2018). Concurrently, a Voice-based User Interface (VUI) has been developed using Amazon Alexa, described as a toolset that allows developers “build highly engaging user experiences with lifelike, conversational interactions, and create new categories of products.” (Amazon, 2017. p.1). A study has been conducted through a lens focused on financial service employees, operating in Ireland. The proofs of concept (POCs) have been designed to complete similar tasks to those that Financial Managers in a large fund management organisation complete on an ongoing basis. The systems return

Voice-based User Interface vs Text-based User Interface answers to questions posed in natural language in relation to specific financial details held in complex data silos. In order to compare and rank the two POCs, the System Usability Scale (SUS) is utilised to assess the user’s perceived usability of the systems. The SUS is regarded as one of the leading industry and academic measures of perceived usability (Bangor et al., 2008). In the following sections the theoretical and practical developments of QA systems are outlined, along with a description of the concept of usability. Numerous models and frameworks to assess usability and the user adoption are discussed, along with the preferred model. Following on from this, a formal experimental usability study is described, along with the results and discussion concerning the aforementioned study. Further research will outline how the discussed concepts will affect future NLP based QA system usability research. Question Answering (QA) Systems – Background and Current Developments In this section a brief history of Question Answering systems is detailed, from their inception in the 1960’s to present day developments, along with a description of their fundamentals. QA systems are an advanced form of information retrieval, characterised by information requests communicated through the medium of natural language. QA systems, which represent one of the most natural forms of human computer interaction, focus on the retrieval of specific data points in relation to user questions. QA systems have featured within academic literature for the past 50 years, with the first instances of such a system appearing in 1960s and 1970s. Two such examples include BASEBALL (Green et al, 1961) and LUNAR (Woods 1973), which represent natural language based interfaces interacting with databases (Kolomiyets and Moens, 2011). BASEBALL, provided information concerning the American baseball league, and LUNAR provided data regarding soil samples taken from the Apollo lunar exploration. Both systems were found to produce good information provision, but were limited in terms of the size and scope of their information repository (Mishra and Jain, 2016). Further developments focused on linguistic analysis, especially in the context of question, or query, specific requirements (Androutsopoulos et al., 1993). Core to the advancement of QA systems in recent decades has been the refinement of advanced artificial intelligence algorithms, such as neural networks, which have allowed for more accurate natural language recognition within NLP systems (Sagara and Hagiwara, 2014; Collobert, and Weston, 2008). This has not only benefited the contextual awareness of systems, but also their ability to understand and categorise complex data silos (Mishra and Jain, 2016). In a report published by the International Data Corporation (IDC) it has been estimated that data growth within the digital universe will double until the year 2020, totalling 40 trillion gigabytes, or 5,200 gigabytes for very man, woman, and child (Gantz and Reinsel, 2012). Due to this growth, systems which can accurately and effectively locate specific data points within domain specific information bases will become integral to understanding and managing vast amounts of information. One such example of a QA system utilising the power of artificial intelligence to harness data growth is the system developed by IBM, entitled Watson (High, 2012). Watson, with its advanced NLP capabilities, has been utilised for a myriad of use-cases from medical diagnosis (Ferrucci et al, 2013), through to serving as interface for a number of website-based frequently asked question query systems. It has even participated in, and won, the game-show Jeopardy (Ferrucci, 2012). The advancement of commercial QA systems is also seen in the wide-scale proliferation of ‘chat-bot’ technologies (Knight, 2016b). It has been predicted that narrowly focused industry, or organisation specific, chat-bots will generate a global revenue of $623 billion by 2020 (Dale, 2016). This seemingly startling prediction is reconciled with the fact that by 2020 customers

Voice-based User Interface vs Text-based User Interface will manage 85% their relationships with a business without interacting with a human (Gartner, 2011). This prediction can be seen coming to fruition through the number of chatbot services being released. These range from entirely digital banks where all customer interaction are done through chat-bots (Brewster, 2018), through to Facebook Messenger enabled business chatbots which have a proven track record of enabling increased sales and customer interaction. Tommy Hilfiger, through the use of a Messenger chatbot, received an 87% return rate to the chat-bot, with consumers exchanging 60,000 messages and spending 3.5 times more time conversing with the chatbot than any other digital channel (Wolfson, 2018). Both internal and external facing question answer systems have the potential to create benefit for organisations, but it is not yet known which is the preferred medium of interaction: voice or text. The paucity of industry and academic literature has led to the research presented in this paper: a perceived usability study that assesses perceived usability of comparable voice and text based NLP based QA systems. Concept of Usability and Testing Methods This section outlines the concept of usability, and why it is important to test for usability. Following on from this, numerous usability and user experience testing methods and models are outlined. The authors view usability testing is an essential iterative process which management must consider during system design, development and post-deployment (Sy, 2007). Concept of Usability Usability, in and of itself, does not exist in any concrete sense, and can only accurately be defined when referenced to a particular context. One commonly used definition has been drawn from the International Organisation of Standardisation (ISO) guideline 9241-11. ISO define usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use” (ISO 1998, p.1). The ISO standard outlines the importance of use case, target audience, expected mode and method of interaction. Evaluating the context of the research presented in this paper led to the development of a usability study which considers the current level of technological development, intended use case, and perhaps most importantly, the deployment within a specific target organisation. The target organisation for this research project is a large financial management firm. More specifically, the product is an automated question answering system designed for querying data in natural language, developed specifically for Financial Managers within the organisation. System usability is a key determinant for system usage, and is pivotal for management to consider when designing and developing a system. Nielsen (2001) outlines four key benefits that management can achieve by measuring the usability of a system or product:    

Tracking Product/System Progress: Management can ensure a successful iterative design/development process by measuring usability between releases. Competition Comparison: Measuring system/product usability allows management to assess their position in comparison to competitors. Stop/Go Decision Making: Usability is a key measure for management to consider prior to the release of a product/system. Bonus Plans: Based on a system's usability performance post-release, bonus plans can be established for design managers and high-level executives.

Understanding how managers gain insights into the role, and method, of adoption of certain technologies is viewed by the authors of this paper as integral to business success. As new

Voice-based User Interface vs Text-based User Interface technologies have been outlined as key determinants of success, understanding modal preference within NLP based QA systems is viewed as an integral step for firms to understand how they may face the implications of the emerging technology for their organisations. Usability and User Adoption Testing Methods Within this section various methods and models to assess usability and user experience of technology are discussed. The System Usability Scale (SUS) was developed as an effective, quick and easy to administer a usability survey scale (Brooke, 1996). SUS has since become a leading industry usability survey, and has been successfully extended to assess a number of vastly differing systems, such as novel hardware platforms, interface technologies, and voice response systems (Bangor et al., 2008). SUS was developed to assess users’ subjective perception of usability with respect to a given system, or design (Brooke, 1996). The original SUS questionnaire consisted of 10 statements, scored on a Likert scale - with the responses varying from positive to negative answers falling within the categories of ‘strongly disagree’ to ‘strongly agree’. The questionnaire alternates between positive and negative questions concerning usability of the system being tested so as to reduce acquiescence bias, whereby users have a tendency to agree with all the positively worded questions, and reduces extreme response bias, whereby users rate everything either very high or very low (Sauro, 2011). A percentile score is then derived based on the provided responses (Brooke, 1996). There have been some criticisms directed towards SUS, mainly that respondents have been found to mistakenly agree with negative questions, leading to undependable questionnaire scores (Sauro and Lewis, 2011). Due to the ongoing issue of user error, a number of changes have been made to the questionnaire since its inception in the 90’s; roughly 90% of tests since its formation have been found to use a revised version (Bangor et al., 2008). When Brooke (1996) originally developed SUS it was designed to assess the users perceived usability of a system, and ultimately produce a single SUS usability score. However, more recent research has indicated that SUS contains two separate scales - usability and learnability (Lewis and Sauro, 2009). The addition of learnability to SUS allows usability practitioners to test for two measures by administering a single questionnaire and has been verified independently, whilst proving the two scales are correlated (Borsci et al., 2009). The System Usability Scale has been utilised successfully on a number of occasions, from cutting edge augmented reality learning systems (Georgsson and Staggers, 2015), through to mobile health management systems (Lin et al., 2011). As SUS was designed prior to the proliferation of speech based systems, a review of speech based usability testing is beneficial to detail how the SUS testing method is applicable to the design and decision making process that managers face when deciding how, why and which types of QA systems to deploy within their organisation. The Technology Acceptance Model (TAM) was set forth as a means to explore user satisfaction, doing so by assessing two distinct aspects: perceived ease-of-use and perceived usefulness (Davis, 1989; Davis et al., 1989). Perceived ease-of-use analyses the level to of effort required for a given user to adopt a certain technology product into their operations. Perceived usefulness relates to the degree to which a user believes a software product would improve their overall job performance. The TAM model has been extensively researched, and has been integrated and extended to assess a range of software solution and technologies from academic e-learning software (Hsu and Chang, 2013), through to consumer acceptance of Internet of Things (IoT) technology (Gao and Bai, 2014). The Unified Theory of Technology of Acceptance and Use of Technology (UTAUT), was set forth as a means to consolidate the previously devised TAM and user acceptance research;

Voice-based User Interface vs Text-based User Interface UTAUT offers a more complete model to assess user acceptance and overall use of technology (Venkatesh et al., 2003). The theory proposes that four key main constructs determine the level of IT/IS adoption and usage intention, those being: performance expectancy; effort expectancy; social influence; facilitating conditions. These key constructs are further moderated by variables such as age, experience, gender, and voluntariness of use (Venkatesh et al, 2003). The UTAUT has on numerous occasions been successfully extended to assess contemporary IS/IT systems, from e-government services (AlAwadhi and Morris, 2008), through to internet banking acceptance (AbuShanab and Pearson, 2007). Prior to extending the study out to incorporate an acceptance model such as TAM or UTAUT, the preferred medium of interaction with QA systems, voice or text, must be established. In order to measure the perceived usability of voice and text based NLP based QA systems, the System Usability Scale (SUS) has been selected as the core method of testing for perceived usability within the study. Research Methodology The following section outlines the research methodology implemented, from initial key informant interviews through to the administration of a formal usability study. Prior to the implementation of a question answering system, managers must first understand which medium of interaction is preferred, voice or text. The study was conducted through a lens focused on financial service managers operating within their specific industry. Key Informant Interviews In order to refine the intended use case of the NLP based QA system within the specific organisational context, qualitative interviews were conducted with professionals from the financial services industry. The use of key informant based methodology as an approach for data collection is viewed as beneficial when there is a lack of underlying theoretical understanding (Babbie, 1998). The key informants approach was selected due to the paucity of industry and academic literature exploring the concept of NLP based QA system usability. The use of key informants has been successfully applied in a number of scenarios:   

The study of inter-organisational research using key informants when there is a lack of archived data to support such a study (Kumar et al. 1993). Developing e-retailing theory when there was a lack of foundational understanding of online retailing (Cowles 2002). Using key informants to analyse a groups attitudes and beliefs towards software project reviews (McAvoy 2006).

Four key informants were chosen from four distinct financial services organisations. The positions held by the key informants are as follows: Chief Operating Officer - Head of Innovation EMEA; Financial Modelling Analyst; Equity Trader; and Business Intelligence Analyst. Interviews were conducted on a semi-structured basis (Myers and Newman 2007; Schultze and Avital 2011). Within their specific roles, the key informants carry out tasks similar to those completed by the proposed QA systems: querying large documents and retrieving answers based on analysis of specific data points. The interviews were conducted in order to understand what effect an automated QA system would have on the daily operations of both Financial Managers, and their employees. They also sought to assess the use context of the QA systems, if implemented in a financial services organisation. The use of key informant interviews shapes the overall design process, so as to produce a context specific QA system for a financial services organisation.

Voice-based User Interface vs Text-based User Interface Numerous research insights were extracted from the interviews. Consensus formed around the importance of the system’s ‘Effectiveness’, with one interviewee stating correct answers were essential in order for them to “trust” a system. Each interviewee stated that QA ‘Effectiveness’, be it in the context of speech, or text, is “vital”. QA system ‘Efficiency’ was communicated to be of great importance to the key informants, with one interviewee stating that a QA system must “meet real-world demands of immediate information access”. QA system ‘Satisfaction’, although important, was considered to be of less importance than ‘Effectiveness’ and ‘Efficiency’. Consensus formed around the idea of satisfaction being derived only if the system was both efficient, and effective. In the context of financial services QA systems, satisfaction was perceived to be of least importance; with one interviewee stating: “I would be far more concerned with system performance, rather than satisfaction”. Based on the key informant insights context specific QA system design requirements were derived, aiding in the development of the two NLP QA proofs of concepts. Design of NLP QA Proofs of Concepts In order to carry out a study into the perceived usability of NLP based QA systems, two proof of concept (POC) were developed. A Text based User Interface (TUI) was developed using IBM Watson Assistant, and presented to the participant in a HTML based website format. A Voice-based User Interface (VUI) was developed using Amazon Alexa, and was presented through an Amazon Echo 2nd Generation device. Based on insights drawn from the key informant interviews, both POCs were designed to represent the demands of a typical NLP based QA system as potentially deployed within a financial services organisation. In order to ensure the systems, correlate with the specific use case of a financial services organisation operating in Ireland, sample questions and answers were provided by the target organisation. These sample questions and answers represented queries that a typical Financial Manager would ask of a large fund related document, which represents a typical data silo. An example of a typical question and answer is as follows: Question - “When was the fund launched?”, Answer - “The fund was launched in October 2014”. Both POCs were developed concurrently to ensure the design specifications were as similar as possible. Measures were taken to ensure that systems had similar levels of ability in terms of ‘intent’ recognition. The NLP algorithms of both Amazon Alexa and IBM Watson are designed to interpret the question a user asks, and return the correct answer accordingly. In order to return the correct answer a number of ‘Sample Utterances’ or ‘Intents’, which consist of varying types of sample questions a user may typically ask of a system must be programmed into the system itself. In order to remove any bias, the sample utterances programmed were identical, allowing the study to focus on the perceived usability of the system, rather than system functionality. However, some system functionality cannot be aligned due to the proprietary nature of the NLP algorithms used by each interface manufacturer, and the method the systems are delivered to the user. For example, the Voice-based User Interface (Alexa) device differs from the open internet browser tab of the Text-based User Interface (IBM Watson Assistant), as it needs a trigger phrase to summon the QA application. Attempts were also made to ensure that systems responded to queries with exactly similar information, with differences kept to respective modes of communication. Each system used their own proprietary NLP algorithms to infer queries.

Voice-based User Interface vs Text-based User Interface Study Design The research presented in this paper seeks to assess which interface medium is preferred by Financial Managers operating within financial organisations - voice or text. The evaluation process is bi-focal; each participant engages with both the voice-based user interface and textbased user interface, within a controlled testing environment. Financial Managers from the target organisation complete a set of identical tasks on each system. This is followed by a SUS questionnaire. The questionnaire assesses their perceived usability of each system. This is coupled with an informal, open ended, interview in which the participants discuss their overall impressions of both systems. The purpose of the open ended interview is to assess if there are any issues, themes, or concepts that arise in either the interfacing with the two QA systems, or the testing procedure itself. It is envisioned that the SUS scores should correlate with the perceptions communicated within the informal interviews. During the usability study, participants were given financial documents with 12 specific data points highlighted to them. This consisted of 2 training data points and 10 data points for testing purposes. The two training points, which were used for demonstration purposes, were then excluded from the results. The participants level of experience with NLP QA systems was noted at the beginning of each session. A short explanatory briefing was given to each participant to explain the functionality of the system, and demonstrate how they would interact with each proof of concept (POC). The 2 training data points were used in this instance. Once the participant felt comfortable with the mode and process of interaction, the testing process would continue. Within the testing session the participants would complete 10 tasks on each given POC, and afterwards complete the SUS questionnaire. Participants For the study, seven Financial Managers from financial service organisations were selected to participate. The selected participants represent experts in their field. They would be comfortable with, and have a deal of experience with the types of data being queried by the system. Seven participants are viewed as an ample amount, as it has been found that 80% of usability issues are uncovered by the first five participants of any given test (Lewis, 1994; Nielsen and Landauer, 1993; Virzi, 1992). Albert and Tullis (2013) argue that five participants is adequate per significant user class; in the case of the following study it is viewed that only one user class exists - that of Financial Managers. Albert and Tullis (2013) also state that evaluation must be fairly limited, with a goal of 5 - 10 task completion, and that the user audience be well defined and represented for testing purposes. The study presented within this paper proposes a 10 task methodology. Tasks The study participants were asked to carry out ten tasks on both the Text-based User-Interface (TUI) and Voice-based User Interface (VUI). The ten tasks consist of locating and querying specific data-points from a financial document. The tasks for both the TUI and VUI are designed to allow the participants utilise both systems in a workplace related context of use. In order to remove bias from the study procedure, the order in which the systems are tested by the participants were alternated in their sequential order. Variables are confounded if they vary together in a such a way that it is not possible to determine which variable is responsible for an observed effect (Cochran and Rubin, 1973). A process, known as counterbalancing, is implemented within the testing procedure. This is where task order is alternated in the testing process (Albert and Tullis, 2013). This ensures participants do not consecutively learn how to use the first system and apply the skills learned to the second system (See Table 1.0 for order).

Voice-based User Interface vs Text-based User Interface

Participant

First Tested Second Tested

Participant 1

Voice

Text

Participant 2

Text

Voice

Participant 3

Voice

Text

Participant 4

Text

Voice

Participant 5

Voice

Text

Participant 6

Text

Voice

Participant 7

Voice

Text

Table 1.0 Data Collection - SUS In order to quantify the perceived usability of the two NLP based QA systems, the System Usability Scale (SUS) was administered to the participants at the end of every session. The participants rate the magnitude with which they agree or disagree to a given statement, by selecting an answer ranging from ‘strongly disagree’ to ‘strongly agree’. The score contribution for each odd numbered question is 1 minus the scale position. For the even items, which are negatively worded, the overall contribution is 5 minus the scale position. Therefore, each overall question contribution ranges from 0 to 4. To calculate the final SUS score, the 10 items are summed and multiplied by 2.5. The final SUS score ranges from 0 to 100 in increments of 2.5, (see figure 1.0 for scoring example). A higher SUS score indicates a higher level of perceived usability for the study participant.

Voice-based User Interface vs Text-based User Interface

Figure 1.0 - An example of SUS Scoring (Albert and Tullis, 2013) SUS Analysis The system usability scale (SUS) scores are calculated for all participants. The SUS scores for both the Voice-based User Interface (VUI) and Text-based User Interface (TUI) were calculated, and a mean rating per POC is provided (See Table 2.0). The following scores are represented in a grade scale format which range from ‘F’ to ‘A’, and are based on the percentile SUS score (Bangor et al., 2009). The grading scale is utilised so as to quantify, or grade the individual and mean SUS scores allowing for a more meaningful interpretation of SUS scores. For example, a system which receives a SUS score of 70 would fall within the 50th percentile, which represents the average SUS score, and would receive the grade scale of ‘C’ (see figure 2.0).

Figure 2.0 - SUS Grading Scale (Bangor et al., 2009)

Voice-based User Interface vs Text-based User Interface Overall, the TUI (IBM Watson) outperformed the VUI (Amazon Alexa) in all but one of the participant sessions, in which case both systems received the grade of ‘A’. On average the Text-based User-Interface received an ‘A’ grade, placing it in the upper band of the acceptability range, while the Voice-based User Interface received an average grade of ‘C’, placing it in the centre of the acceptability range. The acceptability range covers scores between 70 - 100, with anything falling below the lower bound of 70, or a grade of ‘C’ being deemed unacceptable from a usability perspective. (Bangor et al., 2008). The median value for the Voice-based User Interface was a grade scale of ‘B’ category of acceptability, whilst the Textbased User-Interface received a median value was ‘A’ acceptability. Participant

Voice-based User SUS Score - Alexa

Interface Text-based User Interface SUS Score - IBM Watson

1

C

A

2

F

B

3

C

A

4

B

A

5

A

A

6

A

A

7

B

A

Mean SUS Score

C

A

Median SUS Score

B

A Table 2.0

Discussion Following on from the SUS calculation and analysis, the data suggests that the Text-based User Interface is the preferred medium of interaction in the context of NLP based question answering systems. The study, conducted through the lens of financial services managers operating within financial service organisations, implies that although Voice-based User Interfaces rank highly on the SUS grading scale, a text-based alternative is still preferred. Table 2.0, shows that the text-based user-interface consistently ranked higher than the Voice-based User Interface in all but one participant SUS score, in which case they both received a grade of ‘A’. The participants completed an identical series of tasks, with the study design incorporating counterbalanced task ordering (See Table 1.0). Due to this bias reduction technique, the validity of the study is enhanced, further indicating the Financial Manager’s preference for a Text-Based UserInterface. Research Limitations Numerous research limitations were encountered during the completion of the usability study. The primary limitation, and most consequential, was the lack of a context specific usability testing model for NLP based question answering systems. Due to this paucity of an underlying usability testing model, the System Usability Scale was employed to assess the Financial Managers perceived usability of the two proofs of concept. In order to comprehensively assess

Voice-based User Interface vs Text-based User Interface the usability of NLP based question answering systems a context specific usability testing framework is required to assess both the objective and subjective aspects of usability. Further Research A more robust usability study, incorporating a context specific usability model, would aid in supporting the initial findings outlined in this paper. The paper has found that the preferred medium of interaction with an NLP based question answering system is through the modality of text. As the System Usability Scale assesses the subjective aspect of system usability, it would be appropriate for a context specific model to incorporate objective metrics. Doing so would offer a greater degree of validity to findings and also afford a greater degree of respective performance of the two systems. It is easy to say that participants preferred one system over another, but having the objective measures would allow decision makers to analyse why that may be the case, or which systems may be improved from an objective standpoint to ensure that disparities are no longer as stark between the two systems. An extended study would also allow decision makers to understand how the systems performed under certain circumstances, and within the context of specific defined tasks in comparison to each other. For instance, one may find that differing use cases alter objective outcomes, or continued use may reduce performance gaps as participants begin to learn technologies and adopt their work processes. Conclusion In order for managers to decide which medium of interaction is most appropriate within their organisations, they have to understand the technology, the users of the technology, and the context within which the technology will be used. These are all extremely important when making decisions based on how to adopt, deploy, or manage emerging technology within any organisation. Natural language processor based question answering system deployment is viewed as one of the main emerging areas of technology deployment as organisations move further into the 21st century. Understanding the system, and the way people feel about using them is important. Measuring system usability from the perspective of the target user is viewed as of paramount importance to any organisation or decision maker who wishes to implement this type of technology. Prior to the utilisation, or extension, of a user acceptance model such as the Technology Acceptance Model or the Unified Theory of Technology of Acceptance and Use of Technology, managers must first find consensus on which medium of interaction, voice or text, scores higher within their organisation - especially in the context of usability. Decision makers must understand context of use, and determine why some employees prefer one mode of communication over another, or even in which work context a system may be preferred over another. Managers may find answers to be industry, organisation, or even task specific. Measures of usability gives management a base understanding of user experience, and once a defined mode of preferred interaction is established, firms can begin to understand how to incorporate technology acceptance models into their future studies, especially with respect to deployment and engagement. Although voice based question answering systems are becoming increasingly prevalent in today’s world, this study shows the preferred medium of interaction with such systems is still in fact text. In the case of Financial Managers operating within a contemporary financial management firm, the perceived usability of a text based QA system was rated significantly higher than a voice based equivalent. The paper proposes the development of a context specific usability testing model in order to assess all aspects of system usability, by incorporating both objective and subjective usability metrics. The two proofs of concept described in this paper will then be tested again from within the proposed updated and extended model.

Voice-based User Interface vs Text-based User Interface References AbuShanab, E. and Pearson, J.M. (2007). Internet banking in Jordan: The unified theory of acceptance and use of technology (UTAUT) perspective. Journal of Systems and information Technology, 9(1), pp.78-97. AlAwadhi, S. and Morris, A. (2008). The Use of the UTAUT Model in the Adoption of Egovernment Services in Kuwait. In Hawaii International Conference on System Sciences, Proceedings of the 41st Annual (pp. 219-219). Ieee. Amazon. (2017). What is Amazon Lex?. Available https://docs.aws.amazon.com/lex/latest/dg/what-is.html (Accessed on 07/02/2018).

at

Androutsopoulos, I., Ritchie, G. and Thanisch, P. (1993). Masque/sql - An Efficient and Portable Natural Language Query Interface for Relational Databases. Database technical paper, Department of AI, University of Edinburgh.

Bangor, A., Kortum, P.T. and Miller, J.T. (2008). An empirical evaluation of the system usability scale, Intl. Journal of Human–Computer Interaction, 24(6), pp.574-594

Bangor, A., Kortum, P. and Miller, J., (2009). Determining what individual SUS scores mean: Adding an adjective rating scale. Journal of usability studies, 4(3), pp.114-123.

Borsci, S., Federici, S. and Lauriola, M. (2009). On the dimensionality of the System Usability Scale: a test of alternative measurement models. Cognitive processing, 10(3), pp.193-197.

Boulton, C. (2018). Virtual assistants, chatbots poised for mass adoption in 2017. [online] CIO. Available at:https://www.cio.com/article/3153966/artificial-intelligence/virtualassistants-chatbots-poised-for-mass-adoption-in-2017.html [Accessed 30 May 2018]. Brewster, S. (2018). Are you ready to trust your bank account to a team of chatbots?. [online] MIT Technology Review. Available at: https://www.technologyreview.com/s/601418/doyour-banking-with-a-chatbot/ [Accessed 10 Jul. 2018]. Brooke, J. (1996). SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), pp.4-7. Cochran, W.G. and Rubin, D.B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pp.417-446.

Voice-based User Interface vs Text-based User Interface Collobert, R., and Weston, J. (2008). "A unified architecture for natural language processing: Deep neural networks with multitask learning." In Proceedings of the 25th international conference on Machine learning, pp. 160-167. ACM.

Dale, R. (2016). The return of the chatbots. Natural Language Engineering, 22(5), pp.811-817.

Davis, F.D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS quarterly, pp.319-340.

Davis, F.D., Bagozzi, R.P. and Warshaw, P.R. (1989). User acceptance of computer technology: a comparison of two theoretical models. Management science, 35(8), pp.982-1003.

Ferrucci, D., Nyberg, E., Allan, J., Barker, K., Brown, E., Chu-Carroll, J., Ciccolo, A., Duboue, P., Fan, J., Gondek, D. and Hovy, E. (2009). “Towards the open advancement of question answering systems,” IBM, Armonk, NY, IBM Res. Rep. Ferrucci, D.A. (2012). Introduction to “this is watson”. IBM Journal of Research and Development, 56(3.4), pp.1-1.

Ferrucci, D., Levas, A., Bagchi, S., Gondek, D. and Mueller, E.T. (2013). Watson: beyond jeopardy!. Artificial Intelligence, 199, pp.93-105.

Gantz, J. and Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future, 2007(2012), pp.1-16.

Gao, L. and Bai, X. (2014). A unified perspective on the factors influencing consumer acceptance of internet of things technology. Asia Pacific Journal of Marketing and Logistics, 26(2), pp.211-231.

Gartner (2011). Gartner Customer 360 Summit 2011 - Gartner.com [online] Gartner.com. Available at: https://www.gartner.com/imagesrv/summits/docs/na/customer360/C360_2011_brochure_FINAL.pdf#page=2 [Accessed 10 Jul. 2018].

Georgsson, M. and Staggers, N. (2015). Quantifying usability: an evaluation of a diabetes mHealth system on effectiveness, efficiency, and satisfaction metrics with associated user characteristics. Journal of the American Medical Informatics Association, 23(1), pp.5-11.

Voice-based User Interface vs Text-based User Interface Goasduff. L. (2018). Chatbots Will Appeal to Modern Workers. Available at: https://www.gartner.com/smarterwithgartner/chatbots-will-appeal-to-modernworkers/(Accessed on 04/04/2018)

Green Jr, B.F., Wolf, A.K., Chomsky, C. and Laughery, K. (1961), May. Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IREAIEE-ACM computer conference (pp. 219-224). ACM. Green, D. and Pearson, J.M., (2006). “Development of a web site usability instrument based on ISO 9241-11.” Journal of Computer Information Systems, 47(1), pp.66-72. Hausawi, Y.M. and Mayron, L.M. (2013). “Towards usable and secure natural language processing systems. In International Conference on Human-Computer Interaction, Las Vegas, NV, pp. 109-113. Springer, Berlin, Heidelberg.

High, R. (2012). The era of cognitive systems: An inside look at IBM Watson and how it works. IBM Corporation, Redbooks.

Hone, K.S. and Graham, R. (2001). Subjective assessment of speech-system interface usability. In Seventh European Conference on Speech Communication and Technology.

Hsu, H.H. and Chang, Y.Y. (2013). Extended TAM model: Impacts of convenience on acceptance and use of Moodle. Online Submission, 3(4), pp.211-218.

IBM. (2018). IBM Watson. Why Watson, [ONLINE] at: https://www.ibm.com/watson/about/index.html [Accessed: 29/01/18].

Available

Iso, W. (1998). 9241-11. Ergonomic requirements for office work with visual display terminals (VDTs). The international organization for standardization, 45, p.9. Knight, W. (2016) The HR Person at Your Next May Actually Be a Bot. Available at: https://www.technologyreview.com/s/602068/the-hr-person-at-your-next-job-may-actuallybe-a-bot/

Knight, W. (2016b). 10 Breakthrough Technologies 2016: Conversational Interfaces - MIT Technology Review. [ONLINE] Available at: https://www.technologyreview.com/s/600766/10-breakthrough-technologies-2016conversational-interfaces/. [Accessed 10 July 2018].

Voice-based User Interface vs Text-based User Interface Kolomiyets, O. and Moens, M.F. (2011). A survey on question answering technology from an information retrieval perspective. Information Sciences, 181(24), pp.5412-5434. Levy, H. P. (2016). Gartner’s Top 10 Strategic Predictions for 2017 and Beyond: Surviving the Storm Winds of Digital Disruption. Available at: https://www.gartner.com/smarterwithgartner/gartner-predicts-a-virtual-world-of-exponentialchange/ (Accessed on 07/02/18).

Lewis, J.R. (1994). Sample sizes for usability studies: Additional considerations. Human factors, 36(2), pp.368-378.

Lewis, J.R. and Sauro, J. (2009). The factor structure of the system usability scale. In International conference on human centered design (pp. 94-103). Springer, Berlin, Heidelberg.

Lin, H.C.K., Hsieh, M.C., Wang, C.H., Sie, Z.Y. and Chang, S.H. (2011). Establishment and Usability Evaluation of an Interactive AR Learning System on Conservation of Fish. Turkish Online Journal of Educational Technology-TOJET, 10(4), pp.181-187.

López, G., Quesada, L. and Guerrero, L.A. (2017). Alexa vs. Siri vs. Cortana vs. Google Assistant: a comparison of speech-based natural user interfaces. In International Conference on Applied Human Factors and Ergonomics (pp. 241-250). Springer, Cham. Lopez, V., Uren, V., Sabou, M. and Motta, E. (2011). ‘Is question answering fit for the semantic web?: a survey’. Semantic Web, 2(2), pp.125-155.

McAvoy, J. (2006). Evaluating the Evaluations: Preconceptions of Project Post-Mortems. Electronic Journal of Information Systems Evaluation, 9(2).

Mishra, A. and Jain, S.K. (2016). A survey on question answering systems with classification. Journal of King Saud University-Computer and Information Sciences, 28(3), pp.345-361.

Möller, S., Engelbrecht, K.P. and Schleicher, R. (2008). Predicting the quality and usability of spoken dialogue services. Speech Communication, 50(8-9), pp.730-744.

Moore, R. K., Li, H., & Liao, S. H. (2016). Progress and Prospects for Spoken Language Technology: What Ordinary People Think. In INTERSPEECH (pp. 3007-3011).

Voice-based User Interface vs Text-based User Interface Moore, R. K. (2017a). Is spoken language all-or-nothing? Implications for future speech-based human-machine interaction. In Dialogues with Social Robots (pp. 281-291). Springer, Singapore.

Moore, R. K. (2017b). A needs-driven cognitive architecture for future'intelligent'communicative agents. In Proceedings of EUCognition 2016-" Cognitive Robot Architectures" (Vol. 1855, No. 1855, pp. 50-51). CEUR Workshop Proceedings.

Nielsen, J. (2001). Usability Metrics. [online] Nielsen Norman Group. Available at: https://www.nngroup.com/articles/usability-metrics/ [Accessed 16 Jul. 2018].

Nielsen, J. and Landauer, T.K. (1993). A mathematical model of the finding of usability problems. In Proceedings of the INTERACT'93 and CHI'93 conference on Human factors in computing systems (pp. 206-213). ACM.

Panetta, K. (2017). Gartner Top Strategic Predictions for 2018 and Beyond. Available at: https://www.gartner.com/smarterwithgartner/gartner-top-strategic-predictions-for-2018-andbeyond/ (Accessed on 04/04/2018).

Portet, F., Vacher, M., Golanski, C., Roux, C. and Meillon, B. (2013). Design and evaluation of a smart home voice interface for the elderly: acceptability and objection aspects. Personal and Ubiquitous Computing, 17(1), pp.127-144.

Sauro, J. (2011). MeasuringU: Are both positive and negative items necessary in questionnaires?. [online] Measuringu.com. Available at: https://measuringu.com/positivenegative/ [Accessed 11 Jul. 2018].

Sauro, J. and Lewis, J.R. (2011). When designing usability questionnaires, does it hurt to be positive?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2215-2224). ACM. Simonite, T. (2016). “Google Thinks You’re Ready to Converse with Computers,” MIT Technology Review (available at: https://www.technologyreview.com/s/601530/googlethinks-youre-ready-to-conversewith-computers/; retrieved Feburary 13, 2018)

Sy, D. (2007). Adapting usability investigations for agile user-centered design. Journal of usability Studies, 2(3), pp.112-132.

Voice-based User Interface vs Text-based User Interface Torres, J., Vaca, C. and Abad, C.L. (2017), December. What Ignites a Reply?: Characterizing Conversations in Microblogs. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 149-156). ACM.

Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B. and Chahuara, P. (2015). Evaluation of a context-aware voice interface for Ambient Assisted Living: qualitative user study vs. quantitative system evaluation. ACM Transactions on Accessible Computing (TACCESS), 7(2), p.5.

Venkatesh, V., Morris, M.G., Davis, G.B. and Davis, F.D. (2003). User acceptance of information technology: Toward a unified view. MIS quarterly, pp.425-478.

Virzi, R.A. (1992). Refining the test phase of usability evaluation: How many subjects is enough?. Human factors, 34(4), pp.457-468.

Wolfson, R. (2018). Chatbots on Facebook Messenger linked to increased sales. [online] VentureBeat. Available at: https://venturebeat.com/2017/08/01/chatbots-on-facebookmessenger-linked-to-increased-sales/ [Accessed 10 Jul. 2018].

Woods, W.A. (1973). Progress in natural language understanding: an application to lunar geology. In Proceedings of the June 4-8, 1973, national computer conference and exposition (pp. 441-450). ACM.