As avid readers may know, I used to work in DevRel for an open source voice assistant company called Mycroft.AI. In the past two weeks, Mycroft AI’s CEO, Michael Lewis, has announced that the company is winding down operations, and will be unable to fulfill any further Kickstarter pledges. In this post, I first give the Mark II a technical once-over, and conclude that it’s a solid piece of kit. So, why couldn’t a privacy-focused voice assistant go to market and reach adoption in age of growing awareness of surveillance capitalism 1Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power: Barack Obama’s books of 2019. Profile books., datafication 2Sadowski, J. (2019). When data is capital: Datafication, accumulation, and extraction. Big data & society, 6(1), 2053951718820549. and commodified, open hardware? In the second half of this post, I take a critical lens to voice assistants more broadly, and reflect on the ecosystems, infrastructures and other components that are needed to provide value to the end user – and why voice assistants are currently hitting the “trough of disillusionment”.
For full transparency, it’s important that I state that as part of my contract arrangement with Mycroft – I contracted for around 18 months between September 2017 and January 2019 – I was offered, and took up, stock options – although they’re now worthless.
The Mycroft Mark II Hardware
The box for the Mark II was sturdy, and well-designed, carrying forward many of the design features of its predecessor, the Mark 1: folded cardboard which provided protection and buffering for the device inside. It survived a long international journey without a scratch.
Inside was a useful “Get started” card that provide Quick Start links, and a hardware warranty leaflet.
The smiley face upon opening the box was a particularly thoughtful touch. We are used to anthropomorphising voice assistants – imbuing them with human characteristics. I’m torn on this – on one hand, they perform part of the function of interacting with another human, but lack so much of the nuance and meaning that emerges from engaging with humans.
One other very pleasing aspect of the unboxing was that I didn’t have to buy an Australian general power outlet adaptor. Australia uses Type I sockets – and foreign-manufactured devices often require an adaptor (thank goodness for USB).
The Australian Type I socket isn’t shown here because it’s connected to the Mycroft Mark II.
Setting up the device
I found setting up the Mark II to be incredibly smooth. It ships with “Dinkum” software, which, according to the Mycroft website, is a stable operating system for the Mark II. However, it diverges significantly from the previous Mycroft Core software – meaning that Skills previously developed for previous iterations of Mycroft software – for Picroft, or the Mark I, or Mycroft on Linux – won’t work on the Mark II. I found Dinkum to be very stable – it “just worked” out of the box.
Connecting to WiFi
The Mark II uses WiFi connection technology from Pantacor. Once the Mark II had finished booting, I was advised to connect to an SSID that the Mark II was advertising using my phone or laptop, and from there, enter my WiFi credentials. I wasn’t surprised that only my 2.4GHz SSID was detected – I have a dual band router that also advertises a 5GHz SSID – and I wasn’t surprised that enterprise WiFi – WPA2 – wasn’t supported. This appears to be a common issue with IoT devices; they will readily connect to consumer grade networks, but cannot handle the additional complexity – such as MSCHAPv2 authentication – required by WPA2 networks. I suspect this presents a barrier to enterprise IoT adoption – and certainly voice assistant adoption.
Pairing the device with the backend platform
When I worked at Mycroft.AI, pairing the device with the backend platform – https://home.mycroft.ai – was one of the most frequently problematic parts of the device. This was seamless. The Mark II device displayed a pairing code on screen, along with a URL to visit, and it paired almost immediately. I was then able to set the device’s physical location and timezone.
Testing out Skills
One of the frequent criticisms of the Mark II running Dinkum software is its lack of Skills; this criticism is well founded. I can ask basic questions – which hit the Wikipedia backend – or ask for time and weather information – but anything more is unavailable. I was particularly disappointed that there wasn’t any music integration with the Mark II – which boasts dual 1.5″ 5W drivers – but I can’t hold this against Mycroft.
In late 2020, Spotify revoked the API access that Mycroft was using to provide premium Spotify subscribers (Spotify’s API only works for subscribed users) with access to Spotify via Mycroft. Local music is an option, but because I use music streaming – well, apart from my now-nearly-30-year-old CD collection – this wasn’t of much use.
The Mark II sports some impressive hardware. There is an inbuilt camera (not that any Skills make use of it), with a physical privacy shutter.
The microphone is a 2-mic array with noise cancellation, and a physical mute switch – very important for those who like privacy – although I don’t know the brand of the mic.
The mute button integrated very well with the design of the touch screen – with the NeoPixel indicator taking on a red hue, and the border of the touch screen rendering in red also when hardware mute is on.
Seeed Studios hardware are generally considered best-of-breed for embedded Linux devices, but I don’t think this is a Seeed microphone.
The screen is a 4.3″ IPS touchscreen, and it boasts bright, bold colours. I’m guessing it has a resolution of 800px by 400px, but don’t hold me to that. The board itself is based on the Raspberry Pi 4, and the Pi’s GPIO pins 1, 12, 13, GND and 3.3V power are exposed so that you can integrate other hardware with the device; that is, it is extensible if you’re into open hardware. There’s also a 1GB Ethernet RJ45 socket – so you are not reliant on WiFi – which was quite useful. The case itself is very sturdy, and needs a torx screwdriver to open.
SSH’ing into the device
SSH on the default port 22 is blocked and you need to use a custom port to SSH into the device – this is well documented. For security, the Mark II uses SSH security keys; requiring you to generate an SSH key, and enter this on the home.mycroft.ai platform. The key is then sent to the device, and you then SSH in to the custom port using the key. In my opinion this is far more secure than depending on passwords, however, this introduces a closer dependency on the home.mycroft.ai platform – and as we’ll see later, this closely ties the hardware to a supporting platform.
What’s under the hood?
Once SSH’d into the device, I was able to take a closer look at the mycroft-dinkum software in operation.
Voice activity detection / keyword spotter
One of the software layers in a voice assistant stack is a wake word listener or keyword spotter. This software constantly listens for spoken phrases – utterances – and classifies whether the utterance is a wake word. If the wake word listener detects a wake word, then the voice assistant listens for the next utterance, and assumes the next utterance spoken is a voice assistant command.
Previous versions of Mycroft used the Precise wake word listener, and I found it regularly had a high number of false positives – triggering when I didn’t say the “Hey Mycroft” wake word – and false negatives – not responding when I did utter the wake word. The Mark II was incredibly accurate in detecting my wake word utterances – especially considering I am female, and have an Australian accent – groups for which Precise performed poorly.
mycroft-dinkum software uses several services. The
voice service within
mycroft-dinkum provides speech recognition, voice activity detection and microphone input. Unsurprisingly, because this service is constantly listening, it consumes a lot of the CPU of the device, which we can see if we run htop while
I was curious about what models were being used for wake word detection, and by looking into the GitHub code, I was able to determine that the Mark II was using the Silero VAD engine. Silero is a pre-trained model. However, there is no information available about the data it was trained on – the GitHub page doesn’t say – or the type of algorithm – such as a recurrent neural network – used to train the model.
I have a hunch – and it’s just a hunch – that Silero uses some of the Mycroft Precise wake word data that was captured via opt-in mechanisms. I was the main female contributor to that dataset – and if Silero was trained on Precise data, it would explain why it’s so accurate for my voice. But, because the provenance of the training data isn’t clear, I won’t be able to tell for sure.
Finding out what is used for speech recognition was slightly more challenging. In previous versions of Mycroft software, the speech recognition layer is configurable through the
mycroft-config.conf file, allowing the use of on-device speech recognition, or a preferred cloud provider. That configuration is stored in a different location under
mycroft-dinkum, but I was able to find it. The STT module was set to
mycroft – I’m not sure what this means, but I think it means Mark II is using the Google cloud service, anonymised through the home.mycroft.ai platform. Again, there’s a cloud – and network – dependency here.
Intent parsing and Skills
Once an STT utterance transcription has been identified by the speech recognition layer, the voice assistant needs to know how to tie the utterance to a command and pass it to a Skill for handling that command. In the Mark II, from what I could tell in the source code, that layer is provided by Padatious, a neural-network based intent parser.
The Skills range on the Mark II is very limited – you can search for weather, find out the time in different timezones – which was one of my most-used Skills, play from a pre-defined list of internet radio stations, and query Wikipedia. Limiting the range of Skills means that intent parsing is easier – because you have fewer Skills to choose between. However, passing the right query to a Skill can be problematic if your speech recognition isn’t accurate. Intent parsing for Named Entities – people, places, products, especially Indigenous-language-derived ones, worked reasonably well. Finding out the weather for “Yarrawonga”, “Geelong” and “Canberra” were all recognised correctly; “Wagga Wagga” wasn’t. Queries to Wikipedia were parsed accurately – “tell me about pierogi” (Polish derivation) and “tell me about Maryam Mirzakhani” (Persian derivation) – were both correctly identified.
Text to speech
Text to speech, or synthetic speech, is a complement to speech recognition, and takes written phrases and outputs speech as audio. For this layer, the Mark II uses the Mimic 3 engine, a TTS engine based on the ViTS algorithm. The GitHub repo for Mimic 3 doesn’t contain the original data the Mimic 3 voices were trained on, but in this blog post, developer Mike Hansen provides a technical walkthrough of how the models were trained, including the challenges of training new languages, and the phonetic and normalisation challenges that entails.
By default, the Mark II uses the UK English “Pope” Mimic 3 voice, which is a male-gendered voice. I was pleasantly surprised by this, given the industry default of gendering voice assistants as female, which is covered at length in the book The Smart Wife by Yolande Strengers and Jenny Kennedy 10Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.. There were no Australian accents available.
I have a lot of my own (female, Australian) voice data recorded, but I didn’t want to contribute it to the Mimic 3 project. There are growing concerns around how voice and video data of people is being used to create realistic avatars – digital characters that sound, and look, like real humans – and until we have better regulation, I don’t particularly want a “Kathy” voice out in the wild – because I have no control over how it would be used. For example, I know that sex doll companies have approached voice startups wanting access to voices to put into their products. As we say in ‘Strayan: yeah, nah.
Mimic 3 works offline, and I was surprised at how fast it generated speech – there was a slight delay, but not very noticeable at all. Some of the pronunciations in the Pope voice were definitely a little off – a key issue in TTS is getting the generated voice to have the correct emphasis and prosody – but I was still pretty impressed. When I’m done with my PhD work, I’d love to train a Mimic 3 voice on my own voice data to understand the process a bit better.
Mark II uses the Plymouth splash screen manager for the touch screen visuals, and looking at the htop output, this one component uses over 16% of the available CPU. I found the visuals very well-designed and complementary to the audio information that was presented by the TTS layer.
The screen adds additional complexity and consideration for Skill designers however; not only do they have to design a pleasant voice interaction experience in a Skill , but for devices like the Mark II which are multi-modal, the Skill designer must now ensure a complementary visual experience.
The Mark II packs some impressive hardware, and mature software into a solid form factor. The visual design is well integrated with the conversational design, and it’s intended to be extensible – a boon for those who like to experiment and tinker. The software layers that are used are generally matured, and predominantly run on-device, with the exception of the speech recognition layer (I think). The Skills all have a dependency on external services – Wikipedia, weather and so on. The integration with the home.mycroft.ai backend serves to protect my privacy – as long as I trust that platform – and there was no evidence in any of the digging I did that the device is “leaking” data to third party services.
These positive factors are tarnished by the high price point (listed at $USD 499, although the Kickstarter price was a lot lower at $USD 199) and the lack of Skills that integrate with other services – like Spotify. This is a device that is capable of so much more – it’s a diamond in the rough. But what would give it more polish?
And that takes me to part 2 of this post – ecosystems, infrastructures and the friction of privacy.
A critical lens on an inflection point in voice assistant technology: we are in the trough of disillusionment
It’s a difficult time for the voice assistant industry.
Amazon has recently announced huge layoffs to its Alexa voice assistant department (see write ups by Ars Technica, Business Insider, and Vox), as it struggles to find a path to monetisation for the loss-leading hardware devices that its been shipping. Google has followed this trend. This also comes off the back of an architectural change at Google where it dropped support for third party applications on its Assistant voice apps and devices, which has the effect of more tightly integrating its hardware with the Android ecosystem. Apple – the other big player in the voice assistant space, with Siri, has begun quietly laying off third-party contractors, according to reports. Baidu, the largest voice player outside of Western Europe and America, with its Xiaodu assistant, which had a $5b valuation just two short years ago, has also shed 10-15% of its staff. The outlier here is the open source Home Assistant, and its commercial sister operation, Nabu Casa, who have announced the impending launch of a voice-assistant-with-screen competitor to Google Nest Hub and to the Amazon Echo Show, the Home Assistant Yellow.
Previous predictions that there will be 8 billion voice assistant devices in use by this year now appear idealistic indeed.
Voice assistants and theories of innovation
It’s fair to say that voice assistants have reached the “trough of disillusionment”. This term belongs to Gartner and their Hype Cycle; a frame they use to plot the adoption and integration of emerging technology. The “trough of disillusionment” refers to a period in a technology’s history where:
Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters.Gartner Hype Cycle
The Gartner Hype Cycle is a theory of innovation; different technologies move through the Hype Cycle at varying rates, influenced by a range of factors. Other theories of innovation use divergent analytical lenses, or ascribe primacy to differing drivers and constraints on technology. The diffusion of innovation theory gave us the terms “early adopters” and “laggards”. Induced innovation 11Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529. for example places emphasis on economic factors such as market demand. Evolutionary theory focuses on the search for better tools and technologies, with the market selecting – and ensuring the survival – of the best. Path dependence models valorize the role of seemingly insignificant historical changes that compound to shape a technology’s dominance or decline. The multi-level perspective blends the micro and macro levels together, showing how they interact to influence technological development. Disruptive innovation theory 12Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568. takes a contingent approach; different innovations require different strategies to challenge and unseat established incumbents. Apple unseated Nokia with touch screens. Linux dominated the data centre due to higher reliability and performance. Netflix swallowed Blockbuster by leveraging increasing internet speeds for content delivery. Disruptions harnesses interacting social, economic and political developments.
I digress. What all of these views of innovation have in common, regardless of their focus – is inter-dependency of factors that influence a technology’s trajectory – of adoption, success, return on investment.
So what are the inter-dependencies in voice assistant technology that lead us to our current inflection point?
Platform and service inter-dependencies
Voice assistants are interfaces. They enable interaction with other systems. We speak to our phones to call someone. We speak to our televisions to select a movie. We speak to our car console to get directions. A voice assistant like Mycroft, or Alexa, or Google Home or Siri is a multi-purpose interface. It is reliant on connecting to other systems – systems of content, systems of knowledge, and systems of data. Wikipedia for knowledge articles, a weather API for temperature information, or a music content provider for music. These are all platform inter-dependencies.
Here, large providers have an advantage because they can vertically integrate their voice assistant. Google has Google Music, Amazon has Amazon Music, Apple has Apple Music. Mycroft has no music service, and this has no doubt played into Spotify’s decision to block Mycroft from interfacing with it. Content providers know that their content is valuable; this is why Paramount and HBO are launching their own platforms rather than selling content to Netflix.
Voice assistants need other platforms to deliver value to the end user. Apple knew this when they acquired Dark Sky and locked Android users out of the platform; although platforms like Open WeatherMap are filling this gap. We’ve seen similar content wars play out in the maps space; Google Maps is currently dominant, but Microsoft and Bing are leveraging OpenStreetMap to counter this with Map-Builder – raising the ire of open source purists in the process.
Voice assistants need content, and services, to deliver end user value.
Discovery – or identifying what Skills a voice assistant can respond to
Voice assistants are interfaces. We use myriad interfaces in our everyday lives; a steering wheel, a microwave timer, an espresso coffee maker, an oven dial, a phone app, natural language, a petrol bowser, an EV charger. I doubt you RTFM’d for them all. And here’s why: interfaces are designed to be intuitive. We intuitively know how a steering wheel works (even if, for example, you’ve never driven a manual). If you’ve used one microwave, you can probably figure out another one, even if the display is in a foreign language. Compare this with the cockpit of the Concorde – a cacophony of knobs, buttons and dials.
If you’ve used Alexa or Siri, then you could probably set a timer, or ask about something from Wikipedia. But what else can the assistant do? This is the discovery problem. We don’t know what interfaces can do, because we often don’t use all the functions an interface provides. When was the last time you dimmed the dashboard lights in your car? Or when did you last set a defrost timer on your microwave?
The same goes for voice assistants; we don’t know what they can do, and this means we don’t maximise their utility. But what if a voice assistant gave you hints about what it could do? As Katherine Latham reports in this article for the BBC, it turns out that people find that incredibly annoying; we don’t want voice assistants to interrupt us, or suggest things to us. We find that intrusive.
How, then, do we become more acquainted with a voice assistant’s capabilities?
For more information on skill discovery in voice assistants, you might find this paper interesting.White, R. W. (2018). Skill discovery in virtual assistants. Communications of the ACM, 61(11), 106-113.
Privacy and the surveillance economy – or – who benefits from the data a voice assistant collects?
There is a clear trade-off in voice assistants between functionality and privacy. Amazon Alexa can help you order groceries – from Amazon. Google Home can read you your calendar and email – from Gmail. Frictionless, seamless access to information and agents which complete tasks for you requires sharing data with the platforms that provide those functions. This is a two-fold dependency; first the platforms that provide this functionality must be available – my vertical dependency argument from above. Secondly, you must provide personal information – preferences, a login, something trackable – to these platforms in order to receive their benefits.
I’m grossly oversimplifying his arguments here, however the well-known science and technology studies scholar, Luciano Floridi, has argued that “privacy is friction” 13Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.. The way that information is organised contributes to, or impinges upon, our privacy. Voice assistants that track our requests, record our utterances, and then use this information to suggestively sell more products to us reduce friction by reducing privacy. Mireille Hildebrandt, in her book Technology and the Ends of the Law, goes one step further: voice assistants impugn our personal agency through their anticipatory nature 14Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.. By predicting, or assuming our desires and needs, our ability to be reflective in our choices, to be be mindful of our activities, is eroded. Academic Shoshana Zuboff takes a broader view of these developments, theorising that we live in an age of surveillance capitalism; where the data produced through technologies which surveil us – our web browsing history, CCTV camera feeds – and yes – utterances issued to a voice assistant – become a form of capital which is traded in order to more narrowly market to us. Jathan Sadowski has argued, similarly, for the concept of datafication; when our interactions with the world become a form of capital – for trading, for investing, and for extraction.
Many of us have become accustomed to speaking with voice interfaces, and glossing over how those utterances are stored, or linked, or mined – downstream in an ecosystem. In fact, Professor Joe Turow argues this case eloquently in his book, The Voice Catchers: 15Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press. by selling voice assistants at less than cost, their presence, and our interactions with them have been normalised, and backgrounded. We don’t think anything of sharing our data with the corporate platforms on which they rest. Giving personal data to a voice assistant is something we take for granted. By design. We trade the utility of frictionless access to services with the friction to privacy.
And this points to a key challenge for open source and private voice assistants – like Mycroft and like Home Assistant; in order to deliver services and content through those voice assistants, we have to give up some privacy. Mycroft handles this by abstraction; for example, the speech recognition that is done through Google’s cloud service is channelled through home.mycroft.ai – and done under a single identifier so that individual Mycroft users’ privacy is protected.
How do voice assistants overcome the tradeoff between utility and privacy?
Hardware is hard
One of my favourite books is by the developer of the almost-unheard-of open source assistant, Chumby – Bunnie Huang – The Hardware Hacker 16Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.. It is a chronicle of the myriad challenges involved in designing, manufacturing, certifying and bringing to market a new consumer device. For every mention of “Chumby” in the book, you could substitute “Mycroft Mark II”. Design tradeoffs, the capital required to fund manufacturing, quality control issues, a fickle consumer market – all are present in the tale of the Mark II.
Hardware is hard.
Hardware has to be designed, tested, integrated with software through drivers and embedded libraries, it has to be certified compliant with regulations. And above all, it has to yield a profit for the hardware manufacturer to keep manufacturing the hardware. If we think about the escalating costs of the Mark II – the Chumby – and then look at how cheaply competitor devices – Alexa, Google Home – are sold for – then it becomes clear that hardware is a loss leader for ecosystem integration. I have no way to prove this, but I strongly suspect that the true cost of an Alexa or Google Home is four or five times more than what a consumer pays for it.
Voice assistants are interfaces.
And by having a voice assistant in a consumer’s home, it becomes an interface into a broader ecosystem – more closely imbricating and embroiling the customer in the manufacturer’s ecosystem. And if you can’t transform that loss leader into a recurrent revenue stream – through additional purchases, or through paid voice search, or through voice advertising revenue – then the hardware becomes a sunk cost. And you start laying off staff.
An open source voice assistant strategy is different – its selling point is in opposition to a commercial assistant – its privacy, its interoperability, its extensibility. Its lack of lock in to an ecosystem. And the Mark II is all of those things – private, interoperable and extensible. But it still hasn’t achieved product market fit, particularly at its high, true-cost price point.
How do voice assistants reconcile the cost of hardware and the ability to achieve product market fit?
Overcoming the trough of disillusionment
So how might voice assistants overcome the trough of disillusionment?
Higher utility through additional content, data sources and APIs
Voice assistant utility is a function of how many Skills the device has, how useful those Skills are to the end user, and how frequently they are used. Think about your phone. What apps do you use the most? Text messaging? Facebook? Tik Tok? What app could you not delete from your phone? Skills require two things; a data source or content source, and a developer community to build them. Open source enthusiasts may build a Skill out of curiosity or to slake a desire to be able to build a Skill, but commercial entities will only invest in Skill building if it generates revenue for a product or service. And then what does the voice assistant manufacturer do to share in that revenue? So we need to find ways to incentivise Skill development (and maintenance) as well as revenue sharing models that help support the infrastructure, Skill development – and the service or data that the Skill interfaces with. For example, Spotify understands this – and will reserve access to their highly-valued content only for business arrangement that help them generate additional revenue.
I also see governments having a role to play here – imagine for example accessing government services through your voice assistant – no sitting in queues on a phone. The French Minitel service was originally a way to help citizens access services like telephone directories, and postal information. But governments want to both streamline the development they do in-house, and control access to API information; will there be a level of comfort in opening access – and if so, who bears the cost of development?
Distinguishing voice assistants in the home from the voice assistant in your pocket (your mobile phone)
Most of us already have a voice assistant in our pocket – if we have an Android or iPhone mobile phone. So what niche does a physical voice assistant serve? One differentiator I see here is privacy; a voice assistant on a mobile phone ties you to the ecosystem of that device – to Apple, or Google or another manufacturer. Another differentiator is the context in which the voice assistant operates; few of us would want to use a voice assistant in a public or semi-public context, such as on a train or a bus or in a crowd. But a home office is semi-private; and many of us are now working from home. Is there an opportunity for a home office assistant? That isn’t tied to a mobile phone ecosystem? But thinking this through its trajectory of development, if we’re working from home, then we’re working. How will employers seek to leverage voice assistants, and does this sit in opposition to privacy? We are already seeing a backlash against the rise of workplace surveillance (itself a form of surveillance capitalism), so I think there will be barriers to employment or work-based technology being deployed on voice assistants.
Towards privacy, user agency and user choice
I’m someone who places a premium on privacy: I pay for encrypted communications technology and pay for encrypted email that isn’t harvested for advertising. I pay for services like relay that hide my email address from those who would seek to harvest it for surveillance capitalism.
But not everyone does; because privacy is friction, and we have normalised the absence of friction for the price of privacy. As we start to see models like ChatGPT and Whisper that hoover up all the public data on the internet – YouTube videos, public photographs on platforms like Flickr – I think we will start to see more public awareness of how are data is being used – and not always in our own best interests. In voice assistants, this means more safe-guarding of user data, and more protection against harnessing that data for profit.
Voice assistants also have a role to play in user agency and user choice. This means giving people choice about where intents – the commands used to activate Skills – lead. For example, if a commercial voice assistant “sells” the intent “buy washing powder” to the highest “washing powder” bidder, then this restricts user agency and user choice. So we need to think about ways that put control back in the user’s hands – or, in this case, voice. But this of course constrains the revenue generation capabilities of voice assistants.
For voice assistants to escape the trough of disillusionment, they will need to prioritise privacy, agency, utility and choice – and still find a path to revenue.
- 1Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power: Barack Obama’s books of 2019. Profile books.
- 2Sadowski, J. (2019). When data is capital: Datafication, accumulation, and extraction. Big data & society, 6(1), 2053951718820549.
- 3Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.
- 4Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529.
- 5Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568.
- 6Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.
- 7Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.
- 8Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press.
- 9Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.
- 10Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.
- 11Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529.
- 12Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568.
- 13Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.
- 14Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.
- 15Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press.
- 16Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.