The Mycroft Mark II and the wind-down of Mycroft AI: it’s all about ecosystems, infrastructures and the friction of privacy

As avid readers may know, I used to work in DevRel for an open source voice assistant company called Mycroft.AI. In the past two weeks, Mycroft AI’s CEO, Michael Lewis, has announced that the company is winding down operations, and will be unable to fulfill any further Kickstarter pledges. In this post, I first give the Mark II a technical once-over, and conclude that it’s a solid piece of kit. So, why couldn’t a privacy-focused voice assistant go to market and reach adoption in age of growing awareness of surveillance capitalism 1Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power: Barack Obama’s books of 2019. Profile books., datafication 2Sadowski, J. (2019). When data is capital: Datafication, accumulation, and extraction. Big data & society, 6(1), 2053951718820549. and commodified, open hardware? In the second half of this post, I take a critical lens to voice assistants more broadly, and reflect on the ecosystems, infrastructures and other components that are needed to provide value to the end user – and why voice assistants are currently hitting the “trough of disillusionment”.

For full transparency, it’s important that I state that as part of my contract arrangement with Mycroft – I contracted for around 18 months between September 2017 and January 2019 – I was offered, and took up, stock options – although they’re now worthless.

The Mycroft Mark II Hardware

Unboxing

Mycroft Mark II box

The box for the Mark II was sturdy, and well-designed, carrying forward many of the design features of its predecessor, the Mark 1: folded cardboard which provided protection and buffering for the device inside. It survived a long international journey without a scratch.

Inside was a useful “Get started” card that provide Quick Start links, and a hardware warranty leaflet.

Mycroft smiley face upon opening the box

The smiley face upon opening the box was a particularly thoughtful touch. We are used to anthropomorphising voice assistants – imbuing them with human characteristics. I’m torn on this – on one hand, they perform part of the function of interacting with another human, but lack so much of the nuance and meaning that emerges from engaging with humans.

Variety of power connectors

One other very pleasing aspect of the unboxing was that I didn’t have to buy an Australian general power outlet adaptor. Australia uses Type I sockets – and foreign-manufactured devices often require an adaptor (thank goodness for USB).

The Australian Type I socket isn’t shown here because it’s connected to the Mycroft Mark II.

Setting up the device

I found setting up the Mark II to be incredibly smooth. It ships with “Dinkum” software, which, according to the Mycroft website, is a stable operating system for the Mark II. However, it diverges significantly from the previous Mycroft Core software – meaning that Skills previously developed for previous iterations of Mycroft software – for Picroft, or the Mark I, or Mycroft on Linux – won’t work on the Mark II. I found Dinkum to be very stable – it “just worked” out of the box.

Connecting to WiFi

The Mark II uses WiFi connection technology from Pantacor. Once the Mark II had finished booting, I was advised to connect to an SSID that the Mark II was advertising using my phone or laptop, and from there, enter my WiFi credentials. I wasn’t surprised that only my 2.4GHz SSID was detected – I have a dual band router that also advertises a 5GHz SSID – and I wasn’t surprised that enterprise WiFi – WPA2 – wasn’t supported. This appears to be a common issue with IoT devices; they will readily connect to consumer grade networks, but cannot handle the additional complexity – such as MSCHAPv2 authentication – required by WPA2 networks. I suspect this presents a barrier to enterprise IoT adoption – and certainly voice assistant adoption.

Pairing the device with the backend platform

When I worked at Mycroft.AI, pairing the device with the backend platform – https://home.mycroft.ai – was one of the most frequently problematic parts of the device. This was seamless. The Mark II device displayed a pairing code on screen, along with a URL to visit, and it paired almost immediately. I was then able to set the device’s physical location and timezone.

Testing out Skills

One of the frequent criticisms of the Mark II running Dinkum software is its lack of Skills; this criticism is well founded. I can ask basic questions – which hit the Wikipedia backend – or ask for time and weather information – but anything more is unavailable. I was particularly disappointed that there wasn’t any music integration with the Mark II – which boasts dual 1.5″ 5W drivers – but I can’t hold this against Mycroft.

In late 2020, Spotify revoked the API access that Mycroft was using to provide premium Spotify subscribers (Spotify’s API only works for subscribed users) with access to Spotify via Mycroft. Local music is an option, but because I use music streaming – well, apart from my now-nearly-30-year-old CD collection – this wasn’t of much use.

Hardware features

The Mark II sports some impressive hardware. There is an inbuilt camera (not that any Skills make use of it), with a physical privacy shutter.

The microphone is a 2-mic array with noise cancellation, and a physical mute switch – very important for those who like privacy – although I don’t know the brand of the mic.

The mute button integrated very well with the design of the touch screen – with the NeoPixel indicator taking on a red hue, and the border of the touch screen rendering in red also when hardware mute is on.

Seeed Studios hardware are generally considered best-of-breed for embedded Linux devices, but I don’t think this is a Seeed microphone.

The screen is a 4.3″ IPS touchscreen, and it boasts bright, bold colours. I’m guessing it has a resolution of 800px by 400px, but don’t hold me to that. The board itself is based on the Raspberry Pi 4, and the Pi’s GPIO pins 1, 12, 13, GND and 3.3V power are exposed so that you can integrate other hardware with the device; that is, it is extensible if you’re into open hardware. There’s also a 1GB Ethernet RJ45 socket – so you are not reliant on WiFi – which was quite useful. The case itself is very sturdy, and needs a torx screwdriver to open.

SSH’ing into the device

SSH on the default port 22 is blocked and you need to use a custom port to SSH into the device – this is well documented. For security, the Mark II uses SSH security keys; requiring you to generate an SSH key, and enter this on the home.mycroft.ai platform. The key is then sent to the device, and you then SSH in to the custom port using the key. In my opinion this is far more secure than depending on passwords, however, this introduces a closer dependency on the home.mycroft.ai platform – and as we’ll see later, this closely ties the hardware to a supporting platform.

What’s under the hood?

Once SSH’d into the device, I was able to take a closer look at the mycroft-dinkum software in operation.

Voice activity detection / keyword spotter

One of the software layers in a voice assistant stack is a wake word listener or keyword spotter. This software constantly listens for spoken phrases – utterances – and classifies whether the utterance is a wake word. If the wake word listener detects a wake word, then the voice assistant listens for the next utterance, and assumes the next utterance spoken is a voice assistant command.

Previous versions of Mycroft used the Precise wake word listener, and I found it regularly had a high number of false positives – triggering when I didn’t say the “Hey Mycroft” wake word – and false negatives – not responding when I did utter the wake word. The Mark II was incredibly accurate in detecting my wake word utterances – especially considering I am female, and have an Australian accent – groups for which Precise performed poorly.

The mycroft-dinkum software uses several services. The voice service within mycroft-dinkum provides speech recognition, voice activity detection and microphone input. Unsurprisingly, because this service is constantly listening, it consumes a lot of the CPU of the device, which we can see if we run htop while ssh‘d in:

htop output from the Mycroft II while ssh’d in

I was curious about what models were being used for wake word detection, and by looking into the GitHub code, I was able to determine that the Mark II was using the Silero VAD engine. Silero is a pre-trained model. However, there is no information available about the data it was trained on – the GitHub page doesn’t say – or the type of algorithm – such as a recurrent neural network – used to train the model.

I have a hunch – and it’s just a hunch – that Silero uses some of the Mycroft Precise wake word data that was captured via opt-in mechanisms. I was the main female contributor to that dataset – and if Silero was trained on Precise data, it would explain why it’s so accurate for my voice. But, because the provenance of the training data isn’t clear, I won’t be able to tell for sure.

Speech recognition

Finding out what is used for speech recognition was slightly more challenging. In previous versions of Mycroft software, the speech recognition layer is configurable through the mycroft-config.conf file, allowing the use of on-device speech recognition, or a preferred cloud provider. That configuration is stored in a different location under mycroft-dinkum, but I was able to find it. The STT module was set to mycroft – I’m not sure what this means, but I think it means Mark II is using the Google cloud service, anonymised through the home.mycroft.ai platform. Again, there’s a cloud – and network – dependency here.

Intent parsing and Skills

Once an STT utterance transcription has been identified by the speech recognition layer, the voice assistant needs to know how to tie the utterance to a command and pass it to a Skill for handling that command. In the Mark II, from what I could tell in the source code, that layer is provided by Padatious, a neural-network based intent parser.

The Skills range on the Mark II is very limited – you can search for weather, find out the time in different timezones – which was one of my most-used Skills, play from a pre-defined list of internet radio stations, and query Wikipedia. Limiting the range of Skills means that intent parsing is easier – because you have fewer Skills to choose between. However, passing the right query to a Skill can be problematic if your speech recognition isn’t accurate. Intent parsing for Named Entities – people, places, products, especially Indigenous-language-derived ones, worked reasonably well. Finding out the weather for “Yarrawonga”, “Geelong” and “Canberra” were all recognised correctly; “Wagga Wagga” wasn’t. Queries to Wikipedia were parsed accurately – “tell me about pierogi” (Polish derivation) and “tell me about Maryam Mirzakhani” (Persian derivation) – were both correctly identified.

Text to speech

Text to speech, or synthetic speech, is a complement to speech recognition, and takes written phrases and outputs speech as audio. For this layer, the Mark II uses the Mimic 3 engine, a TTS engine based on the ViTS algorithm. The GitHub repo for Mimic 3 doesn’t contain the original data the Mimic 3 voices were trained on, but in this blog post, developer Mike Hansen provides a technical walkthrough of how the models were trained, including the challenges of training new languages, and the phonetic and normalisation challenges that entails.

By default, the Mark II uses the UK English “Pope” Mimic 3 voice, which is a male-gendered voice. I was pleasantly surprised by this, given the industry default of gendering voice assistants as female, which is covered at length in the book The Smart Wife by Yolande Strengers and Jenny Kennedy 10Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.. There were no Australian accents available.

I have a lot of my own (female, Australian) voice data recorded, but I didn’t want to contribute it to the Mimic 3 project. There are growing concerns around how voice and video data of people is being used to create realistic avatars – digital characters that sound, and look, like real humans – and until we have better regulation, I don’t particularly want a “Kathy” voice out in the wild – because I have no control over how it would be used. For example, I know that sex doll companies have approached voice startups wanting access to voices to put into their products. As we say in ‘Strayan: yeah, nah.

Mimic 3 works offline, and I was surprised at how fast it generated speech – there was a slight delay, but not very noticeable at all. Some of the pronunciations in the Pope voice were definitely a little off – a key issue in TTS is getting the generated voice to have the correct emphasis and prosody – but I was still pretty impressed. When I’m done with my PhD work, I’d love to train a Mimic 3 voice on my own voice data to understand the process a bit better.

Screen

Mark II uses the Plymouth splash screen manager for the touch screen visuals, and looking at the htop output, this one component uses over 16% of the available CPU. I found the visuals very well-designed and complementary to the audio information that was presented by the TTS layer.

The screen adds additional complexity and consideration for Skill designers however; not only do they have to design a pleasant voice interaction experience in a Skill , but for devices like the Mark II which are multi-modal, the Skill designer must now ensure a complementary visual experience.

Overall impressions

The Mark II packs some impressive hardware, and mature software into a solid form factor. The visual design is well integrated with the conversational design, and it’s intended to be extensible – a boon for those who like to experiment and tinker. The software layers that are used are generally matured, and predominantly run on-device, with the exception of the speech recognition layer (I think). The Skills all have a dependency on external services – Wikipedia, weather and so on. The integration with the home.mycroft.ai backend serves to protect my privacy – as long as I trust that platform – and there was no evidence in any of the digging I did that the device is “leaking” data to third party services.

These positive factors are tarnished by the high price point (listed at $USD 499, although the Kickstarter price was a lot lower at $USD 199) and the lack of Skills that integrate with other services – like Spotify. This is a device that is capable of so much more – it’s a diamond in the rough. But what would give it more polish?

And that takes me to part 2 of this post – ecosystems, infrastructures and the friction of privacy.

A critical lens on an inflection point in voice assistant technology: we are in the trough of disillusionment

It’s a difficult time for the voice assistant industry.

Amazon has recently announced huge layoffs to its Alexa voice assistant department (see write ups by Ars Technica, Business Insider, and Vox), as it struggles to find a path to monetisation for the loss-leading hardware devices that its been shipping. Google has followed this trend. This also comes off the back of an architectural change at Google where it dropped support for third party applications on its Assistant voice apps and devices, which has the effect of more tightly integrating its hardware with the Android ecosystem. Apple – the other big player in the voice assistant space, with Siri, has begun quietly laying off third-party contractors, according to reports. Baidu, the largest voice player outside of Western Europe and America, with its Xiaodu assistant, which had a $5b valuation just two short years ago, has also shed 10-15% of its staff. The outlier here is the open source Home Assistant, and its commercial sister operation, Nabu Casa, who have announced the impending launch of a voice-assistant-with-screen competitor to Google Nest Hub and to the Amazon Echo Show, the Home Assistant Yellow.

Previous predictions that there will be 8 billion voice assistant devices in use by this year now appear idealistic indeed.

Voice assistants and theories of innovation

It’s fair to say that voice assistants have reached the “trough of disillusionment”. This term belongs to Gartner and their Hype Cycle; a frame they use to plot the adoption and integration of emerging technology. The “trough of disillusionment” refers to a period in a technology’s history where:

Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters.

Gartner Hype Cycle

The Gartner Hype Cycle is a theory of innovation; different technologies move through the Hype Cycle at varying rates, influenced by a range of factors. Other theories of innovation use divergent analytical lenses, or ascribe primacy to differing drivers and constraints on technology. The diffusion of innovation theory gave us the terms “early adopters” and “laggards”. Induced innovation 11Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529. for example places emphasis on economic factors such as market demand. Evolutionary theory focuses on the search for better tools and technologies, with the market selecting – and ensuring the survival – of the best. Path dependence models valorize the role of seemingly insignificant historical changes that compound to shape a technology’s dominance or decline. The multi-level perspective blends the micro and macro levels together, showing how they interact to influence technological development. Disruptive innovation theory 12Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568. takes a contingent approach; different innovations require different strategies to challenge and unseat established incumbents. Apple unseated Nokia with touch screens. Linux dominated the data centre due to higher reliability and performance. Netflix swallowed Blockbuster by leveraging increasing internet speeds for content delivery. Disruptions harnesses interacting social, economic and political developments.

I digress. What all of these views of innovation have in common, regardless of their focus – is inter-dependency of factors that influence a technology’s trajectory – of adoption, success, return on investment.

So what are the inter-dependencies in voice assistant technology that lead us to our current inflection point?

Platform and service inter-dependencies

Voice assistants are interfaces. They enable interaction with other systems. We speak to our phones to call someone. We speak to our televisions to select a movie. We speak to our car console to get directions. A voice assistant like Mycroft, or Alexa, or Google Home or Siri is a multi-purpose interface. It is reliant on connecting to other systems – systems of content, systems of knowledge, and systems of data. Wikipedia for knowledge articles, a weather API for temperature information, or a music content provider for music. These are all platform inter-dependencies.

Here, large providers have an advantage because they can vertically integrate their voice assistant. Google has Google Music, Amazon has Amazon Music, Apple has Apple Music. Mycroft has no music service, and this has no doubt played into Spotify’s decision to block Mycroft from interfacing with it. Content providers know that their content is valuable; this is why Paramount and HBO are launching their own platforms rather than selling content to Netflix.

Voice assistants need other platforms to deliver value to the end user. Apple knew this when they acquired Dark Sky and locked Android users out of the platform; although platforms like Open WeatherMap are filling this gap. We’ve seen similar content wars play out in the maps space; Google Maps is currently dominant, but Microsoft and Bing are leveraging OpenStreetMap to counter this with Map-Builder – raising the ire of open source purists in the process.

Voice assistants need content, and services, to deliver end user value.

Discovery – or identifying what Skills a voice assistant can respond to

Voice assistants are interfaces. We use myriad interfaces in our everyday lives; a steering wheel, a microwave timer, an espresso coffee maker, an oven dial, a phone app, natural language, a petrol bowser, an EV charger. I doubt you RTFM’d for them all. And here’s why: interfaces are designed to be intuitive. We intuitively know how a steering wheel works (even if, for example, you’ve never driven a manual). If you’ve used one microwave, you can probably figure out another one, even if the display is in a foreign language. Compare this with the cockpit of the Concorde – a cacophony of knobs, buttons and dials.

If you’ve used Alexa or Siri, then you could probably set a timer, or ask about something from Wikipedia. But what else can the assistant do? This is the discovery problem. We don’t know what interfaces can do, because we often don’t use all the functions an interface provides. When was the last time you dimmed the dashboard lights in your car? Or when did you last set a defrost timer on your microwave?

The same goes for voice assistants; we don’t know what they can do, and this means we don’t maximise their utility. But what if a voice assistant gave you hints about what it could do? As Katherine Latham reports in this article for the BBC, it turns out that people find that incredibly annoying; we don’t want voice assistants to interrupt us, or suggest things to us. We find that intrusive.

How, then, do we become more acquainted with a voice assistant’s capabilities?

For more information on skill discovery in voice assistants, you might find this paper interesting.

White, R. W. (2018). Skill discovery in virtual assistants. Communications of the ACM, 61(11), 106-113.

Privacy and the surveillance economy – or – who benefits from the data a voice assistant collects?

There is a clear trade-off in voice assistants between functionality and privacy. Amazon Alexa can help you order groceries – from Amazon. Google Home can read you your calendar and email – from Gmail. Frictionless, seamless access to information and agents which complete tasks for you requires sharing data with the platforms that provide those functions. This is a two-fold dependency; first the platforms that provide this functionality must be available – my vertical dependency argument from above. Secondly, you must provide personal information – preferences, a login, something trackable – to these platforms in order to receive their benefits.

I’m grossly oversimplifying his arguments here, however the well-known science and technology studies scholar, Luciano Floridi, has argued that “privacy is friction” 13Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.. The way that information is organised contributes to, or impinges upon, our privacy. Voice assistants that track our requests, record our utterances, and then use this information to suggestively sell more products to us reduce friction by reducing privacy. Mireille Hildebrandt, in her book Technology and the Ends of the Law, goes one step further: voice assistants impugn our personal agency through their anticipatory nature 14Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.. By predicting, or assuming our desires and needs, our ability to be reflective in our choices, to be be mindful of our activities, is eroded. Academic Shoshana Zuboff takes a broader view of these developments, theorising that we live in an age of surveillance capitalism; where the data produced through technologies which surveil us – our web browsing history, CCTV camera feeds – and yes – utterances issued to a voice assistant – become a form of capital which is traded in order to more narrowly market to us. Jathan Sadowski has argued, similarly, for the concept of datafication; when our interactions with the world become a form of capital – for trading, for investing, and for extraction.

Many of us have become accustomed to speaking with voice interfaces, and glossing over how those utterances are stored, or linked, or mined – downstream in an ecosystem. In fact, Professor Joe Turow argues this case eloquently in his book, The Voice Catchers: 15Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press. by selling voice assistants at less than cost, their presence, and our interactions with them have been normalised, and backgrounded. We don’t think anything of sharing our data with the corporate platforms on which they rest. Giving personal data to a voice assistant is something we take for granted. By design. We trade the utility of frictionless access to services with the friction to privacy.

And this points to a key challenge for open source and private voice assistants – like Mycroft and like Home Assistant; in order to deliver services and content through those voice assistants, we have to give up some privacy. Mycroft handles this by abstraction; for example, the speech recognition that is done through Google’s cloud service is channelled through home.mycroft.ai – and done under a single identifier so that individual Mycroft users’ privacy is protected.

How do voice assistants overcome the tradeoff between utility and privacy?

Hardware is hard

One of my favourite books is by the developer of the almost-unheard-of open source assistant, Chumby – Bunnie Huang – The Hardware Hacker 16Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.. It is a chronicle of the myriad challenges involved in designing, manufacturing, certifying and bringing to market a new consumer device. For every mention of “Chumby” in the book, you could substitute “Mycroft Mark II”. Design tradeoffs, the capital required to fund manufacturing, quality control issues, a fickle consumer market – all are present in the tale of the Mark II.

Hardware is hard.

Hardware has to be designed, tested, integrated with software through drivers and embedded libraries, it has to be certified compliant with regulations. And above all, it has to yield a profit for the hardware manufacturer to keep manufacturing the hardware. If we think about the escalating costs of the Mark II – the Chumby – and then look at how cheaply competitor devices – Alexa, Google Home – are sold for – then it becomes clear that hardware is a loss leader for ecosystem integration. I have no way to prove this, but I strongly suspect that the true cost of an Alexa or Google Home is four or five times more than what a consumer pays for it.

Voice assistants are interfaces.

And by having a voice assistant in a consumer’s home, it becomes an interface into a broader ecosystem – more closely imbricating and embroiling the customer in the manufacturer’s ecosystem. And if you can’t transform that loss leader into a recurrent revenue stream – through additional purchases, or through paid voice search, or through voice advertising revenue – then the hardware becomes a sunk cost. And you start laying off staff.

An open source voice assistant strategy is different – its selling point is in opposition to a commercial assistant – its privacy, its interoperability, its extensibility. Its lack of lock in to an ecosystem. And the Mark II is all of those things – private, interoperable and extensible. But it still hasn’t achieved product market fit, particularly at its high, true-cost price point.

How do voice assistants reconcile the cost of hardware and the ability to achieve product market fit?

Overcoming the trough of disillusionment

So how might voice assistants overcome the trough of disillusionment?

Higher utility through additional content, data sources and APIs

Voice assistant utility is a function of how many Skills the device has, how useful those Skills are to the end user, and how frequently they are used. Think about your phone. What apps do you use the most? Text messaging? Facebook? Tik Tok? What app could you not delete from your phone? Skills require two things; a data source or content source, and a developer community to build them. Open source enthusiasts may build a Skill out of curiosity or to slake a desire to be able to build a Skill, but commercial entities will only invest in Skill building if it generates revenue for a product or service. And then what does the voice assistant manufacturer do to share in that revenue? So we need to find ways to incentivise Skill development (and maintenance) as well as revenue sharing models that help support the infrastructure, Skill development – and the service or data that the Skill interfaces with. For example, Spotify understands this – and will reserve access to their highly-valued content only for business arrangement that help them generate additional revenue.

I also see governments having a role to play here – imagine for example accessing government services through your voice assistant – no sitting in queues on a phone. The French Minitel service was originally a way to help citizens access services like telephone directories, and postal information. But governments want to both streamline the development they do in-house, and control access to API information; will there be a level of comfort in opening access – and if so, who bears the cost of development?

Distinguishing voice assistants in the home from the voice assistant in your pocket (your mobile phone)

Most of us already have a voice assistant in our pocket – if we have an Android or iPhone mobile phone. So what niche does a physical voice assistant serve? One differentiator I see here is privacy; a voice assistant on a mobile phone ties you to the ecosystem of that device – to Apple, or Google or another manufacturer. Another differentiator is the context in which the voice assistant operates; few of us would want to use a voice assistant in a public or semi-public context, such as on a train or a bus or in a crowd. But a home office is semi-private; and many of us are now working from home. Is there an opportunity for a home office assistant? That isn’t tied to a mobile phone ecosystem? But thinking this through its trajectory of development, if we’re working from home, then we’re working. How will employers seek to leverage voice assistants, and does this sit in opposition to privacy? We are already seeing a backlash against the rise of workplace surveillance (itself a form of surveillance capitalism), so I think there will be barriers to employment or work-based technology being deployed on voice assistants.

Towards privacy, user agency and user choice

I’m someone who places a premium on privacy: I pay for encrypted communications technology and pay for encrypted email that isn’t harvested for advertising. I pay for services like relay that hide my email address from those who would seek to harvest it for surveillance capitalism.

But not everyone does; because privacy is friction, and we have normalised the absence of friction for the price of privacy. As we start to see models like ChatGPT and Whisper that hoover up all the public data on the internet – YouTube videos, public photographs on platforms like Flickr – I think we will start to see more public awareness of how are data is being used – and not always in our own best interests. In voice assistants, this means more safe-guarding of user data, and more protection against harnessing that data for profit.

Voice assistants also have a role to play in user agency and user choice. This means giving people choice about where intents – the commands used to activate Skills – lead. For example, if a commercial voice assistant “sells” the intent “buy washing powder” to the highest “washing powder” bidder, then this restricts user agency and user choice. So we need to think about ways that put control back in the user’s hands – or, in this case, voice. But this of course constrains the revenue generation capabilities of voice assistants.

For voice assistants to escape the trough of disillusionment, they will need to prioritise privacy, agency, utility and choice – and still find a path to revenue.

Footnotes

  • 1
    Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power: Barack Obama’s books of 2019. Profile books.
  • 2
    Sadowski, J. (2019). When data is capital: Datafication, accumulation, and extraction. Big data & society, 6(1), 2053951718820549.
  • 3
    Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.
  • 4
    Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529.
  • 5
    Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568.
  • 6
    Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.
  • 7
    Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.
  • 8
    Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press.
  • 9
    Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.
  • 10
    Strengers, Y., & Kennedy, J. (2021). The smart wife: Why Siri, Alexa, and other smart home devices need a feminist reboot. Mit Press.
  • 11
    Ruttan, V. W. (1997). Induced innovation, evolutionary theory and path dependence: sources of technical change. The Economic Journal, 107(444), 1520-1529.
  • 12
    Si, S., & Chen, H. (2020). A literature review of disruptive innovation: What it is, how it works and where it goes. Journal of Engineering and Technology Management, 56, 101568.
  • 13
    Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and information technology, 7, 185-200.
  • 14
    Hildebrandt, M. (2015). Smart technologies and the end (s) of law: novel entanglements of law and technology. Edward Elgar Publishing.
  • 15
    Turow, J. (2021). The voice catchers: How marketers listen in to exploit your feelings, your privacy, and your wallet. Yale University Press.
  • 16
    Huang, A. B. (2019). The hardware hacker: Adventures in making and breaking hardware. No Starch Press.

State of my toolchain 2022

Welcome to my now-nearly-yearly State of my Toolchain report (you can see previous editions for 2021, 2019, 2018 and 2016). I began these posts as a way to document the tools, applications and hardware that were useful to me in the work that I did, but also to help observe how they shifted over time – as technology evolved, my tasks changed, and as the underpinning assumptions of usage shifted. In this year’s post, I’m still going to cover my toolchain at a glance, and report on what’s changed, and what gaps I still have in my workflow – and importantly – reflect on the shifts that have occurred over 5 years.

At a glance

Hardware, wearables and accessories

Software

  • Atom with a range of plugins for writing code, thesis notes (no change since last report)
  • Pandoc for document generation from MarkDown (no change since last report)
  • Zotero for referencing (using Better BibTeX extension) (no change since last report)
  • OneNote for Linux by @patrikx3 (no change since last report)
  • Nightly edition of Firefox (no change since last report)
  • Zoom (no change since last report)
  • Microsoft Teams for Linux (no change since last report)
  • Gogh for Linux terminal preferences (no change since last report)
  • Super productivity (instead of Task Warrior) (changed since last report)
  • Cuckoo Timer for Pomodoro sessions (changed since last report)
  • RescueTime for time tracking (no change since last report)
  • BeeMindr for commitment based goals (no change since last report)
  • Mycroft as my Linux-based voice assistant (no change since last report)
  • Okular as my preferred PDF reader (instead of Evince on Linux and Adobe Acrobat on Windows) (changed since last report)
  • NocoDB for visual database work (changed this report)
  • ObservableHQ for data visualistion (changed this report)

Techniques

  • Pomodoro (no change since last report)
  • Passion Planner for planning (no change since last report)
  • Time blocking (used on and off, but a lot more recently)

What’s changed since the last report?

There’s very little that’s changed since my last State of My Toolchain report in 2021: I’m still doing a PhD at the Australian National University’s School of Cybernetics, and the majority of my work is researching, writing, interviewing, and working with data.

Tools for PhD work

My key tools are MaxQDA for qualitative data analysis – Windows only, unfortunately, and prone to being buggy with OneDrive. My writing workflow is done using Atom. One particularly useful tool I’ve adopted in the last year has been NocoDB – it’s an opensource alternative to visual database interfaces like Notion and AirTable, and I found it very useful – even if the front end was a little clunky. Working across Windows and Linux, I’ve settled on Okular as my preferred PDF reader and annotator – I read on average about 300-400 pages of PDF content a week, and Adobe Acrobat was buggy as hell. Okular has fine-grained annotating tools, and the interface is the same across Windows and Linux. Another tool I’ve started to use a lot this year is ObservableHQ – it’s like Jupyter notebook, but for d3.js data visualisations. Unfortunately, they’ve recently brought in a change to their pricing structure, and it’s going to cost me $USD15 a month for private notebooks – and I don’t think the price point is worth it.

Hardware and wearables

The key changes this year are a phone upgrade – my Pixel 3 screen died, and the cost to replace the screen was exorbitant – a classic example of planned obsolescence. I’ve been happy with Google’s phones – as long as I disable all the spyware voice enabled features, and settled on the Pixel 4a 5G. It’s been a great choice – clear, crisp photos, snappy processor, and excellent battery life.

After nearly four years, my Mobvoi Ticwatch Pro started suffering the “ghost touch” problem, where the touch interface started picking up non-existent taps. A factory restart didn’t solve the problem, so I got the next model up – the Ticwatch Pro 2020 – at 50% off. This wearable has been one of my favourite pieces of hardware – fast, responsive, durable – and I can’t imagine not having a smartwatch now. I’ve settled on the Flower watch face after using Pujie Black for a long time – both heavily customisable. The love Google is giving to Wear OS is telling – I have much smoother integration between phone apps and Wear OS apps than even 1-2 years ago.

After having two Plantronics Backbeat Pro headphones – one from around 2017 and the other circa 2021, both still going, but the first with a very poor battery life and battered earpads, I invested in my first pair of reasonable headphones – the Sennheiser Momentum Pro 3. The sound quality is incredible – I got them for $AUD 300 which I thought was a lot to pay for headphones, but they’ve been worth every penny – particularly when listening to speech recognition data.

With so much PhD research and typing, I found my Logitech MK240 just wasn’t what I needed – it’s a great little unit if you don’t have anything else, but it was time for a mechanical keyboard because I love expensive hobbies. After some research, and a mis-step with the far too small HuoJi Z-88, (the keypresses for linux command line tasks were horrendous) I settled on the Keychron K8 and haven’t looked back. Solid, sturdy, blue Gateron switches – it’s a dream to type on, and works well across Windows and Linux. However, on Linux it is using a Mac keyboard layout and I had to do some tweaking with a keymapper – and used keyd. My only disappointment with Keychron is the hackyness needed to get it working properly on Linux.

Productivity

My Passion Planner is still going strong, but I haven’t been as diligent as using it as a second brain as I have been in the past, and the price changes this year meant that shipping one to Australia cost me nearly $AUD 120 in total – and that’s unaffordable in the longer term – so I’m actively looking at alternatives as as Bullet Journalling. The Passion Planner is great – it’s just expensive.

I’ve also dropped Task Warrior in favour of Super Productivity this year. Task Warrior isn’t cross-platform – I can’t use it on Android, or on Windows, and thanks to MaxQDA software, I’m spending a lot more of my time in Windows. The Gothenberg Bit Factory are actively developing Task Warrior – full transparency, I’m a GitHub sponsor of theirs – but the cloud-based and cross-platform features seem to be taking a while to come to fruition.

I’m also using time-blocking a lot more, and am regularly using Cuckoo as a pomodoro timer with a PhD cohort colleague, T. We have an idea for a web app that optimises the timing of Pomodoros based on a feedback loop – but more on that next year.

Current gaps in my toolchain

Visual Git editor

In my last State of My Toolchain report, I lamented having a good Visual Git Editor. That’s been solved in Windows with GitHub’s desktop application, but as of writing the Linux variant appears to be permanently mothballed. I’m sure this has nothing to do with Microsoft buying GitHub. So I am still on the lookout for a good Linux desktop Git GUI. On the other hand, doing everything by CLI is always good practice.

Second Brain

In my last report I also mentioned having taken Huginn for a spin, and being let down at its immaturity. It doesn’t seem to have come very far. So I’m still on the lookout for “Second Brain” software – this is more than the knowledge management software in the space that tools like Roam and Obsidian occupy, but much more an organise-your-life tool. The Microsoft suite – Office, Teams, and their stablemates – are trying to fill this niche, but I want something that’s not dependent on an enterprise login. But I’ve decided to reframe this gap as a “Second Brain” gap – after reading Tiago Forte’s book on the topic.

The Fediverse

Triggered by Elon Musk’s purchase, and subsequent transformation of Twitter into a flaming dumpster fire, I’ve become re-acquainted with the Fediverse – you can find me on Mastodon here, on Pixelfed.au here, and on Bookwyrm here. However, the tooling infrastructure around the Fediverse isn’t as mature – understandably – as commercial platforms. I’m using Tusky as my Android app, and the advanced web interface. But there are a lack of hosting options for the Fediverse – I can’t find a pre-configured Digital Ocean Droplet for Mastodon, for example – and I think the next year will see some development in this space. If you’re not across Mastodon, I wrote a piece that uses cybernetic principles to compare and contrast it with Twitter.

5 years of toolchain trends

After five years of the State of My Toolchain report, I want to share some reflections on the longer-term trends that have been influential in my choice of tools.

Cross-platform availability and dropping support for Linux

I work across three main operating systems – Linux, Windows (because I have to for certain applications) and Android. The tools I use need to work seamlessly across all three. There’s been a distinct trend over the last five years for applications to start providing Linux support but then move to a “community” model or drop support altogether. Two cases in point were Pomodone – which I dropped because of its lack of Linux support, and RescueTime – which still works on Linux for me, albeit with some quirks (such as not restarting properly when the machine awakes from suspend). This is counter-intuitive given the increasing usage of Linux on the desktop. The aspiration of many Linux aficionados that the current year will be “The Year of Linux on the Desktop” is not close to fruition – but the statistics show a continued, steady rise – if small – in the number of Linux desktop users. This is understandable though – startups and small SaaS providers cannot justify supporting such a small user base. That said, they shouldn’t claim to support the operating system then drop support – as both Pomodone and RescueTime have done.

Takeaway: products I use need to work cross-platform, anywhere, anytime – and especially on Linux.

Please don’t make me change my infrastructure to work with your product

A key reason for choosing the Ticwatch Pro 2020 over other Mobvoi offerings was that the watch’s charger was the same between hardware models. I’d bought a couple of extra chargers to have handy, and didn’t want to have to buy more “spares”. This mirrors a broader issue with hardware – it has a secondary ecosystem. I don’t just need a mobile phone, I need a charger, a case, and glass screen protectors – a bunch of accessories. These are all different – they exhibit variety – a deliberate reduction in re-usability and a buffer against commodification. But in choosing hardware, one of my selection criteria is now re-usability or upgradeability – how can I re-use this hardware’s supporting infrastructure. The recent decision by Europe to standardise on USB-C is the right one.

Takeaway: don’t make me buy a second infrastructure to use your product.

I’m happy to pay for your product, but it has to represent value for money, or it’s gone

Several of my tools are open source – Super Productivity, NocoDB, Atom, Pandoc – and where I can, I GitHub sponsor them or provide a monetary contribution.On the whole, these pieces of software are often worth a lot more too me than the paid proprietary software I used – for example, MaxQDA is over $AUD 300 a year – predominantly because it only has one main competitor, NVIVO. I have no issue paying for software, but it has to represent value for money. If I can get the same value – or nearly equivalent – from an open source product, then I’m choosing open source. Taguette wasn’t there over MaxQDA, but Super Productivity has equivalent functionality to Pomodone. Open source products keep proprietary products competitive – and this is a great reason to invest in open source where you are able.

That’s it! Are there any products or platforms you’ve found particularly helpful? Let me know in the comments.

Building a database to handle PhD interview tracking using MySQL and noco-db

So, as folx probably know, I’m currently during a PhD at the Australian National University’s School of Cybernetics, investigating voice data practices and what we might be able to do to change them to have less biased voice data, and less biased voice technology products. If you’d like to see some of the things I’ve been working on, you can check out my portfolio. Two of my research methods are related to interviews; the first tranche being shorter exploratory interviews and the second being in-depth interviews with machine learning practitioners.

Because there are many stages to interviews – identifying participants, approaching them for interviews, obtaining consent, scheduling, transcription and coding – I needed a way to manage the pipeline. My PhD cohort colleagues use a combination of AirTable and Notion, but I wanted an opensource alternative (surprise!).

Identifying alternatives and choosing one to use

I did a scan of what alternatives were available simply by searching for “open source alternative to AirTable”. Some of the options I considered but discarded were:

  • BaseRow: While this is open source, and built in widely adopted frameworks such as Django and Vue.js, and available in Docker and Heroku deploys, the commercial framing behind the product is very much open core. That is, there are a lot of features that are only available in the paid / premium version. I’ve worked with open core offerings before, and I’ve found that the most useful features are usually those that are behind the paywall.
  • AppFlowy: While this looked really impressive, and the community behind it looked strong, the use of Flutter and Rust put me off – I’m not as familiar with either of them compared to Vue.js or Django. I also found the documentation really confusing – for example, to install the Linux version it said to “use the official package”, but it didn’t give the name of the official package. Not helpful. On this basis I ruled out AppFlowy.
  • DBeaver: This tool is more aimed at people who have to work with multiple databases; it provides a using GUI over the top of the database, but is not designed to be a competitor to Notion or AirTable. I wanted something more graphically-focused, and with multiple layout styles (grid, card etc).

This left me with NoCoDB. I kicked the tyres a bit by looking at the GitHub code, and read through the documentation to get a feel for whether it was well constructed; it was. Importantly, I was able to install it on my localhost; my ethics protocol for my research method prevented it being hosted on a cloud platform.

Installation

Installation was a breeze. I set up a database in MySQL (also running locally), then git clone‘d the repo, and used npm to install the software:

git clone https://github.com/nocodb/nocodb-seed
cd nocodb-seed
npm install
npm start

nocodb uses node.js’s httpd server, and starts the application by default on port 80, so to start using it, you simply go to: http://localhost:8080/. One slightly frustrating thing is that it does require an email address and password to log in. nocodb is a commercial company – they’ve recently done a raised and are hiring – and I suspect this is part of their telemetry, even for self-hosted solutions. I run Pihole as my DNS server, and I don’t see any telemetry from nocodb in my block list, however.

Next, you need to provide nocodb with the MySQL database details that you created earlier. This creates some additional tables. nocodb then creates some base views, but at this point you are free to start creating your own.

Deciding what fields I needed to capture to be able to visualise my interview pipeline

Identifying what fields I needed to track was a case of trial and error. As I added new fields, or modified the datatypes of existing ones, nocodb was able to be easily re-synced with the underlying database schema. This makes

Identifying what fields I needed to track was a case of trial and error. As I added new fields, or modified the datatypes of existing ones, nocodb was able to be easily re-synced with the underlying database schema. This makes nocodb ideal for prototyping database structures.

nocodb showing tables out of sync
nocodb now in sync with the underlying tables

In the end, I settled on the following tables and fields:

Interviewees table

  • INTERVIEWEE_ID – a unique, auto-incrementing ID for each participant
  • REAL_NAME – the real name of my participant (and one of the reasons this is running locally and not in the cloud)
  • CODE_NAME – a code name I ascribed to each participant, as part of my Ethics Protocol
  • ROLE_ID – foreign key identifier for the ROLES table.
  • EMAIL_ADDRESS – what it says on the tin.
  • LINKEDIN_URL – I used LinkedIn to contact several participants, and this was a way of keeping track of that information.
  • HOMEPAGE_URL – the participant’s home page, if they had one. This was useful for identifying the participant’s background – part of the purposive sampling technique.
  • COUNTRY_ID – foreign key identifier for the COUNTRIES table – again used for purposive sampling.
  • HOW_IDENTIFIED – to identify whether people had been snowball sampled
  • HAS_BEEN_CONTACTED – Boolean to flag whether the participant had been contacted
  • HAS_AGREED_TO_INTERVIEW – Boolean to flag whether the participant had agreed to be interviewed
  • NO_RESPONSE_AFTER_SEVERAL_ATTEMPTS – Boolean to flag whether the participant hadn’t responded to a request to interview
  • HAS_DECLINED – Boolean to flag an explicit decline
  • INTERVIEW_SCHEDULED – Boolean to indicate a date had been scheduled with the participant
  • IS_EXPLORATORY – Boolean to indicate the interview was exploratory rather than in-depth. Having an explicit Boolean for the interview type allows me to add others if needed (while I felt that a full blown table for interview type was overkill).
  • IS_INDEPTH – Boolean for the other type of interview I was conducting.
  • INTERVIEWEE_DESCRIPTION – descriptive information about the participant’s background. Used to help me formulate questions relevant to the participant.
  • CONSENT_RECEIVED – Boolean to flag whether the participant had provided informed consent.
  • CONSENT_URL – A space to record the file location of the consent form.
  • CONSENT_ALLOWS_PARTICIPATION – A flag relevant to specific type of participation in my ethics protocol, and my consent form
  • CONSENT_ALLOWS_IDENTIFICATION_VIA_PARTICIPANT_CODE – A flag relevant to how participants were able to elect to be identified, as part of my ethics protocol.
  • INTERVIEW_CONDUCTED – Boolean to flag that the interview had been conducted.
  • TRANSCRIPT_DONE – Boolean to flag that the transcript had been created (I used an external company for this).
  • TRANSCRIPT_URL – A space to record the file location of the transcript.
  • TRANSCRIPT_APPROVED – Boolean to indicate the participant had reviewed and approved the transcript.
  • TRANSCRIPT_APPROVED_URL – A space to record the file location of the approved transcript
  • CODING_FIRST_DONE – Boolean to indicate first pass coding done
  • CODING_FIRST_LINK – A space to record the file location of the first coding
  • CODING_SECOND_DONE – Boolean to indicate second pass coding done
  • CODING_SECOND_URL – A space to record the file location of the second coding
  • NOTES – I used this field to make notes about the participant or to flag things to follow up.
  • LAST_CONTACT – I used this date field so I could easily order interviewees to follow them up.
  • LAST_MODIFIED – This field auto-updated on update.

Countries table

  • COUNTRY_ID – Unique identifier, used as primary key and foreign key reference in the INTERVIEWEES table.
  • COUNTRY_NAME – human readable name of the country, useful for demonstrating purposive sampling.
  • LAST_MODIFIED – This field auto-updated on update.

Roles table

  • ROLE_ID – Unique identifier, used as primary key and foreign key reference in the INTERVIEWEES table.
  • ROLE_TITLE – human readable title of the role, used for purposive sampling.
  • ROLE_DESCRIPTION – descriptive information about the activities performed by the role.
  • LAST_MODIFIED – This field auto-updated on update.

If I were to update the database structure in the future, I would be inclined to have a “URLs” table, where the file links for things like consent forms and transcripts are stored. Having them all in one table would make it easier to do things like URL validation. This was overkill for what I needed here.

Thinking also about the interview pipeline, the status of the interviewee in the pipeline is a combination of various Boolean flags. I would have found it useful to have a summary STATUS_ID with a useful descriptor of the status.

Get the SQL to replicate the database table structure

I’ve exported the table structure to SQL in case you want to use it for your own interview tracking purposes. It’s a Gist because I can’t be bothered altering my wp_options.php to allow for .sql uploads, and that’s probably a terrible idea, anyway 😉

Creating views based on field values to track the interview pipeline

Now that I had a useful table structure, I settled on some Views that helped me create and manage the interview pipeline. Views in nocodb are lenses on the underlying database – that restrict or constrain the data that is shown so that it’s more relevant to the task at hand. This is done through showing or hiding fields, and then filtering the selected fields.

  • Data entry view – this was a form view where I could add new Interviewees.
  • Views for parts of the pipeline – I set up several grid views that restricted Interviewees using filters to the part of the interview pipeline they were in. These included those I had and hadn’t contacted, those who had a scheduled interview, those who hadn’t responded, as well as several views for where the interviewee was in the coding and consent pipeline.
  • At a glance view – this was a gallery view, where I could get an overview of all the potential and confirmed participants.

A limitation I encountered working with these views is that there’s no way to provide summary information – like you might with a SUM or COUNT query in SQL. Ideally I would like to be able to build a dashboard that provides statistics on how many participants are at each part of the pipeline, but I wasn’t able to do this.

Updating nocodb

nocodb is under active development, and has regular updates. Updating the software proved to be incredibly easy through npm, with a two-line command:

Uninstall NocoDB package

npm uninstall nocodb

Install NocoDB package

npm install --save nocodb

Parting thoughts

Overall, I have been really impressed by nocodb – it’s a strong fit for my requirements in this use case – easily prototypable, runs locally, and is easily updateable. The user interface is still not perfect, and is downright clunky in places, but as an open source alternative to AirTable and Notion, it hits the spot.