A review of Taguette – an open source alternative for qualitative data coding

Motivation and context

As you might know, I’m currently undertaking a PhD program at Australian National University’s School of Cybernetics, looking at voice dataset documentation practices, and what we might be able to improve about them to reduce statistical and experienced bias in voice technologies like speech recognition and wake words. As part of this journey, I’ve learned an array of new research methods – surveys, interviews, ethics approaches, literature review and so on. I’m now embarking on some early qualitative data analysis.

The default tool in the qualitative data analysis space is NVIVO, made by Melbourne-based company, QSR. However, NVIVO has both a steep learning curve and a hefty price tag. I’m lucky enough that this pricing is abstracted away from me – ANU provides NVIVO for free to HDR students and staff – but reports suggest that the enterprise licensing starts at around $USD 85 per user. NVIVO operates predominantly as a desktop-based pieces of software and is only available for Mac or Windows. My preferred operating system is Linux – as that is what my academic writing toolchain based on LaTeX, Atom and Pandoc – is based on – and I wanted to see if there was a tool with equivalent functionality that aligned with this toolchain.

About Taguette

Taguette is a BSD-3 licensed qualitative coding tool, positioned as an alternative to NVIVO. It’s written by a small team of library specialists and software developers, based in New York. The developers are very clear about their motivation in creating Taguette;

Qualitative methods generate rich, detailed research materials that leave individuals’ perspectives intact as well as provide multiple contexts for understanding the phenomenon under study. Qualitative methods are used in a wide range of fields, such as anthropology, education, nursing, psychology, sociology, and marketing. Qualitative data has a similarly wide range: observations, interviews, documents, audiovisual materials, and more. However – the software options for qualitative researchers are either far too expensive, don’t allow for the seminal method of highlighting and tagging materials, or actually perform quantitative analysis, just on text. It’s not right or fair that qualitative researchers without massive research funds cannot afford the basic software to do their research. So, to bolster a fair and equitable entry into qualitative methods, we’ve made Taguette!

Taguette.org website, “About” page

This motivation spoke to me, and aligned with my own interest in free and open source software.

Running Taguette and identifying its limitations

For reproduceability, I ran Taguette version 1.1.1 on Ubuntu 20.04 LTS with Python 3.8.10

Taguette can be run in the cloud, and the website provides a demo server so that you can explore the cloud offering. However, I was more interested in the locally-hosted option, which runs on a combination of python, calibre, and I believe sqlite as the database backend, with SQLAlchemy for mappings. The install instructions recommend running Taguette in a virtual environment, and this worked well for me – presumably running the binary from the command line spawns a flask– or gunicorn– type web application, which you can then access in your browser. This locally hosted feature was super helpful for me, as my ethics protocol has restrictions on what cloud services I could use.

To try Taguette, I first created a project, then uploaded a Word document in docx format, and began highlighting. This was smooth and seamless. However, I soon ran into my first limitation. My coding approach is to use nested codes. Taguette has no functionality for nested codes, and no concomitant functionality for “rolling up” nested codes. This was a major blocker for me.

However, I was impressed that I could add tags in multiple languages, including non-Latin orthographies, such as Japanese and Arabic. Presumably, although I didn’t check this, Taguette uses Unicode under the hood – so it’s foreseeable that you could use emojis as tags as well, which might be useful for researchers of social media.

Taguette has no statistical analysis tools built in, such as word frequency distributions, clustering or other corpus-type methods. While these weren’t as important for me at this stage of my research, they are functions that I envisage using in the future.

Taguette’s CodeBook export and import functions work really well, and I was impressed with the range of formats that could be imported or exported.

What I would like Taguette to do in the future

I really need nested tags that have aggregation functionality for Taguette to be a a viable software tool for my qualitative data analysis – this is a high priority feature, followed by statistical analysis tools.

Some thoughts on the broader academic software ecosystem

Even though I won’t be adopting Taguette, I admire and respect the vision it has – to free qualitative researchers from being anchored to expensive, limiting tools. While I’m fortunate enough to be afforded an NVIVO license, many smaller, less wealthy or less research-intensive universities will struggle to provide a license seat for all qualitative researchers.

This is another manifestation of universities becoming increasingly beholden to large software manufacturers, rather than having in-house capabilities to produce and manage software that directly adds value to a university’s core capability of generating new knowledge. We’ve seen it in academic journals – with companies like EBSCO, Sage and Elsevier intermediating the publication of journals, hording copyrights to articles and collecting a tidy profit in the process – and we’re increasingly seeing it in academic software. Learning Management Systems such as Desire2Learn and Blackboard are now prohibitively expensive, while open source alternatives such as Moodle still require skilled (and therefore expensive) staff to be maintained and integrated – a challenge when universities are shedding staff in the post-COVID era.

Moreover, tools like NVIVO are imbricated in other structures which reinforce their dominance. University HDR training courses and resource guides are devoted to software tools which are in common use. Additionally, supervisors and senior academics are likely to use the dominant software, and so are in an influential position to recommend its use to their students. This support infrastructure reinforces their dominance by ascribing them a special, or reified status within the institution. At a broader level, even though open source has become a dominant business model, the advocacy behind free and open source software (FOSS) appears to be waning; open source is now the mainstream, and it no longer requires a rebel army of misfits, nerds and outliers (myself included) to be its flag-bearers. This begs the question – who advocates for FOSS within the academy? And more importantly – what influence do they have compared with a slick marketing and sales effort from a global multi-national? I’m reminded here of Eben Moglen’s wise words at linux.conf.au 2015 in Auckland in the context of opposing patent trolls through collective efforts – “freedom itself depends upon how we make use of the technologies we are creating”. That is, universities themselves have created the dependence on academic technologies which now restrict them.

There is hope, however. Platforms like ArXiv – the free distribution service and open access archive for nearly two million pre-prints in mathematics, computer science and other (primarily quant) fields – are starting to challenge the status quo. For example, the Australian Research Council recently overturned their prohibition on the citation of pre-prints in competitive grant applications.

Imagine if universities combined their resources – like they have done with ArXiv – to provide an open source qualitative coding tool, locally hosted, and accessible to everyone. In the words of Freire,

“Reading is not walking on the words; it’s grasping the soul of them.”

Paulo Freire, Pedagogy of the Oppressed

Qualitative analysis tools allow us to grasp the soul of the artefacts we create through research; and that ability should be afforded to everyone – not just those that can afford it.

Picks from #fosdem2020

Although I’ve never managed to get to Brussels for FOSDEM (yet), it remains one of the biggest open source and free software events on the calendar. The videos are now online – and here are a few I found insightful.

Daniel Stenberg (@bagder) on HTTP/3

HTTP/3 has been in the works for a couple of years now, and Daniel’s talk was an excellent overview of how HTTP/3 differs markedly from its predecessors, HTTP/1 and HTTP/2. The key change that I took away from this talk is that HTTP/3 runs over UDP, rather than TCP, which eliminates the header blocking issues seen in both HTTP/1 (http header blocking) and HTPT/2 (TCP header blocking). This is achieved through an as-yet unstandardised new protocol called QUIC.

There are some drawbacks, however;

  • Many networks block UDP traffic because it is often the transport protocol used most for hacking or penetration attempts against networks
  • And to a firewall, QUIC traffic often looks like a DDOS attack.

So, it is likely to be a few years before HTTP/3 sees widespread adoption.

https://fosdem.org/2020/schedule/event/http3/

Esther Payne (@onepict) on RFC1984 and the need for encryption and privacy

Drawing on historical examples such as Ovid and Bentham’s panopticon, Esther outlined trajectories and through-lines of privacy and surveillance. She called upon the technical community to be aware of RFC1984, penned in 1996 by members of the Internet Architecture Board and the Internet Engineering Task Force, and put themselves in the shoes of those who are surveilled. She outlined how RFCs are not universally observed and implemented, particularly by Big Tech, who have the reach and network power to implement their own standards.

Moreover, many governments around the world – including the Australian government – are seeking to implement backdoors into systems, allowing cryptographic measures to be subverted and privacy to be impugned. Data is being used against classes of citizens such as immigrants.

She called on tech communities to help our friends and families to realise that we “are the cow” being surveilled.

Side note: There were significant parallels between this talk and Donna Benjamin’s keynote at #lca2020, “Who’s watching?”.

https://fosdem.org/2020/schedule/event/dip_rfc1984/

Reuben Van der Leun (@rvdleun) builds smart glasses with Javascript

I have to admit, I’ve always been a fan of smart glasses, and was a little surprised that Google Glass didn’t take off over 10 years ago. In the interim, there seems to have been something of an augmented reality “winter”, with AR and VR type goggles being constrained to industrial and experimental usage, rather than being adopted into the consumer mainstream.

Reuben’s project – a DIY approach to augmented reality glasses using a bunch of Javascript APIs and open hardware – may be the harbinger of the “next wave” of augmented reality glasses, powered by freely available APIs for facial recognition, contact management and even speech recognition.

https://fosdem.org/2020/schedule/event/javascript_smartglasses/

linux.conf.au 2019 Christchurch – The Linux of Things

linux.conf.au 2019 this year went over the Tasman to New Zealand for the fourth time, to the Cantabrian university city of Christchurch. This was the first year that Christchurch had played host and I sincerely hope it’s not the last.

First, to the outstanding presentations.

NOTE: You can see all the presentations from linux.conf.au 2019 at this YouTube channel

Open Artificial Pancreas System (OpenAPS) by Dana Lewis

See the video of Dana’s presentation here

Dana Lewis lives with Type 1 diabetes, and her refusal to accept current standards of care with diabetes management led her to collaborate widely, developing OpenAPS. OpenAPS is a system that leverages existing medical devices, and adds a layer of monitoring using open hardware and open software solutions.

This presentation was outstanding on a number of levels.

As a self-experimenter, Dana joins the ranks of scientists the world over putting their own health on the line in the strive for progress. Her ability to collaborate with others from disparate backgrounds and varied skillsets to make something greater than the sum of its parts is a textbook case in the open source ethos; moreover the results that the OpenAPS achieved were remarkable; significant stabilization in blood sugars and better predictive analytics – providing better quality of life to those living with Type 1 diabetes.

Dana also touched on the Open Humans project, which is aiming to have people share their medical health data publicly so that collective analysis can occur – opening up this data from the vice-like grip of medical device manufacturers. Again, we’re seeing that data itself has incredible value – sometimes more so than the devices which monitor and capture the data itself.

Open Source Magnetic Resonance Imaging: From the community to the community by Ruben Pellicer Guridi

You can view the video of Ruben’s presentation here

Ruben Pellicer Guridi‘s talk centred on how the Open Source MRI community has founded to solve the problems of needing more MRI machines, particularly in low socio-economic areas and in developing countries. The project has attracted a community of health and allied health professionals, and has made available both open hardware and open software, with the first image from their Desktop MR software being acquired in December.

Although the project is in its infancy, the implications are immediately evident; providing better public healthcare, particularly for the most vulnerable in the world.

Apathy and Arsenic: A Victorian era lesson on fighting the surveillance state by Lilly Ryan

You can view the video of Lilly’s presentation here

Lilly Ryan’s entertaining and thought-provoking talk drew parallels between our current obsession with privacy-leaking apps and data platforms and the awareness campaign around the detrimental effects of arsenic in the 1800s. Her presentation was a clarion call to resist ‘peak indifference’ and increase privacy awareness and digital literacy.

Deep Learning, not Deep Creepy by Jack Moffitt

You can view the video of Jack’s presentation here

Jack Moffitt is a Principal Research Engineer with Mozilla, and in this presentation he opened by providing an overview of Deep Learning. He then dug a little bit deeper into the dangers of deep learning, specifically the biases that are inherent in current deep learning approaches, and some of the solutions that have been trialled to address them, such as making gender and noun pairs – such as “doctor” and “man” – equidistant – so that “doctor” is equally predictive for “man” and “woman”.

He then covered the key ML projects from Mozilla such as Deep Speech, Common Voice and Deep Proof.

This was a great corollary to the two talks I gave;

Computer Science Unplugged by Professor Tim Bell

You can view Tim’s presentation here

Part of the Open Education Miniconf, Tim‘s presentation covered how to teach computer science in a way that was fun, entertaining and accessible. The key problem that Computer Science Unplugged solves is that teachers are often afraid of CS concepts – and CS Unplugged makes teaching these concepts fun for both learners and teachers.

Go All In! By Bdale Garbee

You can view Bdale’s talk here

Bdale’s talk was a reinforcement of the power of open source collaboration, and the ideals that underpin it, with a call to “bet on” the power of the open source community.

Open source superhumans by Jon Oxer

You can view Jon’s talk here

Jon Oxer’s talk covered the power of open source hardware for assistive technologies, which are often inordinately expensive.

Other conversations

I had a great chat with Kate Stewart from the Linux Foundation and the work she’s doing in the programmatic audit of source code licensing space – her talk on grep-ability of licenses is worth watching – and we covered metrics for communities with CHAOSS, and the tokenisation of Git commits to understand who has committed which code, specifically for unwinding dependencies and copyright.

Christchurch as a location

Christchurch was a wonderful location for linux.conf.au – the climate was perfect – we had a storm or two but it wasn’t 45 C burnination like Perth. The airport was also much bigger than I had expected and the whole area is set up for hospitality and tourism. It won’t be the last time I head to CHC!