Is GitHub Copilot Any Good? — Sympolymathesy, by Chris Krycho

Assumed audience: Anyone thinking seriously about

A bit of context: This whole piece was written in two parts many months apart. I threw up the first draft for folks to read if they happened across the home page of my site, so that I would get it out of my head and into the world, and then revised and published the “final” version this evening. I don’t have a good mechanic for posts like this which I want to evolve over time as my thoughts change on it. See here for more on that idea.

Epistemic status: Still feeling this one out, honestly. (I think we all are!)

April 23, 2023

I have spent the afternoon and evening playing with Copilot: using it to provide suggestions as I work on building out a custom HTML on top of the Rust MDAST implementation. I have built out a lot of code quickly, because this is the kind of thing where generative AI (especially in software) excels: lots and lots and lots of boilerplate. This little side project is, in this regard, very different from most of my day-to-day work over the past year and change, which has been exceedingly low on boilerplate and high on figuring out what to write in the first place.

I am honestly fairly iffy on the long-term legal status of Copilot. It is not at all clear to me whether it is legal to do this kind of thing; it is also entirely unclear to me whether it should be. I am, for the purposes of this particular post, not tackling that question. I have some vague hopes of writing one or more out-and-out essays on the subject, but: we will see.

The thing that caught my attention about this work most is: while it feels slightly faster, because so much code gets automatically ‘filled in’ for me, I am not sure it actually is any faster. In fact, I think it might be slower. The code generated by Copilot is often wrong, but always subtly so, which means that when I let it fill in any non-trivial suggestion for me, I spend a considerable amount of time doing ‘code review’ on the code it emits. For a general discussion of why and how this is hard, see Hillel Wayne’s discussion. My own takeaway is that not only is the shift in modes from writing to reviewing a difficult one, as Wayne discusses in some detail; it is actually more taxing and slower than just writing the code for anything more than a line or two long.

June 10, 2023

I ended up turning off the automatic suggestions mode from Copilot mere days after writing the entry above. I might have given it a whole week, just trying to genuinely get a sense of its value proposition and trying to give it an earnest try. Less than a week after that, I gave up even trying to ask it for suggestions when I thought it might do something useful. It just was not very good.

Above all: outside the kinds of pure-boilerplate generation I was playing with in April, basically everything it suggested was wrong in some way. (The boilerplate it generated was also often wrong, but on the order of a third of the time instead of 100% of the time!) Sometimes it was wrong in little ways that I could correct after completing it; sometimes it was wrong in massive ways that meant the entire suggestion was useless. Almost never in the entirety of the time I was using it did it actually do what I really wanted or needed it to in context.

In all cases, it added cognitive noise to the process of writing code, for two reasons beyond those I noted above:

It offered suggestions for arbitrary amounts of code while I was in the midst of writing code myself. This meant that I was actively forced to choose whether to accept or reject the suggestions on an ongoing basis. While this is, or can be, true of normal autocomplete as well, normal autocomplete (in a language like Rust or TypeScript, anyway) has the massive advantage that it never suggests something which is simply flat-out wrong.
It is unpredictable about the timeframe in which it provides those suggestions. Sometimes it is nearly instantaneous. Sometimes it takes some arbitrary amount of time. This is equally true whether automatically inserting suggestions or triggered via key command. The lack of any feedback about whether it is even doing something when you have triggered it via key command is (or was, two months ago) also a fairly significant UI design failing.

The combination of those two left me deeply annoyed by it, and if there is one thing that absolutely murders flow, it is living in a state of constant annoyance.

I can imagine there are contexts in which something like GitHub Copilot would be useful. But:

Those contexts are very far from the things I do most: refactoring, or building something genuinely novel. In both of those cases, actually understanding the problem is the essential bit, not writing the code; and a tool like Copilot can barely help with writing the code anyway.
It will be “something like” GitHub Copilot — including perhaps future iterations of it — rather than Copilot as it exists today, because the failure rate just makes it useless to me.

It is not that I think there is no place for tools like this. (I will say that I do think there is no place for tools trained this way.) But it does not currently help with the kind of work I do.