Algorithmically good writing.

A few lingering thoughts on an idea I still haven't begun executing

Jun 07, 2022

As background, the reason this blog exists is that I want to git gud at writing, and the first step is to write (even if poorly).

As my first post on Substack, I brought up the idea of developing a Substack comment scraper and doing a network analysis of commenters and likers. A similar thought was “I am suddenly curious on what percentage of sentences are totally unique on Google. I should write a script to google every sentence in any article I feed it, and spit out the stats. My guess is that almost everything is 90+% unique sentences.”

itsnotmyfault @itsnotmyfault01

"I never promised you the Agora" says DeBoer at the end of a recent one freddiedeboer.substack.com/p/lets-talk-ab… I had to look up what the Agora is, but more importantly, I am suddenly curious on what percentage of sentences are totally unique on Google.

freddiedeboer.substack.comLet’s Talk About CommentsSo there’s been some angst about comments here, and I would like to figure out a way to address that issue that still maximizes people’s ability to express their point of view in whatever way they would like. As you know, I have a political commitment to free speech and an aesthetic commitment to a…

Several times I’ve brought up the idea of trying to get better as a writer, but having little sense of what I mean by good writing. From an evolutionary psychology perspective, the current frontrunner is “talking can’t possibly be more about information sharing than about group and status signalling", despite the sort of Socratic ideal that my autistic STEMlord mind clings to. The way I would conventionally think of this is that “what should matter is the veracity of the information presented, and the ease at which those ideas are put into the reader’s head, but unfortunately, the thing that actually wins is how cool you seem” The “actually it is a good game” synthesis is that “People are extremely intelligent and optimize for signal-to-noise across multiple dimensions. They want a high signal-to-noise ratio of true information, of useful information, and of group and status signals. They also differ widely in their ability to determine what is true, useful, and what groups they are fond of. The cliche of ‘know your audience’ is used because success is meta-dependent, which is a signal that the game is sufficiently complex that there’s always a way forward". In other words, it’s not “this sucks because being persuasive is way more valued than being right”, it’s “this is worth getting good at because you can be persuasive in so many different ways, and those ways are dependent on so many interesting skills that really do provide value.” The associated downside is that haters gonna hate, or that the well can be so poisoned against you and people like you that no amount of idea veracity in your writing can make it over the social hurdle, and the perception from the hater will be “bad writer” or “wrong idea” rather than “wrong team”. DeBoer has a few good pieces (and an e-book) on it that I’ve read that are in this “don’t write like everyone else, or else you have no competitive advantage” school of thought. SirPingsALot agrees (my compliment of him here and his substack here)

Recently I read the tale of a cheater-finder that has all kinds of custom automated tools for sniffing out cheaters above-and-beyond the usual stuff

Let’s start with me, I’ll be the category. I have prior experience with pre-pandemic academic integrity violations. There’s no time for a prequel. But let’s say I’ve been called an obsessive plagiarism detective. Plagiarism really irks me. It irks me so bad I wrote my own R package to detect plagiarism https://crumplab.com/playjareyesores/index.html. Students generally are unaware that in my research life I write computational models that can compare text for semantic similarity. And, sometimes as a professor who has access to online plagiarism tools like turnitin or safe-assign, you just want better tools. So, I write my own tools, and then use them to uncover nests of plagiarism.

On that topic, I’m kind of curious on what kind of things we should look at if we approach the question of “how to git gud at writing” algorithmically. Is a writer’s “voice” often dependent on how often they use specific 2-word and 3-word combinations, in addition to what we would normally describe as a personality (that is, the kinds of ideas they tend to reach for as explanations, typically based on their professional training). If I were to try and rank the most frequently used 2-word and 3-word combinations, part of the question here is how much tense-shifting is appropriate. If I filter "So I said” and “so I asked” to be the same thing, is that inappropriate? How about “How about” and “What about” as among my most common sentence openers? “If I were to” is probably safely lumped in with “If I was to”, right? I guess the only real way to figure it out is to run the numbers in both cases and see which method tends to reveal more useful information faster, even if it is technically less accurate information. Yet another “we have optimized for something other than the truth” to whine about, if you were satisfied with hating the game.

The other approach is to let the big brain in the sky figure it out: Learn enough ML to set GPT-3 or whatever to work on pre-writing a shell that you sort of add a bit of a subjective “if it was me, I’d …” flair on top of. Also, the meatspace version is to just hire a ghostwriter. It might also be to write something and work with a professional editor and try to predict what they’ll want to change. Or go to a composition and rhetoric school or something, idk. Just read a book in theory.

You would probably need to give google a little bit of money for sending so many queries if you wanted to also find out how many unique 2-word and 3-word and 4-word etc combinations are in your writing.

Watch this go nowhere in 2 weeks.

itsnotmyfault’s Newsletter

Algorithmically good writing.

A few lingering thoughts on an idea I still haven't begun executing