Tuesday, March 26, 2013

We're moving!!

So, we're moving!

That's not the exclusive "we" of me and my meat-space family. I mean the inclusive "we," you and me, imagined reader, we're moving off of blogger, and over to a new Wordpress-powered site at jonfwilkins.com

This is a part of a borg-like consolidation of my online presence. Over at the new site, you'll find not only the ongoing saga of this blog, but also other fun stuff, like my CV. (WOO!!)

So let us go then, you and I / and update our bookmarks in the sky / of the browser window, by which I / mean the bookmark bar. The Lost in Transcription blog will now live at http://jonfwilkins.com/blog/

If you subscribe to the feedburner feed at http://feeds.feedburner.com/LostInTranscription you should be fine, as the blog feed will still redirect there.

Friday, March 8, 2013

What a Remarkable Paperback!!

So, guess what came in the mail yesterday? That's right! It's the paperback edition of Remarkable, by Lizzie K. Foley. The hardback came out last April under the Dial imprint of Penguin. The paperback is through Puffin (also part of Penguin), and has a completely new cover. Here's a stack of them:

Foreground: The nineteen best books ever written. Background: Our new kitchen wall color.

And here's a close-up of the cover, so that you can really see the awesome cover art by Fernando Juarez, which has a bit of a Dr. Seuss-ey vibe:

Pictured: Jane, The Pirate Ship Mozart Kugeln, Lucky the Lake Monster, The Mansion at the Top of Remarkable Hill, the Bell Tower (under construction). Not pictured: the nefariously identical Grimlet Twins, Melissa and Eddie, Remarkable's School for the Remarkably Gifted, Ebb, Jeb, Flotsam, Madame Gladiola, Penelope Hope Adelaide Catalina, Anderson Brigby Bright Doe III, Lucinda Wilhelmina Hinojosa, Mad Captain Penzing the Horrific, and more.
That means that, yes, you can now get this excellent book in paperback form, which is both more affordable and more bendable than the original!

Should you buy it? Yes! Why? Let me tell you!

Here are the pull quotes from just a few of the positive reviews Remarkable has received:

From the New York Times:
A lot of outlandish entertainment.
From Booklist:
A remarkable middle-grade gem.
From Kirkus Reviews:
A rich, unforgettable story that's quite simply — amazing.
The story centers on the town of Remarkable, where all of the residents are gifted, talented, and extraordinary. Everyone in the town is a world-class musician, or writer, or architect . . .

Except for Jane.

In fact, she is the only student in the entire town who attends the public school, rather than Remarkable's School for the Remarkably Gifted. But everything changes when the Grimlet Twins join her class and pirates arrive in town. Plus, there's a weather machine, a psychic pizza lady, a shy lake monster, and dentistry.

The book is both funny and thoughtful. You can enjoy it as a goofy adventure full of wacky characters and wordplay. It's for ages eight and up, but if you're an grown up who likes kids' books at all, you'll find that there is a lot here to engage the adult reader.

Speaking of which, you can also read it as a subversive commentary on a culture that pushes children towards excellence rather than kindness and happiness. As Jane's Grandpa John says near the end of the book:
The world is a wonderfully rich place, especially when you aren't trapped by thinking that you're only as worthwhile as your best attribute. . . . It's the problem with Remarkable, you know. . . . Everyone is so busy being talented, or special, or gifted, or wonderful at something that sometimes they forget to be happy.
Now, I know, you're thinking to yourself that you should take my endorsement with a grain of salt. After all, Lizzie Foley is my wife, and I can't be trusted to provide an honest, unbiased assessment of her book . . .

Or can I?

I'm gonna give you some straight talk on correlation versus causation. You might assume that I like this book because I'm married to the person who wrote it. You would not be more wrong. In fact, if I did not know Lizzie Foley, and I read this book, I would track her down and marry her.

So, yes, you should run out right now and get yourself a copy of this book. You should give it to your ten year old, or you should read it with your eight year old, or you should just curl up with it yourself. Just remember, she's already married. I'm looking at you, Ryan Gosling!

Monday, March 4, 2013

How Many English Tweets are Actually Possible?

So, recently (last week, maybe?), Randall Munroe, of xkcd fame, posted an answer to the question "How many unique English tweets are possible?" as part of his excellent "What If" series. He starts off by noting that there are 27 letters (including spaces), and a tweet length of 140 characters. This gives you 27140 -- or about 10200 -- possible strings.

Of course, most of these are not sensible English statements, and he goes on to estimate how many of these there are. This analysis is based on Shannon's estimate of the entropy rate for English -- about 1.1 bits per letter. This leads to a revised estimate of 2140 x 1.1 English tweets, or about 2 x 1046. The rest of the post explains just what a hugely big number that is -- it's a very, very big number.

The problem is that this number is also wrong.

It's not that the calculations are wrong. It's that the entropy rate is the wrong basis for the calculation.

Let's start with what the entropy rate is. Basically, given a sequence of characters, how easy is it to predict what the next character will be. Or, how much information (in bits) is given by the next character above and beyond the information you already had.

If the probability of a character being the ith letter in the alphabet is pi, the entropy of the next character is given by
– Σ pi log2 pi
If all characters (26 letter plus space) were equally likely, the entropy of the character would be log227, or about 4.75 bits. If some letters are more likely than others (as they are), it will be less. According to Shannon's original paper, the distribution of letter usage in English gives about 4.14 bits per character. (Note: Shannon's analysis excluded spaces.)

But, if you condition the probabilities on the preceding character, the entropy goes down. For example, if we know that the preceding character is a b, there are many letters that might follow, but the probability that the next character is a c or a z is less than it otherwise might have been, and the probability that the next character is a vowel goes up. If the preceding letter is a q, it is almost certain that the next character will be a u, and the entropy of that character will be low, close to zero, in fact.

When we go to three characters, the marginal entropy of the third character will go down further still. For example, t can be followed by a lot of letters, including another t. But, once you have two ts in a row, the next letter almost certainly won't be another t.

So, the more characters in the past you condition on, the more constrained the next character is. If I give you the sequence "The quick brown fox jumps over the lazy do_," it is possible that what follows is "cent at the Natural History Museum," but it is much more likely that the next letter is actually "g" (even without invoking the additional constraint that the phrase is a pangram). The idea is that, as you condition on longer and longer sequences, the marginal entropy of the next character asymptotically approaches some value, which has been estimated in various ways by various people at various times. Many of those estimates are in the ballpark of the 1.1 bits per character estimate that gives you 1046 tweets.

So what's the problem?

The problem is that these entropy-rate measures are based on the relative frequencies of use and co-occurrence in some body of English-language text. The fact that some sequences of words occur more frequently than other, equally grammatical sequences of words, reduces the observed entropy rate. Thus, the entropy rate tells you something about the predictability of tweets drawn from natural English word sequences, but tells you less about the set of possible tweets.

That is, that 1046 number is actually better understood as an estimate of the likelihood that two random tweets are identical, when both are drawn at random from 140-character sequences of natural English language. This will be the same as number of possible tweets only if all possible tweets are equally likely.

Recall that the character following a q has very low entropy, since it is very likely to be a u. However, a quick check of Wikipedia's "List of English words containing Q not followed by U" page reveals that the next character could also be space, a, d, e, f, h, i, r, s, or w. This gives you eleven different characters that could follow q. The entropy rate gives you something like the "effective number of characters that can follow q," which is very close to one.

When we want to answer a question like "How many unique English tweets are possible?" we want to be thinking about the analog of the eleven number, not the analog of the very-close-to-one number.

So, what's the answer then?

Well, one way to approach this would be to move up to the level of the word. The OED has something like 170,000 entries, not counting archaic forms. The average English word is 4.5 characters long (5.5 including the trailing space). Let's be conservative, and say that a word takes up seven characters. This gives us up to twenty words to work with. If we assume that any sequence of English words works, we would have 4 x 10104 possible tweets.

The xkcd calculation, based on an English entropy rate of 1.1 bits per character predicts only 1046 distinct tweets. 1046 is a big number, but 10104 is a much, much bigger number, bigger than 1046 squared, in fact.

If we impose some sort of grammatical constraints, we might assume that not every word can follow every other word and still make sense. Now, one can argue that the constraint of "making sense" is a weak one in the specific context of Twitter (see, e.g., Horse ebooks), so this will be quite a conservative correction. Let's say the first word can be any of the 170,000, and each of the following zero to nineteen words is constrained to 20% of the total (34,000). This gives us 2 x 1091 possible tweets.

That's less than 1046 squared, but just barely.

1091 is 100 billion time the estimated number of atoms in the observable universe.

By comparison, 1046 is teeny tiny. 1046 is only one ten-thousandth of the number of atoms in the Earth.

In fact, for random sequences of six (seven including spaces) letter words to total only to 1046 tweets, we would have to restrict ourselves to a vocabulary of just 200 words.

So, while 1046 is a big number, large even in comparison to the expected waiting time for a Cubs World Series win, it actually pales in comparison to the combinatorial potential of Twitter.

One final example. Consider the opening of Endymion by John Keats: "A thing of beauty is a joy for ever: / Its loveliness increases; it will never / Pass into nothingness;" 18 words, 103 characters. Preserving this sentence structure, imagine swapping out various words, Mad-Libs style, introducing alternative nouns for thing, beauty, loveliness, nothingness, alternative verbs for is, increaseswill / pass prepositions for of, into, and alternative adverbs for for ever and never.

Given 10000 nouns, 100 prepositions, 10000 verbs, and 1000 adverbs, we can construct 1038 different tweets without even altering the grammatical structure. Tweets like "A jar of butter eats a button quickly: / Its perspicacity eludes; it can easily / swim through Babylon;"

That's without using any adjectives. Add three adjective slots, with a panel of 1000 adjectives, and you get to 1047 -- just riffing on Endymion.

So tweet on, my friends.

Tweet on.

C. E. Shannon (1951). Prediction and Entropy of Written English Bell System Technical Journal, 30, 50-64

Thursday, February 21, 2013

Darwin Deez: You Can't Be My Girl

So, this is one of those videos that actually adds a whole nother dimension to its song. New from a couple of days ago. Watch.

Wednesday, February 20, 2013

White American Singles

So, what do you think: Pasteurized prepared cheese product or Republican online dating site?


Actually, "Pasteurized Prepared Cheese Product" would also be an excellent name for a Republican dating site.

Tuesday, February 19, 2013

This seems like a weird way to fix peer review

So, it is common to hear scientists complain about peer review, about how it is "broken," and there is probably something to that. Over at Backreaction, a blog by theoretical physicists at The Economist, Sabine Hossenfelder argues that the future of peer review, on that will fix its problems, is already here, in the form of what she calls "pre-print peer review."

The idea is to separate the peer review process from the journals, and attach it to the manuscript. So, if I write a manuscript, I would send it out, for a fee, to a peer review service, which might be run by a publishing company, or by some other entity. According to Hossenfelder, once you got back the review,
This report you could then use together with submission of your paper to a journal, but you could also use it with open access databases. You could even use it in company with your grant proposals if that seems suitable.
Okay, so maybe Hossenfelder has a very different perception of what is wrong with peer review than I do. If your ultimate goal is to submit the manuscript for traditional publication, this seems problematic and, ultimately, unsustainable.

Just think for a moment about the dynamics and market pressures. First of all, if authors have control over the reviews that they purchase, one might expect that they will only attach these reviews to their papers when those reviews are positive. Furthermore, if there are multiple peer-review services, the market pressures would presumably drive them all towards more and more positive reviews. Basically, it sets up a system that will be unraveled by "review inflation." Thinking as a journal editor or grant reviewer, I suspect that I would quickly become very skeptical of these reviews. And I certainly would not be willing to substitute their recommendations for my own judgment and the opinions of referees I selected.

You can imagine ways to address this problem. For instance, certain peer-review services could build reputations as tough reviewers, so that their "seal of approval" meant more. At this point, however, you've merely layered on another set of reputations and rankings that must be kept track of. While this approach is billed as a way to simplify the peer review process and make it cheaper and more efficient, I have difficulty imagining that it would not do just the opposite.

Hossenfelder argues that this new model of peer review is not just desirable, but inevitable
irrespective of what you think about this, it's going to happen. You just have to extrapolate the present situation: There is a lot of anger among scientists about publishers who charge high subscription fees. And while I know some tenured people who simply don't bother with journal publication any more and just upload their papers to the arXiv, most scientists need the approval stamp that a journal publication presently provides: it shows that peer review has taken place. The easiest way to break this dependence on journals is to offer peer review by other means. This will make the peer review process more to the point and more effective.
First, in what way does this have anything to do with high subscription fees? Most open access journals have pretty much the same peer-review structure that subscription journals have. There are legitimate problems with the current dominance of scientific publishing by for-profit corporations that use free labor to evaluate publicly funded science, and then turn around and charge people a lot of money to access that science. However, given the expanding number of high-quality open-access journals that use the traditional peer review system, it seems like peer review is orthogonal to this issue.

Second, yes, there are many people who feel that they need the peer-review stamp of approval. The potential benefit here is that an author could pay for peer review and then post their work on the arXiv, thereby circumventing journals altogether, and allowing more junior researchers to pursue this publishing model. It just seems to me that an author-funded system that is so easily gamed is unlikely to provide any real sense of legitimacy to anyone with this specific concern.

Third, when she says that this will make the process "more to the point and more effective," I honestly can't imagine what mechanism she has in mind. Given that it is published in The Economist, my suspicion is that this claim is based on some sort of invisible hand argument -- that if we just free peer review from its shackles, it will become efficient and beautiful. But maybe that's unfair on my part.

The post goes on to point to two outfits that are already working to implement this model: Peerage of Science (which is up and running) and Rubriq (which is getting started). Rubriq seems focused on the author-pay model, creating a standard review format that could travel from journal to journal. Peerage provides reviews free to authors, and it paid by journals when they use a review and then publish a paper. I've not seen anything that addresses the problem of review inflation.

I don't know. Maybe there's something I'm missing here. What do you guys think?

Wednesday, February 13, 2013

Free Tips for ex-Westboro Baptists Apologizing

So, nobody asked me for this advice, but if I only gave out advice when people asked for it, I would probably burst from all the advice building up inside me.

Today, Anderson Cooper apparently interviewed Libby Phelps Alvarez, granddaughter of Westboro Baptist founder Fred Phelps (via Gawker -- I did not watch this). She was raised in the church, but fled / escaped / defected in 2009, and has recently started speaking publicly about her experience. Let me just say that she deserves a lot of respect for that. I mean, she had to reject her whole upbringing and family, which must be hard, even if your family is full of Phelpses.

Here's the thing that pissed me off though. Her interview included the following statement of regret:
I do regret if I hurt people, because that was never my intention.
This is such the standard, cliche pseudo-apology that it is easy at first glance to overlook what an offensive pile of garbage this is. First of all, "if"? Really? Again, this is super common in these circumstances, but if you've spent most of your live holding up "God Hates Fags" signs at the funerals of soldiers and children, you know damn well that you hurt people.

Even worse, though, is the second bit. When some politician or celebrity pseudo-apologizes, saying it was never their intention to hurt anyone, it is often at least plausible that they were being careless, and not intentionally hurtful.

In this case though, hurting people is precisely the intention of every public appearance the Westboro Baptist Church makes. Now, maybe you could make the case that you thought you were practicing tough love, hurting people in a way that would lead them back to the path of righteousness, or some such nonsense. This would be bullshit, of course, but it would at least be plausible according to some sort of twisted logic.

The fact is, you did intend to hurt people. I believe that you wish now that you had not hurt people in the past, and that's great. I believe that you were a kid, did not know better, and are not fully responsible for your actions, at least up to a point. I believe that you think of yourself as a good person, and I am eager to believe that you have become one. But when I see this sort of pseudo-apology, it makes me a little bit skeptical.

Maybe try something like this: "I know that I hurt a lot of people, and I am sorry. I understand now how hurtful my words and actions were in a way that I did not understand then."

I feel bad about this. I mean, given where she started from, she has progressed further in the past few years than most people do in their lifetimes. But if you're going to make amends publicly, a good way to start is by being honest.