Strings attached: A study on the effect gender has on writing conventions… if there is any

This whole issue being on the topic of gender and performance made me curious about peoples’ writing. Do we ‘perform’ our genders every time we sit down at the keyboard? In other words, could there be any signs in a person’s writing of what gender they are?

So, in order to find out and to give my coding portfolio a bit of a boost, I came up with a little corpus study. In this case, “corpus” refers to a body of linguistic data. I put together an anonymous survey, asking for people’s gender and responses to two short writing prompts, and got everyone at The Ubyssey and other friends to fill it out.

Once I was done collecting data and filtered out a few unusable responses, the survey had 33 respondents — 17 of whom are female, 9 are male and 7 are non-binary (including genderfluid, agender, Two-spirit and so on). All responders were in the age range of 19-29 with a few outliers. In order to elicit natural writing, I asked the responders to answer two prompts that hopefully garnered strong reactions:

And sure enough, prompt 1 at least did:

I threw together a python script in order to read each response, crunch some metrics and plot the results.

As foretold by the above answer, the first issue I came up against was the length of each response. Despite asking very nicely for 50-100 words, many responders came up short:

If I were a professor and these responders were my students, they would all have likely received a very angry message in Canvas about this, regardless of gender. Nonetheless, since this study isn’t being graded or anything, I continued on my analysis. The first big test I ran was on the amount of punctuation each responder used:

It seems that female responders may use punctuation slightly more than their male or non-binary peers, with enby responders being the least disposed to non-alphanumeric characters. No, there were no emojis in any of the responses, which is a good thing: The last thing I want to do right now is look for the correct regex filter for ‘emoji.’

I went on to break punctuation down further:

This may suggest that female responders are more predisposed to ellipses and quotations, while non-binary responders eke out on exclamation and question mark use. ‘But Edith’, you may be asking, ‘did you look into the possibility that the location in a response affects the use of punctuation?’

I certainly did, dear hypothetical reader:

Despite using more punctuation overall, female responders were the least likely to end a response with a full stop, exclamation mark, or question mark.

Something I noticed from these charts is that, while there is variation between the categories, it’s fairly small. I’m not exactly going to get a Nobel prize in experimental linguistics for my theory that non-binary folks, on average, use .25 of a question mark more than anyone else. As another example, take this graph of average sentence length:

It’s certainly possible to rank these gender categories in terms of average sentence length in characters — including spaces — but does it really mean anything if the range of averages is within a single character or two? I could go on for ages finding different strings and other combos to match and plot, but what does all this variation actually tell us?

As a kind of Hail Mary, I calculated the entropy of the combined responses for each of the survey responders. It’s based on a formula for measuring how unpredictable a system is — i.e. the negative sum of probabilities times log2 of the same probability. In plain and simple formula notation:

In the case of this study, the entropy of a person’s total response is found by first determining the probability of each unique character occurring in the two responses. The final result is found by adding all products of probability and the log2 of probability for each character, and flipping the sign so that the results are positive. The idea is, the higher the probability, the more ‘unpredictable’ the system of characters is for each responder, and by extension, the more variable a responder’s writing is, in terms of characters.

Now that I got that out of the way, I can move on to the grand results of this caper: the average of entropies, for each gender… were almost exactly the same, hovering around 3.9 bits. I chose to spare you all from seeing a chart with three bars of basically the same height, so I tried a different approach to visualizing this stuff. Instead of an average, I found the range of entropies for each gender, i.e. the largest entropy minus the smallest:

This may suggest that male responders have more variation in response variation, I can’t believe my life led me to saying this stuff, as opposed to non-binary and female responders. Some console outputting I did suggested that response length (in characters) and entropy didn’t appear to be incredibly correlated.

After all of this data collection, python scripting and visualizing, my big conclusion is that I don’t really have one. Perhaps someone with more experience in data science will kindly give me some insight, or perhaps they’ll kindly instruct me to move my PyCharm project to the recycle bin.

All in all, I think you shouldn’t worry about your writing conventions and if they appear gendered too much. When people try to analyze this kind of stuff, things like this article are the result.

I’d like to thank everyone who participated in this study!