Advertisement

Contributors

« Porto X-Phi Lab | Main | Consciousness: Facts, Fictions, & Functions »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83452050c69e2013487bc8ff8970c

Listed below are links to weblogs that reference Gender and Philosophical Intuition:

Comments

Jeremy Stangroom

Not sure whether this is of interest, but a while ago I put together an online version of the Trolley Problem. It's been completed more than 25,000 times now. I'm getting clear gender differences.

It's here:

http://www.philosophyexperiments.com/fatman/Default.aspx

Wesley Buckwalter

Thanks for the info Jeremy. What I've heard is that people often find gender main effects in their trolley data, whereby women are more likely to say that the action is not permissible, but no interaction with condition (push/switch/etc). Is that consistent with your findings as well?

Shen-yi Liao

Hi Wesley, quick question: why do some bar graphs have error bars on them and some don't? I think they would be very helpful for the cases that look ambiguous.

Wesley Buckwalter

Hey Sam, so I think with only one exception, all the figures from the experiments we present in the paper which used continuous variables have error bars (+/- SE).

However if you read the paper, some figures (like figures 1-4 and 15) represent things like percentages of the number of people who did some certain thing (for instance answer yes/no). So no error bars for those. Does that help?

Chandra Sripada


Hi Wesley,

This is very interesting work! Kudos to you and Steve for putting in the legwork to assemble this large database of evidence. I have a very serious concern, however, that I hope you can address: the multiple comparisons problem

You are looking over the rather substantial literature in XPhi and reporting when individual studies show a gender difference in intuitions. But this method generates a serious problem with multiple comparisons because you have to conduct statistical tests on a *very large* number of studies. When a researcher conducts a large number of statistical tests, the cumulative probability that at least some tests come up positive by shear chance rises linearly with the number of tests conducted. Indeed, if we assume the null hypothesis is true and there is no difference between groups whatsoever (and given the standard “alpha” used of p<0.05), a researcher who conducts 100 tests should expect 5 positive results that are purely spurious.

For those that do not want a stats lesson, think of it this way. Suppose Pfizer develops a new anti-depressant “Euphora”, which happens to be a total dud -- it has no anti-depressant effect at all. Now suppose Pfizer asks 100 universities to conduct 100 separate clinical trials. Pfizer then reports only those studies that show the drug had an effect, which happens to be just 5 of the 100. Obviously, Pfizer’s method is unacceptable (and let me clear I am **NOT** saying this is what you did). Rather, I am pointing out that when multiple comparisons are done as in this Pfizer example, for us to figure out whether the reported effect is real, we need to know how big is the pool of candidate studies. The fact that the pool of studies in the Pfizer example is actually 100 should lead us to doubt whether Euphora really helps with depression (and put more strongly, it should lead us to conclude that Euphora actually has no effect).

So in order to know whether the gender effects you report are reliable, we need to know how many underlying studies you are drawing on as your pool of ‘candidate studies’. You report that Holtzman conducted 9 studies and 3 showed a gender effect. But for Cushman, Nichols, and Pizarro, we are not told how many studies are in their pool. Do you have any idea what this number is, even if it is just a guesstimate? Did they look through all their studies or just certain ones? Since they are ridiculously productive researchers, I would worry about your claims regarding a gender effect if they looked through ALL their studies and reported to you just the ones that appear in your paper. But if they looked through just a small set of studies (i.e., the small subset for which they had gender data), then your claims are on a much better footing.

I have other questions about the implications of this research for philosophical methodology. But I will stop here. Once again, thanks for posting this paper. I do think this research is among the most ambitious and thoughtful done in XPhi thus far, so my compliments.

Shen-yi Liao

And now, for more substantive comments:

I think the paper is very cool! I am certainly looking forward to future works that aim to substantiate the hypotheses and discussions that will surely be inspired. But I have one worry with the conclusion you want to draw in this paper from the experimental data.

At times, you make the strong claim that

(1) men and women have significantly different philosophical intuitions.

Compare this with a weaker claim, that

(2) men and women's philosophical intuitions sometimes have *statistically* significant different strength.

I'm convinced that the data support (2), but less convinced that the data support (1). More precisely, I suspect that the "sometimes" in (2) is more frequent than the "sometimes" in (1).

Simplistically, let's map anything above the midline as affirming the intuition and anything below the midline as denying the intuition. I think (1) is only warranted if you find that men largely affirm an intuition that women largely deny, or vice versa. You do find this for Gettier, Compatibilism, Violinist, some Trolley cases, and maybe Chinese Room -- but not others. So it's not clear to me that data from the other cases, even though they support (2), really do support the stronger claim (1). In fact, they look to me to support the negation of (1): men and women appear to have the same intuition on many cases, though the strength of those intuitions may differ.

Of broader significance, I worry that to make the hypotheses regarding gender disparity in philosophy, you really need (1) rather than (2). I wonder if you use forced choice in other cases, you would see the same disparity as found in the Starmans and Friedman cases.

Finally, I am curious: it seems to me that normative ethics is pretty intuition heavy and philosophy of language is pretty not. So you might expect, on your hypotheses, that women would be relatively more discouraged from going into normative ethics due to the intuition differences. But that doesn't seem to be the case. Do you have thoughts on why this is?

(Just saw your earlier comment: you're right. I saw the Pizzaro graph and then carelessly scanned the rest. Still, it'd be nice to see the error bars on the Pizzaro graph.)

Wesley Buckwalter

Hey Chandra!

Thanks so much for the kind comments. I think your statistical worries are right on the money. So let me begin by saying that one of the first things we are doing now is just going through and replicating, expanding, and expounding on the effects presented in the mms. In the meantime, let me see if I can at least try to say something about the Pfizerish worries. First, I should say that (sadly) lots of experiments in xphi get carried out without thought to collect gender information at all. This fact *greatly* reduced the candidate studies we could draw on when compiling this research. Also, while some of the researchers we mention are incredibly productive and prolific as you rightly say, much of that work is on topics outside the scope of this paper. Instead, what we were interested in was data from a particular population (ideally no history or very little history of philosophy courses) using a particular type of stimulus (vignettes very closely resembling thought experiments that freshman philosophy students would be likely to encounter). So these two things also greatly restricted what sets we could draw on. Also, in many places, the selection, the studies themselves, and the new studies we run ourselves, were driven by hypothesis about gender or where to look within the pool of data for gender effects. Perhaps we could have done a better job at discussing that aspect. I will say that of all the results, the Pizarro is probably the most suspect, since that is the only one I do not know specifically about the size of the pool it came from. However, I was thinking that even if we cut that one study, or a few of the studies before they are replicated, we would still have this massive amount of evidence to think there are some sort of differences here. Does that help at all? I’m a little timid to try to convince you of all people about the rightness or wrongness of a technical statistical worry!

Hey Sam,

Great point about the size of these differences. First, I would just say that I think perceived deviation from mid-point isn’t the best way to measure effect size. Instead, I would suggest that effect size is a better way to measure effect size. In the paper, I think we give most of these in terms of Cohens and they fall somewhere in the .3-.8 range for most of the effects. However, that said, you still might (perhaps rightly even) want to argue that those effect sizes are not big enough to do the kind of work we need in our hypothesis about explaining underrepresentation later on down the road.

So how can something this small cause something so large? I would say that it’s true that some of the effects we present are smaller, while some are large. However, our hypothesis predicts that these differences, even when small, have the potential to induce certain undesirable psychological effects (such as alienation in classroom learning environments, etc). It is our claim that these further effects serve as a direct causal mechanism contributing to underrepresentation. Importantly, under our hypothesis, small differences "one side of the midpoint" in intuitional variance in the studies presented have the potential to inspire very large negative psychological effects (an empirical question here to be sure, but I'm willing to bet for now that these two things do not have a 1-1 relationship!) In short, the relevant differences for the selection effect hypothesis are not simply the intuitional inputs per se, but rather the disastrous psychological effects that result when one combines the espousal of (even slightly) different intuitions with the classroom context along with a certain pedological style and a philosophical methodology bent on their consensus.

Justin Sytsma

Hey Wesley!

Interesting and provocative paper! I haven't had a chance to read this version as carefully as I would like, but I also had Chandra's worry... and I'm not sure that your comments alleviated it. (Future replication would alleviate my worry, although it would be the replication that I found compelling not the original data.)

What I would like to know is how many studies made it through the selection criteria you mention. That is, I would like to know how many studies you looked at and excluded because you didn't see a gender difference. Hopefully you determined your selection criteria first, before checking for a gender difference in the study results. If so, it is important to know how many of those studies (if any) weren't reported, and specifically how many (if any) weren't reported because you saw no gender difference.

The worry is that in the absence of that information, your results could be like the following example: Suppose I report the results of 10 studies indicating that men are much better coin-flippers than women. Say that in each of those 10 studies, considered individually, the men flipped at least 50% more heads than the women (say out of 1,000 flips per study and with the flips split evenly between men and woman). If I arrived at those 10 studies, however, by running 10,000 studies total and then selectively picking those 10 from the pool based on the results, it would cast great doubt on whether the results support the conclusion.

Wesley Buckwalter

Hey Justin, so let me try again by getting a little more specific. We present data in the paper under eight kinds of sections: (1) Gettier (2) Compatibilism, Physicalism & Dualism Cases (3) The Violinist and The Magistrate & the Mob (4) Trolley Case (5) Moral Responsibility and Causal Deviance (6) ESEE (and similar causation work) (7) The Brain in the Vat, Twin Earth, The Chinese Room and The Plank (8)Behavior Economics

Numbers 1, 4, 8 are hypothesis driven work that we felt we did not need to support with any kind of external justifications in order to talk about presently. Number 2 we address in the paper: that was a situation in which a researcher examined the results of only one study (with 9 diverse conditions.) In three of those conditions there were gender effects (two of which remain significant even after correcting by a factor of 9, though the third one is just about there too). Number 3 considered about 20 cases in the search, though note these two results fall very far below the .01 level. Number 6 was a reanalysis of the studies I have done in which gender was collected (found an effect for 2 out of 3 of the examined sets, the resulting paper explaining this is linked to previously). Number 7: the construction of these particular new studies was driven with various hypotheses based on previous experimental work (though it is true that we also ran 4 additionally studies in the last year we do not report where no differences were found). So the tally there is 4/8. As I mentioned in an earlier comment, I admit number 5 is a worry, a relatively late addition to the piece. Despite that, we argue that the overall message here is that we simply have not consulted enough studies, such that what we did find (at the sig levels they are at) is actually that threatened by false positives. Though, we'll find out this real soon, as soon as we finish running and re-running new tests.

Note: I took your question to be one about false positives. So an additional question would just be about difference tout court, not about getting lucky relative to a particular researcher. In that case, the above does not take into account the fact that there were also a few researchers who we contacted who had collected data but for various different sorts of reasons were not able to detect gender effects (like you for instance!). So for this problem maybe we need to do a meta-analysis to see just how big, consistent, and reliable these gender effects are intuition across the board.

Chandra Sripada

Hi Wesley,

Your responses to my comment and especially to Justin’s were indeed very helpful (and detailed). I think it allays a good bit of my concerns that pool of studies you are drawing on is so large that we should expect to find at least a few significant differences by gender in this huge pool. I think you are right that the pool is not all that large after all, since most existing data sets did not study the right kinds of cases (cases taught in first year philosophy classes) or did not collect the gender data at all. But Justin’s point about replication is important. And I would add identifying *mechanisms* by which a gender effect arises, rather than just the presence of a gender effect, would be a huge help in quashing any lingering doubts about the reality of these effects. I have some suggestions about how to do this (kind of inchoate right now, but I will work on them) and I can share them by email some time.

Justin Sytsma

Hey Wesley,

Like Chandra, this helped clear things up for me and allays some of my concerns.

Not all of my concerns, unfortunately. In particular, I wonder if you could say a bit more about the note at the end?

If I am understanding correctly, your responses alleviate my concerns for some of the numbered sections in isolation. I’m still left with some concerns about (3), for example, even in isolation, though. While my gut thinks you might be right that the difference is significant enough on those three cases that they will remain significant even when all 20 or so studies are taken into account, I don’t think that my gut is a very reliable judge about such things. Personally, I’d like to see some stats on it.

------------------

Gut check: So here is a simulation of a series of simple experiments to help check my intuitions. Say we have 20 trials of 60 tosses of a die, where a different person is tossing the die each time. We want to know whether any rollers have a knack for rolling or not rolling 6s.

Assuming the die is fair and that nobody has a knack for rolling or not rolling 6s, on each trial, there is roughly a 95% chance that the number of 6s thrown is between 5 and 15. (I’m rounding things here, and not being too exact, but I think this should be sufficient for a gut check.)

Here are the number of 6s for a simulation of 20 trials using a die that is fair for each roller:

11, 8, 13, 4, 9, 13, 5, 6, 13, 12, 8, 14, 9, 16, 13, 9, 11, 10, 10, 13

I got two trials that were outside the specified range.

But, that’s just one experiment of 20 trials. What if we were to do the experiment 20 times? Here are the number of trials outside the specified range in each of 20 experiments (the one above, plus 19 more):

2, 1, 1, 0, 0, 0, 2, 0, 0, 1, 0, 1, 0, 2, 1, 0, 1, 0, 4, 1

So, in 5% of the experiments there were three or more trials with a number of 6s outside the specified range.

Let’s try another 20 experiments:

1, 0, 0, 3, 0, 0, 3, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0

In this batch of 20 experiments, in 10% of them there were three or more trials with a number of 6s outside the specified range.

This of course is not an exact analogue of your (3), but it suggests to me that I shouldn’t trust my gut here: I would need to see some stats before accepting that the occurrence of three studies out of 20 with a significant gender effect is significant.

------------------

My bigger concern, though, isn’t about the individual studies, but overall. What I would like to know is just how many series of studies you looked at to select the ones reported? How many of the ones that you didn’t report were excluded prior to looking at the gender results? How many were rejected because you found no gender effect? My concern is in part driven by your reference to our data and knowing that we’ve run a lot of studies and have found no gender effect on philosophical probes that can’t be readily accounted for by chance. (And this despite that fact that we see a clear gender effect on other measures, such as CRT score.)

Continuing the gut check from above, suppose I were to survey the experiments on coin flipping. If I were to discard the experiments that did not have at least three trials with a number of 6s outside the specified range, I would have three experiments to report. If I then discounted those trials where the number of 6s didn’t fall outside the specified range, things would look pretty impressive: I’d have 10 trials (or “studies”) to report such that when considered in isolation the number of 6s differed significantly from the expected value for a fair die! Of course, however impressive this might look, I shouldn’t conclude that some people do in fact have a knack for rolling or not rolling 6s.

Wesley Buckwalter

Hey Chandra,

Thanks, I would be really eager to hear your thoughts about this!


Justin,

So, relative to your gut check thing. Can’t we just lower the significance threshold before accepting something as significant in the present study in order to guard against false positives relative to a specific research situation they originate (as I have numbered)? To do that, we could just correct the detected level by a factor of 9 (relative to the set from #2) and a factor of 20 (from #3). In addition to what I say about #2, I just checked, and this is not at all a problem to do for #3 either while still maintaining the .05 level. There’s nothing else I can say about this.

As for the bigger worry, as I say, maybe a meta-analysis will help here. In the meantime however, I’m not prepared to discuss the work of other researchers at the level of detail you ask for in your comment here. I will vaguely tell you again though that the number of candidate studies in our search that could be included as fitting the criteria of search (for instance: they were from samples of participants with no philosophical training, they were sufficiently large samples, they were in response to standard philosophical thought experiments) was simply just not high enough, given the huge number of positives we report, at the levels of detected significance for those positives, to give serious traction to the worry that each of the gender differences in judgments made about thought experiment stimuli among people with minimal training in philosophy are the result of chance alone.

Shen-yi Liao

Hey Wesley,

I didn't mean to make a point about effect size. I agree with you, of course, that effect size is best measured by effect size.

Instead, I meant to be making a point about what you mean when you say men and women (sometimes) have significantly different philosophical intuitions. To dramatically illustrate this point, suppose that for some intuitive response to a thought experiment, men's mean is 7 and women's is 6.5 and we get a statistically significant difference and a non-zero effect size. I'd still be inclined to say that, in this case, men and women do not have different philosophical intuitions, they both strongly affirm the intuition, though the strengths of their intuitions differ. This case is not realistic, but I hope it at least makes the point clearer.

Experimentally, given the Starmans and Friedman work, I wonder why you didn't use forced choice for your cases. (Obviously, I understand that for other cases, you were just reanalyzing existing data.) In particular, I think it would be neat if you ask participants to choose between affirming or denying the intuition, and then ask for their confidence in affirmation/denial. That would help rule out the hypothesis that the effect observed is more due to men being more confident in their responses (or something along those lines).

Adam Feltz

What neat results! I'm of course a big fan of the individual differences approach.

For whatever it's worth, Edward Cokely and I have been collecting data on sex for nearly all of our experiments for more than 4 years. We also routinely check for sex effects and I don't remember having found anything like the widespread effects you are reporting. When we do find some, the effect sizes are fairly modest. So, I agree with the others that replication would really be wonderful!

Justin Sytsma

Hey Wesley,

Gotcha on (2) -- I had missed the factor of 9 in footnote 12. To make sure I understand what you did: Those two studies were each significant at a level of 0.0056 to give a familywise error rate of 0.05 for the 9 studies? Similarly when you checked (3), those two studies were each significant at a level of 0.0025 (to give a familywise error rate of 0.05)? I have to say I am somewhat surprised, as you reported p < 0.01, but reported p < 0.005 elsewhere. Thanks for checking that for me -- that is what I was looking for. In my opinion, this is something worth clearly mentioning in the paper.

(Just curious: Were they still significant at 0.0001 for (2) and 0.00005 for (3)?)

Where the other cases in the non-hypothesis driven cases trending toward significance? If not, what do you make of that?

With regard to the meta-analysis issue: Would it be giving away to much to report the number of studies you rejected because there was no gender difference to report? Knowing the number of studies that you checked for gender differences in seems rather relevant to determining what should be made of especially the non-hypothesis driven the studies reported. And I think that the more transparent the paper is about this, the more convincing the results.

Bernard W. Kobes

I have strong Gettier intuitions, but to my surprise not in the case of the vignette from Starmans and Friedman about Peter and his watch. In this case a dramatic gender difference was found. But the vignette is not a Gettier case, or at least not a pure one, I believe. For it is possible to understand the relevant aspect of Peter’s mind to be “extended” in the sense of Clark and Chalmers, “The Extended Mind” (1998). On this understanding, it was all along a constitutive part of Peter’s knowledge of where his watch is that Peter has a distinctive kind of “deictic code” for its location: Peter is disposed to glance quickly at the coffee table to confirm its presence there when he needs to occurrently know where it is. Perhaps Peter usually puts his watch on the coffee table when showering, but not always, and habitually relies on a quick visual scan to check for his watch when he needs to occurrently know where it is. Before the burglar enters, Peter knows where his watch is, but the habitual external visual loop is part of the physical vehicle for Peter’s knowledge, even while he is in the shower. Likewise, before the burglar enters, Peter knows that there is a watch on the coffee table, and again the habitual external visual loop is part of the physical vehicle for this bit of Peter’s knowledge. The burglar stole Peter’s watch and replaced it with a cheap one, thus destroying Peter’s dispositional knowledge that his watch is on the coffee table, but leaving intact Peter’s dispositional knowledge that a watch is on the coffee table. It’s at least arguable that the external loop enters into Peter’s being justified in believing that there is a watch on the coffee table, and that this sort of justification imparts knowledge that a watch on the coffee table, even after the burglar plants a cheap watch. For the occurrent perceptual justification that is built into the external visual checking loop imparts to the dispositional belief that there is a watch on the coffee table the status of knowledge. (In fact, the peculiarity of the burglar’s replacing the watch with a cheap replica seems to make salient this understanding of the vignette.) This understanding of the vignette renders it unsuitable for an empirical showing that men and women differ in Gettier intuitions, for it’s not a Gettier case, or at least not a pure Gettier case. This consideration speaks to the important question of whether there is evidence that, overall, men’s intuitions better reflect received philosophical opinion than do women’s intuitions. Apparently, female subjects, more than males subjects, have Putnam’s externalist intuitions about semantic content in Twin Earth cases. The empirical finding for the Peter vignette should be interpreted in light of the possibility that women’s intuitions support the extended mind hypothesis more than do men’s. - Bernie

jonathan weinberg

@Bernard - I must confess to not finding that a particularly likely hypothesis, but even putting that aside, note that in this instance it's not a _problem_ for Wes and Steve. Your confounding hypothesis -- that there is a specific and substantive difference in the intuition-producing systems in male and female undergraduates -- is actually stronger than, and entails, the hypothesis that they are trying to argue for!

jonathan weinberg

To clarify: that there might be _some_ amount of difference along those lines, in terms of "extendedness-sensitivity" of intuitions, seems to me fairly plausible. But what does not strike me as particularly plausible is the claim that such a difference will be widespread enough, and sufficiently strongly connected to epistemic evaluations, to explain S&F's finding.

Bernard W. Kobes

Hi Jonathan! Thanks for replying.

The questions I touched on are:

(1) Why do I usually have strong Gettier intuitions, but not in the case of the watch vignette? [I think it's because it's at least partly an extended-mind case.]

(2) Do men and women differ with respect to Gettier intuitions? [I don't know, but I think I've raised a cogent doubt about the watch vignette.]

(3) Do men and women differ with respect to extended-mind intuitions? [I don't know, but in thinking about the watch vignette I think there is reason to consider this possibility.]

(4) Do men's intuitions more closely conform to received philosophical opinion than women's? [I don't know, but it's an important question, and I think the point I've raised about the watch vignette tends, at least a little, to dispel the appearance that the answer is yes.]

I said nothing about the magnitude of difference in men's and women's philosophical intuitions. I said nothing about Wes and Steve's hypothesis regarding gender representation in our profession. I take it those are your main points.

You also say you don't find my hypothesis about the watch vignette particularly likely. (You don't say why, and you don't address my reasoning.) Not even likely enough to warrant further study?

Cheers,
Bernie

Jonathan Weinberg

Sorry, I had thought you were addressing the main content of the post, but of course comment threads can go in all sorts of directions. To answer your last question, it doesn't seem likely enough to _me_ such that I would personally devote my own research time to it; and I don't think it's likely enough to constitute much of an objection to any extant work (which maybe doesn't matter, if no objection was on offer). But if it seems to _you_ (or anyone else) to be a bit more likely, then by all means it would be worth _someone's_ looking into it! it would be an equally good hypothesis for a locus of cross-cultural variation too, I suspect.

The comments to this entry are closed.

FSU Free Will Project

Google Search

  • Google Search
    Google

    WWW
    http://experimentalphilosophy.typepad.com/

Wikio Ranking

  • Wikio - Top Blogs - Sciences