Wikio Ranking

  • Wikio - Top Blogs - Sciences
Blog powered by TypePad

« Side Effects Revisited | Main | New Additions... »

A scientist looks at x-phi (with, er, mixed results)

The biologist Marica Bernstein recently presented a paper at the Mid-South Philosophy Conference in which she took a random survey of papers in x-phi, and evaluated how well we as a group are doing on which of a large number of basic methodological criteria.  And how did we do?  Well, you can see for yourself, but the short answer is: not so hot, really.

I should emphasize that Marica's paper is intended very much as a friendly critique.  She's not at all saying that philosophers shouldn't do experiments; she's just trying to argue that we should do experiments better.  (Though at least some of us -- Stotz & Griffiths in particular -- are already doing them very well, on her view.)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83452050c69e200e55074debe8834

Listed below are links to weblogs that reference A scientist looks at x-phi (with, er, mixed results):

Comments

As this essay points out, most universities have statisticians who are willing to consult about design and analysis. If you're not consulting with them, you probably should. Few philosophers have had graduate training in experimental design!

Interestingly, Bernstein holds x-phi-ers to higher standards that what one sees in most research psychology articles. I'm not sure about the importance of reporting the statistical software unless one is doing fancy analyses, but it seems to me that there are good statistical reasons to report alpha level and p more conscientiously than most psychologists and x-phi-ers do.

One typo in the above interferes with the sense: I meant 'than' rather than 'that' in the first line of the second paragraph.

Oh, one more thing: I hope Bernstein is mistaken in her suggestion that most x-phi-ers are not getting IRB approval. She makes it sound like a long, intimidating process, but I have never found it to be. Indeed, for simple polls, it's relatively straightforward.

I think some of her criticisms are misguieded. For example, she writes at one point in the results:

"For 11 of 14 papers the person collecting the data was not blind to the research question. The X-Phiers themselves approached potential subjects, conducted the interview, or passed out questionnaires."

But who cares? When handing out questionnaires, lindness to the research question is pretty much irrelevant. With interviews, this can be a little bit more important (it depends on the type of questions asked, and the type of research question), but since she doesn't say how many of the 11 were paper questionnaires, I just don't know what to make of this.

Bernstein is right, I think, that ignoring common conventions for publishing social science research, misidentifying or ignoring levels of measurement, and using inappropriate statistical tests jeopardizes the credibility of x-phi.

So, even though she offered these criticisms in a constructive manner, I think it is important to acknowledge their seriousness.

Consultation with statisticians is a good idea as folks develop more sophisticated projects, but one doesn't need graduate training in research or statistics to prevent these sorts of errors from recurring: nearly any undergraduate survey of those fields would cover enough ground to avoid the sort of basic mistakes on which Bernstein focuses.

(I'm writing from a perspective of a philosopher who is interested in and friendly towards x-phi and who spends much of his teaching load teaching research methods and data analysis.)

I think making sure you do your stats right is a good idea, of course. That's one of the purposes of the peer review process, though. I suspect that when x-phil studies have been submitted to psych journals, for example, they've been reviewed by psychologists (who are trained in experimental and statistical methods). I can't imagine that if there are problems with the statistical methods in a paper, those reviewers wouldn't say something.

Also, don't most of you collaborate, to some degree, with researchers in other fields?

"Among them, many reported significant results as “p<0.05”, “p<0.01”, and “p<0.001” for multiple tests on the same data set."

This, just for example, is ridiculously embarrassing.

Well, since I know one of the studies Bernstein criticizes is mine, I thougt I should at least correct a mistaken assumption she makes with respect to the motives purportedly lurking behind my experimental design. She makes the following claim:

"It shows the value of design. One might ask, why n=126? Why not 120 or 130? Although detailed explanation is beyond the scope of this note, the reason is simple. N=63/group is exactly the sample size needed to achieve statistical power of 80%-- the accepted standard. Not only is this X-Phier able to assert with 99.4% confidence that the null hypothesis has not been wrongly rejected, he/she is also able to assert with 80% confidence that the alternative hypothesis is true. Slightly smaller sample sizes would have resulted in <80% confidence; larger would have been a waste of resources."

That may indeed be a reason to use 126 participants--no more and no less. Hell, it might even be a really good--if crafty--reason. But for all that, it happens not to have been one of my reasons. Indeed, I had no reason for using 126 at all, although there is a mundane explanation for why I did so.

You see, for all of my studies I would hand out surveys to large auditoriums full of FSU undergraduates. When administering surveys--whether it was in my own class or someone else's class--I would always take as many copies as there were students registed for the class. Of course, I would only get as many answers as there were students who showed up for class that day. Hence, unusal numbers such as 126.

Now perhaps this is itself wildly problematic, methodologically speaking--which just further reveals that I don't know as much about experimental design as I should. But I think Bernstein has ironically given me more credit that I deserve. Had I known enough to know that precisely 126 participants was ideal, I would not have made the other obvious design mistakes she charges experimental philosophers of committing.

Setting that aside, I nevertheless enjoyed Bernstein's paper. I think she makes a number of helpful observations, criticisms, and objections. For now, I just want to point out that I, for one, am painfully aware of how much more I need to learn about experimental design and statistical analysis. This is precisely why I am planning to work more closely with psychologists in the future.

Before closing, I just want to say that I think Bernstein is wrong when it comes to experimental philosophers and IRB approval. Nearly everyone I know doing this kind of research gets approval. Moreover, in my experience at least, doing so is usually very quick and painless.

"... data structure matters tremendously. Importantly, it determines the choice of appropriate statistical test, all of which make assumptions about data structure." (p. 4)

This is by no means an uncontroversial statement. As Howell (2007, p. 7) observes in his introductory behavioral statistics text, "... writers disagree about the importance assigned to measurement scales. Some authors have ignored the problem totally, whereas other have organized whole textbooks around the different scales." Howell comes down in the former camp, stating that "the underlying measurement scale is not crucuial in our choice of statistical technique." This, he says, is because we can often regard a given set of measurements as belonging to any one of the four scales, depending on our theoretical interests and our knowledge of the objects or events that those measurements describe.

In other words, while the particular statistical test that you can use is constrained by what type of data you have, there is often no fact of the matter as to what type of data you have. Rather, the type of data that you have is often determined by your theoretical interests. So your choice of statistical test is ultimately more a function of these theoretical concerns than of the particular type of data you have.

Joe, my sense is your final paragraph exaggerates the extent and nature of any determinacy.

Even if not, however, this would not justify researchers simply ignoring the levels of measurement of their data, which I take to be one of Bernstein's most significant critiques.

As a practical matter, if x-phi-ers want to engage productively with both social science researchers and philosophers, they need to respect the main norms of both disciplines. Bernstein suggests that many x-phi-ers are not yet doing this with respect to some important social science norms.

Perhaps some of these norms are philosophically suspect, and if x-phi-ers manage to capture the attention and trust of their colleagues in the social sciences they may have an opportunity to improve research practice. As Eric and Thomas have suggested, however, philosophers may have to learn more about research design and statistics and may have to form new relationships with experts in those fields before they are well-positioned to do this.

"Among them, many reported significant results as “p<0.05”, “p<0.01”, and “p<0.001” for multiple tests on the same data set."

I get the part about how people should report exact p values, but beyond that issue, why is this finding in Bernstein's paper particularly worrisome? Don't different tests on the same data set typically look at different questions about that data set?

In addition to dabbling in x-phil research, I've been raised since academic infancy on the "scientific method" and currently run my own experimental lab at the psychology department of U of Wyo. Thus, I read Bertstein's article with much interest. I feel quite strongly that in order for x-phil to successfully build a bridge between the philosophical and experimental disciplines, it must embrace the doubly difficult task of being competent in both.

Nonetheless, I'd like to point out a few worries that I had with Berstein's paper.

First- the reported p values. There is an important difference between reporting the p value of your statistical analysis (e.g., p<.001) and establishing your alpha level (α). The standard alpha level in the psychological literature is p = .05. Most psych articles do not explicitly state their alpha level unless they are deviating from this norm. Thus, a researcher may report mulitple p values in an article (one ANOVA comes out significant at p = .01, another at p = .03), but this doesn't mean that he/she is assuming different alpha levels at different points -- he/she is merely reporting the various p values, which are all significant against the background assumption that .05 is the standard of significance being employed. Which means that there is nothing necessarily "embarrassing" (or confused) about reporting significant results as “p<0.05”, “p<0.01”, and “p<0.001” for multiple tests on the same data set. [Of course, there is another worry floating around the vicinity, which has to do with making adequate adjustments - e.g., Bonferroni - to one's alpha level to accommodate multiple analyses being run on the same data set, but I take it that that isn't the worry being discussed here.]

Second, about Likert scales. Yes, strictly speaking these are ordinal (not interval) data. Nonetheless, many statisticians argue that ANOVAs are robust enough statistical tests to warrant the treatment of ordinal data as interval data. Indeed much of my research -- including my dissertation research (and three of my committee members teach grad level stats courses) -- looks at within and between subject mean differenes using Likert scale data. Though it is certianly a matter of some controversy, it is pretty much standard practice in psychological research to treat Likert scale ordinal data as interval data. Which means running ANOVAs (and other interval data analyses) is typically fine.

This does NOT mean that categorical (nominal) data can be treated as interval data (though there are things you can do with categorical data when it is dicotomous).

Third, though running double blind studies (that is, studies where both the participant and researcher are blind) would certainly be ideal, I know very very few researchers (come to think of it, off the top of my head, I don't know any) that don't run their own studies, pass out their own questionnaires, administer their own measures, to participants (or have trained RA's to admister them), etc -- all while knowing the research question in advance. It just isn't feasible to do it otherwise -- nor, if handled properly, are the risks of contaminating the data great enough to warrant the hassle of running double blind studies.

There are, of course, many steps that reserachers can take to minimize the effect their knowledge can have on participants' responses, such as keeping vital participant info in folders or envelopes where it cannot be seen until after the study, etc. And it is certainly something to think seriously about when designing a study.

The worry raised about data collection (random sampling) is defintely a worry for x-phil-ers who are simply polling their classes or other non-random samples. One reason why it is a worry is that it there are definite self-selection biases at work(students who take philosophy classes, for example, are likely to differ in important ways from students who do not, etc). In addition, students have often come to know their professors well enough that their answers can be biased by what they would expect their professors to want answer-wise -- and things of that nature. It would thus be best for x-phil-ers to try to poll participants in classes other than their own if there is no way to get access to a research pool, etc. It would also be good to try to sample many different populations (even if this just means students from mulitiple different classes in different disciplines).

But, this is not a criticism of x-phil alone. For example, most of the researchers in my department (and other departments across the country) get their participants out of the psych research subject pool, which is made up of Intro to Psyc undergraduates and the like. Yet, our papers get published in reputable experimental journals. It is, in practice, quite difficult to meet the requirements for a fully randomized and representative sample of any given population of interest. In reality, all x-phil-ers are doing here is joining the club. [Of course, this is not a reason not to strive for as random and representative of a sample as possible].

Otherwise, I think Bertnstein's concerns about inadequate understanding and employment of statistical analyses are warranted -- I've also seen Chi sqs being used with inadequate cell sizes, studies with inadequate sample sizes, analyses being inadequately reported (no means, SDs, effect sizes, etc), methods being inadequately articulated, etc. In addition, data are often not being examined ahead of time for issues such as normal distribution, homogenity, etc (or, at least this is not being reported). In sum, I agree that x-phil research would greatly benefit from developing a better understanding of the whys and wherefores of statistical analysis (which can be accomplished, in part, by developing a working relationship with your university stats dept).

A few things that I didn't see mentioned in the Bernstein article that I think are important: all reserachers should be collecting and reporting the demographic information of their participants. In addition, I think x-phil papers should consider adopting the general format of empirical research papers (e.g., Intro, Methods, Results, Discussion). The general rule of thumb for the writing of a methods section, for example, should be to report enough information that someone else, having read it, could easily replicate the study.

Well, there is a lot more to say on this topic, but it is getting late and this post is getting out of hand. Perhaps we should consider putting together some sort of a "stats 101" workshop for x-phil-ers at some upcoming conference (e.g., SPP). I'd definitely be interested!

One quick additional comment about reporting p values (something I forgot to mention in my previous post): there are two basic styles for p value reporting. One is to establish your alpha (say, at .05) and then simply report all of the p values of your analyses as being either less than your alpha level or not - so you'd say for every analysis that it either came out at p<.05 or was not significant. The other way (the one I prefer) is to establish your alpha (.05) but then report all the specific p values of your analyses (p=.02, p=.003, p<.001 -- p<.001 is what you report for any p value that comes in at .000 -- etc).

So, the problem with reporting multiple p's with < instead of with = (that is, as p<.01, p<.05) is that it appears to the reader as if you are employing more than one alpha level -- which would be improper. This, I gather, is Bernstein's worry. But it is an easy enough fix.

"I get the part about how people should report exact p values, but beyond that issue, why is this finding in Bernstein's paper particularly worrisome? Don't different tests on the same data set typically look at different questions about that data set?"

"Which means that there is nothing necessarily "embarrassing" (or confused) about reporting significant results as “p<0.05”, “p<0.01”, and “p<0.001” for multiple tests on the same data set."


Yes, I think it is embarrassing that philosophers don't seem to know this, especially those that are publishing results with significance tests.

Let's give a simple example: suppose I have two groups--an experimental group and a control group. I give the experimental group a new drug, and the control group a placebo. I now measure, for each group, 20 different variables, lets say, percentage who have headache, percentage with stomach pain, percentage who get cancer, etc.

I now find that with regard to one of these 20 variables, lets say headaches, that the experimental group has fewer headaches than the the control group, and I say the results are at the p<.05 level, with the implication that my results are statistically significant.

BUT THEY AREN'T. Since I have run 20 different tests on the same data, its as likely as not that one will come out at <.05 just by chance. But the whole idea of reporting p<.05 is to imply that the results were only 5% likely to come out that way by chance.

You cannot naively report p values if you are running more than one test on your data. its totally bogus. And yes, its embarrassing that philosophy journals are letting this kind of mistake slip by. It tells me that their peer review process is not up to demands of vetting this kind of work for minimal competence.

I agree with Bernstein that we, experimental philosophers, can and should improve the design and data analysis of our experimental studies.

However, like other commentators in this thread, I believe that she exaggerates the problems with X-phi studies by expecting experimental philosophers to meet some norms of good design that are not met by psychologists and social scientists. It is thus misleading to present some of her norms of design as "the most basic demands of experimental design and data analysis".

A few examples follow:

1. Random sampling. Random sampling is assumed to increase the external validity of experiments. Now, in psychology, but also in other sciences such as medicine, random samples are typically not used. Rather, people use convenience samples, because random sampling is practically impossible. (Strangely, Bernstein mentions only the random sampling of subjects, while random sampling applies to the different units of an experiment, such as treatment, measure and context).

2. Reporting several p. You merely have to browse, say, Cognition for 5 minutes to see that typically, psychologists report several p.
(A note on one of the comments in the thread: it is true that when several independent tests are done with the same data, the alpha needs to be decreased. But, to my knowledge (and pace Bernstein), this is at best a rare situation in experimental philosophy.)

3. Reporting exact alpha levels. Again, reading Cognition for 5 minutes shows that not reporting exact alpha levels is the common practice in psychology.

4. Blind studies. See Chris' comment.

I could go on (see the list of questions at the end of her paper).

It is surely the case that the methodology used by social scientists can also be improved. But, it would be unfair to expect experimental philosophers to be more rigorous than most social scientists. It is in fact ironical to see that some of studies targeted by Bernstein have been written by professional psychologists (e.g., Rips et al.; Woolfolk et al.).

More problematic, Bernstein does not seem to understand that experimentalists always weigh the benefits of meeting some norm of good design against the practical costs of doing so. It is perfectly reasonable not to meet a norm of good design (e.g., blinding or random sampling) when it is impracticable to do so.

She also writes a few puzzling things.

1. IRB: As mentioned above, typically experimental philosophers ask for IRB permission. And they'd better ask (see the NYT of Thursday, last week).

2. Bernstein strangely speaks of accepting the null hypothesis. This makes little sense, since, as argued by Meehl, when H0 is that two means are identical, the null hypothesis is always false. Failing to falsify the null hypothesis is not accepting the null hypothesis.

3. Bernstein writes: "data structure is known before data are collected". This is puzzling and I may not understand what she has in mind. It seems to me that very often we need to test that our data have a specific structure (e.g., that assumption of normality are met) AFTER we have collected our data (but BEFORE we run any test on them).

4. I believe (but might be wrong) that applying a t-test to a dichotomous variable is mathematically identical to applying a chi-square test. So, there might be not mistake in using a t-test with a dichotomous variable.

In response to what "anon" last posted:

1) As I said in an earlier post, I don't think the worry raised by Bernstein about the p values was a worry about correcting one's alpha level for multiple tests on the same data (though it is a worry).

2) As with most issues, when it is necessary to adjust one's alpha level and when it is not is a tricky (and heavily debated) issue. Remember that there are two types of error that a researcher must be concerned about: not only is there the worry that you will discover a significant relation that isn't a genuine relation, but there is also the worry that you will fail to discover a genuine relation. If you adjust your alpha at times when it isn't called for (and just because you are running multiple analyses on the same data set does NOT automatically mean that an adjustment is called for) -- or if you over-adjust your alpha -- you risk the second type of error.

These are complicated issues that many experimental psychologists (and other scientists) don't adequately understand. Thus I don't think it is an automatic black mark against philosophers (or philosophical journals) that they don't adequately understand it either. It simply means there is room for improvement -- welcome to academia.

All—

I very much appreciate the comments/criticisms you’ve posted about my paper. As Jonathan indicated in his introduction, I am very sympathetic to experimental philosophy and sense that my criticisms were received in the intended spirit. One overarching theme in the replies is the recognized need to begin an informed discussion about these issues, leading hopefully to some working standards or guidelines for doing and publishing your research. Perhaps the discussion has begun?

Now a few specific replies to your comments.

Eric and Thomas: My background is in animal research. Thus, I have had dealings with IACUCs (Institutional Animal Care and Use Committees). Even though my studies were non-invasive, getting IACUC approval on a protocol is not trivial. I may have needlessly projected my experiences onto IRBs. If you all are having no difficulties with IRB, great! In my defense I will point out that only one of the papers I reviewed indicated that the experimental protocol was IRB approved.

Eric: You may be correct that I evaluated these papers against standards that are higher than those typically employed in research psychology. I don’t think this is a bad thing—perhaps it’s most instructive to be evaluated against the highest of standards, and to re-visit high standards occasionally. Periodically the journals I read (more biology than psychology) publish critiques of the methods and statistics used in papers published in those very journals. I wouldn’t be surprised to find that the major psychology and social science journals have done the same. With respect to why it’s important to indicate what software was used, part of the reason is replicatability (as Jen Wright discusses), but the more practical reason is that, as evidenced by the graphs, a lot of the stats are being done with Excel which has a lot of issues. Also, see the UCLA web site I cite. As I recall there’s a discussion of the pros and cons of various software with respect to survey data analysis.

Chris: It’s always relevant to avoid the appearance of bias. As Jen Wright points out, it’s not always feasible to exclude potential biases. But by doing what one can do, and by recognizing what one cannot practically do, there’s less room for actual bias to creep in. To your point about reviewers—first, not all X-Phi papers are being submitted to journals than one would assume has reviewers able to evaluate design and analysis issues. Second, it’s embarrassing how many papers in biology (writ large) journals contain errors—some meta-analyses have found about half contain critical errors and half of these lead to incorrect conclusions. So I don’t think one can assume that the review process is flawless, as Anon(2) also notes.

Andrew: Thanks. I agree that nothing I’ve presented in the paper can’t be learned in an undergrad survey course. I would caution though that if X-Phiers are inclined to take a survey course, take one offered in the psychology or biology department, not statistics. Take a look at the syllabus, the emphasis should be on design and analysis (with theory), not memorization of equations.

Anon(1): Jen Wright has done a good job of trying to explain away this problem (my misunderstanding that different p< values signified three different α levels), but Anon(2)’s reply is right on target. Multiple tests on the same data set should be avoided. Had these papers reported precise p-values, they could have been criticized for multiple tests (something I didn’t look at) but my inference would have been different.

Thomas: n=126, give or take. This post made me chuckle. I praised Thomas et al. for a good, although unintentional, outcome. Thanks again for your positive comments.

Joe: Call me old-fashioned but I (and others) think there is a fact of the matter about data structure. I think numbers (and subsequent operations, like statistical tests) should be agnostic to the research question. Which is NOT to say that conclusions based on test results should be agnostic; statistical significance is not always biological (X-Phi) significance. But if you all want to conceive of numbers differently, that’s for you as a community to decide.

Andrew’s response to Joe: Elsewhere I’ve been asked why X-Phiers should be bound to standards such as are being discussed here. Again, I’m not trying to insinuate myself in your field, or to dictate what conventions and standards you adopt (if any). But as you say, “if x-phi-ers manage to capture the attention and trust of their colleagues in the social sciences they may have an opportunity to improve research practice. … philosophers may have to learn more about research design and statistics and may have to form new relationships with experts in those fields before they are well-positioned to do this.” This would be a good reason to at least discuss the standards as they already exist. Thanks.

Anonymous: See Anon(2)’s posting.

Jen Wright: Jen raises a number of important issues. I agree with many of her points, and disagree with several. For example, just because α=0.05 is conventional in psychological research doesn’t mean that it’s necessarily correct. In general, smaller samples call for smaller α values. Likewise, just because doing ANOVAs on ordinal data is standard practice doesn’t make it statistically legitimate. But these discussions, while useful, would miss my main point which is that as a community those engaged in X-Phi research should establish standards and guidelines appropriate for the kinds of research being done. The bottom line would be that if you all decide that doing ANOVAs is acceptable, you should still be aware of objections and should be able to argue in favor of doing them. One point, though—I’m not sure why one would engage this controversy since there are statistical tests that can handle ordinal data quite well.

Jen has also picked up on something implicit in my review of these papers. Errors of omission in the papers themselves probably accounted for a number of bad “scores”. Jen writes: “There are, of course, many steps that researchers can take to minimize the effect their knowledge can have on participants' responses… .” It could very well be the case that folks took these steps, but failed to mention them in print. This, along with other items Jen mentions such as data exploration to test assumptions, goes to the need to report enough information so that the results can be replicated, as she also emphasizes.

One final note about Jen’s suggestions on how to avoid sampling bias—for me at least it’s disconcerting that some studies don’t indicate sex composition of the sample. The assumption that males and females have, e.g., the same intuitions about intentions and so forth is just that, an assumption.

Anon(2): Couldn’t have said it better. Thanks!

Finally, I see as I get ready to post theses replies that I have been taken to task for saying “accepting the null hypothesis”. I stand corrected, somewhat sloppy writing on my part (I do know better). One fails to reject a null hypothesis. There’s a big difference.

Thanks again to Jonathan for hosting my paper, and to Thomas for urging me to “join the fray” as he put it.


Marica: Just to clarify, anon(1) and anon(2) are numerically identical. I was just clarifying and defending my point in the first post, which is that it should be embarrassing that we make errors like this.

I don't care whether social scientists also make these mistakes sometimes. I'm sure they do. But we are philosophers; we are supposed to be the defenders of care and rigor--the people who think carefully about our conclusions, and who understand the foundations of things like inference and probability. If not that, what special skills are we supposed to have?

As to Edouards point that: "Bernstein does not seem to understand that experimentalists always weigh the benefits of meeting some norm of good design against the practical costs of doing so. It is perfectly reasonable not to meet a norm of good design (e.g., blinding or random sampling) when it is impracticable to do so." It is one thing to deliberately fail to meet some norm of good design because its impractical, and another to fail to display awareness of it. The operational difference is a careful discussion of the methods and their possible shortcomings.

I think the $64 question, which hasnt really been addressed here, is whether or not philosophy journals should be publishing experimental work--whether their peer review structure is equipped to handle it. How long before we have a "social text" style fiasco?

To Edouard's point: "Bernstein does not seem to understand that experimentalists always weigh the benefits of meeting some norm of good design against the practical costs of doing so." I beg to differ. I am well aware that experiments are conducted in a world where extraneous practical contraints (time, money, space) weigh heavily in the process of decision making pertaining to design, data collection and so forth. But as Anon says, "It is one thing to deliberately fail to meet some norm of good design because its impractical, and another to fail to display awareness of it." If a reviewer asks for one more measure of X, and all the animals were sacrificed months ago (and the experiment took two years to conduct), I kick myself, and admit in the Discussion section that one more measure of X would have been a good thing. I have had this conversation with neuorphysiologists who work with primates. They record from five cells in each of four monkeys and say n=20. Wrong. Those five cells are nested in an n of 4. Is it practical to record from 20 monkeys? No. Can they justifiy, not only practically but theoretically (often by appeal to previous empirical results), why they can treat the cell as the unit of analysis? Yes.

Anon says: "But we are philosophers; we are supposed to be the defenders of care and rigor--the people who think carefully about our conclusions, and who understand the foundations of things like inference and probability."

I'm not a philosopher (I don't even play one on TV). That said, I'm not completely ignorant of your questions and methods. I admire your rigor. Design and analyses are no less rigorous. What Anon, I take it as a philosopher, said is exactly what I would say if I were one of you.

Marica,

I am sure that as a scientist, you are fully aware that there are numerous practical reasons that justify not meeting specific norms of design.

But, you seem to have forgotten this point in your discussion of experimental philosophy. To illustrate, you take random sampling and blind data collection and analysis to be norms of design that experimental philosophers should try to meet. But, it is obvious that experimental philosophers, like most psychologists, cannot practically meet these two norms. Since ought entails can, as philosophers like to say, expecting experimental philosophers to meet these norms of design seem to me to be inappropriate.

And note that there is little reason for experimental philosophers and for psychologists to discuss the fact that random sampling and blinding were not applied to their studies, since it is obvious that they were not.

More generally, you are weakening your critique of experimental philosophy by mixing together this kind of objections with valid criticisms--e.g., the fact that some studies have misapplied specific statistical tests.

Edouard,

Even if this is right, there are still many things that can quite easily be done to improve -- e.g., collect data from a wide range of classes in different departments, do not collect data from a class that you are teaching, have someone else unfamiliar with the research (or atleast with the specific research hypothesis) administer the surveys...these things, though certainly less convenient, are easy enough to do.

I am not sure I see the point of evaluating "the primary experimental philosophy" literature as a whole. Whether a particular piece in experimental philosophy meets working (as opposed to ideal) standards of scientific research is best assessed by whether it is published and cited by relevant journals. If an experimental study is published and cited in, say, Cognition or JPSP, then it, pretty much by definition, meets current scientific standards and conventions. If it is not published then that means that it must be improved, and I don't think that it is either possible or necessary to teach scientific methodology to philosophers in conference papers. They are endlessly elaborated on in dozens of books on methodology and statistics anyways.

It strikes me the point of the criticisms are just what Jonathan referred to: constructive suggestions by an interested colleague who has expert knowledge in research design.

To my mind, this sort of work is an act of generosity and the best response is to avoid defensiveness or nit-picking in favor of working hard to understand ways that the criticisms can help x-phi-ers to improve their professional practice.

I think it's good to take constructive criticism, but I think there's something more here. I really do think some of the criticisms are irrelevant, and adopting the suggestions would merely add more headaches. This is particularly true of the blind researcher. There is absolutely no reason to believe that if Joshua Knobe or Edouard Machery hand out their own vignettes to people in a park or a classroom, their knowledge of the hypothesis will have any effect on the results. None whatsoever. In fact, I can't think of a single exp-phil study I've read where that would be a reason.

It is true that avoiding the appearance of bias is very important. If there is data to be coded in such a way that interpretation is involved, for example, it is better that either the coders be unaware of the research hypothesis, or unaware of the conditions in which the subjects had been run, or both. This is also true if researchers are doing verbal interviews. But when all you're doing is handing out pieces of paper to passers by , people from a subject pool, or students in psych classes, there's really no potential for experimenter bias, and no reason to go out of your way to avoid it.

And I agree with the above points about reporting p values. Since we're all doing stats with computer programs now, instead of by hand, we can get exact p values rather than looking up p values in tables. So reporting exact p values is nice, but it is still OK to report p values as being less-than certain standard alphas (.05, .01, .001, etc.). The convention, used in many fields related to x-phil, including neuroscience, is to report anything below .05 as significant, and to report the p values as being less than the closest conventional p value. That is, everyone implicitly accepts an alpha of .05 for behavioral measures, but they can an do report p values as being less than the closest conventional alpha .05 or smaller. Thus, you may have p values less than .05, .01, and .001 for analyses of the same data set, and report them in those ways. There's no methodological reason for not doing it, except that exact p values are available, and it sometimes seems silly not to report them.

One other criticism she makes seems unimportant, as well. That is, reporting your critical significance level. If you're doing medical research, you're probably using a more conservative critical level (.001, say), but if you're doing behavioral research, you're almost certainly using .05. The only reason to discuss the critical level is if you're not using .05, and then you should give a reason for it. That's not to say that you can't report p values as less than .01 or .001, just that, if you're only accepting p values of less than .01 or .001, you should say so and tell us why.

I think the rest of her criticisms are important. In particular, it's of the utmost importance to use the correct statistical test, it's important to discuss the makeup of your sample, as well as how you obtained it, and to randomly assign members of your sample to your different conditions. And it's important to discuss the limitations of your study and its potential for generalization.

Perhaps I have missed the boat with experimental philosophy. I have always viewed the projects of X-philosophers to be saying something akin to this, "Here is something that looks like it's happening. What is philosophy to do about it?" This makes experimental philosophy more interesting than unlikely intuition pumps like the Trolley Case.

Beyond the hubris of Weinberg, Nichols and Stich, who claim to have tested intuitions, "par excellence", it seems X-phiers have been more modest about their projects. Are we now claiming that we're doing science?

Eh, I might mention that this question is not rhetorical. Some of the criticisms of the Bernstein article lead me to believe I've misunderstood what experimental philosophy is about. It seems many commentators here believe experimental philosophy is science.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Google Search

  • Google Search
    Google

    WWW
    http://experimentalphilosophy.typepad.com/

Coordinator