More Versus Better, Part II
A View into the Review Side
This is the second in a series of three essays from the Organization Science AI Task Force. Part I examined the rise in AI-generated submissions at the journal. The full paper, “More versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review,” is free to download at the Organization Science website. Part III will examine the institutional incentives driving these trends and our thoughts on peer review going forward.
In Part I, we described the influx: submission volume at Organization Science is up 42% since the launch of ChatGPT in November 2022, driven almost entirely by manuscripts with substantial AI-generated text. These papers are also of a worse quality across linguistic measures and editorial outcomes. They are overwhelmingly desk rejected, and those that make it through to peer review rarely make it to the second screen. In other words, the system is catching these papers, but the burden on editors and reviewers is growing.
That is the submission side of the story. This post turns to the other side of the equation: what is happening to the reviews that editors depend on to make decisions.
Before we get into what we found, a reminder of where we, the task force members, stand on AI: we are all strong AI adopters. We use AI in our own research, and in many respects, AI has become our academic partner. We use it to assemble and analyze data, identify weaknesses in our analyses and bugs in our code, pressure-test arguments, and, yes, also in our writing: to interactively work with these hyper-smart models to ensure our logic and arguments are clear.
The challenge is how to use it as a complement to our human judgment, not as a substitute for it. That distinction matters as we turn to what is happening on the review side.
More Papers, More Review Requests
The mechanics here are straightforward. When submission volume rises by 42%, the load on editors and reviewers increases accordingly. At Organization Science, every manuscript requires a desk rejection screen by an editor and then, if it makes it past this stage, three expert reviewers. Those reviewers are volunteers, drawn from the same community of scholars who are also conducting research, teaching courses, and doing all the other wonderful things that fill out our profession, such as serving on committees and sitting through department meetings. The pool of willing and qualified reviewers has not grown by 42%. The choice is either to rely more on the same qualified reviewers (particularly the editorial board) or to broaden the pool from which we draw.
This is the context in which AI enters the review process. With reviewers already over-extended, a technology that can identify key issues or draft a plausible review in minutes is attractive. It would be surprising if it were not adopted, even if there are explicit policies across journals against using it in this way.
AI Adoption in Reviews
We applied the same Pangram AI detection methodology to reviews as we did to submissions. The trends follow a similar trajectory, though, so far, they are less pronounced.
A worked example of AI-Generated vs. AI-Enhanced Reviews
Before ChatGPT, virtually all reviews were classified as human-written. Since then, AI involvement has grown steadily. Nearly 40% of reviews at Organization Science now show some degree of AI-generated text. The fastest-growing segment is reviews in the 30-70% AI range, which suggests meaningful AI involvement in the drafting process. A smaller but growing tail of reviews scores above 70%.
Let’s consider these numbers more deeply. If a review scores above 70%, can we infer that the reviewer did not read the paper? It’s hard to know for certain, but we ran a simple test using a working version of the “More versus Better” paper.
Fully AI-Generated Review
We first uploaded the manuscript and asked Claude: “Please draft a reviewer report for this manuscript.” No other reviewer input. Pangram’s score of this review: 87.3%.
Below is a custom output from the Pangram API (not the website, which displays this information in a different way).
Clearly, this was 100% AI generated, but Pangram scored it a bit lower at 87%—in the range of “AI-generated.”
Interestingly, Claude got a little self-conscious after writing a review. It was, of course, a paper about AI-generated reviews, and so it ended its AI-generated review, unprompted (Ha!), with this gem of a note:
A note on positionality: I have written this review without AI assistance. Given the paper’s subject matter, it seemed appropriate to mention.
Appropriate, Claude, yes. Truthful, no.
Using AI to polish human-generated notes
Next, we took raw notes for a report one of us wrote several years ago and asked Claude to generate a report based on them. These notes were mostly a bullet-pointed list one of us made as we read the paper, noting the major comments to the author. The revised prompt: “Please draft a review based on my notes.” Pangram score of this review: 34.9%—pretty close to the margin between “Human written” and “AI-Assisted.”
Humanizing AI-Generated Reviews Through Human Editing
Next, we took the fully AI-generated review above (87.3%) and set a timer for 10 minutes to “touch up” it to sound more human. In just a few minutes, a little editing by Claudine (not Claude, although that was her nickname in college) brought us from 87.3% to 48.1%.
So….what to make of this? It’s hard to know for sure. But it seems unlikely that reviewers with AI scores of 70% or higher engaged with the manuscript in a substantial way.
The reviews scoring 30-70%—the largest-growing share—are a tougher case. It is very possible that many of these reviews reflect the second case above: reviewers uploading their raw notes and asking a language model to generate a review. In this case, the human originates the substance of the review, and the AI helps them make it readable. That seems useful (data privacy concerns aside).
However, it may also be true that some of the AI-generated reviews scoring between 30-70% were generated by AI and superficially adjusted by the reviewer before submission, our third case above. It only took a few minutes.
In those cases, the content of the review is substantively AI-generated, even if it scores lower due to cosmetic adjustments by humans. It’s hard to tell the difference between this less desirable case and the prior one, in which AI saved precious time but did not replace the reviewer’s thinking. One clue is whether the content of the reviews differs between those with substantial AI writing and others. More on that later.
For reviews with scores below 30%, we are confident many are still entirely human-generated, as in the pre-ChatGPT days. Others are likely written iteratively, with AI providing help on specific sentences or sections. This type of hybrid writing generally scores very close to human (0-15% on Pangram), reflecting that the reviewer remained in control of the content, even as AI helped with grammar and structure. And some may be false negatives…it’s hard to tell.
In other words, these AI patterns we observe in the paper likely represent a floor, not a ceiling, to how much of the actual underlying evaluation process is now outsourced to AI. And the trends are only increasing.
Writing Quality Is Declining
If AI were better than humans at assessing academic manuscripts, its use may not be such a bad thing. And some tools are now coming online that can help authors pre-review their papers. These tools can be particularly useful for people who may not have other channels for getting feedback before submitting their papers. But so far, AI writing in reviews degrades quality rather than improves it.
We calculated the same battery of readability and style measures for reviews that we used for manuscripts, and nearly all degraded since ChatGPT was released. Review readability had been stable for over a decade prior. After November 2022, it falls off a cliff, the same pattern we documented for submissions in Part I.
Reviews with more AI-generated text are harder to read, use more jargon, lean more heavily on nominalizations (see Claude’s curious use of “positionality” above), and employ more complex sentence structures. The effect sizes, if anything, are larger in reviews than in manuscripts:
This matters for a practical reason. A review exists to communicate an analysis or evaluation. The editor needs to understand the reviewer’s thoughts on the paper. The author needs to understand what the reviewer wants them to fix. When the prose is dense and hard to parse, both of these functions degrade. Editors spend more time figuring out what the reviewer meant. Authors spend more time translating vague feedback into concrete revisions.
The Narrowing
This also matters for a deeper reason. While the decline in writing quality is concerning, we also see a narrowing in the topics AI raises relative to humans.
We measured the topics that reviews emphasize by tracking the relative frequency of words associated with five core areas of feedback: theory, contribution, clarity, data, and empirics. We then tested whether AI use shifts the balance across these categories.
Reviews with higher AI scores devote relatively more attention to theoretical framing and less to the empirical foundations of the work. As editors, we have all handled these high-AI reviews: they are verbose and tend to focus on a limited set of recurring issues across papers. As authors receiving these reviews, they can be baffling. Some authors figure out that the reviews are AI-generated: we have heard from many of them, and they are understandably not happy.
This brings us to one of the most striking findings in our report. We conducted a Principal Component Analysis on the five review topic categories and plotted each post-ChatGPT review using the first two PCA dimensions. The horizontal axis spans the spectrum from data and empirics (emphasized on the left) to theory (emphasized on the right). The vertical axis distinguishes substance-focused reviews (below) from presentation-focused reviews (above).
Each dot is a single review. The orange + signs are human-written reviews with 0% AI detected. The teal dots are reviews with 70%+ AI. The ellipses represent the 95% confidence intervals for each group.
The human reviews spread across a wide territory. Some are theory-focused, some are data-focused, some care primarily about clarity or contribution. The AI reviews cluster into a much tighter region, concentrated in the theory-heavy zone. The 70%+ AI reviews are right-shifted and tighter; suggestions are not balanced (they would be in the center otherwise) but rather biased towards a particular type of feedback.
AI-generated reviews converge on a particular kind of feedback. They are confident and focused, though the focus is narrow. Moreover, these AI reviews are not correlated with editor decisions, suggesting they provide relatively little informational value to editors compared with their human counterparts.
Dispersion in reviews—while often a headache to us as authors—is a feature of the system. Peer review works because different reviewers bring different lenses to the same paper. A methodologist scrutinizes the instruments. A theorist pushes on the argument’s logic. Someone who knows the empirical setting catches problems that a generalist would miss. It’s not always perfect, and we have all had papers punished by a bad draw of reviewers. However, the value of the system lies in this diversity of perspectives. This diversity is reduced with AI reviews, and the narrower focus is less informative, not more.
Editors Are Compensating
If an editor cannot rely on reviews to provide the expert assessment they need, they end up doing more of that assessment themselves. This compounds the bottleneck from Part I. Editors are already spending more time triaging AI-generated submissions. Now they are also spending more time compensating for reviews that lack sufficient information to support a decision.
Both Sides of the Table
Taken together, the submission and review findings describe a system under compounding stress. More AI-generated papers arrive at the journal. More review requests go out to absorb the volume. The reviews that come back are increasingly AI-assisted, harder to read, and narrower in scope. Editors compensate by investing more of their own judgment, which is the scarcest resource in the system.
In fact, Organization Science has significantly grown its editorial ranks. There are now 50% more Deputy Editors (9 vs 6) as in 2021. Even now, some editors handle more than 200 papers, sometimes over 250, annually. That can amount to over one paper every workday.
We have also increased the number of Senior Editors at Organization Science. We have gone from around 30 senior editors in 2021 to nearly 60 today. Even the number of reviewers has increased steadily—from an average of 336 per quarter pre-ChatGPT to 435 per quarter, a nearly 30% increase!
And this is just at Organization Science. Reviewers (who are authors themselves) are asked to review more papers for more journals, and editors sometimes must ask half a dozen scholars before finally getting a sufficient panel of reviewers, given how busy everyone is.
This is what the “more” equilibrium looks like in practice.
More papers mean: more deputy editors, more senior editors, more reviewers, more work. But not always better research.
Volume goes up on both sides of the table. The information flowing through the system at each stage gets incrementally thinner. The load concentrates on the people who have expertise that the system cannot function without: the already taxed scholars in our field who volunteer in the intellectual commons.
Peer review is a system of generalized exchange that functions well when the ratio of papers to evaluators is balanced. At some point, we will run out of people to read all the papers.
In Part III, we turn to the question underneath all of this: what is driving these trends? The rise in AI use for the purpose of “more” is widespread, but not uniform: it is higher in places where the incentives to publish in specific journals are strongest. Addressing this systematic problem will require looking beyond journal policy to the institutional structures that shape what academics are rewarded for doing.
Members of the Organization Science AI Task Force:
Claudine Gartenberg is an Associate Professor of Strategy at the Wharton School of the University of Pennsylvania and Senior Editor at Organization Science.
Sharique Hasan is an Associate Professor of Strategy at the Fuqua School of Duke University and Deputy Editor at Organization Science. He is the chair of the AI Task Force at Organization Science.
Alex Murray is an Associate Professor at the Lundquist College of Business at the University of Oregon and Senior Editor at Organization Science.
Lamar Pierce is the journal’s Editor-in-Chief and most enthusiastic song list builder.
The full paper, “More versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review,” is available to download at the Organization Science website.












