Field Notes

Research Workflows / May 2026

Fragmented Research Workflows: The Hidden Cost of Scattered Documents, Data, and Code

Why context-switching between PDFs, notebooks, datasets, and drafts makes research harder to verify, reproduce, and build on.

Research rarely breaks because teams lack information.

It breaks because the information is split across too many places.

A paper lives in Zotero. The notes live in Notion. The dataset lives on a shared drive. The analysis lives in a Jupyter notebook. The figure lives in a folder called final_figures_v3. The manuscript lives in Word or Overleaf. The reasoning behind a key decision lives in someone's memory, a Slack message, or a comment thread that nobody can find six months later.

Each tool works well enough on its own. Together, they create a fragmented research workflow where evidence, analysis, and writing drift apart.

That drift is expensive.

Not only because it slows researchers down, but because it makes research harder to verify, harder to reuse, and easier to forget.

Research is not a sequence of isolated tasks

Most research software treats reading, analysis, and writing as separate acts.

You read papers in one tool. You manage references in another. You analyze data somewhere else. You write the manuscript in a separate editor. You communicate in Slack, Teams, or email. You store files in Drive, Dropbox, OneDrive, or a local server.

But research itself does not work that way.

A claim in a manuscript depends on a citation. That citation depends on a passage in a paper. That passage may have been interpreted in a note. The note may have shaped an analysis. The analysis may depend on a specific dataset, cleaning step, model, notebook cell, or parameter choice. The final figure may depend on all of it.

The real work is not just the paper, the dataset, or the notebook.

The real work is the connection between them.

That is also the direction research data management has moved. The FAIR principles define good research data as findable, accessible, interoperable, and reusable, but their original formulation explicitly applies not only to data, but also to the algorithms, tools, and workflows that lead to that data. (Nature Scientific Data: The FAIR Guiding Principles)

In other words, reusable research is not only about storing outputs. It is about preserving the structure around them.

When those connections are not preserved, the project becomes harder to understand over time. A team can still see the final result, but the reasoning trail behind it becomes unclear.

That is where fragmentation starts to hurt.

The cost of lost context

At the beginning of a project, context feels obvious.

Everyone remembers why one dataset was excluded. Someone knows which paper supports a specific claim. A PhD student remembers why a model was changed. A PI remembers why the team narrowed the research question.

Then the project moves forward.

Files are renamed. Drafts multiply. People switch projects. A reviewer asks for clarification. A collaborator joins late. Someone tries to reproduce an old figure. A new team member needs to understand what has already been done.

Suddenly, the team is not doing research.

They are reconstructing research.

They search through folders, notebooks, PDFs, comments, old messages, and half-finished drafts. They ask around. They rerun code. They compare versions manually. They try to remember whether analysis_final.ipynb or analysis_final_revised_2.ipynb produced the figure that ended up in the manuscript.

This is not just an inconvenience. It is a structural weakness in how research work is organized.

The NIH Data Management and Sharing Policy, effective from January 2023, was introduced to promote scientific data sharing, validation of research results, and reuse of high-value datasets. (NIH Grants: Data Management and Sharing Policy Overview) That policy shift reflects a broader reality: research funders and institutions increasingly care not only about final publications, but about whether the underlying data and process can be found, understood, and reused.

Fragmented workflows make that harder.

When context is scattered, teams lose the ability to quickly answer basic questions:

  • Which dataset supports this result?
  • Which notebook generated this figure?
  • Which citation supports this sentence?
  • Why did we remove this variable?
  • What changed between the first and final analysis?
  • Which assumptions are still valid?
  • Who made this decision, and why?

The harder these questions are to answer, the harder it becomes to trust the final output.

Fragmentation makes verification harder

Verification depends on traceability.

If a claim appears in a manuscript, the team should be able to trace it back to its evidence. If a figure appears in a draft, the team should be able to trace it back to the code and data that produced it. If a conclusion depends on a dataset, the team should be able to see which version of the dataset was used and how it was transformed.

In fragmented workflows, that trace is often broken.

The manuscript contains the polished claim, but not the analysis history. The notebook contains the computation, but not the written interpretation. The PDF manager contains the citation, but not the reason it was used. The folder contains the dataset, but not the decision that led to its inclusion.

This creates a gap between output and evidence.

The reproducibility literature shows how real that gap is. A large-scale study of research code published in Scientific Data examined code from public replication datasets and found that research code quality and execution problems directly affect reproducibility and reuse. (Nature Scientific Data: Research code quality and execution) In computational biology, incomplete descriptions of simulations, missing software versions, incomplete documentation, and unavailable code are all common reasons why computational research becomes difficult to reproduce. (PLOS Computational Biology: Ten simple rules for reproducible computational research)

The problem is not that researchers are careless.

The problem is that the workflow does not naturally preserve the full chain from data to analysis to manuscript.

This matters even more as AI becomes part of research. If AI tools are asked to summarize, draft, or reason across research materials without access to the full context, they can produce fluent answers that are difficult to verify. The risk is not only hallucination. The risk is that the system may not know how the evidence, analysis, and writing are connected.

A research assistant that can search files is useful.

A research assistant that understands how claims, citations, datasets, notebooks, and figures relate to each other is much more valuable.

Notebooks helped, but they did not solve the whole workflow

Computational notebooks were a major step forward because they brought code, outputs, and narrative closer together.

Jupyter notebooks, for example, make it possible to combine executable code, documentation, and results in one interactive environment. A 2024 study of biomedical Jupyter notebooks describes them as a popular mechanism for documenting and sharing computational workflows. (PMC: Assessing the reproducibility of Jupyter notebooks in biomedical publications)

But notebooks do not solve the full research workflow.

A notebook may show how an analysis was run, but it usually does not know where the result appears in the manuscript. It may contain a figure, but not the reviewer comment that caused the figure to change. It may reference a dataset, but not the citation or grant text where the result is later used.

Notebooks reduce fragmentation inside the analysis layer.

They do not automatically connect analysis to literature, writing, citations, team discussion, or long-term project memory.

That is why provenance matters. Recent work on recording provenance for workflow runs describes provenance as key for traceability, reproducibility, and quality assessment of data products. (PLOS ONE: Recording provenance of workflow runs)

In practice, provenance means being able to answer:

  • Where did this result come from?
  • Which input produced it?
  • Which code transformed it?
  • Which version was used?
  • Which claim depends on it?
  • Which later decision changed it?

Without provenance, teams can store files but still lose the story of how the research happened.

Fragmentation makes teams forget

Research teams also suffer from a less visible problem: institutional forgetting.

When a researcher leaves a project, their files may remain, but their context often leaves with them.

In one conversation, a professor described a familiar but painful situation. Their team needed to revisit an analysis that had been done several years earlier by a PhD student who had since left the group. The result mattered, but the dataset, source files, and reasoning behind the analysis were no longer easy to find. The former student had moved on and was not in a position to help reconstruct the work.

In the end, the team had to redo the analysis from scratch.

That is the hidden cost of fragmented research workflows. The problem was not that the group lacked intelligence, effort, or storage. The problem was that the project's memory had lived partly in folders, partly in code, partly in undocumented decisions, and partly in one person's head. When that person left, the research context left with them.

The team may still have the notebooks, drafts, and data. What disappears is the reasoning: why certain paths were abandoned, why one method was preferred, why a result was trusted, why a citation mattered, or why a reviewer response was written in a certain way.

This is especially painful in long research projects.

A grant becomes a study. A study becomes a paper. A paper becomes a dataset. The dataset becomes a follow-up project. Over time, the work outlives the people and folder structures that originally held it together.

Without connected memory, every transition creates friction.

New team members spend weeks or months understanding what has already been done. Senior researchers become bottlenecks because they hold too much unwritten context. Teams repeat analyses because they cannot find prior work. Decisions are remade because the original reasoning was never captured.

The organization has knowledge, but not continuity.

The problem is not storage

Most teams already have enough storage.

They have shared drives, cloud folders, reference managers, project management tools, notebooks, messaging apps, and document editors.

The problem is that storage does not equal memory.

A folder can store files, but it does not know why they matter. A reference manager can store papers, but it does not know which claim a passage supports. A notebook can store code, but it does not know where the result was used in the manuscript. A document editor can store writing, but it does not know which dataset or analysis produced the sentence being written.

Research memory requires relationships.

  • This claim came from this note.
  • This note was based on this passage.
  • This figure was generated by this notebook.
  • This notebook used this dataset.
  • This dataset was cleaned using this script.
  • This paragraph was revised because of this reviewer comment.

When those relationships are preserved, the project becomes easier to understand. When they are lost, the project becomes a pile of files.

Why connected workflows matter

A connected research workflow keeps evidence close to the work it informs.

That does not mean forcing every researcher into one rigid system. It means preserving the links between documents, datasets, notebooks, figures, citations, notes, and drafts as the work evolves.

When a researcher writes a claim, the supporting evidence should remain nearby.

When a figure is inserted into a manuscript, the notebook and dataset behind it should remain traceable.

When a citation is used, the relevant passage and interpretation should be accessible.

When a team revises a conclusion, the earlier version and reason for the change should not disappear.

This makes research faster, but speed is not the only benefit.

It also makes research more reviewable. More reusable. More resilient when people leave. More transparent when decisions need to be explained.

For teams working with sensitive or complex data, this becomes even more important. The harder the work is to reproduce, the more valuable connected context becomes.

From fragmented tools to shared project memory

The next generation of research software should not only help teams store more information.

It should help them preserve continuity.

The goal is not to replace every tool researchers use overnight. Researchers will still work with PDFs, spreadsheets, notebooks, code, reference managers, manuscripts, local folders, and shared drives.

The important shift is to connect those materials into a living project memory.

A place where a team can ask not only "where is this file?" but "why did we make this claim?", "which analysis supports this figure?", "what changed since the last draft?", and "what context will a new researcher need to understand this project?"

That is the difference between managing files and managing knowledge.

How Paradocs approaches this

Paradocs is built around the belief that research should become easier to verify as it grows, not harder.

Instead of treating documents, datasets, notebooks, citations, figures, and manuscripts as separate islands, Paradocs brings them into one connected workspace. As researchers read, write, analyze, and revise, the system preserves relationships between the work and the evidence behind it.

The aim is not just faster search.

The aim is continuity: knowing which dataset supports a claim, which notebook produced a figure, which citation matters for a paragraph, and what context a future teammate will need to understand the project.

Because in research, the final output is only part of the work.

The reasoning trail is the work too.