r/biotech Nov 28 '24

Is the lack of common databases a widespread issue in pharma, CROs, and CMOs? Open Discussion 🎙️

Hey,

I've been having discussions with colleagues (I am software engineer, but work in pharma company) in the biotech industry, and a recurring topic is the challenge of managing and searching across different documents and data due to the absence of a common storage system, so I am curious:

  • Is this a common problem you're facing in pharma companies, CROs, or CMOs?
  • How much time and energy does it take to deal with these centralized database issues in your daily work?
  • Have you found any effective solutions or workarounds to mitigate this problem?

Just have an idea of how to tackle this problem, but want to validate this

UPDATED:

I didn't mean common database between companies. I meant a centralized DB inside one company. So use cases are following
Assuming some lab in your company already ran the similar experiment 5 years ago and you are unable to find such documentexperiments results in the mess of folders, FTP, google drive, etc
Another use case is for CROs for example is a common search between experimentsdocuments data outputs of each labdepartment inside it.

42 Upvotes

12

u/RareTadpole_ Nov 28 '24

Company dependent. I’ve been in companies that use one main EOF repository and companies that have everything (seemingly) randomly spread across multiple repositories. And that is just fordocuments. Not even considering where data is stored and shared which will always be a clusterF

2

u/Jack_Hackerman Nov 28 '24

Could you elaborate on what the EOF repository is?

4

u/RareTadpole_ Nov 28 '24

Electronic official file

2

u/2Throwscrewsatit Nov 28 '24

So just a QMS document library 

23

u/South_Plant_7876 Nov 28 '24

This is a well worn path for software engineers to want to solve. In theory it should be simple to do: run an experiment, store the data in some sort of database and everyone can mine and reference.

The reality is experiments are messy and rarely uniform and don't lend themselves easily to serialisation in data tables.

Even in CROs, where things should be more turnkey. It is very hard to consistently label rows and columns. It is also quite stifling, leading people to design experiments to fit the schema rather than the other way around.

The companies which use these systems are probably using them more for routine QC and compliance with GMP where these have been implemented.

Our company (and others) use shared OneNotes for data storage and lab notebooks. It certainly isn't the best, but it has enough flexibility, coupled with best practice processes to make it work.

1

u/Jack_Hackerman Nov 28 '24

If I tell you that I can make a solution for this (like a schemaless db where you can put any arbitrary data, but still analyze or search it), would it be a game changer?

14

u/South_Plant_7876 Nov 28 '24 edited Nov 28 '24

No. Why would we change something that works already?

Sorry to sound so cynical, but we are approached all the time by people who think they have finally hit on the perfect ELN/data storage solution. But they inevitably all become solutions looking for a problem. And charge a fortune for something only marginally better.

People always mention Benchling, but at the end of the day they are little more than a OneNote clone with molecular biology software baked in.

Their position in the market drives more from highly aggressive sales than practical utility. I am pretty sure one of their sales people just PMd me on the back of my previous comment.

5

u/Jack_Hackerman Nov 28 '24

How is the finding information in onenote? Are there any problems with it?

6

u/fibgen Nov 29 '24

If you are looking for a startup idea I'd look elsewhere.  You can make a lot more money with more standardized industries.  The main problem with indexing early stage research is that every experiment is slightly different and has its own context (unless it's GCP/GMP validated), so it doesn't lend itself to normalization.

Also, nobody trusts an experiment from 6 years ago, -- the technology has usually advanced enough that redoing the experiment will yield much richer information.  The few domains that have info well suited to normalization (medchem, nucleic acids) already have many dedicated commercial options with a lot of domain knowledge baked in.

Source: was you 20 years ago

3

u/ebbee Nov 29 '24

Completely agree. My experience is that the earlier the stage of the company, the more (almost daily) pivots happen. Even if you designed a completely customizable tool that was run more like a service than a software, it would be impossible to keep up with how fast ideas and processes change. Plus the high turnover of staff can lead to differences in workflow which compounds the problem.

1

u/Electronic_Exit2519 Dec 02 '24

These solutions exist in every mature pharma company. They don't yield anything valuable on the department level - ie where the raw data is created. When this guys says it's a well worn path, he knows what he's talking about. Graph databases exist. We use them. They suck in practice. Pharma largely hasn't moved towards data ownership/governance culturally, and until we do the tech doesn't matter

16

u/Busy_Bar1414 Nov 28 '24

Do you mean shared common QMS or databases like Veeva?

It’s an interesting question and following to see replies. The CMOs need or should have an ERP that Sponsors and CROs are always wanting access to but won’t be granted.

7

u/Jack_Hackerman Nov 28 '24

I mean not the common db, but rather just a centralized tool to search in the company's all databases, that is easy to operate and search, so there are no questions like 'whether another lab in my company has this document', 'whether a thousand files in these sources like ELN, ftp, Google drive, anything has such piece of information, or I forgot to input it?

22

u/2Throwscrewsatit Nov 28 '24

Any existing enterprise document managing system can do this.

The problem isn’t the software. It’s the people in the organization not agreeing to use a common pattern, & then everyone complaining.

5

u/diodio714 Nov 28 '24

or you are just not given access to view the documents of another department even though everything is in the same content manager.

3

u/con_sonar_crazy_ivan Nov 28 '24

Data governance is so essential but so few are actively driving this...

6

u/2Throwscrewsatit Nov 28 '24

Because it requires a backbone and owning risk. Both things leaders fail to be asked to do in modern corporations

1

u/pancak3d Nov 29 '24

Enterprise document management systems absolutely are not the right place to store everything. What OP is asking for is basically a datalake.

1

u/2Throwscrewsatit Nov 29 '24

Not really. Document management systems used to be where information goes to die for compliance but progress has been made to make their data findable and accessible in these systems. 

What you said about DMS like saying ELNs don’t structure data well. Yeah, they didn’t 10 years ago. But not anymore. Companies don’t use software as it’s intended. That’s why they buy an ELN when they just need a DMS plus Search. Or when they invest in a QMS because they don’t understand how to use Microsoft Enterprise 365. 

A “data lake” is amorphous and is just a collection of data sources given a fancy name. And having a lake of data doesn’t mean it’s interoperable. Which is what OP ultimately wants.

1

u/pancak3d Nov 29 '24 edited Nov 29 '24

I guess I don't really understand your perspective here on DMS, organizations generate a massive amount of data and a very small portion falls into the bucket of GxP documents that belong in DMS. Putting everything into a DMS because it has search is a very weird strategy and I've never heard of any company doing it -- and I worked for pharma's biggest DMS vendor...

5

u/pineapple-scientist Nov 28 '24

I wonder if they are looking for something like Veeva but for non-clinical work. I haven't seen a good centralized system for experimental protocols and results implemented in pharma. When I did gene editing work in a large academic lab, we used benchling and that did work well for our size (~100 people).

7

u/MacPR Nov 28 '24

Yes it is, we just built our own data schema.

2

u/Jack_Hackerman Nov 28 '24

But how do you store the data? So for example if you want to find was there an assay already done in your company that involves human growth hormone? How do you tackle this, is this a big issue? Or you want to find some file or line in some file (which name you don't remember) that was created a month ago?

3

u/MacPR Nov 28 '24

In very general terms, a data schema.

If you want something prebuilt look into an “Electronic Lab Notebook” like signals and have sop for data management.

8

u/atxgossiphound Nov 28 '24

There are products that do this, but the ones comprehensive enough have only entered the market in the last decade. Products like L7’s ESP, Sapio, and to some extent Benchling can do this. However, none are turnkey solutions out of the box and all require some implementation effort. Not as much as the legacy LIMs and CMS tools, but still a few months of implementation time.

In house software engineers tend to push back against the products and insist they can build it themselves. Of course, they can, but it is more work than they anticipate and rarely successful.

There’s also the budget challenge. The vendors need to sell the software at a price that supports their business. With only a few thousand total customers in the market, any one vendor will have double to low triple digit customer numbers. That necessitates higher prices - usually the cost of an FTE. It’s still cheaper than building it yourself, but it’s not cheap. CROs and CDMOs tend to be low margin businesses, there’s not always budget available for software.

Now consider that most CROs already have a MIcrosoft subscription and their main output is Excel reports. It’s easy for them to build a data system around Sharepoint, OneNote, and Excel.

Could it be better? Sure, but the size of the market and the nature of the service businesses work against it.

6

u/blorfity Nov 28 '24

At a large CRO:

Controlled documents (SOPs) in a QMS owned by QA

Controlled test methods/protocols in a lab side Doc mgmt system

Reports, COAs in a separate legacy system that works off windows 2000 for some godforsaken reason. When we went remote for covid this thing really broke from bandwidth/access requests.

Instrument data swept for long term storage between system 2 and 3 above depending on how old the instrument systems are.

Certain documents (investigation forms, etc) are on a big shared drive.

None of the systems above can be accessed externally. We set up sharepoints for client access to specific data and move things there manually.

The quicker we can resolve the “data on the cloud = not safe or secure for GMP” problem, the better, so we can move everything to AWS and call it done. I’m not in QA so I don’t know the latest feelings here. We have been able to do this for certain standalone software systems but no appetite for client reports yet.

5

u/Busy_Bar1414 Nov 28 '24

Hello I just picked up on something you’ve said. Would you say data stored in a cloud is NOT compliant with GxP? Is there a regulation suggesting this? Always interested to hear other view points.

3

u/Chance-Party7686 Nov 29 '24

Cloud if it’s not validated is not considered secure. Infact any IT system in GxP environment. Even sharepoinr

2

u/phaberman Nov 29 '24

I've wondered if anyone has validated SharePoint.

There's definitely a way to do it that would work for smaller biotech companies that don't wanna shell out the money for veeva

https://learn.microsoft.com/en-us/compliance/regulatory/offering-fda-cfr-title-21-part-11

4

u/Chance-Party7686 Nov 29 '24

Probably below are few that each company might test for atleast:

  1. Disaster recovery
  2. Cloud security (restricting public access without permissions)
  3. LDAP authentication to access Etc

1

u/phaberman Nov 29 '24

I'd guess that all of these are doable.

1 & 2 are built into OneDrive. 3 could be done with sso?

1

u/Chance-Party7686 Nov 29 '24

Yup but should be documented 1&2 could leverage vendor documentation

1

u/Deathbird1 Dec 01 '24

I was told in my company it should not be possible - as there is always a master admin who can temper with the data (delete, modify, ...). So our QA says no GxP on SP.

2

u/blorfity Nov 29 '24

I am not directly involved with the decision making here so I am working off the actions of others around me. But there has been concern that if we store client data with external vendors that we don’t control, then we may run afoul of some data retention or client confidentiality regulation. The stuff I’ve overheard is that we can’t control its access if it’s offsite.

1

u/pancak3d Nov 29 '24

No, it's more that old folks in industry don't trust cloud storage.

1

u/Jack_Hackerman Nov 28 '24

Can I DM you?

5

u/saltedmeatsps Nov 28 '24

Benchling can do most of this

1

u/Jack_Hackerman Nov 28 '24

Does it support indexing, searching and viewing of absolutely chaotic data from absolutely chaotic data sources?

5

u/saltedmeatsps Nov 28 '24

Pretty much. They have off the shelf integration with a bunch of instruments. It's basically an expanded ELN. 

If you mean SOPs, Clinical Data, internal data, etc all together, nothing really does that. 

Mulesoft could do it with a bunch of upfront work. 

2

u/Jack_Hackerman Nov 28 '24

I have an idea of implementing such solution with my friend. The problem is that people tend to think that the data must be structured and put in some form of standard representation before it can be analyzedsearchedindexed, but is not true.

2

u/fibgen Nov 29 '24

Is your friend the founder of Quilt?

You should survey the unstructured data indexing solution space before thinking you have some new special insight.

2

u/Thommasc Dec 04 '24

Can you be more specific on the type of data?

I'm working a data explorer solution at the moment and would love some more real use case.

Is it even realistic to do any form of data visualization or analytics on a datalake with unstructured data?

In our solution, we try to tell people to put data in the right place so it can be leveraged.

However we see only 30/40% of the data end up being properly structured and managed.

And that's for projects with small very efficient teams. So you can imagine what happens in bigger teams and structures, the percentage must drop pretty low.

I just don't get how people realistically search and report on data if it's all chaos at scale.

2

u/Jack_Hackerman Dec 04 '24

Hi, if you want you can DM me or i can write you. I'll give you more details

5

u/Patience_dans_lazur Nov 28 '24 edited Nov 28 '24

It sounds like you're describing an electronic lab notebook (ELN)? There are several commercially available options. If you connect it to your inventory and everyone is rigorous in their note taking processes + uploads data and results to a corresponding experiment entry they can be a very powerful tool for searching across projects, people and time.

3

u/awhead Nov 28 '24

Do you mean something like Alation?

3

u/walterbernardjr Nov 28 '24

Yes that is common, which is why consulting firms and tech firms are making bank helping pharma companies implement solutions to address this

3

u/Vervain7 Nov 28 '24

lol

You can have all that but then it doesn’t matter when a place re orgs every year and technical debt is piled on

3

u/mdcbldr Nov 28 '24

Yes. It is a mess. The information may be in internal databases, public databases, PDFs of published data, etc. Pulling the data together into one coherent data pool is always an issue.

CDMOs/CMOs have it worse. Each client may have thier own specifications for how the information is captured. It practice, CDMOs are woefully stagnant when it comes to sophisticated data management practices. One can not prepare for everything that could walk through the door. The CDMOs put the data management onus on the client.

Many moons ago my tiny startup faced data management issue. We had test extracts and compounds, we had assays, we had tox checks, we had cell assays. There was no systems to handle this. We were generating 5,000 to 00,000 data points a week and were scaling to do 25,000 to 50,000 a week.

We partnered with a few other small companies and hired a software design firm who built a suite of programs that were designed with a common api so that we could configure the modules for specific scenarios. All the data was held in an sql system. It would be considered primitive by todays standards. Back then we had companies like Merck were trying to license the tech from us.

We wanted to incorporate public data into our system. That proved difficult. We settled on a data entry approach. We manually entered about 100,000 data points into the system. I wanted more but itb was expensive and the data had a lot missing data points. We eventually figured out a way to get consistent data and ran with that approach.

The last company I worked for was insanely 1985-ish. I often recalled what a programmer said years ago: a computer is more than a fancy pen and nice paper. The company was literally recording in log books and paper forms. Those data were then manually entered into spreadsheets. We had planning software that so abstruse that we dumped its output into a spread sheet for each batch. There was no tracking the run against the workplan.

Data, or access to data in a usable format is an issue. If you can solve this issue, you could be come wealthy.

One last caveat. The data system must be validated under 21CFR Part 10 or 11? I hope someone knows the true CFR reference

2

u/Extreme_Cricket_1244 Nov 28 '24

The largest publicly sourced database to my knowledge is BenchSci which if integrated properly will be able to form generative hypotheses on biological phenomenon. The tricky thing is integrating across data sets within your org which takes time and buy-in to make the LMM proficient.

1

u/Jack_Hackerman Nov 28 '24

Ah, I misformulated the question. Check 'UPDATED' please in the topic

2

u/Anonymous_2672001 Nov 28 '24

Yes, finding anything is a fucking nightmare. My efficiency is probably reduced 10-20% because we don't simply have shared resources. That includes multi-week delays because I have to wait for others to send me things that should've been distributed upon publications.

2

u/Jack_Hackerman Nov 28 '24

Can I contact you and talk about your problem more?

2

u/Content-Doctor8405 Nov 28 '24

This is an obviously desirable technology to have in any company, but sadly it is missing from most. In the larger companies, different divisions are quasi-independent so there is less integration between research projects than you might imagine.

Likewise, Big Pharma R&D productivity has been declining for a long time and that means that there is a lot of mergers with smaller biotechs who might be fairly far down the road with a project before the acquisition closes, and obviously those projects are done on stand alone databases. After the merger, the focus is on getting the target drug across the finish line and time consuming tasks such as systems integration take a back seat to everything else.

So does it make sense to have a common database platform? Absolutely. Is that reality? No, not even close.

2

u/Jack_Hackerman Nov 28 '24

As I mentioned higher me and my friend are from software development world and we got an idea how can you still manage all this chaotic data from different datasources without chaningmoving this data actually.

2

u/Content-Doctor8405 Nov 28 '24

It is messy and a lot of times it is some ad hoc workaround that somebody cobbles together. As the number of database projects get deferred, getting a handle on that becomes nearly impossible.

I think the real answer is that a lot of what you imagine doesn't matter so much. Yes, it would be nice to look at something that another team did five years ago, but I am not sure there is much need to actually do so. The time that is really useful is in preclinical lab work, but more and more of that work is being done by small biotechs. Once you get to the late preclinical or clinical stages, the data is pretty well locked down because it has to be for regulatory reasons.

2

u/Jack_Hackerman Nov 28 '24

But what about current data? Can you share your experience a little? Like how do you store data, in which format, what obstacles do you have obviously?

2

u/BryJammin Nov 28 '24

Data scientist in pharma here. Wish my organization had a cloud compute engine that I could schedule jobs to run reoccurring python and R scripts on and store outputs. Currently executing and storing everything from sync’d SP directories. Definitely annoying having to manually handle this.

2

u/Jack_Hackerman Nov 28 '24

Actually me and my friend did an open source solution for this :) (but it's without scheduling now)
https://github.com/BasedLabs/NoLabs/tree/master
Or you mean something different?

2

u/BryJammin Nov 28 '24

Did a brief scan of your repo, cool tool! I’m on the clinical side of the organization - data cleaning, modeling, reporting/analytics on clinical data. Your tool looks like it’s geared towards dry lab concepts, no?

2

u/Jack_Hackerman Nov 28 '24

Shall I DM you?

2

u/open_reading_frame Nov 28 '24

Yes, this is a common problem at my company and I don't see it getting better soon. A lot of my busywork comes from just compiling data from various data sources. My managers have also gotten annoyed and want all the raw data in backup slides versus getting them from the raw files.

2

u/BringBackBCD Nov 29 '24

This is a challenge at all companies, including beyond biotech.

1

u/Feisty_Shower_3360 Nov 29 '24

Just have an idea of how to tackle this problem...

And, just like that, there are now n+1 disparate and incompatible databases inside your company.

1

u/pancak3d Nov 29 '24

I see pharma companies building datalakes for this, often on AWS or Microsoft Azure. It's expensive and technically difficult, and many companies aren't really savvy enough to understand the benefit and build the infrastructure to do it, but the tide is turning.

1

u/Time_Stand2422 Nov 29 '24 edited Nov 29 '24

Depends on the digital maturity of the company and honestly how progressively tech minded the Executive leadership is. Allot of companies don’t realize that their data is an asset, do not teach and foster data literacy and fail to invest in technology. Im not a data scientist or IT guy, but even I can see Data Lakes, and integration layers to harmonize data formats, and eliminate transcription while unlocking advanced analytics is a huge advantage.

Veeva is attempting to solve this problem by just being the go to app for every vertical in the company (LIMS, EDMS, RIM, LMS etc), but there will be allot of disparet application from bench-top analytical instruments to enterprise software that stillneeds to be harmonized, managed and curated in a way that is useful for the consumer (FAIR data is findable, accessible, interoperable, and reusable). It needs Data Governance as well as technology. If the data is treated as a valuable asset, then it gets cataloged, tagged, lineage established, and controls implemented to ensure integrity as per ALCOA+.

1

u/rageking5 Nov 28 '24

These already exist if a company wants to implement.

5

u/Jack_Hackerman Nov 28 '24

Do you mean that companies implement their own solution for data centralizing or buy some solution?

1

u/TeepingDad Nov 28 '24

Veeva is quickly dominating this space but there are plenty of other softwares that can connect other softwares together to a common database

3

u/Jack_Hackerman Nov 28 '24

I was thinking on creating the solution where you can integrate everything (all complexchaoticwhatever you want) into a single searchable and viewable database. No matter how complex is your data or what is data source

3

u/TeepingDad Nov 28 '24

It's a good idea but already has a market. I've been approached by a handful of firms that do this sort of work, they tie in all data sources (lab, QMS, manufacturing, clinical, etc) and then link them by common meta data.

2

u/Jack_Hackerman Nov 28 '24

Got it. Could you give example of such firms? Want to check what they do

1

u/Chance-Party7686 Nov 29 '24

Veeva or any other QMS system to store validated standard protocols, procedures etc Lims to enter and store the results, generate coa etc

Is this what you asking ?

1

u/ShadowValent Nov 29 '24

This person is young.