2 Materials and Methods

This project is to evaluate digital pathology workflow.
We will use quarto qmd documents in book project format and R for data preprocessing and analysis.
We have 3 datasets:
- specimen tracking and accession in the lab
- the time when the slides are scanned in scanners
- the time when the slides are uploaded to PACS.
Typical slide identification is: 12345-24[11];A,TTF
- 12345-24 is the unique case number. 12345 is the case number and 24 is the year. But we will use the unique case number.
- [11] is the hospital id.
- A is the block name
- TTF is the stain name

Alternative slide names:

12345-21;[65]1A,HE 12345-21;[65]1A, 12345-21;[65]1A 12345-21;[65],HE 12345-21;[65], 12345-21;[65] 12345-21; 12345-21, A_12345-21;[65]1A,HE A_12345-21;[65]1A, A_12345-21;[65]1A A_12345-21;[65],HE A_12345-21;[65], A_12345-21;[65] 12345-21; 12345-21, A.12345-21;[65]1A,HE A.12345-21;[65]1A, A.12345-21;[65]1A A.12345-21;[65],HE A.12345-21;[65], A.12345-21;[65] 12345-21; 12345-21, A.12345-21;[65]1PCS1,Ki-67 A.12345-21;[65]1A, A.12345-21;[65]1A A.12345-21;[65],HE A.12345-21;[65], A.12345-21;[65] 12345-21; 12345-21,

Each case can contain multiple slides.
From the data we are going to extract the unique case number to identify cases. And use the slide identification to track each slide.
Lab Data:
- When the slides are stained, they are accessioned to pathologists.
- When a case is accessioned to pathologist it is ready to be reviewed and signout by pathologists.
- We will find this time and then follow the slides to the scanner and PACS to see how long does it take to scan and upload the slides. Is digital pathology causing delays?
- Staining complete files live under data/lab/ (e.g., *_boyama_islemi_bitti_*.xlsx).
- Case assignment/reassignment files also live under data/lab/ (e.g., *_vaka_atandi_*.xlsx, *_vaka_baska_doktora_atandi_*.xlsx).
- For lab data we are going to combine these excels. We will find the latest staining time. And then find the most recent accession time to it. If there is accession to another pathologist then we will use that accession time. This will be our starting point.
- Logic for determining Case Ready Time:
  - Assignment AFTER Staining: Correctly picks the Assignment Time.
  - Assignment BEFORE Staining: Correctly falls back to Staining Time.
  - No Assignment: Correctly falls back to Staining Time.
  - Multiple Assignments: Correctly picks the First Assignment that occurs after staining.
Scanner Data:
- This is a free text log of the scanner.
- It has the slide identification and the time when the slide is scanned and when it is copied to the scanner transfer folder.
- Daily logs of scanners are in data/scanner/logs/, each scanner in its own subfolder (e.g., data/scanner/logs/SS7833_MFTv2/logs).
PACS Data:
- Structured data (CSV/Excel/JSON) with slide identifiers and upload time.
- Lives in data/pacs/.
After March 2024 we have complete tracking of slides. Previous data are just for documentation purposes. Previous to March 2024 the PACS and scanner data only have incremental slide ids. So it will not be possible to track the slides and also the cases.
Also consider that we will group the slide data per case id (erisim numarasi) and then find the minimum and maximum times. For a case to be ready to be signed out we need all slides to be ready. That is why we will need maximum time. To find the durations we need their diferences.
there may be some errors in the data. the typical workflow should not have negative times. the flow should be lab -> scanner -> pacs.
- but maybe a slide is forgotten and scanned later etc. Also there will be immunohistochemistry slides that are scanned 3-4 days after accession. These are not what we are after. Because they come after the pathologists evaluation. So for time intervals we will focus on HE slides only. HE is the hematoxylene&eosine abbreviation. it will be also safe to think that after acession the cases are scanned within next 24 hours. Later than that may mean laboratory downtime.
- Exclusion Criteria:
  - Turnaround Time Analysis: To strictly focus on the initial scanning time of routine diagnostic slides, we exclude all non-routine samples. This includes Immunohistochemistry (IHC), Histochemistry (Special Stains), Recuts (coded as YK, DYK, SK, SSK), Additional samples (YP), and Cytology (PAP). We achieve this by filtering the dataset to include only slides with staining labeled as “HE”, “H&E”, or where the staining information is missing (empty). A user-editable stain mapping lives in config/stain_mapping.csv to correct or expand groupings without touching code.
  - Rescan Analysis: For the analysis of rescan rates, we include Recuts and Additional pieces as part of the “Routine HE” category, as these represent valid scanning events in the routine workflow. Only Ancillary tests (IHC, Histochemistry) are categorized separately.
Also highlight potential downtime detections. where a scanner is stopped and many cases will have delays.
the lab does not work on Sundays. So we will not expect any staining to be done on Sundays.
We have 2 Leica AT2 scanner and 2 GT450 scanner. Slides coming from each scanner is evident in PACS data. there are some other scanners as well. Using this data evaluate how well are we using these machines? their downtime, their effective use
Keep raw data, we can refer them later on. make all the work reproducible. do not use hardcoded data. generate all output from given data.
before merging, save lab, scanner, and pacs data separately as interim data files. we may need them later on.
Analyse lab, scanner, and pacs data separately.
Not all logs are complete so do not drop cases or slides with missing logs.