Identify best matching barcode pair
Let’s explore how lima is processing and determining barcode hits.
Given a ZMW with the insert of interest R
and its reverse-complement r
, each have a leading _l
and trailing _t
candidate barcode region:
┌→ r_t + adapter + R_l + R + R_t + adapter + r_l + r + --┐
└--------------------------------------------------------┘
For each barcode, lima computes the score of each barcode region. What is a barcode region? Each flanking side of an adapter is a barcode region. Actually, lima performs analysis per read, by looking at the leading and trailing region of the read. If begin and/or end of a read is not determined by an adapter, lima does not try to call barcodes and puts a -1
as score. Why would a read not determined by adapters? For example, if the high-quality region finder cut a read in the middle. This happens to almost all ZMWs, for which the first and last read is cut, so that the score list has a leading and trailing -1
. So for each individual barcode, we get a list of scores and if we take all barcodes, we have a matrix of barcode regions as columns and rows for each barcode.
Example for a really good asymmetric pair of barcodes out of 4 candidate barcodes:
Barcode | R_l | R_t | r_l | r_t | R_l | R_t | r_l | r_t | R_l | R_t |
---|---|---|---|---|---|---|---|---|---|---|
bc0 | -1 | 80 | 69 | 12 | 26 | 97 | 85 | 23 | 40 | -1 |
bc1 | -1 | 12 | 7 | 23 | 45 | 13 | 7 | 33 | 25 | -1 |
bc2 | -1 | 52 | 23 | 89 | 87 | 23 | 34 | 79 | 85 | -1 |
bc3 | -1 | 32 | 50 | 19 | 8 | 23 | 17 | 18 | 13 | -1 |
bc4 | -1 | 47 | 35 | 86 | 79 | 17 | 29 | 75 | 80 | -1 |
First step: Compute mean barcode score, omitting -1
:
Barcode | Mean Score |
---|---|
bc0 | 54 |
bc1 | 21 |
bc2 | 59 |
bc3 | 23 |
bc4 | 56 |
Pick the best barcode, here bc2
, and set it as First
barcode.
Compute the average element-wise absolute difference between First
and each remaining barcode; set the alternative barcode with the lowest distance to First
as FirstAlt
barcode.
Compute all pairwise max with bc2
(bold exchanged scores):
Barcode Combination | R_l | R_t | r_l | r_t | R_l | R_t | r_l | r_t | R_l | R_t |
---|---|---|---|---|---|---|---|---|---|---|
bc2+bc0 | -1 | 80 | 69 | 89 | 87 | 97 | 85 | 79 | 85 | -1 |
bc2+bc1 | -1 | 52 | 23 | 89 | 87 | 23 | 34 | 79 | 85 | -1 |
bc2+bc3 | -1 | 52 | 50 | 89 | 87 | 23 | 34 | 79 | 85 | -1 |
bc2+bc4 | -1 | 52 | 35 | 89 | 87 | 23 | 34 | 79 | 85 | -1 |
Compute mean of each combined barcode list:
Barcode Combination | Mean Score |
---|---|
bc2+bc0 | 84 |
bc2+bc1 | 59 |
bc2+bc3 | 62 |
bc2+bc4 | 61 |
Pick the best mean combined: bc2+bc0 with mean score 84
.
Set bc0
as SecondCandidate
.
Compute the average element-wise absolute difference between SecondCandidate
and each remaining barcode; set the alternative barcode with the lowest distance to SecondCandidate
as SecondAlt
barcode.
Determine if the barcode pair with different barcodes bc2+bc0
is better than a barcode pair with the same barcode bc2
. Here shines our new threshold, which is the minimum difference of the mean scores between the single and combined result:
signal_increase = 84-59 = 25
If signal_increase
is greater than the --min-signal-increase
threshold, accept the combined solution. Otherwise, stick with the best single barcode and copy all information from First
to Combined
. Set the sequence of barcode indices as IdxsCombined
to allow a traceback which barcode contributed to which score.
Why is such threshold necessary? Imagine the data is truly symmetric/tailed, has the same barcode on both sides. The majority of the ZMWs should pick one barcode…but that’s not what happens in reality. Taking the example from above and only looking at bc1-4
, we’d pick bc2
as the best barcode and for the combined solution, we would pick bc2+bc3
, as the mean score increases by 3
points. Only because a single barcode region scored higher (50
vs 23
), we’d choose an asymmetric/different pair. This does not make sense at all, thus we need a minimal combined score difference to the single best score.
Why did we pick a FirstAlt
barcode? The average absolute pairwise distance between scores of First
and FistAlt
measures the mean score lead of the best to the next best alternative barcode. If that score lead is larger than the provided --min-score-lead
threshold, default 10
, accept the chosen barcodes for this ZMW. Otherwise, the difference is too small and the ZMW can’t be labelled reliably.
Why the need for SecondAlt
? In a pair with different barcodes, passing --min-signal-increase
, use IdxsCombined
to create a combined version of ScoresFirstAlt
and ScoresSecondAlt
and compute the score lead to ScoresCombined
.
Smith-Waterman 101
Barcode score and clipping position are computed by a Smith-Waterman algorithm. The dynamic-programming matrix has the barcode on the vertical and the target sequence on the horizontal axis. The initialization of the first row and column follows a glocal alignment; global in the reference, local in the query. The best score is determined by chosing the maximum in the last row, which is also the clipping position. This allows us to skip overhang from the adapter or alien DNA like primer IDs or known as molecular identifiers.
R | E | F | E | R | E | N | C | E | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
B | -1 | x | ||||||||
A | -2 | x | ||||||||
R | -3 | x | ||||||||
C | -4 | x | ||||||||
O | -5 | x | ||||||||
D | -6 | x | ||||||||
E | -7 | x |
For the trailing barcode region, the sequence of the reference window gets reverse-complemented and the clipping position gets transformed back into the correct coordinate system.
Barcode score
The barcode score is an indicator how well the chosen barcode pair matches. After identifying the highest barcode score, it gets normalized:
(100 * sw_score) / (sw_match_score * barcode_length)
The range is between 0 and 100, whereas 0 is no hit and 100 perfect match. The provided mean score is the mean of both normalized barcode scores.
Which minimum barcode score?
For CLR data, both, no threshold and 26
were tested extensively with downstream applications to assure that results are not convoluted by contaminants. A much lower threshold can be used, because additional internal filters in lima remove unreliable calls that go beyond simplistic --min-score
thresholding.
For HiFi data, it depends on the amount of contamination you are willing to accept. The recommended --min-score 80
achieves a precision >99.99%. Without --min-score
filtering precision is >99.95%
. For more information, read the FAQ about precision.