Identify best matching barcode pair

Let’s explore how lima is processing and determining barcode hits.

Given a ZMW with the insert of interest R and its reverse-complement r, each have a leading _l and trailing _t candidate barcode region:

┌→ r_t + adapter + R_l + R + R_t + adapter + r_l + r + --┐
└--------------------------------------------------------┘

For each barcode, lima computes the score of each barcode region. What is a barcode region? Each flanking side of an adapter is a barcode region. Actually, lima performs analysis per read, by looking at the leading and trailing region of the read. If begin and/or end of a read is not determined by an adapter, lima does not try to call barcodes and puts a -1 as score. Why would a read not determined by adapters? For example, if the high-quality region finder cut a read in the middle. This happens to almost all ZMWs, for which the first and last read is cut, so that the score list has a leading and trailing -1. So for each individual barcode, we get a list of scores and if we take all barcodes, we have a matrix of barcode regions as columns and rows for each barcode.

Example for a really good asymmetric pair of barcodes out of 4 candidate barcodes:

Barcode R_l R_t r_l r_t R_l R_t r_l r_t R_l R_t
bc0 -1 80 69 12 26 97 85 23 40 -1
bc1 -1 12 7 23 45 13 7 33 25 -1
bc2 -1 52 23 89 87 23 34 79 85 -1
bc3 -1 32 50 19 8 23 17 18 13 -1
bc4 -1 47 35 86 79 17 29 75 80 -1

First step: Compute mean barcode score, omitting -1:

Barcode Mean Score
bc0 54
bc1 21
bc2 59
bc3 23
bc4 56

Pick the best barcode, here bc2, and set it as First barcode.

Compute the average element-wise absolute difference between First and each remaining barcode; set the alternative barcode with the lowest distance to First as FirstAlt barcode.

Compute all pairwise max with bc2 (bold exchanged scores):

Barcode
Combination
R_l R_t r_l r_t R_l R_t r_l r_t R_l R_t
bc2+bc0 -1 80 69 89 87 97 85 79 85 -1
bc2+bc1 -1 52 23 89 87 23 34 79 85 -1
bc2+bc3 -1 52 50 89 87 23 34 79 85 -1
bc2+bc4 -1 52 35 89 87 23 34 79 85 -1

Compute mean of each combined barcode list:

Barcode
Combination
Mean Score
bc2+bc0 84
bc2+bc1 59
bc2+bc3 62
bc2+bc4 61

Pick the best mean combined: bc2+bc0 with mean score 84.

Set bc0 as SecondCandidate.

Compute the average element-wise absolute difference between SecondCandidate and each remaining barcode; set the alternative barcode with the lowest distance to SecondCandidate as SecondAlt barcode.

Determine if the barcode pair with different barcodes bc2+bc0 is better than a barcode pair with the same barcode bc2. Here shines our new threshold, which is the minimum difference of the mean scores between the single and combined result:

signal_increase = 84-59 = 25

If signal_increase is greater than the --min-signal-increase threshold, accept the combined solution. Otherwise, stick with the best single barcode and copy all information from First to Combined. Set the sequence of barcode indices as IdxsCombined to allow a traceback which barcode contributed to which score.

Why is such threshold necessary? Imagine the data is truly symmetric/tailed, has the same barcode on both sides. The majority of the ZMWs should pick one barcode…but that’s not what happens in reality. Taking the example from above and only looking at bc1-4, we’d pick bc2 as the best barcode and for the combined solution, we would pick bc2+bc3, as the mean score increases by 3 points. Only because a single barcode region scored higher (50 vs 23), we’d choose an asymmetric/different pair. This does not make sense at all, thus we need a minimal combined score difference to the single best score.

Why did we pick a FirstAlt barcode? The average absolute pairwise distance between scores of First and FistAlt measures the mean score lead of the best to the next best alternative barcode. If that score lead is larger than the provided --min-score-lead threshold, default 10, accept the chosen barcodes for this ZMW. Otherwise, the difference is too small and the ZMW can’t be labelled reliably.

Why the need for SecondAlt? In a pair with different barcodes, passing --min-signal-increase, use IdxsCombined to create a combined version of ScoresFirstAlt and ScoresSecondAlt and compute the score lead to ScoresCombined.

Smith-Waterman 101

Barcode score and clipping position are computed by a Smith-Waterman algorithm. The dynamic-programming matrix has the barcode on the vertical and the target sequence on the horizontal axis. The initialization of the first row and column follows a glocal alignment; global in the reference, local in the query. The best score is determined by chosing the maximum in the last row, which is also the clipping position. This allows us to skip overhang from the adapter or alien DNA like primer IDs or known as molecular identifiers.

    R E F E R E N C E
  0 0 0 0 0 0 0 0 0 0
B -1     x            
A -2       x          
R -3         x        
C -4         x        
O -5           x      
D -6             x    
E -7             x    

For the trailing barcode region, the sequence of the reference window gets reverse-complemented and the clipping position gets transformed back into the correct coordinate system.

Barcode score

The barcode score is an indicator how well the chosen barcode pair matches. After identifying the highest barcode score, it gets normalized:

(100 * sw_score) / (sw_match_score * barcode_length)

The range is between 0 and 100, whereas 0 is no hit and 100 perfect match. The provided mean score is the mean of both normalized barcode scores.

Which minimum barcode score?

For CLR data, both, no threshold and 26 were tested extensively with downstream applications to assure that results are not convoluted by contaminants. A much lower threshold can be used, because additional internal filters in lima remove unreliable calls that go beyond simplistic --min-score thresholding.

For HiFi data, it depends on the amount of contamination you are willing to accept. The recommended --min-score 80 achieves a precision >99.99%. Without --min-score filtering precision is >99.95%. For more information, read the FAQ about precision.


THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.