PatternJoinThreaded

This project is an experimental version of PatternJoin project, and is designed to test parallelization capabilities of our algorithms. In order to read more about the main algorithms please visit our C++ or R packages.

What is Edit Similarity Join?

Edit distance between two words is the minimal number of elementary operations needed to transform one word to another. In this project, we consider Levenshtein distance (allows substitutions and insertions/deletions of single letters) and Hamming distance (allows substitutions of any letters and insertions/deletions of letters in the end).

Edit Similarity Join is a specific problem when we have a set of words $X=\{w_1, w_2, \ldots, w_n\}$ and we want to find all pairs of 'similar words' $Y_c=\{(w_i, w_j): distance(w_i, w_j) < c\}$, where $c \in \mathbb{N}_0$ is called a cutoff parameter.

Motivation

Let's briefly introduce the problem why this project was originally developed.

Intro to TCR(BCR) profiles

T-cells and B-cells are types of white blood immune cells which play a major role in human immunity. On the surface of each T/B-cell we can find receptors (TCR/BCR) - special molecules that are responsible for recognizing antigens. The collection of all TCRs/BCRs is called TCR (BCR) profile. One way to represent an organism's TCR/BCR profile is to collect the genetic sequences of most variable region (CDR3) of each TCR/BCR. Recent research (for instance [1], [2]) shows that it is useful to (1) represent these profiles as graphs with nodes being TCR/BCR and edge weights edit distances between nodes and (2) run some network analysis on these graphs (take a look at the NAIR project if you are interested in network analysis in immunology). In order to build these graphs we need efficient edit similarity join software and this is the main purpose of this project.

Parallel Hash Set and Hash Map Performance benchmarks

To run performance benchmarks you need to compile performance_testing subproject:

> cd performance_testing
> mkdir build; cd build
> cmake -DCMAKE_BUILD_TYPE=Release ..; make
> cd ..

Then in order you can run hash_set and hash_map tests using

# assuming you are in PatternJoinThreaded/performance_testing directory
> ./build/threaded_hash_sets
> ./build/threaded_hash_maps

When the tests will finish generate figures using

> pip install pandas seaborn matplotlib
> python plot_results.py

Parallel Partition-Pattern Join Algorithm

To run partition_pattern algorithm using several threads firstly compile the project

# assuming you are in PatternJoinThreaded directory
> mkdir build; cd build; cmake -DCMAKE_BUILD_TYPE=Release ..; make; cd ..

Below we give an example of algorithm launch:

> export FILE_NAME=./performance_testing/test_data/P00245-aa
> export CUTOFF=2
> export METRIC_TYPE=L
> export METHOD=partition_pattern
> export INCLUDE_DUPLICATES=false
> ./build/pattern_join --file_name $FILE_NAME --cutoff $CUTOFF --metric_type $METRIC_TYPE --method $METHOD --include_duplicates $INCLUDE_DUPLICATES

Full list of arguments:

<file_name>: The path to the input file.
<cutoff>: The edit distance cutoff (0, 1 or 2). If cutoff = 0, then the value of metric_type, method, and include_duplicates does not matter.
<metric_type>: The edit distance metric (L for Levenshtein, H for Hamming).
<method>: For threaded implementation we currently tested only partition_pattern.
<include_duplicates>: Consider duplicates in input (true or false). If false the program will ignore duplicate strings in the input and output unique pairs of strings. If true, the program will treat duplicate strings in the input as a pair (index, string) and output pairs of indices.

Input file format

List of words separated by \n: <word_1>\n<word_2>\n....

Output file format

If include_duplicates = true: Space-separated pairs of words separated by \n: <word_i> <word_j>\n.... If include_duplicates = false: Space-separated pairs of indeces separated by \n: <idx_i> <idx_j>\n....

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.vscode		.vscode
benchmark/test_data		benchmark/test_data
benchmarks		benchmarks
performance_testing		performance_testing
src		src
test		test
thirdparty		thirdparty
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PatternJoinThreaded

What is Edit Similarity Join?

Motivation

Intro to TCR(BCR) profiles

Parallel Hash Set and Hash Map Performance benchmarks

Parallel Partition-Pattern Join Algorithm

Full list of arguments:

Input file format

Output file format

About

Uh oh!

Releases

Packages

Languages

License

MatveevDaniil/PatternJoinThreaded

Folders and files

Latest commit

History

Repository files navigation

PatternJoinThreaded

What is Edit Similarity Join?

Motivation

Intro to TCR(BCR) profiles

Parallel Hash Set and Hash Map Performance benchmarks

Parallel Partition-Pattern Join Algorithm

Full list of arguments:

Input file format

Output file format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages