Skip to content
Caffery Yang edited this page May 4, 2025 · 2 revisions

Generated on: 2025-05-03 21:34:20

Table of Contents

Introduction to mLLMCelltype

Related Files

  • README.md
  • assets/mLLMCelltype_logo.png

Related Pages

Related topics: Installation and Setup, Core Functionality and Usage

Introduction to mLLMCelltype

mLLMCelltype is a tool designed to predict cell types from gene expression data using multi-modal Large Language Models (mLLMs). It leverages the power of LLMs to integrate different data modalities, such as gene expression and cell annotations, to improve cell type classification accuracy.

Purpose and Functionality

The primary purpose of mLLMCelltype is to provide a more accurate and versatile method for cell type identification compared to traditional machine learning approaches. By utilizing mLLMs, it can capture complex relationships between genes and cell types, leading to improved performance, especially when dealing with noisy or incomplete data. The tool takes gene expression data as input and outputs predicted cell types, along with confidence scores.

Core Components

The repository contains the following key elements:

  • README.md: Provides an overview of the project, including its purpose, usage instructions, and contributors.
  • assets/mLLMCelltype_logo.png: Contains the logo for the mLLMCelltype project.

Overall Architecture

The architecture of mLLMCelltype involves several stages:

  1. Data Input: Accepts gene expression data (e.g., scRNA-seq data) and optionally, existing cell annotations.
  2. Feature Extraction: Extracts relevant features from the gene expression data.
  3. mLLM Integration: Feeds the extracted features into a pre-trained mLLM.
  4. Cell Type Prediction: The mLLM predicts the cell type based on the input features.
  5. Output: Provides the predicted cell type and associated confidence scores.
graph TD
    A[Gene Expression Data] --> B(Feature Extraction)
    B --> C(mLLM Integration)
    C --> D{Cell Type Prediction}
    D --> E[Predicted Cell Type and Confidence]
Loading

Setup and Usage Instructions

Detailed setup and usage instructions are typically found in the README.md file. Here's a general outline:

  1. Installation:
    • Clone the repository: git clone https://github.com/cafferychen777/mLLMCelltype.git
    • Install the required dependencies (specified in requirements.txt or similar): pip install -r requirements.txt
  2. Data Preparation:
    • Format your gene expression data into a compatible format (e.g., a CSV file where rows are cells and columns are genes).
  3. Configuration:
    • Configure the mLLM settings (e.g., model name, API key).
  4. Execution:
    • Run the main script with the appropriate parameters: python main.py --data_path data.csv --model_name my_mLLM
  5. Output Interpretation:
    • Analyze the output file containing the predicted cell types and confidence scores.

Code Examples

While specific code examples would be found in the project's scripts, here's a conceptual example of how the mLLM might be used for cell type prediction:

# Conceptual example (replace with actual implementation)
import mLLM

def predict_cell_type(gene_expression_data, model_name="default_mLLM"):
    """
    Predicts cell type based on gene expression data using an mLLM.

    Args:
        gene_expression_data (dict): A dictionary of gene names and expression values.
        model_name (str): The name of the mLLM to use.

    Returns:
        str: The predicted cell type.
    """
    model = mLLM.load_model(model_name)
    prediction = model.predict(gene_expression_data)
    return prediction

# Example usage
data = {"geneA": 2.5, "geneB": 1.0, "geneC": 3.2}
cell_type = predict_cell_type(data)
print(f"Predicted cell type: {cell_type}")

Component Relationships

The following diagram illustrates the relationships between the key components of the mLLMCelltype system:

graph TD
    A[User Input: Gene Expression Data] --> B(Data Preprocessing)
    B --> C{Feature Selection}
    C --> D[mLLM Model]
    D --> E{Cell Type Prediction}
    E --> F[Output: Predicted Cell Types]
Loading

Installation and Setup

Related Files

  • README.md
  • R/DESCRIPTION
  • python/setup.py
  • python/requirements.txt

Related Pages

Related topics: Core Functionality and Usage

Installation and Setup

This page provides instructions for installing and setting up the mLLMCelltype repository.

Overview

The mLLMCelltype repository aims to predict cell types using multi-modal Large Language Models (mLLMs). The setup involves installing both R and Python dependencies, along with configuring the necessary environment.

Repository Structure and Key Files

  • README.md: Provides a high-level overview of the project, including its purpose, usage instructions, and relevant links. It serves as the entry point for understanding the project.
  • R/DESCRIPTION: An R package description file containing metadata about the R package, such as its name, version, dependencies, and description.
  • python/setup.py: A Python setup script used to build and install the Python package. It specifies the package's dependencies and other installation-related information.
  • python/requirements.txt: A text file listing the Python packages required to run the Python components of the project. This file is used by pip to install the necessary dependencies.

Installation Steps

1. Clone the Repository

Clone the mLLMCelltype repository to your local machine:

git clone https://github.com/cafferychen777/mLLMCelltype.git
cd mLLMCelltype

2. Install R Dependencies

Navigate to the R directory and install the required R packages.

cd R
R

In the R console, run:

install.packages("remotes")
remotes::install_deps(dependencies = TRUE)

This will install the dependencies specified in the DESCRIPTION file. The DESCRIPTION file contains the following information (example):

Package: mLLMCelltype
Title: Multi-Modal Large Language Model for Cell Type Prediction
Version: 0.1.0
Description: An R package to integrate with Python-based mLLMs for cell type prediction.
Authors@R: person("Caffrey", "Chen", email = "[email protected]", role = c("aut", "cre"))
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.5.0)
Imports:
    Seurat,
    SingleR,
    tidyverse
Suggests:
    knitr,
    rmarkdown

3. Install Python Dependencies

Navigate to the python directory and create a virtual environment (recommended).

cd ../python
python3 -m venv venv
source venv/bin/activate  # On Linux/macOS
# venv\Scripts\activate  # On Windows

Install the Python dependencies using pip:

pip install --upgrade pip
pip install -r requirements.txt

The requirements.txt file lists the Python packages required for the project. An example requirements.txt might look like this:

torch
transformers
pandas
scikit-learn

4. Install the Python Package (Optional)

If the Python code is structured as a package, you can install it using setup.py:

python setup.py install

The setup.py file is used to build and install the Python package. It contains metadata about the package and its dependencies. An example setup.py might look like this:

from setuptools import setup, find_packages

setup(
    name='mLLMCelltype',
    version='0.1.0',
    packages=find_packages(),
    install_requires=[
        'torch',
        'transformers',
        'pandas',
        'scikit-learn'
    ],
)

5. Environment Configuration (if applicable)

The README.md file should contain any specific environment variables or configuration steps required to run the mLLMCelltype code. For example, it might specify API keys or file paths that need to be set. Follow the instructions in the README.md to configure your environment.

Component Relationships

Here's a diagram illustrating the relationship between the main components:

graph TD
    A[R Package] --> B(Python Package);
    B --> C{mLLM Models};
    A --> D[Seurat Object];
    D --> B;
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#ccf,stroke:#333,stroke-width:2px
    style D fill:#f9f,stroke:#333,stroke-width:2px
Loading

Usage Instructions

Refer to the README.md file for detailed usage instructions and examples. The basic workflow involves loading data in R (Seurat object), passing it to the Python module, running the mLLM model, and then retrieving the results back into R.


Core Functionality and Usage

Related Files

  • R/R/cell_type_annotation.R
  • R/R/consensus_annotation.R
  • python/mllmcelltype/annotate.py
  • python/mllmcelltype/consensus.py
  • python/examples/consensus_example.py

Related Pages

Related topics: Customization and Advanced Features, Understanding Uncertainty Metrics

Core Functionality and Usage

This page details the core functionality of the mLLMCelltype repository, focusing on cell type annotation and consensus-building across different modalities.

1. Overview

The mLLMCelltype repository provides tools for automated cell type annotation using multi-modal Large Language Models (mLLMs). It encompasses both R and Python implementations for annotating cell types based on gene expression data and building consensus annotations from multiple sources. The core functionalities are implemented in the following files:

  • R/R/cell_type_annotation.R: R implementation for cell type annotation.
  • R/R/consensus_annotation.R: R implementation for building consensus annotations.
  • python/mllmcelltype/annotate.py: Python implementation for cell type annotation.
  • python/mllmcelltype/consensus.py: Python implementation for building consensus annotations.
  • python/examples/consensus_example.py: Example script demonstrating how to use the consensus annotation functionality in Python.

2. Component Relationships and Architecture

The overall architecture involves annotating cell types using individual methods and then combining these annotations into a consensus annotation. The R and Python implementations provide similar functionalities but cater to different user preferences and integration needs.

graph TD
    A[Expression Data] --> B(Cell Type Annotation - R);
    A --> C(Cell Type Annotation - Python);
    B --> D(Consensus Annotation - R);
    C --> E(Consensus Annotation - Python);
    D --> F[Final Cell Type Assignments];
    E --> F;
Loading

3. Detailed Functionality

3.1. Cell Type Annotation

3.1.1. R (R/R/cell_type_annotation.R)

This R script likely contains functions to perform cell type annotation based on gene expression data. While the exact implementation details are not available without access to the code, it would typically involve:

  1. Data Input: Reading gene expression data (e.g., from a Seurat object or a matrix).
  2. Feature Selection: Identifying marker genes or features relevant for cell type identification.
  3. Annotation: Using a pre-trained model or a reference dataset to assign cell types to individual cells.
  4. Output: Returning a data frame or a vector containing cell type assignments.

3.1.2. Python (python/mllmcelltype/annotate.py)

The Python implementation mirrors the functionality of the R script but is implemented in Python. It likely uses libraries such as scanpy, anndata, or pandas for data manipulation and machine learning libraries for annotation.

# python/mllmcelltype/annotate.py (Example - may not be exact)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

def annotate_cell_types(expression_data: pd.DataFrame, model: RandomForestClassifier) -> pd.Series:
    """
    Annotates cell types based on gene expression data using a pre-trained model.

    Args:
        expression_data: Gene expression data (rows are cells, columns are genes).
        model: A pre-trained RandomForestClassifier model.

    Returns:
        A pandas Series containing cell type assignments for each cell.
    """
    predictions = model.predict(expression_data)
    return pd.Series(predictions, index=expression_data.index)

3.2. Consensus Annotation

3.2.1. R (R/R/consensus_annotation.R)

This R script focuses on combining multiple cell type annotations into a single, more robust consensus annotation. This is useful when annotations are obtained from different methods or datasets. A typical implementation would involve:

  1. Input: Taking multiple cell type annotation vectors or data frames as input.
  2. Normalization/Mapping: Mapping cell type names across different annotation sources to a common vocabulary.
  3. Consensus Building: Using a voting scheme or a more sophisticated algorithm to determine the consensus cell type for each cell.
  4. Output: Returning a vector or data frame containing the consensus cell type assignments.

3.2.2. Python (python/mllmcelltype/consensus.py)

The Python implementation provides similar consensus-building functionality as the R script.

# python/mllmcelltype/consensus.py (Example - may not be exact)
import pandas as pd
from collections import Counter

def build_consensus(annotations: list[pd.Series]) -> pd.Series:
    """
    Builds a consensus cell type annotation from multiple input annotations.

    Args:
        annotations: A list of pandas Series, where each Series contains cell type annotations.

    Returns:
        A pandas Series containing the consensus cell type assignments.
    """
    consensus_annotations = {}
    for cell_id in annotations[0].index:
        cell_annotations = [anno[cell_id] for anno in annotations]
        most_common = Counter(cell_annotations).most_common(1)[0][0]
        consensus_annotations[cell_id] = most_common
    return pd.Series(consensus_annotations)

3.2.3. Example (python/examples/consensus_example.py)

This script demonstrates how to use the build_consensus function in python/mllmcelltype/consensus.py. It likely involves:

  1. Generating or loading example cell type annotations from different methods.
  2. Calling the build_consensus function with these annotations.
  3. Printing or saving the resulting consensus annotation.
# python/examples/consensus_example.py
import pandas as pd
from mllmcelltype.consensus import build_consensus

# Example annotations
annotation1 = pd.Series({"cell1": "T cell", "cell2": "B cell", "cell3": "T cell"})
annotation2 = pd.Series({"cell1": "T cell", "cell2": "B cell", "cell3": "NK cell"})
annotation3 = pd.Series({"cell1": "T cell", "cell2": "B cell", "cell3": "T cell"})

annotations = [annotation1, annotation2, annotation3]

# Build consensus annotation
consensus_annotation = build_consensus(annotations)
print(consensus_annotation)

4. Data Flow

The data flow within the consensus annotation process can be visualized as follows:

sequenceDiagram
    participant User
    participant Annotation1
    participant Annotation2
    participant ConsensusBuilder

    User->>Annotation1: Provide Annotation Data
    User->>Annotation2: Provide Annotation Data
    User->>ConsensusBuilder: Call build_consensus([Annotation1, Annotation2])
    ConsensusBuilder->>ConsensusBuilder: Iterate through cells
    ConsensusBuilder->>ConsensusBuilder: Determine most frequent cell type
    ConsensusBuilder-->>User: Return Consensus Annotation
Loading

5. Setup and Usage Instructions

5.1. Python

  1. Install the package: Assuming the package is structured correctly, you would install it using pip:

    pip install mllmcelltype
  2. Use the annotation and consensus functions:

    from mllmcelltype.annotate import annotate_cell_types # if available
    from mllmcelltype.consensus import build_consensus
    import pandas as pd
    
    # Example usage (adjust based on actual function signatures)
    # Assuming you have expression_data and a pre-trained model
    # cell_type_predictions = annotate_cell_types(expression_data, model)
    
    # Example usage for consensus
    annotation1 = pd.Series({"cell1": "T cell", "cell2": "B cell"})
    annotation2 = pd.Series({"cell1": "T cell", "cell2": "B cell"})
    annotations = [annotation1, annotation2]
    consensus = build_consensus(annotations)
    print(consensus)

5.2. R

  1. Install the package: Assuming the package is structured correctly, you would install it using devtools:

    # Install devtools if you don't have it
    # install.packages("devtools")
    devtools::install_github("cafferychen777/mLLMCelltype") # Or install from local directory
  2. Use the annotation and consensus functions:

    library(mLLMCelltype)
    
    # Example usage (adjust based on actual function signatures)
    # Assuming you have expression_data and a pre-trained model
    # cell_type_predictions <- annotate_cell_types(expression_data, model)
    
    # Example usage for consensus
    annotation1 <- c("T cell", "B cell")
    annotation2 <- c("T cell", "B cell")
    annotations <- list(annotation1, annotation2)
    consensus <- build_consensus(annotations)
    print(consensus)

Customization and Advanced Features

Related Files

  • R/R/prompt_templates.R
  • R/R/custom_model_manager.R
  • python/mllmcelltype/prompts.py
  • python/mllmcelltype/providers/__init__.py

Related Pages

Related topics: Core Functionality and Usage

Customization and Advanced Features

This page details customization options and advanced features within the mLLMCelltype repository, focusing on prompt engineering, custom model integration, and provider management.

Prompt Engineering

Prompt engineering is crucial for guiding the LLMs to produce accurate cell type predictions. The repository provides mechanisms for customizing prompts in both R and Python.

R: R/R/prompt_templates.R

This file likely contains R functions or data structures to define and manage prompt templates used within the R-based components of mLLMCelltype.

Purpose:

The prompt_templates.R file allows users to modify the prompts sent to the LLM. This is essential for adapting the model to specific datasets, improving accuracy, or experimenting with different prompting strategies.

Functionality:

The file likely contains functions to:

  • Load default prompt templates.
  • Modify existing templates.
  • Create new templates.
  • Apply templates to data.

Example (Hypothetical):

# R/R/prompt_templates.R

# Function to load a prompt template
load_prompt_template <- function(template_name) {
  # Example: Load a template from a file
  template_path <- file.path("path/to/templates", paste0(template_name, ".txt"))
  if (file.exists(template_path)) {
    readChar(template_path, file.info(template_path)$size)
  } else {
    stop("Template not found: ", template_name)
  }
}

# Function to modify a prompt template
modify_prompt_template <- function(template, new_instruction) {
  # Example: Replace a placeholder in the template
  gsub("\\{\\{INSTRUCTION\\}\\}", new_instruction, template)
}

# Default prompt
default_prompt <- "Predict cell type based on these markers: \\{\\{MARKERS\\}\\}"

Explanation:

The hypothetical example shows functions to load and modify prompt templates. The load_prompt_template function reads a template from a file. The modify_prompt_template function replaces placeholders in the template with user-defined instructions. The default_prompt variable shows a basic template.

Integration:

This file integrates with other R scripts that use LLMs for cell type prediction. The R scripts would call functions from prompt_templates.R to retrieve and customize prompts before sending them to the LLM.

Python: python/mllmcelltype/prompts.py

This file serves a similar purpose to R/R/prompt_templates.R, but for the Python-based components.

Purpose:

The prompts.py file provides a way to customize the prompts used by the Python components of mLLMCelltype. This allows for fine-tuning the model's behavior and adapting it to different datasets or experimental setups.

Functionality:

The file likely contains:

  • Default prompt templates (as strings or functions).
  • Functions to load, modify, and manage prompts.
  • Classes to represent prompt templates.

Example:

# python/mllmcelltype/prompts.py

class PromptTemplate:
    def __init__(self, template_string):
        self.template = template_string

    def format(self, **kwargs):
        return self.template.format(**kwargs)

DEFAULT_PROMPT = PromptTemplate("Predict the cell type based on these markers: {markers}")

def create_prompt(markers):
    return DEFAULT_PROMPT.format(markers=markers)

Explanation:

The PromptTemplate class encapsulates a prompt string and provides a format method to insert variables. DEFAULT_PROMPT is an instance of PromptTemplate with a default prompt. The create_prompt function uses the DEFAULT_PROMPT and inserts the markers.

Integration:

Python scripts within mLLMCelltype will import the prompts.py module and use its functions or classes to generate prompts before interacting with the LLM.

Custom Model Integration

The repository allows users to integrate their own custom LLMs, rather than relying solely on pre-configured options.

R: R/R/custom_model_manager.R

This R file likely manages the integration of custom LLMs within the R components.

Purpose:

The custom_model_manager.R file enables users to define and register their own LLM models for use within the mLLMCelltype workflow. This is useful when users have access to specialized models or want to experiment with different LLM architectures.

Functionality:

The file likely contains functions to:

  • Register a custom model.
  • Specify the API endpoint for the model.
  • Define the input/output format for the model.
  • Handle authentication.

Example (Hypothetical):

# R/R/custom_model_manager.R

# Function to register a custom model
register_custom_model <- function(model_name, api_endpoint, auth_token) {
  # Store model information in a configuration file or data structure
  model_config <- list(api_endpoint = api_endpoint, auth_token = auth_token)
  saveRDS(model_config, file = paste0(model_name, ".rds"))
  cat("Custom model registered: ", model_name, "\n")
}

# Function to call a custom model
call_custom_model <- function(model_name, prompt) {
  model_config <- readRDS(paste0(model_name, ".rds"))
  api_endpoint <- model_config$api_endpoint
  auth_token <- model_config$auth_token

  # Make API call to the custom model
  response <- httr::POST(
    api_endpoint,
    body = list(prompt = prompt),
    add_headers(Authorization = paste("Bearer", auth_token)),
    encode = "json"
  )

  # Extract the prediction from the response
  content <- httr::content(response, "text")
  # Assuming the response is a JSON string
  json_data <- jsonlite::fromJSON(content)
  prediction <- json_data$prediction
  return(prediction)
}

Explanation:

The register_custom_model function stores the API endpoint and authentication token for a custom model. The call_custom_model function retrieves this information and makes an API call to the custom model, extracting the prediction from the response.

Integration:

Other R scripts will use functions from this file to register and call custom models, allowing them to leverage different LLMs for cell type prediction.

Provider Management

The repository uses the concept of "providers" to abstract the underlying LLM APIs. This allows the system to support multiple LLMs (e.g., OpenAI, Cohere) without requiring significant code changes.

Python: python/mllmcelltype/providers/__init__.py

This file initializes the providers package in Python and likely defines the base classes or interfaces for different LLM providers.

Purpose:

The __init__.py file in the providers directory sets up the provider system, making it easy to add and manage different LLM providers.

Functionality:

The file likely:

  • Defines an abstract base class for providers.
  • Imports specific provider implementations (e.g., OpenAI, Cohere).
  • Provides a mechanism to select a provider.

Example:

# python/mllmcelltype/providers/__init__.py

from abc import ABC, abstractmethod

class BaseProvider(ABC):
    @abstractmethod
    def generate_text(self, prompt):
        pass

from .openai_provider import OpenAIProvider
from .cohere_provider import CohereProvider

PROVIDER_MAP = {
    "openai": OpenAIProvider,
    "cohere": CohereProvider,
}

def get_provider(provider_name, **kwargs):
    provider_class = PROVIDER_MAP.get(provider_name)
    if not provider_class:
        raise ValueError(f"Unknown provider: {provider_name}")
    return provider_class(**kwargs)

Explanation:

The BaseProvider class defines the interface for all providers, requiring a generate_text method. The file imports OpenAIProvider and CohereProvider (which are assumed to be in separate files within the providers directory). The PROVIDER_MAP dictionary maps provider names to their classes. The get_provider function returns an instance of the specified provider.

Integration:

Other Python scripts will use the get_provider function to obtain an instance of the desired LLM provider and then call the generate_text method to interact with the LLM.

graph TD
    A[User Code] --> B{Provider Selection};
    B --> C[OpenAIProvider];
    B --> D[CohereProvider];
    C --> E((OpenAI API));
    D --> F((Cohere API));
    E --> G[LLM Response];
    F --> G;
    G --> A;
Loading

Explanation of the Diagram:

  1. User Code: Represents the Python scripts that use the LLM functionality.
  2. Provider Selection: The get_provider function in __init__.py handles the selection of the appropriate provider based on user configuration.
  3. OpenAIProvider/CohereProvider: Specific provider implementations that handle communication with the respective LLM APIs.
  4. OpenAI API/Cohere API: External LLM APIs.
  5. LLM Response: The response from the LLM.

Understanding Uncertainty Metrics

Related Files

  • R/R/check_consensus.R
  • R/R/print_consensus_summary.R
  • python/mllmcelltype/compare.py
  • images/mLLMCelltype_visualization.png

Related Pages

Related topics: Core Functionality and Usage

Understanding Uncertainty Metrics in mLLMCelltype

This page details the uncertainty metrics used in the mLLMCelltype repository, focusing on consensus checking and comparison of cell type predictions. We will examine the R and Python code involved, their functionalities, and how they contribute to the overall architecture.

Purpose and Functionality

The primary goal is to quantify the uncertainty associated with cell type predictions generated by different methods or models. This involves assessing the agreement (consensus) among predictions and comparing them to known ground truth or reference datasets.

R Code Analysis: Consensus Checking

R/R/check_consensus.R

This R script focuses on evaluating the consensus among different cell type annotations for the same cells. It likely implements functions to calculate metrics such as:

  • Agreement rate: The percentage of cells where all annotations agree.
  • Pairwise agreement: The average agreement between all pairs of annotations.
  • Entropy-based metrics: Quantifying the diversity of annotations for each cell.

Example (Hypothetical):

# R/R/check_consensus.R
# Example function to calculate agreement rate
check_consensus <- function(annotations) {
  # annotations is a matrix where rows are cells and columns are annotations
  agreement <- apply(annotations, 1, function(x) length(unique(x)) == 1)
  agreement_rate <- mean(agreement)
  return(agreement_rate)
}

R/R/print_consensus_summary.R

This script generates a summary report of the consensus analysis. It takes the results from check_consensus.R and presents them in a user-friendly format, possibly including:

  • Tables summarizing agreement metrics.
  • Histograms visualizing the distribution of agreement scores.
  • Scatter plots comparing different annotation methods.

Example (Hypothetical):

# R/R/print_consensus_summary.R
# Example function to generate a summary table
print_consensus_summary <- function(consensus_results) {
  # consensus_results is a list containing the results from check_consensus.R
  summary_table <- data.frame(
    Metric = c("Agreement Rate", "Pairwise Agreement"),
    Value = c(consensus_results$agreement_rate, consensus_results$pairwise_agreement)
  )
  print(summary_table)
}

Python Code Analysis: Comparison with Ground Truth

python/mllmcelltype/compare.py

This Python script compares cell type predictions with a known ground truth or reference dataset. It likely implements functions to calculate metrics such as:

  • Accuracy: The percentage of cells correctly classified.
  • Precision, Recall, F1-score: Metrics for each cell type, evaluating the ability to correctly identify cells of that type.
  • Confusion matrix: A table showing the number of cells of each true type that were predicted as each predicted type.

Example (Hypothetical):

# python/mllmcelltype/compare.py
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix

def compare_with_ground_truth(predictions, ground_truth):
    """
    Compares cell type predictions with ground truth labels.

    Args:
        predictions (np.ndarray): Predicted cell type labels.
        ground_truth (np.ndarray): Ground truth cell type labels.

    Returns:
        dict: A dictionary containing comparison metrics.
    """
    accuracy = accuracy_score(ground_truth, predictions)
    confusion_mat = confusion_matrix(ground_truth, predictions)
    return {"accuracy": accuracy, "confusion_matrix": confusion_mat}

Visualization

images/mLLMCelltype_visualization.png

This image likely showcases visualizations related to cell type prediction and uncertainty. It could include:

  • UMAP or t-SNE plots colored by predicted cell type.
  • Heatmaps showing the expression of marker genes for each cell type.
  • Visualizations of the consensus analysis, such as agreement rates across different methods.
  • Confusion matrix visualization.

Overall Architecture and Data Flow

The uncertainty surrounding cell type predictions is assessed using components in both R and Python. The R scripts (check_consensus.R, print_consensus_summary.R) evaluate the consistency among different annotation methods applied to the same dataset. They calculate consensus metrics and generate summary reports. Concurrently, the Python script (compare.py) focuses on external validation by comparing a set of predictions against a known ground truth dataset, calculating standard performance metrics like accuracy and F1-score. Visualizations often integrate results from both consensus analysis and ground truth comparison.

graph TD
    A[Annotations] --> B(R: check_consensus.R)
    B --> C[Consensus Metrics]
    C --> D(R: print_consensus_summary.R)
    D --> E[Consensus Summary Report]

    F[Cell Type Predictions] --> G(Python: compare.py)
    H[Ground Truth Cell Types] --> G
    G --> I[Comparison Metrics]

    E --> J[Visualization]
    I --> J
    J --> K[Final Visualization of Results]
Loading

Setup and Usage Instructions

  1. Install R packages: Ensure that necessary R packages are installed (e.g., dplyr, ggplot2).
  2. Install Python packages: Ensure that necessary Python packages are installed (e.g., scikit-learn, numpy).
  3. Prepare input data: The R scripts require a matrix of cell type annotations, where rows are cells and columns are different annotation methods. The Python script requires predicted cell type labels and ground truth labels.
  4. Run the scripts: Execute the R and Python scripts, providing the appropriate input data.
  5. Interpret the results: Analyze the consensus metrics and comparison metrics to assess the uncertainty of cell type predictions.

Component Relationships

The check_consensus.R and print_consensus_summary.R scripts are closely related, with the latter relying on the output of the former. The compare.py script operates independently, comparing predictions to ground truth. All components contribute to understanding the uncertainty associated with cell type predictions.


Troubleshooting and FAQ

Related Files

  • .github/ISSUE_TEMPLATE/bug_report.md
  • .github/ISSUE_TEMPLATE/usage_question.md

Troubleshooting and FAQ

This page provides solutions to common issues and answers frequently asked questions related to the mLLMCelltype repository. It focuses on the bug report and usage question issue templates.

Bug Reports

Purpose and Functionality

The bug_report.md file (.github/ISSUE_TEMPLATE/bug_report.md) provides a template for users to report bugs they encounter while using the mLLMCelltype tool. This template ensures that bug reports contain all the necessary information for developers to understand, reproduce, and fix the issue.

Template Structure

The template includes sections for:

  • Description: A clear and concise description of the bug.
  • Steps To Reproduce: Detailed steps to reproduce the bug.
  • Expected Behavior: What the user expected to happen.
  • Actual Behavior: What actually happened.
  • Screenshots: Visual evidence of the bug (if applicable).
  • Environment: Information about the user's environment (OS, Python version, etc.).
  • Additional Context: Any additional information that might be helpful.

Example

Here's the content of .github/ISSUE_TEMPLATE/bug_report.md:

---
name: Bug report
about: Create a report to help us improve
title: "[BUG] "
labels: bug
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment (please complete the following information):**
 - OS: [e.g. iOS]
 - Python Version [e.g. 3.8]
 - Commit ID [e.g. 8e8e8e8]

**Additional context**
Add any other context about the problem here.

How it Fits into the Architecture

Bug reports are crucial for the maintenance and improvement of the mLLMCelltype tool. They provide developers with direct feedback from users, allowing them to identify and fix issues that might not be apparent during development. The issue template streamlines this process.

Usage Instructions

When encountering a bug:

  1. Click on the "Issues" tab in the GitHub repository.
  2. Click on "New Issue."
  3. Choose the "Bug report" template.
  4. Fill out the template with as much detail as possible.
  5. Submit the issue.

Usage Questions

Purpose and Functionality

The usage_question.md file (.github/ISSUE_TEMPLATE/usage_question.md) provides a template for users to ask questions about how to use the mLLMCelltype tool. This template helps users articulate their questions clearly and ensures that developers have enough information to provide helpful answers.

Template Structure

The template includes sections for:

  • Question: A clear and concise question about how to use the tool.
  • Context: Background information about what the user is trying to achieve.
  • Attempts: A description of what the user has already tried.
  • Code Snippets: Relevant code snippets that illustrate the user's problem.

Example

Here's the content of .github/ISSUE_TEMPLATE/usage_question.md:

---
name: Usage Question
about: Ask a question about how to use this project
title: "[USAGE] "
labels: usage
assignees: ''

---

**Question**
A clear and concise question about how to use the project.

**Context**
Provide any background information that might be helpful.

**Attempts**
Describe what you've already tried.

**Code Snippets**
If applicable, provide code snippets to illustrate your problem.

How it Fits into the Architecture

Usage questions help improve the usability and documentation of the mLLMCelltype tool. By answering user questions, developers can identify areas where the tool is unclear or difficult to use, and they can improve the documentation and user interface accordingly. The issue template ensures that these questions are well-structured.

Usage Instructions

When you have a question about how to use the tool:

  1. Click on the "Issues" tab in the GitHub repository.
  2. Click on "New Issue."
  3. Choose the "Usage question" template.
  4. Fill out the template with as much detail as possible.
  5. Submit the issue.

Issue Template Workflow

graph TD
    A[User Discovers Issue] --> B{Is it a Bug?};
    B -- Yes --> C[Create Bug Report Issue];
    B -- No --> D{Is it a Usage Question?};
    D -- Yes --> E[Create Usage Question Issue];
    D -- No --> F[Other Issue Type];
    C --> G[Developer Review];
    E --> G;
    F --> G;
    G --> H{Issue Resolved?};
    H -- Yes --> I[Close Issue];
    H -- No --> J[Further Investigation/Discussion];
    J --> G;
Loading
Clone this wiki locally