VirFind is an online virus discovery tool running on computer nodes housed at Arkansas High Performance Computing Center
VirFind is an online virus discovery tool running on computer nodes housed at Arkansas High Performance Computing Center

VirFind Pipeline

VirFind is a pipeline of various bioinformatics scripts currently running on 3 high performance computer nodes (each with 394 cores, 2.5TB RAM) at Arkansas High Performance Computing Center. You can upload your fasta or Illumina fastq files to VirFind to discover if there are known or new viruses in your samples.

Note: Due to the high intensity nature of processing high throughput sequencing data, it can take up to several days for your job to be analyzed completely. There might also be a queue of other users in front of your jobs.

VirFind is a pipeline of various bioinformatics scripts currently running on 3 high performance computer nodes (each with 64 cores, 512Gb RAM) at Arkansas High Performance Computing Center. You can upload your fasta or Illumina fastq files to VirFind to discover if there are known or new viruses in your samples.

  • User file submission on VirFind ftp server, together with completion of Sequence submission form that instructs how the pipeline will run. File types permitted: fastq (Illumina), fasta, or gz of these two types
  • File transfer to VirFind bioinformatics server
  • Convert fastq to fasta format, collapse
  • Trim n nucleotides (n = user’s choice) from both ends
  • Map to reference genome (user’s choice) by Bowtie2
    • output mapped sequences
  • Unmapped sequences: calculate average sequence length
    • de novo sequence assembly by Trinity and SPAdes
    • in addition, if average sequence length <= 80nt, de novo sequence assembly by SPAdes with kmer=13, 19, 25
  • Assembled contigs: Blastn against NCBI nt database, e-value = user’s choice, default = 0.01, generate following outputs
    • Blastn_NON_VIRUS_reads.fna
    • Blastn_NON_VIRUS_report.tab
    • Blastn_VIRUS_reads.fna
    • Blastn_VIRUS_report.tab
  • Sequences not detected by Blastn: Blastx against all GenBank virus proteins, e-value = user’s choice, default = 0.01, generate following outputs
    • Blastx_VIRUS_reads.fna
    • Blastx_VIRUS_report.tab
  • Sequences not detected by Blastx will be output to
    • Reads_with_NO_Blastn_NO_Blastx.fna
    • Reads_with_NO_Blastn_NO_Blastx.faa (translation of .fna file to protein)
  • Conserved domain search (user’s choice) of the .faa file against NCBI CDD database, e-value = 0.05, output to Conserved_domain_search_report.txt

Users will use the information from Blastn Blastx .tab files and Conserved_domain_search_report.txt to decide whether a virus/viruses present in their sample.

  • Blastn_VIRUS_reads.fna shows reads that share nucleotide identity to GenBank sequences.
  • Blastx_VIRUS_reads.fna shows reads that share amino acid identity to GenBank sequences.
  • Reads_with_NO_Blastn_NO_Blastx.fna shows reads that cannot be detected by Blastn and Blastx with the chosen e-values. Please keep in mind that this file, while can still have host materials, might contain sequences of new viruses that are significantly different from the ones deposited on GenBank.

Users will need some experience to call if a nucleotide/amino acid read reported by VirFind is a real virus read, and the virus is just an isolate of a known species, or a completely new virus species belonging to such and such genus / subfamily / family / order.