CyDotian

CyDotian's manual

An Algorithm Toolkit for Identifying Internal Repeats of Nucleic Acid and Amino Acid Sequences

CyDotian—Chen Yan DOTe on magicIAN

Authors	Huilong Chen
	Fuyu Yang
	Xiyin Wang
	Weina Ge
Email	chenhuilong131@163.com

Graphical Abstract

Download

Download CFVisual for Windows:

CyDotian_1.0.1.zip

262 KB Jan 5

Documentation

The online documentation is located at the GitHub Wiki.

Overview

CyDotian is a toolkit, which can identify all internal repeats of the sequence itself. To achieve efficient output of the results, we used C, the fastest underlying computer language available, to implement the algorithm and allow it to be compiled into an executable program. Downstream analysis tools are then written in Python, the most popular language for data processing and visualisation. These downstream analysis tools include processing the location and number of repeat segments, plotting dotplots, plotting depth plots, calculating repetition density and outputting specific repeat segment comparison details. All are batch processed and exported, which is extremely user-friendly. Users can suggest and optimise the development of all codes. Moreover, due to the applicability of the CyDotian algorithm, it can also be used to identify similarity segments between two different sequences that allow for discrepancies. Intriguingly, you can infer whether a gene is a fusion gene by analyzing the long intragenic repeats (LIRs) identified via CyDotian.

Usage information is built into the program. To display usage on the screen, the user simply runs the program by specifying the -h/--help parameter:

$ python3 program_name.py -h/--help (for Python scripts)

The following is the list of executable programs:

bpRepeatScan (used for nucleic acid sequences)

aaRepeatScan (used for amino acid sequences)

slidingWindow (used for used for nucleic acid and amino acid sequences via sliding window method)

Parameter configuration file

CyDotian.config (used for CyDotian algorithm)

CyDotian_exact_match.config (used for MUMmer's repeat-match algorithm)

CyDotian_sliding_window.config (used for sliding window algorithm)

Batch processing programs

Tool 1. 1.0_batch_check_sequence.py

Tool 2. 1.1_batch_run_CyDotian.py

Tool 3. 1.2_batch_run_draw_dotplot.py

Tool 4. 1.3_batch_run_draw_depth_plot.py

Tool 5. 1.4_batch_run_output_repeat_density.py

Tool 6. 1.5_batch_extract_repeat_sequences.py

Tool 7. 1.6_Extract_the_corresponding_results_by_name.py

Tool 8. 1.7_batch_run_CyDotian_in_pairwise_comparison_mode.py

Tool 9. 1.8_batch_extract_repeat_sequences_in_pairwise_comparison_mode.py

Tool 10. 2.1_batch_run_CyDotian_exact_match.py

Tool 11. 2.7_batch_run_CyDotian_exact_match_in_pairwise_comparison_mode.py

Tool 12. 3.1_batch_run_CyDotian_sliding_window.py

Tool 13. 3.2_batch_run_CyDotian_sliding_window_in_pairwise_comparison_mode.pyor

Tool 13. 3.2_batch_run_CyDotian_sliding_window_in_pairwise_comparison_mode.py

In addition, for the convenience of researchers who are not good at using the command line, we have also developed a user-friendly GUI desktop platform (SAtoolkit, https://github.com/ChenHuilong1223/SAtoolkit). It is more suitable for interactively viewing a particular sequence of interest, like Dotter and Dotlet.

Finally, we sincerely hope that our tools can help your research work if you are browsing！

Citation

Chen, H., Xu, G., Ge, W., Feng, F., Lin, Y., Guo, C., Jing, Q., Wang, X., Nussio, L., Wang, X., & Yang, F. (2024). CyDotian: a versatile toolkit for identification of intragenic repeat sequences. Molecular Horticulture 4, 37. https://doi.org/10.1186/s43897-024-00113-3Chen, H., Xu, G., Ge, W., Feng, F., Lin, Y., Guo, C., Jing, Q., Wang, X., Nussio, L., Wang, X., & Yang, F. (2024). CyDotian: a versatile toolkit for identification of intragenic repeat sequences. Molecular Horticulture 4, 37. https://doi.org/10.1186/s43897-024-00113-3

News

1.0.1

Users are strongly advised to double-check the percentage of N characters in their DNA sequences, and if they feel that the percentage is too high, such sequences are not meaningful to analyze by themselves.

Fixed an output error caused by too many N characters in a sequence (for "sequence_names_*" files in "positions" folder).