Talk:FASTA format

From WikiProjectMed
Jump to navigation Jump to search

Merge with Fasta Sequence?

  1. support- Yeah, the article "Fasta Sequence" should just be merged into the article "FASTA format". A FASTA sequence isn't a good term anyway!
Merged. Changed FASTA sequence for a redirect. I tried to keep most of the material so that no previous work is lost. Anybody who cares, feel free to cut. Miguel Andrade 04:09, 30 March 2006 (UTC)[reply]
What does FASTA stand for???? —Preceding unsigned comment added by 71.157.135.130 (talkcontribs) 06:15, 12 December 2006
Snipped from FASTA page : "FASTA is pronounced 'FAST-Aye', and stands for 'FAST-All', because it works with any alphabet, an extension of 'FAST-P' (protein) and 'FAST-N' (nucleotide) alignment." Paul Cotney 13:30, 15 May 2007 (UTC)[reply]

Comments in a FASTA record

I removed most of the useless and distracting references to comments in the FASTA file because no one supports them and haven't for over 15 years. For details, see my research -- Andrew Dalke

Not true, comments starting with a semi-colon are supported for instance by the function read.fasta in the seqinR package and by the function readFASTA in the Biostrings package. 90.42.8.116 (talk)

Okay, then "almost no one". Who publishes/provides data in FASTA format with comments? What was the driving motivation for adding that support? Where's the push for Bioperl and other tools to support it? -- Andrew Dalke

As scientists we care about backward compatibility, which is essential for reproducibility. The fact that new tools such as Bioperl (or other recent me-too products in the bioinformatic world) do not support reproducibility is by no way a good argument to remove the documentation about the original format, which is sourced. This is Wikipedia, not Bioperl point of view. 90.48.97.179 (talk) 21:28, 4 February 2009 (UTC)[reply]

Where's the data set where you have backwards compatibility concerns? I haven't seen such data outside of a tests set from FASTA itself. I'm not arguing from a Bioperl point of view - I'm talking about the consensus viewpoint of what the FASTA file is by the large majority of tools that exist. If that isn't the Wikipedia point of view, then I don't know what is. Or as I point out on that page which summarizes my research, the world uses something much closer to the NCBI FASTA format than the Pearson FASTA format. - Andrew Dalke —Preceding unsigned comment added by 83.248.213.42 (talk) 12:36, 12 August 2009 (UTC)[reply]

I think you are missing the real point: backwards compatibility is not a problem, as comment lines are to be ignored, they don't matter that much. But "forward compatibility" does matter: the problem is that with comments one may include all needed information so that future programs may rely on them to use it, but without them you are bound to reinvent the wheel and design a new, non-compatible format to be able to add more information.

If you leave the comments details, you allow future developers to add any information they want without breaking backwards compatibility. If they donot know of their existence they are bound to reinvent them in a most probably incompatible way, thus breaking all pre-existing programs. If you want an example: any "FASTA" sequence with comments starting with hash marks "#" is NOT a FASTA sequence and will break all existing software. So, hiding information only suits selfish egos and creates problems for the future. Since Pearson was wise enough to prepare for the future, ignoring it is not only silly, it is also preposterous! -- Jose R Valverde

I have seen gazillions of files that have declared themselves to be FASTA format, I have never, ever seen these semi-colon prefixed comments. I have also seen dozens of parsers and none of them expect to see a semi-colon leading a line. If the anonymous commenter believes this is part of the FASTA specification then he or she should cite this specification. At the very least the example that uses the semi-colon should be made less prominent as this is extremely misleading. The first example given should be using the de-facto standard, i.e. '>'. It's kind of shocking that such a commonly used format has no formal specification. http://www.ncbi.nlm.nih.gov/blast/fasta.shtml provides guidelines, but this isn't a specification. In the absence of a published specification then the de-facto standard i.e. '>'s but no ';'s is all we have to go on -- Chris Mungall —Preceding unsigned comment added by Cmungall (talkcontribs) 18:51, 16 December 2009 (UTC)[reply]

Nothing proves that you have seen "gazillions" files in FASTA format, referenced sources are needed in Wikipedia. The comment by Jose R Valverde is much more interesting and IMHO should be incorporated somehow in the article.90.57.173.193 (talk) 21:53, 15 April 2011 (UTC)[reply]

Copyright?

Looks like the text under "Format" is copied from the NCBI website. Either that, or the NCBI copied from here. Does this need to be changed? —Preceding unsigned comment added by 128.193.214.112 (talk) 22:07, 3 October 2007 (UTC)[reply]

The NCBI website, however, is a US Federal Government website...and, as it is the author of the appropriate text, it is public domain. However, it should be cited as being such. --AEMoreira042281 15:19, 8 November 2007 (UTC)[reply]

HUPO-PSI Format to another page

I suggest moving this addition on HUPO-PSI Format to another page, as it confuses the basic article, and opens the door to n other detailed sequence record proposals.

This is very worthy as a new sequence format proposal, but it is not Fasta format. HUPO-PSI is one of several variants that have been proposed or built to bring documentation and record structure back into Fasta (which was designed for its simplicity). It would be worth referring to other such formats in this article. NCBI's defline format the current most common variant of Fasta. Add why FastA isn't the solution to documented, structured sequence records. Dongilbert (talk) 02:47, 18 February 2008 (UTC)[reply]

I do agree, the HUPO-SPI section should be moved to another page, here it's confusing for someone who is looking for the FASTA format. I'd like to do this but I'm unsure on how to fix it without making trouble. 90.42.137.27 (talk) 19:58, 28 March 2009 (UTC)[reply]

Yes! move the HUPO-PSI section to its own page 90.53.111.75 (talk) 22:46, 30 April 2009 (UTC).[reply]

OK, I'm going to delete the disturbing HUPO-PSI stuff, any objection? 83.205.188.235 (talk) 17:56, 26 October 2009 (UTC)[reply]

Done 90.14.223.244 (talk) 22:01, 9 November 2009 (UTC)[reply]

Great page

I'm not sure if this the appropriate place for this, but I just wanted to say that as a molecular biology student I found this article extremely useful and very reader friendly. Cheers everybody, great work! --Ar-Pharazôn (talk) 19:34, 24 June 2009 (UTC)[reply]

Strange comment/question in first section

The first section ("Format") contains the following line:

"-- If FASTA files can contain multiple sequences, as suggested by the text below, that is a critical part of the format specification and should be described up front here please. If they cannot contain multiple sequences, this point should be clarified here."

However, right above that, the main text addresses that very question.

"The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format [follows]:" (emphasis mine)

This seems crystal-clear to me; in fact, the question is the most jarring part of the section. I'm tempted just to remove the question/comment, but I'm new around here so I don't want to overstep!

Jdrum00 (talk) 05:36, 2 September 2009 (UTC)[reply]

GenBank                           gi|gi-number|gb|accession|locus
EMBL Data Library                 gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
NBRF PIR                          pir||entry
Protein Research Foundation       prf||name
SWISS-PROT                        sp|accession|name
Brookhaven Protein Data Bank (1)  pdb|entry|chain
Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
Patents                           pat|country|number 
GenInfo Backbone Id               bbs|number 
General database identifier       gnl|database|identifier
NCBI Reference Sequence           ref|accession|locus
Local Sequence identifier         lcl|identifier  —Preceding unsigned comment added by 202.131.227.149 (talk) 10:56, 13 November 2009 (UTC)[reply] 

Extensions

I thought it'd be helpful to add more information on file extensions in the form of the table. If I stepped on any toes I wanted to discuss it, so I started this discussion section. Lskatz (talk) 20:55, 28 February 2010 (UTC)[reply]

I concur. Currently I work with .fa files provided by someone else. I was surprised because .fa is not listed on Wikipedia. Perhaps .fa should also be added? 193.171.188.3 (talk) 10:11, 11 February 2014 (UTC)[reply]

Origin of the name

Do the letters F.A.S.T.A stand for anything ? What is the origin of the abbreviation ? XApple (talk) 10:47, 28 May 2010 (UTC)[reply]

I think they mean Fast-A, from Fast-All, as described above in this page. —Preceding unsigned comment added by 89.155.53.244 (talk) 18:19, 10 June 2010 (UTC)[reply]

Doesn't anybody read papers any longer? Not even look them up in the literature in the times of St. Google? Sorry for the ex-abrupt, but a simple look up in Google Scholar for FastA, Pearson and Lipman should bring up the original papers. Jeez! I see, no one cared to even look them up and cite them here. I agree that the papers do not document the format. At the time, it was only just another proposed format among many (and there were too many), so it wouldn't make it into any serious paper, plus this way they could change it later if needed, the documentation was in the code and README files, but at least a reference to its origins (FASTP, FASTA...) would be nice, and answer many of these questions. -- Jose R Valverde — Preceding unsigned comment added by Jrvalverde (talkcontribs) 10:02, 28 February 2013 (UTC)[reply]

Sorry, I realize that there is a link to FASTA package and the citations are there. -- Jose R Valverde — Preceding unsigned comment added by Jrvalverde (talkcontribs) 10:09, 28 February 2013 (UTC)[reply]

I believe that the first four letters of FASTA stand or "Fast Alignment Search Tool". That is what FAST in the tool mrsFAST stands for anyway. (http://mrsfast.sourceforge.net/Home) I came here hoping to verify the acronym after reading "Rates and patterns of great ape retrotransposition" published in PNAS in 2013. The earliest paper I can find that references FASTA is "Improved tools for biological sequence comparison" from 1988 which seems to be the paper where FASTA is first introduced. However no acronym is given in that paper. — Preceding unsigned comment added by 132.170.253.255 (talk) 22:20, 13 March 2014 (UTC)[reply]

Before FASTA there were FASTN and FASTP, and I presume papers for them. In the years since I have known about them, I don't remember seeing anything about the acronym, but then again, I wasn't looking. Gah4 (talk) 15:07, 2 May 2018 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified one external link on FASTA format. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 15:38, 28 December 2016 (UTC)[reply]

Change to lower case fastA

fastA is not an acronym, It should not be written in upper case. — Preceding unsigned comment added by SaraCubillas (talkcontribs) 13:41, 4 January 2017 (UTC)[reply]

Someone above indicates that it might be an acronym. The unix versions that I know of the program are called fasta, where case is significant. How is it written in the original papers?[1] Seems to be all caps there. Gah4 (talk) 15:10, 2 May 2018 (UTC)[reply]

References

  1. ^ Pearson WR, Lipman DJ (April 1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–8. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.

Brought over from FASTA. Should it go into this article?

Thanks for bringing this here. A new paragraph was added regarding FASTA file manipulation with mention to user user-friendly online tools Tomsauv (talk) 14:30, 4 May 2018 (UTC)[reply]

The Process

Simulated phylogeny displaying taxa named ‘A’ to ‘T’. a Basic workflow for FASTA sequence extraction with TREE2FASTA. An exploratory tree is built following multiple-alignment of FASTA data. The Newick tree string (NWK) is visualized and edited in the tree-viewer FigTree and saved as a NEXUS file (NEX). TREE2FASTA uses the FASTA alignment and the NEXUS file (NEX) to produce subsetted FASTA files according to user selection scheme (here color). b Example of possible color and/or annotation selection schemes in FigTree for TREE2FASTA sequence extraction. The FASTA icon marked with an asterisk ‘*’ contains FASTA sequences for taxa H and I lacking color selection (i.e. achromatic) or lacking annotation. For figure clarity annotation ‘Group1’ to ‘Group4’ are reported G1 to G4 within FASTA file icons. FASTA files output to different folders are delimited by dashed boxes.[1]

Fig. 1 Simulated phylogeny displaying taxa named ‘A’ to ‘T’. a Basic workflow for FASTA sequence extraction with TREE2FASTA. An exploratory tree is built following multiple-alignment of FASTA data. The Newick tree string (NWK) is visualized and edited in the tree-viewer FigTree and saved as a NEXUS file (NEX). TREE2FASTA uses the FASTA alignment and the NEXUS file (NEX) to produce subsetted FASTA files according to user selection scheme (here color). b Example of possible color and/or annotation selection schemes in FigTree for TREE2FASTA sequence extraction. The FASTA icon marked with an asterisk ‘*’ contains FASTA sequences for taxa H and I lacking color selection (i.e. achromatic) or lacking annotation. For figure clarity annotation ‘Group1’ to ‘Group4’ are reported G1 to G4 within FASTA file icons. FASTA files output to different folders are delimited by dashed boxes[1]
Fig. 1

References

  1. ^ a b Sauvage, Thomas; Plouviez, Sophie; Schmidt, William E.; Fredericq, Suzanne (5 March 2018). "TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees". BMC Research Notes. 11 (1). doi:10.1186/s13104-018-3268-y.{{cite journal}}: CS1 maint: unflagged free DOI (link) Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.

sequential/interleaved

There is no discussion of interleaved/sequential FASTA format on the page, which could be added.

A section discussing tools available for the sorting of multiple FASTA files/ or popular FASTA utilities could be added.

You are welcome to add whatever information you think would benefit the article. Natureium (talk) 00:54, 3 May 2018 (UTC)[reply]
I worked with FASTA files for many years, but not so recently. The format as I knew it, I believe from the actual FASTA program, allows for multiple lines of sequence data, that are supposed to be the same length, except for the last one, which can be shorter. That definition allows for the whole sequence on one line. Interleaved does not seem to apply here, in the usual meaning of the word. Gah4 (talk) 04:54, 3 May 2018 (UTC)[reply]
Reorganized sections and subsections in a more logical way where needed. Added information about sequential and interleavedTomsauv (talk) 14:31, 4 May 2018 (UTC).[reply]

Compression & Encryption

Both sections seem to exclusively use references from the same few researchers, and their wider relevance/notability is unclear (and isn’t made clear in the text). This seems little more than self-promotion. Notably, the section about compression also fails to mention the (extremely widespread) utility of gzip compression for FASTA, which makes it very misleading. I recommend either removing the sections or improving them by including a statement explaining the relevance, and referencing independent reviews. I won’t do this myself due to a (very indirect) conflict of interest regarding this topic. klmr (talk) 14:49, 5 February 2021 (UTC)[reply]

Default width

The article currently states "Hence, 80 characters became the norm.". Although this is quite possibly true, we should refer to some older discussion or entry or email or mail by people back then who designed the initial FASTA format. I am finding FASTA files that do not care much at all about the 80 characters limit; some seem to prefer 76. It would be nice if we could have some kind of reference to some "standard", or if that is not possible, to refer to what the "most commonly width" was in FASTA. 2A02:8388:1641:5500:8207:8CE:DF2:AB90 (talk) 16:58, 3 December 2022 (UTC)[reply]