N-gram: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>Remsense
No edit summary
 
imported>Kku
m link computational biology
 
Line 4: Line 4:
{{DISPLAYTITLE:''n''-gram}}
{{DISPLAYTITLE:''n''-gram}}
{{Use dmy dates|date=April 2017}}
{{Use dmy dates|date=April 2017}}
[[File:LARGER FONT VERSION Six n-grams frequently found in titles of publications about Coronavirus disease 2019, as of 7 May 2020.svg|thumb|300px|Six ''n''-grams frequently found in titles of publications about Coronavirus disease 2019 (COVID-19), as of 7 May 2020]]


An '''''n''-gram''' is a sequence of ''n'' adjacent symbols in particular order.<ref>{{Cite web |title=n-gram language model - an overview {{!}} ScienceDirect Topics |url=https://www.sciencedirect.com/topics/computer-science/n-gram-language-model |access-date=2024-12-12 |website=www.sciencedirect.com}}</ref> The symbols may be ''n'' adjacent [[letter (alphabet)|letters]] (including [[punctuation mark]]s and blanks), [[syllable]]s, or rarely whole [[word]]s found in a language dataset; or adjacent [[phoneme]]s extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a [[text corpus]] or [[speech corpus]].   
An '''''n''-gram''' is a sequence of ''n'' adjacent symbols in a particular order.<ref>{{cite book |last1=Deller |first1=John R. |last2=Hansen |first2=John |title=The Electrical Engineering Handbook |chapter=Methods, Models, and Algorithms for Modern Speech Processing |date=2005 |pages=861–890 |doi=10.1016/B978-012170960-0/50063-3 |isbn=978-0-12-170960-0 }}</ref> The symbols may be ''n'' adjacent [[letter (alphabet)|letters]] (including [[punctuation mark]]s and blanks), [[syllable]]s, or rarely whole [[word]]s found in a language dataset; or adjacent [[phoneme]]s extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a [[text corpus]] or [[speech corpus]].   


If [[Latin numerical prefixes]] are used, then ''n''-gram of size 1 is called a "unigram", size 2 a "[[bigram]]" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the [[Cardinal number (linguistics)|English cardinal numbers]] are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using [[Greek numerical prefixes]] such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for [[polymer]]s or [[oligomer]]s of a known size, called [[k-mer|''k''-mer]]s. When the items are words, {{mvar|n}}-grams may also be called ''shingles''.<ref>{{cite journal |last1=Broder |first1=Andrei Z. |first2=Steven C. |last2=Glassman |first3=Mark S. |last3=Manasse |first4=Geoffrey |last4=Zweig |title=Syntactic clustering of the web |journal=Computer Networks and ISDN Systems |volume=29 |issue=8 |year=1997 |pages=1157–1166 |doi=10.1016/s0169-7552(97)00031-7|s2cid=9022773 }}</ref>
If [[Latin numerical prefixes]] are used, then ''n''-gram of size 1 is called a "unigram", size 2 a "[[bigram]]" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the [[Cardinal number (linguistics)|English cardinal numbers]] are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, [[Greek numerical prefixes]] such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in [[computational biology]] for [[polymer]]s or [[oligomer]]s of a known size, called [[k-mer|''k''-mer]]s. When the items are words, {{mvar|n}}-grams may also be called ''shingles''.<ref>{{cite journal |last1=Broder |first1=Andrei Z. |first2=Steven C. |last2=Glassman |first3=Mark S. |last3=Manasse |first4=Geoffrey |last4=Zweig |title=Syntactic clustering of the web |journal=Computer Networks and ISDN Systems |volume=29 |issue=8 |year=1997 |pages=1157–1166 |doi=10.1016/s0169-7552(97)00031-7 }}</ref>


In the context of [[natural language processing]] (NLP), the use of ''n''-grams allows [[Bag-of-words model|bag-of-words]] models to capture information such as word order, which would not be possible in the traditional bag of words setting.
In the context of [[natural language processing]] (NLP), the use of ''n''-grams allows [[Bag-of-words model|bag-of-words]] models to capture information such as word order, which would not be possible in the traditional bag of words setting.


== Examples ==
== Examples ==
(Shannon 1951)<ref>Shannon, Claude E. "The redundancy of English." ''Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation''. 1951.</ref> discussed ''n''-gram models of English. For example:
In 1951, [[Claude Shannon|Shannon]]<ref>Shannon, Claude E. "The redundancy of English." ''Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation''. 1951.</ref> discussed ''n''-gram models of English. For example:


* 3-gram character model (random draw based on the probabilities of each trigram): ''in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre''
* 3-gram character model (random draw based on the probabilities of each trigram): ''in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre''
Line 19: Line 18:


{| class="wikitable" style="font-size:85%;"
{| class="wikitable" style="font-size:85%;"
|+ Figure 1 ''n''-gram examples from various disciplines
|+ Figure 1. ''n''-gram examples from various disciplines
! Field !! Unit !!Sample sequence !! 1-gram sequence !! 2-gram sequence !! 3-gram sequence
! Field !! Unit !!Sample sequence !! 1-gram sequence !! 2-gram sequence !! 3-gram sequence
|-
|-
Line 30: Line 29:
| [[DNA sequencing]]|| [[base pair]] || ...AGCTTCGA... || ..., A, G, C, T, T, C, G, A, ... || ..., AG, GC, CT, TT, TC, CG, GA, ... || ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
| [[DNA sequencing]]|| [[base pair]] || ...AGCTTCGA... || ..., A, G, C, T, T, C, G, A, ... || ..., AG, GC, CT, TT, TC, CG, GA, ... || ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
|-
|-
| [[Language model]] || [[Character (computing)|character]] ||  ...to_be_or_not_to_be... || ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... || ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... || ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
| Character ''n''-gram [[language model]] || [[Character (computing)|character]] ||  ...to_be_or_not_to_be... || ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... || ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... || ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
|-
|-
| [[word n-gram language model|Word ''n''-gram language model]] ||[[word]] ||  ... to be or not to be ... ||  ..., to, be, or, not, to, be, ... ||  ..., to be, be or, or not, not to, to be, ... || ..., to be or, be or not, or not to, not to be, ...
| [[word n-gram language model|Word ''n''-gram language model]] ||[[word]] ||  ... to be or not to be ... ||  ..., to, be, or, not, to, be, ... ||  ..., to be, be or, or not, not to, to be, ... || ..., to be or, be or not, or not to, not to be, ...
Line 37: Line 36:
Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.
Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.


Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google ''n''-gram corpus.<ref>{{cite web |first1=Alex |last1=Franz |first2=Thorsten |last2=Brants |title=All Our ''N''-gram are Belong to You |year=2006 |work=Google Research Blog |url=http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html |access-date=2011-12-16 |archive-date=17 October 2006 |archive-url=https://web.archive.org/web/20061017225954/http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html |url-status=live }}</ref>
Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google ''n''-gram corpus.<ref>{{cite web |first1=Alex |last1=Franz |first2=Thorsten |last2=Brants |title=All Our ''N''-gram are Belong to You |year=2006 |work=Google Research Blog |url=https://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html |access-date=2011-12-16 |archive-date=17 October 2006 |archive-url=https://web.archive.org/web/20061017225954/http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html |url-status=live }}</ref>


3-grams
3-grams
Line 61: Line 60:
* {{cite journal |last1=White |first1=Owen |last2=Dunning |first2=Ted |last3=Sutton |first3=Granger |last4=Adams |first4=Mark |last5=Venter |first5=J. Craig |last6=Fields |first6=Chris |year=1993 |title=A quality control algorithm for dna sequencing projects |journal=Nucleic Acids Research |volume=21 |issue=16 |pages=3829–3838 |doi=10.1093/nar/21.16.3829 |pmid=8367301 |pmc=309901 }}
* {{cite journal |last1=White |first1=Owen |last2=Dunning |first2=Ted |last3=Sutton |first3=Granger |last4=Adams |first4=Mark |last5=Venter |first5=J. Craig |last6=Fields |first6=Chris |year=1993 |title=A quality control algorithm for dna sequencing projects |journal=Nucleic Acids Research |volume=21 |issue=16 |pages=3829–3838 |doi=10.1093/nar/21.16.3829 |pmid=8367301 |pmc=309901 }}
* Damerau, Frederick J.; ''Markov Models and Linguistic Theory'', Mouton, The Hague, 1971
* Damerau, Frederick J.; ''Markov Models and Linguistic Theory'', Mouton, The Hague, 1971
* {{cite journal |last1=Figueroa |first1=Alejandro |last2=Atkinson |first2=John |year=2012 |title=Contextual Language Models For Ranking Answers To Natural Language Definition Questions |url=https://www.researchgate.net/publication/262176888 |journal=Computational Intelligence |volume=28 |issue=4 |pages=528–548 |doi=10.1111/j.1467-8640.2012.00426.x |s2cid=27378409 }}
* {{cite journal |last1=Figueroa |first1=Alejandro |last2=Atkinson |first2=John |year=2012 |title=Contextual Language Models For Ranking Answers To Natural Language Definition Questions |journal=Computational Intelligence |volume=28 |issue=4 |pages=528–548 |doi=10.1111/j.1467-8640.2012.00426.x }}
* {{cite conference |last=Brocardo |first=Marcelo Luiz |first2=Issa |last2=Traore |first3=Sherif |last3=Saad |first4=Isaac |last4=Woungang |url=https://ieeexplore.ieee.org/document/6705711 |title=Authorship Verification for Short Messages Using Stylometry |conference= IEEE International Conference on Computer, Information and Telecommunication Systems (CITS) |year=2013 }}
* {{cite book |last1=Brocardo |first1=Marcelo Luiz |last2=Traore |first2=Issa |last3=Saad |first3=Sherif |last4=Woungang |first4=Isaac |title=2013 International Conference on Computer, Information and Telecommunication Systems (CITS) |chapter=Authorship verification for short messages using stylometry |date=2013 |pages=1–6 |doi=10.1109/CITS.2013.6705711 |isbn=978-1-4799-0168-5 }}


== See also ==
== See also ==
Line 69: Line 68:
== External links ==
== External links ==
* [https://www.ngramextractor.com Ngram Extractor: Gives weight of ''n''-gram based on their frequency.]
* [https://www.ngramextractor.com Ngram Extractor: Gives weight of ''n''-gram based on their frequency.]
* [https://books.google.com/ngrams Google's Google Books ''n''-gram viewer] and [http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Web ''n''-grams database] (September 2006)
* [https://books.google.com/ngrams Google's Google Books ''n''-gram viewer] and [https://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Web ''n''-grams database] (September 2006)
* [http://data.statoperator.com/ STATOPERATOR N-grams Project Weighted ''n''-gram viewer for every domain in Alexa Top 1M]
* [http://data.statoperator.com/ STATOPERATOR N-grams Project Weighted ''n''-gram viewer for every domain in Alexa Top 1M]
* [http://www.ngrams.info/ 1,000,000 most frequent 2,3,4,5-grams] from the 425 million word [[Corpus of Contemporary American English]]
* [http://www.ngrams.info/ 1,000,000 most frequent 2,3,4,5-grams] from the 425 million word [[Corpus of Contemporary American English]]

Latest revision as of 14:22, 14 December 2025

Template:Short description Script error: No such module "other uses". Script error: No such module "Distinguish".

Script error: No such module "Unsubst".

An n-gram is a sequence of n adjacent symbols in a particular order.[1] The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus.

If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology for polymers or oligomers of a known size, called k-mers. When the items are words, Template:Mvar-grams may also be called shingles.[2]

In the context of natural language processing (NLP), the use of n-grams allows bag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.

Examples

In 1951, Shannon[3] discussed n-gram models of English. For example:

  • 3-gram character model (random draw based on the probabilities of each trigram): in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre
  • 2-gram word model (random draw of words taking into account their transition probabilities): the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected
Figure 1. n-gram examples from various disciplines
Field Unit Sample sequence 1-gram sequence 2-gram sequence 3-gram sequence
Vernacular name unigram bigram trigram
Order of resulting Markov model 0 1 2
Protein sequencing amino acid ... Cys-Gly-Leu-Ser-Trp ... ..., Cys, Gly, Leu, Ser, Trp, ... ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing base pair ...AGCTTCGA... ..., A, G, C, T, T, C, G, A, ... ..., AG, GC, CT, TT, TC, CG, GA, ... ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
Character n-gram language model character ...to_be_or_not_to_be... ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Word n-gram language model word ... to be or not to be ... ..., to, be, or, not, to, be, ... ..., to be, be or, or not, not to, to be, ... ..., to be or, be or not, or not to, not to be, ...

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.

Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.[4]

3-grams

  • ceramics collectables collectibles (55)
  • ceramics collectables fine (130)
  • ceramics collected by (52)
  • ceramics collectible pottery (50)
  • ceramics collectibles cooking (45)

4-grams

  • serve as the incoming (92)
  • serve as the incubator (99)
  • serve as the independent (794)
  • serve as the index (223)
  • serve as the indication (72)
  • serve as the indicator (120)

References

<templatestyles src="Reflist/styles.css" />

  1. Script error: No such module "citation/CS1".
  2. Script error: No such module "Citation/CS1".
  3. Shannon, Claude E. "The redundancy of English." Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.
  4. Script error: No such module "citation/CS1".

Script error: No such module "Check for unknown parameters".

Further reading

  • Manning, Christopher D.; Schütze, Hinrich; Foundations of Statistical Natural Language Processing, MIT Press: 1999, Template:ISBN
  • Script error: No such module "Citation/CS1".
  • Damerau, Frederick J.; Markov Models and Linguistic Theory, Mouton, The Hague, 1971
  • Script error: No such module "Citation/CS1".
  • Script error: No such module "citation/CS1".

See also

External links

Template:Natural Language Processing