imported>Ekamanganese at 01:08, 23 January 2025

2025-01-23T01:08:30Z

New page

{{short description|Information without a formal data model}}
[[File:Photograph of Departmental Records Branch Military Records Center in Alexandria, Virginia - NARA - 23855327.jpg|350px|thumb|right|Unsorted records captured from [[Nazi Germany]] at the [[National Archives and Records Administration|U.S. National Archives]] Military Records Center in [[Alexandria, Virginia]], 1956]]
'''Unstructured data''' (or '''unstructured information''') is information that either does not have a pre-defined [[data model]] or is not organized in a pre-defined manner. Unstructured information is typically [[plain text|text]]-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and [[ambiguities]] that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or [[annotation|annotated]] ([[Tag (metadata)|semantically tagged]]) in documents.

In 1998, [[Merrill Lynch]] said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%."<ref>{{cite web |last1=Shilakes |first1=Christopher C. |last2=Tylman |first2=Julie |title=Enterprise Information Portals |url=http://ikt.hia.no/perep/eip_ind.pdf |archive-url=https://web.archive.org/web/20110724175845/http://ikt.hia.no/perep/eip_ind.pdf |url-status=dead |archive-date=24 July 2011 |website=Merrill Lynch |date=16 Nov 1998}}</ref> It is unclear what the source of this number is, but nonetheless it is accepted by some.<ref name="Clarabridge">{{cite web |last1=Grimes |first1=Seth |title=Unstructured Data and the 80 Percent Rule |url=http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule |website=Breakthrough Analysis - Bridgepoints |publisher=Clarabridge |date=1 August 2008}}</ref> Other sources have reported similar or higher percentages of unstructured data.<ref>{{Cite journal|last1=Gandomi|first1=Amir|last2=Haider|first2=Murtaza|date=April 2015|title=Beyond the hype: Big data concepts, methods, and analytics|journal=International Journal of Information Management|volume=35|issue=2|pages=137–144|doi=10.1016/j.ijinfomgt.2014.10.007|issn=0268-4012|doi-access=free}}</ref><ref>{{Cite news|url=https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/|title=The biggest data challenges that you might not even know you have - Watson|date=2016-05-25|work=Watson|access-date=2018-10-02|language=en-US}}</ref><ref>{{Cite web|url=https://www.datamation.com/big-data/structured-vs-unstructured-data.html|title=Structured vs. Unstructured Data|website=www.datamation.com|language=en|access-date=2018-10-02}}</ref>

{{asof|2012}}, [[International Data Corporation|IDC]] and [[Dell EMC]] project that data will grow to 40 [[zettabytes]] by 2020, resulting in a 50-fold growth from the beginning of 2010.<ref name="idc">{{cite web |title=EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected |url=http://www.emc.com/about/news/press/2012/20121211-01.htm |website=www.emc.com |publisher=EMC Corporation |date=December 2012}}</ref> More recently, IDC and [[Seagate Technology|Seagate]] predict that the global [[datasphere]] will grow to 163 zettabytes by 2025 <ref>{{Cite news|url=https://www.seagate.com/our-story/data-age-2025/|title=Trends {{!}} Seagate US|work=Seagate.com|access-date=2018-10-01|language=en-US}}</ref> and majority of that will be unstructured. The [[Computerworld|Computer World magazine]] states that unstructured information might account for more than 70–80% of all data in organizations.{{ref|computerworld}}

== Background ==

The earliest research into [[business intelligence]] focused in on unstructured textual data, rather than numerical data.<ref name="History">{{cite web|last1=Grimes|first1=Seth|title=A Brief History of Text Analytics|url=http://www.b-eye-network.com/view/6311|website=B Eye Network|access-date=June 24, 2016}}</ref> As early as 1958, [[computer science]] researchers like [[Hans Peter Luhn|H.P. Luhn]] were particularly concerned with the extraction and classification of unstructured text.<ref name="History" /> However, only since the turn of the century has the technology caught up with the research interest. In 2004, the [[SAS Institute]] developed the [[SAS (software)|SAS]] Text Miner, which uses [[Singular Value Decomposition]] (SVD) to reduce a [[Dimensional Analysis|hyper-dimensional]] textual [[space (mathematics)|space]] into smaller dimensions for significantly more efficient machine-analysis.<ref name="SVD">{{cite web|last1=Albright|first1=Russ|title=Taming Text with the SVD|url=http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf|archive-url=https://web.archive.org/web/20160930182157/http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf|url-status=dead|archive-date=2016-09-30|website=SAS|access-date=June 24, 2016}}</ref> The mathematical and technological advances sparked by [[machine learning|machine]] textual analysis prompted a number of businesses to research applications, leading to the development of fields like [[sentiment analysis]], [[voice of the customer]] mining, and call center optimization.<ref name="Applications">{{cite web|last1=Desai|first1=Manish|title=Applications of Text Analytics|url=http://mybusinessanalytics.blogspot.com/2009/08/applications-of-text-analytics.html|website=My Business Analytics @ Blogspot|access-date=June 24, 2016|date=2009-08-09}}</ref> The emergence of [[Big Data]] in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as [[predictive analytics]] and [[root cause analysis]].<ref>{{cite web|last1=Chakraborty|first1=Goutam|title=Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining|url=https://support.sas.com/resources/papers/proceedings14/1288-2014.pdf|website=SAS|access-date=June 24, 2016}}</ref>

== Issues with terminology ==
The term is imprecise for several reasons:
# [[Structure]], while not formally defined, can still be implied.
# Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
# Unstructured information might have some structure ([[semi-structured data|semi-structured]]) or even be highly structured but in ways that are unanticipated or unannounced.

== Dealing with unstructured data ==
Techniques such as [[data mining]], [[natural language processing]] (NLP), and [[text analytics]] provide different methods to [[pattern recognition|find patterns]] in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual [[Tag (metadata)|tagging with metadata]] or [[part-of-speech tagging]] for further [[text mining]]-based structuring. The [[UIMA|Unstructured Information Management Architecture]] (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.

Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication.<ref name="IntelligentEnterprise">{{cite web |title=Structure, Models and Meaning: Is "unstructured" data merely unmodeled? |url=http://www.intelligententerprise.com/showArticle.jhtml?articleID=59301538 |website=InformationWeek |language=en |date=March 1, 2005}}</ref> Algorithms can infer this inherent structure from text, for instance, by examining word [[morphology (linguistics)|morphology]], sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, [[metadata]], [[health record]]s, [[Sound|audio]], [[video]], [[Analog device|analog data]], images, files, and unstructured text such as the body of an [[e-mail]] message, [[Web page]], or [[word-processor]] document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data".<ref>{{cite web |last1=Malone |first1=Robert |title=Structuring Unstructured Data |url=https://www.forbes.com/2007/04/04/teradata-solution-software-biz-logistics-cx_rm_0405data.html |website=Forbes |language=en |date=April 5, 2007}}</ref> For example, an [[HTML]] web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. [[XHTML]] tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.

Since unstructured data commonly occurs in [[electronic document]]s, the use of a [[content management|content]] or [[document management]] system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto [[text corpus|document collections]].

[[Search engines]] have become popular tools for indexing and searching through such data, especially text.

=== Approaches in natural language processing ===
Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of [[Online analytical processing|online analytical processing, or OLAP]], and may be supported by data models such as text cubes.<ref>{{Cite book|last1=Lin|first1=Cindy Xide|last2=Ding|first2=Bolin|last3=Han|first3=Jiawei|last4=Zhu|first4=Feida|last5=Zhao|first5=Bo|title=2008 Eighth IEEE International Conference on Data Mining |chapter=Text Cube: Computing IR Measures for Multidimensional Text Database Analysis |date=December 2008|pages=905–910 |language=en-US|publisher=IEEE|doi=10.1109/icdm.2008.135|isbn=9780769535029|citeseerx=10.1.1.215.3177|s2cid=1522480}}</ref> Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.<ref name = "textcubes">{{cite web |title=Multi-Dimensional, Phrase-Based Summarization in Text Cubes |url=http://sites.computer.org/debull/A16sept/p74.pdf |last1=Tao|first1=Fangbo | last2=Zhuang|first2=Honglei | last3=Yu|first3=Chi Wang| first4=Qi|last4=Wang | first5=Taylor|last5=Cassidy | first6=Lance|last6=Kaplan | first7=Clare|last7=Voss| last8=Han | first8=Jiawei | date=2016}}</ref>

=== Approaches in medicine and biomedical research ===
Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the [[domain knowledge]] required to fully contextualize observations), the results of these activities may yield links between technical and medical studies<ref>{{Cite journal|last1=Collier|first1=Nigel|last2=Nazarenko|first2=Adeline|last3=Baud|first3=Robert|last4=Ruch|first4=Patrick|date=June 2006|title=Recent advances in natural language processing for biomedical applications|journal=International Journal of Medical Informatics|volume=75|issue=6|pages=413–417|doi=10.1016/j.ijmedinf.2005.06.008|issn=1386-5056|pmid=16139564|s2cid=31449783 }}</ref> and clues regarding new disease therapies.<ref>{{Cite journal|last1=Gonzalez|first1=Graciela H.|last2=Tahsin|first2=Tasnia|last3=Goodale|first3=Britton C.|last4=Greene|first4=Anna C.|last5=Greene|first5=Casey S.|date=January 2016|title=Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery|journal=Briefings in Bioinformatics|volume=17|issue=1|pages=33–42|doi=10.1093/bib/bbv087|issn=1477-4054|pmc=4719073|pmid=26420781}}</ref> Recent efforts to enforce structure upon biomedical documents include [[self-organizing map]] approaches for identifying topics among documents,<ref>{{Cite journal|last1=Skupin|first1=André|last2=Biberstine|first2=Joseph R.|last3=Börner|first3=Katy|date=2013|title=Visualizing the topical structure of the medical sciences: a self-organizing map approach|journal=PLOS ONE|volume=8|issue=3|pages=e58779|doi=10.1371/journal.pone.0058779|issn=1932-6203|pmc=3595294|pmid=23554924|bibcode=2013PLoSO...858779S|doi-access=free}}</ref> general-purpose [[Unsupervised learning|unsupervised algorithms]],<ref>{{Cite journal|last1=Kiela|first1=Douwe|last2=Guo|first2=Yufan|last3=Stenius|first3=Ulla|last4=Korhonen|first4=Anna|date=2015-04-01|title=Unsupervised discovery of information structure in biomedical documents|journal=Bioinformatics|volume=31|issue=7|pages=1084–1092|doi=10.1093/bioinformatics/btu758|issn=1367-4811|pmid=25411329|doi-access=free}}</ref> and an application of the CaseOLAP workflow<ref name = "textcubes" /> to determine associations between protein names and [[cardiovascular disease]] topics in the literature.<ref name="caseolapCV">{{Cite journal|last1=Liem|first1=David A.|last2=Murali|first2=Sanjana|last3=Sigdel|first3=Dibakar|last4=Shi|first4=Yu|last5=Wang|first5=Xuan|last6=Shen|first6=Jiaming|last7=Choi|first7=Howard|last8=Caufield|first8=John H.|last9=Wang|first9=Wei|last10=Ping|first10=Peipei|last11=Han|first11=Jiawei|date=Oct 1, 2018|title=Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease|journal=American Journal of Physiology. Heart and Circulatory Physiology|volume=315|issue=4|pages=H910–H924|doi=10.1152/ajpheart.00175.2018|issn=1522-1539|pmid=29775406|pmc=6230912}}</ref> CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.<ref name="caseolapCV" />

== The use of "unstructured" in data privacy regulations ==
In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question was confirmed as "unstructured".<ref>{{Cite web|url=https://sverigeskommunikatorer.se/kunskap/nyheter/gdpr-del-3--missbruksregeln-upphor-vad-innebar-det-for-kommunikatoren/#:~:text=Vad%20inneb%C3%A4r%20Missbruksregeln%3F,men%20%C3%A4ven%20publicering%20av%20bilder|title=Swedish data privacy regulations discontinue separation of "unstructured" and "structured"}}</ref> This terminology, unstructured data, is rarely used in the EU after [[GDPR]] came into force in 2018. GDPR does neither mention nor define "unstructured data". It does use the word "structured" as follows (without defining it);
* Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing of personal data ... if ... contained in a filing system."
* GDPR Article 4, "‘filing system’ means any structured set of personal data which are accessible according to specific criteria ..."

GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in which the set of personal data collected by each of the members who engage in preaching is actually structured is irrelevant, so long as that set of data makes it possible for the data relating to a specific person who has been contacted to be '''easily retrieved''', which is however for the referring court to ascertain in the light of all the circumstances of the case in the main proceedings.” ([[Court_of_Justice_of_the_European_Union|CJEU]], [https://curia.europa.eu/juris/document/document.jsf?docid=203822&doclang=EN|Jehovan Todistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61]).

If [[personal data]] is easily retrieved - then it is a filing system and - then it is in scope for GDPR regardless of being "structured" or "unstructured". Most electronic systems today,{{As of?|date=September 2023}} subject to access and applied software, can allow for easy retrieval of data.

== See also ==
*[[Cluster analysis|Clustering]]
*[[Pattern recognition]]
*[[List of text mining software]]
*[[Semi-structured data]]
*[[Structured data]]

== Notes ==
#{{note|Today’s Challenge in Government}} Today's Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst, [[Forrester Research]], Nov 2010

==References==
{{Reflist}}

== External links ==
*[http://www.tdan.com/view-articles/5009 Matching Unstructured Data and Structured Data]
*[https://dynomapper.com/blog/21-sitemaps-and-seo/433-what-is-structured-data-for-seo a brief description for Structured Data]
*[https://securiti.ai/unstructured-data-101-definition-examples-benefits-challenges/ Unstructured Data Definition, Examples, Benefits & Challenges]

{{Data}}
[[Category:Data]]
[[Category:Information technology management]]
[[Category:Business intelligence terms]]

Unstructured data - Revision history

imported>Ekamanganese at 01:08, 23 January 2025