Creating Structured Data for Online Bios

Written by  Wednesday, 19 April 2017 16:28
 
Introduction
      The study aims to define an effective method and framework for describing online biographical sketches using structured data, or, in a broad sense, “metadata” – the structured, encoded data that describe characteristics of information-bearing entities. Using artists as an example, the most widely seen structured data can be found in two primary spots on the web: (1) the “knowledge graph” provided by search engines when one searches for an artist, and (2) the “info-box” which appears on the right side of a Wikipedia article for an artist. However, in order for a knowledge graph or info-box to exist, the individual’s life has to be well-documented in a Wikipedia article or in a dedicated record in Wikidata (formerly Freebase) knowledge base.
The online biographical sketches, or “bios,” that have become a very popular publishing format online are the focus of this study. Recognized biographies in the informational medium of published books, dedicated special journal issues, documentary films, drama and biographical movies, etc. were not considered since they are usually already covered by Wikipedia and search engines’ knowledge graphs. The creator of an online biography can be anyone who uses webpages to record and interpret a particular human life, which can range from a high-profile individual to any person of interest such as a family member or an inspiring person. In most cases, this individual’s life has not been documented in a Wikipedia article or depicted by search engine’s knowledge graphs.  Beyond dedicated bios, it is common that a person-centered website would include an online bio. Institutions (such as research institutes, academic unions, businesses, government offices), conferences, and awards websites also publish short bios. By exploring digital online biographies, this study found a wide range of styles of structures and the creative uses of digital technologies to record an individual’s life.
This researcher approaches the study from two important sides of the structured data of bios. The first is the human side: by using the methods and the editing tool developed by this project to encode the bios with structured data, the creators of the bios will double the exposure opportunities for what they care about - not only to the human readers, but also to the search engines and other digital information services. The second is from the machine side: the search engines and data crawlers (i.e., a program that systematically browses the Web in order to create an index of data) can obtain the structured data that are generated based on the depictions of the lives of individuals. From here, the search engines will be able to index and connect the structured data, reuse them in forming knowledge graphs, and provide useful and accurate references along with searching processes. In addition to search engines, any digital library, portal, or website that is interested in contents related to and about individuals and cultural heritage can use these structured data that have not been well-exposed or connected before.  
 
Background
      The project was initiated from the desire of exposing the information about native Taiwanese artists in an effective method and reuse the resources of digital archives. 
In 2014, using digital archives, the Academia Sinica Center for Digital Cultures (ASDC) released a comprehensive book-style digital biography website (in both Chinese and English) for one of the most well-known Taiwanese artists, Chen Cheng-po (陳澄波, 1895-1947, also known as Tan Ting-pho). The artist was the first to have a dedicated website created by ASDC as well as a Google Doodle feature. An International Conference on Chen Cheng-po and Modern East Asian Art History was held in January 2015, indicating the importance of this Taiwanese artist. Along with Chen Cheng-po, short bios of 12 other well-known artists are presented in the “Friends chapter” of the website, written by the researchers of Academia Sinica’s Institute of Taiwan History and ASDC.
Through the data collected during 2015, however, this researcher found that among these 12 artists, less than half of them had knowledge graphs accompanying the English search results provided by search engines (including Google, Yahoo, and Bing); 3/4 did not have an English entry or info-box data on Wikipedia. For example,  Pu Tian-sheng (蒲添生, 1912~1996), a pioneer of modern sculpture in Taiwan, founder of Taiwan’s first bronze casting factory, and the creator of Taiwan’s first ever likeness of Sun Yat-sen today on display in front of Taipei Zhongshan Hall, did not have an English-language entry on Wikipedia nor did he have a knowledge graph. The situation was the same for Chen Jin (陳進, 1907~1998), the only well-known female Taiwanese painter before 1945. In terms of structured data, nine of these 12 artists have Chinese name entries in the Wikidata knowledge base, but the information available is minimal and English names are needed. Their biographies in English were found at a couple of websites, but those trustworthy bios are often not citable or linkable due to web design, such as using the HTML “frame” style (which causes webpages share the same upper level URL). 
During the last decade, many open digital archives and digital libraries were established all over the world. The completion of these digital archives involved very complicated and costly digitization processes. They deliver integrated resources (including the metadata, representative images, digitized original documents, and multimedia files) through unified interfaces such as online union catalogs like Digital Archives Taiwan. Building on these resources, the digital archives gradually established special theme portals (e.g., Taiwan Biodiversity Information Facility), special digital collections (e.g., 余光中 Kwang-Chung Yu’s Digital Archives, and online exhibitions (such as the one for Chen Cheng-po).  Yet, how can we convey the new levels of usefulness of such great digital resources to the international community across the boundaries of language, geography, format, and discipline? With this question in mind, this project considers to use structured data to share, connect, and reuse these digital resources. It also seeks to enable the discovery and rediscovery of the individuals and their significant roles in culture and history by international users. 
 
Methods 
 
       Adopting Gillian Rose’s visual methodologies, biographies may be considered from three perspectives or facets: 
 
 
The first perspective of production relates to the reason or motivation for making a biography as well as the creation of such a biography. At this stage, the structured data can be embedded in the <meta> tags of a webpage’s source code or join other resource description data storages. Widely used metadata schemas for such descriptions are Dublin Core (DC) Metadata Element Set and DC terms.      Another perspective to consider is the audiencing of a biography, which reflects the views of the audience interacting with the biography. From this angle, we consider what audiences receive from reading or viewing the bio and how they act on them (e.g. whether they cite, like, share, comment, follow it, etc.).  A final perspective is that of the biography itself, particularly its intellectual content, which is the main focus of this project. The following section outlines the selection of the conceptual model and the application of encoding syntaxes.
As Semantic Web technologies advance quickly, a W3C standard “Web Ontology Language (OWL),” a knowledge representation language, is widely used in building ontologies, which are formal models that allow knowledge for specific domains to be represented. An ontology is defined as (a) the types of things that exist (classes); (b) the relationships between them (properties); and (c) the logical ways those classes and properties can be used together (axioms). Several important ontologies have been developed for “person” during the last two decades with different focuses:
  1. (1) Friend of a Friend (FOAF) was designed as a machine-understandable ontology in the early 2000s, which focuses on persons, their activities, and their relationships to other people and objects
  2. (2) DBpedia ("DB" for "database") is a database of structured data from Wikipedia which contains 1,445,000 English articles for persons as of 2015. In DBpedia Ontology, the Person class (which is a sub-class of Agent and sub-sub-class of Thing) has over 260 sub- and sub-sub-classes (e.g., Artist, Writer, Athlete, Skier). For the Artist class alone, there are 18 sub-types and about 250 properties (i.e., attributes and relationships), including many inherited from the Person class. 
  3. (3) BIO, an ontology for biographical information, is event-centered. It provides ontological classes for events and properties for relating people to events and events to places and times. The approach taken is to describe a person's life as a series of interconnected key events around which other information can be woven. 
  4. (4) The Biography Light model also takes an event-centric approach and has borrowed useful properties from the earlier ontologies such as FOAF, BIO, OWL-Time, GeoNames, and Basic Geo. It is aimed at prosopography (also referred as “collective biography”) in order to connect biographic events to other resources available through the Semantic Web. 
  5. (5) A giant ontology, Schema.org, appeared online in mid-2011 and is still expanding. This vocabulary, created by major search engines such as Bing, Google, Yahoo!, and Yandex, aims to provide many schemas under one namespace, so webmasters can describe and expose websites of any kind to search engines. Its impact can be seen on an increasing number of website products across almost all domains and areas. In comparison to the DBpedia Ontology, Schema.org managed to use a limited number of properties for the Person class.  Less than 50 properties are dedicated to Person; nine of them are inherited from Thing. Additionally, 60 properties for the relationships between instances of Person with other classes are listed in the Person schema. 
The perspective of the contents of a biography also means that the semantic meaning of a particular component in an online bio should be encoded so that search engines and other computer programs can understand, react, and process these structured data that are embedded in a bio. Using the properties defined by Schema.org, we can insert those structured data alongside the original text (i.e., annotating the text with metadata tags) in addition to the metadata from the first perspective (production of a bio) that are embedded in the <head> section of a webpage. 
Two basic encoding formats, RDFa (Resource Description Framework in Attributes) and Microdata, provide specifications of attributes to be used to express structured data in any markup language. Both function as a means of nesting metadata within existing content on webpages to allow search engines, web crawlers, and browsers to extract and process the metadata statements from a webpage. Both specify only syntax and rely on independent specification of metadata vocabularies by others, such as Schema.org vocabularies. The result is that the original unstructured web-page content becomes semantically structured content and is exposed to search engines and data crawlers.
After comprehensive review and testing, the project described in this article decided to use Schema.org even though it was found to have limitations in describing historical figures due to its initial focus being limited to the current web contents. The two encoding formats RDFa and Microdata are selected to realize the creation of structured data. 
 
Preliminary Results
       The Schema.org ontology and the RDFa and Microdata encoding formats have provided an interoperable model for anyone who would like to embed structured data into an online bio page for exposure to machines and search engines. However, how should the appropriate application of these 100+ properties about or relating to Person be approached? What HTML attributes have to be used when adding RDFa- or Microdata-encoded data into an online bio page? Even for a web master, these are new challenges. 
In order to fill the gap between the great potentials and the technical challenges, this project developed a platform for creating structured data for digital bios (click here) . The platform provides a set of templates based on Schema.org's vocabulary and is designed to highlight the information about a specific person based on a biographic sketch. The initial goal is to use embedded structured data to increase the findability of bios, especially for those which are not included in the English Wikipedia or appear in search engines' knowledge graphs. The webpages' source codes are machine-processable, embedded with either Microdata or RDFa based on one’s needs, using the Schema.org vocabulary. As one starts to enter the information about a person, such as the person’s name, related URL, nationality, birth date, important roles related with creative and artistic works, and relationships with other persons and organizations, the structured data are generated simultaneously. 
To achieve the result of interactive encoding and generating desired source codes on-the-fly, the project uses interactive forms as the communication channel between a human editor and a computer. The development of the platform involves applying the new versions of HTML (a markup language used for structuring and presenting content on the World Wide Web, currently the 5th version), Cascading Style Sheets (CSS) (a mechanism for adding style to web documents), and PHP (a server-side scripting language). More than ten thousand lines of source codes were created. The template not only fulfills the basic fill-in functions, but also automatically embeds a resource’s Uniform Resource Identifier (URI) with the identity of a “Thing,” e.g., a person, a creative work (painting, book, video), an institution (e.g., a university, a society, an association), etc. In cases where the property is designed for a class other than “Person,” inverse relationships would be generated (e.g., a creative work is created-by a person.) The form is also designed to be able to expand when dealing with multiple relationships between a person and other things.
To demonstrate the use of the template and establish useful structured datasets, the project used the 12 artists’ bios as the base to generate the structured data. The data values were taken from the bios in two ways before they were encoded: first through human analysis of the text, and second through the use of semantic analysis tools (Open Calais and COGITO Intelligence) to extract the entities (people, organization, place, facts) and their relationships. 
     The comparison of the results indicated the pros and cons of both approaches and proved the benefits of their combined use. Another set of creative agents from humanity domains other than fine arts were also used for testing in order to enhance the platform. These can be seen on the landing page of Structured Data for Bios (Beta). For the Microdata-format that search engines prefer, one can use the Google Testing Tool to check if the structured data are correctly encoded in the source codes and how a search engine would read them (see the demo of the results). For the RDFa-format, which is what the Linked Open Data community usually prefers, one can take the source code of a bio page constructed using this platform to the RDFa-Play tool and obtain both a visualization result and a raw data result (see another demo of the results). 
     The English bios and structured data values have also been submitted to Wikipedia and Wikidata manually involving a mapping process. A re-examination after two to four weeks of publishing the English bios and structured data, it was found that all of the 12 artists now have knowledge graphs on Google with various degrees of details. The more encouraging finding is that now the “People also search for” results have been found for half of these artists already, which have revealed other artists.     
The project also faced different kinds of challenges and unanswered questions. For encoding the bios using Schema.org, there are obvious limits. Because these 12 artists are mainly historical figures, the English present verbs that are used as property names (such as “follows” and “owns”) and those that deal with present situations (such as “jobTitle” and “workLocation”) are not appropriate, even after the obvious irrelevant ones (such as “email” and “taxID”) were excluded. The contents themselves also present challenges, such as the “nationality” of the ones born in the Qing Dynasty or during the period that Taiwan was under Japanese rule. Using RDFa-Play to generate raw data, there are issues when dealing with the resources’ URIs in the inverse relationship. The Google Testing Tool does not handle inverse relationships at all. This impacts the quality of the by-product, the datasets to be stored in the data stores. There are also issues of handling non-English text in persons’ original names in Chinese or Japanese, the punctuations in text strings at description and award notes, and so on. 
 
Long Term Goals and Further Impacts
 
     A byproduct of testing and demonstrating the project is the new datasets of the structured bios. The resulting raw data formed by the RDFa-Play tool uses one of the RDF serializations called Turtle (Terse RDF Triple Language). The automatically generated RDF dataset for each bio can be used for downloading (also called “dumping”) or be stored in a graph database using an open source tool such as FUSEKI, a SPARQL server, like this project has done. Including as many bio datasets as one likes, many questions can be answered by querying the datasets with SPARQL query language (a W3C standard for querying RDF graphs). For example, for the 12 artists’ datasets, these questions can be answered in a few seconds: find alumni of a particular college, multiple colleges, either of the two colleges specified; list the artists whose art works were selected for a particular exhibition, or for selected [multiple] exhibitions; find the teacher(s)/mentor(s) of a person or persons; locate the artists who were taught by the same teacher(s); list the people who were born (or not born) in a particular place, before (or after) a particular year; gather people who co-founded a society, who were members of a society or association, and so on. Here, the datasets have become a knowledge base, and the potential for use in research seems to have no end. When the entities (such as a person, a place, or an institution) of these datasets are mapped to the data values in DBpedia, Wikidata, GeoNames, and other existing datasets, the knowledge base will be enriched greatly and also be shared with, or reused by, anyone in the world.   
Beyond making it easy to tag a bio and create machine-understandable structured data, the project demonstrates a path toward building knowledge bases using a non-centralized approach. If structured data can be generated for individual bios by anyone and the resulting datasets can be collected and searched through SPARQL query endpoints, the data will connect things meaningfully. 
The approach of using un-centralized resources to generate datasets has special meaning in the current stage of digital archives and digital libraries. On the one hand, digital resources can be further used without waiting for a large “project” or being a one-time effort which would be difficult to sustain after the funding ends. On the other hand, the creation or contribution of the structured data can come from the bio writers themselves or the people who care about a particular person, who are usually more familiar and possess more accurate information than an outsider. Thus, the accumulated efforts have no limits or boundary, while the outcomes can be maximized. Such long term goals can be achieved at different levels. However, wider participation would happen only if the tools (such as the online bio platform) are handy to use, and the roadmaps are clear to follow. This researcher is lead by these goals and will continue the effort. 
 
 
Read 319 times

Media