CENSUS Manuscripts' Archive Search Resource

CENSUS is thought as an archive of documents web available, thus a query resource for retrieving manuscripts by free keywords is developing as the General Catalogue of Dongba manuscripts is developing and implementing too.

The core of such retrieval system I developed in Php (currently version ), with some functions which operate in NLP treatment of the query for ampliation of research power of the query itself (tokenization, normalization, stemming of the query), and determination of occurences, frequency of keywords and weight of document retrieved.

My work for CENSUS retrieval system is developed grounding on previous experience I tested on "Bibliografiapiste" alias "Western Desert Caravans' Routes" Egyptologic bibliography (my MA degree in Egyptology, April of 2004) where and whence I was passionated with Digital Humanities applications and I started to study Information Retrieval Systems, lighted by Pr. Zampolli and Dr. Lenci lessions of Computational Linguistic.

CENSUS archive of manuscript, where Php retrieval system points, is stored into a MySql database (currently version ).

At today CENSUS search system is implemented applying some NLP operations for treatment of query

  • Tokenization of the query:
  • Erasing of stopwords form the tokenized query:
  • Normalization of the keyowrds of the query:

Tokenization

Tokenization is the process of breaking a stream of text up into meaningful elements. For instance the text "Lorem ipsum dolor sit amet", if tokenized taking a word as the atom or token of the funztion, where a word is meant to be the sequence between two "_blank " characters and/or interpuntions, the original text should be transformed as:

  • Lorem
  • impsum
  • dolor
  • sit
  • amet

Considering a keywords search query as a stream of text where atoms are the single words which compose the whoole phrase, then a tokenization of the query permit to operate on the part of the query to elaborate different search option, as "all words" and " one word al least"

CENSUS system use the Php "explode" function to devide a string of words into an array of words, where each element of the array contains one word of the original query. For instance the query "Miniated and colored titlepages" will be tokenized as an array of 4 elements:

  • Array[1] = Miniated
  • Array[2] = and
  • Array[3] = colored
  • Array[4] = titlepages

Finally, by allpying the php function strtolower - string to lower each token, alias each element of the array, should be reduce from eventually capitals to lower letters, this to avoid any disambiguation in the dialogue between Php and MySql, both case-sensitive.

Stopwords

Stopwords is the name given by Hans Peter Luhn to words which are filtered out prior to, or after, processing of natural language data (text).

In general, among the Stopwords set of a language could be gathered:

  • indefinite adjectives
  • articles
  • adverbs
  • esclamations
  • interjection
  • prepositions
  • pronouns (indefinite, demonstrative, relative)

in other words all the grammatical and sintactic words whith a very high frequence in all kind of text and with very few semantic peculiarity and weight.

List of English Language stopwords are available here:

Stopwords elimination from the queries that CENSUS has to elaborate I meant as very useful for improving CENSUS performance in retrieval documents from keyowrds queries, because of redoundace of stopwords into the operations of information and documents retrieval (quote); according to such idea I thus believed to adopt a stopwords eraser function into CENSUS system.

With the availability of a sequence of tokens stored in the array previously described, is possible to operate on each atom of the original query, controlling if it's a stopword and if it is and the lenght of the query permits, then proceed with the elimination of the token-stopword.

For instance, the query "Miniated and colored titlepages", tokenized in

  • [1] = miniated
  • [2] = and
  • [3] = colored
  • [4] = titlepages

will be processed by the tokenizator function which identifies the stopwords "and" in the 2nd token and erase it reducing the original 4 atoms' query into a 3 atoms' one.

Erasing of stopwords I meant to be as the first step of the operations of normalization of keyords that compose the query. Their elimination is useful and imporve the documents retrieval system performances because of many reasons.

Among them I believd that reducing the number of words of a query also reduce the fastness of documents retrieving.

On the other side, because of the nature of the stopwords as words characterized by a high frequency and a very low ponderance and peculiarity as semantic carriers (for instance grammar words: conjunctions, adverbs, pronouns, etc...,) thus being them very very common and numerous in each documents, they're however retrieved if searched, thus they interfer with the retrievement of documents.

Moreover, expecially in a "one word at least" query research session, for instance "miniated or colored titlepages", stopwords presence into the query will produce retrievement of documents that only match just to the stopwords, so it consist witht the production of retrievement of non pertinent documents.

In CENSUS system stopwords are earsed from queries both for "all words" and "one word at least" search options, whilst stopwords aren't erased if the user chose for "exact phrase" search option.

Stemming

In linguistic morphology, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

I believed to implement CENSUS retrieval system by keywords search with a basic stemming process, functions that allow to identification of some patterns into the keywords, and once these patterns were identified, then operate to the stemming of the keyword

I choose about the stemming because I supposed that queries expressed in Natural Language by the user could be characterized by all the flessions and syntax features required for a correct grammar writer and/or speaker; a basic stemming operation is thus able to reconduct different inflections to a common stem.

That's because I supposed that a user which asks for manuscripts by a query, is expressing by keywords a semantic concept to be matched into the retrieved manuscripts, thus the pertinence of manuscripts have closely to be realted not just to the keywords submitted by the user, but with the semantic meanings carried by the keywords.

Thus, for instance, a query composed by the keywords:

  1. miniated
  2. titepages

however the option of search chose by the user is (all words or one word at least) should produce among the retrieved documents manuscrpits that match with other semantic equivalent or related query, as "miniature in titlepage", "miniate title-page", "miniature in titlepage", etc...

In few words, to allow to CENSUS system such power in retrieving, a stemming and normalization of keywords were needed, thus some functions for prefixes and suffixes determination is implemented, and where the keywords' lenght permits, prefixes and suffixes are marked up and erased respectively from head and tail of the keyowrds.

For instance, the keyowrd "miniated" will be reduce to a semantic core miniat which is common to all the keywords that carry and belong to the same semantic core as "miniat[ed]" , as "miniat[ure]", "miniat[ing]", "miniat[e]", etc...

CENSUS database of Dongba manuscipt could be query with some other options for a more specified restriction in the criteria for retrieving of the manuscripts, alisa restricting fileds of the CENSUS manuscripts database to be queryied.

Grouping and ordering retrieved manuscripts

Becuase CENSUS is meant as a general world wide catalog of Dongba manuscripts, and because Dongba manuscripts are in few words texts dedicated to a particular ceremony, thus restrictions at today are meant and implemented as:

  • Restrict the query just to one collection of manuscript
  • Restrict the query to one ceremony

Finally, user could gropu and order retrieved manuscripts by specified fileds:

Querying CENSUS

The queries allowed into CENSUS retrieval system are strings in English, expressed in natural language, without boolean operators, and the option for keywords and for query must be selcted from the available ones of interface.

  • Option "ALL WORDS"

    By selecting "all words" search option in a query session, user will retrieve all documents archived into CENSUS which contains all the words of the query except for the stopwords; to inproove the research power and expand the query, keywords will be processed after stemming.

  • Option "ONE WORD AT LEAST"

    By selecting "one word at least" seacrh option in a query session, user will retrieve all documents available into CENSUS that will contains at least one of the words which compose the query. Query will be erased from stopwords and submitted to stemming.

  • Option "EXACT PHRASE"

    If user selects "exact phrase" search option, then all documents which contains the exact string - alias the exact succession of characters - will be retrieved. Selection of this option implies that tokenization, erasing of stopwords and stemming won't be performed.

An how-to page is available, developed to show on-fly demo of how Census System of free keywords query works. For any other information please, feel free to email me.

CENSUS database

Census catalog of Dongba manuscript is implemented as a MySql 5.0.82sp1-log database.

Structural scheme of database with commented fileds (in Italian language) is available here as a pdf.