The dblp XML format is modeled after the BibTeX *.bib file format. The format is defined in the DTD file in the same directory. Please understand that (by design) our DTD is not very strict, as it makes no restriction to element order or multiplicity, and even allows nonsensical child elements (e.g., school tags in article elements, editor and author elements at the same time) that you will never find in the actual dblp data set. Our priority was to keep the definition clean and simple, and not to model every aspect of the publication landscape.
More information on the XML structure of the dblp records and several design decisions can be found in the following paper:
In general, our XML is a shallow but very long list of XML records. The root element has several million child elements, but usually no element is deeper than level three. An excerpt of the XML file looks like this:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp SYSTEM "dblp.dtd"> <dblp> [...] <article key="journals/cacm/Gentry10" mdate="2010-04-26"> <author>Craig Gentry</author> <title>Computing arbitrary functions of encrypted data.</title> <pages>97-105</pages> <year>2010</year> <volume>53</volume> <journal>Commun. ACM</journal> <number>3</number> <ee>http://doi.acm.org/10.1145/1666420.1666444</ee> <url>db/journals/cacm/cacm53.html#Gentry10</url> </article> [...] <inproceedings key="conf/focs/Yao82a" mdate="2011-10-19"> <title>Theory and Applications of Trapdoor Functions (Extended Abstract)</title> <author>Andrew Chi-Chih Yao</author> <pages>80-91</pages> <crossref>conf/focs/FOCS23</crossref> <year>1982</year> <booktitle>FOCS</booktitle> <url>db/conf/focs/focs82.html#Yao82a</url> <ee>http://doi.ieeecomputersociety.org/10.1109/SFCS.1982.45</ee> </inproceedings> [...] <www mdate="2004-03-23" key="homepages/g/OdedGoldreich"> <author>Oded Goldreich</author> <title>Home Page</title> <url>http://www.wisdom.weizmann.ac.il/~oded/</url> </www> [...] </dblp>
Level 1: data records
The children of the root element represent the individual data records that are stored in dblp. In general, there are two types of records: publication records and person records.
Publication records are inspired by the BibTeX syntax and are given by one of the following elements:
- article – An article from a journal or magazine.
- inproceedings – A paper in a conference or workshop proceedings.
- proceedings – The proceedings volume of a conference or workshop.
- book – An authored monograph or an edited collection of articles.
- incollection – A part or chapter in a monograph.
- phdthesis – A PhD thesis.
- mastersthesis – A Master's thesis. There are only very few Master's theses in dblp.
- www – A web page. There are only very few web pages in dblp. See also the notes on person records.
Person records are described separately here.
All records share a number of common attributes:
- key – The unique dblp key of this record.
- mdate – The date this record has been last modified.
- publtype – An optional attribute that further specifies the type of record.
The values of the publtype attribute are from a controlled vocabulary. Multiple publtypes can be provided as a space-separated list. In the near future, we will replace some of the current publtype values to simplify parsing. The following table lists the publtypes in use for records. scope denotes if the publtype is used for publication records or person records. Note that annotation of record is partial. E.g., only a small amount of edited publications are annotated as edited.
|scope||current value||future value||description|
|publication||encyclopedia entry||encyclopedia||Publication is reference work, e.g., an encyclopedia article.|
|publication||informal publication||informal||Publication is gray literature, e.g., a preprint publications.|
|publication||edited publication||edited||Edited publication, e.g., an editorial or a news anouncement.|
|publication||survey||survey||Publication is a survey article.|
|publication||withdrawn||withdrawn||Publication was officially withdrawn by the publisher.|
|person||disambiguation page||disambiguation||The author profile associated with this person record does not represent a single author. See Why are some names followed by a four digit number for details.|
Level 2: bibliographic metadata
Record elements do not contain any text, but they contain a number of child elements to specify the record's bibliographic metadata entries. See the Wikipedia page on BibTeX to learn which data entries are meaningful in which record type.
Note that in contrast to BibTeX, there are no key elements since the key is already an attribute of the record node. Also, there is a custom url element to specify a local hyperlink relative to the dblp websites homepage.
Most record elements can have one or more of the following optional attributes:
type: A fine-grained description of the element content. Type is from a controlled vocabulary.
label: Similar to type but contains free text descriptive information (not from a controlled vocabulary). However, there are content guidelines for some situations.
aux: Reference to an auxiliary record. Auxiliary records contain additional information for the record and its data elements. The information is not listed in the primary record because it is too large, experimental or cannot be provided under ODC-BY licence.
A detailed description of record elements can be found at How are data annotations used in dblp.xml.
Level 3: optional HTML markup
In the XML file, only title or booktitle elements contain optional HTML markups, and only a selected few markup elements are allowed:
- ref – a pseudo-HTML markup to denote local hyperlinks within the dblp website (relative to the dblp websites homepage); requires the attribute href
- sup – superscript text
- sub – subscript text
- i - italics
- tt – monospace
In theory, the elements of this level may be nested arbitrarily deep to describe complex structures like formulas, e.g.
ixsubysup2/sup/sub/i to describe xy². However, such cases are very rare.
The dblp XML file is encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as named entities in the DTD and used whenever necessary.
At the moment, most parts of dblp are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255 Unicode characters. With exception to the the author- or editor-elements, where you will still find only latin-1 characters, you may find numerical entities outside of this range. For example, title-elements my contain Greek letters like an ε, or the note-elements of a person record may contain a Chinese name in the original Unicode spelling. All characters above the first 255 Unicode characters are given as numerical entities.