IBM recently announced the donation to PubChem of more than 2.4 million chemical structures extracted from the patent literature and biomedical journals. (link, link) According to Marc Nicklaus of NIH:
... Non-U.S. patents are included as the source of structures in this data donation. This information is not directly part of the donated file itself, though. There is a link for each record that points back to an IBM web page that provides some additional information (apparently for free) of the type, "PMIDs and patent numbers found for documents containing IBM Structure ID=0015AFBF08D8F183C1F8E32A430CFFEB." What one finds there in this case is simply: EP0244956A1 ...presumably the European patent in which this compound appeared.
BTW, these data were donated to both PubChem and us (NCI CADD Group). We're currently processing the file and will incorporate the structures into our services on http://cactus.nci.nih.gov.
The donation resulted from research performed using IBM's Strategic IP Insight Platform (SIIP). Last year, Stephen Boyer discussed technical aspects of the patent mining work as it applies to cheminformatics (below).
IBM's donation should be viewed in the context of related recent events including the release of screening data for over 300,000 structures against malaria by GlaxoSmithKline and Novartis.
Are data releases like these by large companies merely a fad or the start of something big? Only time will tell. But given the ongoing pain and renewed drive to innovate in the pharmaceutical industry, I wouldn't be surprised to see multiple announcements along the same lines in the coming year.