Skip to main content

Workshop Outcomes

Welcome to the PatentsView inventor disambiguation technical workshop site. If you are a computer scientist, social scientist, innovation policy expert, or anyone in between, then you are in the right place.

PatentsView is a USPTO-supported initiative to develop a patent data visualization and analysis platform that will increase the value, utility, and transparency of US patent data. PatentsView has several goals:

  • to encourage the study and enhance understanding of U.S. intellectual property and innovation systems; 
  • to serve an administration priority in creating “public good” platforms in government data; and 
  • to minimize redundant data cleaning, and matching on the part of the user community.

The workshop is part of the effort to find creative new approaches to get better information on innovators and the new technologies they develop by disambiguating inventor names. Research teams from the United States, Europe, Australia and China submitted their inventor disambiguation algorithms and results in the run-up to the workshop.

Over 100 people attended the event in person and online on September 24, 2015. Six research teams presented exciting new computational approaches for identifying unique inventor entities across 40 years of USPTO patent data. 

Nicholas Monath and Andrew McCallum from the University of Massachusetts Amherst authored the successful algorithm that will now be integrated in the PatentsView data platform. The Umass team will receive a $25,000 stipend. The stipend will support the team’s work on the algorithm and compensate team members for their technical guidance on integration efforts.

The USPTO Deputy Director Russell Slifer opened the workshop.

The USPTO Chief Economist Alan Marco provided an overview of the PatentsView initiative and set the goals for the workshop.

Joseph Bailey from USPTO presented the evaluation approach and outcomes of the workshop.

Participant presentations:

The judges presentations:


PARTICIPANT INFORMATION

Algorithm Submission Checklist

  • Tab-delimited file showing the result of disambiguating the rawinventor table. Note that only the rawinventor table provided on the workshop website should be disambiguated (you do not need to disambiguate the applications database).
    The first column of the tab-delimited file should be an inventor ID which is constructed by taking the hyphenated combination of the patent number and sequence fields for each inventor in the rawinventor table. The second column should be an integer ID generated by your program. Inventor IDs that are predicted to refer to the same unique individual should be assigned the same integer ID. For example, in the following excerpt the second and fourth inventors in the list are believed to be the same individual:
    1234567-1	1
    1234567-2	2
    1234567-3	3
    2345678-1	2
    2345678-2	4
    Note that some other sources of patent data may contain an “inventor ID” column which may not agree with the identifier that we are asking you to use. If you use a different inventor ID than the one described here, then we will not be able to evaluate your results properly.
  • Plain text or Word file describing the computing setup used and the run time of the algorithm. The description of the computing setup should include processor speed, number of cores/processors, amount of RAM, and amount of HD space. If applicable, describe the GPU or distributed computing setup used.
  • Source code for your disambiguation algorithm, including any preprocessing steps. If the code is available online, then you may submit a link. Otherwise, provide a compressed file.
  • Draft program documentation (note that minimal user documentation will be required for participants asked to continue to the second round of evaluation)
  • Draft write-up of methods

Next steps

  • We will evaluate your output file against withheld labeled data by computing the precision and recall rates as described in the evaluation documentation.
  • The judges will review the precision and recall rates as well as your description of run time and computing setup. The judges will choose up to three participating grouops to advance to the next stage of evaluation by September 7.
  • The teams that advance to the next evaluation stage must be prepared to work in the test environment as described in the evaluation documentation.

How to Participate:

We welcome the participation of US and international researchers who can bring creative new technical approaches to this research problem. We invite you to review the workshop and technical parameters below, explore the patent data, and send a brief note of your “intent to participate” to cssip@air.org by July 15, 2015. All reasonable submissions will be accepted for participation.

  • Describe any external datasets or nontraditional computing environments that you wish to incorporate in your proposed solution.
  • Give a rough description of your development platform, including the operating system, CPU specifications, the amount of available memory, and GPU specifications if applicable.
  • List any non-free software that you will use.

*We understand that the exact specifications of your platform may change

Remuneration

One $25,000 stipend will be provided to the researcher team with the algorithm that scores highest on the evaluation.

  • That stipend will support the team’s work on their algorithm
  • The stipend will additionally support the research team’s technical guidance as AIR staff integrate their algorithm into the PatentsView workflow.

AIR will provide a stipend to cover travel expenses for one representative from up to 15 participating teams. If more than 15 teams submit eligible algorithms, the evaluation results will be used to identify the 15 teams that will be represented at the workshop.

Program Requirements

Participants will write a program that reads in files containing processed patent data from 1976 to 2014 and produce an output file giving predictions for which inventors in the data correspond to the same underlying individual.

Input files

The input files consist of parsed text and XML data from USPTO on published patent grants (1976-2014) and applications (2001-2014). Participants are free to use any portion of this data for their algorithms as they see fit. The data tables are provided both as individual CSV downloads and as a MySQL export file which can be used to populate a new MySQL database.

Output file format

The output file should be a tab-delimited file with two columns and no header. The first column should be an inventor ID that is constructed taking the hyphenated combination of the patent number and sequence fields for each inventor in the rawinventor table. The second column should be an integer ID generated by your program. Inventor IDs that are predicted to refer to the same unique individual should be assigned the same integer ID. For example, in the following excerpt the second author on the first patent and the first author on the second patent are believed to be the same individual:

1234567-1	1
1234567-2	2
1234567-3	3
2345678-1	2
2345678-2	4

Runtime and computing resources

  • The algorithm should not run for more than 5 days when processing all patent application and grant data (2001-2014 for applications; 1976-2014 for grants)
  • The implementation should be runnable on hardware equivalent to a single Amazon Web Services (AWS) instance. For reference, currently the largest compute-optimized AWS instance provides 36 virtual CPUs and 60 GB memory.
  • AIR and the panel will review any requests for software or hardware updates that might be required to accommodate the incorporation of a novel algorithm into the current PatentsView workflow. These requests must be submitted in your letter of intent to participate.

External data

  • Proprietary datasets cannot be included in any algorithm.
  • AIR and the panel will review any requests to incorporate additional non-proprietary data into a submitted algorithm. Please specify any additional data you intend to use in your letter of intent to participate.

Evaluation

Algorithms will be evaluated for accuracy, run-time, and usability. The evaluation will take place in two phases. We will briefly describe the evaluation criteria here, but refer to the evaluation criteria documentation for further details and definitions.

First Phase

This is an initial round of self-testing where participants will infer links for the bulk patents database. They may train their algorithms using any part of the provided data, as well as any additional data sets that have been submitted to the workshop organizers. During this round, participants will be evaluated on the following criteria:

  • Recall Rate defined as $$\text{Recall} = \frac{\text{# of true positives}}{\text{# of true positives} + \text{# of false negatives}}$$
  • Precision Rate defined as $$\text{Precision} = \frac{\text{# of true positives}}{\text{# of true positives} + \text{# of false positives}}$$
  • Self-reported algorithm run-time

Second Phase

Up to three workshop participants will be invited to a second phase of evaluation, where they will be asked to run their disambiguation algorithm in a server environment that we provide. During this phase, participants will be evaluated on

  • Algorithm generalizability, which we will assess by computing the recall rate and precision rate for different sets of training and evaluation data.
  • Self-reported algorithm run-time
  • Usability of the implementation. We will ask participants in this phase of evaluation to provide user documentation for their algorithm implementation.

Deadlines

May 13 Workshop informational website is open to the public. PatentsView data is available for exploration from the bulk download page. AIR staff are available to respond to inquiries (cssip@air.org).
First week of June The pre-workshop activities officially kick off. Training datasets will be posted on the informational website.
July 15 Deadline for prospective participants to submit a 1-page "intent to participate" document. This includes any requests to incorporate additional data, software, or hardware requirements. All teams with proposals deemed to be reasonable (by the judges' panel) will be invited to participate.
August 30 All algorithms must be submitted to cssip@air.org. Submissions must include testing results (lumping errors, splitting errors, run time, and any other relevant observations) that were obtained with the PatentsView-provided test dataset.
September 24 Final technical workshop to be held at the USPTO headquarters in Alexandria, VA. Attendees will include leaders from relevant federal agencies, the science and innovation research communities, and the press. Research teams will present their approaches. AIR will present the results of algorithm evaluations.

Training Datasets

Four patent datasets are provided to workshop participants, for training of their algorithms. Each is a human-labeled research dataset with validated inventor identities. The datasets were previously developed, curated, and validated for research purposes. These four research datasets were generously provided by Erica Fuchs and colleagues, Ivan Png and colleagues, Pierre Azoulay and colleagues, and Manuel Trajtenberg and colleagues.AIR is providing the four research datasets in multiple formats, as described below:

The original optoelectronic human-labeled dataset and full documentation can all be accessed at: http://www.cmu.edu/epp/disambiguation

File name (click to download) Source Data Description
als_training_data.csv Azoulay, et al 2010 A training data set with 15,000 records. The data is a bootstrap sample of record comparisons based on a labeled data set of approximately 5,000 researchers in the academic life sciences and their US patents.
is_inventors.csv Trajtenberg, et al 2008 Original dataset of all Israeli inventor and their US patents

ens_inventors.csv

ens_patents.csv

Chunmian, et al forthcoming Original dataset of engineers and scientists and their patents
benchmark_epfl.rar Lissoni et al., 2010 An archive containing a database of labeled inventors affiliated to the Ecole Polytechnique Federale de Lausanne.
benchmark_france.rar Lissoni et al., 2010 An archive containing a database of labeled inventors from French universities.
td_patent.csv

Chunmian, et al forthcoming

Trajtenberg, et al 2008

Patent fields from the processed bulk patent data for patents matched to Trajtenberg or Png (see methods)
td_inventor.csv

Chunmian, et al forthcoming

Trajtenberg, et al 2008

Inventor fields from the processed bulk patent data for all inventors on patents matched to Trajtenberg or Png (see methods)
td_assignee.csv

Chunmian, et al forthcoming

Trajtenberg, et al 2008

Assignee fields from the processed bulk patent data for all assignees on patents matched to Trajtenberg or Png (see methods)
td_class.csv

Chunmian, et al forthcoming

Trajtenberg, et al 2008

USPC classes from the processed bulk patent data for patents matched to Trajtenberg or Png (see methods)
epo_patent.csv EPO Worldwide Patent Statistical Database (PATSTAT)

Additional patent fields from PATSTAT for patents that appear in benchmark_epfl.rar or bench_mark_france.rar

epo_inventor.csv EPO Worldwide Patent Statistical Database (PATSTAT)

Additional patent fields from PATSTAT for inventors on patents that appear in benchmark_epfl.rar or bench_mark_france.rar

epo_cpc.csv EPO Worldwide Patent Statistical Database (PATSTAT)

CPC classifications from PATSTAT for patents that appear in benchmark_epfl.rar or bench_mark_france.rar

Intellectual Property

PatentsView data and the underlying codebase are open to the public and available on the main website and our github repository under the BSD-2 open source license (https://github.com/CSSIP-AIR/PatentsProcessor). All submitted algorithms and related information will be subject to the same open source copyright and will be made available to the public on the PatentsView github repository.

Teams

There is no limit to the size or composition of participating researcher teams; however, government employees are not eligible to participate. We will only provide travel expenses for one representative from the top teams (up to 15 teams) to present their technology to the workshop attendees.

Governance

The American Institutes for Research (AIR) is hosting the workshop with the support of a judges’ panel of three subject matter experts:

AIR and the judges’ panel will review all “intent to participate ” submissions and will oversee the evaluation of submitted algorithms in advance of the technical workshop in September.

Eligibility

Workshop participation is open to US and foreign nationals in academia or the private sector.

US government employees are not eligible to participate.

Location

The final workshop will be held at USPTO headquarters in Alexandria, VA.

Button sidebar