Skip to content

Deduplicate dblp publications and link to DOI/arXiv#8

Draft
Copilot wants to merge 4 commits intomasterfrom
copilot/deduplicate-dblp-publications
Draft

Deduplicate dblp publications and link to DOI/arXiv#8
Copilot wants to merge 4 commits intomasterfrom
copilot/deduplicate-dblp-publications

Conversation

Copy link
Contributor

Copilot AI commented Feb 5, 2026

Plan to fix dblp publication deduplication and linking:

  • Understand current implementation and identify duplicates
  • Update fetch_dblp.py to extract DOI/arXiv links from ee field
  • Implement deduplication logic that prefers published versions over preprints
  • Test the script with sample data (92 → 76 publications, 16 duplicates removed)
  • Run the script to regenerate publications
  • Address code review feedback (move imports, use constants, reduce duplication)
  • Run CodeQL security check (no issues found)
  • Complete final verification
  • Address PR review comments:
    • Made ee extraction more concise using next() with generator expression
    • Refactored deduplication into separate function with single-loop approach

Summary

Successfully implemented deduplication and improved linking for dblp publications:

Statistics:

  • Reduced publications from 92 to 76 (removed 16 duplicates)
  • 92% of publications now link to DOI/arXiv instead of dblp pages
  • 0 DBLP-only links remaining

Recent improvements:

  • More concise URL extraction using next() with generator expression
  • Deduplication logic extracted into deduplicate_publications() function
  • Single-loop approach using dictionary for better performance
Original prompt

This section details on the original issue you should resolve

<issue_title>dblp publications are not deduplicated</issue_title>
<issue_description>Describe the bug
Currently, the dblp script fetches all publications. However, publications can be present multiple times, for example as a preprint. If this is the case we only want to show it a single time.

Additionally, the publication link is only to the dblp page. However, a link to doi or acm or arvix or whatever is better.

To Reproduce
Check the website under /publications.

Expected behavior
The publications are deduplicated in a suitable way. Generally, newer releases of the same paper should be preferred, but I am uncertain if this is the only way to do it or if other properties should have an effect.

The publications should also link to a pager where the pdf is immediately accessible.

Additional context
Refetch the publications using the existing script.
</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits February 5, 2026 15:03
Co-authored-by: JakobHoffmann <23260378+JakobHoffmann@users.noreply.github.com>
… duplication

Co-authored-by: JakobHoffmann <23260378+JakobHoffmann@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix duplication in dblp publications fetching Deduplicate dblp publications and link to DOI/arXiv Feb 5, 2026
Copilot AI requested a review from JakobHoffmann February 5, 2026 15:09
…unction

Co-authored-by: JakobHoffmann <23260378+JakobHoffmann@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dblp publications are not deduplicated

2 participants