Trying Out Collecting Papers with Python Part 1 (arXiv, Node-RED)

2026年2月17日 2026年2月17日

管理人

table of contents

Info

This article is translated from Japanese to English.

https://404background.com/program/python-arxiv/

Introduction

In this post, I tried collecting research papers with Python using the arXiv API.
I usually use Google Scholar when searching for papers, but since I wanted to automate the process with a program, I consulted ChatGPT and learned about the existence of the arxiv package.
After confirming a sample program with Python, I made it executable within Node-RED.
▼Previous articles are here:

Related Information

▼The arXiv page is here. I often look at it when searching for papers. When it can be displayed in HTML format, it's convenient to use Google Translate in the browser.

https://arxiv.org

▼The PyPI page for arXiv is here:

https://pypi.org/project/arxiv

▼Detailed instructions on how to use the package were summarized on this page:

https://note.nkmk.me/python-arxiv-api-download-rss

Using the arXiv Package

Setting Up the Environment

The execution environment is a Windows 11 laptop with Python version 3.12.6.

▼I am using a gaming laptop purchased for around 100,000 yen, running Windows 11.

【Amazon.co.jp限定】 ASUS ゲーミングノートPC TUF Gaming A15 FA506NCR 15.6インチ RTX 3050 AMD Ryzen 7 7435HS メモリ 16GB SSD 512GB リフレッシュレート144Hz RGB イルミネートキーボード Windows 11 動画編集グラファイトブラック FA506NCR-R7R3050A

ASUS

Amazon

ポチップ

https://amzn.to/4aaSMlT

First, create a Python virtual environment and install the package. I ran the following commands:

python -m venv pyenv
cd pyenv
.\Scripts\activate
pip install arxiv

▼For more details on Python virtual environments, please see the following article:

Trying the Sample Program

I tried the sample program found on the PyPI page.

# https://pypi.org/project/arxiv/

import arxiv

# Construct the default API client.
client = arxiv.Client()

# Search for the 10 most recent articles matching the keyword "quantum."
search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)

# `results` is a generator; you can iterate over its elements one by one...
for r in client.results(search):
  print(r.title)
# ...or exhaust it into a list. Careful: this is slow for large results sets.
all_results = list(results)
print([r.title for r in all_results])

# For advanced query syntax documentation, see the arXiv API User Manual:
# https://arxiv.org/help/api/user-manual#query_details
search = arxiv.Search(query = "au:del_maestro AND ti:checkerboard")
first_result = next(client.results(search))
print(first_result)

# Search for the paper with ID "1605.08386v1"
search_by_id = arxiv.Search(id_list=["1605.08386v1"])
# Reuse client to fetch the paper, then print its title.
first_result = next(client.results(search))
print(first_result.title)

▼The output was as follows:

▼Briefly summarized:

The second to last is a search using AND, and the last is a search by ID.
▼Details can be found in the following user manual:

https://info.arxiv.org/help/api/user-manual.html#query_details

I also tried the following sample program to download PDF files.

# https://pypi.org/project/arxiv/

import arxiv

paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

Note that while a directory is specified in the last line, a FileNotFoundError occurred if the mydir folder did not exist.
▼After execution, the PDF files were downloaded. Three files were downloaded: with the original name, the specified name, and in the specified folder.

I also tried a sample program to get results using a custom client, but since there were too many, I modified it to display the count as well.

# https://pypi.org/project/arxiv/

import arxiv
i = 0

big_slow_client = arxiv.Client(
  page_size = 1000,
  delay_seconds = 10.0,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
  print(result.title)
  i+=1
  print(i)

▼Results were displayed 1000 at a time with a delay.

Sometimes the first display was only 100 or 200 items. After that, they were displayed 1000 at a time.

Using with Node-RED

Executing with the inject node

I used the python-venv node I developed to run Python in Node-RED.
▼I wrote about the transition of development at the end of last year.

https://qiita.com/background/items/d2e05e8d85427761a609

I created a flow to execute Python. You can install arxiv using the pip node.
▼The flow is here. The download destination folder is set in the inject node, so please change it.

[{"id":"56d78acd29b819e8","type":"venv","z":"711edbe9464cba2a","venvconfig":"2fd1a793b1bee0bb","name":"Download","code":"import arxiv\nimport os\nimport time\nfrom datetime import datetime\n\n\ndef download_papers(paper_dir, query, wait_time=60, year_limit=2023):\n    \"\"\"\n    論文を検索し、PDFをダウンロードします。\n\n    Parameters:\n    - paper_dir (str): 保存先フォルダ\n    - query (str): 検索クエリ（複数のキーワードを空白区切りで指定）\n    - wait_time (int): ダウンロード間隔（秒）\n    - year_limit (int): ダウンロード対象とする論文の最小年（それ以前の論文はダウンロードしない）\n    \"\"\"\n    # 保存先ディレクトリを作成\n    os.makedirs(paper_dir, exist_ok=True)\n    print(f\"Saving papers to: {paper_dir}\")\n\n    # arXivクライアントの設定\n    client = arxiv.Client(page_size=10)\n\n    print(f\"Searching for papers with query: '{query}'...\")\n\n    # クエリを空白区切りでAND検索として扱う\n    formatted_query = \" AND \".join(query.split())\n\n    # クエリを実行して結果を取得\n    search = arxiv.Search(\n        query=formatted_query,\n        sort_by=arxiv.SortCriterion.SubmittedDate\n    )\n\n    for result in client.results(search):\n        try:\n            # タイトルと提出日\n            print(f\"\\nTitle: {result.title}\")\n            # print(f\"Submitted: {result.updated}\")\n            # print(f\"Abstract: {result.summary}\")\n\n            # 提出日が指定された年よりも前かどうかを確認\n            submitted_year = result.updated.year  # ここで直接年を取得\n            if submitted_year < year_limit:\n                print(f\"Skipping {result.title} as it was submitted in {submitted_year}, which is before {year_limit}.\")\n                break  # 指定年より前の論文が見つかった時点で終了\n\n            # ファイル名を適切にフォーマット\n            sanitized_title = \"\".join(c if c.isalnum() or c in \" _-\" else \"_\" for c in result.title)\n            pdf_filename = f\"{sanitized_title}.pdf\"\n            pdf_paper_path = os.path.join(paper_dir, pdf_filename)\n\n            # PDFが既に存在する場合はダウンロードしない\n            if os.path.exists(pdf_paper_path):\n                print(f\"{pdf_filename} already exists. Skipping download.\")\n                continue\n\n            # PDFダウンロード\n            # print(f\"Downloading to: {pdf_paper_path}\")\n            result.download_pdf(dirpath=paper_dir, filename=pdf_filename)\n\n            time.sleep(wait_time)\n\n        except Exception as e:\n            print(f\"Failed to process {result.title}: {e}\")\n\n    print(\"\\nDone!\")\n\n# Node-REDからの入力\nif __name__ == \"__main__\":\n    # 保存先フォルダ\n    paper_dir = os.path.join(msg['directory'].replace(\"\\\\\", \"/\"), 'paper')\n\n    # 検索ワード\n    query = msg['query']\n\n    # ダウンロード間隔を設定（デフォルト60秒）\n    wait_time = 10  # 固定値\n\n    # ダウンロード対象とする最小年（例：2023年以降の論文）\n    year_limit = 2020\n\n    # 論文のダウンロード\n    download_papers(paper_dir, query, wait_time, year_limit)","continuous":true,"x":1280,"y":840,"wires":[["e87751155b621fb6"]]},{"id":"8bd7561f59804d05","type":"inject","z":"711edbe9464cba2a","name":"W","props":[{"p":"directory","v":"W:\\論文","vt":"str"},{"p":"query","v":"ROS Unreal Engine","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","x":1130,"y":840,"wires":[["56d78acd29b819e8"]]},{"id":"e87751155b621fb6","type":"debug","z":"711edbe9464cba2a","name":"debug 355","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1450,"y":840,"wires":[]},{"id":"64af3d1520034da0","type":"inject","z":"711edbe9464cba2a","name":"","props":[],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","x":1130,"y":780,"wires":[["82e599ee0edab80a"]]},{"id":"82e599ee0edab80a","type":"pip","z":"711edbe9464cba2a","venvconfig":"2fd1a793b1bee0bb","name":"","arg":"arxiv","action":"install","tail":false,"x":1270,"y":780,"wires":[["236bfae326175264"]]},{"id":"236bfae326175264","type":"debug","z":"711edbe9464cba2a","name":"debug 358","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1430,"y":780,"wires":[]},{"id":"2fd1a793b1bee0bb","type":"venv-config","venvname":"paper","version":"3.10"}]

The Python code to be executed was written by ChatGPT.

import arxiv
import os
import time
from datetime import datetime


def download_papers(paper_dir, query, wait_time=60, year_limit=2023):
    """
    論文を検索し、PDFをダウンロードします。

    Parameters:
    - paper_dir (str): 保存先フォルダ
    - query (str): 検索クエリ（複数のキーワードを空白区切りで指定）
    - wait_time (int): ダウンロード間隔（秒）
    - year_limit (int): ダウンロード対象とする論文の最小年（それ以前の論文はダウンロードしない）
    """
    # 保存先ディレクトリを作成
    os.makedirs(paper_dir, exist_ok=True)
    print(f"Saving papers to: {paper_dir}")

    # arXivクライアントの設定
    client = arxiv.Client(page_size=10)

    print(f"Searching for papers with query: '{query}'...")

    # クエリを空白区切りでAND検索として扱う
    formatted_query = " AND ".join(query.split())

    # クエリを実行して結果を取得
    search = arxiv.Search(
        query=formatted_query,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    for result in client.results(search):
        try:
            # タイトルと提出日
            print(f"\nTitle: {result.title}")
            # print(f"Submitted: {result.updated}")
            # print(f"Abstract: {result.summary}")

            # 提出日が指定された年よりも前かどうかを確認
            submitted_year = result.updated.year  # ここで直接年を取得
            if submitted_year < year_limit:
                print(f"Skipping {result.title} as it was submitted in {submitted_year}, which is before {year_limit}.")
                break  # 指定年より前の論文が見つかった時点で終了

            # ファイル名を適切にフォーマット
            sanitized_title = "".join(c if c.isalnum() or c in " _-" else "_" for c in result.title)
            pdf_filename = f"{sanitized_title}.pdf"
            pdf_paper_path = os.path.join(paper_dir, pdf_filename)

            # PDFが既に存在する場合はダウンロードしない
            if os.path.exists(pdf_paper_path):
                print(f"{pdf_filename} already exists. Skipping download.")
                continue

            # PDFダウンロード
            # print(f"Downloading to: {pdf_paper_path}")
            result.download_pdf(dirpath=paper_dir, filename=pdf_filename)

            time.sleep(wait_time)

        except Exception as e:
            print(f"Failed to process {result.title}: {e}")

    print("\nDone!")

# Node-REDからの入力
if __name__ == "__main__":
    # 保存先フォルダ
    paper_dir = os.path.join(msg['directory'].replace("\\", "/"), 'paper')

    # 検索ワード
    query = msg['query']

    # ダウンロード間隔を設定（デフォルト10秒）
    wait_time = 10  # 固定値

    # ダウンロード対象とする最小年（例：2023年以降の論文）
    year_limit = 2020

    # 論文のダウンロード
    download_papers(paper_dir, query, wait_time, year_limit)

In the Python code, there are msg['directory'] and msg['query'] which are not defined, but this is because it accesses Node-RED's msg object. This useful feature for running Python in Node-RED was implemented by a contributor.
▼It is specified in the inject node as follows:

PDF files are downloaded to the paper folder within the folder specified by msg.directory.
I actually ran it.
▼Paper PDFs were downloaded!

▼It skips if already downloaded.

Combining with the dashboard node

During the New Year holidays, I updated the python-venv node so that it can access not only the msg object but also the flow and global objects.
Using this feature, I used the text input node from the dashboard node to input queries and directories to download PDF files.
▼The overall flow is here. The search keyword entered in the dashboard is assigned to the flow object.

[{"id":"60da293526020a2c","type":"venv","z":"711edbe9464cba2a","venvconfig":"2fd1a793b1bee0bb","name":"Download","code":"import arxiv\nimport os\nimport time\nfrom datetime import datetime\n\n\ndef download_papers(paper_dir, query, wait_time=60, year_limit=2023):\n    \"\"\"\n    論文を検索し、PDFをダウンロードします。\n\n    Parameters:\n    - paper_dir (str): 保存先フォルダ\n    - query (str): 検索クエリ（複数のキーワードを空白区切りで指定）\n    - wait_time (int): ダウンロード間隔（秒）\n    - year_limit (int): ダウンロード対象とする論文の最小年（それ以前の論文はダウンロードしない）\n    \"\"\"\n    # 保存先ディレクトリを作成\n    os.makedirs(paper_dir, exist_ok=True)\n    print(f\"Saving papers to: {paper_dir}\")\n\n    # arXivクライアントの設定\n    client = arxiv.Client(page_size=10)\n\n    print(f\"Searching for papers with query: '{query}'...\")\n\n    # クエリを空白区切りでAND検索として扱う\n    formatted_query = \" AND \".join(query.split())\n\n    # クエリを実行して結果を取得\n    search = arxiv.Search(\n        query=formatted_query,\n        sort_by=arxiv.SortCriterion.SubmittedDate\n    )\n\n    for result in client.results(search):\n        try:\n            # タイトルと提出日\n            print(f\"\\nTitle: {result.title}\")\n            # print(f\"Submitted: {result.updated}\")\n            # print(f\"Abstract: {result.summary}\")\n\n            # 提出日が指定された年よりも前かどうかを確認\n            submitted_year = result.updated.year  # ここで直接年を取得\n            if submitted_year < year_limit:\n                print(f\"Skipping {result.title} as it was submitted in {submitted_year}, which is before {year_limit}.\")\n                break  # 指定年より前の論文が見つかった時点で終了\n\n            # ファイル名を適切にフォーマット\n            sanitized_title = \"\".join(c if c.isalnum() or c in \" _-\" else \"_\" for c in result.title)\n            pdf_filename = f\"{sanitized_title}.pdf\"\n            pdf_paper_path = os.path.join(paper_dir, pdf_filename)\n\n            # PDFが既に存在する場合はダウンロードしない\n            if os.path.exists(pdf_paper_path):\n                print(f\"{pdf_filename} already exists. Skipping download.\")\n                continue\n\n            # PDFダウンロード\n            # print(f\"Downloading to: {pdf_paper_path}\")\n            result.download_pdf(dirpath=paper_dir, filename=pdf_filename)\n\n            time.sleep(wait_time)\n\n        except Exception as e:\n            print(f\"Failed to process {result.title}: {e}\")\n\n    print(\"\\nDone!\")\n\n# Node-REDからの入力\nif __name__ == \"__main__\":\n    # 保存先フォルダ\n    paper_dir = os.path.join(node['flow']['directory'].replace(\"\\\\\", \"/\"), 'paper')\n\n    # 検索ワード\n    query = node['flow']['query']\n\n    # ダウンロード間隔を設定（デフォルト60秒）\n    wait_time = 10  # 固定値\n\n    # ダウンロード対象とする最小年（例：2023年以降の論文）\n    year_limit = 2020\n\n    # 論文のダウンロード\n    download_papers(paper_dir, query, wait_time, year_limit)","continuous":true,"x":1400,"y":580,"wires":[["991894f0041bde39"]]},{"id":"991894f0041bde39","type":"debug","z":"711edbe9464cba2a","name":"debug 355","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1570,"y":580,"wires":[]},{"id":"17a39cba18b56f4a","type":"ui_text_input","z":"711edbe9464cba2a","name":"","label":"Query","tooltip":"","group":"3c0113e44b7c7681","order":2,"width":0,"height":0,"passthru":true,"mode":"text","delay":"0","topic":"topic","sendOnBlur":true,"className":"","topicType":"msg","x":1210,"y":500,"wires":[["5e8f6e1392be952c"]]},{"id":"5e8f6e1392be952c","type":"change","z":"711edbe9464cba2a","name":"","rules":[{"t":"set","p":"query","pt":"flow","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1380,"y":500,"wires":[[]]},{"id":"09dbdd1d30623cf4","type":"ui_text_input","z":"711edbe9464cba2a","name":"","label":"Directory","tooltip":"","group":"3c0113e44b7c7681","order":1,"width":0,"height":0,"passthru":true,"mode":"text","delay":"0","topic":"topic","sendOnBlur":true,"className":"","topicType":"msg","x":1220,"y":540,"wires":[["f89c8b94065c0f92"]]},{"id":"f89c8b94065c0f92","type":"change","z":"711edbe9464cba2a","name":"","rules":[{"t":"set","p":"directory","pt":"flow","to":"payload","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1410,"y":540,"wires":[[]]},{"id":"09bea2464adcad69","type":"ui_button","z":"711edbe9464cba2a","name":"","group":"3c0113e44b7c7681","order":3,"width":0,"height":0,"passthru":false,"label":"Search","tooltip":"","color":"","bgcolor":"","className":"","icon":"","payload":"","payloadType":"str","topic":"topic","topicType":"msg","x":1220,"y":580,"wires":[["60da293526020a2c"]]},{"id":"a2ca29c9855d8533","type":"change","z":"711edbe9464cba2a","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"query","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1040,"y":500,"wires":[["17a39cba18b56f4a"]]},{"id":"510bb51c560178d8","type":"change","z":"711edbe9464cba2a","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"directory","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1040,"y":540,"wires":[["09dbdd1d30623cf4"]]},{"id":"f2915bc04c18cd63","type":"inject","z":"711edbe9464cba2a","name":"W","props":[{"p":"directory","v":"W:\\論文","vt":"str"},{"p":"query","v":"ROS Unreal Engine","vt":"str"}],"repeat":"","crontab":"","once":true,"onceDelay":0.1,"topic":"","x":870,"y":500,"wires":[["a2ca29c9855d8533","510bb51c560178d8"]]},{"id":"2fd1a793b1bee0bb","type":"venv-config","venvname":"paper","version":"3.10"},{"id":"3c0113e44b7c7681","type":"ui_group","name":"arXiv","tab":"4fbb1075cf58a732","order":1,"disp":true,"width":"6","collapse":false,"className":""},{"id":"4fbb1075cf58a732","type":"ui_tab","name":"Paper","icon":"dashboard","disabled":false,"hidden":false}]

The Python code is below. It accesses the flow object via node['flow']['query'] and node['flow']['directory']. I simply changed the parts that previously received the msg object.

import arxiv
import os
import time
from datetime import datetime


def download_papers(paper_dir, query, wait_time=60, year_limit=2023):
    """
    論文を検索し、PDFをダウンロードします。

    Parameters:
    - paper_dir (str): 保存先フォルダ
    - query (str): 検索クエリ（複数のキーワードを空白区切りで指定）
    - wait_time (int): ダウンロード間隔（秒）
    - year_limit (int): ダウンロード対象とする論文の最小年（それ以前の論文はダウンロードしない）
    """
    # 保存先ディレクトリを作成
    os.makedirs(paper_dir, exist_ok=True)
    print(f"Saving papers to: {paper_dir}")

    # arXivクライアントの設定
    client = arxiv.Client(page_size=10)

    print(f"Searching for papers with query: '{query}'...")

    # クエリを空白区切りでAND検索として扱う
    formatted_query = " AND ".join(query.split())

    # クエリを実行して結果を取得
    search = arxiv.Search(
        query=formatted_query,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    for result in client.results(search):
        try:
            # タイトルと提出日
            print(f"\nTitle: {result.title}")
            # print(f"Submitted: {result.updated}")
            # print(f"Abstract: {result.summary}")

            # 提出日が指定された年よりも前かどうかを確認
            submitted_year = result.updated.year  # ここで直接年を取得
            if submitted_year < year_limit:
                print(f"Skipping {result.title} as it was submitted in {submitted_year}, which is before {year_limit}.")
                break  # 指定年より前の論文が見つかった時点で終了

            # ファイル名を適切にフォーマット
            sanitized_title = "".join(c if c.isalnum() or c in " _-" else "_" for c in result.title)
            pdf_filename = f"{sanitized_title}.pdf"
            pdf_paper_path = os.path.join(paper_dir, pdf_filename)

            # PDFが既に存在する場合はダウンロードしない
            if os.path.exists(pdf_paper_path):
                print(f"{pdf_filename} already exists. Skipping download.")
                continue

            # PDFダウンロード
            # print(f"Downloading to: {pdf_paper_path}")
            result.download_pdf(dirpath=paper_dir, filename=pdf_filename)

            time.sleep(wait_time)

        except Exception as e:
            print(f"Failed to process {result.title}: {e}")

    print("\nDone!")

# Node-REDからの入力
if __name__ == "__main__":
    # 保存先フォルダ
    paper_dir = os.path.join(node['flow']['directory'].replace("\\", "/"), 'paper')

    # 検索ワード
    query = node['flow']['query']

    # ダウンロード間隔を設定（デフォルト10秒）
    wait_time = 10  # 固定値

    # ダウンロード対象とする最小年（例：2023年以降の論文）
    year_limit = 2020

    # 論文のダウンロード
    download_papers(paper_dir, query, wait_time, year_limit)

▼The inject node at the beginning is set to execute when Node-RED starts. Its value is passed as the initial value for the text input node.

I actually ran it.
▼The dashboard screen is here. After specifying the Directory and Query, clicking SEARCH started the download.

Even for more detailed condition specifications, it seems easy to create a screen by combining it with the dashboard node.
▼By the way, I use Node-RED embedded in Electron. It is displayed as an Electron window rather than in a browser.

Finally

In the first place, I was collecting a large number of papers because I wanted to read them for my graduation thesis. Although not written here, I followed this up with processing to extract text from the PDF files, translate it, and summarize it using a local LLM. I plan to gradually summarize these elementary technologies.
This time I set the Query to "ROS" and "Unreal Engine," but there aren't many papers. Since I am combining those two in my research, I want to search for more prior research.