Skip to main content

Setup guide for web-based scanner to paperless-ngx pipeline

· 14 min read

Scanning documents to paperless-ngx can be a tedious process depending on your scanner software, hardware, and consume folder setup.

This guide sets up dockerized open-source web-based scanning software (scanservjs) and a scan button polling software (insaned) to make scanning a breeze.

Includes auto-crop, auto-rotate, and deskewing before the PDF file arrives for consumption and OCR by paperless.

Simply press the scan button for simple one-off scans, or utilize the scanservjs web UI for more complex multipage scans. Docs will automatically be consumed by paperless-ngx and tagged with scanserv

Basics

This guide assumes:

  • You already have a paperless-ngx instance running on your local network
    • This guide will use an NFS share to link the paperless consume folder to scanservjs as a docker volume.
    • If your paperless instance is not local, other solutions are available utilizing the paperless API to upload documents using the scanservjs pipeline features - but in my opinion are less reliable in the case of API or pipeline failures.
  • Your scanner is SANE compatible. See http://www.sane-project.org/sane-supported-devices.html for supported devices, or simply run scanimage -L on the host to test.
  • Your scanner is either connected to the Network or is connected via USB to a host on the Network (which will utilize the SANE over Network feature)
  • If your scanner is connected via USB to a host on the Network, a Linux-based host is assumed.
    • This guide will be specific to Fedora, but provides references for Debian/Ubuntu and Arch linux setups.
    • This guide does NOT cover USB passthrough to a Linux VM
  • Docker is installed and ready for use on the server that will host scanservjs and insaned.
    • This guide does NOT cover HTTP proxy setup

Definitions

  • SANE: "Scanner Access Now Easy" SANE is an application programming interface (API) that provides standardized access to any raster image scanner hardware. The standardized interface makes it possible to write just one driver for each scanner device instead of one driver for each scanner and application.
  • Paperless-ngx: Paperless-ngx is a community-supported open-source document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.
  • Scanservjs: scanservjs is a web UI frontend for your scanner. It allows you to share one or more scanners (using SANE) on a network without the need for drivers or complicated installation.
  • Insaned: Insaned is a simple linux daemon for polling button presses on SANE-managed scanners.
  • Docker: Docker is a tool that is used to automate the deployment of applications in lightweight containers so that applications can work efficiently in different environments in isolation.

References

Much of the knowledge in this guide comes from well documented resources:

Non Fedora Distros

If you are running a different distro than Fedora, please check the above references for further instructions.

Packages installed with dnf should be available on other package managers such as apt, albeit under different names.

Configuration paths referenced in this guide may be different on other distros.

Use your google-foo wisely.

Personal setup

This guide is not well-tested, but it should get you close. My personal working setup is as follows:

  • ScanSnap S1300i connected via USB to a Fedora 42 Workstation host
  • Dockerized Scanservjs, and Insaned running on a local generic server
    • also runs nginx, pihole, etc
  • Dockerized Paperless-ngx running on a local NAS server

For me Xsane, NAPS2, and simplescan were lacking in being feature-complete for autoheight, autocrop, autorotate, deskew, button support, and streamlining output to paperless-ngx.

Scanner setup

Dependencies

SANE software and utilities such as scanimage can be found in the sane-backends package for dnf.

sudo dnf update && sudo dnf install sane-backends

Network Scanners

Simply check if your scanner is available on the network via:

scanimage -L

This is untested. Network Scanner addresses may need to be added to /etc/sane.d/net.conf, /etc/sane.d/airscan.conf, or /etc/sane.d/pixma.conf. Please see above references for more guidance.

Skip to next section if output is nominal.

USB Scanners

USB Scanners will require utilizing the SANE over Network feature, where the host exposes the scanner to a network subnet.

( there is another option using priveledged mode and sharing dbus paths to a local containerized Scanservjs instance, but it tends to be unreliable and will not work where Scanservjs instances are hosted on a different machine than the one the scanner is connected to. for the purposes of this guide, we will choose the more reliable solution: SANE over Network )

Configure and run saned.socket service

  • Increase MaxConnections to 64 in the saned.socket settings:

    sudo systemctl edit --full saned.socket

    Because the default of 1 will block connections and will interfere with requests from both the button press and the Scanservjs web ui.

  • Enable and start the saned.socket service

    systemctl enable saned.socket
    systemctl start saned.socket

Give saned permissions to your scanner

  • Run lsusb and take note of the vendor:product numbers.

    lsusb

    For example in Bus 005 Device 013: ID 04c5:128d Fujitsu, Ltd ScanSnap S1300i, the Vendor ID is 04c5 and the Product ID is 128d

  • Add device permissions. Edit the following file:

    /usr/lib/udev/rules.d/65-sane-backends.rules

    and append the following line, replacing vendorID and productID with the identifiers from the previous step:

    ATTRS{idVendor}=="vendorID", ATTRS{idProduct}=="productID", MODE="0664", GROUP="lp", ENV{libsane_matched}="yes"

    ( The configuration file /usr/lib/udev/rules.d/65-sane-backends.rules may be named /usr/lib/udev/rules.d/65-sane.rules in non-Fedora distros. )

  • Unplug and replug in your scanner for permissions to take effect

Configure Firewall

saned uses port 6566.

  • On fedora:

    firewall-cmd --add-service=sane --permanent

Add subnet(s) to saned access list

  • Edit the saned access list config file:

    /etc/sane.d/saned.conf
  • Add the subnet(s) that will be given access to the scanner. This should include the host where Scanservjs and Insaned will be running.

    Example: 192.168.0.0/24

Test the configuration (optional)

This step is optional because we haven't yet setup Scanservjs and Insaned to test on, but we can still test on the host itself.

On the host where Scanservjs and Insaned will be running:

  • Ensure net is uncommented in /etc/sane.d/dll.conf (it may be the first entry in the list)

  • Add the host IP where the Scanner is connected to /etc/sane.d/net.conf

  • Run scanimage -L (may need to install the sane-backends package)

    This should output the scanner device ID and confirm that your USB connected Scanner is now accessible over the network.

Setup Scanservjs and Insaned

Now we will setup a docker compose file to configure and run scanservjs and insaned.

This compose file will build both images locally from the source code. We build from source for a couple reasons:

  • scanservjs requires building if you want to use custom user and group IDs, which come in useful when accessing docker volumes for scan output and configuration.
  • insaned is rudamentary source code. I personally wrote a Dockerfile for it which will build an image. For now it remains unpackaged.

Clone repos

  • In a relevant directory on the server, clone both repos (adjust version tag for future updates):
git clone --branch v3.0.4 https://github.com/sbs20/scanservjs
git clone https://github.com/Vigrond/insaned

Setup compose.yaml

In the same directory create a compose.yaml. A basic template is provided below. You may need to adjust for your specific environment.

services:
scanservjs:
build:
context: ./scanservjs
args:
# ----- enter UID and GID here -----
UID: 1000
GID: 1000
UNAME: user
target: scanservjs-user2001
user: 1000:1000
container_name: scanservjs
environment:
# ----- specify network scanners here using a ; delimiter -----
- SANED_NET_HOSTS=192.168.0.101
volumes:
# mount your NFS shared paperless consume folder to the scanservjs output folder:
- /mnt/homelab/services/paperless/.data/paperless/consume/scanserv:/var/lib/scanservjs/output
# scanserfjs configuration, where pipelines may be defined:
- ./.data/scanservjs/config:/etc/scanservjs
restart: unless-stopped
scanservjs_insaned:
build:
context: ./insaned
container_name: scanservjs_insaned
environment:
# ----- specify only one network scanner here; additional scanners will need more insaned instances -----
- SANED_NET_HOSTS=192.168.0.101
volumes:
# insaned configuration
- ./.data/insaned/insaned.env:/etc/insaned/events/.env
restart: unless-stopped

  • UID and GID should match the host user that will need access to docker volumes. In this guide, it will be the user the NFS share for our Paperless consume folder grants to.

  • We mount a folder called scanserv inside a NFS-shared paperless-ngx consume folder to the output folder of scanservjs.

    When PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=1 is set for the paperless-ngx instance, this will automatically tag scanned documents with scanserv.

  • The env variable SANED_NET_HOSTS defines scanner IPs, including the USB connected ones we setup with SANE over Network previously.

    In the entrypoint of both images, the script automatically adds SANED_NET_HOSTS addresses to /etc/sane.d/*.conf files so that saned has access.

Do not run the docker compose file yet

Configure scanservjs for auto height, autocrop, deskew, autorotate

Although the default setup of scanservjs works mostly fine, we want to add auto scan height, autocrop, deskew, and auto rotate.

  • Add ocrmypdf to the scanservjs/Dockerfile scanservjs-base layer dependencies. It should look something like:

    ...
    FROM debian:bookworm-slim AS scanservjs-base
    RUN apt-get update \
    && apt-get install -yq \
    nodejs \
    adduser \
    imagemagick \
    ipp-usb \
    sane-airscan \
    sane-utils \
    ocrmypdf \
    tesseract-ocr \
    tesseract-ocr-ces \
    tesseract-ocr-deu \
    tesseract-ocr-eng \
    tesseract-ocr-spa \
    tesseract-ocr-fra \
    tesseract-ocr-ita \
    tesseract-ocr-nld \
    tesseract-ocr-pol \
    tesseract-ocr-por \
    tesseract-ocr-rus \
    tesseract-ocr-tur \
    tesseract-ocr-chi-sim \
    && rm -rf /var/lib/apt/lists/*;
    ...
  • Copy the scanservjs configuration template to our config volume

    In our compose.yaml example template above, this would be ./.data/scanservjs/config

    cp ./scanservjs/app-server/config/config.default.js ./.data/scanservjs/config/config.local.js
  • Add an ocrmypdf pipeline for autocrop, deskew, autorotate:

    Inside afterConfig(config), add the following code:

    config.pipelines.push({
    extension: 'pdf',
    description: 'ocrmypdf (JPG | @:pipeline.high-quality)',
    get commands() {
    return [
    'convert @- -fuzz 10% -define trim:percent-background=100% -trim tmp-crop-%04d.jpg && ls tmp-crop-*.jpg',
    'convert @- scan-0000.pdf',
    `ocrmypdf --tesseract-timeout=0 --deskew --rotate-pages scan-0000.pdf scan_clean_0000.pdf`,
    'ls scan_clean_*.pdf'
    ];
    }
    });

    This pipeline does the following:

    • Uses ImageMagick to perform a basic autocrop (trim) and uses JPG compression
    • Uses ImageMagick to convert to a PDF file
    • Uses ocrmypdf to deskew and auto align. (does not perform ocr, leaves it to paperless)

    Please reference the scanservjs documentation above to customize your own pipelines.

  • Add auto height for scanning

    My particular scanner (ScanSnap S1300i) has an auto-height feature where it will figure out how long a document is without having to specify it.

    Scanservjs trusts the manufacturer defined specs that are communicated to it through SANE. Oftentimes these specs are wrong, incorrect, or just plane inconvenient.

    Thankfully Scanservjs provides a way to adjust these in the afterDevices(devices) section of config.local.js.

    In my personal use case, I added the following:

    devices
    .filter(d => d.id.includes('epjitsu'))
    .forEach(device => {
    device.features['--page-height'] = {
    default: 0,
    limits: [0, 297]
    };
    });
    • Where epjitsu is a string from the Device ID outputted by scanimage -L
    • Where default: 0 defines --page-height as 0. Per the manufacturer, a value of 0 implies auto-height.
    • Where limits: [0, 297] defines an upper limit of 297, which is also specified by the manufacturer.

    Manufacturer information can be queried for your particular device with scanimage --help

    At which point you may return to this section for additional customization.

    Additional config documentation can be found using the scanservjs official docs referenced above.

Configure Insaned

Insaned is configured using an .env file. In our compose.yaml template above, it is setup as a volume located at ./.data/insaned/insaned.env

Upon the scan button being pressed, a curl request is sent to the scanservjs api to initiate a scan using these settings.

  • Copy the template provided by the insaned repo:

    cp ./insaned/events/_example.env ./.data/insaned/insaned.env
  • Ensure SCAN_SCRIPT=scanservjs.

    The rest of the settings here will depend on your environment. Use the following example as a guide:

    Note that the SSJS server settings take advantage of docker compose internal networking.

    #!/bin/bash

    ### general
    # select the script to be executed when the scan button is pressed
    ## scanimage - the classic scanning image, use this for testing and keep it if it meets your needs
    ## scanservjs - execute a scan via a user friendly web front-end and easily access scans from a browser - see scanservjs file for more info
    export SCAN_SCRIPT="scanservjs"

    ### scanservjs
    # note - the parameters below are for a Fujitsu ScanSnap S1300i
    # consult the scanservjs documentation for your own scanner

    # scanservjs instance using docker compose networking
    export SSJS_PROTOCOL="http"
    export SSJS_HOST="scanservjs" # or IP address
    export SSJS_PORT=8080
    export SSJS_PATH="api/v1/scan"

    # parameters
    # see scanservjs docs/repo for an exhaustive list of enumerations
    export SSJS_RESOLUTION=300 # 50-600 DPI
    export SSJS_MODE="Color" # Color|Gray|Lineart etc
    export SSJS_SOURCE="ADF Duplex" # ADF Front|Back|Duplex
    export SSJS_BRIGHTNESS=0
    export SSJS_CONTRAST=0
    export SSJS_FILTERS=() # ("filter.auto-level" "filter.blur" "filter.threshold") - bash array will be converted to JSON, use spaces not commas!
    export SSJS_PIPELINE="ocrmypdf (JPG | @:pipeline.high-quality)" # pipeline description string
    export SSJS_BATCH="auto" # none|manual|auto

Run the compose file

  • In the same directory start the services and follow the logs for any errors

    docker compose up -d
    docker compose -f logs

Test the Scanning Pipeline

If properly setup, you should be able to

  • Scan from the scanservjs UI (useful for when multi page docs need extra attention)

  • Scan by using the scan button ( useful for one-off docs )

  • Have paperless automatically consume documents and tag them with scanserv

    This will also automatically clean the files folder in scanservjs as documents are consumed.

Troubleshooting

No scanners found

Test saned within the containers

Use the following docker compose commands to confirm scanners are found on the network in both scanservjs and insaned.

Always test with the saned user. (not root)

  • Test the scanservjs container:

    docker compose exec -u root scanservjs su -s /bin/bash -c "scanimage -L" saned
  • Test the scanservjs_insaned container:

    docker compose exec -u root scanservjs_insaned su -s /bin/bash -c "scanimage -L" saned

The above commands should output scanners found on the network

If no scanners are found, it may indicate a SANE configuration issue, a firewall issue, or a permission issue.

  • Check the container logs docker compose logs for error information
  • scanimage -L will be your friend in confirming connectivity both locally and over the network.

Scan Button stopped working

Check insaned and scanservjs logs for errors.

There is a bug in insaned where a Segfault happens and does not properly restart the container. If this is the case, you may need to run docker compose restart scanservjs_insaned

Sometimes a pipeline can take a while to finish, insaned will wait for scanservjs to finish processing before allowing another button press.

I want to scan a document with a large amount of pages without using the web UI

This can be achieved by using the merge pdf action in paperless-ngx after multiple scans are done.

With a document-feeder scanner, ensure the SSJS_BATCH variable in the insaned.env config is set to auto or the relevant auto setting returned by the device (scanimage -A --help). This will make the scanner keep scanning sheets as long as one is available in the feeder.

This way, in combination with the merge pdf action, can make button-only scanning of large documents a quick process.

Getting device features for Scanservjs configuration

Manufacturer device configuration information, including features and default values can be obtained using:

scanimage -A --help

These values can be utilized for scanner device configuration in config.local.js

Feedback

To provide feedback on this guide, you may create an issue at https://github.com/Vigrond/insaned/issues