Setup guide for web-based scanner to paperless-ngx pipeline
Scanning documents to paperless-ngx can be a tedious process depending on your scanner software, hardware, and consume folder setup.
This guide sets up dockerized open-source web-based scanning software (scanservjs) and a scan button polling software (insaned) to make scanning a breeze.
Includes auto-crop, auto-rotate, and deskewing before the PDF file arrives for consumption and OCR by paperless.
Simply press the scan button for simple one-off scans, or utilize the scanservjs web UI for more complex multipage scans.
Docs will automatically be consumed by paperless-ngx and tagged with scanserv
Basics
This guide assumes:
- You already have a
paperless-ngxinstance running on your local network- This guide will use an NFS share to link the paperless consume folder to
scanservjsas a docker volume. - If your
paperlessinstance is not local, other solutions are available utilizing thepaperlessAPI to upload documents using thescanservjspipeline features - but in my opinion are less reliable in the case of API or pipeline failures.
- This guide will use an NFS share to link the paperless consume folder to
- Your scanner is SANE compatible. See http://www.sane-project.org/sane-supported-devices.html for supported devices, or simply run
scanimage -Lon the host to test. - Your scanner is either connected to the Network or is connected via USB to a host on the Network (which will utilize the SANE over Network feature)
- If your scanner is connected via USB to a host on the Network, a Linux-based host is assumed.
- This guide will be specific to Fedora, but provides references for Debian/Ubuntu and Arch linux setups.
- This guide does NOT cover USB passthrough to a Linux VM
- Docker is installed and ready for use on the server that will host
scanservjsandinsaned.- This guide does NOT cover HTTP proxy setup
Definitions
- SANE: "Scanner Access Now Easy" SANE is an application programming interface (API) that provides standardized access to any raster image scanner hardware. The standardized interface makes it possible to write just one driver for each scanner device instead of one driver for each scanner and application.
- Paperless-ngx: Paperless-ngx is a community-supported open-source document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.
- Scanservjs: scanservjs is a web UI frontend for your scanner. It allows you to share one or more scanners (using SANE) on a network without the need for drivers or complicated installation.
- Insaned: Insaned is a simple linux daemon for polling button presses on SANE-managed scanners.
- Docker: Docker is a tool that is used to automate the deployment of applications in lightweight containers so that applications can work efficiently in different environments in isolation.
References
Much of the knowledge in this guide comes from well documented resources:
- ArchWiki SANE https://wiki.archlinux.org/title/SANE
- Debian Sane over Network docs https://wiki.debian.org/SaneOverNetwork
- ScanservJS docs https://sbs20.github.io/scanservjs/
- Paperless ngx docs https://docs.paperless-ngx.com/
- Insaned is not well-documented and is very rudamentary software. I personally forked it to fix numerous bugs and updates, which can be found here: https://github.com/Vigrond/insaned
Non Fedora Distros
If you are running a different distro than Fedora, please check the above references for further instructions.
Packages installed with dnf should be available on other package managers such as apt, albeit under different names.
Configuration paths referenced in this guide may be different on other distros.
Use your google-foo wisely.
Personal setup
This guide is not well-tested, but it should get you close. My personal working setup is as follows:
- ScanSnap S1300i connected via USB to a Fedora 42 Workstation host
- Dockerized Scanservjs, and Insaned running on a local generic server
- also runs nginx, pihole, etc
- Dockerized Paperless-ngx running on a local NAS server
For me Xsane, NAPS2, and simplescan were lacking in being feature-complete for autoheight, autocrop, autorotate, deskew, button support, and streamlining output to paperless-ngx.
Scanner setup
Dependencies
SANE software and utilities such as scanimage can be found in the sane-backends package for dnf.
sudo dnf update && sudo dnf install sane-backends
Network Scanners
Simply check if your scanner is available on the network via:
scanimage -L
This is untested. Network Scanner addresses may need to be added to /etc/sane.d/net.conf, /etc/sane.d/airscan.conf, or /etc/sane.d/pixma.conf. Please see above references for more guidance.
Skip to next section if output is nominal.
USB Scanners
USB Scanners will require utilizing the SANE over Network feature, where the host exposes the scanner to a network subnet.
( there is another option using priveledged mode and sharing dbus paths to a local containerized Scanservjs instance, but it tends to be unreliable and will not work where Scanservjs instances are hosted on a different machine than the one the scanner is connected to. for the purposes of this guide, we will choose the more reliable solution: SANE over Network )
Configure and run saned.socket service
-
Increase
MaxConnectionsto64in thesaned.socketsettings:sudo systemctl edit --full saned.socketBecause the default of
1will block connections and will interfere with requests from both the button press and theScanservjsweb ui. -
Enable and start the
saned.socketservicesystemctl enable saned.socket
systemctl start saned.socket
Give saned permissions to your scanner
-
Run
lsusband take note of the vendor:product numbers.lsusbFor example in
Bus 005 Device 013: ID 04c5:128d Fujitsu, Ltd ScanSnap S1300i, the Vendor ID is04c5and the Product ID is128d -
Add device permissions. Edit the following file:
/usr/lib/udev/rules.d/65-sane-backends.rulesand append the following line, replacing vendorID and productID with the identifiers from the previous step:
ATTRS{idVendor}=="vendorID", ATTRS{idProduct}=="productID", MODE="0664", GROUP="lp", ENV{libsane_matched}="yes"( The configuration file
/usr/lib/udev/rules.d/65-sane-backends.rulesmay be named/usr/lib/udev/rules.d/65-sane.rulesin non-Fedora distros. ) -
Unplug and replug in your scanner for permissions to take effect
Configure Firewall
saned uses port 6566.
-
On fedora:
firewall-cmd --add-service=sane --permanent
Add subnet(s) to saned access list
-
Edit the saned access list config file:
/etc/sane.d/saned.conf -
Add the subnet(s) that will be given access to the scanner. This should include the host where
ScanservjsandInsanedwill be running.Example:
192.168.0.0/24
Test the configuration (optional)
This step is optional because we haven't yet setup Scanservjs and Insaned to test on, but we can still test on the host itself.
On the host where Scanservjs and Insaned will be running:
-
Ensure
netis uncommented in/etc/sane.d/dll.conf(it may be the first entry in the list) -
Add the host IP where the Scanner is connected to
/etc/sane.d/net.conf -
Run
scanimage -L(may need to install thesane-backendspackage)This should output the scanner device ID and confirm that your USB connected Scanner is now accessible over the network.
Setup Scanservjs and Insaned
Now we will setup a docker compose file to configure and run scanservjs and insaned.
This compose file will build both images locally from the source code. We build from source for a couple reasons:
scanservjsrequires building if you want to use custom user and group IDs, which come in useful when accessing docker volumes for scan output and configuration.insanedis rudamentary source code. I personally wrote a Dockerfile for it which will build an image. For now it remains unpackaged.
Clone repos
- In a relevant directory on the server, clone both repos (adjust version tag for future updates):
git clone --branch v3.0.4 https://github.com/sbs20/scanservjs
git clone https://github.com/Vigrond/insaned
Setup compose.yaml
In the same directory create a compose.yaml. A basic template is provided below. You may need to adjust for your specific environment.
services:
scanservjs:
build:
context: ./scanservjs
args:
# ----- enter UID and GID here -----
UID: 1000
GID: 1000
UNAME: user
target: scanservjs-user2001
user: 1000:1000
container_name: scanservjs
environment:
# ----- specify network scanners here using a ; delimiter -----
- SANED_NET_HOSTS=192.168.0.101
volumes:
# mount your NFS shared paperless consume folder to the scanservjs output folder:
- /mnt/homelab/services/paperless/.data/paperless/consume/scanserv:/var/lib/scanservjs/output
# scanserfjs configuration, where pipelines may be defined:
- ./.data/scanservjs/config:/etc/scanservjs
restart: unless-stopped
scanservjs_insaned:
build:
context: ./insaned
container_name: scanservjs_insaned
environment:
# ----- specify only one network scanner here; additional scanners will need more insaned instances -----
- SANED_NET_HOSTS=192.168.0.101
volumes:
# insaned configuration
- ./.data/insaned/insaned.env:/etc/insaned/events/.env
restart: unless-stopped
-
UIDandGIDshould match the host user that will need access to docker volumes. In this guide, it will be the user the NFS share for our Paperless consume folder grants to. -
We mount a folder called
scanservinside a NFS-sharedpaperless-ngxconsume folder to the output folder ofscanservjs.When
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=1is set for thepaperless-ngxinstance, this will automatically tag scanned documents withscanserv. -
The env variable
SANED_NET_HOSTSdefines scanner IPs, including the USB connected ones we setup withSANE over Networkpreviously.In the entrypoint of both images, the script automatically adds
SANED_NET_HOSTSaddresses to/etc/sane.d/*.conffiles so thatsanedhas access.
Do not run the docker compose file yet
Configure scanservjs for auto height, autocrop, deskew, autorotate
Although the default setup of scanservjs works mostly fine, we want to add auto scan height, autocrop, deskew, and auto rotate.
-
Add
ocrmypdfto thescanservjs/Dockerfilescanservjs-baselayer dependencies. It should look something like:...
FROM debian:bookworm-slim AS scanservjs-base
RUN apt-get update \
&& apt-get install -yq \
nodejs \
adduser \
imagemagick \
ipp-usb \
sane-airscan \
sane-utils \
ocrmypdf \
tesseract-ocr \
tesseract-ocr-ces \
tesseract-ocr-deu \
tesseract-ocr-eng \
tesseract-ocr-spa \
tesseract-ocr-fra \
tesseract-ocr-ita \
tesseract-ocr-nld \
tesseract-ocr-pol \
tesseract-ocr-por \
tesseract-ocr-rus \
tesseract-ocr-tur \
tesseract-ocr-chi-sim \
&& rm -rf /var/lib/apt/lists/*;
... -
Copy the
scanservjsconfiguration template to ourconfigvolumeIn our
compose.yamlexample template above, this would be./.data/scanservjs/configcp ./scanservjs/app-server/config/config.default.js ./.data/scanservjs/config/config.local.js -
Add an
ocrmypdfpipeline for autocrop, deskew, autorotate:Inside
afterConfig(config), add the following code:config.pipelines.push({
extension: 'pdf',
description: 'ocrmypdf (JPG | @:pipeline.high-quality)',
get commands() {
return [
'convert @- -fuzz 10% -define trim:percent-background=100% -trim tmp-crop-%04d.jpg && ls tmp-crop-*.jpg',
'convert @- scan-0000.pdf',
`ocrmypdf --tesseract-timeout=0 --deskew --rotate-pages scan-0000.pdf scan_clean_0000.pdf`,
'ls scan_clean_*.pdf'
];
}
});This pipeline does the following:
- Uses ImageMagick to perform a basic autocrop (
trim) and uses JPG compression - Uses ImageMagick to convert to a PDF file
- Uses ocrmypdf to deskew and auto align. (does not perform ocr, leaves it to paperless)
Please reference the
scanservjsdocumentation above to customize your own pipelines. - Uses ImageMagick to perform a basic autocrop (
-
Add auto height for scanning
My particular scanner (ScanSnap S1300i) has an auto-height feature where it will figure out how long a document is without having to specify it.
Scanservjstrusts the manufacturer defined specs that are communicated to it throughSANE. Oftentimes these specs are wrong, incorrect, or just plane inconvenient.Thankfully
Scanservjsprovides a way to adjust these in theafterDevices(devices)section ofconfig.local.js.In my personal use case, I added the following:
devices
.filter(d => d.id.includes('epjitsu'))
.forEach(device => {
device.features['--page-height'] = {
default: 0,
limits: [0, 297]
};
});- Where
epjitsuis a string from theDevice IDoutputted byscanimage -L - Where
default: 0defines--page-heightas 0. Per the manufacturer, a value of0implies auto-height. - Where
limits: [0, 297]defines an upper limit of297, which is also specified by the manufacturer.
Manufacturer information can be queried for your particular device with
scanimage --helpAt which point you may return to this section for additional customization.
Additional config documentation can be found using the
scanservjsofficial docs referenced above. - Where
Configure Insaned
Insaned is configured using an .env file. In our compose.yaml template above, it is setup as a volume located at ./.data/insaned/insaned.env
Upon the scan button being pressed, a curl request is sent to the scanservjs api to initiate a scan using these settings.
-
Copy the template provided by the
insanedrepo:cp ./insaned/events/_example.env ./.data/insaned/insaned.env -
Ensure
SCAN_SCRIPT=scanservjs.The rest of the settings here will depend on your environment. Use the following example as a guide:
Note that the SSJS server settings take advantage of docker compose internal networking.
#!/bin/bash
### general
# select the script to be executed when the scan button is pressed
## scanimage - the classic scanning image, use this for testing and keep it if it meets your needs
## scanservjs - execute a scan via a user friendly web front-end and easily access scans from a browser - see scanservjs file for more info
export SCAN_SCRIPT="scanservjs"
### scanservjs
# note - the parameters below are for a Fujitsu ScanSnap S1300i
# consult the scanservjs documentation for your own scanner
# scanservjs instance using docker compose networking
export SSJS_PROTOCOL="http"
export SSJS_HOST="scanservjs" # or IP address
export SSJS_PORT=8080
export SSJS_PATH="api/v1/scan"
# parameters
# see scanservjs docs/repo for an exhaustive list of enumerations
export SSJS_RESOLUTION=300 # 50-600 DPI
export SSJS_MODE="Color" # Color|Gray|Lineart etc
export SSJS_SOURCE="ADF Duplex" # ADF Front|Back|Duplex
export SSJS_BRIGHTNESS=0
export SSJS_CONTRAST=0
export SSJS_FILTERS=() # ("filter.auto-level" "filter.blur" "filter.threshold") - bash array will be converted to JSON, use spaces not commas!
export SSJS_PIPELINE="ocrmypdf (JPG | @:pipeline.high-quality)" # pipeline description string
export SSJS_BATCH="auto" # none|manual|auto
Run the compose file
-
In the same directory start the services and follow the logs for any errors
docker compose up -d
docker compose -f logs
Test the Scanning Pipeline
If properly setup, you should be able to
-
Scan from the
scanservjsUI (useful for when multi page docs need extra attention) -
Scan by using the scan button ( useful for one-off docs )
-
Have paperless automatically consume documents and tag them with
scanservThis will also automatically clean the
filesfolder inscanservjsas documents are consumed.
Troubleshooting
No scanners found
Test saned within the containers
Use the following docker compose commands to confirm scanners are found on the network in both scanservjs and insaned.
Always test with the saned user. (not root)
-
Test the
scanservjscontainer:docker compose exec -u root scanservjs su -s /bin/bash -c "scanimage -L" saned -
Test the
scanservjs_insanedcontainer:docker compose exec -u root scanservjs_insaned su -s /bin/bash -c "scanimage -L" saned
The above commands should output scanners found on the network
If no scanners are found, it may indicate a SANE configuration issue, a firewall issue, or a permission issue.
- Check the container logs
docker compose logsfor error information scanimage -Lwill be your friend in confirming connectivity both locally and over the network.
Scan Button stopped working
Check insaned and scanservjs logs for errors.
There is a bug in insaned where a Segfault happens and does not properly restart the container. If this is the case,
you may need to run docker compose restart scanservjs_insaned
Sometimes a pipeline can take a while to finish, insaned will wait for scanservjs to finish processing before allowing
another button press.
I want to scan a document with a large amount of pages without using the web UI
This can be achieved by using the merge pdf action in paperless-ngx after multiple scans are done.
With a document-feeder scanner, ensure the SSJS_BATCH variable in the insaned.env config is set to auto or the relevant auto setting
returned by the device (scanimage -A --help). This will make the scanner keep scanning sheets as long as one is available in the feeder.
This way, in combination with the merge pdf action, can make button-only scanning of large documents a quick process.
Getting device features for Scanservjs configuration
Manufacturer device configuration information, including features and default values can be obtained using:
scanimage -A --help
These values can be utilized for scanner device configuration in config.local.js
Feedback
To provide feedback on this guide, you may create an issue at https://github.com/Vigrond/insaned/issues
