Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored
Under maintenancePricing
$2.50 / 1,000 job results
Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored
Under maintenanceJob data with a correctness guarantee: per-field confidence, ghost-job filtering and cross-source dedup — never silently wrong, duplicated or expired. Reaches LinkedIn and ATS boards cookieless, gets through Cloudflare/DataDome, self-healing on layout shifts. Built on the data.hilgard.cz engine.
Pricing
$2.50 / 1,000 job results
Rating
0.0
(0)
Developer
Jan Hilgard
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
2
Monthly active users
2 days ago
Last modified
Categories
Share
Validated Jobs Scraper
Job data that is deduplicated, ghost-filtered and never silently wrong.
You give it a source and a query (an ATS company, or keywords for LinkedIn / Indeed). You get back
clean job records — title, company, location, salary, employment and workplace type — and
on top of every record a correctness layer: a confidence on each field, a single
reliable flag, a ghost-job score, and cross-source dedup. When a field does not clear
the bar you get a null with a stated reason (status: "absent"), not a guessed value.
When the same opening sits on several boards you get it once, with the others listed under
also_at. That guarantee is the product; the engine underneath is only how it is kept.
Most job scrapers optimise for how many fields they return. That is the easy part. The hard part — and the expensive one when it goes wrong — is a job that is silently duplicated, already filled, reposted for the tenth time, or quietly mis-parsed. This one is built around that: it scores its own confidence per field, flags ghost jobs, and collapses duplicates across sources, so what you load is what is real.
Keywords: no silent errors, per-field confidence, validated jobs, ghost job detection, deduplicated jobs, hiring intent, cookieless jobs scraper, linkedin jobs scraper, indeed jobs scraper, ats greenhouse lever ashby, job posting api, labor market data, hiring intent signals, sales prospecting jobs.
What a real run looks like
One query in, one row per job out — each in the engine's snake_case schema, with the
per-field confidence and the enrichment block attached. Below is a real record from a
Greenhouse run (the two description_* fields are large and omitted here for brevity;
every enrichment field is wrapped as
{ value, status, score, source, model }).
(a) A reliable record — and where the page has no salary, it says so instead of guessing.
{"source": "greenhouse","canonical_url": "https://job-boards.greenhouse.io/gitlab/jobs/8565469002","apply_url": "https://job-boards.greenhouse.io/gitlab/jobs/8565469002","title": "AI Engineer","company": { "name": "GitLab", "url": null },"location": { "raw": "Remote, US", "city": null, "region": null, "country": null, "workplace_type": "remote" },"employment_type": null,"date_posted": "2026-05-29","salary": { "min": null, "max": null, "currency": null, "period": null, "source": "absent" },"seniority": null,"skills": [],"reliable": true,"overall_confidence": 0.95,"fields": {"title": { "value": "AI Engineer", "status": "confirmed", "score": 0.95, "source": "ats" },"company": { "value": "GitLab", "status": "confirmed", "score": 0.95, "source": "ats" },"location": { "value": "Remote, US", "status": "confirmed", "score": 0.95, "source": "ats" },"salary": { "value": null, "status": "absent", "score": null, "source": null },"employment_type": { "value": null, "status": "absent", "score": null, "source": null }},"also_at": [],"enrichment": {"quality": {"ghost_job_score": { "value": 0.1, "status": "low", "score": 0.5, "source": "inferred", "model": "qwen3.6-35b@1" },"flags": { "value": [], "status": "absent", "score": null, "source": "inferred", "model": null },"is_real": { "value": true, "status": "low", "score": 0.5, "source": "inferred", "model": "qwen3.6-35b@1" }},"dedup": {"is_duplicate_of": { "value": null, "status": "absent", "score": null, "source": "inferred", "model": null },"sources_seen": { "value": ["greenhouse"], "status": "confirmed", "score": 0.95, "source": "inferred", "model": null }},"normalized": {"role_normalized": { "value": "AI Engineer", "status": "low", "score": null, "source": "inferred", "model": "qwen3.6-35b@1" },"seniority_normalized": { "value": "Mid", "status": "low", "score": null, "source": "inferred", "model": "qwen3.6-35b@1" },"skills": { "value": [{ "name": "Python", "type": "nice_to_have", "score": 0.9 }, { "name": "TypeScript", "type": "nice_to_have", "score": 0.9 }, { "name": "LLMs", "type": "nice_to_have", "score": 0.9 }], "status": "high", "score": 0.9, "source": "inferred", "model": "qwen3.6-35b@1" }},"hiring_intent": {"buying_signal_score": { "value": 0.11, "status": "low", "score": 0.45, "source": "inferred", "model": null },"company_signals": { "value": { "open_roles_count": 1, "role_velocity": 0.03, "expanding_departments": [] }, "status": "low", "score": 0.45, "source": "inferred", "model": null }}},"company_name": "GitLab","location_raw": "Remote, US","ghost_job_score": 0.1,"success": true,"error": null}
Note the salary and employment_type: the page did not state them, so they come back
status: "absent" with a null value — not a guessed band. The ghost_job_score is low
(0.1), so the record is trusted. company_name, location_raw and ghost_job_score at
the bottom are flat copies the actor lifts out of the nested objects for the dataset table.
(b) A ghost-suspect job with thin data. It does NOT guess — it flags and fails loud. (Illustrative record in the same real shape — a live ghost can't be produced on demand.)
{"source": "linkedin","canonical_url": "https://www.linkedin.com/jobs/view/...","title": "Marketing Manager","company": { "name": "Stealth Startup", "url": null },"location": { "raw": null, "city": null, "region": null, "country": null, "workplace_type": null },"employment_type": null,"salary": { "min": null, "max": null, "currency": null, "period": null, "source": "absent" },"reliable": false,"overall_confidence": 0.39,"fields": {"title": { "value": "Marketing Manager", "status": "high", "score": 0.9, "source": "html" },"company": { "value": "Stealth Startup", "status": "low", "score": 0.41, "source": "html" },"location": { "value": null, "status": "absent", "score": null, "source": null },"salary": { "value": null, "status": "absent", "score": null, "source": null }},"also_at": [],"enrichment": {"quality": {"ghost_job_score": { "value": 0.78, "status": "low", "score": 0.6, "source": "inferred", "model": "qwen3.6-35b@1" },"flags": { "value": ["reposted", "vague_jd"], "status": "high", "score": 0.7, "source": "inferred", "model": "qwen3.6-35b@1" },"is_real": { "value": false, "status": "low", "score": 0.6, "source": "inferred", "model": "qwen3.6-35b@1" }}},"ghost_job_score": 0.78,"success": true,"error": null}
The value is the second row. A cheaper tool would have returned this as just another
clean-looking hit. Here the ghost_job_score is high (0.78), the flags say why
(reposted, vague_jd), is_real is false, the weak company field is low, and the
absent ones are marked absent — not filled with a guess. (Note: on LinkedIn an empty
apply link is normal, so no apply-related flag fires — the no_real_apply flag only fires
on ATS boards where a real apply path is expected.) Turn drop_ghost on and a row like
this is removed before it reaches you — and not charged.
Why this beats LinkedIn-only and AI scrapers
Adapting to a layout and scraping a lot of fields is table-stakes now — this does both. But scraping is not the same as being right. A scraper can return a job and still hand you one that is duplicated three times, already filled, reposted for months, or quietly mis-parsed — and say nothing. That silent bad row is the one that costs you, because you act on it. The difference here is the guarantee, not the scraping: every field carries its own confidence, every record carries a ghost-job score, duplicates are collapsed across sources, and a cheap enrichment runs on every record — not just a sampled few.
Two things others quietly skip:
- They return rich fields but zero confidence, and their high success rate needs your cookies. This runs cookieless and still attaches a confidence to every field, so you can tell a solid row from a shaky one without logging anything in.
- Their AI enrichment is shallow and expensive because it is API-bound. Ours runs on cheap local inference, so the quality / dedup / normalize / hiring-intent layers run on every record by default, not as a costly add-on on a handful.
Why it is different
- No silent errors. Every field carries a confidence, the whole record carries a
reliableflag. Below the bar a field is returnednullwithstatus: "absent"andreliable: false, never a confident-looking wrong value. - Ghost-job filtering and cross-source dedup. Each record gets a
ghost_job_scorewith the flags behind it (reposted,evergreen,vague_jd,staffing_agency,perpetual_req, andno_real_applyon ATS), so stale and fake openings are visible — or dropped withdrop_ghost. The same opening seen on several sources is collapsed into one record, the rest listed underalso_at. You load real, distinct openings, not reposts and duplicates. - Cookieless reach / anti-bot. It reaches LinkedIn, Indeed and the major ATS boards without cookies, and gets through heavy protections — Cloudflare, DataDome and similar — that return a challenge page to a plain fetch. (Anti-bot is an arms race, so this is a capability, not a guarantee against any one named vendor.)
- Self-healing — a mechanism that serves the correctness above, not the headline. When a board changes its markup, the engine re-finds fields by meaning instead of silently breaking on a selector.
- Hiring-intent signals. Enrichment normalises each role and adds hiring-intent
signals (
buying_signal_score,company_signals), so the data is usable for prospecting, labor analytics and recruiting research, not just a flat list of postings.
Supported sources
Live-verified sources only — this list is what is actually tested, not a wishlist:
- Greenhouse
- Ashby
- Lever
- SmartRecruiters
- RemoteOK
- Indeed
ATS sources (Greenhouse / Ashby / Lever / SmartRecruiters) take a company slug;
LinkedIn, Indeed and RemoteOK take keywords (+ location). A concrete source is
required. Indeed sits behind Cloudflare — it is reached through the same anti-bot stack,
cookieless. Indeed pay is usually an estimate, so it comes back as
salary.source: "inferred"Input
{"source": "greenhouse", // required: linkedin | indeed | greenhouse | lever | ashby | smartrecruiters | remoteok"company": "gitlab", // ATS slug, for greenhouse/lever/ashby/smartrecruiters"keywords": "backend engineer", // for LinkedIn / Indeed / RemoteOK search"location": "Berlin","title_include": [], "title_exclude": [],"employment_type": [], "workplace_type": [],"country": [], "language": [], // arrays, e.g. ["DE"], ["en"]"posted_within_days": 30,"drop_expired": true, "drop_ghost": false,"enrich": true,"enrich_layers": ["quality", "dedup", "normalize", "hiring_intent"],"dedup_across_sources": true,"ghost_threshold": 0.7,"start": 0, "limit": 25, "max_results": 100, "fetch_all": false,"include_description": true // engine fetches full JD HTML/text (on by default; LinkedIn fetch is per-posting)}
What a source needs depends on the source: ATS boards (Greenhouse / Lever / Ashby /
SmartRecruiters) require company; LinkedIn and Indeed require keywords (or
location); RemoteOK needs neither (it lists the feed). Pick a company for an ATS
source and the actor fails loud early if it's missing, instead of forwarding a request
the engine would reject.
Filters and enrichment all run on the engine; the actor just forwards them. max_results
caps how many jobs come back, and since you pay per returned job, it caps spend.
Output
One dataset row per job, in the engine's snake_case JobPosting schema. Every tracked
field is always present — a missing one comes back with status: "absent" and a null
value, never silently dropped. Highlights:
- Core fields:
title,company({name, url}),location({raw, city, region, country, workplace_type}),employment_type(FULL_TIME/PART_TIME/CONTRACT/INTERN/TEMP),date_posted,salary({min, max, currency, period, source}—sourceisexplicit/inferred/absent, never guessed),seniority,skills,canonical_url,apply_url. - Trust layer:
reliable(bool),overall_confidence(0–1), andfields— a map where each tracked field carries{ value, status, score, source }, withstatus ∈ confirmed | high | low | absent. - Dedup: top-level
also_at[]lists the same opening on other portals ([]if unique). - Enrichment (present when
enrichis on) underenrichment, each field wrapped as{ value, status, score, source: "inferred", model }:quality(ghost_job_score,flags,is_real),dedup,normalized,hiring_intent. - Flat helpers the actor adds for the dataset table:
company_name,location_raw, andghost_job_score(lifted fromenrichment.quality.ghost_job_score.value).
success says the engine produced a job record; reliable says that record cleared the
trust threshold. They diverge when a job is extracted but the engine is not confident —
then success is true, reliable is false, and the per-field scores stay low, so a
hedge never reads as a clean hit. Each row is the engine's record passed through
verbatim — the actor never drops or rewrites a field. The full job description
(description_html / description_text) is included; include_description (on by
default) controls whether the engine fetches it for sources fetched per-posting (e.g.
LinkedIn). Descriptions can be large.
Pricing
This actor uses Pay Per Event, with a single event:
job-result— one flat fee per job returned.
You are charged once per job the engine returns — whether it comes back reliable: true
or, honestly, reliable: false, because an honest fail is still a result you can act on
(you learn the field is shaky instead of trusting a guess). Jobs the engine drops before
returning — expired, ghost (with drop_ghost), or duplicates collapsed across sources —
are not returned and not charged. A run that fails before returning any results
is not charged. The current price of the event is in the Apify Console pricing tab.
No flat monthly fee, no per-seat pricing. One predictable price per validated job.
A note on data
This actor returns job and company data only — title, company, location, salary, employment terms, and signals derived from the posting. It does not collect or return personal data of applicants or named recruiters.
About
Built on the data.hilgard.cz engine — the same self-healing stack that does cookieless anti-bot fetching, extraction by meaning, and independent verification, here applied to jobs with ghost-job scoring and cross-source dedup on top.
By Jan Hilgard — founder of Hosting90 (built 2002, exited 2020), contributor to vllm-mlx. The precision-first stance is deliberate: I would rather return an honest "not sure" than a confident wrong row you build a pipeline on.
Development
npm installnpm run build # tsc → dist/npm run start:dev # tsx src/main.ts, reads .actor/INPUT.json