Overview

I have the following structure in a system, where the user can save attributes to jobs, which he can name on his own. One attribute can have 2 possible values, a user value and a system value. I am trying to filter the jobs based on the attributes, where either the user value or the system value can hit.

Current situation

table jobs

id	company_id	status
1	2	active
2	2	created
3	3	created
4	12	inactive

table job_data

job_id	field	value_user	value_xml
1	city	Berlin
1	phone	1234567
1	type	fulltime
2	city	New York
2	phone	33333333
2	type	parttime
3	city		Berlin
3	phone		123
3	type		fulltime

Indexes:

Table	Key_name	Column_name	Collation
jobs	PRIMARY	id	A
job_data	job_data_jobs_id_foreign	job_id	A

The user now has to filter for multiple attributes, for example: "show me all jobs with city Berlin and phone 123". This would show job 1 and 3, since they both have Berlin in one of the city-values and a phone number like 123 in one of the phone values.

I had several working solutions, but now we are having 120,000 active jobs in the database, having over a million attributes and our code is not fast enough. This is our current solution:

SELECT * FROM jobs
WHERE
(
    SELECT count(*) FROM job_data
    WHERE job_data.job_id = jobs.id
    AND job_data.field = "city" AND (job_data.value_xml LIKE "%Martinhaven%" OR job_data.value_user LIKE "%Martinhaven%")
) > 0
AND
(
    SELECT count(*) FROM job_data
    WHERE job_data.jobposting_id = jobs.id
    AND job_data.field = "category" AND (job_data.value_xml LIKE "%omnis%" OR job_data.value_user LIKE "%omnis%")
) > 0;

output:

id	company_id	status
1	2	active
3	3	created

this has been simplified, we have around 6 filter possibilities, but they are all of the same kind, so I only posted 2 of them.

Question

Can anyone tell me, how I can do this a lot faster? We currently need 5-10 seconds for one select. Ofc we could restructure the database to make it faster to filter, but we are still trying to prevent that, since it is a bigger system in construction.

Thanks a lot in advance.

CodePudding user response：

Instead of using 1 correlated subquery that aggregates for each condition and for each row of the table jobs, use conditional aggregation once:

SELECT job_id 
FROM job_data
GROUP BY job_id
HAVING SUM(field = 'city' AND (value_xml LIKE '%Martinhaven%' OR value_user LIKE '%Martinhaven%')) > 0
   AND SUM(field = 'category' AND (value_xml LIKE '%omnis%' OR value_user LIKE '%omnis%')) > 0
   AND .....

to get all the job_ids that you want and use them with the operator IN:

SELECT * FROM jobs
WHERE job_id IN (
  SELECT job_id 
  FROM job_data
  GROUP BY job_id
  HAVING SUM(field = 'city' AND (value_xml LIKE '%Martinhaven%' OR value_user 
  LIKE '%Martinhaven%')) > 0
     AND SUM(field = 'category' AND (value_xml LIKE '%omnis%' OR value_user 
  LIKE '%omnis%')) > 0
     AND .....
);

If the number of job_ids returned by the subquery is small I believe that the performance of this query will be better.

All the > 0 inequalities can be removed from the code, but I left them there for clarity.

CodePudding user response：

You have two problems.

Multiple value fields.
like '%word%' is tricky to index.

tl;dr demonstration

Fix the schema

Change job_data to a single value and a source.

job_id  field  value   source
1       city   Berlin  user
3       city   Berlin  xml

-- Add value and source columns, nullable for now.
alter table job_data add value varchar(255), add source varchar(255);

-- If value_user is not null or blank, add it to value with a source of user.
update job_data
set value = value_user, source = 'user'
where coalesce(value_user, '') <> '';

-- Same for value_xml, source of xml.
update job_data
set value = value_xml, source = 'xml'
where value_xml is not null;

-- Drop the old value columns.
alter table job_data
  drop value_xml, drop value_user;

Now we can enforce not null on value and source to avoid bad data.

alter table job_data
  modify value varchar(255) not null,
  modify source varchar(255) not null;

We can also enforce that a job can only have one value per field. Or per field and source.

-- If a job can have multiple sources for a field.
alter table add unique(job_id, field, source);

-- If it cannot.
alter table add unique(job_id, field);

And if you need to maintain compatibility, make a view.

create view old_job_data as
  select
    job_id,
    field,
    case source when 'user' then value end as value_user,
    case source when 'xml' then value end as value_xml
  from job_data;

This fixes a number of problems.

You can have as many sources as you like.
You can ensure the value is not null.
You can avoid duplicates.
You don't need to check the value in two places.
Indexing is easier.

The query

With that fixed, the query is much simpler.

To find which jobs match both field/value pairs, normally you'd do an intersect. But MySQL does not have intersect, so we emulate it with a self-join.

select * from jobs
where job_id in (
  select distinct j1.job_id
  from job_data j1
  inner join job_data j2 on j1.job_id = j2.job_id
  where
    (j1.field = 'city' and j1.value like '%York%')
    and
    (j2.field = 'type' and j2.value like '%time%')

)

You don't have to change the table, you can do j1.field = 'city' and (j1.value_xml like '%York%' or j1.value_user like '%York%'), but it makes it so much easier.

Indexing

In MySQL, like '%word%' will not use a simple index. It has to scan each row with a matching field. Simple indexes only work for like 'word%'. Other databases have special indexes for this. I don't know of a solution for MySQL.

At minimum, index field so you can at least find the rows matching field quickly.

create index job_data_field_idx on job_data(field);

Now instead of scanning every row, it will only scan the rows which match the field.

Decide if you really need full wildcard searches. Or do you need to clean up your data before inserting it? If you can get rid of the wildcard indexes, this index will make the search nearly instant.

create index job_data_field_vaue_idx on job_data(value, field);

CodePudding user response：

There's not a lot to that will help when using an unsargable criteria with like '%...', however you are needlessly burning CPU and IO by doing a count(*) for each searched criteria.

It's possible the optimizer could spot this and optimize it internally as an exists operation, but it's always preferable to be explicit

select * 
from jobs j
where exists (
    select 1 job_data d
    where d.job_id = j.id
    and d.field = "city" and (d.value_xml like "%Martinhaven%" or d.value_user like "%Martinhaven%")
)

With an entity-attribute model it would be more efficient if all the values were in the same column. That might be too much of a change to be implemented initially so you should also try multiple criteria to check each column separately rather than using or. This would be most beneficial with an index for each (on field, Value_xml and field, value_user).

select * 
from jobs j
where exists (select 1 job_data d  where d.job_id = j.id and d.field = "city" and d.value_xml like "%Martinhaven%") 
union
select * 
from jobs j
where exists (select 1 job_data d  where d.job_id = j.id and d.field = "city" and d.value_user like "%Martinhaven%")

You might find this yields better performance since each could use an index to seek to just the city rows and then scan just until there is a match; at best it can exist almost immediately, at worst it's no worse than the existing table scan.

CodePudding user response：

Partial solution:

Instead of

( SELECT COUNT(*) ... ) > 0 )

write

EXISTS ( SELECT 1 ... )

The latter will stop scanning when the first occurrence is found. (The former has to look at the entire table.)

The real villains are

OR -- which does not optimize well
LIKE '%...' -- the 'leading wildcard' prevents use of any index.

Please provide SHOW CREATE TABLE so we can see what indexes exist.