Skip to content

Implement two qubit unitary to operations algorithm from https://arxiv.org/abs/quant-ph/0406176 #6777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NoureldinYosri opened this issue Oct 17, 2024 · 27 comments · May be fixed by #7390
Open
Labels
good for learning For beginners in QC, this will help picking up some knowledge. Bit harder than "good first issues" kind/feature-request Describes new functionality triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on

Comments

@NoureldinYosri
Copy link
Collaborator

The paper https://arxiv.org/abs/quant-ph/0406176 introduces an algorithm for performing quantum shannon decomposition. We have this algorithm implemented in Cirq in

However, one of the base cases of the algorithms is not implemented. Namely A.2 the base case for a two qubit unitary, instead cirq uses another decomposition

if n == 4:
operations = tuple(
two_qubit_matrix_to_cz_operations(
qubits[0], qubits[1], u, allow_partial_czs=True, clean_operations=True, atol=atol
)
)
yield from operations
i, j = np.unravel_index(np.argmax(np.abs(u)), u.shape)
new_unitary = unitary_protocol.unitary(FrozenCircuit.from_moments(*operations))
global_phase = np.angle(u[i, j]) - np.angle(new_unitary[i, j])
if np.abs(global_phase) > 1e-9:
yield ops.global_phase_operation(np.exp(1j * global_phase))
return

Which is not as efficient as the decompositin described in A.2. It would be nice to implement A.2 and uses it in QSD.

What is the urgency from your perspective for this issue? Is it blocking important work?

P2 - we should do it in the next couple of quarters

@NoureldinYosri NoureldinYosri added kind/feature-request Describes new functionality good for learning For beginners in QC, this will help picking up some knowledge. Bit harder than "good first issues" triage/discuss Needs decision / discussion, bring these up during Cirq Cynque labels Oct 17, 2024
@RahilJain1366
Copy link

Hi @NoureldinYosri, can I take this up too since I am trying to find a solution for #6770?

@senecameeks senecameeks added triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on and removed triage/discuss Needs decision / discussion, bring these up during Cirq Cynque labels Oct 30, 2024
@senecameeks
Copy link
Collaborator

Cirq Cynq: This will improve the performance by reducing the number of rotations and depth of the circuit.

@senecameeks
Copy link
Collaborator

@RahilJain1366, thanks for offering, let's discuss after #6770 is complete

@ldi18
Copy link

ldi18 commented Nov 10, 2024

Hi, I’m also interested in working on this if @RahilJain1366 hasn’t taken it up yet @senecameeks .

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

I implemented the A.2 case. However, when comparing unoptimized circuit depth for the current Cirq package with my change, I didn't observe any improvement:

qubits = [3, 4, 5, 6, 7, 8]
cirq_package = [45, 219, 963, 4035, 16515, 66819]
with_my_change = [47, 221, 965, 4037, 16517, 66821]

To obtain these numbers I generated a random unitary for the given number of qubits and then ran Shannon decomposition.

Note that these numbers are the same as those here.

The reason for this is that the 3-qubit case already uses the A.2 optimization. The crux of A.2 was already implemented by @balopat in this method. Removing the 3-qubit base case and using A.2 for n=2 doesn't help.

However, when running the optimizer:
cirq.optimize_for_target_gateset(circuit, gateset=cirq.CZTargetGateset(preserve_moment_structure=False), max_num_passes=None)
I observed a substantial improvement in the current package (without A.2 base case):

qubits = [3, 4, 5, 6, 7, 8]
cirq_package = [41, 203, 899, 3779, 15491, 62723]

which is better than what was reported here.

Furthermore, I discovered that quantum Shannon decomposition is very slow. I traced this issue back to cleanup_operations in two_qubit_to_cz.py. merge_single_qubit_gates_to_phased_x_and_z is very slow, even for a small set of gates (size 7). I was able to optimize this substantially, reducing the time almost in half.

qubits = [3, 4, 5, 6, 7, 8]
time_sec_no_optimization = [0.05, 0.18, 0.76, 2.81, 11.22, 42.79]
time_sec_with_optimization = [0.04, 0.1, 0.39, 1.31, 5.43, 21.8]

You can see my code here.

To summarize:

  • there is no need for the A.2 case as it's already part of the 3-qubit decomposition. I suggest this issue to be closed;
  • the running time can be reduced sustantially by optimizing cleanup_operations. I am happy to submit my fix here. I suggest spending time to also fix merge_single_qubit_gates_to_phased_x_and_z, but this can be a different issue.

@NoureldinYosri @ACE07-Sev can you please have a look and let me know your thoughts?
@ACE07-Sev can you rerun your analysis and confirm the depth is better at the current package version?

@ACE07-Sev
Copy link
Contributor

I implemented A.2 a few months ago for quick and it definitely improves depth.

I'll have a look soon but in the meantime kindly check yours with mine.

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

@ACE07-Sev as I explain in the comment A.2 is already implemented here, but it's not obvious, because it's part of the n=3 base case. I ran my local analysis and didn't see any improvement from removing the n=3 base case and using n=2 and A.2. However, running the circuit through the optimizer substantially reduces depth: the issue is solved because the optimizer got better and is similar to the Qiskit one now.

If you have a colab or script to share, I'm happy to run it.

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 18, 2025

@codrut3 I'm comparing with mine now. Can you run yours with transpile afterwards please? I want to compare the depths with mine.

Here are my results:

41
197
869
3653
14981
60677

This is the depth of a circuit with U3 and CX gates as done before (I don't have my old script anymore, but that one was the same).

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

Are these Qiskit numbers? The Cirq numbers are:

qubits = [3, 4, 5, 6, 7, 8]
cirq_package = [41, 203, 899, 3779, 15491, 62723]

Would you consider this to be close enough?

@ACE07-Sev
Copy link
Contributor

Yes, those are the qiskit numbers (after transpile). Did you run transpile on yours as well?

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

Yes, this is after the transpiler (I call it optimizer :)). Basically, I think the transpiler does the heavy work and so further improvements would be realized by improving the transpiler.

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 18, 2025

I see. I mean, for small number of qubits sure, but when you go above 10 it's gonna make a big difference. Especially with practical use-cases given the deeper the circuit is the noisier it's going to get.

With qubits as large as 20-30, it'll make a big difference down the line. QSD is already exponential, so we really can use any trimming we can get.

@ACE07-Sev
Copy link
Contributor

I'll put this on my TODO list to properly go through the codebase ASAP. Right now my laptop is kaput and freezes every few minutes so can't properly go through it.

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

I see. I mean, for small number of qubits sure, but when you go above 10 it's gonna make a big difference. Especially with practical use-cases given the deeper the circuit is the noisier it's going to get.

Maybe. But A.2 doesn't help here, because it's already part of the 3-qubit base case. I did a sanity check that the number of CZ gates decreases in my implementation, so I did manage to apply A.2 lol. It's just that the big difference is made by the transpiler. I'm not sure what to do there.
For now I can't even run 9 on my old laptop :) I am proposing a running time optimization, I think this would be more important.

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 18, 2025

I'll run it without transpile right now. You do the same please. We can then compare. If they match then LGTM, but otherwise we could really use that small improvement.

P.S : I get you on the laptop being slow hehe. God knows I have had code that ran for a week on my laptop. I would definitely try either using multi-threading to make the loops for each qubit to run in parallel (you'd have to use 3.14 with nogil to do this properly) or to get a CPU with really fast single-core speed. That works wonders.

I'm not sure if you're a community member like moi or you work at cirq, but if the latter, definitely try to run the code on sth like i7 14700k or above (the office should have a CPU like that). It can cut the runtime in 4 usually and make it easier to work on.

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 18, 2025

By the way, before you start optimizing, please run a profiler to see what is taking the most amount of time. cProfiler is great for that (let me know if you'd like some assistance with this, I love profiling). You can even make a call graph png with gprof2dot to visualize it and make it easier to read (it also makes it easier to catch redundancies based on number of calls to a function/method).

Also, keep in mind python is not the best language for runtime speed. Hence why some packages have been porting the heavy-duty stuff to rust and only use python as the interface.

@codrut3
Copy link
Contributor

codrut3 commented May 18, 2025

I'll run it without transpile right now. You do the same please. We can then compare. If they match then LGTM, but otherwise we could really use that small improvement.

My numbers are:

qubits = [3, 4, 5, 6, 7, 8]
cirq_package = [45, 219, 963, 4035, 16515, 66819]
with_my_change = [47, 221, 965, 4037, 16517, 66821]

P.S : I get you on the laptop being slow hehe. God knows I have had code that ran for a week on my laptop. I would definitely try either using multi-threading to make the loops for each qubit to run in parallel (you'd have to use 3.14 with nogil to do this properly) or to get a CPU with really fast single-core speed. That works wonders.

Thank you! Using multi-threading: doesn't this require writing the shannon decomposition to be multi-threaded?

I'm not sure if you're a community member like moi or you work at cirq, but if the latter, definitely try to run the code on sth like i7 14700k or above (the office should have a CPU like that). It can cut the runtime in 4 usually and make it easier to work on.

I work in a different part of Google. I contribute to cirq in my free time on my personal laptop. For now I'll try to stick to this :)

By the way, before you start optimizing, please run a profiler to see what is taking the most amount of time. cProfiler is great for that (let me know if you'd like some assistance with this, I love profiling). You can even make a call graph png with gprof2dot to visualize it and make it easier to read (it also makes it easier to catch redundancies based on number of calls to a function/method).

Also, keep in mind python is not the best language for runtime speed. Hence why some packages have been porting the heavy-duty stuff to rust and only use python as the interface.

I used a profiler and it did help to narrow down but I'm still not sure of the root cause. I wrote an optimization for cleanup_operations, but it would be better to optimize merge_single_qubit_gates_to_phased_x_and_z. I'll try to use gprof2dot, thanks for the tip!
I know python goes only so far, but there is definitely some runtime that can be squeezed here :)

@ACE07-Sev
Copy link
Contributor

Should be fine then. I checked the older messages and I had said 70k there without transpile so I guess it could be transpile bringing it to 60k.

I'll have to properly check this to make sure. I think one way for you to be sure (I tried, I get frozen so can't run it myself, looking to get a new hardware soon to not be so useless like now) is to convert the cirq circuit you're getting before optimization to qiskit via qasm or via my package if you prefer that and then call transpile on both and then compare. That way if they don't match then you know for a fact that the decomposition are not the same. That should remove the effect of transpile from the comparison.

@ACE07-Sev
Copy link
Contributor

No no what I meant by multiprocessing is to just use it for when you were generating the circuit depths for a list of qubit sizes. Basically since each iteration is independent then you can run them in parallel to save time.

Like instead of generating circuit for 1 qubit then 2 qubit then 3 qubit and so on you would instead generate all of them in parallel since they are independent tasks. That way it would take as long as the longest iteration. It would save considerable time.

I wouldn't recommend making cirq itself use multiprocessing given it's a very user-specific thing (some have very few cores or very slow clock speed so for them it would be slower than no multiprocessing due to the overhead of the communication between the threads).

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 18, 2025

Oh yeah gprof2dot is amazing. It color codes the call graph from green to red where red is sth that takes the most time. It shows exactly how many times sth is being called and how expensive it is.

It's very useful for packages this big since you can visually just see what's red and then trail it from there and see what is in your control (I.e. you can't optimize performance of your dependencies) and optimize that.

To save time, just make a constant function and put the synthesis or whatever you want to profile there. Then call .run with cProfiler and pass the function name (iirc it's just the name as a string) and sort with tottime (also a string) and save it as a file. This makes a .prof file.

Then you can use gprof2dot to generate a png from this .prof file. I closed the laptop and went to bed, I am really sorry I didn't write proper snippets and exact command. But that's the overall steps.

@NoureldinYosri
Copy link
Collaborator Author

thanks @codrut3 and @ACE07-Sev for the analysis and discussion ... looking forward to your final verdict

@ACE07-Sev your optimized cleanup_operations will definitely be a great addition to Cirq, also what is the bottleneck in merge_single_qubit_gates_to_phased_x_and_z?

@ACE07-Sev and @codrut3 I don't know if you are aware but we have a biweekly cirq-sync meeting, if you want to attend you can join the cirq-dev group https://groups.google.com/g/cirq-dev which will send you an invite to the meeting.

@codrut3
Copy link
Contributor

codrut3 commented May 19, 2025

Thanks @NoureldinYosri ! I'll come to the next Cirq Sync to discuss this. @ACE07-Sev would you be able to attend too? It's on Wednesday 28th May.

@ACE07-Sev
Copy link
Contributor

So so sorry for the delay. I had some court issues and forgot about this. Sure, I'll try to be there.

@ACE07-Sev
Copy link
Contributor

@codrut3 Did you get a chance to run the experiment I asked? That's the main thing I would like to see to finally provide a meaningful comparison. Would be great to have that before the biweekly meeting on 28th.

@ACE07-Sev
Copy link
Contributor

@NoureldinYosri You are too kind. I have been a bit useless with my laptop being glitchy. I have put this issue on my TODO if it's still open by the time I get a new hardware.

@codrut3
Copy link
Contributor

codrut3 commented May 25, 2025

I'll have to properly check this to make sure. I think one way for you to be sure (I tried, I get frozen so can't run it myself, looking to get a new hardware soon to not be so useless like now) is to convert the cirq circuit you're getting before optimization to qiskit via qasm or via my package if you prefer that and then call transpile on both and then compare. That way if they don't match then you know for a fact that the decomposition are not the same. That should remove the effect of transpile from the comparison.

Thank you for the suggestion! I did this. More precisely, I ran the Cirq Shannon decomposition, converted to qasm and then ran Qiskit transpiler with basis gates u3 and cz. Then I ran Cirq transpiler with the same target gateset.

Both transpilers give the same depth!

For the current Cirq package:

qubits = [3, 4, 5, 6, 7, 8]
cirq_depth_after_transpiler = [41, 203, 899, 3779, 15491, 62723]
qiskit_depth_after_transpiler = [41, 203, 899, 3779, 15491, 62723]

With my change that implements A.2:

qubits = [3, 4, 5, 6, 7, 8]
cirq_depth_after_transpiler = [43, 205, 901, 3781, 15493, 62725]
qiskit_depth_after_transpiler = [43, 205, 901, 3781, 15492, 62724]

Here there is a -1 difference.

One thing to note is that the Cirq transpiler creates more u3 gates (but this doesn't change the depth). The cz gate numbers after the transpiler match.

Do I interpret this correctly that it's not the transpiler, but the Cirq Shannon decomposition that still has some inefficiency that causes higher depth? But it can't be A.2, because I implemented it and it doesn't change things.

@codrut3
Copy link
Contributor

codrut3 commented May 27, 2025

I found the problem :) I had to implement A.1 too. After doing this, the circuit depth after transpiler matches that of Qiskit exactly.

@ACE07-Sev thank you for pushing to figure this out!

I'll send a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good for learning For beginners in QC, this will help picking up some knowledge. Bit harder than "good first issues" kind/feature-request Describes new functionality triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants